Accelerated Nonnegative Tensor Completion via Integer Programming

The problem of tensor completion has applications in healthcare, computer vision, and other domains. However, past approaches to tensor completion have faced a tension in that they either have polynomial-time computation but require exponentially more samples than the information-theoretic rate, or they use fewer samples but require solving NP-hard problems for which there are no known practical algorithms. A recent approach, based on integer programming, resolves this tension for nonnegative tensor completion. It achieves the information-theoretic sample complexity rate and deploys the Blended Conditional Gradients algorithm, which requires a linear (in numerical tolerance) number of oracle steps to converge to the global optimum. The tradeoff in this approach is that, in the worst case, the oracle step requires solving an integer linear program. Despite this theoretical limitation, numerical experiments show that this algorithm can, on certain instances, scale up to 100 million entries while running on a personal computer. The goal of this paper is to further enhance this algorithm, with the intention to expand both the breadth and scale of instances that can be solved. We explore several variants that can maintain the same theoretical guarantees as the algorithm, but offer potentially faster computation. We consider different data structures, acceleration of gradient descent steps, and the use of the Blended Pairwise Conditional Gradients algorithm. We describe the original approach and these variants, and conduct numerical experiments in order to explore various tradeoffs in these algorithmic design choices.


INTRODUCTION
A tensor is a multi-dimensional array or a generalized matrix.ψ is called an order-p tensor if ψ ∈ R r 1 ×•••×rp where r i is the length of ψ in its i-th dimension.For example, an RGB image with a size of 30 by 30 pixels is an order-3 tensor in R 30×30×3 .Since tensors and matrices are closely related, many matrix problems can be naturally generalized to tensors, such as computing a matrix norm and decomposing a matrix.However,

Pan et al.
the tensor generalization of such problems can be substantially more challenging in terms of computation [Hillar and Lim, 2013].
Like matrix completion, tensor completion uses the observed entries of a partially observed tensor ψ to interpolate the missing entries with a restriction on the rank of the interpolated tensor.The purpose of the rank restriction is to restrict the degree of freedom of the missing entries [Song et al., 2019], e.g.avoiding overfitting.Without this rank restriction, the tensor completion problem is ill-posed because there are too many degrees of freedom to be constrained by the available data.Tensor completion is a versatile model with many important applications in social sciences [Tan et al., 2013], healthcare [Gandy et al., 2011], computer vision [Liu et al., 2013], and many other domains.
In the past decade, there have been major advances in matrix completion [Zhang et al., 2019].However, for general tensor completion there remains a critical tension.Past approaches either have polynomial-time computation but require exponentially more samples than the information-theoretic rate [Gandy et al., 2011;Mu et al., 2014;Barak and Moitra, 2016;Montanari and Sun, 2018], or they achieve the informationtheoretic rate but require solving NP-hard problems for which there are no known practical numerical algorithms [Chandrasekaran et al., 2012;Yuan andZhang, 2016, 2017;Rauhut and Stojanac, 2021].
We note that, aside from matrix completion, for some special cases of tensor completion there are numerical algorithms that can achieve the information-theoretic rate; for instance, nonnegative rank-1 [Aswani, 2016] tensors, or symmetric orthogonal [Rao et al., 2015] tensors.In this paper, we focus on a new approach proposed by [Bugg et al., 2022], designed for entrywise-nonnegative tensors, which naturally exist in applications such as image demosaicing.The authors defined a new norm for nonnegative tensors by using the gauge of a specific 0-1 polytope that they constructed.By using this gauge norm, their approach achieved the information-theoretic rate in terms of sample complexity, although the resulting problem is NP-hard to solve.Nevertheless, a practical approach was attained: as the norm is defined by using a 0-1 polytope, the authors embedded integer linear optimization within the Blended Conditional Gradients (BCG) algorithm [Braun et al., 2019], a variant of the Frank-Wolfe algorithm, to construct a numerical computation algorithm that required a linear (in numerical tolerance) number of oracle steps to converge to the global optimum.This paper proposes several acceleration techniques for the original numerical computation algorithm created by Bugg et al. [2022]; multiple techniques are tested in combination to evaluate the best configuration for overall speedup.The motivation is to further improve the implementation on largescale problems.Our variants can maintain the same theoretical guarantees as the original algorithm, while offering potential speedups.Indeed, our experiments demonstrate that such speedups can be attained consistently across a range of problem instances.This paper also fully describes the original numerical computation algorithm and its coding implementation, which was omitted in [Bugg et al., 2022].
We summarize preliminary material and introduce the framework and theory of Bugg et al. [2022]'s nonnegative tensor completion approach in Section 2.Then, we describe the computation algorithm of their approach in Section 3 and our acceleration techniques in Section 4. Section 5 presents numerical experiment results.
A nonnegative rank-1 tensor ψ is defined as ψ x k .Bugg et al. [2022] defined the ball of nonnegative rank-1 tensors whose maximum entry is λ ∈ R + to be so the nonnegative rank of nonnegative tensor is where B ∞ = lim λ→∞ B λ .For a λ ∈ R + , consider a finite set of points x k ∈ {0, 1}, for x ∈ R} (3) Bugg et al. [2022] established the following connection between B λ and S λ : where C λ is the nonnegative tensor polytope.Bugg et al. [2022] also presented three implications of this result that are useful to their nonnegative tensor completion approach.First, C λ is a polytope.Second, the elements of S λ are the vertices of C λ .Third, the following relationships hold: B λ = λB 1 , S λ = λS 1 , and C λ = λC 1 .

Norm for Nonnegative Tensors
Key to the theoretical guarantee and numerical computation of Bugg et al. [2022]'s approach is their construction of a new norm for nonnegative tensors using a gauge (or Minkowski functional) construction.
Definition 2.1 (Bugg et al. [2022]).The function defined as This norm has an important property that it can be used as a convex surrogate for tensor rank [Bugg et al., 2022].In other words, if ψ is a nonnegative tensor, then we have

Nonnegative Tensor Completion
For a partially observed order-p tensor ψ, let (x i , y i ) ∈ R × R for i = 1, . . ., n denote the indices and value of n observed entries.We assume that an entry can be observed multiple times, so let U := where λ ∈ R + is given and ψ is the completed tensor.The feasible set {ψ : ψ + ≤ λ} is equivalent to C λ by the norm definition (2.1) and C λ = λC 1 .
Although the problem ( 6) is a convex optimization problem, Bugg et al. [2022] showed that solving it is NP-hard.Nonetheless, Bugg et al. [2022] managed to design an efficient numerical computation algorithm of global minima of the problem (6) by using its substantial structure of the problem.We will explain their algorithm more after showing their findings on the statistical guarantees of the problem (6).

Statistical Guarantees
Bugg et al. [2022] observed that the problem ( 6) is equivalent to a convex aggregation [Nemirovski, 2000;Tsybakov, 2003;Lecué, 2013] problem for a finite set of functions, so we have the following tight generalization bound for the solution of the problem (6) PROPOSITION 2.1 ( [Lecué, 2013]).Suppose |y| ≤ b almost surely.Given any δ > 0, with probability at least 1 − 4δ we have that where c 0 is an absolute constant and Under specific noise models such as an additive noise model, we have the following corollary to the above proposition combined with the fact that • + is a convex surrogate for tensor rank.COROLLARY 2.1 ( [Bugg et al., 2022]).Suppose ϕ is a nonnegative tensor with rank + (ϕ) = k and ϕ max ≤ µ.If (x i , y i ) are independent and identically distributed with y i − ϕ x i ≤ e almost surely and Ey i = ϕ x i .Then given any δ > 0, with probability at least 1 − 4δ we have where ζ n is as in (8) and c 0 is an absolute constant.
The two results above show that the problem (6) achieves the information-theoretic sample complexity rate when k = O(1).

ORIGINAL COMPUTATION ALGORITHM
Since C 1 is a 0-1 polytope, we can use integer linear optimization to solve the linear separation problem over this polytope.Thus, we can apply the Frank-Wolfe algorithm or its variants to solve the problem (6) to a desired numerical tolerance.Bugg et al. [2022] choose the BCG variant for two reasons.
First, the BCG algorithm can terminate (within numerical tolerance) in a linear number of oracle steps for an optimization problem with a polytope feasible set and a strictly convex objective function over the feasible set.To make the objective function in the problem (6) strictly convex, we can reformulate the problem (6) by changing its feasible set from C λ to Proj U (C λ ).The implementation of this reformulation is to simply discard the unobserved entries of ψ.Second, the weak-separation oracle in the BCG algorithm accommodates early termination of the associated integer linear optimization problem, which is formulated as follows min The feasible set in the problem above is equivalent to S λ , and the linear constraints above are acquired from standard techniques in integer optimization [Hansen, 1979;Padberg, 1989].Bugg et al. [2022] also deploy a fast alternating minimization heuristic to solve the weak-separation oracle to avoid (if possible) solving the problem (10) via integer programming oracle.
Next, we will fully describe the Python 3 implementation of the BCG algorithm adapted by Bugg et al. [2022] to solve the problem (6) and its major computation components.This description is important for us to explain how we will accelerate this algorithm later.It also supplements the algorithm description omitted in [Bugg et al., 2022].Their code is available from https://github.com/WenhaoP/TensorComp.For brevity, we do not explain the abstract frameworks of the adapted BCG (ABCG) or its major computation components here, but they can be found in [Braun et al., 2019] and [Bugg et al., 2022].

Adapted Blended Conditional Gradients
The ABCG is implemented as the function nonten in original nonten.py.The inputs are indices of observed entries (X), values of observed entries (Y), dimension (r), λ (lpar), and numerical tolerance (tol).The output is the completed tensor ψ (sol).There are three important preparation steps of ABCG.First, it reformulates the problem (6) by changing the feasible set from C λ to Proj U (C λ ).Second, it normalizes the feasible set to Proj U (C 1 ) and scales sol by λ at the end of ABCG.Third, it flattens the iterate (psi q) and recovers the original dimension of sol at the end of ABCG.
We first build the integer programming problem (10) in Gurobi.The iterate is initialized as a tensor of ones; the active vertex set {v 1 , . . ., v k } (Pts and Vts) and the active vertex weight set {γ 1 , . . ., γ k } (lamb) are initialized accordingly.Vts stores the θ (i) 's for constructing each active vertex in Pts.bestbd stores the best lower bound for the optimal objective value found so far.It is initialized as 0, a lower bound on the optimal objective value of problem (6).objVal stores the objective function value of the current iterate.m. gap stores half of the current primal gap estimate, the difference between objVal and the optimal objective function value [Braun et al., 2019].

Pan et al.
The iteration procedure of BCG is implemented as a while loop.In each iteration, we first compute the gradient of the objective function with respect to the current iterate codec (lines 318 -320).Then, we compute the dot products between the scaled gradient (lpar * c) and each active vertex ({ lpar * c, v 1 , . . ., lpar * c, v k }), store them in pro, and use pro to find the awaystep vertex (psi a) and the Frank-Wolfe-step vertex (psi f).Based on the test that compares lpar * c, psi a -psi f and m. gap, we either use the simplex gradient descent step (SiGD) or the weak-separation oracle step (LPsep) to find the next iterate.Since we initialize m. gap as +∞, we always solve the problem (10) to solve the weak-separation oracle in the first iteration.At the end of each iteration, we update objVal and bestbd.The iteration procedure stops when either the primal gap estimate (2 * m. gap) or the largest possible primal gap objVal -bestbd is smaller than tol.

Simplex Gradient Descent Step
SiGD is implemented in the function nonten.It determines the next iterate with a lower objective function value via a single descent step that solves the following reformulation of the problem (6) γ1 , . . ., γk ∈ arg min For clarity, we ignore the projection in the problem above.Since lamb is in ∆ k as all its elements are nonnegative and sum to one, SiGD has the term "Simplex" in its name [Braun et al., 2019].
We first compute the projection of pro onto the hyperplane of ∆ k and store it in d.If all elements of d are zero, then we set the next iterate to the first active vertex and end SiGD.Otherwise, we solve the optimization problem η = arg max{η ∈ R + : γ − ηd ≥ 0} where γ is lamb and d is d, and store the optimal solution η in eta.Next, we compute psi n which is x − η i d i v i where x is the current iterate, d i 's are the elements of d, and v i are active vertices.If the objective function value of psi n is smaller or equal to that of the current iterate, then we set the next iterate to psi n and drop any active vertex with zero weight in the updated active vertex set for constructing psi n.Otherwise, we perform an exact line search over the line segment between the current iterate and psi n, and set the next iterate to the optimal solution.

Weak-Separation Oracle Step
LPsep is implemented in the function nonten.It has two differences from its implementation in the original BCG.First, whether it finds a new vertex that satisfies the weak-separation oracle or not, it always performs an exact line search over the line segment between the current iterate and that new vertex to find the next iterate.Second, it uses bestbd to update m. gap.m. cmin is the dot product between lpar * c and the current iterate.The flag variable oflg is True if an improving vertex has not been found that satisfies either the weak-separation oracle or the conditions described below.We first repeatedly use the alternating minimization heuristic (AltMin) to solve the weakseparation oracle.If AltMin fails to do so within 100 repetitions but finds a vertex that shows that m. gap is an overestimate, we claim that it satisfies the weak-separation oracle and update m. gap.Otherwise, we solve the integer programming problem (10) via the Gurobi global solver to obtain a satisfying vertex and update m. gap.Note that an exact solution is not required; any solution attaining weak separation suffices.

Alternating Minimization
AltMin is implemented as the function altmin in original nonten.py.Since a vertex is in S λ , the objective function in the integer programming problem (10) becomes c, (θ There is a motivating observation for solving this problem.For example, if we fix θ (2) , . . ., θ (p) , then the problem ( 10) is equivalent to min θ (1) ∈{0,1} r 1 c(1) , θ (1) where c is a constant in R r 1 computed from c, θ (2) , . . ., θ (p) .The optimal solution can be easily found based on the signs of c's entries as θ (1) is binary.AltMin utilizes this observation.

ACCELERATED COMPUTATION ALGORITHM
To show the efficacy and scalability of their approach, Bugg et al. [2022] conduct numerical experiments on a personal computer.They compare the performance of their approach and three other approachesalternating least squares (ALS) [Kolda and Bader, 2009], simple low rank tensor completion (SiLRTC) [Liu et al., 2013], and trace norm regularized CP decomposition (TNCP) [Song et al., 2019] -in four sets of different nonnegative tensor completion problems.Normalized mean squared error (NMSE) F is used to measured the accuracy.The results of their numerical experiments show that Bugg et al. [2022]'s approach has higher accuracy but requires more computation time than the three other approaches in all four problem sets.We work directly on their code base to design acceleration techniques.Besides basic acceleration techniques such as caching repetitive computation operations, we design five different acceleration techniques based on profiling results.By combining these five techniques differently, we create ten variants of ABCG that can maintain the same theoretical guarantees but offer potentially faster computation.

Technique 1A: Index
The first technique Index is for accelerating AltMin.fpro was computed using a for-loop with u iteration.The loop operation can be time-consuming when u is large (i.e., large sample size).We rewrote the computation of pro such that it was computed with a for-loop with r k iterations for k = 1, . . ., p.Although each iteration of the new computation takes more time than the original, the numerical experiments show that total computation time drops significantly, especially for problems with large sample sizes.

Technique 1B: Pattern
The second technique Pattern is for accelerating AltMin.It is based on the Index technique described above.In each iteration of the new loop operation in Index, we extract the information from U for Pan et al.
computation.Since U remains unchanged throughout the algorithm, we can extract the information from U at the beginning of the algorithm, avoiding repetition later on.

Technique 2: Sparse
The third technique Sparse is for accelerating the test for deciding whether to use SiGD or LPsep to determine the next iterate.At the beginning of each BCG iteration, Pts and lpar * c are used to find psi a and psi f.This operation involves a matrix multiplication between Pts and lpar * c, which is time-consuming when Pts has a large size (i.e., large sample size or active vertex set size).We observed the sparsity of Pts as vertices are binary, and used a SciPy [Virtanen et al., 2020] sparse matrix to represent it instead of a NumPy [Harris et al., 2020] array.It not only accelerates the matrix multiplication but also reduces the memory usage for storing Pts.

Technique 3: NAG
The fourth technique NAG is for accelerating SiGD.Since SiGD is a gradient descent method, we can use Nesterov's accelerated gradient (NAG) to accelerate it.Specifically, we applied [Besanc ¸on et al., 2022]'s technique of transforming the problem (6) to its barycentric coordinates so that we minimize over ∆ k instead of the convex hull of the active vertex set.We restart the NAG when the active vertex set changes.

Technique 4: BPCG
The fifth technique BPCG is for accelerating SiGD.Tsuji et al. [2021] developed the Blended Pairwise Conditional Gradients (BPCG) by combining the Pairwise Conditional Gradients (PCG) with the blending criterion from BCG.Specifically, we applied the lazified version of BPCG, which only differs from BCG by replacing SiGD with PCG.

Computation Variants
We create ten computational variants of ABCG by mixing the five acceleration techniques described above.They are described in Table 1 of Appendix 1.There are two observations.First, Index and Pattern are always used together as Pattern relies on Index, so we combine and consider them as a single technique in all following discussions.Second, NAG and BPCG are exclusive to each other as BPCG removes SiGD completely.

NUMERICAL EXPERIMENTS
For benchmarking we adopted the same problem instances (four sets of nonnegative tensor completion problems) and setup as Bugg et al. [2022].We ran the original algorithm as well as each variant on these instances.The experiments were conducted on a laptop computer with 32GB of RAM and an Intel Core i7 2.2Ghz processor with 6-cores/12-threads.The algorithms were coded in Python 3. Gurobi v9.5.2 [Gurobi Optimization, LLC, 2022] was used to solve the integer programming problem (10).The code is available from https://github.com/WenhaoP/TensorComp.We note that the experiments in Bugg et al. [2022] were conducted using a laptop computer with 8GB of RAM and an Intel Core i5 2.3Ghz processor with 2-cores/4-threads, algorithms coded in Python 3, and Gurobi v9.1 [Gurobi Optimization, LLC, 2022] to solve the integer programming problem (10).
We used NMSE to measure the accuracy and recorded its arithmetic mean (with its standard error) over one hundred repetitions of each problem instance.We recorded the arithmetic mean (with its standard   error), geometric mean, minimum, median, and maximum of the computation time over 100 repetitions of each problem instance.Generally speaking, the most effective variants were version 1 (Index + Pattern) and version 8 (BPCG + Index + Pattern), offering consistent speedups over the original version 0 (ABCG).All variants achieve almost the same NMSE as the original algorithm for each problem, which shows that they maintain in practice the same theoretical guarantee as the original algorithm; moreover, their runtimes were mostly competitive with the original.

Experiments with Order-3 Tensors
The first set of problems is order-3 tensors with increasing dimensions and n = 500 samples.The results are in Table 2 of Appendix 2, and the arithmetic mean and median computation time is plotted in Figure 1.Among all versions, version 1 (Index + Pattern) has the lowest mean and median computation time for each problem except for the 100 × 100 × 100 tensor problem, where the original version 0 (ABCG) is slightly faster.Thus, we claim version 1 (Index + Pattern) as the best to use for this set of problems.

Experiments with Increasing Tensor Order
The second set of problems is tensors with increasing tensor order, with each dimension r i = 10 for i = 1, . . ., p, and n = 10, 000 samples.The results are in Table 3 of Appendix 2, and the arithmetic mean and median computation time is plotted in Figure 2.Among all the versions, version 1 (Index + Pattern) and version 8 (BPCG + Index + Pattern) has the lowest mean and median computation time for each problem except for the 10 ×8 tensor problem.Thus, we claim versions 1 (Index + Pattern) and 8 (BPCG + Index + Pattern) are the best to use for this set of problems.
For the 10 ×8 tensor problem, version 5 (NAG + Index + Pattern) has the lowest median computation time.We observe that its median computation time is much lower than its mean computation time, and the same occurs with other versions involving NAG.The maximum computation time of these versions is also much higher than other versions.This is due to the fact that, in certain cases, AltMin failed to solve the weak-separation oracle, and the Gurobi solver was needed to solve the time-consuming integer programming problem (10).

Experiments with Increasing Sample Size
The third and fourth set of problems is tensors of size 10 ×6 and 10 ×7 with increasing sample sizes.The results are in Table 4 and Table 5 of Appendix 2, and the arithmetic mean and median computation time is plotted in Figure 3 and 4.Among all the versions, either version 1 (Index + Pattern) or version 8 (BPCG + Index + Pattern) has the lowest mean and median computation time for each problem.Thus, we claim versions 1 (Index + Pattern) and 8 (BPCG + Index + Pattern) are the best to use for this set of problems.

DISCUSSION AND CONCLUSION
This paper proposed and evaluated multiple speedup techniques of the original numerical computation algorithm for nonnegative tensor completion created by Bugg et al. [2022].We benchmarked these algorithm variants on the same set of problem instances designed by Bugg et al. [2022].Our benchmarking results were that versions 1 (Index + Pattern) and 8 (BPCG + Index + Pattern) generally had the fastest computation time for solving the nonnegative tensor completion problem (6), offering substantial speedups over the original algorithm.Version 1 had Index and Pattern techniques, and version 8 had BPCG, Index, and Pattern techniques.Because version 8 is version 1 with BPCG, version 8 may be preferred over version 1 for problem instances where the active vertex set could be large.Surprisingly, we found that Sparse failed to yield improvements.A possible reason is that the implementation of Sparse needs extra operations (e.g., finding the indices of nonzero entries) for converting NumPy arrays to SciPy sparse arrays.If the active vertex set is not large enough so that extra operations take time as least as the time saved from matrix multiplications of the active vertex set, then we cannot see an improvement.These results suggest that Index and Pattern are the most important for acceleration, and that BCG and BPCG work equally well overall.

Figure 1 .
Figure 1.Results for order-3 nonnegative tensors with size r × r × r and n = 500 samples.

Figure 4 .
Figure 4. Results for nonnegative tensors with size 10 ×7 and increasing n samples.

Pan et al. {x
1 , . . ., x u } where u ≤ n denote the set of unique indices of observed entries.Since • + is a convex surrogate for tensor rank, we have the following nonnegative tensor completion problem

Table 1
ABCG and its ten variants True means we implement this technique in this variant, and False means we do not implement this technique in this variant.For example, version 5 implements NAG, Index, and Pattern.

Table 2
Results for order-3 nonnegative tensors with n = 500 samples Pan et al.

Table 3
Results for increasing order nonnegative tensors and n = 10, 000 samples For each problem, the lowest mean and median computation time are highlighted.