On The Complexity of Sparse Label Propagation

This paper investigates the computational complexity of sparse label propagation which has been proposed recently for processing network structured data. Sparse label propagation amounts to a convex optimization problem and might be considered as an extension of basis pursuit from sparse vectors to network structured datasets. Using a standard first-order oracle model, we characterize the number of iterations for sparse label propagation to achieve a prescribed accuracy. In particular, we derive an upper bound on the number of iterations required to achieve a certain accuracy and show that this upper bound is sharp for datasets having a chain structure (e.g., time series).


I. INTRODUCTION
A powerful approach to processing massive datasets is via using graph models.In particular, we consider datasets which can be characterized by an "empirical graph" (cf.[8,Ch. 11]) whose nodes represent individual data points and whose edges connect data points which are similar in an application-specific sense.The empirical graph for a particular dataset might be obtained by (domain) expert knowledge, an intrinsic network structure (e.g., for social network data) or in a data-driven fashion by imposing smoothness constrains on observed realizations of graph signals (which serve as training data) [21], [23], [24], [27], [30], [31], [38].Besides the graph structure, datasets carry additional information in the form of labels (e.g., class membership) associated with individual data points.We will represent such label information as graph signals defined over the empirical graph [37].
Using graph signals for representing datasets is appealing for several reasons.Indeed, having a graph model for datasets facilitates scalable distributed data processing in the form of message passing over the empirical graph [34].Moreover, graph models allow to cope with heterogeneous datasets containing mixtures of different data types, since they only require an abstract notion of similarity between individual data points.In particular, the structure encoded in the graph model of a dataset enables to capitalize, by exploiting the similarity between data points, on massive amounts of unlabeled data via semisupervised learning [8].This is important, since labelling of data points is often expensive and therefore label information is typically available only for a small fraction of the overall dataset.The labels of individual data points induce a graph signal which is defined over the associated empirical graph.We typically only have access to the signal values (labels) of few data points and the goal is learn or recover the remaining graph signal values (labels) for all other data points.
The processing of graph signals relies on particular models for graph signals.A prominent line of uses applies spectral graph theory to extend the notion of band-limited signals from the time domain (which corresponds to the special case of a chain graph) to arbitrary graphs [8]- [10], [12], [17], [37].These band-limited graph signals are smooth in the sense of having a small variation over well-connected subsets of nodes, where the variation is measured by the Laplacian quadratic form.However, our approach targets datasets whose labels induce piece-wise constant graph signals, i.e., the signal values (labels) of data points belonging to well connected subset of data points (clusters), are nearly identical.This signal model is useful, e.g., in change-point detection, image segmentation or anomaly detection where signal values might change abruptly [14], [15], [19], [35], [36].
The closest to our work is [19], [35], [36] for general graph models, as well as a line of work on total variation-based image processing [5], [7], [29].In contrast to [5], [7], [29], which consider only regular grid graphs, our approach applies to arbitrary graph topology.The methods presented in [15], [35], [36] apply also to arbitrary graph topologies but require (noisy) labels available for all data points, while we consider labels available only on a small subset of nodes.

I-A. Contributions and Outline
In Section II, we formulate the problem of recovering clustered graph signals as a convex optimization problem.We solve this optimization problem by applying a preconditioned variant of the primal-dual method of Pock and Chambolle [29].As detailed in Section III, the resulting algorithm can be implemented as a highly scalable message passing protocol, which we coin sparse label propagation (SLP).In Section IV, we present our main result which is an upper bound on the number of SLP iterations ensuring a particular accuracy.We also discuss the tightness of the upper bound for datasets whose empirical graph is a chain graph (e..g, time series).

I-B. Notation
Given a vector x = (x 1 , . . ., x n ) T ∈ R n , we define the norms For a positive definite matrix Q, we define the norm x Q := x T Qx.The signum sign{x} of a vector x = x 1 , . . ., x d is defined as the vector sign(x 1 ), . . ., sign(x d ) ∈ R d with the scalar signum function Throughout this paper we consider convex functions g(x) whose epigraphs epi g := {(x, t) : x ∈ R n , g(x) ≤ t} ⊆ R n × R are non-empty closed convex sets [32].Given such a convex function g(x), we denote its subdifferential at and its convex conjugate function by [4] g * (ŷ) := sup We can re-obtain a convex function g(y) from its convex conjugate via [4] g(ŷ) := sup

II. PROBLEM SETTING
We consider network-structured datasets which are represented by an undirected weighted graph G = (V, E, W), referred to as the "empirical graph" (see Figure 1).The nodes i ∈ V of the empirical graph represent individual data points, such as user profiles in a social network or documents of a repository.An undirected edge {i, j} ∈ E of the empirical graph encodes a notion of (physical or statistical) proximity of neighbouring data points, such as profiles of befriended social network users or documents which have been co-authored by the same person.This network structure is identified with conditional independence relations within probabilistic graphical models (PGM) [21], [23], [24], [30], [31].
As opposed to PGM, we consider a fully deterministic graph-based model which does not invoke an underlying probability distribution for the observed data.In particular, given an edge {i, j} ∈ E, the nonzero value W i,j > 0 represents the amount of similarity between the data points i, j ∈ V.The edge set E can be read off from the non-zero pattern of the weight matrix According to (4), we could in principle handle network-structured datasets using traditional multivariate (vector/matrix based) methods.However, putting the emphasis on the empirical graph leads naturally to scalable algorithms which are implemented as message passing methods (see Algorithm 2 below).The neighbourhood N (i) and weighted degree (strength) In what follows we assume the empirical graph to be connected, i.e., d i > 0 for all nodes i ∈ V and having no self-loops such that W i,i = 0 for all i ∈ V.The maximum (weighted) node degree is x i i∈V i−1 i+1 It will be convenient to orient the undirected empirical graph G = (V, E, W), which yields the directed version The orientation amounts to declaring for each edge e = {i, j} one node as the head (origin node) and the other node as the tail (destination node) denoted e + and e − , respectively.Given a set of edges S ⊆ E in the undirected graph G, we denote the corresponding set of directed edges in If we number the nodes and orient the edges in the chain graph in Fig. 1-(a) from left to right, its weighted incidence matrix would be The directed neighbourhoods of a node i ∈ V are defined as N + (i) := {j ∈ V : e = {i, j} ∈ E, and e + = i} and N − (i) := {j ∈ V : e = {i, j} ∈ E, and e − = i}, respectively.We highlight that the particular choice of orientation for the empirical graph G has no effect on our results andmethods and will be only used for notational convenience.
In many applications we can associate each data point i ∈ V with a label x i , e.g., in a social network application the label x i might encode the group membership of the member i ∈ V in a social network G.We interpret the labels x i as values of a graph signal x defined over the empirical graph G. Formally, a graph signal x ∈ R V defined over the graph G maps each node i ∈ V to the graph signal value x[i] ∈ R. Since acquiring labels is often costly and error-prone, we typically have access to a few noisy labels xi for the data points i ∈ M ⊆ V within a (small) subset M ⊆ V of nodes in the empirical graph.Thus, we are interested in recovering the entire graph signal x from knowledge of its values x[i] = xi on a small subset M ⊆ V of labeled nodes i ∈ M. The signal recovery will be based on a clustering assumption [8].Clustering Assumption (informal).Consider a graph signal x ∈ R |V| whose signal values are the (mostly unknown) labels xi of the data points z i ∈ D. The signal values x[i], x[j] at nodes i, j ∈ V within a well-connected subset (cluster) of nodes in the empirical graph are similar, i.e., This assumption of clustered graph signals x can be made precise by requiring a small total variation (TV) The incidence matrix D (cf. ( 7)) allows to represent the TV of a graph signal conveniently as We note that related but different measures for the total variation of a graph signal have been proposed previously (see, e.g., [11], [33]).The definition ( 8) is appealing for several reasons.First, it conforms with the class of piece-wise constant or clustered graph signals which has proven useful in several applications including meteorology and binary classification [13], [18].Second, as we demonstrate in what follows, the definition (8) allows to derive semi-supervised learning methods which can be implemented by efficient massing passing over the underlying empirical graph and thus ensure scalability of the resulting algorithm to large-scale (big) data.
A sensible strategy for recovering a graph signal with small TV is via minimizing the TV x TV while requiring consistency with the observed noisy labels {x i } i∈M , i.e., xSLP ∈ arg min The objective function of the optimization problem (10) is the seminorm x TV , which is a convex function. 1Since moreover the constraints in (10) are linear, the optimization problem ( 10) is a convex optimization problem [4].Rather trivially, the problem ( 10) is equivalent to Here, we used the constraint set Q = {x : x[i] = xi for all i ∈ M} which collects all graph signals x ∈ R |V| which match the observed labels xi for the nodes of the sampling set M.
The usefulness of the learning problem (10) depends on two aspects: (i) the deviation of the solutions of (10) from the true underlying graph signal and (ii) the difficulty (complexity) of computing the solutions of (10).The first aspect has been addressed in [25] which presents precise conditions on the sampling set M and topology of the empirical graph G such that any solution of ( 10) is close to the true underlying graph signal if it is (approximately) piece-wise constant over well-connected subsets of nodes (clusters).The focus of this paper is the second aspect, i.e., the difficulty or complexity of computing approximate solutions of (10).
In what follows we will apply an efficient primal-dual method to solving the convex optimization problem (10).This primaldual method is appealing since it provides a theoretical convergence guarantee and also allows for an efficient implementation as message passing over the underlying empirical graph (cf.Algorithm 2 below).We coin the resulting semi-supervised learning algorithm sparse label propagation (SLP) since it bears some conceptual similarity to the ordinary label propagation (LP) algorithm for semi-supervised learning over graph models.In particular, LP algorithms can be interpreted as message passing methods for solving a particular recovery (or, learning) problem [8,Chap 11.3.4.]: The recovery problem (12) amounts to minimizing the weighted sum of squares, while SLP (10) minimize a weighted sum of absolute values, of the signal differences (x[i] − x[j]) 2 arising over the edges {i, j} ∈ E in the empirical graph G.It turns out that using the absolute values of signal differences instead of their squares allows SLP methods to accurately learn graph signals x which vary abruptly over few edges, e.g., clustered graph signals considered in [13], [18].In contrast, LP methods tends to smooth out such abrupt signal variations.
The SLP problem (10) is also closely related to the recently proposed network Lasso [19], [26] xnLasso ∈ arg min Indeed, according to Lagrangian duality [3], [4], by choosing λ in (13) suitably, the solutions of (13) coincide with those of (10).The tuning parameter λ trades small empirical label fitting error i∈M (x[i] − xi ) 2 against small total variation xnLasso TV of the learned graph signal xnLasso .Choosing a large value of λ enforces small total variation of the learned graph signal, while using a small value for λ puts more emphasis on the empirical error.In contrast to network Lasso (13), which requires to choose the parameter λ (e.g., using (cross-)validation [20], [22]), the SLP method (10) does not require any parameter tuning.

III. SPARSE LABEL PROPAGATION
The recovery problem (10) is a convex optimization problem with a non-differentiable objective function, which precludes the use of standard gradient methods such as (accelerated) gradient descent.However, both the objective function and the constraint set of the optimization problem (10) have rather a simple structure individually.This suggests the use of efficient proximal methods [28] for solving (10).In particular, we apply a preconditioned variant of the primal-dual method introduced by [6] to solve (10).
In order to apply the primal-dual method of [6], we reformulate (10) as an unconstrained problem (see (11)) The function h(x) in ( 14) is the indicator function (cf.[32]) of the convex set Q and can be described also via its epigraph It will be useful to define another optimisation problem which might be considered as a dual problem to (14), i.e., ŷSLP ∈ arg max Note that the objective function f (y) of the dual SLP problem (15) involves the convex conjugates h * (x) and g * (y) (cf. (2)) of the convex functions h(x) and g(y) which define the primal SLP problem (14).By elementary convex analysis [32], the solutions xSLP of ( 14) are characterized by the zero-subgradient condition A particular class of iterative methods for solving (14), referred to as proximal methods, is obtained via fixed-point iterations of some operator P : R |V| → R |V| whose fixed-points are precisely the solutions xSLP of ( 16), i.e., In general, the operator P is not unique, i.e., there are different choices for P such that ( 17) is valid.These different choices for the operator P in (17) result in different proximal methods [28].
One approach to constructing the operator P in ( 17) is based on convex duality [32,Thm. 31.3],according to which a graph signal xSLP ∈ R |V| solves (14) if and only if there exists a (dual) vector ŷ ∈ R |E| such that −(D T ŷSLP ) ∈ ∂h(x SLP ) , and Dx SLP ∈ ∂g * (ŷ SLP ). ( The dual vector ŷSLP ∈ R |E| represents a signal defined over the edges E in the empirical graph G, with the entry ŷSLP [e] being the signal value associated with the particular edge e ∈ E. Let us now rewrite the two coupled conditions in (18) as with the invertible diagonal matrices (cf.( 4) and ( 5)) The specific choice (20) for the matrices Γ and Λ can be shown to satisfy [29, Lemma 2] which will turn out to be crucial for ensuring the convergence of the iterative algorithm we will propose for solving (14).
It will be convenient to define the resolvent operator for the functions g * (y) and h(x) (cf.( 14) and ( 2)), [29, Sec.1.1.] We can now rewrite the optimality condition (19) (for xSLP , ŷSLP to be primal and dual optimal) more compactly as The characterization (23) of the solution xSLP ∈ R |V| for the SLP problem (10) leads naturally to the following fixed-point iterations for finding xSLP (cf.[29]) The fixed-point iterations (24) are similar to those considered in [6, Sec.6.2.] for grid graphs arising in image processing.In contrast, the iterations ( 24) are formulated for an arbitrary graph (network) structure which is represented by the incidence matrix D ∈ R |E|×|V| .By evaluating the application of the resolvent operators (cf.( 22)), we obtain simple closed-form expressions (cf.[6, Sec.6.2.]) for the updates in (24) yielding, in turn, Algorithm 1.
1: repeat 2: x := 2x (k) − x(k−1) x(k+1 x(k+1) [i] := xi for all sampled nodes i ∈ M 7: 9: until stopping criterion is satisfied Note that the Algorithm 1 does not directly output the iterate x(k) but its running average x(k) .Computing the running average (see step 8 in Algorithm 1) requires only little effort but allows for a simpler convergence analysis (see the proof of Theorem 1 in the Appendix).
One of the appealing properties of Algorithm 1 is that it allows for a highly scalable implementation via message passing over the underlying empirical graph G.This message passing implementation, summarized in Algorithm 2, is obtained by implementing the application of the graph incidence matrix D and its transpose D T (cf.steps 2 and 5 of Algorithm 1) by local updates of the labels x[i], i.e., updates which involve only the neighbourhoods N (i), N (j) of all edges {i, j} ∈ E in the empirical graph G.
Note that executing Algorithm 2 does not require to collect global knowledge about the entire empircal graph (such as the maximum node degree d max (6)) at some central processing unit.Indeed, if we associate each node in the data graph with a computational unit, the execution of Algorithm 2 requires each node i ∈ V only to store the values {ŷ[{i, j}], W i,j } j∈N (i) and x(k) [i].Moreover, the number of arithmetic operations required at each node i ∈ V during each time step is proportional to the number of the neighbours N (i).These characteristics allow Algorithm 2 to scale to massive datasets (big data) if they can be represented using sparse networks having a small maximum degree d max (6)).The datasets generated in many important applications have been found to be accurately represented by such sparse networks [1].

IV. COMPLEXITY OF SPARSE LABEL PROPAGATION
There are various options for the stopping criterion in Algorithm 1, e.g., using a fixed number of iterations or testing for sufficient decrease of the objective function (cf.[2]).When using a fixed number of iterations, the following characterization of the convergence rate of Algorithm 1, we need to have a precise characterization of how many iterations are required to guarantee a prescribed accuracy of the resulting estimate.Such a characterization is provided by the following result.

Algorithm 2 Sparse Label Propagation as Message Passing
Input: directed empirical graph − → G = (V, − → E , W), sampling set M, noisy labels {x i } i∈M .
for all nodes i ∈ V: for all nodes i ∈ V: for all sampled nodes i ∈ M: x(k+1) [i] := xi 7: for all nodes i ∈ V: Consider the sequences x(k) and ŷ(k) obtained from the update rule (24) and starting from some arbitrary initalizations x(0) and ŷ(0) .The averages x(k) , and obtained after K iterations (for k = 0, . . ., K − 1) of (24), satisfy with ỹ(K) = sign{Dx (K) }.Moreover, the sequence ŷ(0 According to (26), the sub-optimality in terms of objective value function incurred by the output of Algorithm 1 after K iterations is bounded as where the constant c does not depend on K but might depend on the empirical graph via its weighted incidence matrix D (cf. ( 7)) as well as on the initial labels xi .The bound (27) suggests that in order to ensure reducing the sub-optimality by a factor of two, we need to run Algorithm 1 for twice as many iterations.
Let us now show that the bound (27) on the convergence speed is essentially tight.What is more, the bound cannot be improved substantially by any learning method, such as SLP (14) or network Lasso (13), which is implemented as message passing over the underlying empirical graph G.To this end we consider a dataset whose empirical graph is a weighted chain graph (see Figure 2) with nodes V = {1, . . ., N } which are connected by N − 1 edges E = {{i, i + 1}} i=1,...,N −1 .The weights of the edges are W i,i+1 = 1/i.The labels of the data points V induce a graph signal x defined over G with x[i] = 1 for all nodes i = {1, . . ., N − 1} and x[N ] = 0. We observe the graph signal noise free on the sampling set M = {1, N }, resulting in the observations x1 = 1 and xN = 0.According to [25,Theorem 3], the solution xSLP of the SLP problem ( 14) is unique and coincides with the true underlying graph signal x.Thus, the optimal objective function The empirical graph G is a chain graph with edge weights W i,i+1 = 1/i.We aim at recovering the graph a graph signal from the observations x1 = 1 and xN = 0 using Algorithm 2.
value is xSLP TV = x TV = 1/(N − 1).On the other hand, the output x(K) of Algorithm 1 after K iterations satisfies x(K) [1] = 1 and x(K) [i] = 0 for all nodes i ∈ {K + 1, . . ., N }.Thus, implying, in turn, x(K) For the regime of K/N ≪ 1 which is reasonable for big data applications where the number of iterations K computed in Algorithm 1 is small compared to the size N of the dataset, the dependency of the lower bound ( 29) on the number of iterations is essentially ∝ 1/K and therefore matches the upper bound (27).This example indicates that, for certain structure of edge weights, chain graphs are among the most challenging topologies regarding the convergence speed of SLP.

V. CONCLUSIONS
We have studied the intrinsic complexity of sparse label propagation by deriving an upper bound on the number of iterations required to achieve a given accuracy.This upper bound is essentially tight as it cannot be improved substantially for the particular class of graph signals defined over a chain graph (such as time series).

APPENDIX PROOF OF THEOREM 1
Our proof closely follows the argument used for deriving [6,Thm. 1].Let us start with rewriting the objective function of the SLP problem (14) using convex conjugate functions (cf.(3)) as so that we can reformulate the SLP problem ( 14) equivalently as The SLP dual problem ( 15) is obtained by swapping the order of minimization and maximization (taking supremum): According to [32, Corollary 31.2.1], the optimal objective values of the primal (31) and dual problem (32) coincide, i.e., and, in turn, = inf Therefore, by combining (33) with [32,Lemma 36.2],we have that any pair xSLP , ŷSLP consisting of a primal and dual optimal point forms a saddle point of L, i.e., and moreover = f (x SLP ).