A Fixed-Point of View on Gradient Methods for Big Data

Interpreting gradient methods as fixed-point iterations, we provide a detailed analysis of those methods for minimizing convex objective functions. Due to their conceptual and algorithmic simplicity, gradient methods are widely used in machine learning for massive data sets (big data). In particular, stochastic gradient methods are considered the de- facto standard for training deep neural networks. Studying gradient methods within the realm of fixed-point theory provides us with powerful tools to analyze their convergence properties. In particular, gradient methods using inexact or noisy gradients, such as stochastic gradient descent, can be studied conveniently using well-known results on inexact fixed-point iterations. Moreover, as we demonstrate in this paper, the fixed-point approach allows an elegant derivation of accelerations for basic gradient methods. In particular, we will show how gradient descent can be accelerated by a fixed-point preserving transformation of an operator associated with the objective function.


I. INTRODUCTION
One of the main recent trends within machine learning and data analytics using massive data sets is to leverage the inferential strength of the vast amounts of data by using relatively simple, but fast, optimization methods as algorithmic primitives [5].Many of these optimization methods are modifications of the basic gradient descent (GD) method.Indeed, computationally more heavy approaches, such as interior point methods, are often infeasible for a given limited computational budget [7].
Moreover, the rise of deep learning has brought a significant boost for the interest in gradient methods.Indeed, a major insight within the theory of deep learning is that for typical high-dimensional models, e.g., those represented by deep neural networks, most of the local minima of the cost function (e.g., the empirical loss or training error) are reasonably close (in terms of objective value) to the global optimum [11].These local minima can be found efficiently by gradient methods such as stochastic gradient descent (SGD), which is considered the de-facto standard algorithmic primitive for training deep neural networks [11].This paper elaborates on the interpretation of some basic gradient methods such as GD and its variants as fixedpoint iterations.These fixed-point iterations are obtained for operators associated with the convex objective function.Emphasizing the connection to fixed-point theory unleashes some powerful tools, e.g., on the acceleration of fixedpoint iterations [17] or inexact fixed-point iterations [1], [3], for the analysis and construction of convex optimization methods.
In particular, we detail how the convergence of the basic GD iterations can be understood from the contraction properties of a specific operator which is associated naturally with a differentiable objective function.Moreover, we work out in some detail how the basic GD method can be accelerated by modifying the operator underlying GD in a way that preserves its fixed-points but decreases the contraction factor which implies faster convergence by the contraction mapping theorem.
Outline.We discuss the basic problem of minimizing convex functions in Section II.We then derive GD, which is a particular first order method, as a fixed-point iteration in Section III.In Section IV, we introduce one of the most widely used computational models for convex optimization methods, i.e., the model of first order methods.In order to assess the efficiency of GD, which is a particular instance of a first order method, we present in Section V a lower bound on the number of iterations required by any first order method to reach a given sub-optimality.Using the insight provided from the fixed-point interpretation we show how to obtain an accelerated variant of GD in Section VI, which turns out to be optimal in terms of convergence rate.
Notation.The set of natural numbers is denoted N := {1, 2, . ..}.Given a vector x = (x 1 , . . ., x n ) T ∈ C n , we denote its lth entry by x l .The (hermitian) transpose and trace of a square matrix A ∈ C n×n are denoted (A H ) A T and tr{A}, respectively.The Euclidian norm of a vector x is denoted Mx .The spectral decomposition of a positive semidefinite (psd) matrix Q ∈ C n×n is Q = UΛU H with matrix U = u (1) , . . ., u (n) whose columns are the orthonormal eigenvectors u (i) ∈ C n of Q and the diagonal matrix Λ containing the eigenvalues λ 1 (Q) ≥ . . .≥ λ n (Q) ≥ 0. For a square matrix M, we denote its spectral radius as ρ(M) := max{|λ| : λ is an eigenvalue of M}.

II. CONVEX FUNCTIONS
holds for any x, y ∈ R n and α ∈ [0, 1] [7].For a differentiable function f (•) with gradient ∇f (x), a necessary and sufficient condition for convexity is [6, p. 70] which has to hold for any x, y ∈ R n .
Our main object of interest in this paper is the optimization problem x 0 ∈ arg min Given a convex function f (x), we aim at finding a point x 0 with lowest function value f (x 0 ), i.e., f (x 0 ) = min x f (x).
In order to motivate our interest in optimization problems like (1), consider a machine learning problem based on training data X := {z (i) } N i=1 consisting of N data points z (i) = (d (i) , y (i) ) with feature vector d (i) ∈ R n (which might represent the RGB pixel values of a webcam snapshot) and output or label y (i) ∈ R (which might represent the local temperature during the snapshot).We wish to predict the label y (i) by a linear combination of the features, i.e., ( The choice for the weight vector x ∈ R n is typically based on balancing the empirical risk incurred by the predictor (2), i.e., (1/N ) with some regularization term, e.g., measured by the squared norm x 2 .Thus, the learning problem amounts to solving the optimization problem The learning problem (3) is precisely of the form (1) with the convex objective function By choosing a large value for the regularization parameter λ, we de-emphasize the relevance of the training error and thus avoid overfitting.However, choosing λ too large induces a bias if the true underlying weight vector has a large norm [4], [11].A principled approach to find a suitable value of λ is cross validation [4], [11].
Differentiable Convex Functions.Any differentiable function f (•) is accompanied by its gradient operator ∇f : R n → R n , x → ∇f (x). (5) While the gradient operator ∇f is defined for any (even non-convex) differentiable function, the gradient operator of a convex function satisfies a strong structural property, i.e., it is a monotone operator [2].
Smooth and Strongly Convex Functions.If all second order partial derivatives of the function f (•) exist and are continuous, then f (•) is convex if and only if [6, p. 71] We will focus on a particular class of twice differentiable convex functions, i.e., those with Hessian ∇ 2 f (x) satisfying with some known constants U ≥ L > 0.
The set of convex functions f (•) : R n → R satisfying (6) will be denoted S L,U n .As it turns out, the difficulty of finding the minimum of some function f (•) ∈ S L,U n using gradient methods is essentially governed by the condition number κ := U/L of the function class S L,U n .
(7) Thus, regarding the difficulty of optimizing the functions f (•) ∈ S L,U n , the absolute values of the bounds L and U in (6) are not crucial, only their ratio κ = U/L is.
One particular sub-class of functions f (•) ∈ S L,U n , which is of paramount importance for the analysis of gradient methods, are quadratic functions of the form with some vector q ∈ R n and a psd matrix As can be verified easily, the gradient and Hessian of a quadratic function of the form (8) are obtained as ∇f (x) = Qx + q and ∇ 2 f (x) = Q, respectively.It turns out that most of the results (see below) on gradient methods for minimizing quadratic functions of the form (8), with some matrix apply (with minor modifications) also when expanding their scope from quadratic functions to the larger set S L,U n .This should not come as a surprise, since any function f (•) ∈ S L,U n can be approximated locally around a point x 0 by a quadratic function which is obtained by a truncated Taylor series [16].
The crucial difference between the quadratic function ( 8) and a general function appearing in the quadratic form in (9) typically varies with the point x.In particular, we can rewrite (9) as with Q = ∇ 2 f (x 0 ).The last summand in (10) quantifies the approximation error obtained when approximating a function f (•) ∈ S L,U n with the quadratic f (x) obtained from (8) with the choices Thus, we can ensure a arbitrarily small approximation error ε by considering f (•) only over a neighbourhood B(x 0 , r) := {x : x−x 0 ≤ r} with sufficiently small radius r > 0.
Let us now verify that learning a (regularized) linear regression model (cf.( 3)) amounts to minimizing a convex quadratic function of the form (8). Indeed, using some elementary linear algebraic manipulations, we can rewrite the objective function in (4) as a quadratic of the form (8) using the particular choices Q = Q LR and q = q LR with (12) The eigenvalues of the matrix Q LR obey [10] Hence, learning a regularized linear regression model via (3) amounts to minimizing a convex quadratic function , where λ denotes the regularization parameter used in (3).

III. GRADIENT DESCENT
Let us now show how one of the most basic methods for solving the problem (1), i.e., the GD method, can be obtained naturally as fixed-point iterations involving the gradient operator ∇f (cf.( 5)).
Our point of departure is the necessary and sufficient condition [6] ∇f for a vector x 0 ∈ R n to be optimal for the problem (1) with a convex differentiable objective function f (•) ∈ S L,U n .
Lemma 1.We have ∇f (x) = 0 if and only if the vector x ∈ R n is a fixed point of the operator for an arbitrary but fixed non-zero α ∈ R \ {0}.Thus, Proof.Consider a vector x such that ∇f (x) = 0.Then, Conversely, let x be a fixed point of T (α) , i.e., Then, = 0.
According to Lemma 1, the solution x 0 of the optimization problem (1) is obtained as the fixed point of the operator T (α) (cf.( 14)) with some non-zero α.As we will see shortly, the freedom in choosing different values for α can be exploited in order to compute the fixed points of T (α) more efficiently.
A straightforward approach to finding the fixed-points of an operator T (α) is via the fixed-point iteration By tailoring a fundamental result of analysis (cf.[16, Theorem 9.23]), we can characterize the convergence of the sequence x (k) obtained from ( 16).
Lemma 2. Assume that for some q ∈ [0, 1), we have for any x, y ∈ R n .Then, the operator T (α) has a unique fixed point x 0 and the iterates x (k) (cf.(16)) satisfy Proof.Let us first verify that the operator T (α) cannot have two different fixed points.Indeed, assume there would be two different fixed points x, y such that x, and y = T (α) y.
This would imply, in turn, However, since q < 1, this inequality can only be satisfied if x−y = 0, i.e., we must have x = y.Thus, we have shown that no two different fixed points can exist.The existence of one unique fixed point x 0 follows from [16,Theorem 9.23].
The estimate (18) can be obtained by induction and noting ≤ q x (k) − x 0 .
In order to apply Lemma 2 to ( 16), we have to ensure that the operator T (α) is a contraction, i.e., it satisfies (17) with some contraction coefficient q ∈ [0, 1).For the operator T (α) (cf.( 14)) associated with the function f (•) ∈ S L,U n this can be verified by standard results from vector analysis.
Lemma 4. Consider a convex function f (•) ∈ S L,U n with the unique minimizer x 0 , i.e., f (x 0 ) = min x f (x).We then construct the operator T (α) : x → x − α∇f (x) with a step size α such that Then, starting from an arbitrary initial guess x (0) , the iterates x (k) (cf.(16)) satisfy According to Lemma 4, and also illustrated in Figure 1, starting from an arbitrary initial guess x (0) , the sequence x (k) generated by the fixed-point iteration ( 16) is guaranteed to converge to the unique solution x 0 of (1), i.e., lim k→∞ x (k) = x 0 .What is more, this convergence is quite fast, since the error x (k) − x 0 decays at least exponentially according to (25).Loosely speaking, this exponential decrease implies that the number of additional iterations required to have on more correct digit in x (k) is constant.
Let us now work out the iterations (16) more explicitly by inserting the expression (14) for the operator T (α) .We then obtain the following equivalent representation of (16): This iteration is nothing but plain vanilla GD using a fixed step size α [11].
Since the GD iteration (26) is precisely the fixed-point iteration (16), we can use Lemma 4 to characterize the convergence (rate) of GD.In particular, convergence of GD is ensured by choosing the step size of GD (26) such that q(α) = max{|1−U α|, |1−Lα|} < 1.Moreover, in order to make the convergence as fast as possible we need to chose the step size α = α * which makes the contraction factor q(α) (cf.(20)) as small as possible.
In Figure 2, we illustrate how the quantifies |1 − αL| and |1 − αU | evolve as the step size α (cf.(26)) is varied.From Figure 2 we can easily read off the optimal choice yielding the smallest possible contraction factor x y x (1)   x (2)   x 0 Fig. 1.Fixed-point iterations for a contractive mapping T (α) with the unique fixed point x 0 .We have arrived at the following characterization of GD for minimizing convex functions f (•) ∈ S L,U n .
Theorem 5. Consider the optimization problem (1) with objective function f (•) ∈ S L,U n , where the parameters L and U are fixed and known.Starting from an arbitrarily chosen initial guess x (0) , we construct a sequence by GD (26) using the optimal step size (27).Then, In what follows, we will use the shorthand T := T (α * ) for the gradient operator T (α) (cf.( 14)) obtained for the optimal step size α = α * (cf.( 27)).

IV. FIRST ORDER METHODS
Without a computational model taking into account a finite amount of resources, the study of the computational complexity inherent to (1) becomes meaningless.Consider having unlimited computational resources at our disposal.Then, we could build an "optimization device" which maps each function f (•) ∈ S L,U n to its unique minimum x 0 .Obviously, this approach is infeasible since we cannot perfectly represent such a mapping, let alone its domain S L,U n , using a physical hardware which allows us only to handle finite sets instead of continuous spaces like S L,U n .Let us further illustrate the usefulness of using a computational model in the context of machine learning from massive data sets (big data).In particular, as we have seen in the previous section, the regularized linear regression model (3) amounts to minimizing a convex quadratic function (8) with the particular choices (12).Even for this most simple machine learning model, it is typically infeasible to have access to a complete description of the objective function (8).
Indeed, in order to fully specify the quadratic function in (8), we need to fully specify the matrix Q ∈ R n×n and the vector q ∈ R n .For the (regularized) linear regression model (3) this would require to compute Q LR (cf.( 12)) from the training data X = {z (i) } N i=1 .Computing the matrix Q LR in a naive way, i.e., without exploiting any additional structure, amounts to a number of arithmetic operations on the order of N • n 2 .This might be prohibitive in a typical big data application with N and n being on the order of billions and using distributed storage of the training data X [9].
There has emerged a widely accepted computational model for convex optimization which abstracts away the details of the computational (hard-and software) infrastructure.Within this computational model, an optimization method for solving (1) is not provided with a complete description of the objective function, but rather it can access the objective function only via an "oracle" [7], [13].
We might think of an oracle model as an application programming interface (API), which specifies the format of queries which can be issued by a convex optimization method executed on an application layer (cf. Figure 3).There are different types of oracle models but one of the most popular type (in particular for big data applications) is a first order oracle [13].Given a query point x ∈ R n , a first order oracle returns the gradient ∇f (x) of the objective function at this particular point.
A first order method (FOM) aims at solving (1) by sequentially querying a first order oracle, at the current iterate x (k) , to obtain the gradient ∇f (x (k) ) (cf. Figure 3).Using the current and past information obtained from the oracle, a FOM then constructs the new iterate x (k+1) such that eventually lim k→∞ x (k) = x 0 .For the sake of simplicity and without essential loss in generality, we will only consider FOMs whose iterates x (k) satisfy [13] x (k) ∈ span x (0) , ∇f (x (0) ), . . ., ∇f (x (k−1) ) . (29)

V. LOWER BOUNDS ON NUMBER OF ITERATIONS
According to Section III, solving (1) can be accomplished by the simple GD iterations (26).The particular choice α * (27) for the step size α in (26) ensures the convergence rate κ−1 κ+1 k with the condition number κ = U/L of the function class S L,U n .While this convergence is quite fast, i.e., the error decays exponentially with iteration number k, we would, of course, like to know how efficient this method is in general.
As detailed in Section IV, in order to study the computational complexity and efficiency of convex optimization methods, we have to define a computational model such as those underlying FOMs (cf. Figure 3).The next result provides a fundamental lower bound on the convergence rate of any FOM (cf.(29)) for solving (1).Theorem 6.Consider a particular FOM, which for a given convex function f (•) ∈ S L,U n generates iterates x (k) satisfying (29).For fixed L, U there is a sequence of functions

Proof. see Section VIII-A.
There is a considerable gap between the upper bound (28) on the error achieved by GD after k iterations and the lower bound (30) which applies to any FOM which is run for the same number iterations.In order to illustrate this gap, we have plotted in Figure 4 the upper and lower bound for the (quite moderate) condition number κ = 100.
Thus, there might exist a FOM which converges faster than the GD method (28) and comes more close to the lower bound (30).Indeed, in the next section, we will detail how to obtain an accelerated FOM by applying a fixed point preserving transformation to the operator T (cf.( 16)), which is underlying the GD method (26).This accelerated gradient method is known as the heavy balls (HB) method [15] and effectively achieves the lower bound (30), i.e., the HB method is already optimal among all FOM's for solving (1) with an objective function f (•) ∈ S L,U n .

VI. ACCELERATING GRADIENT DESCENT
Let us now show how to modify the basic GD method (26) in order to obtain an accelerated FOM, whose convergence rate essentially matches the lower bound (30) for the function class S L,U n with condition number κ = U/L > 1 (cf.(7)) and is therefore optimal among all FOMs.
Our derivation of this accelerated gradient method, which is inspired by the techniques used in [8], starts from an equivalent formulation of GD as the fixed-point iteration with the operator As can be verified easily, the fixed-point iteration (31) starting from an arbitrary initial guess x(0) = z (0) y (0) is related to the GD iterate x (k) (cf.(26)), using initial guess z (0) , as for all iterations k ≥ 1.
By the equivalence (33), Theorem 5 implies that for any initial guess x(0) the iterations (31) converge to the fixed point with x 0 being the unique minimizer of f (•) ∈ S L,U n .Moreover, the convergence rate of the fixed-point iterations (31) is precisely the same as those of the GD method, i.e., governed by the decay of κ−1 κ+1 k , which is obtained for the optimal step size α = α * (cf.( 27)).
We will now modify the operator T in (32) to obtain a new operator M : R 2n → R 2n which has the same fixed points (34) but improved contraction behaviour, i.e., the fixed point iteration will converge faster than those obtained from T in (31).In particular, this improved operator M is defined as with As can be verified easily, the fixed point x T 0 , x T 0 T of T is also a fixed point of M.
Before we analyze the convergence rate of the fixed-point iteration (35), let us work out explicitly the FOM which is represented by the fixed-point iteration (35).To this end, we partition the kth iterate, for k ≥ 1, as HB := 0. The iteration (39) defines the HB method [15] for solving the optimization problem (1).As can be verified easily, like the GD method, the HB method is a FOM.However, contrary to the GD iteration (26), the HB iteration (39) also involves the penultimate iterate x (k−2) HB for determining the new iterate x (k) HB .We will now characterize the converge rate of the HB method (39) via its fixed-point equivalent (35).To this end, we restrict ourselves to the subclass of S L,U n given by quadratic functions of the form (8). HB via iterating (26).Then, Proof.see Section VIII-B.
The upper bound (40) differs from the lower bound (30) by the factor k.However, the discrepancy is rather decent as this linear factor in (40) grows much slower than the exponential √ κ−1 √ κ+1 k in (40) decays.In Figure 6, we depict the upper bound (40) on the error of the HB iterations (39) along with the upper bound (28) on the error of the GD iterations (26) and the lower bound (30) on the error of any FOM after k iterations.
We highlight that, strictly speaking, the bound (40) only applies to a subclass of smooth strongly convex functions f (•) ∈ S L,U n , i.e., it applies only to quadratic functions of the form (8).However, as discussed in Section II, given a particular point x, we can approximate an arbitrary function f (•) ∈ S L,U n with a quadratic function f (x) of the form (8). The approximation error ε(x) (cf.(11)) will be small for all points x sufficiently close to x 0 .Making this reasoning more precise and using well-known results on fixed-point iterations with inexact updates [1], one can verify that the bound (40) essentially applies to any function f (•) ∈ S L,U n .

VII. CONCLUSIONS
We have presented a fixed-point theory of some basic gradient methods for minimizing convex functions.The approach via fixed-point theory allows for a rather elegant analysis of the convergence properties of these gradient methods.In particular, their convergence rate is obtained as the contraction factor for an operator associated with the objective function.
The fixed-point approach is also appealing since it leads rather naturally to the acceleration of gradient methods via fixed-point preserving transformations of the underlying operator.We plan to further develop the fixed-point theory of gradient methods in order to accommodate stochastic variants of GD such as SGD.Furthermore, we can bring the popular class of proximal methods into the picture by replacing the gradient operator underlying GD with the proximal operator.
However, by contrast to FOMs (such as the GD method), proximal methods use a different oracle model (cf. Figure 3).In particular, proximal methods require an oracle which can evaluate the proximal mapping efficiently which is typically more expensive than gradient evaluations.Nonetheless, the popularity of proximal methods is due to the fact that for objective functions arising in many important machine learning applications, the proximal mapping can be evaluated efficiently.

ACKNOWLEDGEMENT
This paper is a wrap-up of the lecture material created for the course Convex Optimization for Big Data over Networks, taught at Aalto University in spring 2017.The student feedback on the lectures has been a great help to develop the presentation of the contents.In particular, the detailed feedback of students Stefan Mojsilovic and Matthias Grezet on early versions of the paper is appreciated sincerely.

Fig. 4 .
Fig. 4. Upper bound (28) on convergence rate of GD and lower bound (30) on convergence rate for any FOM minimizing functions f (•) ∈ S L,U n with condition number κ = U/L = 100.

Fig. 5 .
Fig.5.Schematic illustration of the fixed-point iteration using operator T (32) (equivalent to GD) and for the modified operator M (36) (yielding HB method).

Theorem 7 .
Consider the optimization problem (1) with objective function f (•) ∈ S L,U n which is a quadratic (8).Starting from an arbitrarily chosen initial guess x (−1) HB and x (0) HB , we construct a sequence x (k)

Fig. 6 .
Fig. 6.Dependence on iteration number k of the upper bound (40) on error of HB (solid), upper bound (28) for error of GD (dashed) and lower bound (30) (dotted) for FOMs for the function class S L,U n with condition number κ = U/L = 100.