Incremental and Parallel Machine Learning Algorithms With Automated Learning Rate Adjustments

The existing machine learning algorithms for minimizing the convex function over a closed convex set suffer from slow convergence because their learning rates must be determined before running them. This paper proposes two machine learning algorithms incorporating the line search method, which automatically and algorithmically finds appropriate learning rates at run-time. One algorithm is based on the incremental subgradient algorithm, which sequentially and cyclically uses each of the parts of the objective function; the other is based on the parallel subgradient algorithm, which uses parts independently in parallel. These algorithms can be applied to constrained nonsmooth convex optimization problems appearing in tasks of learning support vector machines without adjusting the learning rates precisely. The proposed line search method can determine learning rates to satisfy weaker conditions than the ones used in the existing machine learning algorithms. This implies that the two algorithms are generalizations of the existing incremental and parallel subgradient algorithms for solving constrained nonsmooth convex optimization problems. We show that they generate sequences that converge to a solution of the constrained nonsmooth convex optimization problem under certain conditions. The main contribution of this paper is the provision of three kinds of experiment showing that the two algorithms can solve concrete experimental problems faster than the existing algorithms. First, we show that the proposed algorithms have performance advantages over the existing ones in solving a test problem. Second, we compare the proposed algorithms with a different algorithm Pegasos, which is designed to learn with a support vector machine efficiently, in terms of prediction accuracy, value of the objective function, and computational time. Finally, we use one of our algorithms to train a multilayer neural network and discuss its applicability to deep learning.


Introduction
In this paper, we consider minimizing the sum of nonsmooth, convex functionals over a closed convex projectable constraint set.The incremental subgradient method [12] and the parallel subgradient method [7,10] are subgradientprojection algorithms for solving the minimization problem.The simplest subgradient-projection method is as follows: µ 0 ∈ H and where P stands for the projection onto the constraint set, g n is the subgradient of the sum of all objective functionals at µ n , and {s n } is a positive number sequence called step-size [3,Section 8.2].The step-size {s n } is generally expressed as s n := cs n , where s n is the basic diminishing step-size and c is the regulation constant value.To make the convergence speed of the sequence {µ n } defined by ( 1) faster, we should empirically select the suitable regulation constant value.One iteration of the incremental subgradient method [12] sequentially uses the iterator (1) on each of the functionals in order to minimize their sum.On the other hand, one iteration of the parallel subgradient method [7] uses the iterator (1) independently on each of the functionals and reduces their results to their barycenter.The sequence generated by the incremental subgradient method and the parallel subgradient method converges to a solution of the minimization problem [7,Theorem 3.6], [12,Proposition 2.4].
In the unconstrained minimization problem for differentiable functionals, a line search is used to select a suitable step-size.In particular, the Wolfe conditions are step-size criteria for the line search.The Wolfe conditions are such that the step-size must satisfy a sufficient decrease condition and a curvature condition [13,Chapter 3].The sufficient decrease condition is that the step-size is acceptable only if its functional value is below a linear function with a negative slope.This condition ensures that the algorithms update the solution to a better one.However, it is not enough to ensure that the algorithm makes reasonable progress because it will do so for all sufficiently small step-sizes.Therefore, a curvature condition is invoked that generates sequence further enough along the chosen direction.
Our main idea is to use a line search and a parallel subgradient method to accelerate the existing methods.Reference [4] provides the method (1) with a line search that minimizes the objective functional.However, this method assumes that the objective functional is differentiable.In addition, it is designed for single-core computing; it is not useful in parallel computing environments.Reference [9] gives a method for solving fixed-point problems, covering the constrained minimization problem discussed in this paper, with a line search.This method has fast convergence property, though it decides only the coefficient of the convex combination and is not designed for multi-core computing.The results in [7,8,10,12] use a suitable regulation constant value to converge efficiently.However, this value depends on various factors, such as the number of objective functionals, number of dimensions, the shapes of objective functionals and constraint sets, and the selection of subgradients.Hence, selection or adjustment of this value is very difficult.In contrast to previous reports, this paper proposes incremental and parallel subgradient methods with a line search to find better step-sizes than the ones used in the existing methods.This is realized by means of turning the step-length into a step-range.We will show that these methods converge weakly to an optimizer to the problem when the step-range is diminishing.The convergence analysis in this paper is a generalization of the previously reported results in [7,12].
We also compare our algorithms and existing algorithms [7,12] on a concrete optimization problem using a real computer.The results show that our algorithms converge faster.In particular, we implemented the parallel subgradient method and its extension for a multi-core computer.The experimental results show that parallel computing with a multi-core computer reduces the effect of appending a line search to each iteration; the overall processing speed is as high as that of the original parallel subgradient method.Reference [5] pointed out that the computational complexities of evaluating practical objective functions can be extremely high.Indeed, acceleration methods driven by parallel computation have been developed as an antidote in recent years [5,6,11].However, no parallel computing experiment has yet been tried in the previous research [8,10].Here, we conducted experiments on a multi-core computer showing that our parallel method reduces the running time and iterations needed to find an optimal solution compared with other ones.This paper is organized as follows.Section 2 gives the mathematical preliminaries and the mathematical formulation of the main problem.Section 3 presents our algorithms.We also show the fundamental properties of these algorithms that are used to prove the main theorems.Section 4 presents their convergence analyses.Section 5 describes numerical comparisons of the proposed algorithms with the existing ones in [7,12].Section 6 concludes this paper.

Mathematical Preliminaries
Let (H, •, • ) be a real Hilbert space with its induced norm defined by x := x, x .We define the notation R + := (0, ∞) and N := {1, 2, . ..}.Let x n → x denote that the sequence {x n } converges to x and let x n x denote that the sequence {x n } converges weakly to x.
A subgradient g of a convex function f : H → R at a point x ∈ H is defined by g ∈ H such that f (x) + y − x, g ≤ f (y) for all y ∈ H.The set of all subgradients at x is denoted as ∂f (x) [15], [16,

Main Problem
Let . ., K) be convex, continuous functions and let C be a nonempty, closed convex subset of H.We will examine the following problem [7,12], The following assumptions are made throughout this paper.

Incremental Subgradient Method
This subsection presents the incremental subgradient method, Algorithm 1, for solving the problem (2).Let us compare Algorithm 1 with the existing one [12].The difference is step 6 of Algorithm 1.The step-size λ n of the existing method is decided before the algorithm runs.However, Algorithm 1 only needs the step-range [λ n , λ n ]. -A Step-size within the range used by Algorithm 1 can be automatically determined in run-time.Algorithm 1 coincides with the incremental subgradient method when λ n := λ n := λ n , which means that it is a generalization of the method in [12].In this case, Algorithm 1 chooses only one step-size λ n from the step-range.This difference has three merits.First, we do not need to adjust the stepsize in Algorithm 1 precisely.In our experiments, Algorithm 1 converged y n,0 := x n . 4: g n,i ∈ ∂f i (y n,i−1 ). 6: By a line search algorithm 7: ).

8:
end for 9: x n+1 := y n,K .10: end loop as fast as the best incremental subgradient method when the step-range contained roughly the best step-size for the existing method (this is shown in Section 5).The second merit is that the algorithm can be applied to the difficult objective problems for which a suitable step-size is difficult to choose with the existing method.It requires only the step-range.Hence, we do not need to give a pre-determined step-size to it.Finally, the step-size is appropriately selected using a line search in step 6 of the algorithm, and it accelerates convergence.Subsection 3.3 provides some examples of the line search, and Section 5 examines the convergence properties.
Algorithm 1 satisfies the following properties.
Lemma 1 (Fundamental Property of Algorithm 1).Let {x n } be a sequence generated by Algorithm 1.Then, for all y ∈ C and for all n ∈ N , the following inequality holds: Proof.Fix y ∈ C and n ∈ N arbitarily.From the nonexpansivity of P C , the definition of subgradients, and Assumption 1, we have where the second equation comes from x−y 2 = x 2 −2 x, y + y 2 (x, y ∈ H).Using the definition of subgradients and the Cauchy-Schwarz inequality, we have Further, the nonexpansivity of P C and the triangle inequality mean that, for all i = 2, 3, . . ., K, From above inequality and the fact that y n,0 − x n = x n − x n = 0, we find that This completes the proof.

Parallel Subgradient Method
Algorithm 2 below is an extension of the parallel subgradient method [7].The Algorithm 2 Parallel Subgradient Method for all i ∈ {1, 2, . . ., K} do Independently 4: By a line search algorithm 6: y n,i := P C (x n − λ n,i g n,i ).

7:
end for 8: x n+1 := 1 K K i=1 y n,i .9: end loop difference between Algorithm 2 and the method in [7] is step 5 of Algorithm 2. The existing method uses a given step-size λ n , while Algorithm 2 chooses a step-size λ n from the step-range [λ n , λ n ] at run-time.
The common feature of Algorithm 2 and the parallel subgradient method [7] is loop independence (step 3).This loop structure is not influenced by the computation order.Hence, the elements of this loop can be computed in parallel.Therefore, parallelization using multi-core processing reduce the time needed for computing this loop procedure.Generally, the main loop of Algorithm 2 is computationally heavier than the parallel subgradient method [7] because it appends the step-size selection (line search) procedure to the existing one.Hence, acceleration through paralellization alleviates the effect of the line search procedure (This is shown in Section 5).
Next, we have the following lemma.
Lemma 2 (Fundamental Property of Algorithm 2).Let {x n } be a sequence generated by Algorithm 2.Then, for all y ∈ C and for all n ∈ N , the following inequality holds: Proof.Fix y ∈ C and n ∈ N arbitrarily.From the convexity of • 2 , the nonexpansivity of P C , the definition of subgradients, and Assumption 1, we have This completes the proof.

Line Search Algorithms
Step 6 of Algorithm 1 and step 5 of Algorithm 2 are implemented as linesearches.The algorithms decide an efficient step-size λ n in [λ n , λ n ] by using y n,i−1 in Algorithm 1 (or x n in Algorithm 2), g n,i , f i and other accessible informations on i.This is the principal idea of this paper.We can use any algorithm that satisfies the above condition.The following are such examples.
Algorithm 3 Discrete Argmin Line Search Algorithm 1: x p := y n,i−1 (Algorithm 1), x n (Algorithm 2) . 5: λ n,i ← t 7: end if 8: end for The simplest line search is the discrete argmin, as shown in Algorithm 3. First, we set the ratio candidates {L 1 , L 2 , . . ., L k } ⊂ [0, 1].In each iteration, we compute all of the candidate objectives when the step-size stop (success).
step-size that satisfies the sufficient decrease condition with d-interval grids.Once this step-size has been found, the algorithm stops and the step-size it found is used in the caller algorithm.However, this algorithm may fail (step 8).To avoid such a failure, we made the caller algorithm will use λ n .This is the largest step-size of the candidates for making an effective update of the solution.
The grid of Algorithm 4 can be changed into a logarithmic one.Algorithm 5 below uses such a logarithmic grid.The results of the experiments Algorithm 5 Logarithmic-Interval Armijo Line Search 1: x p := y n,i−1 (Algorithm 1), x n (Algorithm 2) .

6:
end if 7: end for 8: stop (failed).described in Section 5 demonstrate the accelerating effect of this algorithm.

Convergence Analysis
Here, we first show that the limit inferiors of {f (x n )} generated by Algorithms 1 and 2 are equal to the optimal value of f .Next, we show that {x n } converges weakly to a solution of the main problem (2).The following assumption is used to show convergence of Algorithms 1 and 2.

Assumption 3 (Step-Size Compositions).
The existing methods [7,8,10,12] require a suitable regulation constant value in order to converge efficiently.However, this value differs depending on the number of objective functionals or dimensions, the shape of the objective functionals or the constraint set, the selection of subgradients, and so on.The step-sizes of our proposals can be selected at run-time.This feature makes our methods much more flexible than the existing ones.
Lemma 3 (Evaluation of the Limit Inferior).For a sequence {x n }, if there exists α ∈ R + such that, for all y ∈ C and for all n ∈ N , then, in the main problem (2).The property of the limit inferior and Proposition 1 ensure that Further, from the positivity of f i (i = 1, 2, . . ., K), the fact that λ n ≤ λ n,i and the assumption that lim n→∞ This is a contradiction.Next, we assume min From the definition of the limit inferior, there exists From inequality (3), for all n ∈ N , if n 0 ≤ n, we have From Assumption 3, n 1 ∈ N exists such that n 0 ≤ n 1 and, for all n ∈ N , if for all n ∈ N .From Assumption 3, the right side diverges negatively, which is a contradiction.Overall, we have Next, let us assume that lim n→∞ Hence, from the positivity of f i (i = 1, 2, . . ., K), the fact that λ n ≤ λ n,i , and Proposition 1, we have However, this is a contradiction.This completes the proof.
Theorem 1 (Main Theorem).The sequence {x n } generated by Algorithm 1 or 2 converges weakly to an optimal solution to the main problem (2).
Proof.Let ŷ ∈ argmin x∈C f (x) and fix n ∈ N .From Lemmas 1 and 2, there exists α ∈ R + such that From Assumption 3, the left side of the above inequality is bounded.Hence, {x n } is bounded.From Proposition 2, J ∈ R exists for all ŷ ∈ argmin x∈C f (x) such that lim n→∞ x n − ŷ = J.Moreover, from Lemma 3, a subsequence {f (x n i )} ⊂ {f (x n )} exists such that lim i→∞ f (x n i ) = f (ŷ).From Proposition 3, C is a weak closed set.Therefore, there exists a subsequence {x n i j } ⊂ {x n i } and a point u ∈ C such that x n i j u.Hence, from Proposition 4, we obtain This implies that u ∈ argmin x∈C f (x).Let {x n i k } ⊂ {x n i } be another subsequence and assume x n i k v ∈ argmin x∈C f (x) and u = v.From Proposition 5, we have lim n→∞ This is a contradiction.Accordingly, any subsequence of {x n i } weakly converges to u ∈ argmin x∈C f (x).Therefore, from Proposition 6, x n i u.Now let {x n j } ⊂ {x n } be another subsequence and assume x n j w = u.Then, from Proposition 5, we have lim n→∞ This is a contradiction.Therefore, any subsequence of {x n } weakly converges to u ∈ argmin x∈C f (x).Hence, by Proposition 6, x n u.This completes the proof.
We used Algorithm 5 with a := 8 = 2 3 and k := 5 to select λ n,i ∈ [λ n , λ n ] in Algorithms 1 and 2. We set the parameter a as a power of 2 to avoid computational errors and computed its involutions using the bit-shift operation.We also set the parameter k such that I R and 1/a k are small enough.
Figure 1 compares the behaviors of the incremental subgradient method [12] and Algorithm 1.The y-axes in Figures 1a and 1b represent the value of f (x).The x-axis in Figure 1a represents the number of iterations, and the x-axis in Figure 1b represents the elapsed time.The results show that Algorithm 1 converges faster than the incremental subgradient method does.
Figure 2 compares the behaviors of the parallel subgradient method [7] and Algorithm 2. The y-axes in Figures 2a and 2b represent the value of f (x).The x-axis in Figure 2a represents the number of iterations, and the x-axis in Figure 2b represents the elapsed time.The results show that Algorithm 2 converges faster than the parallel subgradient method does.initial point, computer and experimental environments.In other words, the experiment genuinely evaluated the effect of multi-core computing.The values of "Time" and "f (x)" in Table 1 are those at the 1000th iteration.
The results show that multi-core computing accelerated the algorithm and provided a 70%∼80% time reduction.Finally, let us discuss the overhead of the line search procedure in the incremental subgradient method (Figure 1).It appears that the convergence speed of the proposed algorithm, as evaluated by the number of iterations, is slower than that evaluated by the elapsed time.However, the overhead of the line search procedure is lessened with the parallel subgradient method (Figure 2 and Table 1).This means that multi-core computing reduces the effect of the overhead; i.e., the parallel subgradient method enjoys the effect of line search acceleration more than the incremental subgradient method does.

Conclusion
We proposed step-size run-time selectable extensions, i.e., line searchable extensions, of the incremental subgradient method and parallel subgradient method.We showed that the extended algorithms converge to an optimal solution of the problem of minimizing the sum of objective functionals over a constraint set.We also found that they converge faster than existing algorithms.Regarding the parallel subgradient method in particular, the issue of

1 , 2 ,Algorithm 4
. . ., k) and take the best step-size.Algorithm 4 is a line search based on the Wolfe conditions.It finds a Uniform-Interval Armijo Line Search 1: x p := y n,i−1 (Algorithm 1),x n (Algorithm 2) .

Figure 1 :
Figure 1: Behavior of f (x n ) for the Incremental Subgradient Method[12] and Algorithm 1

Figure 2 :
Figure 2: Behavior of f (x n ) for the Parallel Subgradient Method[7] and Algorithm 2 Section 7.3].The metric projection onto a nonempty, closed convex set C ⊂ H is denoted by P C : H → H.It is defined by P C (H) ⊂ C and x − P C (x) = inf y∈C x − y [1, Section 4.2, Chapter 28].P C satisfies the nonexpansivity condition [16, Subchapter 5.2]; i.e.P C x − P C y ≤ x − y for all x, y ∈ H.The following propositions are used to prove our main theorems.Let {a n }, {b n } ⊂ [0, ∞).Suppose that there exists a ∈ R + such that a n → a.Then, lim n→∞ a n b n = a lim n→∞ b n .Suppose that {a n }, {b n } ⊂ [0, ∞) such that a n+1 ≤ a n + b n for all n ∈ N .If ∞ n=1 b n < ∞, then lim n→∞ a n exists.Let {x n } be a sequence in H with x n x and x = y.Then, lim n→∞ x n − x < lim n→∞ x n − y .Let {x n } be a bounded sequence in H. Then {x n } is weakly convergent if and only if each weakly convergent subsequence of {x n } has the same weak limit.

Table 1 :
Table1compares experimental results computed with and without multiple cores for Algorithm 2. The methods were evaluated on the same problem, random seeds, Effect of multi-core computing in 1000 iterations