Dual Free Adaptive Minibatch SDCA for Empirical Risk Minimization

In this paper we develop an adaptive dual free Stochastic Dual Coordinate Ascent (adfSDCA) algorithm for regularized empirical risk minimization problems. This is motivated by the recent work on dual free SDCA of Shalev-Shwartz (2016). The novelty of our approach is that the coordinates to update at each iteration are selected non-uniformly from an adaptive probability distribution, and this extends the previously mentioned work which only allowed for a uniform selection of"dual"coordinates from a fixed probability distribution. We describe an efficient iterative procedure for generating the non-uniform samples, where the scheme selects the coordinate with the greatest potential to decrease the sub-optimality of the current iterate. We also propose a heuristic variant of adfSDCA that is more aggressive than the standard approach. Furthermore, in order to utilize multi-core machines we consider a mini-batch adfSDCA algorithm and develop complexity results that guarantee the algorithm's convergence. The work is concluded with several numerical experiments to demonstrate the practical benefits of the proposed approach.


INTRODUCTION
In this work we study the 2 -regularized Empirical Risk Minimization (ERM) problem, which is widely used in the field of machine learning. The problem can be stated as follows. Given training examples (x 1 , y 1 ), . . . , (x n , y n ) ∈ R d × R, loss functions φ 1 , . . . , φ n : R → R and a regularization parameter λ > 0, 2 -regularized ERM is an optimization problem of the form where the first term in the objective function is a data fitting term and the second is a regularization term that prevents over-fitting.
One of the most popular methods for solving (D) is Stochastic Dual Coordinate Ascent (SDCA). The algorithm proceeds as follows. At iteration t of SDCA a coordinate i ∈ {1, . . . , n} is chosen uniformly at random and the current iterate α (t) is updated to α (t+1) := α (t) + δ * e i , where δ * = arg max δ∈R D(α (t) + δe i ). Much research has focused on analysing the theoretical complexity of SDCA under various assumptions imposed on the functions φ * i , including the pioneering work of Nesterov in Nesterov (2012) and others including Richtárik and Takáč (2014); Tappenden et al. (2017); Clipici (2013, 2016); Liu and Wright (2015); Takáč et al. (2015Takáč et al. ( , 2013. A modification that has led to improvements in the practical performance of SDCA is the use of importance sampling when selecting the coordinate to update. That is, rather than using uniform probabilities, instead coordinate i is sampled with an arbitrary probability p i , see for example Zhao and Zhang (2015); .
In many cases algorithms that employ non-uniform coordinate sampling outperform naïve uniform selection, and in some cases help to decrease the number of iterations needed to achieve a desired accuracy by several fold.
In addition, it is simple to observe that the function φ i (x T i ·) : R d → R is L i smooth, i.e., ∀w,w ∈ R d and for all i ∈ [n] there exists a constant L i ≤ x i 2L i such that We will use the notation L = max Throughout this work we let R + denote the set of nonnegative real numbers and we let R n + denote the set of n-dimensional vectors with all components being real and nonnegative.

Contributions
In this section the main contributions of this paper are summarized (not in order of significance).
Adaptive SDCA. We modify the dual free SDCA algorithm proposed in Shalev-Shwartz (2015) to allow for the adaptive adjustment of probabilities and a non-uniform selection of coordinates. Note that the method is dual free, and hence in contrast to classical SDCA, where the update is defined by maximizing the dual objective (D), here we define the update slightly differently (see Section 2 for details).
Allowing non-uniform selection of coordinates from an adaptive probability distribution leads to improvements in practical performance and the algorithm achieves a better complexity bound than in Shalev-Shwartz (2015). We show that the error after T iterations is decreased by factor of T t=1 (1 − θ (t) ) ≥ (1 − θ * ) T on average, where θ * is an uniformly lower bound for all θ (t) . Here 1 − θ (t) ∈ (0, 1) is a parameter that depends on the current iterate α (t) and the nonuniform probability distribution. By changing the coordinate selection strategy from uniform selection to adaptive, each 1 − θ (t) becomes smaller, which leads to an improvement in the convergence rate.
Non-uniform sampling procedure. Rather than using a uniform sampling of coordinates, which is the commonly used approach, here we propose the use of non-uniform sampling from an adaptive probability distribution. With this novel sampling strategy, we are able to generate non-uniform non-overlapping and proper (see Section 5) samplings for arbitrary marginal distributions under only one mild assumptions. Indeed, we show that without the assumption, there is no such non-uniform sampling strategy. We also extend our sampling strategy to allow the selection of mini-batches.
Better convergence and complexity results. By utilizing an adaptive probabilities strategy, we can derive complexity results for our new algorithm that, for the case when every loss function is convex, depend only on the average of the Lipschitz constants L i . This improves upon the complexity theory developed in Shalev-Shwartz (2015) (which uses a uniform sampling) and  (which uses an arbitrary but fixed probability distribution), because the results in those works depend on the maximum Lipschitz constant. Furthermore, even though adaptive probabilities are used here, we are still able to retain the very nice feature of the work in Shalev-Shwartz (2015), and show that the variance of the update naturally goes to zero as the iterates converge to the optimum without any additional computational effort or storage costs. Our adaptive probabilities SDCA method also comes with an improved bound on the variance of the update in terms of the sub-optimality of the current iterate.
Practical aggressive variant. Following from the work of , we propose an efficient heuristic variant of adfSDCA. For adfSDCA the adaptive probabilities must be computed at every iteration (i.e., once a single coordinate has been selected), which can be computationally expensive. However, for our heuristic adfSDCA variant the (exact/true) adaptive probabilities are only computed once at the beginning of each epoch (where an epoch is one pass over the data/n coordinate updates), and during that epoch, once a coordinate has been selected we simply reduce the probability associated with that coordinate so it is not selected again during that epoch. Intuitively this is reasonable because, after a coordinate has been updated the dual residue associated with that coordinate decreases and thus the probability of choosing this coordinate should also reduce. We show that in practice this heuristic adfSDCA variant converges and the computational effort required by this algorithm is lower than adfSDCA (see Sections 4 and 6).
Mini-batch variant. We extend the (serial) adfSDCA algorithm to incorporate a mini-batch scheme. The motivation for this approach is that there is a computational cost associated with generating the adaptive probabilities, so it is important to utilize them effectively. We develop a non-uniform mini-batch strategy that allows us to update multiple coordinates in one iteration, and the coordinates that are selected have high potential to decrease the sub-optimality of the current iterate. Further, we make use of ESO framework (Expected Separable Overapproximation) (see for example Richtárik and Takáč (2012), Qu et al. (2015)) and present theoretical complexity results for mini-batch adfSDCA. In particular, for mini-batch adfSDCA used with batchsize b, we derive the optimal probabilities to use at each iteration, as well as the best step-size to use to guarantee speedup.

Outline
This paper is organized as follows. In Section 2 we introduce our new Adaptive Dual Free SDCA algorithm (adfSDCA), and highlight its connection with a reduced variance SGD method. In Section 3 we provide theoretical convergence guarantees for adfSDCA in the case when all loss functions φ i (·) are convex, and also in the case when individual loss functions are allowed to be nonconvex but the average loss functions n i=1 φ i (·) is convex. Section 4 introduces a practical heuristic version of adfSDCA, and in Section 5 we present a mini-batch adfSDCA algorithm and provide convergence guarantees for that method. Finally, we present the results of our numerical experiments in Section 6. Note that the proofs for all the theoretical results developed in this work are left to the appendix.

THE ADAPTIVE DUAL FREE SDCA ALGORITHM
In this section we describe the Adaptive Dual Free SDCA (adfSDCA) algorithm, which is motivated by the dual free SDCA algorithm proposed by Shalev-Shwartz (2015). Note that in dual free SDCA two sequences of primal and dual iterates, {w (t) } ∞ t=0 and {α (t) } ∞ t=0 respectively, are maintained. At every iteration of that algorithm, the variable updates are computed in such a way that the well known primal-dual relational mapping holds; for every iteration t: The dual residue is defined as follows.
Definition 1 (Dual residue, ). The dual residue κ (t) = (κ The Adaptive Dual Free SDCA algorithm is outlined in Algorithm 1 and is described briefly now; a more detailed description (including a discussion of coordinate selection and how to generate appropriate selection rules) will follow. An initial solution α (0) is chosen, and then w (0) is defined via (4). In each iteration of Algorithm 1 the dual residue κ (t) is computed via (5), and this is used to generate a probability distribution p (t) . Next, a coordinate i ∈ [n] is selected (sampled) according to the generated probability distribution and a step is taken by updating the ith coordinate of α via Finally, the vector w is also updated and the process is repeated. Note that the updates to α and w using the formulas (6) and (7) ensure that the equality (4) is preserved.
Also note that the updates in (6) and (7) involve a step size parameter θ (t) , which will play an important role in our complexity results. The step size θ (t) should be large so that good progress can be made, but it must also be small enough to ensure that the algorithm is guaranteed to converge. Indeed, in Section 3.1 we will see that the choice of θ (t) depends on the choice of probabilities used at iteration t, which in turn depend upon a particular function that is related to the suboptimality at iteration t.
Algorithm 1 Adaptive Dual Free SDCA (adfSDCA) Generate adaptive probability distribution p (t) ∼ κ (t) 7: Sample coordinate i according to p (t) 8: Set step-size θ (t) ∈ (0, 1) as in (18) 9: Update: α The dual residue κ (t) is informative and provides a useful way of monitoring suboptimality of the current solution (w (t) , α (t) ). In particular, note that if κ i = 0 for some coordinate i, then by (5) α i = −φ i (w T x i ), and substituting κ i into (6) and (7) shows that α i.e., α and w remain unchanged in that iteration. On the other hand, a large value of |κ i | (at some iteration t) indicates that a large step will be taken, which is anticipated to lead to good progress in terms of improvement in sub-optimality of current solution.
The probability distributions used in Algorithm 1 adhere to the following definition.
Definition 2. (Coherence, ) Probability vector p ∈ R n is coherent with dual residue κ ∈ R n if for any index i in the support set of κ, denoted by I κ := {i ∈ [n] : κ i = 0}, we have p i > 0. When i / ∈ I κ then p i = 0. We use p ∼ κ to represent this coherent relation.

Adaptive dual free SDCA as a reduced variance SGD method.
Reduced variance SGD methods have became very popular in the past few years, see for example Konečný and Richtárik (2017); Johnson and Zhang (2013); Roux et al. (2012); Defazio et al. (2014). It is show in Shalev-Shwartz (2015) that uniform dual free SDCA is an instance of a reduced variance SGD algorithm (the variance of the stochastic gradient can be bounded by some measure of sub-optimality of the current iterate) and a similar result applies to adfSDCA in Algorithm 1. In particular, note that conditioned on w (t−1) , we have Combining (7) and (8) gives which implies that 1 np i κ (t) i x i is an unbiased estimator of ∇P (w (t) ). Therefore, Algorithm 1 is eventually a variant of the Stochastic Gradient Descent method. However, we can prove (see later) that the variance of the update goes to zero as the iterates converge to an optimum, which is not true for vanilla Stochastic Gradient Descent.

CONVERGENCE ANALYSIS
In this section we state the main convergence results for adfSDCA (Algorithm 1). The analysis is broken into two cases. In the first case it is assumed that each of the loss functions φ i is convex. In the second case this assumption is relaxed slightly and it is only assumed that the average of the φ i 's is convex, i.e., individual functions φ i (·) for some (several) i ∈ [n] are allowed to be nonconvex, as long as 1 n n j=1 φ j (·) is convex. The proofs for all the results in this section can be found in the Appendix.

Case I: All loss functions are convex
Here we assume that φ i is convex for all i ∈ [n]. Define the following parameter whereL is given in (3). It will also be convenient to define the following potential function. For all iterations t ≥ 0, The potential function (11) plays a central role in the convergence theory presented in this work. It measures the distance from the optimum in both the primal and (pseudo) dual variables. Thus, our algorithm will generate iterates that reduce this suboptimality and therefore push the potential function toward zero.

Also define
We have the following result.
, and v i be as defined in (3), (5), (10), (11) and (12), respectively. Suppose that φ i isL-smooth and convex for all i ∈ [n] and let θ ∈ (0, 1). Then at every iteration t ≥ 0 of Algorithm 1, a probability distribution p (t) that satisfies Definition 2 is generated and Note that if the right hand side of (13) is negative, then the potential function decreases (in expectation) in iteration t: The purpose of Algorithm 1 is to generate iterates (w (t) , α (t) ) such that the above holds. To guarantee negativity of the right hand term in (13), or equivalently, to ensure that (14) holds, consider the parameter θ. Specifically, any θ that is less than the function Θ(·, ·) : R n will ensure negativity of (13). Moreover, the larger the value of θ, the better progress Algorithm 1 will make in terms of the reduction in D (t) . The function Θ depends on the dual residue κ and the probability distribution p. Maximizing this function w.r.t. p will ensure that the largest possible value of θ can be used in Algorithm 1. Thus, we consider the following optimization problem: One may naturally be wary of the additional computational cost incurred by solving the optimization problem in (16) at every iteration. Fortunately, it turns out that there is an (inexpensive) closed form solution, as shown by the following Lemma.
LEMMA 2. Let Θ(κ, p) be defined in (15). The optimal solution p * (κ) of (16) is The corresponding θ by using the optimal solution p * is PROOF. This can be verified by deriving the KKT conditions of the optimization problem in (16). The details are moved to Appendix for brevity.
The results in  are weaker because they require a fixed sampling distribution p throughout all iterations. Here we allow adaptive sampling probabilities as in (17), which enables the algorithm to utilize the data information more effectively, and hence we have a better convergence rate. Furthermore, the optimal probabilities found in  can be only applied to a quadratic loss function, whereas our results are more general because the optimal probabilities in (17) can used whenever the loss functions are convex, or when individual loss functions are non-convex but the average of the loss functions is convex.
Before proceeding with the convergence theory we define several constants. Let where γ is defined in (10). Note that C 0 in (19) is equivalent to the value of the potential function (11) at iteration t = 0, i.e., C 0 ≡ D (0) . Moreover, let Now we have the following theorem.
, v i , C 0 and Q be as defined in (3), (5), (10), (11), (12), (19) and (20), respectively. Suppose that φ i isL-smooth and convex for all i ∈ [n], let θ (t) ∈ (0, 1) be decided by (18) for all t ≥ 0 and let p * be defined via (17). Then, setting p (t) = p * at every iteration t ≥ 0 of Algorithm 1, gives where Similar to Shalev-Shwartz (2015), we have the following corollary which bounds the quantity in terms of the sub-optimality of the points α (t) and w (t) by using optimal probabilities. COROLLARY 1. Let the conditions of Theorem 1 hold. Then at every iteration t ≥ 0 of Algorithm 1, Note that Theorem 1 can be used to ( 1 )). Furthermore, we achieve the same variance reduction rate as shown in Shalev-Shwartz (2015) For the dual free SDCA algorithm in Shalev-Shwartz (2015) where uniform sampling is adopted, the parameter θ should be set to at most min λ λn+L , whereL ≥ max i v i · L. However, from Corollary 1, we know that this θ is smaller than θ * , so dual free SDCA will have a slower convergence rate than our algorithm. In , where they use a fixed probability distribution p i for sampling of coordinates, they must choose θ less than or equal to min i p i nλ L i v i +nλ . This is consistent with Shalev-Shwartz (2015) where p i = 1/n for all i ∈ [n]. With respect to our adfSDCA Algorithm 1, at any iteration t, we have that θ (t) is greater than or equal to θ * , which again implies that our convergence results are better.

Case II: The average of the loss functions is convex
Here we follow the analysis in Shalev-Shwartz (2015) and consider the case where individual loss functions φ i (·) for i ∈ [n] are allowed to be nonconvex as long as the average 1 n n j=1 φ j (·) is convex. First we define several parameters that are analogous to the ones used in Section 3.1. Let where L i is given in (2), and define the following potential function. For all iterations t ≥ 0, let We also define the following constants Then we have the following theoretical results.
, and v i be as defined in (2), (5), (24), (25) and (12), respectively. Suppose that every φ i , i ∈ [n] is L i -smooth and that the average of the n loss functions 1 is convex. Let θ ∈ (0, 1). Then at every iteration t ≥ 0 of Algorithm 1, a probability distribution p (t) that satisfies Definition 2 is generated and , v i , andC 0 be as defined in (3), (5), (24), (25), (12), and (26) respectively. Suppose that every φ i , i ∈ [n] is L i -smooth and that the average of the n loss functions (18) for all t ≥ 0 and let p * be defined via (17). Then, setting p (t) = p * at every iteration t ≥ 0 of Algorithm 1, gives where We remark that, L i ≤ L for all i ∈ [n], soγ ≤ L 2 , which means that a conservative complexity bound is We conclude this section with the following corollary.
COROLLARY 2. Let the conditions of Theorem 2 hold and letM be defined in (27). Then at every iteration t ≥ 0 of Algorithm 1,

HEURISTIC ADFSDCA
One of the disadvantages of Algorithm 1 is that it is necessary to update the entire probability distribution p ∼ κ at each iteration, i.e., every time a single coordinate is updated the probability distribution is also updated. Note that if the data are sparse and coordinate i is sampled during iteration t, then, one need only update probabilities p j for which x T j x i = 0; unfortunately for some datasets this can still be expensive. In order to overcome this shortfall we follow the recent work in  and present a heuristic algorithm that allows the probabilities to be updated less frequently and in a computationally inexpensive way. The process works as follows. At the beginning of each epoch the (full/exact) nonuniform probability distribution is computed, and this remains fixed for the next n coordinate updates, i.e., it is fixed for the rest of that epoch. During that same epoch, if coordinate i is sampled (and thus updated) the probability p i associated with that coordinate is reduced (it is shrunk by p i ← p i /s). The intuition behind this procedure is that, if coordinate i is updated then the dual residue |κ i | associated with that coordinate will decrease. Thus, there will be little benefit (in terms of reducing the sub-optimality of the current iterate) in sampling and updating that same coordinate i again. To avoid choosing coordinate i in the next iteration, we shrink the probability p i associated with it, i.e., we reduce the probability by a factor of 1/s. Moreover, shrinking the coordinate is less computationally expensive than recomputing the full adaptive probability distribution from scratch, and so we anticipate a decrease in the overall running time if we use this heuristic strategy, compared with the standard adfSDCA algorithm. This procedure is stated formally in Algorithm 2. Note that Algorithm 2 does not fit the theory established in Section 3. Nonetheless, we have observed convergence in practice and a good numerical performance when using this strategy (see the numerical experiments in Section 6).

MINI-BATCH ADFSDCA
In this section we propose a mini-batch variant of Algorithm 1. Before doing so, we stress that sampling a mini-batch non-uniformly is not easy. We first focus on the task of generating non-uniform random samples and then we will present our minibatch algorithm.

Efficient single coordinate sampling
Before considering mini-batch sampling, we first show how to sample a single coordinate from a non-uniform distribution. Note that only discrete distributions are considered here. if mod (t, n) == 0 then 6: Generating adapted probabilities distribution p (t) ∼ κ (t) 8: end if 9: Select coordinate i from [n] according to p (t) 10: Set step-size θ (t) ∈ (0, 1) as in (18) 11: There are multiple approaches that can be taken in this case. One naïve approach is to consider the Cumulative Distribution Function (CDF) of p, because a CDF can be computing in O(n) time complexity and it also takes O(n) time complexity to make a decision. One can also use a better data structure (e.g. a binary search tree) to reduce the decision cost to O(log n) time complexity, although the cost to set up the tree is O(n log n). Some more advanced approaches like the so-called alias method of Kronmal and Peterson Jr (1979) can be used to sample a single coordinate in only O(1), i.e., sampling a single coordinate can be done in constant time but with a cost of O(n) setup time. The alias method works based on the fact that any n-valued distribution can be written as a mixture of n Bernoulli distributions.
In this paper we choose two sampling update strategies, one each for Algorithms 1 and 2. For adfSDCA in Algorithm 1 the probability distribution must be recalculated at every iteration, so we use the alias method, which is highly efficient. The heuristic approach in Algorithm 2 is a strategy that only alters the probability of a single coordinate (e.g. p i = p i /s) in each iteration. In this second case it is relatively expensive to use the alias method due to the linear time cost to update the alias structure, so instead we build a binary tree when the algorithm is initialized so that the update complexity reduces to O(log(n)).

Nonuniform Mini-batch Sampling
Many randomized coordinate descent type algorithms utilize a sampling scheme that assigns every subset of [n] a probability p S , where S ∈ 2 [n] . In this section, we consider a particular type of sampling called a mini-batch sampling that is defined as follows.
Definition 3. A samplingŜ is called a mini-batch sampling, with batchsize b, consistent with the given marginal distribution q := (q 1 , . . . , q n ) T , if the following conditions hold: Note that we study samplingsŜ that are non-uniform since we allow q i to vary with i. The motivation to design such samplings arises from the fact that we wish to make use of the optimal probabilities that were studied in Section 3.
We make several remarks about non-uniform mini-batch samplings below.
1. For a given probability distribution p, one can derive a corresponding mini-batch sampling only if we have p i ≤ 1 b for all i ∈ [n]. This is obvious in the sense that q i = bp i = i∈S,S∈Ŝ P (S) ≤ S∈Ŝ P (S) = 1. 2. For a given probability distribution p and a batch size b, the mini-batch sampling may not be unique and it may not be proper, see for example Richtárik and Takáč (2012). (A proper sampling is a sampling for which any subset of size b must have a positive probability of being sampled.) In Algorithm 3 we describe an approach that we used to generate a non-uniform mini-batch sampling of batchsize b from a given marginal distribution q. Without loss of generality, we assume that the q i ∈ (0, 1) for i ∈ [n] are sorted from largest to smallest.
Algorithm 3 Non-uniform mini-batch sampling 1: Input: Marginal distribution q ∈ R n with q i ∈ (0, 1) ∀i ∈ [n] and batchsize b such that n i=1 q i = b. Define q n+1 = 0 2: Output: A mini-batch sampling S (Definition 3) 3: Initialization: Index set i, j ∈ N n , and set k = 1. 4: for k = 1, . . . , n do 5: Obtain r k : 7: Update q i : 8: Terminate if q = 0, and set m = k 9: end for 10: Select K ∈ [m] randomly with discrete distribution (r 1 , . . . , r m ) 11: Choose b − i K + 1 coordinates uniformly at random from i K to j K , denote it by W 12: S = {1, . . . , i K − 1} ∪ W We now state several facts about Algorithm 3. 1. Algorithm 3 will terminate in at most n iterations. This is because the update rules for q i (which depend on r k at each iteration), ensure that at least one q i will reduce to become equal to some q j < q i (i.e., either q i k+1 −1 = q b or q j k+1 +1 = q b ) and since there are n coordinates in total, after at most n iteration it must hold that q i = q j for all i, j ∈ [n]. Note that if the algorithm begins with q i = q j for all i, j ∈ [n], which implies a uniform marginal distribution, the algorithm will terminated in a single step.
2. For Algorithm 3 we must have m i=1 r i = 1, where we assume that the algorithm terminates at iteration m ∈ [1, n], since overall we have m i=1 br i = n i=1 q i = b. 3. Algorithm 3 will always generate a proper sampling because when it terminates, the situation p i = p j > 0, for all i = j, will always hold. Thus, any subset of size b has a positive probability of being sampled.
4. It can be shown that this algorithm works on an arbitrary given marginal probabilities as long as q i ∈ (0, 1), for all i ∈ [n].
Figure 1 is a sample illustration of Algorithm 3, where we have a marginal distribution for 4 coordinates given by (0.8, 0.6, 0.4, 0.2) T and we set the batchsize to be b = 2. Then, the algorithm is run and finds r to be (0.2, 0.4, 0.4) T . Afterwards, with probability r 1 = 0.2, we will sample 2-coordinates from (1, 2). With probability r 2 = 0.4, we will sample 2-coordinates which has (1) for sure and the other coordinate is chosen from (2, 3) uniformly at random and with probability r 3 = 0.4, we will sample 2-coordinates from (1, 2, 3, 4) uniformly at random.
Note that, here we only need to perform two kinds of operations. The first one is to sample a single coordinate from distribution d (see Section 5.1), and the second is to sample batches from a uniform distribution (see for example Richtárik and Takáč (2012)).

Mini-batch adfSDCA algorithm
Here we describe a new adfSDCA algorithm that uses a mini-batch scheme. The algorithm is called mini-batch adfSDCA and is presented below as Algorithm 4.

6:
Choose mini-batch S ⊂ [n] of size b according to probabilities distribution p (t) 7: Set step-size θ (t) ∈ (0, 1) as in (75) 8: for i ∈ S do 9: Briefly, Algorithm 4 works as follows. At iteration t, adaptive probabilities are generated in the same way as for Algorithm 1. Then, instead of updating only one coordinate, a mini-batch S of size b ≥ 1 is chosen that is consistent with the adaptive probabilities. Next, the dual variables α (t) i , i ∈ S are updated, and finally the primal variable w is updated according to the primal-dual relation (4).
In the next section we will provide a convergence guarantee for Algorithm 4. As was discussed in Section 3, theoretical results are detailed under two different assumptions on the type of loss function: (i) all loss function are convex; and (ii) individual loss functions may be non-convex but the average over all loss functions is convex.

Expected Separable Overapproximation
Here we make use of the Expected Separable Overapproximation (ESO) theory introduced in Richtárik and Takáč (2012) and further extended, for example, in Qu and Richtárik (2016). The ESO definition is stated below.
Definition 4 (Expected Separable Overapproximation, Qu and Richtárik (2016)). LetŜ be a sampling with marginal distribution q = (q 1 , · · · , q n ) T . Then we say that the function f admits a v-ESO with respect to the samplingŜ if ∀x, h ∈ R n , we have v 1 , . . . , v n > 0, such that the following inequality holds REMARK 1. Note that, here we do not assume thatŜ is a uniform sampling, i.e., we do not assume that The ESO inequality is useful in this work because the parameter v plays an important role when setting a suitable stepsize θ in our algorithm. Consequently, this also influences our complexity result, which depends on the samplingŜ. For the proof of Theorem 4 (which will be stated in next subsection), the following is useful. Let f (x) = 1 2 Ax 2 , where A = (x 1 , . . . , x n ). We say that f (x) admits a v-ESO if the following inequality holds To derive the parameter v we will make use of the following theorem.
THEOREM 3 (Qu and Richtárik (2016)). Let f satisfy the following assumption f ( Here P(Ŝ) is called a sampling matrix (see Richtárik and Takáč (2012)) where element p ij is defined to be p ij = {i,j}∈S,S∈Ŝ P (S). For any matrix M , λ (M ) denotes the maximal regularized eigenvalue We may now apply Theorem 3 because f (x) = 1 2 Ax 2 satisfies its assumption. Note that in our mini-batch setting, we have P S∈Ŝ (|S| = b) = 1, so we obtain λ (P(Ŝ)) ≤ b (Theorem 4.1 in Qu and Richtárik (2016)). In terms of λ (A T A), note that where |J j | is number of non-zero elements of x j for each j. Then, a conservative choice from Theorem 3 that satisfies (33) is Now we are ready to give our complexity result for mini-batch adfSDCA (Algorithm 4). Note that we use the same notation as that established in Section 3 and we also define , v i , C 0 and Q be as defined in (3), (5), (10), (11), (34), (19) and (35), respectively. Suppose that φ i is L-smooth and convex for all i ∈ [n]. Then, at every iteration t ≥ 0 of Algorithm 4, run with batchsize b we have where θ * = . Moreover, it follows that whenever It is also possible to derive a complexity result in the case when the average of the n loss functions is convex. The theorem is stated now. γD (t) , v i ,C 0 and Q be as defined in (3) where θ * = (v iγ +nλ 2 ) . Moreover, it follows that whenever These theorems show that in worst case (by setting b = 1), this mini-batch scheme shares the same complexity performance as the serial adfSDCA approach (recall Section 2). However, when the batch-size b is larger, Algorithm 4 converges in fewer iterations. This behaviour will be confirmed computationally in the numerical results given in Section 6.

NUMERICAL EXPERIMENTS
Here we present numerical experiments to demonstrate the practical performance of the adfSDCA algorithm. Throughout these experiments we used two loss functions, quadratic loss φ i (w T x i ) = 1 2 (w T x i − y i ) 2 and logistic loss φ i (w T x i ) = log(1 + exp(−y i w T x i )). The experiments were run using datasets from the standard library of test problems (see Chang and Lin (2011)

Comparison for a variety of dfSDCA approaches
In this section we compare the adfSDCA algorithm (Algorithm 1) with both dfSCDA, which is a uniform variant of adfSDCA described in Shalev-Shwartz (2015), and also with Prox-SDCA from Shalev-Shwartz and Zhang (2014). We also report results using Algorithm 2, which is a heuristic version of adfSDCA, used with several different shrinking parameters. Figures 2 and 3 compare the evolution of the duality gap for the standard and heuristic variant of our adfSDCA algorithm with the two state-of-the-art algorithms dfSDCA and Prox-SDCA. For these problems both our algorithm variants out-perform the dfSDCA and Prox-SDCA algorithms. Note that this is consistent with our convergence analysis (recall Section 3). Now consider the adfSDCA+ algorithm, which was tested using the parameter values s = 1, 10, 20. It is clear that adfSDCA+ with s = 1 shows the worst performance, which is reasonable because in this case the algorithm only updates the sampling probabilities after each epoch; it is still better than dfSDCA since it utilizes the sub-optimality at the beginning of each epoch. On the other hand, there does not appear to be an obvious difference between adfSDCA+ used with s = 10 or s = 20 with both variants performing similarly. We see that adfSDCA performs the best overall in terms of the number of passes through the data. However, in practice, even though adfSDCA+ may need more passes through the data to obtain the same sub-optimality as adfSDCA, it requires less computational effort than adfSDCA.  Figure 4 shows the estimated density function of the dual residue |κ (t) | after 1, 2, 3, 4 and 5 epochs for both uniform dfSDCA and our adaptive adfSDCA. One observes that the adaptive scheme is pushing the large residuals towards zero much faster than uniform dfSDCA. For example, notice that after 2 epochs, almost all residuals are below 0.03 for adfSDCA, whereas for uniform dfSDCA there are still many residuals larger than 0.06. This is evidence that, by using adaptive probabilities we are able to update the coordinate with a high dual residue more often and therefore reduce the sub-optimality much more efficiently.

Mini-batch SDCA
Here we investigate the behaviour of the mini-batch adfSDCA algorithm (Algorithm 4). In particular, we compare the practical performance of mini-batch adfSDCA using different mini-batch sizes b varying from 1 to 32. Note that if b = 1, then Algorithm 4 is equivalent to the adfSDCA algorithm (Algorithm 1). Figures 5 and 6 show that, with respect to the different batch sizes, the mini-batch algorithm with each batch size needs roughly the same number of passes through the data to achieve the same sub-optimality. However, when considering the computational time, the larger the batch size is, the faster the convergence will be. Recall that the results in Section 5 show that the number of iterations needed by Algorithm 4 used with a batch size of b is roughly 1/b times the number of iterations needed by adfSDCA. Here we compute the adaptive probabilities every b samples, which leads to roughly the same number of passes through the data to achieve the same sub-optimality.

ACKNOWLEDGEMENT
We would like to thank Professor Alexander L. Stolyar for his insightful help with Algorithm 3. The material is based upon work supported by the U.S. National Science Foundation, under award number NSF:CCF:1618717, NSF:CMMI:1663256 and NSF:CCF:1740796.

A.1 Preliminaries and Technical Results
Recall that w * denotes an optimum of (P) and define α * i = −φ i (x T i w * ). To simplify the proofs we introduce the following variables At the optimum w * , it holds that 0 = ∇P (w * ) = 1 , and therefore we have κ The following two lemmas will be useful when proving our main results.
LEMMA 4. Let A (t) and B (t) be defined in (40), and let v i = x i 2 for all i ∈ [n]. Then, conditioning on α (t) , the following hold for given θ: PROOF. Note that at iteration t, only coordinate i (of α) is updated, so Thus, Taking expectation over i ∈ [n], conditioned on α (t) , gives the first result.
LEMMA 5. Assume that each φ i isL i -smooth and convex. Then, for every w PROOF. Let z, z * ∈ R. Define Because φ i isL i -smooth, so too is g i , which implies that for all z,ẑ ∈ R, By convexity of φ i , g i is nonnegative, i.e., g i (z) ≥ 0 for all z. Hence, by non-negativity and smoothness g i is self-bounded (see Section 12.1.3 in Shalev-Shwartz and Ben-David (2014) or set z =ẑ − 1 L i g i (ẑ) in (47) and rearrange): Differentiating (46) w.r.t. z and combining the result with (48), used with z = x T i w and z * = x T i w * , gives Multiplying (49) through by 1/(nL i ) and summing over i ∈ [n] shows that where we have used the fact that E[∇P (w * )] = E[φ (x T i w * )x i + λw * ] = 0. The first inequality follows becauseL = max iLi .

A.2 Proof of Lemmas 1 and 3
PROOF OF LEMMA 1. In this case it is assumed that every loss function is convex and we set γ = λL (10). For convenience, define the following quantities: Recall that A (t) , B (t) and D (t) are defined in (40) and (11), respectively, and γ is defined in (10). Then, Now, where the last inequality follows from convexity of P (w), i.e., P (w (t) ) − P (w * ) ≤ ∇P (w (t) ) T (w (t) − w * ).
PROOF OF LEMMA 3. For this result we assume that the average of the loss functions 1 n φ i (·) is convex. Note that one can define parametersC 1 andC 2 that are analogous to C 1 and C 2 in (50) and (51) but with γ replaced byγ. Then, the same arguments as those used in (52) can be used to show that Now, note that by Lipschitz continuity of φ (·) one has Further, since the average of the losses is convex, P (w) is strongly convex, so and since w * is the minimizer Now, adding (56) and (57) gives Thus, from (54) and (59) we have that E[D (t+1) |α (t) ] − D (t) ≤ −θD (t) +C 2 , which is the desired result.

A.3 Proof of Lemma 2
PROOF. This is easy to verify by derive KKT conditions of optimization problem (16), which is where µ is the Lagrange multiplier.
Subsequently, the expression for T in (23) is obtained by multiplying through by e θ * T / , taking natural logs, rearranging and noting that = n +L Q λ .
PROOF OF THEOREM 2. Here we assume that the average loss 1 n n i=1 φ i (·) is convex, but that individual loss functions φ i (·) may not be. The proof of this result is almost identical to the proof of Theorem 1, but with the parameters defined in Section 3.2. Similarly to (64) we must find T for which whereγ = 1 n n i=1 L 2 i is defined in (24) andC 0 is defined in (26). The expression T in (30) is obtained by multiplying through by e θ * T / , taking natural logs, rearranging and noting that = n +γ Q λ 2 .
Similar arguments to those made in the final stages of the proof of Theorem 5 can be used to show that if T is given by the expression in (39) then E[P (w t ) − P (w * )] ≤ .