Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell., 28 January 2026

Sec. Machine Learning and Artificial Intelligence

Volume 9 - 2026 | https://doi.org/10.3389/frai.2026.1699896

This article is part of the Research TopicEthical Artificial Intelligence: Methods and ApplicationsView all 4 articles

EF-Feddr: communication-efficient federated learning with Douglas–Rachford splitting and error feedback


Jiao XueJiao Xue1Chundong Wang,
Chundong Wang1,2*
  • 1School of Computer Science and Engineering, Tianjin University of Technology, Tianjin, China
  • 2Tianjin Police Institute, Tianjin, China

Introduction: Federated learning (FL) is a distributed machine learning paradigm that preserves data privacy and mitigates data silos. Nevertheless, frequent communication between clients and the server often becomes a major bottleneck, restricting training efficiency and scalability.

Methods: To address this challenge, we propose a novel communication-efficient algorithm, EF-Feddr, for federated composite optimization, where the objective function includes a potentially non-smooth regularization term and local datasets are non-IID. Our method is built upon the relaxed Douglas–Rachford splitting method and incorporates error feedback (EF)—a widely adopted compression framework—to ensure convergence when biased compression (e.g., top-k sparsification) is applied.

Results: Under the partial client participation setting, our theoretical analysis demonstrates that EF-Feddr achieves a fast convergence rate of O(1/K) and a communication complexity of O(1/ε2). Comprehensive experiments conducted on the FEMNIST and Shakespeare benchmarks, as well as controlled synthetic data, consistently validate the efficacy of EF-Feddr across diverse scenarios.

Discussion: The results confirm that the integration of error feedback with the relaxed Douglas–Rachford splitting method in EF-Feddr effectively overcomes the convergence degradation typically caused by biased compression, thereby offering a practical and efficient solution for communication-constrained federated learning.

1 Introduction

Federated learning (FL) (Konecný et al., 2016; McMahan et al., 2017) is a distributed framework designed to address large-scale learning problems across networks of edge clients. In this paradigm, clients update models locally on their private data, while the server aggregates these updates to refine a shared global model. This collaborative process enables the development of global or personalized models without compromising user privacy (Ezequiel et al., 2022; Saifullah et al., 2024). Despite these advantages, communication between clients and the server remains a critical bottleneck, particularly when the number of participating clients is large, bandwidth is constrained, and the models involve high-dimensional parameters (Bhardwaj et al., 2023; Talwar et al., 2021). Recent efforts to improve the communication efficiency of FL have primarily focused on two directions: (i) reducing the number of communication rounds through partial client participation or increased local computation, and (ii) lowering the number of transmitted bits per round via techniques such as quantization and residual gradient compression. While these strategies effectively cut communication costs, they also introduce additional variance, which may widen the neighborhood around the optimal solution and, in some cases, prevent convergence under biased compression. To mitigate these issues, variance-reduction techniques such as error feedback (EF) are commonly employed. In contrast to traditional distributed training, it is unrealistic to assume that data on each local device are always independent and identically distributed (IID). Prior studies have consistently shown that FL accuracy degrades significantly when faced with non-IID or heterogeneous data (Islam et al., 2024). In this study, we focus on the following federated composite optimization (FCO) problem:

minxdF(x)=f(x)+g(x)=1ni=1nfi(x)+g(x),    (1)

where n denotes the number of clients, fi is the local loss function for the i-th client, which is L-smooth and non-convex, and g represents the regularization term, which is proper, closed, convex (possibly non-smooth). As a practical example, consider a collaborative environmental monitoring project in which multiple research institutions aim to analyze sensor data from diverse geographical locations to detect climate change patterns. Due to privacy concerns and proprietary restrictions, however, raw data cannot be shared directly. In this case, enforcing sparse regularization becomes particularly important: although the dataset may contain relatively few observations (e.g., readings from a sparse sensor network Bhardwaj et al., 2022), each observation typically involves a high-dimensional set of features such as temperature, humidity, wind speed, and pollution levels, a combination of factors that further justifies the use of sparse regularization to identify salient features and prevent overfitting.

Operator splitting constitutes a broad class of methods for solving optimization problems of the form (Equation 1). These methods decompose numerically intractable components into simpler subproblems, thereby reducing computational complexity, enhancing efficiency, and enabling modular algorithms that are naturally suited for parallelization. Operator splitting has been successfully applied to a wide range of challenging optimization problems. Among these, the Douglas–Rachford splitting method is particularly well-established due to its enhanced iterative stability and accelerated convergence rate. Furthermore, its update rule decomposes the global composite objective into local proximal steps that can be executed in a fully parallel manner. This structure inherently aligns with the distributed nature of federated learning, facilitating efficient client-side computation while also underpinning the method's enhanced iterative stability. From this perspective, many state-of-the-art FL algorithms can be interpreted within the operator splitting framework (Malekmohammadi et al., 2021). Examples include FedAvg (McMahan et al., 2017), FedProx (Li et al., 2020), FedSplit (Pathak and Wainwright, 2020), and FedDR (Tran-Dinh et al., 2021). However, for the FCO Equation 1, existing FL methods such as FedAvg and its communication-efficient variants are primarily designed for smooth, unconstrained settings minxdF(x)=1ni=1nfi(x). In non-smooth FL settings, subgradient methods are widely used but suffer from slow convergence (Jhunjhunwala et al., 2022). Although proximal operators offer a more effective alternative with superior convergence properties (Liu et al., 2024), their seamless integration into communication-efficient FL frameworks remains limited. Moreover, while compression techniques effectively reduce communication overhead, they introduce additional variance that can enlarge the solution neighborhood and hinder convergence. Critically, existing communication-efficient methods have predominantly been designed for smooth FL problems, leaving a pronounced combined gap in addressing non-smooth federated composite optimization under compression-induced variance and communication constraints simultaneously. To bridge this multifaceted gap, this study presents EF-Feddr, a communication-efficient FL algorithm that employs the Top-k sparsification technique to compress transmitted parameters and reduce communication bits, incorporates an error feedback (Li and Li, 2023) mechanism to mitigate variance introduced by compression, and further integrates the relaxed Douglas–Rachford splitting method (He et al., 2021) along with a proximal operator to accelerate the iterative process while effectively handling the non-smoothness of the global regularization term. This integrated design enables EF-Feddr to be applicable to a wider range of scenarios and constrained settings. Leveraging the Douglas–Rachford envelope, we establish convergence guarantees for EF-Feddr in non-convex FL problems under mild assumptions.

Our contributions are summarized as follows:

• We propose EF-Feddr, an algorithm that combines the relaxed Douglas–Rachford splitting method with error feedback to reduce communication costs between clients and the server without sacrificing accuracy in non-IID settings. In addition, the error feedback mechanism enhances the stability of communication-compressed training in FL.

• We establish theoretical convergence guarantees for EF-Feddr based on the Douglas–Rachford envelope. Specifically, our method achieves a convergence rate of O(1K) and a communication complexity of O(1ε2) for non-convex loss functions under partial client participation.

• Through experiments on synthetic datasets, the FEMNIST dataset, and the Shakespeare dataset, we show that EF-Feddr improves accuracy by 3.29%–12.97% over state-of-the-art FL variants, while significantly reducing communication costs compared to uncompressed FedDR.

2 Related work

2.1 Operator splitting methods

Classical operator splitting methods such as Douglas–Rachford (DR), Forward-Backward (FB), and the Alternating Direction Method of Multipliers (ADMM) have recently been adopted in FL (Godavarthi et al., 2025; Goel et al., 2025). FedAvg (McMahan et al., 2017) can be viewed as an instance of k-step FB splitting, while FedProx (Li et al., 2020) extends the backward-backward splitting method. It is another FB variant tailored for regularized FL problems. FedSplit (Pathak and Wainwright, 2020), based on Peaceman-Rachford splitting, aims to identify the correct fixed point for strictly convex FL problems. Its communication-efficient variant, Eco-FedSplit (Khirirat et al., 2022), incorporates error-compensated compression. For the FCO problem, FedDR (Tran-Dinh et al., 2021) integrates a randomized block-coordinate strategy with DR splitting to solve non-convex formulations. FedADMM (Wang et al., 2022) leverages ADMM by applying FedDR to the dual form of the FCO problem, while FedTOP-ADMM (Kant et al., 2022) generalizes FedADMM as the first three-operator method used in FL.

2.2 Communication-efficient FL

To address the communication bottleneck in FL (Sun et al., 2024), two categories of compression methods have been widely explored: unbiased compressors (e.g., stochastic quantization Alistarh et al., 2017) and biased compressors (e.g., top-k sparsification Khirirat et al., 2018). FedPAQ (Reisizadeh et al., 2020) reduces communication costs through periodic averaging, partial client participation, and quantization. However, this reduction comes at the expense of convergence accuracy, which requires additional training iterations. The authors also analyzed the trade-off between communication overhead and convergence in their experiments. The z-SignFedAvg algorithm (Tang et al., 2024), a variant of FedAvg, employs stochastic sign-based compression. It achieves accuracy comparable to uncompressed FedAvg while greatly reducing communication overhead. Building on the lazily aggregated gradient rule and error feedback, (Zhou et al., 2023) proposed two communication-efficient algorithms for non-convex FL: EF-LAG and BiEF-LAG, which adapt both uplink and downlink communications. Similarly, FedSQ (Long et al., 2024) introduces a hybrid approach combining sparsity and quantization to reduce communication costs while enhancing convergence.

2.3 Error feedback

In the realm of distributed optimization, it has been noted that employing biased compressors for direct updates may decelerate convergence, deteriorate generalization performance, or even induce divergence (Li and Li, 2023). To counteract these issues, error feedback techniques have been introduced, which can reduce the compression error compared to direct compression. The study (Seide et al., 2014) first proposed this method as a heuristic approach, which is inspired by the idea of Sigma-Delta modulation. EF21 (Richtárik et al., 2021) removes strict assumptions such as bounded gradients and bounded dissimilarity, and can handle arbitrary data heterogeneity among clients, but leads to worse computational complexity. EFSkip (Bao et al., 2025) allows arbitrary data heterogeneity and enjoys linear speedup for significantly improving upon previous results.

3 Compressed non-convex FL with error feedback

In this section, we present EF-Feddr, an algorithm that integrates error feedback into the relaxed Douglas–Rachford splitting framework to address the non-convex FCO problem. We begin with a brief introduction to the Douglas–Rachford splitting method, followed by an explanation of how error feedback is incorporated to improve communication efficiency. We then provide the detailed formulation of EF-Feddr and analyze its convergence properties. Main notations are listed in Table 1.

Table 1
www.frontiersin.org

Table 1. Summary of main notations.

3.1 Problem formulation

The FCO Equation 1 is mathematically equivalent to the consensus optimization problem

minx1,,xnF(x)=f(x)+g(x)=1ni=1nfi(xi)+g(x)subject tox1=x2==xn,    (2)

where the consensus constraint set is E = {x = (x1, …, xn)|x1 = x2 = ⋯ = xn.}. Let lE be the indicator function of E. With the indicator function, one can treat the constrained problem as unconstrained by moving the constraints into the objective function. Then Equation 1 is obviously equivalent to

min1ni=1nfi(xi)+g(x)+l𝔼(x).    (3)

The first-order optimality condition is given by 0∈∇f(x)+∂g(x)+∂lE(x), where ∇f(x) = [∇f1(x1), ..., ∇fn(xn)]. A point x* is a stationary point to Equation 1, if 0f(x*)+g(x*)+lE(x*). Additionally, the operator splitting method encompasses a broad range of techniques to effectively address this Equation 3. A key advantage of operator splitting methods is their efficient per-iteration operations, which makes them particularly suitable for large-scale applications due to their lower computational costs (He et al., 2021), among which the DR splitting method is particularly well-known. The iteration equations for the DR splitting method are given by

{yk+1=yk+xkzk+1zk+1=proxγf(yk)xk+1=proxγ(g+lE)(2zk+1yk+1).    (4)

Given that the DR splitting method often demonstrates favorable and stable convergence behavior in practice, we base our approach on its relaxed variant to solve Equation 1. The detailed application is presented in Section 3.3.

For convenience, we introduce the definitions of the key concepts that will be utilized. For a function f, the proximal operator at point x with a step size γ>0 is

proxγf(x)=argminy{f(y)+12γ||y-x||2},

the Moreau envelope of f with a step size γ>0 is

Mγf(x)=miny{f(y)+12γ||y-x||2},

the gradient mapping of f at point x with a step size γ>0 is

Gγf(x)=1γ(x-proxγf(x)).

We observe that ∇Mγf(x) = Gγf(x) (Liu et al., 2019). Moreover, the proximal operator update zk=proxγf(yk) can be written as

zk=yk-γGγf(yk).

This representation reveals that the proximal operator update is analogous to taking a gradient step applied to the gradient mapping Gγf(yk) of f. For the composite function F(x) = f(x)+g(x), the corresponding gradient mapping is given by

Gγ(x)=1γ(x-proxγg(x-γf(x))).    (5)

In the context of general non-convex non-smooth problems, the gradient mapping Gγ(x) is commonly used to assess convergence (Liu et al., 2024). Specifically, 0f(x*)+g(x*)+lE(x*) of Equation 1 is equivalent to Gγ(x*)=0.

3.2 Error feedback

We now define a general class of compressors that will be used throughout this study

Definition 1. (Absolute compressor). A map C:ℝd → ℝd is an absolute compressor operator if there exists ν>0 such that, ∀x∈ℝd, E||xC(x)||2 ≤ ν2.

Most popular compressors such as the sign compression (Bernstein et al., 2018), the Top-k sparsification (Khirirat et al., 2018) and the sparsification together with quantization (Alistarh et al., 2017) are in fact absolute compressors if the full-precision vector has a bounded norm (Khirirat et al., 2022; Sahu et al., 2021).

Error feedback (also known as error compensation) is a popular tool in FL to reduce compression error and improve convergence speed compared to direct compression (Valdeira et al., 2025). Its mechanism shares a fundamental principle with Sigma-Delta modulation in signal processing (Seide et al., 2014). Technically, when transmitting a sequence of vectors, the method incorporates an auxiliary vector that accumulates the compression error at each step. This accumulated error is then added to the current vector before it undergoes compression and transmission (Karimireddy et al., 2019). More specifically, based on the DR splitting method (Equation 4), the update steps of the direct compression scheme are as follows:

ck+1=C(2zk+1-yk+1),(direct compression)xk+1=proxγ(g+lE)(ck+1),(model update)    (6)

the update steps with error feedback compression are as follows:

ck+1=C(2zk+1-yk+1+ek),(error compensation)ek+1=2zk+1-yk+1+ek-ck+1,(compute the error)xk+1=proxγ(g+lE)(ck+1).(model update)    (7)

In direct compression, each vector 2zk+1yk+1 is individually compressed, and the receiver directly uses its compressed version C(2zk+1yk+1) in place of the original. Conversely, error feedback compression employs a proxy vector ck+1 for 2zk+1yk+1 that integrates information from prior steps 0, 1, …, k. This proxy is refined via an auxiliary vector ek+1, which is iteratively updated and stored to accumulate the compression error at each step.

3.3 EF-Feddr algorithm

In this section, we propose the following EF-Feddr algorithm. The details of EF-Feddr are presented in Algorithm 1. Specifically, applying the relaxed DR splitting method (He et al., 2021) to the Equation 3 of Equation 1 in a distributed setting yields the following iterative steps:

{yik+1=yik+λ(xkzik)zik+1=proxγfi(yik+1)xik+1=2zik+1yik+1xk+1=proxnγ(g+lE)(xik+1).
Algorithm 1
www.frontiersin.org

Algorithm 1. EF-Feddr.

By integrating the error feedback mechanism detailed in Section 3.2, we obtain the EF-Feddr iterative scheme:

{yik+1=yik+λ(xkzik)zik+1proxγfi(yik+1)xik+1=C(2zik+1yik+1+eik)eik+1=2zik+1yik+1+eikxik+1xk+1=proxnγ(g+lE)(xik+1),    (8)

where λ∈(0, 2) (He et al., 2021) is the relaxation parameter. The variables yik+1, zik+1, xik+1 and eik+1 are updated locally on each client i. The key step involves compression and communication: instead of compressing 2zik+1-yik+1 directly, each client compresses the error-compensated vector 2zik+1-yik+1+eik. The resulting value xik+1 is then sent to the server. Furthermore, to compute the server aggregation xk+1, we have the following conclusion.

Proposition 1. For every k≥0, xk+1=proxnγ(g+lE)(xik+1) in Equation 8 is equal to proxγg(1niSkxik+1).

Proof. Let x̄=1niSkxik+1. Actually, the result of proxnγ(g+lE)(xik+1) must have blocks equal to some vector z (Mishchenko et al., 2022) such as

z=arg miny{g(y)+12nγi=1nyxik+12}=arg miny{g(y)+12nγi=1n(yx¯2+2yx¯,x¯xik+1+x¯xik+12)}=arg miny{g(y)+12nγ[i=1nyx¯2+2yx¯,nx¯2yx¯,nx¯]}=arg miny{g(y)+12γyx¯2}=proxγg(x¯)=proxγg(1niSkxik+1).

Thus, we have the server aggregation

xk+1=proxnγ(g+lE)(xik+1)=proxγg(1niSkxik+1).

In Algorithm 1, during round k: (1) The clients receive the global model xk from the server (line 5); (2) A subset of clients Sk is sampled following the sampling scheme described in Section 4. The i-th client performs a relaxation step, where λ is the relaxation parameter, computes the proximal local update to obtain the local model zik+1, calculates the compressed local model update xik+1, and updates the local compression error accumulator eik+1 and sends the compressed xik+1 back to the server (line 6–10); (3) The server receives the compressed xik+1 from clients iSk and performs a global model update using the averaged compressed local model updates (line 13). Particularly, the relaxation strategy, akin to the inertial extrapolation technique (e.g., the heavy ball method), has broadly accelerated iterative algorithms in convex and non-convex optimization, as the cost per iteration stays basically unchanged (He et al., 2021). For any γ>0, zik+1 serves as an approximation of proxγfi(yik+1). The evaluation of proxγfi can be carried out using several established techniques, such as accelerated GD-type algorithms and local SGD (Parikh et al., 2014; Tran-Dinh et al., 2021). It is worth noting that this algorithm requires O(d) memory and incurs O(d) computational overhead per client per round.

4 Theoretical results

For analyzing the convergence of Algorithm 1, we consider several basic assumptions and auxiliary results. Our analysis is based on the analytical framework outlined in Tran-Dinh et al. (2021). First, we introduce a proper sampling scheme following Tran-Dinh et al. (2021). Let p1, …, pn>0 such that for all i∈[N], (iS̄)=pi1. Here, S̄ is a proper samping scheme of [N], and each Sk is an i.i.d. realization of S̄. Note that pi=S[N],iS(S̄=S). Define Ak=σ(S0,,Sk) as the σ-algebra generated by the sequence S0, …, Sk. This sampling scheme ensures that each client has a significant probability of being updated.

Assumption 1. (L-Smoothness). All local functions fi(·) are L-smooth, if

x,y,||fi(x)-∇fi(y)||L||x-y||.

Assumption 2. (Boundedness from below). F(·) given in (1) is bounded below, that is, F*=infxdF(x)>-.

In non-convex FL optimization, Assumptions 1 and 2 are standard. Assumption 2 guarantees that Equation 1 is well-defined and is independent of the choice of algorithms. We first present three useful lemmas that will be instrumental in proving our main theorem.

Lemma 1. Let {(yik,zik,xik,eik,xk)} be generated by Algorithm 1, for all iSk, λ>0, β1>0 and γ>0, we have

||xk-zik||22(γ2L2+1)λ2[(1+β1)||zik+1-zik||2                           +2(1+1β1)(||mik+1||2+||mik||2)].    (9)

Proof.

For the relation zik+1proxγfi(yik+1), where the approximation error satisfies ||zik+1-proxγfi(yik+1)||εik with a given accuracy εik0, we introduce auxiliary variables wi0 and wik+1 for i∈[n] to analyze the convergence of Algorithm 1,

wi0=proxγfi(yi0),wik+1={proxγfi(yik+1)if iSkwikif iSk,zik=wik+mik,  where  mikεik.    (10)

Here, mik denotes the vector of errors associated with the approximations of the proximal operator, and wik+1 serves as an accurate computation to proxγfi(yik+1). Note that when iSk, we have zik+1=zik and wik+1=wik, which implies ||mik+1||=||zik+1-wik+1||=||mik||=||zik-wik||. From Equation 10 (Atenas, 2025), we have

yik=wik+γfi(wik).    (11)

Then, using the update rule for yik+1 in Algorithm 1, we get xk-zik=1λ(yik+1-yik)=1λ(wik+1-wik)+γλ(fi(wik+1)-fi(wik)). Using Young's inequality ||a1+a2||2(1+β)||a1||2+(1+1β)||a2||2, and the L-smoothness of fi, we bound ||xk-zik||2 for any β1>0 and iSk as follows

xkzik2=1λ(wik+1wik)+γλ(fi(wik+1)fi(wik))2                     2λ2wik+1wik2+2γ2λ2(fi(wik+1)fi(wik))2                    2λ2wik+1wik2+2γ2L2λ2(wik+1wik)2                    =2(γ2L2+1)λ2zik+1mik+1zik+mik2                   2(γ2L2+1)λ2[(1+β1)zik+1zik2+2(1                  +1β1)(mik+12+mik2)],

which proves (9).

We then establish the relationship between i=1n||xk-zik||2 and the squared norm of the gradient mapping ||Gγ(xk)||2.

Lemma 2. Let {(yik,zik,xik,eik,xk,wik)} be generated by Algorithm 1 and Equation 10, and the gradient mapping Gγ be defined by (5). Then, for any λ>0, β2>0, and γ>0, we have

||Gγ(xk)||22(1+γL)2nγ2i=1n[(1+β2)||zik-xk||2                          +(1+1β2)||mik||2]+2nγ2i=1n||eik-1-eik||2.    (12)

Proof. From the update of xik+1, eik+1 in Algorithm 1 and (11), we have

1ni=1nxik=1ni=1n(2zik-yik+eik-1-eik)              =1ni=1n(2zik-wik-γfi(wik)+eik-1-eik).    (13)

From the update rule of xk in Algorithm 1, the definition of Gγ(x), the non-expansive property of proxγg, and the fact f(xk)=1ni=1nfi(xk), we obtain that

Gγ(xk)=1γxkproxγg(xkγf(xk))                   =1γproxγg(1ni=1nxik)proxγg(xkγf(xk))                   1γ1ni=1nxikxk+γf(xk)                   =1nλ||i=1n[(2zikwikxk)+γ(fi(xk)fi(wik))                  +eik1eik].

By applying the L-smoothness of fi and the Young's inequality stated in Lemma 1, for any β2>0 we deduce that

||Gγ(xk)||21n2γ2[i=1n(||2zik-wik-xk||+γL||zik-x̄k||+||eik-1-eik||)]21nγ2i=1n(||2zik-wik-xk||+γL||xk-wik||+||eik-1-eik||)21nγ2i=1n[(1+γL)||zik-xk||+(1+γL)||mik||+||eik-1-eik||]21nγ2(1+γL)2i=1n[2(1+β2)||zik-xk||2+2(1+1β2)||mik||2+2(1+γL)2||eik-1-eik||2]2(1+γL)2nγ2i=1n[(1+β2)||zik-xk||2+(1+1β2)||mik||2+1(1+γL)2||eik-1-eik||2],

which proves (12).

Lemma 3. Let {(yik,zik,xik,eik,xk)} be generated by Algorithm 1. Suppose that Assumptions 1 and 2 hold, and we define the Lyapunov function

Vk(xk)=g(xk)+1ni=1n[fi(zik)+fi(zik),xk-zik             +12γ||xk-zik||2],

then by choosing

0<γ<(1-λ4)2-λ2β4(4β4+1)-λ4L(2λβ4+1)and0<λ<min{4β4+1716-14,2}4β4+1,

and for any ε1, β1, β4>0, we have

𝔼[Vk+1(xk+1)|Ak-1]Vk(xk)-π2ni=1n||xk-zik||2+4ε1γν2                                             +1ni=1n(δ1(εik)2+δ2(εik+1)2),

where

π=pλ[2-λ(1+Lγ)-2L2γ2-4λβ4(1+L2γ2)]2γ(1+β1)(γ2L2+1),δ1=2(1+γL)2γβ4λ2+[2-λ(1+Lγ)-2L2γ2-4λβ4(1+L2γ2)]λγβ1,δ2=δ1+(1+γ2L2)γ.

Proof. Given the definition x̄k=1ni=1nxik, the update rule xk+1=proxγg(x̄k+1) in Algorithm 1 (hence x̄k+1-xk+1γg(xk+1)), and the convexity of g, we obtain the following inequality

g(xk+1)g(xk)-1γ||xk+1-xk||2+1γx̄k+1-xk,xk+1-xk.    (14)

Combining Equations 10 and 11, we obtain

zik+1+γfi(zik+1)   =wik+1+γfi(wik+1)+mik+1+γ(fi(zik+1)                                     fi(wik+1))                                      =yik+1+mik+1+γ(fi(zik+1)fi(wik+1)).    (15)

Next, using the update rules for xik+1 and eik+1 in Algorithm 1, we have

x̄k+1=1ni=1nxik+1=1ni=1n(C(2zik+1-yik+1+eik))          =1ni=1n(2zik+1-yik+1+eik-eik+1).    (16)

In order to establish the descent property of the Lyapunov function Vk+1(xk+1), its second term is expanded and rearranged as follows

1ni=1n[fi(zik+1)+fi(zik+1),xk+1-zik+1+12γ||xk+1-zik+1||2]=1ni=1n[fi(zik+1)+fi(zik+1),xk-zik+1+xk+1-xk]+12γni=1n||xk-zik+1+xk+1-xk||2=1ni=1n[fi(zik+1)+fi(zik+1),xk-zik+1+12γ||xk-zik+1||2]+1nγi=1nxk-2zik+1+(zik+1+γfi(zik+1)),xk+1-xk+12γ||xk+1-xk||2=(15)1ni=1n[fi(zik+1)+fi(zik+1),xk-zik+1+12γ||xk-zik+1||2]+1nγi=1nxk-2zik+1+yik+1,xk+1-xk+12γ||xk+1-xk||2+1nγi=1nmik+1+γ(fi(zik+1)-fi(wik+1)),xk+1-xk=(16)1ni=1n[fi(zik+1)+fi(zik+1),xk-zik+1+12γ||xk-zik+1||2]+1γxk-x̄k+1+1ni=1n(eik-eik+1),xk+1-xk+12γ||xk+1-xk||2+1nγi=1nmik+1+γ(fi(zik+1)-fi(wik+1)),xk+1-xk.    (17)

Here, Equation 15 is used to separate the term yik+1 from the approximation error mik+1, while Equation 16 expresses 2zik+1-yik+1 in terms of the average vector x̄k+1 and the accumulated compression errors eik+1 and eik. Then, by combining Equations 14, 17 and using the definition of Vk+1(xk+1), we obtain that

Vk+1(xk+1)   g(xk)+1ni=1n[fi(zik+1)+fi(zik+1),xkzik+1                         +12γxkzik+12]                    +1nγi=1neikeik+1,xk+1xk12γxk+1xk2                            +1nγi=1nmik+1+γ(fi(zik+1)fi(wik+1)),xk+1                            xk .    (18)

To bound the third term on the right-hand side of Equation 18, we employ the inequality 2a1,a2ε1||a1||2+1ε1||a2||2 (for any ε1>0) as follows

1nγi=1neik-eik+1,xk+1-xk1nγi=1n[ε1||eik-eik+1||2+1ε1||xk+1-xk||2]1nγi=1n[2ε1||eik||2+2ε1||eik+1||2]+1γε1||xk+1-xk||22ε1nγi=1n[||eik||2+||eik+1||2]+1γε1||xk+1-xk||2.    (19)

For iSk, we have wik+1=wik. Applying Young's inequality stated in Lemma 1 with any β3>0, we can evaluate the five term on the right-hand side of Equation 18 as follows

1nγi=1nmik+1+γ(fi(zik+1)-fi(wik+1)),xk+1-xk12nγi=1n[1β3||mik+1+γ(fi(zik+1)-fi(wik+1))||2+β3||xk+1-xk||2]1nγβ3i=1n[||mik+1||2+γ2i=1n||fi(xik+1)-fi(zik+1)||2]+β32γ||xk+1-xk||2(1+γ2L2)nγβ3[iSk||mik||2+iSk||mik+1||2]+β32γ||xk+1-xk||2.    (20)

To streamline the notation, denote

Ψk+1=-1γ(12-1ε1-β32)||xk+1-xk||2          +2ε1nγi=1n[||eik||2+||eik+1||2]          +(1+γ2L2)nγβ3[iSk||mik||2+iSk||mik+1||2],    (21)

and substituting Equations 19 and 20 into Equations 18, we obtain an expanded expression for Vk+1. Differentiating between the active client set Sk and the inactive set, and employing the L- smoothness of fi (i.e., fi(zik+1)fi(zik)+fi(zik),zik+1-zik+L2||zik+1-zik||2), we have

Vk+1(xk+1)g(xk)+1ni=1n[fi(zik+1)+fi(zik+1),xk-zik+1+12γ||xk-zik+1||2]+Ψk+1  (by the fact that onlyiSk perform update)=g(xk)+1niSkfi(zik+1)+1niSkfi(zik+1),zik-zik+1+1niSkfi(zik+1),xk-zik+12nγiSk||xk-zik+1||2+1niSkfi(zik)+1niSkfi(zik),xk-zik+12nγiSk||xk-zik||2+Ψk+1  (by theL-smoothness of fi)g(xk)+1niSkfi(zik)+L2niSk||zik+1-zik||2+1niSkfi(zik+1),xk-zik+12nγiSk||xk-zik+1||2+1niSkfi(zik)+1niSkfi(zik),xk-zik+12nγiSk||xk-zik||2+Ψk+1=g(xk)+1ni=1nfi(zik)+1ni=1nfi(zik),xk-zik+L2niSk||zik+1-zik||2+12nγiSk||xk-zik+1||2+1niSkfi(zik+1)-fi(zik),xk-zik+12nγiSk||xk-zik||2+Ψk+1.    (22)

Next, applying the square-norm expansion

||xk-zik+1||2=||xk-zik||2+2xk-zik,zik-zik+1+||zik                                 -zik+1||2.

For non-updated clients iSk, the local variable remains unchanged, i.e., zik+1=zik. Substituting these relations into the original expression gives

12nγiSk||xk-zik+1||2+12nγiSk||xk-zik||2=12nγi=1n||xk-zik||2+12nγiSk[2xk-zik,zik-zik+1+||zik-zik+1||2],

Inserting the reorganized expression into the expansion of Vk+1(xk+1) and collecting common terms gives

Vk+1(xk+1)=Vk(xk)+1niSkfi(zik+1)-fi(zik),xk-zik                       +1nγiSkzik+1-zik,zik-xk                       +1+Lγ2nγiSk||zik+1-zik||2+Ψk+1.    (23)

Then, from the update rule of yik+1 in Algorithm 1 together with Equations 10 and 11, we derive an expression for zik-xk:

zikxk=1λ(yikyik+1)                     =1λ(wikwik+1)+γλ(fi(wik)fi(wik+1))                    =1λ(zikzik+1)+γλ(fi(zik)fi(zik+1))                   +1λ[(mik+1+γ(fi(zik+1)fi(wik+1)))                    (mik+γ(fi(zik)fi(wik)))]                  =1λ(zikzik+1)+γλ(fi(zik)fi(zik+1))+nik,    (24)

where nik is a composite error term involving the approximation errors mik, mik+1 and gradient differences. The subsequent analysis will control the impact of nik via its norm bound. It is defined as

nik=1λ[(mik+1+γ(fi(zik+1)          fi(wik+1)))(mik+γ(fi(zik)fi(wik)))],

Its squared norm satisfies

||nik||2=1λ2||mik+1-mik+γ(fi(zik+1)-fi(wik+1))                  +γ(fi(wik)-fi(zik))||2               2(1+γL)2λ2[||mik||2+||mik+1||2]

By applying the L-smoothness of fi, the Young's inequality, and Equation 24, we obtain for any β4>0 that

Vk+1(xk+1)Vk(xk)+[λ(1+Lγ)-2]2λγniSk||zik+1-zik||2+γλniSk||fi(zik+1)-fi(zik)||2+1γniSknik,(zik+1-zik)+γ(fi(zik)-fi(zik+1))+Ψk+1  (by the L-smoothness of fi)Vk(xk)+γL2λniSk||zik+1-zik||2+[λ(1+Lγ)-2]2λγniSk||zik+1-zik||2+Ψk+1+1γniSk[1β4||nik||2+2β4||zik-zik+1||2+2β4γ2||fi(zik)-fi(zik+1)||2]Vk(xk)-[2-λ(1+Lγ)-2L2γ2-4λβ4(1+L2γ2)]2λγniSk||zik+1-zik||2+1γβ4niSk||nik||2+Ψk+1Vk(xk)-[2-λ(1+Lγ)-2L2γ2-4λβ4(1+L2γ2)]2λγniSk||zik+1-zik||2+2(1+γL)2γβ4λ2niSk[||mik||2+||mik+1||2]+Ψk+1.    (25)

Next, leveraging the L-smoothness of fi and assuming γ1L, we demonstrate the boundedness of Vk(xk)

Vk(xk)=g(xk)+1ni=1n[fi(zik)+fi(zik),xk-zik              +12γ||xk-zik||2]              g(xk)+1ni=1n[fi(xk)-L2||xk-zik||2+12γ||xk-zik||2]              F(xk)+(12γ-L2)1ni=1n||xk-zik||2              F*.

From Lemma 1, we have

λ22(1+β1)(γ2L2+1)iSk||xk-zik||2iSk[||zik+1-zik||2                          +2β1(||mik+1||2+||mik||2)].    (26)

According to the sampling scheme, we consider the expectation of iSk||zik+1-zik||2 with respect to Sk conditioned on Ak-1. Combined with (26), this yields

𝔼[iSk||zik+1-zik||2|Ak-1]  =S(Sk=S)iS||zik+1-zik||2=i=1npi||zik+1-zik||2pλ22(1+β1)(γ2L2+1)i=1n||xk-zik||2-2pβ1i=1n(||mik+1||2+||mik||2),    (27)

where p = minpi∈(0, 1], i∈[n]. By taking the conditional expectation of Equation 25 with respect to Sk conditioned on Ak-1, and combining it with Equations 10, 21, 27 under the setting β3 = 1, we derive the following

E[Vk+1(xk+1)|Ak1]  (21)Vk(xk)+2(1+γL)2γβ4λ2ni=1npi[mik2+mik+12]    [2λ(1+Lγ)2L2γ24λβ4(1+L2γ2)]2λγnE[iSk||zik+1zik||2|Ak1]    +2ε1nγE[i=1n||eik||2+i=1n||eik+1||2]+(1+γ2L2)nγi=1n[(1pi)mik2+pimik+12]     (by the definition of absolute compressor)(27)Vk(xk)+2(1+γL)2γβ4λ2ni=1npi[mik2+mik+12]  pλ[2λ(1+Lγ)2L2γ24λβ4(1+L2γ2)]4γn(1+β1)(γ2L2+1)  i=1n||xkzik||2 +p[2λ(1+Lγ)2L2γ24λβ4(1+L2γ2)]λγβ1n  i=1n(mik+12+mik2)  +4ε1γν2+(1+γ2L2)nγi=1n[(1pi)mik2+pimik+12]  (10)Vk(xk)π2ni=1n||xkzik||2+4ε1γν2+1ni=1n(δ1(εik)2+δ2(εik+1)2).

To guarantee the descent property, let

π=pλ[2-λ(1+Lγ)-2L2γ2-4λβ4(1+L2γ2)]2γ(1+β1)(γ2L2+1)>0.

Then, we have

0<λ<min{4β4+1716-14,2}4β4+1and0<γ<(1-λ4)2-λ2β4(4β4+1)-λ4L(2λβ4+1).

Theorem 1. Let {(yik,zik,xik,eik,xk)} be generated by Algorithm 1. Suppose that Assumptions 1 and 2 hold, for 0<γ<(1-λ4)2-λ2β4(4β4+1)-λ4L(2λβ4+1)and0<λ<min{4β4+1716-14,2}4β4+1, we have

1Kk=0K-1𝔼[||Gγ(xk)||2]M1K(F(x0)-F*)                                           +1nKk=0K-1i=1n[M2(εik)2+M3(εik+1)2]                                           +M4Kν2,    (28)

where

M1=4(1+β2)(1+γL)2πγ2,M2=(2δ1β2+π)β2M1M3=δ2M1,M4=4ε1KγM1+4Knγ2,

with ε1, β2>0, and π, δ1, δ2 defined in Lemma 3.

Proof. First, it follows from Lemma 3 that

i=1n||xk-zik||22nπ[Vk(xk)-E[Vk+1(xk+1)|Ak-1]                                      +4ε1γν2+1ni=1n(δ1(εik)2+δ2(εik+1)2)].

Combining the derived estimates and Lemma 2, we obtain

||Gγ(xk)||22(1+γL)2nγ2i=1n[(1+β2)||zik-xk||2                          +(1+1β2)||mik||2]+2nγ2i=1n||eik-1-eik||2                          4(1+β2)(1+γL)2πγ2[Vk(xk)                          -𝔼[Vk+1(xk+1)|Ak-1]]                          +4(1+β2)(1+γL)2nπγ2i=1n(δ1(εik)2+δ2(εik+1)2)                          +2(1+β2)(1+γL)2nγ2β2(εik)2+2nγ2i=1n||eik-1-eik||2                          +16(1+β2)(1+γL)2ε1πγ3ν2.    (29)

Taking the total expectation of ||Gγ(xk)||2 with respect to Ak, and by using the update of eik and the definition of the absolute compressor, we obtain the following result

𝔼[||Gγ(xk)||2]M1(𝔼[Vk(xk)]-𝔼[Vk+1(xk+1)])                                  +M2ni=1n(εik)2+M3ni=1n(εik+1)2+M4Kν2,

where

M1=4(1+β2)(1+γL)2πγ2,M2=2(1+β2)(1+γL)2(4δ1β2+2π)γ2β2πM3=4(1+β2)(1+γL)2δ2πγ2,M4=16(1+β2)(1+γL)2ε1Kπγ3+4Knγ2,

is four constants. Summing the inequality over k from 0 to K−1, and then scaling the resultant sum by 1K, we derive

1Kk=0K-1𝔼[||Gγ(xk)||2]M1K(𝔼[V0(x0)]-E[VK(xK)])                                              +1Kk=0K-1[M2ni=1n(εik)2+M3ni=1n(εik+1)2                                              +M4Kν2].    (30)

With the initial condition zi0=x0, we obtain V0(x0)=g(x0)+1ni=1nfi(zi0)=F(x0). Together with the lower bound E[Vk+1(xk+1)]≥F*, this implies that Equation 30 simplifies to

1Kk=0K-1𝔼[||Gγ(xk)||2]M1K(F(x0)-F*)                                            +1nKk=0K-1i=1n[M2(εik)2+M3(εik+1)2]                                            +M4Kν2,    (31)

which proves Equation 28.

Corollary 1. Suppose that Assumptions 1 and 2 hold, EF-Feddr (Algorithm 1) will find a ε-stationary point x such that E||Gγ(xk)||ε in the following number of iterations

KM1[F(x0)-F*]+(M2+M3)M+M4ν2ε2,

where M>0 is a constant, and M1, M2, M3, M4 are defined in Theorem 1. Consequently, the communication complexity is K=O(1ε2).

Proof. As described in Tran-Dinh et al. (2021), the choice of accuracies εik is constrained such that for a given constant M>0, 1nk=0K-1i=1n(εik)2M. Therefore,

1Kk=0K-1𝔼[||Gγ(xk)||2]M1(F(x0)-F*)+(M2+M3)M+M4ν2K.    (32)

Consequently, to guarantee E||Gγ(xk)||ε, we have

KM1[F(x0)-F*]+(M2+M3)M+M4ν2ε2.

Therefore, we can take (K=M1[F(x0)F]+(M2+M3)M+M4ν2ε2=O(1ε2) ) as its lower bound.

5 Experiments

In the experiments, we evaluate EF-Feddr against Eco-FedSplit (Khirirat et al., 2022), Eco-FedProx (Khirirat et al., 2022), and FedDR (Tran-Dinh et al., 2021). In all compression-based baselines, the compression operator C denotes Top-k sparsification. For a fair comparison, we implement Eco-FedSplit, Eco-FedProx, and EF-Feddr on top of the FedDR framework. All experiments are conducted in TensorFlow (Abadi et al., 2016) on a cluster equipped with NVIDIA Tesla P100 (16 GB) GPUs. We next describe the datasets and models used in our study.

5.1 Non-IID datasets

We evaluate on both synthetic and real-world datasets: synthetic-(l, s), FEMNIST, and Shakespeare. Following prior studies (Caldas et al., 2018; Tran-Dinh et al., 2021), we generate synthetic-(l, s) with (l, s) = {(0, 0), (1, 1)}, where l controls the number of differing local models and s controls the degree of local data heterogeneity; larger l and s imply stronger non-IID heterogeneity. FEMNIST extends MNIST to 62 classes with over 800k samples; we use an 80%/20% train/test split and partition by writer, which naturally induces client-level heterogeneity. Shakespeare is a character-level language modeling corpus; we partition by user/play, so each client holds a distinct subset of texts (plays/scenes), yielding non-uniform label distributions across clients. In this context, the degree of non-IID-ness within each client's dataset is quantified by the number of classes present. Specifically, the Shakespeare dataset's non-IID-ness is delineated by the allocation of various plays' texts among clients. Each client is allocated a distinct subset of the corpus, which may include a varying number of plays and scenes. This results in a non-uniform distribution of text, where certain clients predominantly receive data from specific plays, whereas others obtain a more diverse range of content. Analogously, the FEMNIST dataset establishes non-IID-ness through the distribution of handwriting samples across different writers. Each client's dataset comprises samples from a subset of writers, thereby leading to variability in handwriting styles and features among clients. The datasets and model configurations used in our experiments are summarized in Table 2, which outlines their key statistical characteristics.

Table 2
www.frontiersin.org

Table 2. Dataset and model characteristics for federated training.

5.2 Models and hyper-parameters selection

We use a fully connected network with a 60-32-10 architecture and train it for 200 communication rounds with a learning rate of 0.01 on all synthetic datasets. At each round, 10 out of 30 clients are sampled. To evaluate the algorithm's performance with an increased number of clients, we further extended the Synthetic-(1,1) setup from the original 30 clients to 90 clients while preserving the statistical characteristics defined by the (l, s) parameters. The data generation process maintained the same non-IID partition pattern and per-client data distribution profile as the original setup. The client sampling ratio was kept constant at 1/3 (that is, selecting 30 out of 90 clients per round). Eco-FedSplit applies error-compensated compression to FedSplit, and Eco-FedProx does so to FedProx. To study an image classification problem on FEMNIST, we employ artificial neural networks (ANN) consisting of two fully connected layers. The first layer has 128 neurons followed by a ReLU activation function, and the second layer has 62 neurons followed by a softmax activation function for classification. In this experiment, we sample 50 clients out of 200 to perform updates at each communication round for all the above-mentioned algorithms. The model used for FEMNIST is trained for 200 communication rounds in total with an optimal learning rate of 0.003. Consistent with prior research (Li et al., 2020), our approach to character-level prediction in the Shakespeare dataset utilizes a recurrent neural network (RNN) architecture. Specifically, we deploy a two-layer stacked LSTM classifier, each layer comprising 256 hidden units. Each input sequence is structured to include 80 characters, which are initially embedded into an eight-dimensional space prior to LSTM processing. The model subsequently generates a 62-class softmax distribution over the character vocabulary for each training instance. The training regimen involves a total of 50 communication rounds. An optimal learning rate of 0.08 is determined for the four operator-splitting-based federated learning algorithms employed in this study. Parameters for each algorithm such as α∈(0, 2) and η∈[1, 1, 000] for FedDR, μ∈[0.001, 1] for Eco-FedProx, and λ∈(0, 2) and γ∈[1, 1, 000] for EF-Feddr are tuned from a large range of values. For each dataset, we pick the most suitable parameters for each algorithm.

5.3 Comparison of methods

Figures 13 report training loss/accuracy and test accuracy vs. communication rounds and communication cost on the synthetic datasets; Figure 4 shows the same on FEMNIST. A key observation is that expanding the total number of clients does not substantially degrade the performance of EF-Feddr. Experimental results under the scaled setting (Figure 3) confirm this: the algorithm maintains nearly identical convergence speed and final accuracy compared to the original 30-client scenario (Figure 2). Across heterogeneous settings, EF-Feddr consistently outperforms the baselines. On FEMNIST, EF-Feddr reaches 80.5% test accuracy at round 50, whereas Eco-FedSplit attains 74.5% only at round 200. Within 200 rounds, EF-Feddr improves accuracy by 12.97% and 7.93% over Eco-FedSplit and Eco-FedProx, respectively. On synthetic-(0, 0), EF-Feddr exceeds the two baselines by 3.88% and 8.40%; on synthetic-(1, 1), by 7.20% and 3.29%. On Shakespeare, Figure 5 shows EF-Feddr also surpasses two Douglas–Rachford splitting-based FL algorithms: Eco-FedSplit and FedDR. As shown in Table 3, EF-Feddr requires 18.64%–85.41% less runtime and 48.03%–93.18% less communication than baseline methods to achieve the same target test accuracy of 60% on synthetic and 70% on FEMNIST. Specifically, on FEMNIST, it meets this target in only 17 communication rounds (8.29 min), significantly outperforming competitors like Eco-FedSplit. These substantial reductions in overhead are consistently observed across the synthetic datasets. Additionally, EF-Feddr achieves a substantial reduction in communication costs without compromising performance relative to the uncompressed FedDR.

Figure 1
Six line charts illustrate the performance of four federated learning methods: FedDR, Eco-FedSplit, Eco-FedProx, and EF-Feddr on synthetic data. The top row shows metrics compared by the number of rounds: training loss, training accuracy, and test accuracy. The bottom row shows metrics compared by log communicated bits. FedDR consistently shows stable results, while EF-Feddr demonstrates rapid improvement in test accuracy and reduced training loss. The legend identifies each method's line style and color.

Figure 1. Convergence performance of different methods on the synthetic-(0, 0) dataset with Top-k and participation rate p = 0.3.

Figure 2
Six line graphs comparing the performance of four algorithms: FedDR, Eco-FedSplit, Eco-FedProx, and EF-Feddr on a dataset labeled “femnist”. The top row shows train loss, train accuracy, and test accuracy over 200 rounds, with improvements and convergence visible. The bottom row presents the same metrics as functions of communicated bits on a logarithmic scale. Each graph uses distinct markers and colors to differentiate the algorithms.

Figure 2. Convergence performance of different methods on the synthetic-(1, 1) dataset with Top-k, participation rate p = 0.3, and N = 30 total clients.

Figure 3
Six line graphs compare the performance of FedDR, Eco-FedSplit, Eco-FedProx, and EF-Fedgdr using metrics such as TrainLoss and TestAcc against Rounds and Communicated Bits. In the top row, three graphs plot TrainLoss, TrainAcc, and TestAcc against Rounds. In the bottom row, similar metrics are plotted against log of Communicated Bits. FedDR, represented by blue, shows consistently stable performance across all graphs. Eco-FedSplit, Eco-FedProx, and EF-Fedgdr exhibit varied performance during the evaluations.

Figure 3. Convergence performance of different methods on the synthetic-(1, 1) dataset with Top-k, participation rate p = 0.3, and N = 90 total clients.

Figure 4
Six graphs compare four algorithms: FedDR, Eco-FedSplit, Eco-FedProx, and EF-Feddr. The top row shows training loss, training accuracy, and test accuracy over 200 rounds. The bottom row displays the same metrics against log communicated bits. Legends identify lines: blue triangles for FedDR, orange squares for Eco-FedSplit, green triangles for Eco-FedProx, and red circles for EF-Feddr. Overall, the graphs illustrate performance trends in synthetic data labeled (1,1).

Figure 4. Convergence performance of different methods on the FEMNIST dataset with Top-k and participation rate p = 0.3.

Figure 5
Six line graphs compare three methods: FedDR, Eco-FedSplit, and EF-Feddr, for the Shakespeare dataset. The first row shows training loss, training accuracy, and test accuracy against the number of rounds. The second row presents the same metrics against logged communicated bits. FedDR generally performs well in reducing loss and increasing accuracy, EF-Feddr shows competitive performance across metrics, while Eco-FedSplit has higher initial loss but improves over time.

Figure 5. Convergence performance of different methods on the Shakespeare dataset with Top-k and participation rate p = 0.3.

Table 3
www.frontiersin.org

Table 3. Efficiency comparison on synthetic-(1, 1) and femnist datasets.

5.4 Effect of the relaxation parameter

Figure 6 examines the effect of the relaxation parameter λ over 200 iterations. Empirically, the best convergence is observed at λ = 0.3. Consistent with prior findings on FL adaptations of Douglas–Rachford splitting, choosing 0 <λ <1 often leads to faster convergence than the classical (unrelaxed) variant.

Figure 6
Three line graphs compare training loss, training accuracy, and test accuracy against rounds for different lambda values on the Femnist dataset. The first graph shows decreasing training loss; the second graph shows increasing training accuracy, and the third graph displays increasing test accuracy with fluctuations. The legend identifies curves by lambda values: 0.7, 1.9, 1.0, 1.4, and 0.3.

Figure 6. EF-Feddr on FEMNIST with relaxation parameter λ analysis.

6 Discussion

This study presents EF-Feddr, a communication-efficient federated learning algorithm that combines error-compensated compression with Douglas–Rachford splitting. The method's robustness is demonstrated across controlled synthetic and real-world benchmarks, yet we recognize that extreme heterogeneity, such as single-class clients, remains a challenging frontier. Furthermore, while our experiments simulate realistic constraints (partial participation, compression), fully asynchronous updates and dynamic network conditions warrant further study in real deployments.

Recent advances in behavior-based threat hunting (Bhardwaj et al., 2022), IoT firmware security assessment (Bhardwaj et al., 2023), and energy-efficient proactive fault tolerance in cloud environments (Talwar et al., 2021) provide complementary perspectives for building reliable and secure federated systems. While this study focuses on optimization efficiency under non-IID and communication constraints, these studies collectively point toward an integrated “Optimization + System + Security” paradigm for future research. Specifically, they motivate investigations into client behavior profiling for attack detection, trusted execution at the edge, and proactive fault-tolerant scheduling, all of which are essential for deploying robust and efficient federated learning in real-world, dynamic environments. Furthermore, to strengthen the generalizability of our findings, future studies will also include evaluations on a wider variety of datasets, encompassing diverse domains, scales, and heterogeneity patterns, thereby providing a more comprehensive assessment of the algorithm's practical applicability.

7 Conclusion

In this study, we introduced EF-Feddr, a communication-efficient algorithm for non-convex federated learning that leverages the Douglas–Rachford splitting method, error feedback compression, and a relaxation strategy. EF-Feddr improves communication efficiency while preserving solution accuracy. Both theoretical analysis and empirical experiments demonstrated that EF-Feddr substantially reduces the number of bits transmitted from clients to the server compared with uncompressed FedDR. In terms of solution accuracy, EF-Feddr performs comparably to the uncompressed FedDR. Building on the Douglas–Rachford envelope, we established convergence guarantees and analyzed the communication complexity of EF-Feddr under mild assumptions. Extensive experiments further confirmed that our method significantly outperforms existing state-of-the-art approaches in non-IID settings.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: arXiv preprint arXiv:1812.01097.

Author contributions

JX: Validation, Conceptualization, Methodology, Formal analysis, Data curation, Writing – original draft, Software. CW: Visualization, Investigation, Supervision, Resources, Funding acquisition, Project administration, Writing – review & editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). “$TensorFlow$: a system for $LargeScale$ machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (Savannah, GA), 265–283.

Google Scholar

Alistarh, D., Grubic, D., Li, J., Tomioka, R., and Vojnovic, M. (2017). “QSGD: communication-efficient SGD via gradient quantization and encoding,” in Advances in Neural Information Processing Systems 30.

Google Scholar

Atenas, F. (2025). Understanding the Douglas-Rachford splitting method through the lenses of moreau-type envelopes. Comput. Optim. Appl. 90, 881–910. doi: 10.1007/s10589-024-00646-9

Crossref Full Text | Google Scholar

Bao, H., Chen, P., Sun, Y., and Li, Z. (2025). EFSKIP: a new error feedback with linear speedup for compressed federated learning with arbitrary data heterogeneity. Proc. AAAI Conf. Artif. Intell. 39, 15489–15497. doi: 10.1609/aaai.v39i15.33700

Crossref Full Text | Google Scholar

Bernstein, J., Wang, Y.-X., Azizzadenesheli, K., and Anandkumar, A. (2018). “SIGNSGD: compressed optimisation for non-convex problems,” in International Conference on Machine Learning (Stockholm: PMLR), 560–569.

Google Scholar

Bhardwaj, A., Kaushik, K., Alomari, A., Alsirhani, A., Alshahrani, M. M., Bharany, S., et al. (2022). BTH: behavior-based structured threat hunting framework to analyze and detect advanced adversaries. Electronics 11:2992. doi: 10.3390/electronics11192992

Crossref Full Text | Google Scholar

Bhardwaj, A., Kaushik, K., Bharany, S., and Kim, S. (2023). Forensic analysis and security assessment of IOT camera firmware for smart homes. Egypt. Inf. J. 24:100409. doi: 10.1016/j.eij.2023.100409

Crossref Full Text | Google Scholar

Caldas, S., Duddu, S. M. K., Wu, P., Li, T., Konečnỳ, J., McMahan, H. B., et al. (2018). Leaf: a benchmark for federated settings. arXiv [preprint]. arXiv:1812.01097. doi: 10.4885/arXiv.1812.01097

Crossref Full Text | Google Scholar

Ezequiel, C. E. J., Gjoreski, M., and Langheinrich, M. (2022). Federated learning for privacy-aware human mobility modeling. Front. Artif. Intell. 5:867046. doi: 10.3389/frai.2022.867046

PubMed Abstract | Crossref Full Text | Google Scholar

Godavarthi, D., Jaswanth, V., Mohanty, S., Dinesh, P., Venkata Charan Sathvik, R., Moreira, F., et al. (2025). Federated quantum-inspired anomaly detection using collaborative neural clients. Front. Artif Intell. 8:1648609. doi: 10.3389/frai.2025.1648609

PubMed Abstract | Crossref Full Text | Google Scholar

Goel, C., Anita, X., and Anbarasi, J. L. (2025). Federated knee injury diagnosis using few shot learning. Front. Artif. Intell. 8:1589358. doi: 10.3389/frai.2025.1589358

PubMed Abstract | Crossref Full Text | Google Scholar

He, S., Dong, Q.-L., Tian, H., and Li, X.-H. (2021). On the optimal relaxation parameters of Krasnosel'ski-Mann iteration. Optimization 70, 1959–1986. doi: 10.1080/02331934.2020.1767101

Crossref Full Text | Google Scholar

Islam, F., Mahmood, A., Mukhtiar, N., Wijethilake, K. E., and Sheng, Q. Z. (2024). “Fairequityfl-a fair and equitable client selection in federated learning for heterogeneous IOV networks,” in International Conference on Advanced Data Mining and Applications (Cham: Springer), 254–269. doi: 10.1007/978-981-96-0814-0_17

Crossref Full Text | Google Scholar

Jhunjhunwala, D., Sharma, P., Nagarkatti, A., and Joshi, G. (2022). “Fedvarp: tackling the variance due to partial client participation in federated learning,” in Uncertainty in Artificial Intelligence (Eindhoven: PMLR), 906–916.

Google Scholar

Kant, S., da Silva, J. M. B., Fodor, G., Göransson, B., Bengtsson, M., and Fischione, C. (2022). Federated learning using three-operator ADMM. IEEE J. Sel. Topics Signal Processing 17, 205–221. doi: 10.1109/JSTSP.2022.3221681

Crossref Full Text | Google Scholar

Karimireddy, S. P., Rebjock, Q., Stich, S., and Jaggi, M. (2019). “Error feedback fixes signsgd and other gradient compression schemes,” in International Conference on Machine Learning (Long Beach, CA: PMLR), 3252–3261.

Google Scholar

Khirirat, S., Johansson, M., and Alistarh, D. (2018). “Gradient compression for communication-limited convex optimization,” in 2018 IEEE Conference on Decision and Control (CDC) (Miami, FL: IEEE), 166–171. doi: 10.1109/CDC.2018.8619625

Crossref Full Text | Google Scholar

Khirirat, S., Magnússon, S., and Johansson, M. (2022). “Eco-fedsplit: federated learning with error-compensated compression,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Singapore: IEEE), 5952–5956. doi: 10.1109/ICASSP43922.2022.9747809

Crossref Full Text | Google Scholar

Konecný, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D. (2016). Federated learning: strategies for improving communication efficiency. arXiv [preprint]. arXiv:1610.05492. doi: 10.48550/arXiv.1610.05492

Crossref Full Text | Google Scholar

Li, T., Sahu, A. K., Zaheer, M., Sanjabi, M., Talwalkar, A., Smith, V., et al. (2020). Federated optimization in heterogeneous networks. Proc. Mach. Learn. Syst. 2, 429–450. doi: 10.48550/arXiv.1812.06127

Crossref Full Text | Google Scholar

Li, X., and Li, P. (2023). “Analysis of error feedback in federated non-convex optimization with biased compression: fast convergence and partial participation,” in International Conference on Machine Learning (Honolulu, HI: PMLR), 19638–19688.

Google Scholar

Liu, J., Xu, L., Shen, S., and Ling, Q. (2019). An accelerated variance reducing stochastic method with Douglas-Rachford splitting. Mach. Learn. 108, 859–878. doi: 10.1007/s10994-019-05785-3

Crossref Full Text | Google Scholar

Liu, Y., Zhou, Y., and Lin, R. (2024). The proximal operator of the piece-wise exponential function. IEEE Signal Process. Lett. 31, 894–898. doi: 10.1109/LSP.2024.3370493

Crossref Full Text | Google Scholar

Long, Z., Chen, Y., Dou, H., Zhang, Y., and Chen, Y. (2024). Fedsq: sparse-quantized federated learning for communication efficiency. IEEE Trans. Consum. Electron. 70, 4050–4061. doi: 10.1109/TCE.2024.3352432

Crossref Full Text | Google Scholar

Malekmohammadi, S., Shaloudegi, K., Hu, Z., and Yu, Y. (2021). An operator splitting view of federated learning. arXiv [preprint]. arXiv:2108.05974. doi: 10.48550/arXiv.2108.05974

Crossref Full Text | Google Scholar

McMahan, B., Moore, E., Ramage, D., Hampson, S., and Arcas, B. A. (2017). “Communication-efficient learning of deep networks from decentralized data,” in Artificial Intelligence and Statistics (Fort Lauderdale, FL: PMLR), 1273–1282.

Google Scholar

Mishchenko, K., Khaled, A., and Richtárik, P. (2022). “Proximal and federated random reshuffling,” in International Conference on Machine Learning (Baltimore, MA: PMLR), 15718–15749.

Google Scholar

Parikh, N.Boyd, S., et al. (2014). Proximal algorithms. Found. Trends Optim. 1, 127–239. doi: 10.1561/2400000003

Crossref Full Text | Google Scholar

Pathak, R., and Wainwright, M. J. (2020). Fedsplit: an algorithmic framework for fast federated optimization. Adv. Neural Inf. Process. Syst. 33, 7057–7066. doi: 10.48550/arXiv.2005.05238

Crossref Full Text | Google Scholar

Reisizadeh, A., Mokhtari, A., Hassani, H., Jadbabaie, A., and Pedarsani, R. (2020). “FEDPAQ: a communication-efficient federated learning method with periodic averaging and quantization,” in International Conference on Artificial Intelligence and Statistics (PMLR), 2021–2031.

Google Scholar

Richtárik, P., Sokolov, I., and Fatkhullin, I. (2021). Ef21: a new, simpler, theoretically better, and practically faster error feedback. Adv. Neural Inf. Process. Syst. 34, 4384–4396. doi: 10.48550/arXiv.2106.05203

Crossref Full Text | Google Scholar

Sahu, A., Dutta, A., Abdelmoniem, M., Banerjee, A., Canini, T., Kalnis, M., et al. (2021). Rethinking gradient sparsification as total error minimization. Adv. Neural Inf. Process. Syst. 34, 8133–8146. doi: 10.48550/arXiv.2108.00951

Crossref Full Text | Google Scholar

Saifullah, S., Mercier, D., Lucieri, A., Dengel, A., and Ahmed, S. (2024). The privacy-explainability trade-off: unraveling the impacts of differential privacy and federated learning on attribution methods. Front. Artif. Intell. 7:1236947. doi: 10.3389/frai.2024.1236947

PubMed Abstract | Crossref Full Text | Google Scholar

Seide, F., Fu, H., Droppo, J., Li, G., and Yu, D. (2014). “1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs,” in Interspeech, Vol. 2014 (Singapore), 1058–1062. doi: 10.21437/Interspeech.2014-274

Crossref Full Text | Google Scholar

Sun, W., Wang, A., Gao, Z., and Zhou, Y. (2024). “A communication-concerned federated learning framework based on clustering selection,” in International Conference on Advanced Data Mining and Applications (Cham: Springer), 285–300. doi: 10.1007/978-981-96-0814-0_19

Crossref Full Text | Google Scholar

Talwar, B., Arora, A., and Bharany, S. (2021). “An energy efficient agent aware proactive fault tolerance for preventing deterioration of virtual machines within cloud environment,” in 2021 9th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO) (Noida: IEEE), 1–7. doi: 10.1109/ICRITO51393.2021.9596453

Crossref Full Text | Google Scholar

Tang, Z., Wang, Y., and Chang, T.-H. (2024). z-signfedavg: a unified stochastic sign-based compression for federated learning. Proc. AAAI Conf. Artif. Intell. 38, 15301–15309. doi: 10.1609/aaai.v38i14.29454

Crossref Full Text | Google Scholar

Tran-Dinh, Q., Pham, N. H., Phan, D. T., and Nguyen, L. M. (2021). Feddr-randomized Douglas-Rachford splitting algorithms for nonconvex federated composite optimization. Adv. Neural Inf. Process. Syst. 34, 30326–30338. doi: 10.48550/arXiv.2103.0345

Crossref Full Text | Google Scholar

Valdeira, P., Xavier, J., Soares, C., and Chi, Y. (2025). Communication-efficient vertical federated learning via compressed error feedback. IEEE Trans. Signal Process. 73, 1065–1080. doi: 10.1109/TSP.2025.3540655

Crossref Full Text | Google Scholar

Wang, H., Marella, S., and Anderson, J. (2022). “FEDADMM: a federated primal-dual algorithm allowing partial participation,” in 2022 IEEE 61st Conference on Decision and Control (CDC) (Cancún: IEEE), 287–294. doi: 10.1109/CDC51059.2022.9992745

Crossref Full Text | Google Scholar

Zhou, X., Chang, L., and Cao, J. (2023). Communication-efficient nonconvex federated learning with error feedback for uplink and downlink. IEEE Trans. Neural Netw. Learn. Syst. 36, 1003–1014. doi: 10.1109/TNNLS.2023.3333804

PubMed Abstract | Crossref Full Text | Google Scholar

Keywords: communication efficiency, composite optimization, data heterogeneity, error feedback, federated learning, operator splitting

Citation: Xue J and Wang C (2026) EF-Feddr: communication-efficient federated learning with Douglas–Rachford splitting and error feedback. Front. Artif. Intell. 9:1699896. doi: 10.3389/frai.2026.1699896

Received: 05 September 2025; Accepted: 05 January 2026;
Published: 28 January 2026.

Edited by:

Haifeng Chen, NEC Laboratories America Inc, United States

Reviewed by:

Mengmeng Ren, Xidian University, China
Salil Bharany, Chitkara University, India

Copyright © 2026 Xue and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Chundong Wang, bWljaGFlbDM3NjlAMTYzLmNvbQ==

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.