- 1 College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, China
- 2 Jiangsu Key Laboratory of Wireless Communications and IoT, Nanjing University of Posts and Telecommunications, Nanjing, China
- 3 School of Information Science and Technology, Dalian Maritime University, Dalian, China
Introduction: Federated learning (FL) enables model training on edge devices using local data while aggregating model updates at a central server without exchanging raw data, thereby preserving privacy. However, achieving satisfactory convergence accuracy with low communication energy remains challenging. This work investigates a three-tier clustered FL (CFL) architecture to improve global training performance and communication efficiency through joint device clustering and resource scheduling.
Methods: We analyze how clustering strategies influence learning convergence and communication energy consumption. Based on this analysis, we propose a clustering method that jointly accounts for gradient cosine similarity and communication distance. A simplified procedure is further developed for device association and cluster-head selection, with the goals of improving intra-cluster data balance and reducing the overall communication distance to the server.
Results: Simulations demonstrate that the proposed method consistently improves model accuracy while reducing communication energy consumption compared with random clustering and similarity-based clustering baselines.
Discussion: These results indicate that jointly considering update similarity and communication distance in CFL can effectively balance learning quality and communication cost, offering a practical approach for energy-efficient federated training in edge networks.
1 Introduction
The explosive growth of mobile edge devices and their local storage of data has challenged traditional centralized machine learning paradigms. Centralized approaches necessitate frequent data transmission to the central server, which results in significant privacy risks and high communication overhead (Xia et al., 2020). Federated learning (FL) has emerged as a promising distributed learning paradigm, which enables model training by aggregating locally trained models at the central server without directly transmitting raw data, thereby preserving data privacy (Konečnỳ et al., 2015). However, FL still relies on frequent communication between devices and the server, which becomes inefficient in wireless environments with heterogeneous data distributions (Lu et al., 2024).
During the evolution of FL, extensive efforts have been devoted to mitigating communication overhead from various perspectives. Hierarchical FL reduces communication frequency through staged aggregation, thereby lowering communication costs (Briggs et al., 2020; Tran et al., 2025); however, its rigid hierarchical design often limits adaptability in dynamic environments. Prototype-based clustered FL is a representative branch of personalized FL, which aligns client features through global or local prototypes, thereby effectively reducing communication costs by minimizing the need for frequent model exchanges (Yang et al., 2024; Tan et al., 2022). However, such methods often compromise global consistency, thereby limiting scalability when tasks are only partially shared. Moreover, resource-aware FL introduces client selection under bandwidth and computation constraints to improve communication efficiency but typically decouples resource optimization from model aggregation, neglecting their intrinsic interplay (Nishio and Yonetani, 2019). Against this backdrop, clustered FL (CFL) has become a more direct and effective solution. CFL reduces communication costs and improves convergence stability by performing intra- and inter-cluster aggregation after grouping clients based on their statistical or geographical characteristics (Ghosh et al., 2021; Zeng et al., 2023). For example, Yan et al. (2024) proposed adaptive clustering strategies that enable flexible client participation and dynamic cluster formation, thereby reducing redundant communication under non-independent and identically distributed (Non-IID) conditions. Similarly, Gao et al. (2023) clustered clients based on the similarity of their local data distributions and used acceleration algorithms to shorten training time and lower communication overhead. In addition, Zhang et al. (2024) developed an adaptive CFL framework that adjusts cluster size and communication intervals through online similarity measurement, thereby improving both robustness and communication efficiency. Despite these advances, existing CFL approaches still suffer from several limitations: 1) existing studies rarely explore how the clustering strategy influences both data richness and convergence dynamics. Over-reliance on data similarity for clustering may reduce intra-cluster diversity, thereby weakening model generalization. 2) Prior CFL methods primarily rely on gradient similarity or geographic proximity for clustering but often ignore joint optimization of learning performance and resource efficiency.
To bridge this research gap, we propose a data- and distance-aware clustering scheme. The proposed scheme exploits data distribution characteristics and geographical information to optimize cluster head selection and device association scheduling prior to the training process. Based on this clustering result, the CFL training procedure is subsequently carried out. The main contributions of this study are summarized as follows.
• We propose a CFL framework that collectively considers learning performance and communication cost. On the learning side, a convergence analysis is conducted to theoretically demonstrate that enhancing the diversity and representativeness of intra-cluster data effectively improves the convergence behavior of CFL under data heterogeneity. On the communication side, the communication cost is modeled in terms of the transmission distance. Based on this model, a combined optimization problem for cluster head selection and device association scheduling is formulated, which simultaneously accounts for learning performance and communication cost.
• Based on the formulated joint optimization objective, we develop an iterative algorithm to efficiently solve the cluster head selection and device association scheduling problem. The proposed algorithm decomposes the original NP-hard problem into two tractable subproblems, which are solved in an alternating optimization manner.
• Simulation results demonstrate that the proposed method consistently outperforms the three baseline algorithms in terms of model accuracy and communication energy efficiency, thereby validating the effectiveness of the proposed framework.
2 System model
We consider a wireless CFL system, as illustrated in Figure 1, which consists of a central server and a set of devices
A binary variable
The global loss function is defined in Equation 1:
Here,
2.1 CFL process
The architecture of CFL consists of three layers: intra-cluster devices, cluster heads, and a central server. A synchronous aggregation scheme is used. The overall process comprises the following steps, where Step 1 is executed during the initialization phase and the remaining steps are iteratively performed throughout the training process.
1.
2.
where
1.
After intra-cluster model aggregation is completed, each cluster head broadcasts the aggregated model to its associated devices for continued local training.
1.
Subsequently, the central server broadcasts the latest global model to all devices, which is then used for the next round of local training.
2.2 Problem formula
The pairwise cosine similarity between the gradients of devices is computed, as shown in Equation 5:
Assuming that the spectrum is divided into
The objective is to determine the optimal clustering strategy that minimizes the weighted sum of the final global training loss and the overall communication cost. The optimization problem is formulated as shown in Equation 7:
Here,
3 Convergence analysis
To evaluate how clustering strategies influence learning performance, we conduct a convergence analysis to understand how they influence the convergence performance of CFL. To obtain the expected convergence rate of CFL, we first make the following assumptions (Wang et al., 2020; Wan et al., 2021).
• Assumption 1: Assume that the global loss function is differentiable, its gradient is uniformly Lipschitz continuous, and there exists a positive constant
The equation is equivalent to Equation 9:
Here,
• Assumption 2: Global divergence is bounded, as shown in Equation 10:
• Assumption 3: Local divergence is bounded, as presented in Equation 11:
• Assumption 4: Local variance is bounded, as shown in Equation 12:
Based on the aforementioned assumptions and the description of the global model, we present the following convergence results. Given the optimal global model
From Equation 13, we can analyze how each key parameter would affect the convergence of the proposed CFL algorithm. The learning rate
Notably, by combining the convergence bound in Equation 13 with assumptions 2 and 3, it can be observed that client drift—defined as the deviation of local gradient updates from the global gradient direction—increases the convergence upper bound, thereby degrading learning performance. In particular, assumptions 2 and 3 characterize such drift through the deviation measures
4 Algorithm design
Based on the above convergence analysis, the cluster-level data representativeness metric is defined as follows (Equation 15):
By substituting the original global loss function with
where Equation 14 is NP-hard and is addressed by decomposing it into device association and cluster-head selection strategies based on cosine similarity and communication distance.
Given an initial cluster head set
which is an integer nonlinear programming problem. We adopt the Gurobi solver for optimal device association under a fixed cluster head set. The overall clustering utility is defined in Equation 18:
where
To minimize
• Cluster head addition: For any device
• Cluster head exchange: For
• Cluster head removal. For
Therefore, in the greedy iterative strategy for cluster head selection, the number of cluster heads varies dynamically until convergence. We summarize the alternating optimization process of device association and cluster head selection in Algorithm 1. The overall complexity is mainly determined by two components: (i) solving the device association problem using the Gurobi solver for a fixed cluster head set and (ii) the greedy iterative updates for cluster head selection, including addition, exchange, and removal operations. In particular, let
5 Numerical results
We simulate a wireless FL system consisting of a central server and 100 devices uniformly distributed within a circular area of 100 m radius. To model long-range communication, the server is positioned 1 km away from the device cluster center. The total number of global training rounds is fixed at 200, and the intra- and inter-cluster model aggregation periods are set to
To evaluate the proposed method, three baseline algorithms are considered for comparison: (i) random clustering, where devices are grouped based on the geographical proximity and cluster heads are selected randomly within each cluster; (ii) similarity-based clustering, which groups devices with similar local data distributions using statistical distance metrics, with heads randomly assigned; and (iii) FedAvg, the conventional FL scheme without clustering, where all devices communicate directly with the central server in each round.
To verify the effectiveness of the proposed clustering algorithm in enhancing learning performance in FL, we conduct test accuracy comparison experiments on the Fashion-MNIST and CIFAR-10 datasets. The experiments evaluate the impact of different device clustering algorithms, along with the classical FedAvg, on the model’s training accuracy. Figures 2, 3 illustrate the evolution of test accuracy during training, and the corresponding test accuracies at convergence are summarized in Table 1. As shown, the proposed algorithms consistently achieve the highest test accuracies for a given number of training rounds and maintain significant advantages throughout the training process. The similarity-based clustering algorithm ranks second, suggesting that adjusting data within clusters to a more balanced distribution—i.e., aligning the data distribution across clusters with the global distribution—can lead to better convergence performance than clustering purely based on intra-cluster data similarity, which is in line with expectations. The randomized clustering algorithm ranks third because it does not account for data distribution or similarity within clusters, leading to less balanced clusters and consequently slower convergence. The FedAvg algorithm exhibits the worst performance, primarily due to the heterogeneity of local data distributions across devices under
To verify the effectiveness of the proposed algorithm in reducing global communication energy consumption, energy simulations are conducted on the Fashion-MNIST dataset under different numbers of devices, as shown in Figure 4. FedAvg incurs the highest energy consumption due to frequent long-distance communication with the server. In contrast, the other three algorithms use intra-cluster aggregation, which shortens communication distances and reduces upload frequency, thereby lowering energy usage. The proposed method performs best by jointly optimizing inter-device distances and balancing intra-cluster data, further reducing global communication energy.
As shown in Figures 2, 4, the proposed algorithm consistently achieves high training accuracy with low communication energy consumption, highlighting its advantage in maintaining model performance while reducing communication cost. In comparison, similarity-based and random clustering exhibit slower convergence and higher energy consumption. The results demonstrate that balancing intra-cluster data enhances cluster representativeness, mitigates aggregation conflicts, and, when combined with geographical proximity, contributes to reducing overall energy consumption.
Table 2 presents the learning performance and communication energy under different values of the weighting factor
6 Conclusion
This study investigates the trade-off between learning performance and communication energy consumption in CFL, focusing on how cluster head selection and device association affect model training and energy overhead. To collectively optimize model performance and communication efficiency in wireless FL, we first conduct a convergence analysis linking global loss to inter-cluster data imbalance and use cosine similarity to quantify distributional dissimilarity. An optimization model of training loss is then constructed based on cosine gradient similarity, while communication energy is modeled as a function of transmission distance. Finally, a clustering algorithm is proposed to jointly schedule cosine similarity and communication distance for solving the reformulated combined optimization problem. The simulation results show that the proposed method markedly reduces communication energy while improving model accuracy.
In future work, we aim to introduce data-size–aware weighting mechanisms to further optimize client selection and matching, along with adaptive channel allocation strategies to extend the applicability of the method to heterogeneous devices and non-uniform wireless environments. These directions aim to improve both the scalability and robustness of the CFL framework, providing a more comprehensive solution for real-world federated learning scenarios.
Data availability statement
The datasets generated during this study are not publicly available due to the following restrictions: 1. the simulation data and model parameters are integral to the proprietary research methodology developed in this work. 2. The training data consist of standard benchmark datasets that are already publicly available through their original sources. 3. Raw gradient information and device-specific data cannot be shared as they may contain sensitive information about the federated learning system architecture. Requests to access the datasets should be directed to Zhenning Chen, bGlua19jaGVuQHllYWgubmV0.
Author contributions
ZC: Writing – original draft. ZX: Writing – review and editing. YD: Data curation, Investigation, Validation, Writing – review and editing. YW: Methodology, Supervision, Validation, Writing – review and editing.
Funding
The author(s) declared that financial support was not received for this work and/or its publication.
Conflict of interest
The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declared that generative AI was not used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence, and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Briggs, C., Fan, Z., and Andras, P. (2020). “Federated learning with hierarchical clustering of local updates to improve training on non-iid data,” in 2020 international joint conference on neural networks (IJCNN), 1–9. doi:10.1109/IJCNN48605.2020.9207469
Chen, M., Yang, Z., Saad, W., Yin, C., Poor, H. V., and Cui, S. (2020). A joint learning and communications framework for federated learning over wireless networks. IEEE Transactions Wireless Communications 20, 269–283. doi:10.1109/twc.2020.3024629
Gao, Z., Xiong, Z., Zhao, C., and Feng, F. (2023). “Clustered federated learning framework with acceleration based on data similarity,” in International conference on algorithms and architectures for parallel processing, 80–92.
Ghosh, A., Chung, J., Yin, D., and Ramchandran, K. (2021). An efficient framework for clustered federated learning. doi:10.48550/arXiv.2006.04088
Konečnỳ, J., McMahan, B., and Ramage, D. (2015). Federated optimization: distributed optimization beyond the datacenter. arXiv Preprint arXiv:1511.03575. doi:10.48550/arXiv.1511.03575
Lu, Z., Pan, H., Dai, Y., Si, X., and Zhang, Y. (2024). Federated learning with non-iid data: a survey. IEEE Internet Things J. 11, 19188–19209. doi:10.1109/JIOT.2024.3376548
Meng, X., Li, Y., Lu, J., and Ren, X. (2023). An optimization method for non-iid federated learning based on deep reinforcement learning. Sensors 23, 9226. doi:10.3390/s23229226
Nishio, T., and Yonetani, R. (2019). “Client selection for federated learning with heterogeneous resources in Mobile edge,” in ICC 2019 - 2019 IEEE International Conference on Communications (ICC), 1–7. doi:10.1109/ICC.2019.8761315
Taïk, A., Mlika, Z., and Cherkaoui, S. (2022). Clustered vehicular federated learning: process and optimization. IEEE Trans. Intelligent Transp. Syst. 23, 25371–25383. doi:10.1109/TITS.2022.3149860
Tan, Y., Long, G., Liu, L., Zhou, T., Lu, Q., Jiang, J., et al. (2022). Fedproto: Federated prototype learning across heterogeneous clients. Proc. AAAI Conference Artificial Intelligence 36, 8432–8440. doi:10.1609/aaai.v36i8.20819
Tran, D. T., Ha, N. B., Nguyen, V.-D., and Wong, K.-S. (2025). Sherl-fl: when representation learning meets split learning in hierarchical federated learning. arXiv Preprint arXiv:2508.08339. doi:10.48550/arXiv.2508.08339
Wan, S., Lu, J., Fan, P., Shao, Y., Peng, C., and Letaief, K. B. (2021). Convergence analysis and system design for federated learning over wireless networks. IEEE J. Sel. Areas Commun. 39, 3622–3639. doi:10.1109/jsac.2021.3118351
Wang, J., Wang, S., Chen, R.-R., and Ji, M. (2020). Local averaging helps: hierarchical federated learning and convergence analysis. arXiv Preprint arXiv:2010. doi:10.48550/arXiv.2010.12998
Xia, W., Quek, T. Q. S., Guo, K., Wen, W., Yang, H. H., and Zhu, H. (2020). Multi-armed bandit-based client scheduling for federated learning. IEEE Trans. Wirel. Commun. 19, 7108–7123. doi:10.1109/TWC.2020.3008091
Yan, Y., Tong, X., and Wang, S. (2024). Clustered federated learning in heterogeneous environment. IEEE Trans. Neural Netw. Learn. Syst. 35, 12796–12809. doi:10.1109/TNNLS.2023.3264740
Yang, M., Xu, J., Ding, W., and Liu, Y. (2024). Fedhap: federated hashing with global prototypes for cross-silo retrieval. IEEE Trans. Parallel Distributed Syst. 35, 592–603. doi:10.1109/TPDS.2023.3324426
Zeng, D., Hu, X., Liu, S., Yu, Y., Wang, Q., and Xu, Z. (2023). Stochastic clustered federated learning. doi:10.48550/arXiv.2303.00897
Keywords: device clustering, energy efficiency, federated learning, gradient similarity, wireless communications
Citation: Chen Z, Xu Z, Ding Y and Wang Y (2026) Data- and distance-aware clustering for scalable wireless federated learning. Front. Commun. Netw. 7:1748815. doi: 10.3389/frcmn.2026.1748815
Received: 18 November 2025; Accepted: 12 January 2026;
Published: 09 February 2026.
Edited by:
Osama Amin, King Abdullah University of Science and Technology, Saudi ArabiaReviewed by:
Ahmad Bazzi, New York University Abu Dhabi, United Arab EmiratesSelvaraj Kandasamy, PSNA College of Engineering and Technology, India
Copyright © 2026 Chen, Xu, Ding and Wang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Zhenning Chen, bGlua19jaGVuQHllYWgubmV0
Zihe Xu2