Entropy-Driven Stochastic Federated Learning in Non-IID 6G Edge-RAN

Scalable and sustainable AI-driven analytics are necessary to enable large-scale and heterogeneous service deployment in sixth-generation (6G) ultra-dense networks. This implies that the exchange of raw monitoring data should be minimized across the network by bringing the analysis functions closer to the data collection points. While federated learning (FL) is an efficient tool to implement such a decentralized strategy, real networks are generally characterized by time- and space-varying traffic patterns and channel conditions, making thereby the data collected in different points non independent and identically distributed (non-IID), which is challenging for FL. To sidestep this issue, we first introduce a new a priori metric that we call dataset entropy, whose role is to capture the distribution, the quantity of information, the unbalanced structure and the “non-IIDness” of a dataset independently of the models. This a priori entropy is calculated using a multi-dimensional spectral clustering scheme over both the features and the supervised output spaces, and is suitable for classification as well as regression tasks. The FL aggregation operations support system (OSS) server then uses the reported dataset entropies to devise 1) an entropy-based federated averaging scheme, and 2) a stochastic participant selection policy to significantly stabilize the training, minimize the convergence time, and reduce the corresponding computation cost. Numerical results are provided to show the superiority of these novel approaches.


INTRODUCTION
6G wireless networks announces the era of massive heterogeneous digital services, that extend the vertical use cases to the final consumer, which is challenging from a network management point of view. Indeed, in this new context, classical centralized monitoring, analysis, and control would become impractical, as they usually represent a single point of failure and suffer from large overhead. Alternatively, decentralized service processing would bring scalability, low raw data exchange and therefore more system sustainability. In this regard, distributed artificial intelligence (AI) approaches, and in particular FL schemes, can play a pivotal role in leveraging the potential of scattered monitoring data across the network as well as the computing power of edge cloud, while reducing the computational costs and enabling fast local analysis and decision. Nevertheless, FL performance is often limited by the convergence delay due to several conceptual and operational issues that are reviewed in the sequel.

Related Work
In (Brendan McMahan et al., 2017), the authors have proposed the federated averaging (FedAvg) algorithm that synchronously aggregates the parameters, and is thus susceptible to the so-called straggler effect, i.e., each training round only progresses as fast as the slowest edge device since the FL server waits for all devices to complete local training before the global aggregation can be performed. Alternatively, the asynchronous model in (Sprague et al., 2018) has been introduced to improve the scalability and efficiency of FL. For asynchronous FL, the server updates the global model whenever it receives a local update which grants more robustness against participants joining halfway during a training round, as well as when the federation involves participating devices with heterogeneous processing capabilities. However, the model convergence is found to be significantly delayed when data is non independent and identically distributed (non-IID) and unbalanced (Zhao et al., 2018). To solve this issue, it has been proposed to distribute a public dataset to the FL clients at the beginning. However, such a dataset may not always exist, or the participants may refuse to download them for security reasons. Therefore, an alternative solution was to construct an approximately IID dataset using inputs from a limited number of privacy insensitive participants (Yoshida et al., 2019). In the Hybrid-FL protocol, the server asks random participants if they allow their data to be uploaded. During the participant selection phase, apart from selecting participants based on computing capabilities, participants are selected such that their uploaded data can form an approximately IID dataset in the server, i.e., the amount of collected data in each class has close values. Thereafter, the server trains a model on the collected IID dataset, and merges this model with the global model trained by the participants. Nevertheless, requests for data sharing are not in line with the original intent of FL. As an improvement, the authors in (Xie et al., 2019) have proposed the FedAsync algorithm in which newly received local updates are adaptively weighted according to staleness, that is defined as the difference between the current epoch and the iteration to which the received update belongs to. For example, a stale update from a straggler is outdated since it should have been received in previous training rounds. As such, it is given a smaller weight. In addition, the authors prove the convergence guarantee for a restricted family of non-convex problems. However, the current hyperparameters of the FedAsync algorithm still have to be tuned to ensure convergence in different settings. Hence, the algorithm is still unable to generalize to suit the dynamic computation constraints of heterogeneous devices. Given this uncertainty surrounding the reliability of asynchronous FL, synchronous FL remains the most commonly used approach (Keith et al., 2019). In this context, it has been confirmed that the correlation between the model parameters of different clients is increasing as the training progresses, which implies that aggregating parameters directly by averaging may not be a reasonable approach in general (Xiao et al., 2020). Besides, a fair resource federated learning approach has been studied recently in (Tian et al., 2020), which introduces a weighted averaging that gives higher weights to devices with the worst performance (i.e., the largest loss) to let them dominate the objective, and thereby impose more uniformity to the training accuracy. Finally, authors in (Niknam et al., 2019) and (Yang et al., 2021) have listed the different FL motivations, challenges and applications on 6G and wireless communications, where FL has been presented as a solution to address energy, bandwidth, delay and privacy questions in wireless communications. As energy consumption is one of the important aspects to consider in FL, in (Tran et al., 2019) the trade-off between learning time, learning accuracy and terminals power consumption has been investigated.

Contributions
In this paper, our contribution is two-fold.
• We first introduce the concept of the entropy of a dataset in both classification and regression tasks, where we jointly consider the features and supervised outputs to characterize the distribution of its samples and the underlying quantity of information based on a custom spectral clustering strategy. This generalized entropy captures the diversity of a dataset as well as its unbalanced structure and non-IIDness. • By leveraging the proposed entropy as an a priori information, we develop two novel FL strategies to make central units (CUs) at 6G Edge-RAN collaborate in learning a certain resource usage, namely, 1) Entropy-weighted federated aggregation which involves all the CUs in the FL training task while prioritizing the most balanced and uncorrelated datasets (i.e., those maximizing the entropy) and 2) Entropy-driven stochastic policy for selecting only a subset of CUs to take part in the FL task. This consists on sampling, at each FL round, the active CUs according to an entropy-based probability distribution, which dramatically reduces the convergence stability and time, as well as the underlying resource consumption by avoiding concurrent training by all CUs at each round.

Edge-RAN
As depicted in Figure 1, the considered network corresponds to a 6G edge-RAN under the central unit (CU)/distributed unit (DU) functional split, where each transmission/reception point (TRP) is co-located with its DU, while all CUs are hosted in an edge cloud where they run as virtual network functions (VNFs). Each CU k (k 1, . . ., K) performs RAN key performance indicators (KPIs) data k stands for the input features vector while y (i) k represents the corresponding output. Given that this dataset is generally nonexhaustive to train accurate analytical models, the CU takes part in a federated learning task wherein an OSS server-located at the core cloud-plays the role of a model aggregator. In this work, the CUs and the OSS are connected via fiber transport links, which present a very stable behavior (compared to the wireless channel), and have no effect on the accuracy of the FL. with different traffic profiles-both in space and time-that tightly depend on the heterogeneous users distribution and behavior in each context (e.g., residential zones, business zones, entertainment events, . . .). On the other hand, the radio KPIs are correlated with the timevarying channel conditions. These realistic datasets are therefore non-IID, which is more challenging for FL algorithms as studied in (Li et al., 2021).

PROPOSED ENTROPY-BASED FEDERATED LEARNING
To tackle the FL convergence in practical non-IID setups, we seek an objective and compressed metric capturing both the distribution of a dataset and its quantity of information, while not depending on the local models. In this regard, we introduce the notion of dataset entropy that is a sufficient statistic to characterize the unbalanced structure of a dataset, as well as its independence from other datasets. Specifically, the entropy is maximized under a uniform distribution with low probability mass function (PMF). By relying on the a priori entropies of all CUs, the aggregation server can implement novel CUs selection and models combining schemes to accelerate and stabilize the FL convergence.

Dataset Entropy
Since we are targeting a generalized definition of the entropy, the labels of a classification dataset are not reliable to reflect the distribution of data since it does not apply to regression tasks where the supervised output is continuous, and it omits the effect of the input features. In particular, samples with different feature values but presenting approximately similar outputs are not providing the same information and might not necessarily fit in the same group of data. Therefore, in order to accurately discern the samples, we consider a joint approach where both the features and the supervised output are used. To that end, each CU uses a clustering algorithm that operates on the so-called similarity matrix S k whose entries measure the logical correlations between the dataset samples vectors including both the features and the supervised output, i.e., This matrix is built using a radial basis function (RBF) kernel with parameter σ. As such, the (i, j)-th matrix element is given by where d stands for the pairwise logical distance between samples' vectors i and j. Let F denote the number of features in the datasets. A general definition of this distance that involves both the features and the output can be written as where {α f } stand for the weights of each feature/output in the distance and verify F+1 f 1 α f 1. They can be fine-tuned to orient the clustering towards the direction of the most relevant features or prioritize the output. For the sake of simplicity, and to avoid  generating a high number of scenarios, we settle in this work to the typical setting where the weights are uniform, i.e., α f 1/(F + 1). Further investigation on the effect of the weights on the entropy is left for future works. Since basic clustering algorithms usually require the target number of clusters as an input, we resort to the well-established self-tuning spectral clustering (STSC) technique that presents a time complexity of O(D 3 k ), but is still practical as long as the dataset size D k < 10 3 (Tsironis et al., 2013). Since in our case we have only small datasets for each CU (e.g., of size 100)-which is by the way one of the reasons to resort to federating learning, the clustering scheme is viable in our case.
The STSC relies on the eigenvalues and eigenvectors of the similarity matrix. To that end, we first define Λ to be a diagonal matrix with and construct the normalized affinity matrix When Λ is strictly block diagonal, its eigenvalues and eigenvectors are the union of the eigenvalues and eigenvectors of its blocks padded appropriately with zeros. Let X k denotes the block diagonal matrix gathering the eigenvectors. In this case, we can automatically cluster a dataset into an appropriate number of clusters that minimizes a custom cost function defined in terms of the coefficients of a rotated and normalized version of matrix X k (Zelnik and Pietro, 2004). Let us assume that for CU k, the clustering yields n k clusters C k,1 , . . . , C k,n k with probabilities Pr(C k,1 ), . . . , Pr(C k,n k ) over dataset D k , which are calculated via the number of samples per cluster Δ k,p as The Corresponding Entropy Is Then Defined as Pr C k,p log Pr C k,p .
By letting the CUs report their dataset entropies {ε k } K k 1 to the aggregation server before starting the training, it becomes possible to devise advanced entropy-driven FL strategies that prioritize the CUs with high entropy datasets.

Entropy-Driven FL Combining
In this strategy, the aggregation server directly uses the entropies to perform a weighted averaging of all CUs local models at each round t, i.e., where is the cumulative sum of the different CUs entropies that serves as a factor. This allows the CUs with high entropies to dominate and orient the FL training, although this requires the participation of all CUs.

Entropy-Driven Stochastic FL Policy
To optimize the federated learning computation time as well as the underlying resource consumption, we aim at selecting only a number of active CUs in each FL round. In this respect, we introduce an entropy-driven stochastic CU selection policy wherein the aggregation server first generates a probability distribution over all the CUs using their received entropies. In fact, CUs with high entropies hold datasets that are rich in terms of quantity of information and can lead to more generalized models in the training. A direct strategy would consist on selecting the m CUs with highest entropies during all the training. But since the datasets of CUs with low entropy can also hold samples that are non-existing in the other high entropy datasets and yet can help in further generalizing the FL model, the idea we have proposed is to give them a chance by implementing a softmax stochastic policy, where each CU can participate in the training with a probability proportional to its entropy. Hence, in the long-term, even CUs with low entropy are given a chance in some rounds to train the model. This leads to a fast convergence (since it orients the training to the CUs with high potential) while ensuring a more general model at the end. This is achieved by a direct softmax activation layer, i.e., Next, at each FL round t, as illustrated in Figure 2, the server selects a subset of m < K CUs to participate in the training by sampling the non-uniform CUs set with probabilities {π 1 , . . ., π K }, i.e., which ensures that, by the convergence round, the CUs would have stochastically taken part in the FL task according to the initial probability distribution, while avoiding the concurrent training by all CUs at each round. In this case, the model averaging at round t is performed as Where D is the total samples over all CUs datasets. This entropydriven stochastic policy is summarized in Algorithm 1, where L(·, ·) stands for the mean square error (MSE) loss function, and b is the bias, while the rest of FL setting parameters is provided in Table 2.

STOCHASTIC FEDERATED LEARNING CONVERGENCE ANALYSIS
In this section, we analyze the convergence probability of the stochastic federated learning. In this intent, a closed-form expression for the lower bound of the convergence probability is derived, reflecting the effects of the CUs selection probability and the datasets sizes. Theorem 1 (Convergence Analysis of the Stochastic Federated Learning). Consider that the CUs selection in the stochastic federated learning follows a policy {π 1 , . . ., π K }, and let Ω and B k stand for the upper bounds on the weights and the norm of subgradient ∇L(W (t) k ), respectively. Let α k ∼ B(π k ) denote the CU activation bit. Then, the federated learning convergence probability satisfies where Proof. First, by means of the subgradient inequality we have at round t: Using Cauchy-Schwarz inequality, we get By recalling the federated learning aggregation Eq. 12, we can write Therefore, from Eqs 16, 17 and by invoking the triangle inequality we have By the monotonicity of the expectation, we have FIGURE 2 | Entropy-driven stochastic federated learning policy. By means of Hoeffding-Azuma's inequality (Hoeffding, 1963), we have 5 NUMERICAL RESULTS

DNN Setting
The structure of the global model weights matrix W has been defined by the server to satisfy the findings of (Ke and Liu, 2008), where the authors have estimated the required number Q of neurons per layer based on the number H of hidden layers, the dataset sizes D k , and the number of features F as which is confirmed via Figures 3, 4, where the best setting of the DNN model neurons turns out to be Q 4 for H 3. As a benckmark, the performance of our proposed approaches is compared with LossFedAvg (Li et al., 2021) and FedAvg (Brendan McMahan et al., 2017). FL settings are listed on  Table 2, where FL system consists of K 6 DUs running local DNN with a learning rate η 0.001 for T 20 rounds.

Learning Rate
The learning rate is a key parameter in ML models, therefore we have to select carefully its right value. In this perspective, we have simulated different learning rate values to illustrate Entropy-Weighted model convergence behaviour. In this respect, Figure 5 shows fast convergence of Entropy-Weighted model with learning rate η 0.01, while for η 0.001 it is showing a stable yet more slow convergence to the same loss as the case of η 0.01. Note that the adopted DNN optimizer is Adam optimizer (Kingma and Ba, 2015).

Convergence
Figures 6A,B illustrate the gains achieved by the entropyweighted approach compared to the baseline FedAvg and LossFedAvg. The comparison is done for both balanced and unbalanced non IID datasets. As showcased in Table 3, the entropy metric varies in balanced datasets, since the clustering technique takes into account the correlation between features as well as the supervised output. In the unbalanced scenario, the entropy difference between CUs is even clearer and demonstrates also that datasets with smaller size can sometimes yield more clusters compared to larger datasets, which further corroborates the role of the introduced entropy metric in characterizing a dataset efficiently.
A slightly lower losses are met with the entropy-weighted approach rather than the entropy stochastic policy, but both methods have the same convergence trend. In Figure 6A,B both entropy-based FL converge faster than FedAvg and LossFedAvg. Knowing how critical is the bandwidth occupation for FL exchanges, and how the CUs local model training is power consuming, especially in 6G mobile systems, our introduced entropy stochastic policy shows good results. This aspect becomes more critical if the FL result is an input for fast decision-making algorithms such as network slicing orchestration or resources scheduling.
Better than FedAvg and LossFedAvg, the entropy stochastic policy convergence trend is oscillating around entropy-weighted as in Figure 6A,B.

Time Complexity and Scalability
Another important achievement with the entropy stochastic policy is the reduction of the required time for a given number of rounds and exchanges between the OSS server and the CUs towards convergence, as shown in Figure 7, wherein the convergence time difference between the entropy-weighted approach and the entropy stochastic policy is exponentially growing with the number of FL rounds. Note that the corresponding wall-clock time performance is tightly dependent on the computation capabilities of both the OSS server and the CUs, but it shows that the stochastic policy FL minimizes the computation burden by selecting only a subset of CUs to take part in the training according to their prior entropy measure, no matter how the number of CUs grows in the network. This proves the scalability of the proposed stochastic FL in large-scale deployments scenarios. More results can be generated for different values of K and m.

Learning Rate Sets
We have trained both entropy-weighted and LossFedAvg models using specific learning rate per each FL CU. As illustrated in Figure 8, better convergence is achieved with both used sets of learning rates compared to fixed η 0.001. Where set1 is a random selection of CUs learning rates, while in set2, η has been chosen according to each CU's entropy value, i.e., CUs with high entropy are assigned small η values and vice-versa. Note that the random learning rate strategy exhibits unstable convergence since it allows CUs with low entropy to learn faster and therefore dominate in some cases.

CONCLUSION
In this paper, we have introduced a novel a priori metric termed dataset entropy to characterize the distribution, the quantity of information, the unbalanced structure and the "non-IIDness" of a dataset independently of the models. This entropy is calculated via a generalized clustering strategy that relies on a custom similarity matrix defined over both the features and the supervised output spaces, and supporting both classification and regression tasks. The entropy metric has been then adopted to develop 1) an entropybased federated averaging scheme, and 2) a stochastic CU selection policy to significantly stabilize the training, minimize the convergence time, and reduce the corresponding computation cost. Numerical results have been provided to corroborate these findings. In particular, the convergence time difference between Entropy-Weighted and Entropy Stochastic Policy schemes is exponentially growing with the number of FL rounds. Another important result is Entropy Stochastic Policy model convergence, which is better than FedAvg and LossFedAvg and oscillating near Entropy-Weighted model.

DATA AVAILABILITY STATEMENT
The datasets presented in this article are not readily available because the dataset is protected by IPR of the operator. Requests to access the datasets should be directed to aamer.brahim@gmail.com.