On Addressing Heterogeneity in Federated Learning for Autonomous Vehicles Connected to a Drone Orchestrator

In this paper we envision a federated learning (FL) scenario in service of amending the performance of autonomous road vehicles, through a drone traffic monitor (DTM), that also acts as an orchestrator. Expecting non-IID data distribution, we focus on the issue of accelerating the learning of a particular class of critical object (CO), that may harm the nominal operation of an autonomous vehicle. This can be done through proper allocation of the wireless resources for addressing learner and data heterogeneity. Thus, we propose a reactive method for the allocation of wireless resources, that happens dynamically each FL round, and is based on each learner's contribution to the general model. In addition to this, we explore the use of static methods that remain constant across all rounds. Since we expect partial work from each learner, we use the FedProx FL algorithm, in the task of computer vision. For testing, we construct a non-IID data distribution of the MNIST and FMNIST datasets among four types of learners, in scenarios that represent the quickly changing environment. The results show that proactive measures are effective and versatile at improving system accuracy, and quickly learning the CO class when underrepresented in the network. Furthermore, the experiments show a tradeoff between FedProx intensity and resource allocation efforts. Nonetheless, a well adjusted FedProx local optimizer allows for an even better overall accuracy, particularly when using deeper neural network (NN) implementations.


I. INTRODUCTION
The adoption of ubiquitous Level-5 fully independent system autonomy in road vehicles (as per the SAE ranking system [1]) is barred from progress due to the omnipresence of chaotic traffic in legacy traffic situations. Moreover, a 38% share of prospective users are sceptical of the performance of the autonomous driving systems [2]. As such, lowering the number of negative outcome outliers in autonomous vehicle operation, particularly ones that lead to fatal incidents, can be addressed with an overabundance of statistically relevant data [3]. Thus, given the privacy requirements and the abundance of the data that is produced by road vehicles and/or unmanned aerial vehicles (UAVs) in the role of traffic monitors, the machine learning (ML) problem can be addressed by treating the participatory vehicles as learners in a federated learning (FL) network.
In more detail, FL is an ML technique that distributes the learning across many learners. In this way, many separate I. Donevski, J. J. Nielsen, and P. Popovski are with Department of Electronic Systems, Aalborg University, Denmark (e-mail:{igordonevski, jjn, petarp}@es.aau.dk). models are aggregated in order to acquire one general model at server side [4]. In FL, each learner does not have to send heaps of data to a common server for processing, but maintains the data privately. As such, the concept of FL is an extension of distributed ML with four important distinctions: (1) the training data distributions across devices can be non-IID; (2) not all devices have similar computational hardware; (3) FL scales for networks of just few devices to vast networks of millions; (4) FL can be engineered in a way in which privacy is conserved. Given the vast complexity of implementing FL in autonomous vehicular traffic, particularly related to the quickly changing environment, in this paper we focus on solving the issues of non-IID data learnt across several devices with unequal processing power. We proceed with a review on relevant FL literature below.

A. State of the Art
FL is an emergent field that has gained immense popularity in the last five years. From the relevant literature we highlight several works. [5] covers the state of the art regarding computational models, [6] contains a clear understanding of the FL potential and its most prominent applications. [7] and [8] provide comprehensive coverage on the communications challenges for the novel edge computation, [9] analyzes scenarios of FL where learners use wireless connectivity. Challenges and future directions of FL systems in the context of the future 6G systems is given in [10], while [11] elaborates upon the applications of FL on connected automated vehicles and collaborative robotics. [12] covers resource allocation and incentive mechanisms in FL implementations. Most of the works on FL concerning UAVs treat the devices as learners [13]- [15]. This requires mounting heavy computational equipment on-board, and therefore it is an energy inefficient way of exploiting drones. In contrast, in our prior work [16] we have investigated techniques for reducing staleness when a UAV acts as an orchestrator by optimizing its flying trajectory.
There is also an interest in wireless resource allocation optimization for FL networks, as covered in the topics that follow. The work of [17] proposes a detailed communications framework for resource allocation given complex wireless conditions and an FL implementation on IID data. This work has a strong contribution to the topic of convergence analysis of wireless implementations of FL with very detailed channel model. The work of [18] does a detailed convex analysis for The lower bound of the bandwidth allocation coefficient Smax The extreme bound of the bandwidth allocation coefficient distributed stochastic gradient decent (SGD) and optimizes the power allocation for minimizing FL convergence times. The work of [19] formulates FL over wireless network as an optimization problem and conducts numerical analysis given the subdivided optimization criteria. However, the aforementioned works perform their analysis on SGD which has been shown to suffer in the presence of non-IID data and unequal work times [20]. The novel local subproblem that includes a proximal optimizer in [20] achieves 22% improvements in the presence of unequal work at each node. The learning of both single task and multi task objectives in the presence of unequal learner contributions is a difficult challenge and has received a lot of attention, e.g. in the works of [21]- [23]. This also leads to the question of analyzing contributions among many learners with vastly different hardware that is considered in works covering FL incentive mechanisms, by [12], [24]- [27]. The incentive based FL implementations rely on estimating each learner's contribution and rewarding them for doing the work. Hence calculating appropriate rewards becomes a difficult challenge that also comes at the price of computation and communications as shown by [28]. Such mechanisms are useful when orchestrating an FL where learners would collect strongly non-IID data and learn with vastly different processing capabilities.

B. Drone Traffic Monitors as FL Orchestrators
Unmanned aerial vehicles (UAVs) or drones could provide an essential aid to the vehicular communication networks by  carrying wireless base stations (BSs). In combination with the 5G standardisation and the emerging 6G connectivity, droneaided vehicular networks (DAVNs) [29] are capable of providing ultra reliable and low latency communications (URLLC) [30], [31] when issuing prioritized and timely alarms. In accord, most benefits of DAVNs come as consequence of the UAV's capability to establish line of sight (LOS) with very high probability [32]. The good LOS perspective also benefits visual surveillance, hence enabling UAVs to offer just-in-time warnings for critical objects (COs) that can endanger the nominal work of autonomous vehicles. Though DAVNs expect many roles from the drone, we draw inspiration from UAVs in the role of drone traffic monitors (DTMs) that continuously improve and learn to perform timely and reliable detections of COs. To avoid requiring a plethora of drone-perspective camera footage of the traffic, we propose DTMs that take the role of a federated learning (FL) orchestrator, and autonomous vehicles participate as learners.
This FL architecture with a drone orchestrator, illustrated in Fig. 1, exploits the processing and sensing enabled vehicles contained in the monitoring area (MA) to participate both as learners and supervisors. The vehicle-learners receive the drone provided footage, and do the heavy computational work of ML training for the task of computer vision. This is possible since the vehicle-learners have robust sensing capabilities, and when they have the CO in view, can contribute to the learning process due to their secondary perspective [33] on the object, and their deeper knowledge of traffic classes. However, even when assuming perfect supervision by the learners, FL is not an easy feat since some knowledge can be obfuscated among omnipresent information and/or contained at computationally inferior straggler learners. In accord, we use a combination of state of the art FL implementation with a novel resource aware solution for balancing work times and learner contributions, which are described in the overview that follows.

C. Main Contributions
In this paper, we provide a novel perspective on continuous DTM improvements through an FL implementation onto vehicle-learners. Moreover, we aim to provide a robust and adaptable resource allocation method for improved FL performance in the presence of chaotic, quickly changing, and most importantly imbalanced and non-IID data. Since both computational and data bias cannot be analytically extracted before sampling the ML model received from each learner, we assume heuristic measures such as maximizing the epochs computed, or equalizing the epochs computed across the learners. Moreover, the core contribution of this work is a dynamic resource allocation method based on each learner's past contributions. To provide full compatibility with heterogeneous learners and non-IID data, we employ these methods in combination with the FedProx algorithm. Finally, we developed an experimental analysis in which the performance is evaluated through its capability to learn an underrepresented class of the dataset, while also balancing overall system accuracy.
The paper organisation goes as follows. Section II introduces the learning setup and the communications resource allocation setup. Section III defines the optimization problem and lists several static and reactive heuristic measures for improving the learning performance, and introduces the learner contribution calculations. This is followed by Section IV where the experimental setup and the results from the setup are presented. The final, Section V summarizes the outcomes and discusses future directions.

II. SYSTEM MODEL
The setup is depicted in Fig. 2 unsupervised video surveillance footage at a constant data rate for all vehicles inside the MA. We assume that each vehicle acts as an ideal supervisor for the objects which are represented both in the broadcasted video and their sensor feed. Given some deadline of completion T , the learner needs to return its locally learnt model to the drone-orchestrator. After receiving the model, the orchestrator aggregates the K models, after which it can also evaluate the contribution of each learner separately. Each learner k has a contribution, that the contribution estimator estimates to be G k,i , for some FL cycle/round i. Finally, the orchestrator contains a resource allocator module that based on the aforementioned information can readjust the wireless resources for the next round, in a way that it improves the FL process.

A. Federated Learning
The FL process starts when the orchestrator sends its weights to all K learners, where each learner k ∈ K = {1, 2, .., K} is present in the MA. The goal of FL methods [4] is to coordinate the optimization of a single global learning objective min ω f (ω), where the function f () is calculated across the whole network at each round i as: where ω are the instantaneous value of the local model weights, F k (ω) is the local optimization function at each node, p k ≥ 0 and k p k = 1 is the averaging weight when aggregating. In a single FL round i ∈ Z + , a server, i.e. the DTM-orchestrator, has a global model with weights ω g,i . On round i each k-th learner receives the model and computes τ k,i epochs of solving the local optimization function F k (), with data batches of size B. Each batch represents a sample of items that have been sensed and collected from that learner's surroundings. The distributed training process produces a new set of weights ω k,i at each k that totals to K different ML models. Hence, cycle i concludes when all ω k,i are aggregated to a signle set of weights ω g,i+1 , that serve as the collective model for the next iteration. The two most prominent approaches to solve the FL problem are Fedavg [4] and Fedprox [20] and differentiate mainly in the local optimization problem F k () at each device. Using stochastic gradient descent (SGD) as a local solver F k (), federated averaging (FedAvg) locks the amount of local epochs for each device to a fixed value. As such, each learner is fixed on computing the same F k () with the same learning rate of SGD for the same amount of epochs. For the successful operation of this system, it is essential to tune the optimization hyperparameters properly including the amount of epochs. The tradeoff in FedAvg becomes one of computation and communication since computing more local epochs reduces communication overhead at the expense of diversifying the local objectives as each system converges to a local optima given their portion of the non-IID data.
Due to the expected heterogeneity in the network of learners in the proposed FL implementation, we use the FedProx algorithm. The benefit of FedProx is that it can converge and provide good general models even under partial work and very dissimilar amounts of τ k,i . This is done by introducing a proximal term ω − ω g,i that alleviates the negative impact of the heterogeneity as: where ω is the instantaneous value of the local model weights at the local optimizer, L k (ω) is a local cost function for the estimation losses, µ is a hyperparameter controlling the impact of the proximal term. The role of the proximal term here is that it prevents the local optimiser from straying far from the global model at round i. Moreover, we can control the local optimization problem to vary from a FedAvg (µ = 0) to FedProx (µ > 0). We note that even when using Fedprox, too much local work causes the local optimizers to diverge from the global objective [20]. Finally, using (2) for minimizing the local sub-problem min ω F k (ω; ω g,i ) the FL converges to a solution even in the presence of heterogeneity and non-IID data distribution [20]. Therefore, we use the FedProx algorithm to allow for full flexibility in data and processing heterogeneity, in combination with the resource allocation module that follows.

B. Allocation of Wireless Resources
Though the work of [17] covers a detailed cellular model for FL connectivity, drone provided connectivity is generally uniform and can be designed to be predominantly line of sight [34]. As we illustrate in Fig of the elevation can be derived from the evironmental parameters while also accounting for the directivity of the antenna mounted on the drone, as in [35], and the service reliability that needs to be achieved [36].
Since our goal of a DMT implementation is to improve the worst case performance of autonomous traffic, we also model the communications system through θ edge as a worst case design parameter. θ edge is decided upon deployment as it plays an important role of controlling the likelihood of establishing line of sight with the ground vehicles at the edge of the cell as in: where a and b are constants defined by the propagation topology of the environment, as given by [37]. Through θ edge in (3) a system designer controls not only the probability of detecting a CO but also the average quality of the communications channel at the edge of the MA as: where L LoS and L NLoS are the pathloss coefficients when LOS is established or lost, respectively. As such, we arrive to the average rate for the user located at the edge of the cell by: where P tx is the transmission power, and N is the noise power. As FL model transmissions usually take several seconds depending on the size of the model, we omit small scale fading as an impactful factor in the analysis and assume that the drone provided links are symmetrical in both directions and offer each learner k a rate of W K · R avg , where W is the total bandwidth dedicated for the FL model passing. W may be represented as discrete resource blocks or a band of spectrum that is left over after portioning part of it for the purpose of video broadcasting. Like this, R avg acts as a lower bound guarantee for the amount of time spent learning at each ground device.
As the size of the processing batch is fixed to B, each device k is tasked with an equal number of floating point operations (FLO) for each epoch, and computes τ k,i epochs. However, for each learner k we introduce a coefficient f k that represents the learners' computational power with regards to the model size, and is a unit of amount of epochs computed per unit time. Having full information on f k is generally trivial since it depends on the processing capabilities of the learner, which should be publicly available in the device specifications.
Given an equal bandwidth allocation to all devices, the total number of epochs is a linear function of f k . This results in the following equation for τ k,i : where, D is the total amount of data that needs to be sent in both directions within the deadline of T . We convert the problem to a step-wise nomenclature that gives the relationship between each learner, independent of the length of T but as a relative inter-learner metric: where ∀k, l ∈ K, l = k. We then perform the substitution: where α is the nominal time reserved for learning, and it is directly influenced by the amount of FLOPs required to compute one epoch. This simplifies to: where S k,i ≥ 0 and K k S k,i = K is the bandwidth allocation for learner k in round i, represented as the portion of the average spectrum W K occupied (i.e. S k,i = K is the full spectrum, and S k,i = 1 is the average spectrum). We continue with the substitution: where β is the portion of time spent transmitting within one round. As per β, it is obvious that it is much more important to investigate the ratio of data load on the channel instead of solely focusing on the achieved rate R avg . Moreover, the time spent learning at each device becomes more significant the more we load the resources, in both number of learners and the size of the model. This results in the final representation of epochs computed for learner k as a function of the bandwidth allocated to them: Given a no-drop policy (each learner must complete at least one epoch τ ≥ 1), the lower bound on S k,i becomes: and the extreme upper bound of S k,i is therefore: The behaviour of the resource function for a single τ k,i when adjusting β and S k,i within the bounds of (12) and (13), is: The entire communications setup is reducible to the analysis of combinations of α and β, as both parameters directly determine the impact that resource allocation has on the system. Moreover, the parameter β modifies the impact of resource allocation for each learner, where systems with high β values stand to benefit the most, while low β values indicate near instantaneous model transfers which cannot be influenced by modifying the bandwidth. On the other hand, α is a system design hyperparameter that indicates the amount of epochs computed within a single round, by an average learner, and it is fully customizable before or even during operation.

III. ANALYSIS
Our goal is to improve the learning of a particular class among the network of FL devices, that may represent a CO, without harming the overall accuracy of the system. Thus, each round i we exploit our control over the wireless resources and optimize the bandwidth allocated to each device S k,i . The vector representation of the bandwidth allocation for each round becomes S i = (S 1,i , S 2,i ...S K,i ). In the same way, the number of epochs computed in round i and the contribution estimations are reformulated into vectors: is an estimate of the contribution of learner k based of its learning performance in the past. Due to the rapidly changing environment around each learner, we cannot assume having information about the size or distribution of the data stored at each learner. Therefore, we can assume a function of utility from both aforementioned parameters h X (τ i , G i ), where X is a placeholder for the name of the approach. Given this function, the optimization problem of maximizing the utility X can be defined as: max , (12), (13), (14).
Extracting the direct impact of G k,i and τ i onto the future accuracy of the model, and under non-IID data distribution, is non-trivial and hence requires that we form several heuristic functions for h X () to be tested on an experimental setup.
Therefore we compare three different solutions for (15) by swapping the utility function h X () with the ones named as X ∈ {MAX, AAS, ACT}. The first two versions of the optimization problem (MAX and AAS) apply a static method that computes utility only as a function of the epochs that will be computed for that round for each learner. The third approach (ACT) is a novel reactive method, that extracts the utility of a learning round as a product of the estimated contribution by each learner and the epochs that will be computed by that learner. The details for each method follow below.

A. Static Resource Allocation Measures
The naive way of improving the convergence in a heterogeneous setting is maximizing the total amount of work done by all learners as in: This optimization criteria maximizes the epochs computed across the whole network given the limited radio resources.
Since (16) implies asyncronous amount of work performed among the learners, it may not be considered as a potential maximization metric when using classical FedAvg implementations. However, since we use FedProx as a local optimizer, this is a sufficient naive solution that represents an exploitative behavior from the orchestrator. Furthermore, given the work on asynchronous FL and the issues of diverse computational hardware in the network [38], [39] we identify maximum staleness [16] as an important criterion towards the precision of the model. We define this as the maximal difference between the fastest and slowest learner: Nonetheless, minimizing staleness does not extract the full potential of our setup. Therefore, as in [16] we convene s and the average of the anticipated epochs to a more balanced heuristic metric, named Average Anchored Staleness (AAS) as an optimization metric: AAS gives a good general overview that is data-agnostic, without the need to assume the impact of data at some particular learner and solely on spatial and computational performance. Like this, AAS provides a resource allocation objective function that serves an equally balanced amount of learning and staleness.

B. Contribution Estimation for Reactive Resource Allocation
In the case of DTMs, the considered vehicle supervisors/learners can find themselves in the presence of vastly different objects, and the data they sense changes constantly while they operate. Given the aforementioned, the contribution of each learner is hard to estimate especially in the presence of noisy samples. Hence, we assume that separating the important CO information ahead of time is impossible and only consider reactive approaches such as incentive mechanisms. To use incentive mechanisms we must assume that the validation dataset that is present at the orchestrator has equal representation of all classes. Hence, based on such validation data we can pass the weights ω through an evaluation function E(ω) which can be based on accuracy or loss evaluations of the model (we choose accuracy). To calculate the contribution for each round i we define: where ω g\{k},i is a model aggregator that constructs a new model that is an aggregate of all recieved models except the one of k. Hence the difference in accuracy between the fully aggregated model and the ω g\{k},i [27] gives the added value (the uniqueness) of the learning done by learner k. Like this, the contribution estimator is capable of discovering the overall contribution from each learner for that round, without the capability of sampling for contributions on each detection class separately, or discern which object is underrepresented or is the CO. This is a central feature of our method, since we aim to improve CO learning without tailoring the solution to discern which class is the CO. We note that the ω g\{k},i function needs to be called for each learner in order to produce K different contribution estimations. In addition to having to compute an additional parameter, there is one extra set of weights that needs to be aggregated for the calculation of ω g\{k},i for all other learners, thus making the complexity of the estimator scale as a square of the number of nodes in the system K. Even though the computational complexity of this technique can escalate in big FL implementations, in the architecture that we propose there should be several active learners inside the MA. Thus, even aside the limited computational power on the drone, the estimator module should not experience lengthy computational times.
Following the first round, each device k provides its model to the DTM-orchestrator. After which, the aggregator provides the first aggregate model weights ω g,i . The resource allocator module in the orchestrator receives the contributions for each of the participating learners and hence can decide to adjust the resources based on G k,i . Since G k,i is an estimation of the contributions for the past round, the goal is to maximize the total contribution of the upcoming round by introducing the following optimization function: where g() is a utility function that scales the contributions to match the impact of the number of computed epochs. Introducing a utility function is necessary to properly scale each learner's impact since −1 ≤ G k,i ≤ 1 and τ k,i ∈ Z + . Since in an average scenario E[τ k,i ] = αE[f k ], and E[f k ] = 1 we scale our utility function as per the average epochs computed for that round as g(G k,i ) = α G k,i . The bounds of the function become 3, 4, 6 0.7 1 α ≤ g(G k,i ) ≤ α, and the nominal non-contributive learners produce g(0) = 1. Thus the heuristic exponential optimization function for the reactive solution can be calculated as the contribution corrected maximum epochs computed as in: In the case of constantly equal contributions from all learners, the heuristic maximization criteria is reduced to the epoch maximization problem defined in (16). With h ACT defined as in (21) we maintain the problem within the bounds of mixed integer linear programming since the utility is applied only to G k,i that remains constant for the whole round i.

A. Experimental Setup
For a set of learners that are scattered along the MA, our goal is to as closely as possible generate an experimental setup that simulates a realistic learner given the system model in Section II. Since each learner has a very short amount of time to do the learning for the DTM, we approach the data as fleeting (stored very briefly) and concealed (cannot be known beforehand). Due to the complexity and the issues of reliably simulating the FL performance for full scale traffic footage, we test the performance of the proposed methods through simple and easily accessible computer vision datasets. Each testing scenario was built using either the MNIST dataset [40] of handwritten digits, or the FMNIST [41] dataset consisting of 10 different grayscale icons of fashion accessories.
As we expect that each vehicle contains strongly non-IID data we create a custom data distribution among K = 7 learners as shown in Table II. In addition, the processing power for computing a certain amount of epochs per millisecond f k for each learner, is distributed as: two standard vehicles (f k = 1), two premium vehicles (f k = 1.3), and two budget vehicles (f k = 0.7); with the addition of one straggler that contains an older technology (f k = 0.15). At each epoch the learner samples a single batch of B = 16 randomly selected values from the stored data (as per Table II). Like this, the training data changes constantly, to mimic the changing environment of the vehicular scenario. This makes this FL testing scenario unique in that the number of epochs computed also reflects the amount of data sampled from the environment.
In the described setting, the class-number 5 (6th class counting from zero) assumes the role of a CO. In addition to the CO, class-number 3 is another non-CO class that is not too common and appears at only 3 learners. This is an overexaggerated situation of having the CO data hidden at one node that is also a straggler. We expect this to be a realistic reflection of data in drone orchestrated FLs as nodes carry only a small amount of supervisory data for each class due to the fact that they stumble upon important objects randomly.
For detection, we implement a small convolutional neural network (CNN), common for the global and local models implemented in python tensorflow [42]. In more detail, the CNN has only one 3x3 layer of 64 channels using the rectifier linear unit (ReLU), that goes to a 2x2 polling layer. A dense, fully connected neural network (NN) layer of 64 ReLU activated neurons receives the polled outputs of the convolutional layer, which is then fully connected to a NN layer of 10 soft-max activated neurons, one for each of the 10 categories of the NIST dataset. The local optimizer at each learner is given by the FedProx calculation in Eq. (2), where the cost function L k () is a categorical cross-entropy loss function, and the learning rate performed well when fixed to γ = 0.1. The communication phase coefficient was considered in milliseconds and chosen as β = 100 considering our CNN model with a size of 2.5Mb that needs to be transmitted to all 7 learners, over a single W = 80MHz 802.11ax channel. Finally, in the reference frame of milliseconds, the cycle duration coefficient was set to α = 100 in favor of allowing for higher flexibility when scaling the bandwidth allocation.

B. MNIST Testing
We proceed with the testing of all three approaches for five different values of the proximal importance hyperparameter µ ∈ {0, 0.01, 0.1, 0.5}, as guided by the recommended values in [20]. µ values larger than 0.5 failed to produce productive results and only harmed the convergence outlook. The testing lasts for 200 rounds on the aforementioned CNN model. Aside the three shown FL implementations, we also implement a classical ML with only one learner that contains all the data. We do this to extract the performance ceiling of the NN approach, which is 98% for the validation accuracy and 0.0602 validation loss paired with training accuracy of 98.85% and training loss 0.0423.
In Fig. 4 we can notice a limited impact of changing the µ parameter of FedProx, most likely due to the small amount of learners and not as significant straggler impact. This is expected given that [20] claim strong superiority over FedAvg in the cases of very large portions of stragglers. Interestingly, µ does not have a strong positive impact on the learning performance even in the case of MAX, and therefore, a system designer would most likely introduce a weak proximal term of µ = 0.01. Additionally, using the ACT approach provides superior convergence, and in combination with µ = 0.01 achieves the best overall accuracy. In addition to this, the ACT and µ = 0.01 combination also keeps up with the performance of AAS with regards to the CO class after the first several rounds of convergence.  To better investigate the behavior of the ACT approach we illustrate the evolution of the estimated contributions for learner k = 1 in Fig. 5, where G 1,i is based on the performance of the learner estimated from the previous learning round as in Eq. (19). The overall conclusion here is that we achieve CO learning without tailoring the solution to discern which class is the CO. This is possible as the calculation of G k,i is focused around the uniqueness of the dataset at each learner. Here we can notice that increasing the strength of the proximal parameter through setting higher µ values equalizes the contributions between all three methods, particularly in the first 40 rounds. Moreover, when µ = 0.5 the contributions are stabilized and vary very little once the initial phase of 40 rounds.
Most notably, the accuracy of AAS suffers significantly when µ = 0.5 which results in a performance that is equally matched to the MAX approach when detecting the CO. It is thus evident that a strong FedProx implementation harms total system accuracy, and above all, diminishes the impact of the using resource allocation. Finally, we conclude that the task of learning MNIST is too simplistic for our assumed scenario of traffic monitoring, and thus we continue with testing the FMNIST dataset in the following subsection.

C. FMNIST Testing
Since modeling common tasks of computer vision on MNIST is a very easy task we repeat the test on the FMNIST dataset. This dataset consists of 10 classes of fashion accessories in equal distribution as the MNIST dataset (a training set Compared to the number MNIST, in FMNIST the intensity of each voxel plays a much bigger role and is scattered across larger parts of the image. We consider the FMNIST dataset as a computer vision task that sufficiently replicates the problem of detecting 10 different types of vehicles, in a much more simplistic context that is furthermore easily replicable. In Fig. 6, we show the learning performance in the same setting and µ ∈ {0, 0.01, 0.1, 0.5}, across 200 rounds of training. It is most obvious that the overall accuracy has dropped quite a lot from the 98% in the MNIST case to 88% in the best case scenario of ACT with µ = 0.01 for the FMNIST. Most notably the largest difference is that the increased difficulty of the learning problem introduces a lot more noise in the learning process, particularly for the CO class. Due to this, when using no FedProx (µ = 0) AAS does a good job at accelerating the learning process in the first 20 rounds until it is overtaken by ACT. Even though the combination of ACT with µ = 0.01 shows the best overall accuracy on the validation data, the accuracy of detecting the CO class with ACT never truly reaches the performance of AAS.
Finally, we conclude that even though µ = 0.1 and µ = 0.5 were eligible in the MNIST run, the overall increased complexity of FMNIST harms the accuracy outlook in both, but with the most severe impact on AAS. This experimental run therefore inspired us to investigate the issue of underfitting, and we proceed with testing FMNIST performance with a deeper model.

D. Deeper FMNIST Testing
In this testing scenario we expand the small convolutional neural network by adding another 3x3 layer of 64 channels using ReLU activators as a first layer. In Fig. 7 we show the outcomes of the testing, where the overall accuracy of the system has been improved to 90%. However, the larger model acted as an equalizer across all three approaches and in the case of µ = 0 generally gave equal performance both in convergence time and overall accuracy. It is important to also look at the validation loss following the round i = 150 as it starts to diverge for both ACT and MAX approaches. This did not directly map into the accuracy of the detection, but nonetheless is a first sign of possible overfitting and eventual divergence.
With the deep model, this effect is diminished for the case of ACT with µ = 0.01, and manages to reach the best convergence time along with overall accuracy from all tested implementations. This accuracy is also paired with improved detection of the CO that exactly matches the AAS approach. As such the ACT with µ = 0.01 is both the best overall learning solution, but also the best CO detector. It is also interesting to notice that the MAX approach does well with overall accuracy, particularly when compared to the inferior performance in the previous testing sets. Nonetheless, MAX is still inferior to both other approaches when it comes to detecting the CO class. Finally, we focus on the results on µ = 0.5. When the proximal term has such a strong impact on the learning, all three approaches show inferior overall performance by 4-5 percentage points with regards to the best performing µ = 0.01. However, it is interesting to see that the impact is by far most severe on the AAS approach, even reducing the CO detection performance. Additionally, MAX gives the best result when it comes to learning the CO behavior for µ = 0.5. Opposed to the behavior back in the MNIST testing, here AAS suffers from the increased complexity of the task, and in combination with a very strong proximal term reduces the overall learning of detection. This makes it is easy to conclude that a strong proximal term reduces the effect of resource allocation efforts. We seek to discover the culprit for the inferiority of AAS in CO discovery when µ = 0.5 by plotting the contributions of learner k = 1 in Fig. 8. Looking at the contribution evolution in case µ = 0.5 we extrapolate that AAS aims to keep the learner relevant while the reduced amount of learning across the whole network harms the potential contribution of all other nodes. This leads us to the final conclusion of this experiment which is that the ACT based approach is extremely versatile in providing good CO detection and accuracy even in the cases of µ = 0, a properly assigned µ, and overly restricted FedProx implementation.

E. Testing Fleeting FMNIST
The final test with the experimental setup is constructed such that we introduce stress in the learning process by introducing temporary losses in the supervision process. This is done by introducing a likelihood that a learner k loses access to a detection class. This would be representative of a learner losing LOS of the object was able to supervise, and is therefore modelled as a two state markov model (such as the Gilbert Elliot [43]) that has a good and a bad state. Hence each supervisor has p = 0.9 chance to maintain supervision for that class (stay in the good state), and 1 − p = 0.1 probability to lose supervision capability (and move to the bad state). If the vehicle loses supervision capabilities for that class, it has r = 0.5 probability to maintain that state (remain in the bad state) or 1 − r = 0.5 probability to regain supervision of that class. The values for the state transitions in the Gilbert-Elliot model were chosen with the experimental setup in mind so that not too much data is lost with regards to the previous testing setups. These testing parameters were provisioned arbitrarily, because higher values would make the learning process very lengthy imposing unrealistic testing times for our experiment, but still provide a lot of stress to the learning system.
Hence, to compensate for the smaller dataset, we let the simulations run for 250 rounds, and focus only on µ ∈ {0.01, 0.1}. The fleeting data is provided from the same seed and the Gilbert Elliot model starts from the good state for every possible detection combination. In Fig. 9 we show the performance of all approaches on the aforementioned setup. Comparing this to the previous testing setup, we notice that the overall accuracy dropped by 1 percentage point for µ = 0.01 and 2 percentage points when µ = 0.1 due to the increased stress in the learning process. It is also apparent that both ACT and MAX show signs of overfitting -the diverging lines in the validation loss -which is improved when using µ = 0.1, at the cost of reducing the overall system accuracy by an additional 1 percentage point. Focusing on µ = 0.01, all methods achieve nearly the same overall accuracy, since the learning of the computer vision task is bottlenecked by the presence of the data. However, AAS is superior in CO detection and it shows slightly inferior convergence time for overall accuracy (i.e. around the 50 round mark). In addition to this, AAS is the most data sensitive approach and experiences the largest overall accuracy dips in situations where many detection classes are in the bad state (such as around the 55th round and the 127th round). Finally, to better observe the noisy training data, we plot a 10-point moving average in Fig. 10. Here we notice the in the common training scenario AAS and ACT perform rather equally when learning hidden information. However, in the presence of fleeting data, the ACT performance becomes very noisy and becomse slightly inferior than AAS with regards to CO learning performance. Nonetheless, as already mentioned, this CO learning performance of the AAS approach comes at a slight cost of general detection performance, in both fleeting and normal setting.

F. Key Takeaways
We condense several takeaways that were derived from all four experimental runs. The initial and most important conclusion is that the concepts of resource allocation and FedProx are at odds in the case of FL implementations. In more detail, the goal of FedProx is to reduce the impact of each learner individually while resource allocation methods strive to improve the overall performance by exploiting or compensating the heterogeneity of the system. Hence the impact of resource allocation methods is diminished when strengthening the role of the proximal term. Nonetheless, in the many tests a safe balance between both µ and resource allocation ensure good learning behavior. As such, we recommend that all future works consider perturbed gradient descent implementations, such as FedProx, when dealing with non-IID data in heterogeneous FL.
Additionally, in the initial testing of our setup we noticed that testing on MNIST is not sufficient to provide reasonable results for the implementations, due to how trivial the task of recognising digits is. Moreover, FL implementations, such as the proposed drone implementation, are based in the distributed learning of complex tasks and require deeper NN models. In such cases, it was evident that increasing the total amount of computed epochs benefits the convergence time of the system with potentially harmful effects in CO detection accuracy. Moreover, deeper model implementations did not behave well under strong proximal terms.
As a consequence to this, learning hidden data can be addressed by equalizing the contributions by using AAS or by introducing strong proximal terms. However, the strong proximal terms have potential to slow down the convergence time for all nodes. Hence, the safest implementation to achieving the best combination of convergence time, overall accuracy and CO learning rate is using the ACT approach with a weak proximal term.
Finally, in a case where the data is fleeting, using a µ > 0 was crucial to reach stable learning performance. In this setting, the low availability of data acted as a lower bound for all learning implementations, but most importantly harms the convergence time performance of AAS. This is understandable since AAS was the approach that cumulatively computed the least amount of epochs at each round. On the other hand, the ACT approach maintained superior performance to both static approaches by maintaining good CO detection performance and great convergence times.
Finally, we extrapolate that defining a proper µ is cardinal. However, the hyperparameter needs to be defined ahead of the deployment of the system. As such, since we would not have access to the training data, the feasibility of implementing AAS is uncertain especially for situations where the presence of data changes quickly. This gives another strong motivation for using reactive measures based on contributions and incentive calculations, such as ACT.

V. CONCLUSION
In this paper we investigated the learning process in a novel Federated Learning (FL) architecture, where a DTM acts as an orchestrator and traffic participants act as supervisors on its model. Such an implementation expects impairments on the learning process due to unbalanced and non-IID data scattered across heterogeneous learners that have variable computational equipment. We therefore test the ability of two static methods (AAS and MAX), and one incentive based reactive (ACT) resource allocation method to improve the speed of learning CO classes and maintaining good overall model accuracy. The validity of the methods was tested with an experimental FL implementation that uses the novel FedProx algorithm to learn from the MNIST and FMNIST datasets. The testing was conduced across combinations of different FedProx strength, CNN model depth, and fleeting data. From the testing we conclude that both reactive (ACT) resource allocation and FedProx are essential to securing model accuracy. In more detail, due to the inability to anticipate the distribution of the data across the learners, the use of ACT ensures proper operation of the FL implementation. In accord, the combination of properly set FedProx with an ACT implementation provided faster convergence times, better accuracy, but most importantly it matched the AAS method in learning to recognize the CO. Such behavior was consistent across most runs given the varying task complexity, model size, and data presence. The goal of future works would be to look into more advanced proactive approaches, especially for the presence of imperfect data supervision.

CONFLICT OF INTEREST STATEMENT
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. AUTHOR CONTRIBUTIONS ID: investigation, writing; JJN and PP: writing, review, editing, resources, funding acquisition, supervision, and project administration.

FUNDING
The work was supported by the European Union's research and innovation programme under the Marie Sklodowska-Curie grant agreement No. 812991 "PAINLESS" within the Horizon 2020 Program.