Multi-armed bandit based device scheduling for crowdsensing in power grids

With the increase of devices in power grids, a critical challenge emerges on how to collect information from massive devices, as well as how to manage these devices. Mobile crowdsensing is a large-scale sensing paradigm empowered by ubiquitous devices and can achieve more comprehensive observation of the area of interest. However, collecting sensing data from massive devices is not easy due to the scarcity of wireless channel resources and a large amount of sensing data, as well as the different capabilities among devices. To address these challenges, device scheduling is introduced which chooses a part of mobile devices in each time slot, to collect more valuable sensing data. However, the lack of prior knowledge makes the device scheduling task hard, especially when the number of devices is huge. Thus the device scheduling problem is reformulated as a multi-armed bandit (MAB) program, one should guarantee the participation fairness of sensing devices with different coverage regions. To deal with the multi-armed bandit program, a device scheduling algorithm is proposed on the basis of the upper confidence bound policy as well as virtual queue theory. Besides, we conduct the regret analysis and prove the performance regret of the proposed algorithm with a sub-linear growth under certain conditions. Finally, simulation results verify the effectiveness of our proposed algorithm, in terms of performance regret and convergence rate.


Introduction
Nowadays, the development of smart power grids brings much convenience to human life and production. Meanwhile, more and more devices, such as sensors and actuators, are deployed in power grids, e.g., substations, transformers, and generators. Consequently, A critical challenge arises on how to collect information from massive devices and how to manage these devices. Mobile crowdsensing is a large-scale sensing paradigm empowered by ubiquitous devices. These devices interact with each other by sharing local knowledge according to the data they have perceived, and then the information can be further aggregated and fused in a central node for crowd intelligence extraction, decision-making, and service delivery (Guo et al., 2014).
However, collecting sensing data from massive devices is not easy due to the following reasons. Firstly, the scarce channel resource limits the number of devices that simultaneously access to an edge server. That is to say, the available wireless channels are fewer than the sensing devices. Secondly, the overlap of perception areas of different devices introduces sensing data redundancy. Besides, the system heterogeneity of sensing devices, such as processing capability, network connectivity, and battery capacity, leads to different processing capabilities (Xia et al., 2021). The system heterogeneity causes a drift of global statistical characteristics since the fast devices can collect more data according to their local observations. To achieve a more comprehensive observation of the area of interest, one should guarantee the participation fairness of sensing devices with different coverage regions. Therefore, the edge server has to perform device scheduling, i.e., choosing a part of sensing devices in each time slot, to collect more valuable sensing data. However, the lack of prior knowledge makes the device scheduling task hard, especially when the number of sensing devices is huge.
Actually, there have been some works on device scheduling in crowdsensing tasks. For example, The authors in (Chu et al., 2013) proposed a selection scheme of individual sensors to collect data in different regions in order to optimize some specified objective while satisfying constraints in the number and costs of sensors. The authors in (Han et al., 2016) chose from a set of available participants to maximize sensing revenue under a limited budget. The authors in (Sun and Tang, 2019) proposed a greedy scheduling algorithm to find data-giver vehicles for every subtask with minimized cost in vehicular crowdsensing. The work in (Han et al., 2015) considered an online scheduling problem that determined sensing decisions for smartphones that were distributed over different regions of interest. (Nguyen and Zeadally, 2021). studied a participant selection problem that aimed to maximize the number of event records reported by fewer users. Different from (Han et al., 2015;Han et al., 2016;Sun and Tang, 2019;Nguyen and Zeadally, 2021), the work in (Gendy et al., 2020) aimed to maximize the percentage of the accomplished sensing tasks in a given period, by modeling the interaction between the participating devices and sensing task publishers as auctions. However, these works did not take into account the effects of dynamic wireless channels on sensing performance. Besides, most of them performed sensing device scheduling under the assumption that some statistical information is available in advance, which is usually resource-consuming and even impractical especially when the number of sensing devices is huge. Motivated by this fact, we aim to propose an online scheduling algorithm to find device scheduling decisions in crowdsensing tasks.
Recently, the rapid development of reinforcement learning (RL) techniques sheds light on the considered problem. Among these RL techniques, the multi-armed bandit (MAB) program is thought of as an important tool and has been widely adopted for scheduling and resource allocation problems. For example, MAB has been applied to advertisement placement, multi-antenna beam selection (Cheng et al., 2019), packet routing, offloading (Sun et al., 2018;Chen and Xu, 2019), caching (Blasco and Gündüz, 2014;Sengupta et al., 2014), and so on. In this work, we reformulate the sensing device scheduling problem as an MAB program, based on which a device scheduling algorithm is also proposed. The contributions of this work are summarized as follows.
• Considering the scarcity of wireless channel resources, we formulate a device scheduling problem in crowdsensing scenarios. We take into account not only the availability of devices caused by dynamic wireless channels but also fairness among the devices for better comprehensive observation of the area of interest. Besides, no prior information about devices is available. • Then, the device scheduling problem is reformulated as an MAB problem, based on which an online scheduling algorithm is also proposed. The proposed algorithm propose incorporates the upper confidence bound (UCB) policy and virtual queue theory, whose regret performance is also analyzed in this work. • Finally, simulation results are conducted to verify the effectiveness of the proposed algorithm. The balance between the time used to reach a point that meets the fairness constraints of devices and the performance regret is revealed.

System model
Consider a system consisting of an edge server and a set K = {1, 2, …, K} of crowdsensing devices (e.g., sensors, cameras, and so on), as shown in Figure 1. These devices are responsible for collecting raw data from the observed events or objects and then pre-processing the raw data into samples, finally transmitting these samples to the edge server for processing tasks, such as statistical analysis and training a neural network for classification. For simplicity, we assume that the samples generated by different devices have the same size δ. Since the observed events can be periodic or aperiodic, or the observed objects have different activity characteristics, the amount of raw data collected by different devices is different. Other factors such as device location and perception ability also have influences on the amount of raw data collected by different devices. In addition, the processing capabilities of different devices are heterogeneous. Taking into account these facts mentioned above and for simplicity, we assume that time is slotted and the number of the newly generated samples of device k ∈ K in time slot τ, N k (τ), is independently and identically distributed (i.i.d.) according to some unknown distribution whose expectation ν k is also unknown a priori. Thus, the total number of the samples waiting for uplink transmission of device k at the beginning of time slot τ is (1) where [x] + = max{0, x}, M max is the largest number of the samples that each device can store due to the limited storage space, and L k (τ) is the number of the samples of device k has been transmitted the edge server in time slot τ, which will be specified in the following.

Transmission model
The orthogonal frequency-division multiple access technique is adopted and there are F max orthogonal channels, each with the same bandwidth w, that can be used for uplink transmission simultaneously. The channel h k between the edge server and device k is i.i.d., which is assumed to be constant within a time slot but varies independently across different time slots. The achievable uplink rate of device k in time slot τ is computed as where σ 2 denotes the noise power and p k denotes the transmit power of device k. Then, the number of samples that can be transmitted to the edge server is where Δt is the duration length of a time slot.

Available channel constraint
When the edge server collects the generated samples from the devices, some devices can be unavailable for uplink transmission. For example, the devices experience poor channel conditions due to external interference, or the devices cannot work in the transmission mode when collecting raw data due to power constraints. We introduce the binary variable a k (τ) to indicate the availability state of device k in time slot τ. Specifically, a k (τ) = 1 represents that device k can work in the transmission mode in time slot τ, otherwise not. Let Z(τ) = {k ∈ K|a k (τ) = 1} ∈ B(K) denote the set of available devices that can work in the transmission mode in time slot τ where B(K) is the power set of K. We assume the distribution of available devices, P Z (e) =P (Z(τ) = e), e ∈ B(K), is i.i.d. over time and unknown a priori, but Z(τ) is unmasked to the edge server at the beginning of each time slot τ. Then, L k is specified by In the considered system, there can be a huge number of devices, but the number of available channels at the same time is constrained. Due to the limited number of available channels, the edge server has to select a subset W(τ) from the available devices, which should meet the available channel constraint, i.e., where |W(τ)| denotes the cardinality of W(τ).

Fairness constraint
In order to achieve more comprehensive observation of the area of interest or better performance of computational tasks such as training a neural network, besides collecting as many samples as possible, the edge server is required to collect samples from different devices to ensure the diversity of samples. Thus, fairness among the devices is also an important issue that should be addressed in many practical applications. Here, a binary variable b k (τ) is introduced with b k (τ) = 1 if device k is chosen to transmit its samples to the edge server in time slot τ, otherwise, b k (τ) = 0. With the definition of b k (τ), we formulate the fairness constraint as follows where T represents the total number of time slots, c k ∈ (0, 1) represents the minimum of the portion of time slots required to transmit the samples of device k, and [•] is the expectation operator. We incorporate c k , k ∈ K into a vector c = [c 1 , c 2 , …, c K ] T and c is thought of as a feasible fairness constraint if there is a policy which generates a decision sequence {W(τ), τ ≥ 1} such that the fairness constraint 6) is satisfied.

Problem formulation
In this work, we aim to optimize a time sequence {W(τ), τ ≥ 1} which maximizes the number of samples received at the edge server with a given time horizon of T time slots. The underlying problem with the fairness constraint and the available channel constraint can be formulated as (5) and (6), which is hard to solve because we have no idea about the distribution of the number of newly generated samples, as well as the distribution of wireless channels. Besides, the fairness constraint and the available channel constraint make problem Eq. 7 more challenging. Thanks to the development of the MAB framework, which sheds light on solutions to problem Eq. 7.

Proposed algorithm
In this section, we first introduce a stationary policy optimization program to deal with the uncertainty of device availability. Then, the device scheduling problem is reformulated as an MAB program, based on which an arm-pull algorithm is proposed to determine the decision sequence.

Problem reformulation
In this work, to simplify the scheduling complexity, a stationary policy named Z-only policies is introduced, in which a super arm W(τ) ∈ Y(Z(τ)) is selected according to the observed Z(τ) only in each time slot τ (Neely, 2010), where Y(Z(τ)) denotes the set of all possible subsets when Z(τ) is observed. According to Theorem 4.5 in (Neely, 2010), if c belongs to the maximum feasibility region C strictly, a Z-policy which can meet the fairness constraint in Eq. 6 always exists.
We further use a vector of probability distributions q = [q W (e), ∀W ∈ Y(e), ∀e ∈ B(K)] to describe an Z-only policy π with ∑ W∈Y(e) q W (e) = 1, ∀e ∈ B(K). Then, we compute the mean of b k (τ) as which is a linear problem if the expectationL k of L k is known a priori. However, this assumption does not usually hold in practice and the edge server needs to estimate the average number of samples received from device k per time slot to make scheduling decisions.
To address this issue, we introduce the MAB program.

Multi-armed bandit program
An MAB program is a machine learning framework where a player chooses a sequential of actions (arms) in order to maximize its cumulative reward in the long term (Lattimore and Szepesvári, 2020). Thankfully, we can model problem Eq. 7 as an MAB problem, in which the edge server and the devices play the roles of the player and the arms, respectively. Each subset W(τ) of available arms is also treated as a super arm. Correspondingly, we can interpret the objective of problem Eq. 7 as determining a time sequence of the super arm to maximize the cumulative reward (i.e., the number of samples received at the edge server).
In the MAB program, there is an expected reward for each arm, but such statistical information is unknown by the player, which brings challenges to the arm selection of the player. The main basis that can be used to determine actions is some observation about the state in the current round and the experience gathered in previous rounds. More specifically, the arms which performed well in the past should be associated with higher priority. In the meantime, the player continues to explore the expected payoffs of the other arms. In other words, the player has to balance between the need to acquire more knowledge about the reward distributions of each arm (exploration) and the need to optimize rewards based on its current knowledge (exploitation) (Bubeck and Cesa-Bianchi, 2012). The exploration-exploitation dilemma inevitably causes performance loss and regret is the most popular metric for evaluating the learning performance in the MAB works, which is defined as the difference between the reward r* and the average reward in a given period of time (Lai and Robbins, 1985). Here, r* is the achievable maximum reward of problem Eq. 9 with the knownL k , ∀k ∈ K. Therefore, the original problem Eq. 7 can be reformulated as a cumulative regret minimization under policy π by determining a super arm W(τ) in each time slot τ, i.e., min {W(τ),τ≥1} (5) and (6), wherel k (τ) = l k (τ)/M max .

Algorithm design
When designing an algorithm for problem Eq. 10, three challenges need to be addressed: 1) how to maximize the cumulative reward when the reward expectation of each arm is unknown, 2) how to choose a super arm under the available channel constraint, and 3) how to meet the fairness constraint. The first two challenges can be dealt with with the extension of the classic UCB algorithm (Auer et al., 2002), but how to meet the fairness constraint requires the introduction of novel methods. Encouraged by (Neely, 2010;Li et al., 2019), the virtual queue technique has the potential to handle the fairness constraint. Specifically, a virtual queue is built for each arm k, i.e., where [x] + = max{0, x} and D k (τ) represents the length of virtual queue of arm k at the beginning of time slot τ. Define ϱ k (τ) = ∑ τ τ ′ =1 b k (τ ′ ) as the number of times arm k has been chosen and υ k (τ) as the empirical mean of the reward of arm k by the end of time slot τ. The update rules of v k (τ) and ϱ k (τ) are given as and respectively. If ϱ k (τ) = 0, we set υ k (τ) = 0. Note that both ϱ k (0) and υ k (0) are initialized to be 0. We estimate the mean reward of each arm k according to a truncated UCB method (Li et al., 2019), i.e., whereυ k (τ) is set to be 1, if ϱ k (τ − 1) = 0. Then, a super arm is selected in each time slot τ according to where α ∈ (0, 1] is a weighting value. Finally, the whole algorithm is summarised in Algorithm 1.

Regret analysis
We first introduce a lemma that specifies the upper bound on the expected regret of the proposed algorithm.

FIGURE 1
The considered system with an edge server and K devices. Different colored areas represent the perception areas of different devices.

FIGURE 4
Performance comparison of the proposed algorithm with the round-robin algorithm.
Remark 1. Given 0 < α ≤ 1 √ T and a large value T, then we can simplify the upper bound in Eq. 16 as which suggests that the time-average performance regret increases at a sub-linear rate (i.e., O( √ T ln T)) over time.

Simulation results
In this section, we provide simulation results to verify the effectiveness of the proposed algorithm. We consider a disc area with a radius of 200 m and a single-antenna access point (AP) equipped with an edge server located in the center of the considered area. The transmit power of each sensing device k is set as 23 dBm and the noise power σ 2 is set as −107 dBm. The channel response h k is computed as h k = √β kh k whereh k and β k stand for smallscale fading and large-scale fading, respectively. The small-scale fading is represented by i.i.d. zero-meaned complex Gaussian variables with unit variance. The large-scale fading is determined according to the path-loss model: PL [dB] = 128.1 + 37.6 log 10 (d) where d stands for the distance in km (Dahrouj and Yu, 2010). The number of the newly generated samples N k (τ) is assumed to be uniformly distributed in [N LB k , N UB k ], where N LB k and N UB k are set as N LB k = (0.5k + 0.5) × 20 and N UB k = (0.5k + 1.5) × 80, ∀k ∈ K. We also assume the availability of each sensing device to be i.i.d. using a binary random variable with a mean of 0.9. Besides, we assume M max = 500 samples, δ = 100 bits/sample, and the length of a time slot Δt = 0.1 s. We consider a system with K = 3 sensing devices randomly distributed within the coverage of the AP. However, only F = 2 channel links are available and the bandwidth of each orthogonal channel is set as 15 KHz. The fairness constraint factors are c 1 = 0.7, c 2 = 0.5, and c 3 = 0.6. Here, we define Ω 1 and Ω 2 as the cumulative performance gap and average performance gap, respectively, with Ω 1 = Σ π M max and Ω 2 = Ω 1 /T, which are used to describe the difference between the optimal solution found by solving problem Eq. 9 and the solution found using the proposed algorithm (or baseline algorithms). Note that the optimal solution found by solving problem Eq. 9 satisfies the fairness constraint. For comparison, we introduce a modified version of the proposed algorithm, which does not take into account the fairness constraint. More specifically, the UCB algorithm for the modified version does not introduce the virtual queue technique. In Figure 2, we compare the proposed algorithms under different α values with the modified version. We find that the performance gap of the modified algorithm is the smallest, whose value is even negative because the modified algorithm does not need to meet the fairness constraint and may lead to biased observations of the area of interest. We also observe that the time-average performance gap of the proposed algorithm grows at a sub-linear trend. Besides, at first glance, the proposed algorithm with a smaller α value meets the fairness constraint but also enjoys better performance, which is more attractive. This is because a smaller α value makes the reward of each device dominant and the fairness constraint insignificant. However, what is missing in Figure 2 is the convergence time used to meet the fairness constraint, which is also an important metric that should be taken into account in practice. Figure 3 shows the change in the selection fractions of different devices over the time slots. Here, the selection fraction is defined as the ratio of the chosen time of a certain device to the total number of time slots. We find that the curves of all the arms obtained by the proposed algorithm meet the fairness constraints eventually, no matter which α value is taken. The modified algorithm has no idea of the fairness constraints and thus does not need to meet the fairness constraints. In addition, it is observed that a smaller α value leads to more time consumption before the convergence is achieved and the convergence rate of the curve with α = 0.001 is the slowest.
To further validate the effectiveness of the proposed algorithm, we consider a scenario with K = 20 devices and introduce the round-robin algorithm as a baseline, as shown in Figure 4. According to the results in Figure 4, we find that more samples are collected with the increase of the number of available channels. In addition, the proposed algorithm always achieves better performance than the round-robin algorithm.

Conclusion
In this work, we considered sensing device scheduling problem in mobile crowdsensing tasks, which suffers from the scarcity of wireless channel resource and the lack of prior knowledge, as well as different capabilities among devices. To address these challenges, we reformulated the device scheduling problem as an MAB program, one should guarantee the participation fairness of sensing devices with different coverage regions. Then, we proposed a device scheduling algorithm on the basis of the UCB policy and virtual queue theory, whose performance regret was also analyzed. Finally, numerical results were conducted to verify the effectiveness of the proposed algorithm.

Data availability statement
The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

Author contributions
JZ contributed to the conception and prepared the first draft of the manuscript. YN performed the numerical simulations. HZ improved the writing of the manuscript. All authors approved the submitted version of the manuscript.

Funding
This work was supported in part by National Key R&D Program of China (2021ZD0140405), in part by Natural Science Foundation of Jiangsu Province (BK2021022532), in part by Jiangsu University Philosophy and Social Science Research Fund (2022SJYB0517).