# Adaptive Discrete Motion Control for Mobile Relay Networks

^{1}Electrical and Computer Engineering, Rutgers, The State University of New Jersey, New Brunswick, NJ, United States^{2}Electrical Engineering, Yale University, New Haven, CT, United States

We consider the problem of joint beamforming and discrete motion control for mobile relaying networks in dynamic channel environments. We assume a single source-destination communication pair. We adopt a general time slotted approach where, during each slot, every relay implements optimal beamforming and estimates its optimal position for the subsequent slot. We assume that the relays move in a 2D compact square region that has been discretized into a fine grid. The goal is to derive discrete motion policies for the relays, in an adaptive fashion, so that they accommodate the dynamic changes of the channel and, therefore, maximize the Signal-to-Interference + Noise Ratio (SINR) at the destination. We present two different approaches for constructing the motion policies. The first approach assumes that the channel evolves as a Gaussian process and exhibits correlation with respect to both time and space. A stochastic programming method is proposed for estimating the relay positions (and the beamforming weights) based on causal information. The stochastic program is equivalent to a set of simple subproblems and the exact evaluation of the objective of each subproblem is impossible. To tackle this we propose a surrogate of the original subproblem that pertains to the Sample Average Approximation method. We denote this approach as model-based because it adopts the assumption that the underlying correlation structure of the channels is completely known. The second method is denoted as model-free, because it adopts no assumption for the channel statistics. For the scope of this approach, we set the problem of discrete relay motion control in a dynamic programming framework. Finally we employ deep Q learning to derive the motion policies. We provide implementation details that are crucial for achieving good performance in terms of the collective SINR at the destination.

## 1 Introduction

In distributed relay beamforming networks, spatially distributed relays synergistically support the communication between a source and a destination (Havary-Nassab et al., 2008a; Li et al., 2011; Liu and Petropulu, 2011). The concepts of distributed beamforming hold the promise of extending the communication range and of minimizing the transmit power that is being wasted by being scattered to unwanted directions (Barriac et al., 2004).

Intelligent node mobility has been studied as a means of improving the Quality-of-Service (QoS) in communications. In (Chatzipanagiotis et al., 2014), the interplay of relay motion control and optimal transmit beamforming is considered with the goal of minimizing the relay transmit power, subject to a QoS-related constraint. In (Kalogerias et al., 2013), optimal relay positioning in the presence of an eavesdropper is considered, aiming to maximize the secrecy rate. In the context of communication-aware robotics, motion has been controlled with the goal of maintaining in-network connectivity (Yan and Mostofi, 2012; Yan and Mostofi, 2013; Muralidharan and Mostofi, 2017).

In this work, we examine the problem of optimizing the sequence of relay positions (relay trajectory) and the beamforming weights so that some SINR-based metric is maximized at the destination. The assumption that we adopt is that the channel evolves as a stochastic process that exhibits spatiotemporal correlations. Intrinsically, optimal relay positioning requires the knowledge of the Channel State Information (CSI) in all candidate positions at a future time instance. This is almost impossible to achieve since the channel varies with respect to time and space. Nonetheless, since the channel exhibits spatiotemporal correlations (induced by the shadowing propagation effect (Goldsmith, 2005; MacCartney et al., 2013) that is prominent in urban environments), it can be, explicitly or implicitly, predicted. We follow two different directions, when it comes to the discrete relay motion control.

The first direction (Kalogerias and Petropulu, 2018; Kalogerias and Petropulu, 2016) (we call it model-based) pertains to the formulation of a stochastic program that computes the beamforming weights and the subsequent relay positions, so that some SINR-based metric at the destination is maximized, subject to a total relay power budget, assuming the availability of causal CSI information. This 2-stage problem is equivalent to a set of 2-stage subproblems that can be solved in distributed fashion, one by each relay. The objective of each subproblem is impossible to be analytically evaluated, so an efficient approximation is proposed. This approximation acts as a surrogate to the initial objective. The surrogate relies on the Sample Average Approximation (SAA) (Shapiro et al., 2009). The term “model-based” is not to be confused with model-based reinforcement learning. We just use it because this method (or direction rather) assumes complete knowledge of the underlying correlation structure of the channels, so it is helpful formalism to distinguish this method from the second approach that makes no particular assumption for the channel statistics.

The second direction (Evmorfos et al., 2021a; Evmorfos et al., 2021b; Evmorfos et al., 2022) tackles the problem of discrete relay motion control from a dynamic programming viewpoint. We formulate the Markov Decision Process (MDP), that is induced by the problem of controlling the motion. Finally, we employ deep Q learning (Mnih et al., 2015) to find relay motion policies that maximize the sum of SINRs at the destination over time. We propose a pipeline for adapting deep Q learning for the problem at hand. We experimentally show that Multilayer Perceptron Neural Networks (MLPs) cannot capture high frequency components in natural signals (in low-dimensional domains). This phenomenon, referred to as *“Spectral Bias”* (Jacot et al., 2018) has been observed in several contexts, and also arises as an issue in the adaptation of deep Q learning for the relay motion control. We present an approach to tackle spectral bias, by parameterizing the Q function with a Sinusoidal Representation Network (SIREN) (Sitzmann et al., 2020).

Our intentions for this work lie in two directions. First, we attempt to compare two methods for relay motion control in urban communication environments. The two methods constitute two different viewpoints in terms of tackling the problem. The first method assumes complete knowledge on the underlying statistics of the channels (model-based) Kalogerias and Petropulu, (2018). The second method is completely model-free in the sense that it drops all assumptions for knowledge of the channel statistics and employs deep reinforcement learning to control the relay motion Evmorfos et al. (2022). In addition to the head-to-head comparison, we propose a slight variation of the model-free method that deviates from the one in Evmorfos et al. (2022) by augmenting the state with the addition of the timestep as an extra feature. This variation is more robust than the previous one, especially when the shadowing component of the urban environment is particularly strong.

*Notation*: We denote the matrices and vectors by bold uppercase and bold lowercase letters, respectively. The operators *σ*-algebras. The *ℓ*_{p}-norm of *N*-dimensional identity operator will be denoted as **I**_{N}. Additionally, we define *n* > *m*.

## 2 Problem Formulation

### 2.1 System Model

Consider a scenario where source *R* single-antenna, trusted mobile relays are enlisted to support the communication. The relays are deployed over a two-dimensional space, which is partitioned into *M* × *M* imaginary grid cells. Time evolves in a time-slotted fashion, where *T* is the slot duration, and *t* denotes the current time slot. In every time slot, a grid cell can be occupied by at most one relay.

Source *t*. The signal received by relay *R*_{r}, located at **p**_{r}(*t*), *r* = 1, … , *R*, equals

where *f*_{r} denotes the flat fading channel from *R*_{r}, and *n*_{r}(*t*) denotes reception noise at relay *R*_{r}, with *r* = 1, … , *R*.

Each relay operates in an Amplify-and-Forward (AF) fashion, i.e., it transmits received signal, *x*_{r}(*t*), multiplied by weight

where *g*_{r} denotes the flat fading channel from relay *R*_{r} to destination *y*(*t*) can be rewritten as

where *y*_{signal}(*t*) is the received signal component and

In the following, we will use the vector *t*.

### 2.2 Channel Model

The channel evolves in time and space and can be described in statistical terms. In particular, during time slot *t*, the channel between the source and a relay positioned at

where *ℓ*; *e*^{j2πϕ(t)}, with *ϕ* uniformly distributed in [0, 1], a phase term. A similar model holds for the relay-destination channel *g*_{r} (**p**_{r}, *t*).

The logarithm of the squared channel magnitude of Eq. 1 converts the multiplicative channel model into an additive one, i.e.,

with

where *η*^{2} is the shadowing power, and

The multipath fading component, **p**_{i} and **p**_{j}, and between any two time slots *t*_{a} and *t*_{b}, as (Kalogerias and Petropulu, 2018)

where

with *c*_{1} denoting the correlation distance, and *c*_{2} the correlation time. Similar correlations hold for similarly

Further,

where

and *c*_{3} denoting the correlation distance of the source-destination channel (Kalogerias and Petropulu, 2018).

### 2.3 Joint Scheduling of Communications and Controls

Let us assume the same carrier for all communication tasks, and employ a basic joint communication/decision making TDMA-like protocol. At each time slot

1. The source broadcasts a pilot signal to all relays, based on which the relays estimate their channels to the source.

2. The destination also broadcasts pilots, which the relays use to estimate their channels relative to the destination.

3. Then, based on the estimated channels, the relays beamform in AF mode. Here we assume perfect CSI estimation.

4. Based on the CSI that has been received up to that point, a decision is made on where the relays need to go to, and relay motion controllers are determined to steer the relays to those positions.

The above steps are repeated for *N*_{T} time slots. Let us assume that the relays pass their estimated CSI to the destination via a dedicated low-rate channel. This simplifies information decoding at the destination (Gao et al., 2008; Proakis and Salehi, 2008).

Concerning relay motion, we assume that the relays obey the differential equation (Kalogerias and Petropulu, 2018)

where

with **u**_{t} denote the time interval that the relays are allowed to move in, and the respective relay controller, in each time slot *τ* belongs in the first *N*_{T} − 1 time slots. In each time slot *t*, the length of Δ*τ*_{t}, *γ*, which can be function of the slot width. Thus, relay velocity must be of the order of

To determine the relay motion controller *t*,

Based on the above, the motion control problem can be formulated in terms of specifying the relay positions at the next time slot, given the relay positions at the current time slot and the estimated CSI. We assume here for simplicity that there exists some path planning and collision avoidance mechanism, the derivation of which is out of the scope of this paper.

For simplicity and tractability, we are assuming that the channel is the same for every position *within* each grid cell, and for the duration of each time slot. In other words, we are essentially adopting a *time-space block fading model*, at least for motion control purposes. This is a valid approximation of reality as the grid cell size and the time slot duration become smaller, at the expense of more stringent resource constraints at the relays, and faster channel sensing capability. Under this setting, *communication and relay control can indeed happen simultaneously* within each time slot, with the understanding that at the start of the next time slot, each relay must have completed their motion (starting at the previous time slot–also see our discussion earlier in this section–). In this way, our approach is valid in a practical setting where communication needs to be continuous and uninterrupted.

Additionally, we are assuming that the relays move sufficiently slowly, such that the local spatial and temporal changes of the wireless channel due to relay motion itself are negligible, e.g., Doppler shift effects. Then, spatial and temporal variations in channel quality are only due to changes in the physical environment, which happen at a much slower rate than that of actual communication. Note that this is a standard requirement for achieving a high communication rate, whatsoever.

We see that there is a natural interplay between relay velocity and the relative rate of change of the communication channel Kalogerias and Petropulu (2018). The challenge here is to identify a fair tradeoff between a reasonable relay velocity, grid size and a time slot, which would enable simultaneously faithful channel prediction and feasible and effective motion control (adherring to potential relay motion constraints). The width of the communication time slot depends on the spatial characteristics of the terrain, which varies with each application. This also determines the sampling rate employed for identifying the parameters of the adopted channel model. In theory, for a given relay velocity, the relays could move to any position up to which the channel remains correlated. However, as the per time slot rate of communications depends on the relay velocity (characterizing system throughput), the relays should move to much smaller distances within the slot.

In the following we use *along their trajectories* *along* *σ*-algebra) generated by the CSI observed up to and including time slot *t* and *σ*-algebra *t* ≡ 0, as a *dummy time slot*.

### 2.4 Spatially Controlled SINR Maximization at the Destination

Next, we present the first stage of the 2-stage generic formulation. The 2-stage approach optimizes network QoS by optimally selecting beamforming weights *and* relay positions, on a *per time slot* basis. In this subsection, we focus on the calculation of the beamforming weights. The calculation of the weights at each step remains the same both for the stochastic programming (model-based) method and the dynamic programming (model-free) method.

Optimization of Beamforming Weights: At time slot *given* CSI in

where *P*_{c} > 0 denotes the total relay transmission power budget. Based on the mutual independence of source and destination CSI, (Eq. 2) can be expressed as (Havary-Nassab et al., 2008b)

where, dropping the dependence on *t* for brevity,

The optimization problem of Eq. 3 is *always feasible, as long as P*_{c} *is nonnegative*, and the optimal value of Eq. 3 can be expressed in closed form as (Havary-Nassab et al., 2008b)

for all

The above analytical expression of the optimal value *V*_{t} in terms of relay positions and their corresponding channel magnitudes will be key in our subsequent development.

## 3 Stochastic Programming for Myopic Relay Control

During time slot *t* − 1, we need to determine the relay positions for time slot *t*, so that we achieve the maximum *V*_{t}. However, at time slot *t* − 1, we only know *t*. Therefore, exactly optimizing the relay positions at the next time slot seems to be an impossible task.

Since deterministic optimization of *V*_{t} with respect to *t* − 1, we can alternatively optimize a projection of *V*_{t} onto the space of all measurable functions of *V*_{t} is of finite variance, we can consider orthogonal projections. In other words, we can consider the Minimum Mean-Square Error (MMSE) predictor of *V*_{t} given the available information

to be solved at time slot *t*, but here, for simplicity that dependence is not shown.

The map *finite-valued multifunction*, and we write *first-stage problem* and the *second-stage problem*, respectively (Shapiro et al., 2009). The block diagram of the above described process is shown in Figure 1.

**FIGURE 1**. 2-Stage optimization of beamforming weights and relay motion controls. The variable *t* − 1.

As compared to traditional AF beamforming for a static case, our spatially controlled system described above, uses the same CSI as in the stationary case, to predict the optimal beamforming performance in its vicinity in the MMSE sense, and moves to the optimally selected location. The prediction here relies on the aforementioned spatiotemporal channel model. Of course, this requires a sufficiently slowly varying channel relatively to relay motion, which can be guaranteed if the motion is constrained within small steps.

### 3.1 Motion Policies & the Interchangeability Principle

To assist in the process of understanding the techniques to solve Eq. 4, we make note of an important *variational* property of Eq. 4, related to the *long-term performance* of the proposed spatially controlled beamforming system. Our discussion pertains to the employment of the so-called *Interchangeability Principle (IP)* (Bertsekas and Shreve, 1978; Bertsekas, 1995; Rockafellar and Wets, 2004; Shapiro et al., 2009; Kalogerias and Petropulu, 2017), also known as the *Fundamental Lemma of Stochastic Control (FLSC)* (Astrom, 1970; Speyer and Chung, 2008) Kalogerias and Petropulu, (2018). The IP refers conditions that allow the interchange of expectation and maximization or minimization in general stochastic programs.

A version of the IP for the first-stage problem of (4) is established in (Kalogerias and Petropulu, 2017) Specifically, the IP implies that (4) is exchangeable by the variational problem (Kalogerias and Petropulu, 2017)

to be solved at each *unconditional expectation* of *V*_{t} over all (measurable) mappings of the variables generating *t* − 1, whereas, in Eq. 4, *point*, since all variables generating *fixed before decision making*. Aligned with the literature, any feasible decision *admissible*) *policy*, or a *decision rule*. *Exchangeability* of Eqs. 4, 5 is understood in the sense that the optimal value of Eq. 5, which is a number, coincides with the *expectation* of the optimal value of Eq. 4, which is a measurable function of *interchangeable* with integration, in the sense that” (Kalogerias and Petropulu, 2017)

for all

### 3.2 Near-Optimal Beamformer Motion Control

One can readily observe that the problem of (4) is separable. Given that, for each *i*th relay being responsible for solving the problem (Kalogerias and Petropulu, 2018)

at each *.* Note that no local exchange of intermediate results is required among relays; given the available information, each relay independently solves its own subproblem. It is also evident that apart from the obvious difference in the feasible set, the optimization problems at each of the relays are identical.

However, the objective of problem Eq. 11 is impossible to obtain analytically, and it is necessary to resort to some well behaved and computationally efficient *surrogates*. Next, we present *a near-optimal* such approach. The said approach relies on *global* function approximation techniques, and achieves excellent empirical performance.

The proposed approximation to the stochastic program (11) will be based on the following technical, though simple, result.

Lemma 1 (Big Expectations) (Kalogerias and Petropulu, 2018) *Under the assumptions of the wireless channel model, it is true that, at any* *,*

*for all* *, and where we define*

*with* *m*_{1:t−1}*,* *μ*_{1:t−1}*,* *,* *,* *,* *and* **Σ**_{1:t−1} *defined as in* (6)*,* (7)*,* (8)*,* (9)*, and* (10) *respectively, for all* *. Further, for every choice of* *, the conditional correlation of the fields* *and* *relative to* *may be expressed in closed form as*

*at any* *and for all* *.*

The detailed description of the proposed technique for efficiently approximating our base problem (11) now follows.

Sample Average Approximation (SAA): This is a direct Monte Carlo approach, where, *at worst*, existence of a *sampling, or pseudosampling mechanism at each relay* is assumed, capable of generating samples from a bivariate Gaussian measure. We may then observe that the objective of Eq. 11 can be represented, for all

for any choice of

for all

for all

Now, for each relay *F* and *G*. Then, by defining the sample average estimate

the SAA of our initial problem Eq. 11 is formulated as

at relay *S* → *∞*. For further details, see (Shapiro et al. (2009), Chapter 5).

On the downside, computing the objective of the SAA problem Eq. 12 assumes availability of Monte Carlo samples, which could be restrictive in certain scenarios. Nevertheless, assuming mutual independence of the sequences *i* and each *t* is not required. In fact, one could generate one sequence for all relays, per time slot, or even better, one sequence for all relays, for all time slots altogether. Such sampling schemes are legitimate, for two reasons. First, all SAAs of the form Eq. 12 are solved independently for each relay and at each time slot. Second, Monte Carlo sampling is by construction statistically independent from the spatiotemporal channel fields *F* and *G*. As a result, such sampling schemes relax (in fact, eliminate) the need for (pseudo)random sampling at each *individual* relay. This makes them particularly attractive for practical purposes.

We denote this approach as SAA for the rest of the paper. The control flow of the SAA is presented in Algorithm 1.

## 4 Deep Reinforcement Learning for Adaptive Discrete Relay Motion Control

### 4.1 Dynamic Programming for Relay Motion Control

The previously mentioned approach tackles the problem of relay motion control from a myopic perspective in the sense that the stochastic program is formulated so as to select the relay positions for the subsequent time slot with the goal of maximizing the collective SINR at the destination only for that particular slot.

The employment of reinforcement learning for the problem of discrete relay motion control entails that we reformulate the problem as a dynamic program. In this set up we want, at time slot *t* − 1, to derive a motion policy (a methodology for choosing the relays’ displacement) so as to maximize the discounted sum of *V*_{I}s (in expectation) from the subsequent time step *t* to the infinite horizon.

To formally pose that program we need to introduce a Markov Decision Process (MDP). The MDP is a tuple defined as

The formulation of the dynamic program is as follows:

If *γ* is a discount factor, we can formulate the infinite horizon relay control problem as:

where **u**(*t*) is the control at time t (essentially determining the relay displacement), and the driving noise **W**(*t*) is distributed as **Σ**_{C} is the covariance matrix for all channels (source and destination) for all the cells in the grid. The said covariance matrix is explicitly defined in (Kalogerias and Petropulu, 2017) and admits a particular form if the channels evolve according to the spatiotemporal Gaussian process defined in 2.2.

Now, either the above problem defines a MDP or POMDP is dependent on the history

On the other hand, if

### 4.2 Deep Q Learning for Discrete Relay Motion Control

The employment of deep Q learning for relay motion control expels the need for making particular assumption for the underlying correlation structure of the channels.

Taking into account the (12) one can infer that we can construct a single policy that is learned by the collective experience of all the agents/relays and it constitutes the single policy that the movement of all relays strictly adhere to. In that spirit, we instantiate one neural network to parameterize the state-action value function (Q) and it is being trained on the experiences of all the relay. The motion policy is *ϵ*-greedy with respect to the estimation of the Q function.

Initially, we adopt the deep Q learning algorithm as described in (Mnih et al., 2015) and illustrated in Figure 2. Even though, as we pointed out in the previous subsection, the state of the MDP is the concatenation of the relay position **p** = ** s** and the channels

*f*(

**p**,

*t*) and

*g*(

**p**,

*t*), we follow a slightly different approach in the adoption of deep Q learning. In particular, the input to the neural network is the concatenation of the position

**p**= [

*x*,

*y*] and the time step

*t*. We should note at this point that augmenting the neural network input with the timestamp of the transition is a differentiation between the algorithm presented in this current work and the solution proposed in Evmorfos et al. (2022). This alternative, even though does not affect the implementation much, provides measurable improvements in cases where the power of the shadowing is strong. The reward

*r*is the contribution of the relay to the SINR at the destination during the respective time step (

*V*

_{I}). At each time slot the relay selects an action

In general, Q learning with rich function approximators such as neural networks requires some heuristics for stability. The first such heuristic is the *Experience Replay* (Mnih et al., 2015). Each tuple of experience for a relay, namely

The second heuristic is the *Target Network* (Mnih et al., 2015). The Target Network (*Q*_{target} (** s**′,

*a*′;

*θ*^{−})) provides the estimation for the targets (labels) for the updates of the

*Policy Network*(

*Q*

_{policy}(

**′,**

*s**a*′;

*θ*^{+})), i.e., the network used for estimating the Q function. The two networks share (typically) the same architecture. We do not update the Target Network’s weights with any optimization scheme, but, after a predefined number of training steps, the weights of the Policy Network are copied to the Target Network. This provides stationary targets for the weight updates and brings the task of the Q function approximation closer to a supervised learning paradigm.

Therefore, at each update step we sample a batch of experiences from the Experience Replay and use the batch to perform gradient descent on the loss:

At each step, the Policy Network’s weights are updated according to:

where,

The parameter *λ* is the learning rate. The parameter *γ* is a scalar called the *discount factor* and *γin* (0, 1). The choice for the discount factor pertains to a trade off between the importance assigned to long term rewards and the importance assigned to short term rewards. The parameters *a*, ** s** and

**′ correspond to the current state and the next state respectively. The general pipeline of the deep Q learning algorithm is defined in Figure 3.**

*s***FIGURE 3**. This is a heatmap for visualizing a trajectory of the relays. We can see the *V*_{I} for all grid cells for four different time steps (each time step has a 2-time-slot difference with the previous and the next). One can see the positions of the relays for every time slot. The relays are moving towards better and better positions (larger *V*_{I}s).

When the relays move (the do not stay in the same grid cell for two consecutive slots), they require additional energy consumption. i some cases though, the diplacement to a neighboring grid cell does not correspond to significant improvement in terms of the cumulative SINR at the destination. Therefore, to account for the energy used for the application, we choose to not perform the *ϵ*-greedy policy directly on the estimates *Q*_{policy} (** s**,

*a*;

*θ*^{+}) of the Q function, but we decrease the estimates for all actions

*a*, except for the action

*μ*. In that way we prohibit the relay displacement if this action does not correspond to a significant increase in the expectation of the cumulative sum of rewards (SINR). How significant this displacement action should be for it to be performed pertains to the choice of

*μ*. For our simulations, in the subsequent sections, we choose

*μ*to be 1%.

### 4.3 Sinusoidal Representation Networks for Q Function Parameterization

There have been many recent works which convincingly claim that coordinate-based Multilayer Perceptron Neural Networks (MLPs), i.e., MLPs that map a vector of coordinates to a low-dimensional natural signal, fail to learn high frequency components of the said signal. This constitutes a phenomenon that is called the spectral bias in machine learning literature (Jacot et al., 2018; Cao et al., 2019). The work in (Sitzmann et al., 2020) examines the amelioration of spectral bias for MLPs. The inadequacy of MLPs for such inductive biases is bypassed by introducing a variation of the conventional MLP architecture with sinusoid (sin (⋅)) as activation function between layers. Tis MLP alternative was termed *Sinusoidal Representation Networks* (SIRENs), and was shown, both theoretically and experimentally, to effectively tackle the spectral bias.

The sinusoid is a periodic function which is quite atypical as a choice for activation function in neural networks. The authors in (Sitzmann et al., 2020) propose the employment of weight initialization framework so that the distribution of activations is retained during training and convergence is achieved without the network oscillating.

In particular, if we assume an intermediate layer of the neural network with input ** w** passed through the sinusoid activation, therefore the output is

**is arcsine distributed. With these assumptions it was shown in (Sitzmann et al., 2020) that, if the elements of**

*x***, namely**

*w**w*

_{i}, are initialized from a uniform distribution

*n*grows. Therefore one should initialize the weights of all intermediate layers with

*ω*

_{0}, so that the output of the first layer, sin (

*ω*

_{0}

**+**

*Wx**b*) spans multiple periods over [ − 1, 1].

**is a matrix whose elements correspond to the weights of the first layer.**

*W*When we adopt the deep Q learning approach for discrete relay motion control, we basically train a neural network (MLP) to learn a low-dimensional natural signal from coordinates, namely the state-action value function *Q* (** s**,

*a*). The Q function,

*Q*(

**,**

*s**a*), represents the sum of SINR at the destination that the relays are expected to achieve for an infinite time horizon, starting from the respective position

**and performing action**

*s**a*. The Policy Network, being a coordinate MLP may not be able to converge for the high frequency components of the underlying Q function that arise from the fact that the channels exhibit very abrupt spatiotemporal variations.

Therefore we propose that both the Policy and the Target Networks are SIRENs. The control flow of the algorithm we propose is given in Algorithm 2. We denote this as DQL-SIREN, which stands for *Deep Q Learning with Sinusoidal Representation Networks*.

## 5 Simulations

We test our proposed schemes by simulating a 20, ×, 20 m grid. All the grid cells are 1*m* × 1*m*. The number of agents/relays that assist the single source destination communication pair is *R* = 3. For every time slot the position of each relay is constrained within the boundaries of the gridded region and also constrained to adhere to a predetermined relay movement priority. Only one relay can occupy a grid cell per time slot. The center of the relay/agent and the center of the respective grid cell coincide.

When it comes to the shadowing part of our assumed channel model, we define a threshold *θ* which quantifies the distance in time and space where the shadowing component is important and can be taken into account for the construction of the motion policy. We assume that the shadowing power *η*^{2} = 15 and the autocorrelation distance is *c*_{1} = 10*m* and the autocorrelation time is *c*_{2} = 20*sec*. The variances of noises at the relays and destination are fixed as

Each one of the relays can move 1 grid cell/time slot and the size of each cell is 1*m* × 1*m* (as mentioned before). The time slot length is set to be 0.6*sec*. Therefore the calculation of the channel and the decision of the movement for each relay should take up an amount of time that is strictly less than the duration of the time interval.

### 5.1 Specifications for the DQL-SIREN and the SAA

Regarding the DQL-SIREN, we employ SIRENs for both the Policy and the Target Networks. Each SIREN is comprised by three dense layers (350 neurons for each layer) and the learning rate is 1*e* − 4.

The Experience Replay size is 3,000 tuples and we begin every experiment with 300 transitions derived by a completely random policy before the start of training for all the deep Q learning approaches. The *ϵ* of the *ϵ*-greedy policy is initialized to be 1 but it is steadily decreased until it gets to 0.1 This is a very typical regime in RL. It is a very simple way to handle the dilemma between exploration and exploitation in RL, where we begin by giving emphasis to exploration first and then gradually exploration is traded for exploitation. We copy the weights of the Policy Network to the weights of the Target Network every 100 steps of training. The batch size is chosen to be 128 (even though the methods work reliably for different batch sizes ranging from 64 to 512) and the discount factor *γ* is chosen to be 0.99. We want to mention that small values for *γ* translate to a more myopic agent (an agent that assigns significance to short term rewards at the expense of long term/delayed rewards). On the other hand, values of *γ* closer to 1 correspond to agents that assign almost equal value to long term rewards and short term rewards. For the deep Q learning methods that we have proposed, we noticed that for low values of *γ* converence and performance is impeded, something that we attribute to the interplay of Q learning and neural network employment rather than to the nature of the underlying MDP.

We set the *ω*_{0} for the DQL-SIREN to 5 (the performance of the algorithm is robust for different values of the said parameter). Finally, we use the Adam optimizer for updating the network weights.

When it comes to the SAA, the sample size is set to 150 for the experiments.

### 5.2 Synthesized Data and Simulations

We create synthetic CSI data that adhere to the channel statistics described in 2.2.

In Figure 4, we plot the average SINR at the destination (in dB scale) achieved by the cooperation of all three relays, per episode, for 100 episodes, where every episode is comprised by 30 steps. The transmission power of the source is *P*_{S} = 57*dbm* and the relay transmission power budget is *P*_{R} = 57*dBm*. The assumed channel parameters are set as *ℓ* = 2.3, *ρ* = 3, *η*^{2} = 15, *c*_{1} = 10, *c*_{2} = 20, *c*_{3} = 0.5. The variance of the noise at the relays and destination are

We generate 3,000 = 100, ×, 30 instances of the source-relay and relay destination channels for the whole grid (20, ×, 20). Every 30 time steps we initialize the relays to random positions in the grid and let them move. We plot the average SINR for every 30 steps of the algorithms.

### 5.3 Simulation Results and Discussion

We present the results of our simulations in Figure 4. As we stated before, the results correspond to the average SINR at the destination for 100 episodes. Each episode consists of 30 time steps. The runs correspond to the average over six different seeds.

We compare three different policies. The first one is the Random policy, where each relay chooses the displacement for the next step at random. The second policy is the DQL-SIREN that solves the dynamic program (maximization of the discounted sum of *V*_{I}s for every relay from the current time step to the infinite horizon). The third policy is the myopic SAA that corresponds to the stochastic program and optimizes each individual relay’s *V*_{I} for the subsequent slot.

As one can see that both the SAA and the DQL-SIREN perform significantly better than the Random policy (they both achieve an average SINR of approximately 7 *db* in contrast to the Random policy that achieves about 4 *db*). Table 1 contains a head-to-head comparison of the SAA and the DQL-SIREN approaches regarding some qualitative and some quantitative features.

The convergence of the DQL-SIREN is faster than that of SAA. This is reasonable since, when it comes to the SAA approach, for the first five episodes there have not been collected enough samples (150). Both SAA and DQL-SIREN perform approximately the same in terms of average SINR. Towards the end of the experiments there is a small gap between the two (with the SAA performing slightly better). This can be attributed to the *ϵ*-greedy policy of the DQL-SIREN, where *ϵ* never goes to zero (choosing a random action a small percentage of the time for maintaining exploration).

There are some interesting inferences that one can make, based on the simulations. First of all, even though the SAA is myopic and only attempts to maximize the SINR for the subsequent time slot, works quite well in the sense of the aggregated statistic of the average SINR. This is a clear indication that, for the formulated problem, being greedy translates to performing adequately in the sense of cumulative reward.

Of course this peculiarity stands true only when the statistics of the channels are completely known and do not change significantly during the operation time. Apparently, in such a scenario, the phenomenon of delayed rewards is not much prevalent.

## 6 Conclusion

In this paper, we examine the discrete motion control for mobile relays facilitating the communication between a source and a destination. We compare two different approaches to tackle the problem. The first approach employs stochastic programming for scheduling the relay motion. This approach is myopic meaning that it seeks to maximize the SINR at the destination, only at the subsequent time slot. In addition, the stochastic programming approach makes specific assumption for the statistics of the channel evolution. The second approach is a deep reinforcement learning approach that is not myopic meaning that its goal is to maximize the discounted sum of SINR at the destination from the subsequent slot to an infinite time horizon. Additionally, the second approach makes no particular assumptions for the channel statistics. We test our methods in synthetic channel data produced in accordance to a known model for spatiotemporally varying channels. Both methods perform similarly and achieve significant improvement in comparison to a standard random policy for relay motion. We also provide a head-to-head comparison of the two approaches regarding various key qualitative and quantitative features. As future work, we plan on extending the current methods for scenarios with multiple source-destination communication pairs and, possibly, include the existence of eavesdroppers.

## Data Availability Statement

The raw data supporting the conclusion of this article will be made available by the authors, without undue reservation.

## Author Contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

## Funding

Work supported by ARO under grant W911NF2110071.

## Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

## Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

## References

Barriac, G., Mudumbai, R., and Madhow, U. (2004). “Distributed Beamforming for Information Transfer in Sensor Networks,” in Third International Symposium on Information Processing in Sensor Networks, 2004 (IEEE), 81–88. doi:10.1145/984622.984635

Bertsekas, D. (1995). *Dynamic Programming & Optimal Control*. 4th edn., II. Belmont, Massachusetts: Athena Scientific.

Bertsekas, D. P., and Shreve, S. E. (1978). *Stochastic Optimal Control: The Discrete Time Case*, 23. New York: Academic Press.

Cao, Y., Fang, Z., Wu, Y., Zhou, D.-X., and Gu, Q. (2019). Towards Understanding the Spectral Bias of Deep Learning. *arXiv preprint arXiv:1912.01198*.

Chatzipanagiotis, N., Liu, Y., Petropulu, A., and Zavlanos, M. M. (2014). Distributed Cooperative Beamforming in Multi-Source Multi-Destination Clustered Systems. *IEEE Trans. Signal Process.* 62, 6105–6117. doi:10.1109/tsp.2014.2359634

Evmorfos, S., Diamantaras, K., and Petropulu, A. (2021a). “Deep Q Learning with Fourier Feature Mapping for Mobile Relay Beamforming Networks,” in 2021 IEEE 22nd International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 126–130. doi:10.1109/SPAWC51858.2021.9593138

Evmorfos, S., Diamantaras, K., and Petropulu, A. (2021b). “Double Deep Q Learning with Gradient Biasing for Mobile Relay Beamforming Networks,” in 2021 55th Asilomar Conference on Signals, Systems, and Computers, 742–746. doi:10.1109/ieeeconf53345.2021.9723405

Evmorfos, S., Diamantaras, K., and Petropulu, A. (2022). Reinforcement Learning for Motion Policies in Mobile Relaying Networks. *IEEE Trans. Signal Process.* 70, 850–861. doi:10.1109/TSP.2022.3141305

Gao, F., Cui, T., and Nallanathan, A. (2008). On Channel Estimation and Optimal Training Design for Amplify and Forward Relay Networks. *IEEE Trans. Wirel. Commun.* 7, 1907–1916. doi:10.1109/TWC.2008.070118

Havary-Nassab, V., Shahbazpanahi, S., Grami, A., and Zhi-Quan Luo, Z.-Q. (2008a). Distributed Beamforming for Relay Networks Based on Second-Order Statistics of the Channel State Information. *IEEE Trans. Signal Process.* 56, 4306–4316. doi:10.1109/tsp.2008.925945

Havary-Nassab, V., ShahbazPanahi, S., Grami, A., and Zhi-Quan Luo, Z.-Q. (2008b). Distributed Beamforming for Relay Networks Based on Second-Order Statistics of the Channel State Information. *IEEE Trans. Signal Process.* 56, 4306–4316. doi:10.1109/TSP.2008.925945

Heath, R. W. (2017). *Introduction to Wireless Digital Communication: A Signal Processing Perspective*. Prentice-Hall.

Jacot, A., Gabriel, F., and Hongler, C. (2018). “Neural Tangent Kernel: Convergence and Generalization in Neural Networks,” in Advances in Neural Information Processing Systems. Editors S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Curran Associates, Inc).31.

Kalogerias, D. S., Chatzipanagiotis, N., Zavlanos, M. M., and Petropulu, A. P. (2013). “Mobile Jammers for Secrecy Rate Maximization in Cooperative Networks,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference, 2901–2905. doi:10.1109/ICASSP.2013.6638188

Kalogerias, D. S., and Petropulu, A. P. (2016). “Mobile Beamforming Amp; Spatially Controlled Relay Communications,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6405–6409. doi:10.1109/ICASSP.2016

Kalogerias, D. S., and Petropulu, A. P. (2017). Spatially Controlled Relay Beamforming: 2-stage Optimal Policies. *Arxiv*.

Kalogerias, D. S., and Petropulu, A. P. (2018). Spatially Controlled Relay Beamforming. *IEEE Trans. Signal Process.* 66, 6418–6433. doi:10.1109/tsp.2018.2875896

Li, J., Petropulu, A. P., and Poor, H. V. (2011). Cooperative Transmission for Relay Networks Based on Second-Order Statistics of Channel State Information. *IEEE Trans. Signal Process.* 59, 1280–1291. doi:10.1109/TSP.2010.2094614

Liu, Y., and Petropulu, A. P. (2011). On the Sumrate of Amplify-And-Forward Relay Networks with Multiple Source-Destination Pairs. *IEEE Trans. Wirel. Commun.* 10, 3732–3742. doi:10.1109/twc.2011.091411.101523

MacCartney, G. R., Zhang, J., Nie, S., and Rappaport, T. S. (2013). Path Loss Models for 5G Millimeter Wave Propagation Channels in Urban Microcells. *Globecom*, 3948–3953. doi:10.1109/glocom.2013.6831690

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Human-level Control through Deep Reinforcement Learning. *nature* 518, 529–533. doi:10.1038/nature14236

Muralidharan, A., and Mostofi, Y. (2017). “First Passage Distance to Connectivity for Mobile Robots,” in Proceedings of the American Control Conference (IEEE), 1517–1523. doi:10.23919/ACC.2017.7963168

Rockafellar, R. T., and Wets, R. J.-B. (2004). *Variational Analysis*, 317. Springer Science & Business Media.

Shapiro, A., Dentcheva, D., and Ruszczyński, A. (2009). *Lectures on Stochastic Programming*. 2nd edn. Society for Industrial and Applied Mathematics.

Sitzmann, V., Martel, J., Bergman, A., Lindell, D., and Wetzstein, G. (2020). Implicit Neural Representations with Periodic Activation Functions. *Adv. Neural Inf. Process. Syst.* 33, 7462–7473.

Yan, Y., and Mostofi, Y. (2013). Co-optimization of Communication and Motion Planning of a Robotic Operation under Resource Constraints and in Fading Environments. *IEEE Trans. Wirel. Commun.* 12, 1562–1572. doi:10.1109/twc.2013.021213.120138

Yan, Y., and Mostofi, Y. (2012). Robotic Router Formation in Realistic Communication Environments. *IEEE Trans. Robot.* 28, 810–827. doi:10.1109/TRO.2012.2188163

Keywords: relay networks, discrete motion control, stochastic programming, dynamic programming, deep reinforcement learning

Citation: Evmorfos S, Kalogerias D and Petropulu A (2022) Adaptive Discrete Motion Control for Mobile Relay Networks. *Front. Sig. Proc.* 2:867388. doi: 10.3389/frsip.2022.867388

Received: 01 February 2022; Accepted: 01 June 2022;

Published: 06 July 2022.

Edited by:

Monica Bugallo, Stony Brook University, United StatesReviewed by:

Francesco Palmieri, University of Campania Luigi Vanvitelli, ItalyStefania Colonnese, Sapienza University of Rome, Italy

Copyright © 2022 Evmorfos, Kalogerias and Petropulu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Athina Petropulu, athinap@rutgers.edu

^{†}These authors have contributed equally to this work