Decentralized multi-agent reinforcement learning based on best-response policies

Gabler, Volker; Wollherr, Dirk

doi:10.3389/frobt.2024.1229026

ORIGINAL RESEARCH article

Front. Robot. AI, 16 April 2024

Sec. Robot Learning and Evolution

Volume 11 - 2024 | https://doi.org/10.3389/frobt.2024.1229026

This article is part of the Research TopicDecision-Making and Planning for Multi-Agent SystemsView all 6 articles

Decentralized multi-agent reinforcement learning based on best-response policies

Volker Gabler*

Dirk Wollherr

Chair of Automatic Control Engineering, TUM School of Computation, Information and Technology, Technical University of Munich, Munich, Germany

Introduction: Multi-agent systems are an interdisciplinary research field that describes the concept of multiple decisive individuals interacting with a usually partially observable environment. Given the recent advances in single-agent reinforcement learning, multi-agent reinforcement learning (RL) has gained tremendous interest in recent years. Most research studies apply a fully centralized learning scheme to ease the transfer from the single-agent domain to multi-agent systems.

Methods: In contrast, we claim that a decentralized learning scheme is preferable for applications in real-world scenarios as this allows deploying a learning algorithm on an individual robot rather than deploying the algorithm to a complete fleet of robots. Therefore, this article outlines a novel actor–critic (AC) approach tailored to cooperative MARL problems in sparsely rewarded domains. Our approach decouples the MARL problem into a set of distributed agents that model the other agents as responsive entities. In particular, we propose using two separate critics per agent to distinguish between the joint task reward and agent-based costs as commonly applied within multi-robot planning. On one hand, the agent-based critic intends to decrease agent-specific costs. On the other hand, each agent intends to optimize the joint team reward based on the joint task critic. As this critic still depends on the joint action of all agents, we outline two suitable behavior models based on Stackelberg games: a game against nature and a dyadic game against each agent. Following these behavior models, our algorithm allows fully decentralized execution and training.

Results and Discussion: We evaluate our presented method using the proposed behavior models within a sparsely rewarded simulated multi-agent environment. Although our approach already outperforms the state-of-the-art learners, we conclude this article by outlining possible extensions of our algorithm that future research may build upon.

1 Introduction

Based on recent advances in robotics research over the last few decades, automated robotic systems have been established in everyday life, even beyond industrial applications. Nonetheless, it remains tedious and challenging to impart new tasks to robots, especially if the environment is stochastic and hard to model. In this context, applied machine learning (ML), specifically reinforcement learning (RL), is a promising research field that aims to continuously improve robotic performance from collected task trial samples. In particular, the core motivation is to equip robots with the ability to explore and learn unknown tasks simultaneously without relying on an accurate model of the environment or the task. Building upon this, the concept of MARL has raised interest in improving scalability by executing tasks by a fleet of robots rather than a single autonomous unit. In order to exploit results from single-agent RL, a common paradigm in MARL is centralized learning with decentralized execution. Nonetheless, it is desirable to handle each robot as an independent individual, such that the learning phase of a MARL algorithm also scales well. In contrast to simulated environments, where access to other agents’ policies and observation is realistic, this assumption is overly restrictive for real robot systems and adds additional constraints to heterogeneous robot fleets.

Therefore, the contribution of this article is a novel AC method for cooperative MARL problems in sparsely rewarded environments. Our MARL algorithm allows fully decoupling learning among the agents while achieving comparable performance to current state-of-the-art MARL approaches. This approach uses the best-response policies of other agents conditioned on each agent’s policy output. This leverages the need to access the exact agent policies during centralized learning to achieve a fully decentralized learning scheme. This decision-theoretic principle stems from Stackelberg equilibria from game theory and is tailored to non-zero-sum games in the scope of this article.

Our proposed method incorporates another multi-robot planning and game theory concept by explicitly differentiating between interactive task rewards and agent-specific costs, i.e., native costs. In other words, each agent estimates the performance of the joint policy with regard to the current task to be learned but also a cost critic that evaluates agent-specific costs.

In the remainder of this article, we briefly summarize the state-of-the-art algorithms of MARL in Section 2, followed by a summary of the technical foundations of this article and the technical problem in Section 3. The core concept of our proposed framework is outlined in Section 4. In order to evaluate the presented method, we present the collected results of our method compared against state-of-the-art MARL algorithms in Section 5. Based on these results, we discuss our algorithm and explicitly sketch conceptual modifications of our approach in Section 6. Eventually, we conclude this article in Section 7.

2 Related work

Even though early applications of RL in robotic systems have shown promising results (Ng et al., 2004; Kolter and Ng, 2009), it was the success of outperforming humans in computer games via deep RL (Mnih et al., 2015; Silver et al., 2016; Vinyals et al., 2019) without suffering from catastrophic interference (McCloskey and Cohen, 1989) problems that has opened the door for RL applications within complex, real-world environments. Given the computational power of modern graphics processing units (GPUs), policy gradients such as the stochastic policy gradient from Sutton et al. (1999a) or the deterministic policy gradient (PG) from Silver et al. (2014) have been realized via function approximators such as neural networks (NNs). A famous example is given as the deep deterministic policy gradient (DPG) by Lillicrap et al. (2016). DDPG has shown that deep RL can also be applied on continuous action spaces such that the applicability of RL within robotic systems has been boosted drastically ever since. Even though further PG methods have been developed in order to improve the variance sensitivity issue, such as trust region policy optimization (Schulman et al., 2015), proximal policy optimization (Schulman et al., 2017), or maximum a posteriori policy optimization (Song et al., 2020), the majority of algorithms relies on an AC architecture, where an additional critic reduces the variance drastically, such as the soft actor–critic (SAC) (Haarnoja et al., 2018). As an intense outline of advances in single-agent RL is beyond the scope of this article, we encourage the interested reader to available literature survey papers (Kaelbling et al., 1996; Kober et al., 2013; Arulkumaran et al., 2017). Building upon the results from single-agent RL, MARL has gained great interest over the last decades and is thus outlined separately in the following.

2.1 Multi-agent reinforcement learning

In addition to solving complex Markov decision process (MDP) problems, the decentralized extension of the Markov game (MG) has gained attention in the context of MARL (van der Wal, 1980; Littman, 1994). The naive approach of extending Q-learning to a set of $N_{A}$ independent learners (Tan, 1993) works well for small-scale problems or selective applications. Similar to deep RL, initial results on MARL have been found on discrete action sets, such as the differentiable inter-agent learning by Foerster et al. (2016) or the explicit communication learning by Havrylov and Titov (2017); Mordatch and Abbeel (2017). In general, however, independent learners violate the Markov assumption (Laurent et al., 2011).

Multi-agent deep deterministic policy gradient (DDPG) is an extension of DDPG to MARL (Lowe et al., 2017), which also applies an AC architecture. During training, a centralized critic uses additional information about the other agents’ states and actions to approximate the $Q$ -function. Given this centralized critic, each agent updates a policy that is only conditioned on the local observations of each agent. Thus, the actor only relies on local observations during execution. MADDPG has achieved very robust results in simulated benchmark environments (Mordatch and Abbeel, 2017) for cooperative and competitive scenarios. Various extensions to the multi-agent deep deterministic policy gradient (MADDPG) have been proposed. Li et al. (2019) introduced an extension to MADDPG that uses the minimax concept of game theory to make decisions under uncertainty. The idea is to take the best action in the worst possible case.

As pointed out by Ackermann et al. (2019), the overestimation bias is also present in MARL. Some initial works have proposed to bridge concepts from the single-agent domain (van Hasselt, 2010) to MARL (Sun et al., 2020). Thus, SAC has been adjusted to the multi-agent domain by Wei et al. (2018), for which further extensions have been outlined, e.g., Zhang et al. (2020) proposed a Lyapunov-based penalty term to the policy update to stabilize the policy gradient. As centralized learning inherently suffers from poor scaling, Iqbal and Sha (2019) introduced attention mechanisms in the multi-actor-attention-critic (MAAC). In order to cope with large-scale MARL, Sheikh and Bölöni (2020) explicitly differentiated between local and global reward metrics that each agent obtains from the environment.

In contrast to single-agent systems, the critic also suffers from the non-stationarity of the policies of other agents. This initiated the research on explicitly modeling the learning behavior of other agents, such as Foerster et al. (2018). Alternatively, Tian et al. (2019) proposed to model the MARL problem as an inference problem, i.e., to estimate the most likely action of the other agents and respond with the best response (br). Jaques et al. (2019) reversed this idea by applying counterfactual reasoning and thus incorporating the mutual influence among agents into the reward of each agent. They outlined a decentralized version of their algorithm, which applies behavioral cloning similarly to the decentralized version of MADDPG.

As a complete survey of MARL is beyond the scope of this article, we refer to Zhang et al. (2021), Hernandez-Leal et al. (2020), Yang and Wang (2020), Hernandez-Leal et al. (2019), and Nguyen et al. (2020) for a more detailed literature review. In order to illustrate the relevance of MARL from an application-driven perspective, there exists a variety of recent examples, such as logistics (Tang et al., 2021), the Internet of Things (Wu et al., 2021), or motion-planning for robots (He et al., 2021).

3 Preliminaries

As the methods presented in this article build upon various findings from the literature, we provide an insight into these methods. We begin with sketching the notation used in this article, followed by an initial example in the form of introducing MGs.

3.1 Notation

In order to outline the notation for the remainder of this article, we use $p$ as an arbitrary placeholder variable. We denote $p^{(i)}$ as a variable explicitly assigned to agent i, while ${(p)}_{i \in N_{A}}$ denotes the joint team analog of the said variable for all agents. This is most commonly denoted as $\underset{̲}{p}$ for brevity, while ${\underset{̲}{p}}^{(- i)}$ denotes all elements of ${(p)}_{i \in N_{A}}$ except $p^{i}$ . Furthermore, we denote the vector as $p$ and matrices as $p$ , while $1^{p}$ , $1^{p \times p}$ , $0^{p}$ , and $0^{p \times p}$ denote identity- and zero-vectors/matrices. In general, we denote time-variant variables as $p_{t}$ , while a temporal successor $p_{t + 1}$ is denoted as ′ for brevity. In the context of stochastic variables, we denote probability density functions (PDFs) as $P [p]$ and conditionally dependent PDFs as $P [p_{1} | p_{2}]$ . Similarly, $E_{p_{1} \sim ρ (p_{2})} []$ and $V {a r}_{p_{1} \sim ρ (p_{2})} []$ symbolize the expectation and variance of the random variable $p_{1}$ that follows a probability distribution function ρ(), which depends on $p_{2}$ . Eventually, we denote hierarchical systems by denoting layer k as ${}^{{k}}p$ .

3.2 Markov game

An MG is an extension of an MDP to the multi-agent domain, which is fully described by the tuple $(\underset{̲}{A}, \underset{̲}{S}, \underset{̲}{A}, T, \underset{̲}{R}, γ)$ , where $N_{A}$ agents $\underset{̲}{A} = (A^{(1)}, A^{(2)}, \dots, A^{(N_{A})}) = {(A)}_{i \in N_{A}}$ interact with each other in a stochastic environment (Shapley, 1952; Shapley, 1953), as shown in Figure 1. The state $\underset{̲}{s} = (s^{(1)}, s^{(2)}, \dots, s^{(N_{A})}) \in \underset{̲}{S}$ of the environment with state space $\underset{̲}{S}$ is perceived as the individual state observations $s^{(i)}$ for each agent. Due to the Markov property, the dynamics of an MG is given by each individual choosing an action $a^{(i)} \in A^{(i)} \subset \underset{̲}{A}$ out of an agent-specific action space $A^{(i)}$ , thus forming a joint action $\underset{̲}{a}$ that transitions $\underset{̲}{s}$ to ${\underset{̲}{s}}^{'}$ according to a transition probability function $T : = P [{\underset{̲}{s}}^{'} | \underset{̲}{s}, \underset{̲}{a}]$ , and $P [{\underset{̲}{s}}^{'} | \underset{̲}{s}, \underset{̲}{a}]$ is the conditional probability for ${\underset{̲}{s}}^{'}$ , given $\underset{̲}{s}$ and $\underset{̲}{a}$ . The individual reward functions $\underset{̲}{R} = {(R^{(i)} : S^{(i)} \times A^{(i)} \times {\underset{̲}{A}}^{(- i)} \times S^{(i)} \to R)}_{i \in N_{A}}$ map a transition from $\underset{̲}{s}$ to ${\underset{̲}{s}}^{'}$ , given $\underset{̲}{a}$ , to a numeric value for each agent $A^{(i)}$ , which is denoted as $r^{(i)} : {= R}^{(i)} (s^{(i)}, a^{(i)}, \underset{̲}{a^{(- i)}}, s^{(i)'})$ . Given this, each agent $A^{(i)}$ follows the stochastic behavior policy $a^{(i)} \sim π^{(i)} (s^{(i)})$ that intends to maximize the objective for each agent

\begin{aligned} J^{(i)} & : = \sum_{t = 0}^{\infty} γ^{t} \int_{\underset{̲}{A}} \underset{̲}{π} ({\underset{̲}{a}}_{t} | {\underset{̲}{s}}_{t}) \int_{\underset{̲}{S}} T ({\underset{̲}{s}}_{t + 1} | {\underset{̲}{s}}_{t}, {\underset{̲}{a}}_{t}) r^{(i)} d {\underset{̲}{s}}_{t + 1} d {\underset{̲}{a}}_{t} \\ = \sum_{t = 0}^{\infty} γ^{t} E_{{\underset{̲}{a}}^{(- i)} \sim {\underset{̲}{π}}^{(- i)} (s_{t}^{(i)})} \\ \times [\int_{\underset{̲}{A}} π (a_{t}^{(i)} | s_{t}^{(i)}) P [{\underset{̲}{a}}^{(- i)}] \int_{\underset{̲}{S}} T ({\underset{̲}{s}}_{t + 1} | {\underset{̲}{s}}_{t}, {\underset{̲}{a}}_{t}) r^{(i)} d {\underset{̲}{s}}_{t + 1} d {\underset{̲}{a}}_{t}] \end{aligned}, (1)

where the hyperparameter $γ \in (0,1]$ is a temporal decay weight that scales short-term versus long-term impact. In order to solve Equation 1, the state-value function,

V_{\underset{̲}{π}}^{(i)} (\underset{̲}{s}) = \sum_{t = 0}^{\infty} E_{{\underset{̲}{a}}_{t} \sim ρ (\underset{̲}{π}), ({\underset{̲}{s}}_{t}, {\underset{̲}{s}}_{t + 1}) \sim ρ (T, \underset{̲}{π})} [γ^{t} r^{(i)} | {\underset{̲}{s}}_{0}], (2)

the state-action value function,

Q_{\underset{̲}{π}}^{(i)} (\underset{̲}{s}, \underset{̲}{a}) = r^{(i)} + γ E_{{\underset{̲}{s}}^{'} \sim ρ (T, \underset{̲}{π})} [V_{\underset{̲}{π}}^{(i)} ({\underset{̲}{s}}^{'})], (3)

and the advantage function,

A_{\underset{̲}{π}}^{(i)} (\underset{̲}{s}, \underset{̲}{a}) = Q_{\underset{̲}{π}}^{(i)} (\underset{̲}{s}, \underset{̲}{a}) - V_{\underset{̲}{π}}^{(i)} (\underset{̲}{s}) (4)

have been introduced as the multi-agent version of the Bellman backup operator for MDPs (Bellman, 1957). Given that the agents follow a fixed and optimal policy ${\underset{̲}{π}}^{*}$ , the dynamic programming problem eventually solves Equation 1 as the global optimum of the MG, as shown by Littman (1994). Given the optimal $Q_{{\underset{̲}{π}}^{*}}$ function, the optimal policies for each agent can be obtained as the following:

π^{(i)} {(\underset{̲}{s})}^{*} \leftarrow \arg \max_{π^{(i)}} {Q_{{\underset{̲}{π}}^{*}}^{(i)} (\underset{̲}{s}, {\underset{̲}{a}}^{(- i)}, a)|}_{a \leftarrow π^{(i)} (\underset{̲}{s})} . (5)

As solving Equation 5 requires each agent to follow an optimal policy, the definition of a best-response policy is of importance in MGs.

Figure 1

Figure 1. Sketch of a general MARL problem, where $N_{A}$ agents interact with each other in an unknown environment. Each agent has access to the individual state observation $s^{(i)}$ , which can be mapped to an action $a^{(i)}$ via the policy π⁽ⁱ⁾ in such a manner that the expected individual return $r^{(i)}$ is maximized.

Definition 3.1. Best response policy

Given a joint policy ${\underset{̲}{π}}^{(- i)}$ for the neighboring agents of agent $A^{(i)}$ , a policy ^brπ⁽ⁱ⁾ is called a br to ${\underset{̲}{π}}^{(- i)}$ if and only if

\begin{aligned} J^{(i)} ({}^{b r}a^{(i)} \leftarrow {}^{b r}π^{(i)} | {\underset{̲}{π}}^{(- i)}) \geq \\ J^{(i)} (a^{(i)} \neq {}^{b r}a^{(i)} | {\underset{̲}{π}}^{(- i)}), \end{aligned}

i.e., the agent $A^{(i)}$ cannot improve the individual payoff return $J^{(i)}$ by deviating from ${}^{b r}π^{(i)}$ (Shoham and Leyton-Brown, 2008).

Within an MG, the optimal policy requires that the policies of the individual agents are the br to the policies of the surrounding agents, leading to the definition of a Nash equilibrium (NE).

Definition 3.2. Nash equilibrium

According to Nash (1950), a policy ${}^{N E}{\underset{̲}{π}} : = {({}^{N E}π)}_{i \in N_{A}}$ is a NE if and only if each agent following ${}^{N E}π^{i} \in {}^{N E}{\underset{̲}{π}}$ results in each policy being a br policy, according to Definition 3.1. Replacing the objectives $J^{(i)}$ by the state-action value $Q^{(i)}$ , this requires

\begin{aligned} Q_{{}^{N E}π^{(i)}, {}^{N E}{\underset{̲}{π}}^{(- i)}}^{(i)} \geq Q_{{\tilde{π}}^{(i)}, {}^{N E}{\underset{̲}{π}}^{(- i)}}^{(i)} \\ Q_{{}^{N E}π^{(i)}, {}^{N E}{\underset{̲}{π}}^{(- i)}}^{(i)} \geq Q_{{{}^{N E}π}^{(i)}, {\tilde{\underset{̲}{π}}}^{(- i)}}^{(i)} \\ with \tilde{π} \neq {}^{N E}π, \forall A^{(i)} \in \underset{̲}{A}, \forall \underset{̲}{s} \in \underset{̲}{S} \end{aligned},

to hold on the global state space $\underset{̲}{S}$ .

Nonetheless, in real-world problems, neither ${\underset{̲}{π}}^{*}$ nor the value functions are known. In addition, the environment is characterized by multiple learners, whose policies and, thus, actions vary over time and cannot directly be controlled by an individual agent in an MG. This results in the problem formulation of this article.

3.3 Multi-agent reinforcement learning problem

Given a set of agents ${(A)}_{i \in N_{A}}$ that try to optimize their individual accumulated discounted reward, according to Section 3.2, an optimal policy for each agent has to be found, which fulfills the following:

• ${(π \leftarrow \arg \max Q_{π})}_{i \in N_{A}}$ according to Equation 5.

• The joint action $\underset{̲}{a} = {(a \leftarrow π^{*})}_{i \in N_{A}}$ is an NE of the MG, according to Definition 3.2.

We will continue with a short overview of RL methods that have been established as the current state-of-the-art methods within single-agent RL and MARL.

3.4 Policy gradient methods

Obtaining an optimal policy π_II, parameterized by II, has been tackled by generating PGs (Sutton et al., 1999b) that estimate the stochastic gradient over II of a policy, and the policy-loss function is defined as the following:

\nabla_{Π} J (π_{Π}) = E_{s \sim π (s)} [\sum_{t = 0}^{\infty} \nabla_{Π} \log π_{Π} (a_{t} | s_{t}) χ_{t}], (6)

where $χ_{t}$ may, for example, be the single agent version of Equation 3 or Equation 4, i.e., $Q_{π}$ or $A_{π}$ . If one can obtain the gradient $\nabla_{a} χ_{t}$ directly, i.e., the action space is continuous and the environment is stationary, it is also possible to obtain the deterministic policy gradient (DPG) from Equation 6 as the following:

\nabla_{Π} J (π_{Π}) = E_{s \sim D} [\nabla_{Π} π_{Π} (a | s) {\nabla_{a} χ|}_{a \leftarrow π (s)}], (7)

where the expectation is approximated by drawing samples from an experience replay buffer $D$ that contains observed environment transitions. Exemplary DDPG uses $χ_{t} : = Q_{π}$ in order to obtain the gradient of the state-action value in Equation 7. As it can be seen in Equations 6 and 7, PGs and DPGs are generally highly sensitive to the variance of $χ_{t}$ . As a consequence, AC methods have been outlined that add a policy evaluation metric to the policy update of PG methods.

3.5 Actor–critic methods

As the accumulated reward does generally suffer from high variance over repeated episodes, AC algorithms simultaneously estimate $A_{π}$ or $Q_{π}$ alongside the PGs in Equation 6. The deep Q-network presented by Mnih et al. (2015) uses NNs as function approximators, thus approximating $Q_{π}$ by $Q_{Θ}$ and ${{}^{†}Q}_{Θ}$ , parametrized by Θ, where ${}^{†}Q$ denotes the target-net of $Q$ . These two function approximators are then used to learn $Q_{π}$ via off-policy temporal-difference learning, which is obtained via iteratively minimizing the loss function as follows:

\begin{aligned} L_{Q} (Θ) : = & E_{(s, a, r, s^{'}) \sim D} [\frac{1}{2} {(p - Q_{Θ} (s, a))}^{2}] \\ w i t h p = & r (s, a, s^{'}) + γ (1 - d) {{}^{†}V}_{Θ} (s^{'}) \\ {{}^{†}V}_{Θ} (s^{'}) = & E_{a^{'} \leftarrow π (s^{'})} [{{}^{†}Q}_{Θ} (s^{'}, a^{'})] \end{aligned}, (8)

where $D$ is again a replay buffer that stores experienced transitions from the environment during the exploration process. Each sample contains the state $s$ , action $a$ , next state $s^{'}$ , the experienced reward $r$ , and the termination flag $d$ . The term $(1 - d)$ thus ignores the value of the successor state in the Bellman backup operator in Equation 3 at the terminal states.

The SAC (Haarnoja et al., 2018) is an extension of the general AC that approximates the solution of Equation 1 via a maximum entropy objective by introducing a soft-value function, thus replacing Equation 2 by the following:

\begin{aligned} V_{π} (s) & : = E_{(s_{t}, a_{t}, s_{t + 1}) \sim D} {[\sum_{t = 0}^{\infty} γ^{t} (r (s_{t}, a_{t}, s_{t + 1}) + α H (π (\cdot | s_{t})))]}_{s_{0} = s}, \end{aligned} (9)

In Eq. 9 $H (\cdot)$ denotes the policy entropy at a given state and α is a temperature parameter that weighs the impact of the entropy against the environment reward. In contrast to Equation 2, this objective explicitly encourages exploration in regions of high rewards, thus decreasing the chance of converging to local minima. Furthermore, two function approximators are used for the critic as in the twin-delayed deep deterministic policy gradient (TD3) such that the target value function in Equation 8 is obtained as the following:

\begin{aligned} {{}^{†}V}_{Θ} (s^{'}) = E_{a^{'} \sim π (s^{'})} [\min_{j = 1,2} {{}^{†}Q}_{Θ, j} (s^{'}, a^{'}) - α \log π_{Π} (a^{'} | s^{'})] \end{aligned}, (10)

In Eq. 10 $a$ is obtained from $π (s^{'})$ , whereas $s^{'}$ is drawn from $D$ . In contrast to this, the actual policy loss is obtained by applying the reparameterization trick as follows:

\begin{aligned} L_{π}^{SAC} (Π) : = E_{s \sim D} [\min_{j = 1,2} \underset{χ}{\underset{⏟}{{{}^{†}Q}_{Θ, j} (s, f_{Π} (s, ζ))}} . - α \log π_{Π} (f_{Π} (s, ζ) | s)], \end{aligned} (11)

which computes a deterministic function $f_{Π} (s, ζ)$ that depends on the state $s$ , policy parameters Π, and independent noise vector ζ drawn from a fixed distribution, e.g., mean free Gaussian noise. In contrast to DDPG, this parameterized policy is also squashed via a tanh function to the bounds of the action space, thus resulting in valid samples that can be used to generate a stochastic policy for the stochastic policy gradient update step.

3.6 Multi-agent actor–critic algorithms

The methods mentioned above have been recently extended to the multi-agent domain. The MADDPG extends AC with DDPG by proposing the schematic representation of decentralized execution in combination with centralized learning. As such, each $A^{(i)}$ learns an individual (deterministic) policy $π : = S^{(i)} \times A^{(i)} \mapsto [0,1]$ while setting $χ_{t} : = Q^{(i)} (\underset{̲}{s}, \underset{̲}{a})$ in Equation 7, which has access to the observations $s^{(i)}$ , actions $a^{(i)}$ , and policies of all agents such that Equation 8 can be directly applied to the multi-agent domain. This requires having access to all policies during learning in order to calculate the target values of Equation 8. Similar approaches have been proposed by MAAC and counterfactual multi-agent (COMA)-PG, which additionally incorporate a baseline value function for the policy update and thus use the multi-agent advantage function

χ : = A^{(i)} (\underset{̲}{s}, \underset{̲}{a}) = Q^{(i)} (\underset{̲}{s}, a^{(i)}, {\underset{̲}{a}}^{(- i)}) - V_{b} (\underset{̲}{s}, {\underset{̲}{a}}^{(- i)}) (12)

for their policy loss declarations. The baseline $V_{b} (\underset{̲}{s}, {\underset{̲}{a}}^{(- i)})$ estimates the value of a current state and the opponents' current actions such that optimizing Equation 12 leads to a best-response action of the agent $A^{(i)}$ , according to Definition 3.1. Although COMA substitutes Equation 12 into Equation 6, MAAC uses SAC, thus inserting Equation 12 into Equation 11. Furthermore, MAAC improves the centralized critic by adding an attention mechanism that explicitly learns which parts of the observations have an actual impact on the values of the critic.

4 Technical approach

A key challenge for real-world applications in multi-agent systems is the ability to handle decentralized decisions asynchronously. Although most MARL approaches allow decentralized execution, they still rely on centralized learning (Lowe et al., 2017). This imposes various constraints on the overall multi-agent system, e.g., the necessity to have access to all observations of all agents during learning. Furthermore, real robot systems are often commanded by a task planner or similar, where each robot is assigned a dedicated sub-task. Similarly, the upper layer could be realized via a hierarchical reinforcement learning (HRL) learner that learns to allocate tasks to each agent in a team-optimal manner. To visualize the necessity of our algorithm, a two-layered hierarchical decision framework for a multi-agent system is shown in Figure 2. The upper layer could either stem from a sub-task allocator that assigns tasks to each agent or a multi-agent HRL algorithm. As can be seen from this figure, centralized learning would not only require synchronous updates along all agents and layers, but it also would require knowing the current output of the upper layer of each agent. In order to leverage this constraint, we propose a novel decentralized MARL concept that builds upon the concept of best-response policies and separates joint rewards from internal agent objectives.

Figure 2

Figure 2. Exemplary step of a two-level hierarchical MARL step where each low-level step represents an interaction with the environment from Figure 1. For brevity, only selective nodes and edges are labeled. The upper layer acts synchronously, such that the observed transition would qualify for centralized learning for all layers, which is emphasized via the dashed lines for the upper layer.

4.1 Decentralized MARL based on Stackelberg equilibria

In order to achieve a decentralized model for MARL problems, a previous work has evaluated the application of predicting the br policy to the inferred action of an opponent (Tian et al., 2019) or assuming overly restrictive access to the environment feedback of other agents. The latter is always fulfilled for centralized learning. In order to decouple the decentralized learning procedure, we propose a similar idea to that of Tian et al. (2019) and instead reformulate their inference-based policy by modeling the br policy of other agents. In detail, we apply the concept of Stackelberg equilibria. The Stackelberg equilibrium evaluates the br of an agent if the opponent has unveiled the current actions. Therefore, each agent regresses not only a policy ${π^{(i)}}_{Π} : = S^{(i)} \mapsto A^{(i)}$ , parameterized by Π, that intends to optimize the player-individual agent-objective but also a br policy ${\underset{̲}{π}}_{br}^{(- i)}_{Ξ} : = S^{(i)} \times A^{(i)} \mapsto {\underset{̲}{A}}^{(- i)}$ − parameterized by Ξ − that represents the reactions of the other agent(s) at each step.

In addition to regressing the br policy of the other agent, we further claim that it is beneficial to distinguish between joint task rewards and individual or native cost terms. In general, we assume that the individual reward for a cooperative MARL problem is given in the following form:

\begin{aligned} J^{(i)} (\underset{̲}{s}, \underset{̲}{π}, {\underset{̲}{s}}^{'}) = & \sum_{t = 0}^{\infty} γ^{t} E [r^{(i)} ({\underset{̲}{s}}_{t}, \underset{̲}{π}, {\underset{̲}{s}}_{t + 1}) - {\hat{c}}_{nat}^{(i)} ({\underset{̲}{s}}_{t}, π^{(i)}, {\underset{̲}{s}}_{t + 1})] \end{aligned} \in R, (13)

Thus, the individual reward in Eq. 13 consists of a joint or cooperative task reward that depends on the joint action or policy, as well as an interactive cost component that only affects each player. Although some existing work assumes to directly have access to local and global rewards, i.e., to obtain $r (\underset{̲}{s}, \underset{̲}{π}, {\underset{̲}{s}}^{'})$ and ${\hat{c}}_{nat}^{(i)}$ directly (Sheikh and Bölöni, 2020), we propose a model that only has access to the agent reward and the averaged joint task reward of all agents tailored toward sparsely rewarded environments. Thus, the cost of the agents needs to be estimated from this joint reward at each transition. Thus, we apply the following:

\begin{aligned} {\hat{c}}_{nat}^{(i)} (s^{(i)}, \underset{̲}{a}, s^{(i)'}) \approx & \min (\frac{1}{N_{A}} \sum_{\begin{array}{c} j = 1 \end{array}}^{N_{A}} r^{(j)} ({\underset{̲}{s}}^{(j)}, \underset{̲}{a}, {\underset{̲}{s}}^{(j)'}) - r^{(i)} (s^{(i)}, \underset{̲}{a}, s^{(i)'}), 0) \\ \approx & \min (\frac{1}{N_{A}} \sum_{\begin{array}{c} j = 1 \end{array}}^{N_{A}} r^{(j)} - r^{(i)}, 0) where \underset{̲}{r} \sim D^{(i)}, \end{aligned} (14)

i.e., we keep the individual agent reward as the joint task reward, which solely depends on the observation of each agent. In addition, we propose to regress a non-negative auxiliary cost term that contains information about local interaction penalties for each agent. Exploiting the rare occurrence of costs within sparsely rewarded environments, we estimate the step cost for each agent by the difference of the average rewards of all agents compared to the individual agent reward. Having collected empirical data within a replay buffer $D^{(i)}$ for each agent, the numeric cost is approximated as a non-negative difference of the average reward values of all agents without the necessity of explicitly accessing the observations of each agent. Finally, our br-AC approach approximates the following:

• the (interactive) task critic $Q_{int}^{(i)} : = S^{(i)} \times \underset{̲}{A} \mapsto R$ , which intends to maximize the accumulated task reward.

• the (native) agent critic $Q_{nat}^{(i)} : = S^{(i)} \times A^{(i)} \mapsto R$ , which intends to minimize the agent-specific costs.

Therefore, the final goal of each agent is to maximize the interactive task critic, i.e., optimize the accumulated team reward, while minimizing the agent-specific cost penalties from the native agent critic. This concept is similar to the idea of combining RL rewards while also minimizing a myopic objective by means of numeric optimization cf. (Englert and Toussaint, 2016). The agent policies and critics can then be regressed by means of existing AC methods, such as SAC or TD3. In contrast to the default methods, the policies need to optimize the joint task critic and the native agent critic simultaneously. Therefore, the difference between the two critics provides the final critic that is used for the policy gradient of the current actor. Eventually, the br policy needs to be updated as well. In contrast to the agent policy, the native critic is independent of the br policy and can thus be neglected for the update of the br policies. As we emphasize on cooperative MARL, the br policies intend to optimize the joint task-critic as well. Thus, the br policy is found by obtaining the gradient of the joint task critic with regard to the policy of the other agents after applying the current agent policy. Denoting the cost estimation from Equation 14 as GetCost, a single update step for agent i is sketched in Algorithm 1. The dedicated critic losses are denoted as CriticLoss and CostCriticUpdate in Algorithm 1. The native cost critic is regressed by calculating

Algorithm 1

Algorithm 1. Decentralized br policy-based MARL update step for agent i. Due to the decentralized learning, the update step can be run in a fully parallelized procedure. For brevity, the exploration is omitted from this algorithm skeleton.

\begin{aligned} p = & {\hat{c}}_{nat}^{(i)} + γ (1 - d) (\max_{j = 1,2} {{}^{†}Q}_{n a t, j}^{(i)} (s^{(i)}, a^{(i)'})) \\ a^{(i)'} \leftarrow & {}^{†}π^{(i)} (s^{(i)'}) \end{aligned}, (15)

in Equation 8. Nonetheless, it has to be noted that Equation 15 uses the maximum critic value to account for the overestimation bias as the cost critic intends to minimize the accumulated native costs. We explicitly do not apply SAC for the cost critic as exploration should be emphasized on the task to be learned rather than exploring accumulated costs. In order to regress the joint task critic, a similar approach to existing AC approaches is followed. Querying $a^{(i)'}$ identically to Equation 15, we set

\begin{aligned} p = & r^{(i)} (s^{(i)}, \underset{̲}{a}, s^{(i)'}) + γ (1 - d) (\min_{j = 1,2} {{}^{†}Q}_{n a t, j}^{(i)} (s^{(i)'}, {\underset{̲}{a}}^{'}) + p_{S A C}) \\ a^{(- i)'} \leftarrow & {}^{†}{\underset{̲}{π}}_{br}^{(- i)}_{Ξ} (s^{(i)'}, a^{(i)'}) \\ p_{S A C} & = \{\begin{cases} - α \log π^{(i)} (a^{(i)'} | s^{(i)'}) & for an SAC model \\ 0 & else \end{cases}, \end{aligned} (16)

in Equation 8.

For brevity, Algorithm 1 only sketches the update, i.e., learning, step of our proposed multi-agent reinforcement learning (MARL) approach. During the exploration phase, each agent samples from their own policy and stores the actions and rewards of the other agents. In other words, each agent stores a tuple of $(s^{(i)}, \underset{̲}{a}, \underset{̲}{r}, d, s^{(i)'},)$ in the replay buffer $D^{(i)}$ until the end of an episode is reached or the task is completed. After collecting new data for a fixed number of episodes, Algorithm 1 is run for a fixed number of episodes.

Eventually, we distinguish between two interaction schemes in order to model ${π^{(- i)}}_{Π, b r}$ . First, the other agents can be modeled as an unknown black-box system, usually denoted as a game-against-nature within game theory. Thus, a single policy is tracked as follows:

{\underset{̲}{π}}_{br}^{(- i)}_{Ξ} : = S^{(i)} \times A^{(i)} \mapsto {(\underset{̲}{A})}_{j \in - i}, (17)

which models an interaction with the current agent and the responsive nature. Our second approach uses a dyadic interaction scheme and models the br policy of each agent to the current agent individually via Eq. 18.

{\underset{̲}{π}}_{br}^{(- i)}_{Ξ} \leftarrow {({π_{br}^{(j)}}_{Ξ} : = S^{(i)} \times A^{(i)} \mapsto A^{(j)})}_{j \in - i} . (18)

This requires to regress $N_{A} - 1$ opponent policies rather than a single policy in Equation 17. Both policies leverage the effect of a mutual interaction among the other agents to diminish combinatorial explosion.

5 Results

In this section, we evaluate the performance of our decentralized br policy MARL framework within the simulation environment from Appendix 1. Within this environment, we evaluated our algorithm against state-of-the-art algorithms within MARL, namely, MADDPG and multi-agent soft actor–critic (MASAC).

Given our adjusted multi-agent particle environment (MPE) as outlined in Appendix 1, we ran a decentralized version of TD3 (Ackermann et al., 2019) and the multi-agent version of Haarnoja (2018) for the joint critic in our algorithm. Given the dyadic and game-against-nature variants, we use the following notations:

• The state-of-the-art algorithms are directly denoted as commonly known in the literature, i.e., MADDPG and MASAC.

• Our extension of TD3 is denoted as br-TD3-dyad/nature.

• Our extension of SAC is denoted as br-SAC-dyad/nature.

As stated above, our main emphasis is set on improving the performance in sparsely rewarded environments. Furthermore, we explicitly tailor our approach to continuous action spaces in cooperative sections. Therefore, we applied the cooperative collection task, according to the parameterization in Appendix 2.

For brevity, we present the evaluation performance of the individual algorithms based on the average rewards of all agents in Figure 3. In here, the term evaluation refers to the agents greedily following their current policies than drawing samples from them. The collected results show a static version, i.e., using fixed goal locations, in the lower figure and a non-static version in the upper figure. In this environment, three agents update their policies over 5,000 exploration episodes. The evaluation is only run every 10th episode, so the number of evaluation steps is lower than the actual explorations. In addition, the averaged reward per evaluation run is logged, which, in return, strongly depends on a randomly sampled starting state of the agent and the goal locations in the non-static environment. As a result, the collected data encounter high noise, which is reduced by smoothing the collected reward values using a Savitzky–Golay filter (Savitzky and Golay, 1964) and the implementation by Virtanen et al. (2020).

Figure 3

Figure 3. Results of the decentralized br-based algorithms for the cooperative collection task using sparse rewards. The figures present averaged rewards of all agents over eight learning runs per algorithm and environment. The shaded areas highlight a CI of 70%. The upper figure shows the performance of the collection task with static goal locations, whereas the environment on the bottom samples new goal locations upon every reset. The x-axis denotes the evaluation steps, which are run after 10 exploration episodes to evaluate the current performance.

As it can be seen, our algorithms outperform the current state-of-the-art algorithms not only in terms of the final performance but also in terms of convergence speed for both scenarios. Unsurprisingly, our method performs best in static environments, requiring reaching static goal locations. In these scenarios, there is a direct relation between the agent states and the actual value functions, which leads to an improved learning speed.

For a closer evaluation of our presented algorithms, the per-agent rewards metrics are listed in Table 1. Furthermore, the number of total successful trials per algorithm during exploration and evaluation is listed. Exploration is not only run distinctly more often but also contains double the amount of steps per run. As a consequence, the number of successful exploration runs is distinctly higher compared to the evaluation numbers.

Table 1

Table 1. Detailed performance metrics for evaluated environments. The results of the static environment are listed on the bottom. The values show the averaged results with the optional standard deviation appended by ±. The terms dyadic and nature are abbreviated by their first letter for brevity. Similarly, the number of successful trials of the exploration and evaluation runs are denoted as $d_{ex}$ and $d_{ev}$ .

Nonetheless, the collected numbers underline that our presented method outperforms current state-of-the-art methods distinctly, not only in terms of averaged accumulated rewards, as shown in Figure 3, but also for each individual agent involved. The performance increase becomes evident on comparing the success rates of the algorithms, where MASAC failed completely to find a successful policy.

Comparing the overall results, the TD3 agents outperformed not only the state-of-the-art methods but also our SAC variants. Furthermore, the dyadic setup resulted in improved performance for all evaluation metrics compared to the game-against-nature schematic representation. This confirms our initial statement that it is preferable to handle interactions individually rather than regressing interaction schemes fully from a NN.

Regarding the standard deviations of our proposed methods, it also becomes evident that our methods suffer from higher variance in the accumulated rewards. Even though this may seem like a disadvantage of our approach compared to the existing algorithms, it has to be noted that the obtained reward from a successful episode usually distinctly differs from the reward obtained from an unsuccessful episode. Therefore, the increased variance is not subject to our method but a consequence of the increased success rates, as highlighted in Table 1, where our algorithm achieves distinctly higher success rates than the existing AC methods.

In summary, it can be stated that our presented algorithm outperforms the existing methods within our simulated environments even though they are run fully decentralized.

6 Discussion and technical extensions

Given our presented algorithm, we conclude this article with an outline of possible extensions in order to further improve the overall performance.

6.1 Applying best-response policies on competitive environments

A current drawback of our algorithm is the restriction to competitive domains. If the agents have access to all reward values during learning, an additional critic for the objective of the other agent can be added to the presented algorithm. This results in applying the gradient step for the br policy not only over the joint task critic for the current agent but also the agent-specific agent critic. If this metric is applied, applying the dyadic interaction scheme from above is strongly recommended as our algorithm is restricted to optimizing the average reward over all agents otherwise.

Another extension is given by modeling non-cooperative agent(s). In order to model this procedure, it is best to condition non-cooperative agents on the joint team policy of all cooperative agents, thus leading to the conditional interaction policy as follows:

\begin{aligned} \underset{̲}{π} (\underset{̲}{s}) \approx & {\underset{̲}{π}}^{(- i, - i)} (s^{(i)} | {\underset{̲}{a}}^{(- i, i)}, a^{(i)}) {\underset{̲}{π}}^{(- i, i)} (s^{(i)} | a^{(i)}) {\underset{̲}{π}}^{(i)} (s^{(i)}), \end{aligned} (19)

In Eq. 19 the cooperative policy is denoted as ${\underset{̲}{π}}^{(- i, i)} (s^{(i)} | a^{(i)})$ and the non-cooperative policy is denoted as ${\underset{̲}{π}}^{(- i, - i)} (s^{(i)} | {\underset{̲}{a}}^{(- i, i)}, a^{(i)})$ . Alternatively, on-policy-based approaches, such as proximal policy optimization (PPO) or trust region policy optimization, are worth an investigation to model the behavior of other agents. Here, a direct approach is given by conditioning the policy estimate on the average over all estimators. A more promising approach would be given by averaging over all agent advantages and, thus, applying a gradient step. This bears the potential of stabilizing the estimated opponent models and, thus, the overall task-critic updates, which eventually increases the likelihood of converging to the team-optimal policy. Furthermore, there is great potential on incorporating the findings of Jaques et al. (2019) and, thus, not only regressing the agents’ policies but also directly conditioning the reward of each agent on the influencing reward.

6.2 Improving convergence behavior by partially centralized learning

The presented method fully decouples learning by learning opponent models without applying centralized learning schemes. This is endangered to leading to divergent agent behavior and thus converging to suboptimal team-behavior. Therefore, our current method could be further enhanced by introducing centralized learning without adding restrictive full observability assumptions. Rather than sharing the full observations, the individual opponent policy predictions can be shared during learning such that the policy gradient can be conditioned on the Kullback–Leibler (KL)-divergence of the predicted opponent policies.

\begin{aligned} a^{(j)} & \leftarrow {π^{(j)}}_{Π, b r} (s^{(i)}, a^{(i)}) \\ J ({π^{(j)}}_{Π, b r}) & = E_{s^{(i)}, a^{(i)}} [{\nabla_{a^{(- i)}} Q_{i n t}^{(i)} (s^{(i)}, \underset{̲}{a})|}_{a^{(j)}} - \frac{1}{N_{A} - 2} \sum_{\begin{array}{c} k = 1 \\ k \neq j \\ k \neq i \end{array}}^{N_{A}} D_{KL} (π^{(j)} | | π^{(k)})] \end{aligned} . (20)

Eventually, a baseline policy, e.g., obtained from behavioral cloning, can be substituted into Equation 20 to further stabilize the gradients.

6.3 Multi-robot hierarchical actor critic

Recalling the motivation of our algorithm in Section 4, a major advantage of our algorithm is the possibility of applying asynchronous actions. As this directly allows extending AC for MARL to HRL, we outline a conceptual extension of our approach tailored to RL tasks with sparse rewards. This extension relies on a collection of assumptions, which is also assumed in Levy et al. (2019):

• There exists an agent-specific state space $X^{(i)} \in S^{(i)}$ , and $x^{(i)} \in s^{(i)}$ always holds¹.

• There exist deterministic mapping functions $F_{g} : = X^{(i)} \times {}^{{p}}A^{(i)} \mapsto {}^{{p - 1}}G^{(i)} \in X^{(i)}$ and $F_{g}^{- 1} : = X^{(i)} \times {}^{{p - 1}}G^{(i)} \mapsto {}^{{p}}A^{(i)}$ , which map the actions of the upper layer to the goal space of the lower layer and vice versa for each agent.

• There exists a deterministic evaluation metric ${}^{{p}}J : = {}^{{p}}{G \times X}^{(i)} \mapsto [0,1]$ that evaluates the achievement of a goal ${}^{{p}}g^{(i)}$ , given the current agent state $X^{(i)}$ .

This differs from the original assumption by Levy et al. (2019) because we propose to explicitly distinguish between the internal agent state $x^{(i)}$ and the full environment observation $s^{(i)}$ .

We claim that within multi-agent HRL, it is specifically beneficial to distinguish between internal and external observations. Therefore, we propose to use structured observations as follows:

s^{(i)} : = {(x, y_{e}, y_{(- i)})}_{i \in N_{A}}, (21)

In Eq. 21 $x^{(i)}$ reflects the internal state of an agent, e.g., current position, velocity, etc., and environmental observations $y_{e}^{(i)}$ from $A^{(i)}$ ’s perspective, e.g., images or laser range data, as well as observations of the other agents $y_{(- i)}^{(i)}$ .

Given this representation, we propose a two-layered hierarchy where the upper layer proposes sub-goals to the lower layer agents. This lower level is denoted as the environment-layer or ${}^{{e}}p$ in the following, while the upper layer is denoted as the team-coordination layer or ${}^{{i}}p$ . On the lowest layer, our decentralized MARL approach based on br policies from Section 4.1 can be applied using a dyadic interaction scheme.

Therefore, the individual components per agent are given as follows:

• A joint task critic $Q_{i n t}^{(i)} (s^{(i)}, \underset{̲}{a})$ .

• A native hierarchical critic $Q_{n a t}^{(i)} (x^{(i)}, y_{e}^{(i)}, g^{(i)})$ .

• A goal-conditioned action policy for the current agent $π^{(i)} (s^{(i)}, g^{(i)})$ .

• The dyadic br policies ${(π^{(j)} ({(x, y_{e}, y_{(- i)}, a)}_{i \in N_{A}}))}_{j \in - i}$ .

As the individual agents are provided with a sub-task that is to be reached by the dedicated agents alone, the hierarchical native critic preferably drops the dependency on the observation of other agents. In other words, this native hierarchical critic evaluates the hierarchically imposed rewards instead of estimating the current step-cost from the deviation with regard to the average reward. As a result, the update step of the lower layer follows Algorithm 1 but replaces the difference in line 9 by an average over the two critics. Furthermore, the native critic not only evaluates the environmental task success $d$ but also evaluates if the update step has accomplished the current sub-goal. As the rest remains identical to Algorithm 1 and Equations 15 and 16, we omit repeating the same equations.

In contrast, the upper layer only tracks a single critic as infeasible sub-goals result in unpredictable task performance. Unfortunately, the agents do not have access to the goal mapping of other agents such that it is impossible to directly impose their policies or higher-level actions in the critic within decentralized settings. Furthermore, the upper layer usually suffers from asynchronous decisions, which would require adding the decision epochs to the critic’s state to allow sampling from the experience buffer. Therefore, we propose to apply an observation oracle instead of a br policy,

{{}^{{i}}π}_{br}^{(j)} : = {(X \times Y_{env} \times Y^{(j)} \times {}^{{i}}a)}_{i \in N_{A}} \mapsto Y^{(j)}, (22)

i.e., instead of predicting the agent action on the upper layer, the next observation is predicted. In case a (partially) centralized learning scheme is applied, this observation oracle can also be replaced with the following:

{{}^{{i}}π}_{br}^{(j)} : = {(X \times Y_{env} \times X^{(j)} \times {}^{{i}}a)}_{i \in N_{A}} \mapsto X^{(j)}, (23)

thus predicting the next internal state of the agent j. These opponent models from Eqs 22, 23 allow using data from an experience buffer independent of the higher-level policies or decision epochs during execution. As a result, the (interactive) task critic of the upper layers are regressed from Eqs 24 or 25:

{{}^{{i}}Q}_{i n t}^{(j)} : = S^{(i)} \times A^{(i)} \times {(Y_{(j)}^{(i)})}_{j \in - i} \mapsto R (24)

{{}^{{i}}Q}_{i n t}^{(j)} : = s \times a^{(i)} \times {(X_{(j)}^{(i)})}_{j \in - i} \mapsto R (25)

Although the lower layer is updated similarly to Algorithm 1, the lower-layer critic update is given by calculating the following:

\begin{aligned} p = & \frac{1}{{{}^{{i}}T}_{max}} \sum_{n = 1}^{{{}^{{i}}T}_{max}} γ^{{{}^{{i}}T}_{max - n}} r^{(i)} (s_{t + n}, {\underset{̲}{a}}_{t + n}, s_{t + n + 1}) \\ + (1 - d) \{\begin{cases} 0 & if J (x^{(i)'}, F_{g} (x^{(i)}, {}^{{i}}a^{(i)})) \mapsto ⊤ \\ - (T_{max} - n) & if \exists n : J ({x^{(i)}}_{t + n + 1}, F_{g} (x^{(i)}, {}^{{i}}a^{(i)})) \mapsto ⊤ \\ - T_{max} & else \end{cases} \\ + γ (1 - d) (\min_{k = 1,2} {{}^{{i} †}Q}_{i n t, k} {}^{(i)}{(s^{(i)'}, {}^{{i}}a^{(i)'}, {(y_{(j)}^{(i)'})}_{j \in - i})}), \end{aligned} (26)

where ${{}^{{e}}T}_{max}$ denotes the number of maximum sub-steps for a hierarchical transition and $s^{'} : = s_{t + {{}^{{i}}T}_{max}}$ represents the observation of agent i after a hierarchical update step. The actions and observations for the target critic in Equation 26 are obtained via the proposed policies from Eq. 27

\begin{aligned} {}^{{i}}a^{(i)'} \leftarrow {}^{{i} †}π^{(i)}_{Π} (s^{(i)'}) \\ {(y^{'} \leftarrow {{}^{{i} †}π}_{br}^{(j)}_{Ξ} (s^{(i)'}, {}^{{i}}a^{(i)'}))}_{j \in - i} \end{aligned} . (27)

The first term in Equation 26 averages the environmental reward, while the second adds the hierarchical penalty term depending on whether the lower layer could achieve the current action or the respective sub-goal. Eventually, the value function is approximated via querying the current higher-level policy and predicting the observations of the other agents.

7 Conclusion

Within this article, we have proposed a novel MARL framework that allows decentralized learning while differentiating between agent-based native costs and joint task rewards. In order to regress the team-optimal policy, each agent not only updates their own policy but also instead models the team-optimal response by estimating the br policy of the other agents. We propose to employ concepts from game theory, namely, either applying dyadic br policies for each agent pair or representing the collection of other agents as a game against nature. Even though our method relies on estimates of the agent-based costs, it outperforms recent state-of-the-art methods in terms of convergence speed within sparsely rewarded environments.

Given the promising results collected in simulation, this article provides a variety of extensions, which bear great potential for future research. First, an extension to competitive domains, i.e., zero-sum games, has been sketched by minimizing instead of maximizing the agent task critic when updating the other agents’ policies. Second, we outlined that sharing the predicted br policies can improve the convergence. Eventually, we sketched an extension of our method to a hierarchical MARL algorithm as this may allow bootstrapping performance during learning.

Using such a hierarchical MARL algorithm combined with structured environment observations bears the additional advantage of explicitly incorporating available model knowledge. This allows leveraging the concept of full end-to-end learning and instead combines MARL with optimization-based techniques, which remains a rarely covered field of research.

Data availability statement

The original contributions presented in the study are publicly available. This data can be found here: https://gitlab.com/vg_tum/multi-agentgym.git; https://gitlab.com/vg_tum/mahac_rl.git.

Author contributions

VG proposed, implemented, and outlined the methods presented in the article, performed the experiments, and evaluated the collected evidence. VG and DW verified the approach. All authors contributed to the article and approved the submitted version.

Funding

The research leading to the results presented in this work received funding from the Horizon 2020 research and innovation programme under grant agreement No. 820742 of the project “HR-Recycler—Hybrid Human–Robot RECYcling plant for electriCal and eLEctRonic equipment.”

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Footnotes

¹Assumption of $x^{(i)} \in s^{(i)}$ emphasizes that we do not expect full state observability and that the intrinsic observations do not provide additional knowledge to the robotic agents.

²Source code: https://gitlab.com/vg_tum/multi-agent-gym.git

³Source code: https://gitlab.com/vg_tum/mahac_rl.git

⁴Various implementations available online realize the action input as the difference of two positive force terms as this eases the comparison to discrete action spaces, where the result equals learning an optimal bang–bang controller. As our framework explicitly highlights continuous applications, we kept this implementation for the comparison to the state-of-the-art methods but used the interfaces from Eqs 28, 29 for our method.

References

Ackermann, J., Gabler, V., Osa, T., and Sugiyama, M. (2019). Reducing overestimation bias in multi-agent domains using double centralized critics. CoRR abs/1910.01465 Available at: https://arxiv.org/abs/1910.01465.

Decentralized multi-agent reinforcement learning based on best-response policies

1 Introduction

2 Related work

2.1 Multi-agent reinforcement learning

3 Preliminaries

3.1 Notation

3.2 Markov game

3.3 Multi-agent reinforcement learning problem

3.4 Policy gradient methods

3.5 Actor–critic methods

3.6 Multi-agent actor–critic algorithms

4 Technical approach

4.1 Decentralized MARL based on Stackelberg equilibria

5 Results

6 Discussion and technical extensions

6.1 Applying best-response policies on competitive environments

6.2 Improving convergence behavior by partially centralized learning

6.3 Multi-robot hierarchical actor critic

7 Conclusion

Data availability statement

Author contributions

Funding

Conflict of interest

Publisher’s note

Footnotes

References

Appendix 1

A.1 Structured observations in the multi-agent particle environment

Appendix 2

Nomenclature