<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frai.2022.908353</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Hypothesis and Theory</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A Unifying Framework for Reinforcement Learning and Planning</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Moerland</surname> <given-names>Thomas M.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1654151/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Broekens</surname> <given-names>Joost</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Plaat</surname> <given-names>Aske</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Jonker</surname> <given-names>Catholijn M.</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1278160/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Leiden Institute of Advanced Computer Science (LIACS), Leiden University</institution>, <addr-line>Leiden</addr-line>, <country>Netherlands</country></aff>
<aff id="aff2"><sup>2</sup><institution>Interactive Intelligence, Delft University of Technology</institution>, <addr-line>Delft</addr-line>, <country>Netherlands</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Theophane Weber, DeepMind Technologies Limited, United Kingdom</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Rosa Maria Vicari, Federal University of Rio Grande Do Sul, Brazil; Eszter Vertes, DeepMind Technologies Limited, United Kingdom</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Thomas M. Moerland  <email>t.m.moerland&#x00040;liacs.leidenuniv.nl</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Machine Learning and Artificial Intelligence, a section of the journal Frontiers in Artificial Intelligence</p></fn></author-notes>
<pub-date pub-type="epub">
<day>11</day>
<month>07</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>5</volume>
<elocation-id>908353</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>03</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>06</day>
<month>06</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2022 Moerland, Broekens, Plaat and Jonker.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Moerland, Broekens, Plaat and Jonker</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license> </permissions>
<abstract>
<p>Sequential decision making, commonly formalized as optimization of a Markov Decision Process, is a key challenge in artificial intelligence. Two successful approaches to MDP optimization are <italic>reinforcement learning</italic> and <italic>planning</italic>, which both largely have their own research communities. However, if both research fields solve the same problem, then we might be able to disentangle the common factors in their solution approaches. Therefore, this paper presents a unifying algorithmic framework for reinforcement learning and planning (FRAP), which identifies underlying dimensions on which MDP planning and learning algorithms have to decide. At the end of the paper, we compare a variety of well-known planning, model-free and model-based RL algorithms along these dimensions. Altogether, the framework may help provide deeper insight in the algorithmic design space of planning and reinforcement learning.</p></abstract>
<kwd-group>
<kwd>planning</kwd>
<kwd>reinforcement learning</kwd>
<kwd>model-based reinforcement learning</kwd>
<kwd>framework</kwd>
<kwd>overview</kwd>
<kwd>synthesis</kwd>
</kwd-group>
<contract-sponsor id="cn001">Universiteit Leiden<named-content content-type="fundref-id">10.13039/501100001717</named-content></contract-sponsor>
<counts>
<fig-count count="8"/>
<table-count count="7"/>
<equation-count count="8"/>
<ref-count count="166"/>
<page-count count="25"/>
<word-count count="20372"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Sequential decision making is a key challenge in artificial intelligence (AI) research. The problem, commonly formalized as a Markov Decision Process (MDP) (Bellman, <xref ref-type="bibr" rid="B15">1954</xref>; Puterman, <xref ref-type="bibr" rid="B122">2014</xref>), has been studied in different research fields. The two prime research directions are <italic>reinforcement learning</italic> (RL) (Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>), a subfield of machine learning, and <italic>planning</italic> (also known as <italic>search</italic>), of which the discrete and continuous variants have been studied in the fields of artificial intelligence (Russell and Norvig, <xref ref-type="bibr" rid="B126">2016</xref>) and control theory (Bertsekas, <xref ref-type="bibr" rid="B18">2012</xref>), respectively. Departing from different assumptions both fields have largely developed their own methodology, which has cross-pollinated in the field of <italic>model-based reinforcement learning</italic> (Sutton, <xref ref-type="bibr" rid="B146">1990</xref>; Hamrick, <xref ref-type="bibr" rid="B57">2019</xref>; Moerland et al., <xref ref-type="bibr" rid="B102">2020a</xref>; Plaat et al., <xref ref-type="bibr" rid="B117">2021</xref>).</p>
<p>However, a unified view on both fields, including how their approaches overlap or differ, lacks in literature. For example, the classic AI textbook by Russell and Norvig (<xref ref-type="bibr" rid="B126">2016</xref>) discusses (heuristic) search methods in Chapters 3, 4, 10, and 11, while reinforcement learning methodology is separately discussed in Chapter 21. Similarly, the classic RL textbook by Sutton and Barto (<xref ref-type="bibr" rid="B148">2018</xref>) does discuss a variety of the topics in our framework, but never summarizes these as a single algorithmic space. Moreover, while the book does extensively discuss the relation between reinforcement learning and dynamic programming methods, it does not focus on the relation with the many other branches of planning literature. Therefore, this paper introduces a Framework for Reinforcement learning and Planning (FRAP) (<bold>Table 2</bold>), which attempts to identify the underlying algorithmic space shared by RL and MDP planning algorithms. We show that a wide range of algorithms, from Q-learning (Watkins and Dayan, <xref ref-type="bibr" rid="B161">1992</xref>) to Dynamic Programming (Bellman, <xref ref-type="bibr" rid="B15">1954</xref>) to A<sup>&#x022C6;</sup> (Hart et al., <xref ref-type="bibr" rid="B62">1968</xref>), fit the framework, simply making different decisions on a number of subdimensions of the framework (<bold>Table 7</bold>).</p>
<p>We need to warn experienced readers that many of the individual topics in the paper will be familiar to them. However, the main contribution of this paper is not the discussion of these ideas themselves, but in the <italic>systematic structuring</italic> of these ideas into a single algorithmic space (<xref ref-type="table" rid="T8">Algorithm 1</xref>). Experienced readers may therefore skim over some sections more quickly, and only focus on the bigger integrative message. As a second contribution, we hope the paper points researchers from one of both fields toward relevant literature from the other field, thereby stimulating cross-pollination. Third, we note that the framework is equally useful for researchers from model-free RL, since to the best of our knowledge &#x0201C;a framework for reinforcement learning&#x0201D; does not exist in literature either (&#x0201C;a framework for planning&#x0201D; does, see Related Work). Finally, we hope the paper may also serves an educational purpose, for example for students in a University course, by putting algorithms that are often presented in different courses into a single perspective.</p>
<table-wrap position="float" id="T8">
<label>Algorithm 1</label>
<caption><p>FRAP pseudocode. In planning, there is no global solution, and the orange lines therefore disappear (and <bold>g</bold> therefore drops from all functions as well). In model-free RL there are restrictions on the blue lines: we can only select actions and next states in a single forward trace per root, which indirectly restricts the trial budget per root (to the number of target depths we reweight over within the trace, which is often set to one) and the way we set the next root (which either has to be a next state we reached within the trial or a reset to an initial state of the MDP). In model-based RL, all elements of the framework can be active.</p></caption>
<graphic xlink:href="frai-05-908353-i0001.tif"/>
</table-wrap>
<p>We also need to clearly demarcate what literature we do and do not include. First of all, planning and reinforcement learning are huge research fields, and the present paper is definitely <italic>not</italic> a systematic survey of both fields (which would likely require multiple books, not a single article). Instead, we focus on the core ideas in the joint algorithmic space and discuss characteristic, well-known algorithms to illustrate these key ideas. For the planning side of the literature, we exclusively focus on planning algorithms that search for <italic>optimal behavior</italic> in an MDP formulation, which for example excludes all non-MDP planning methods, as well as &#x0201C;planning as satistifiability&#x0201D; approaches, which attempt to verify whether a path from start to goal exists at all (Kautz et al., <xref ref-type="bibr" rid="B74">1992</xref>, <xref ref-type="bibr" rid="B73">2006</xref>). For the reinforcement learning side of the literature, we do not focus on approaches that treat the MDP formulation as a <italic>black-box optimization problem</italic>, such as evolutionary algorithms (Moriarty et al., <xref ref-type="bibr" rid="B107">1999</xref>), simulated annealing (Atiya et al., <xref ref-type="bibr" rid="B7">2003</xref>) or the cross-entropy method (Mannor et al., <xref ref-type="bibr" rid="B94">2003</xref>). While these approaches can be successful (Salimans et al., <xref ref-type="bibr" rid="B127">2017</xref>), they typically only require access to an evaluation function, and do not use MDP specific characteristics in their solution (on which our framework is built).</p>
<p>The remainder of this article is organized as follows. After discussing Related Work (Section 2), we first formally introduce the MDP optimization setting (Section 3.1), the way we may get access to the MDP (Section 3.2), and give definitions of planning and reinforcement learning (Section 3.3). The next section provides brief overviews of literature in planning (Section 4.1) and reinforcement learning (Section 4.2). Together, Sections 3 and 4 should establish common ground to build the framework upon. The main contribution of this paper, the framework, is presented in Section 5, where we systematically discuss each consideration in the algorithmic space. Finally, Section 6 illustrates the applicability of the framework, by comparing a range of planning and reinforcement learning algorithms along the framework dimensions, and identifying interesting directions for future work.</p>
</sec>
<sec id="s2">
<title>2. Related Work</title>
<p>The basis for a framework approach to planning (and reinforcement learning) is the FIND-and-REVISE scheme by Bonet and Geffner (<xref ref-type="bibr" rid="B25">2003a</xref>). FIND-and-REVISE specifies a general procedure for asynchronous value iteration, where we first <italic>find</italic> a new node that requires updating, and subsequently <italic>revise</italic> the value estimate of that node based on interaction with the MDP. Our framework follows as similar pattern, where we repeatedly find a new state (a root that requires updating), find interesting subsequent states to compute an improved value estimate for this state, and subsequently use this estimate to improve the solution. Our framework is also partially inspired by the reinforcement learning textbook of Sutton and Barto (<xref ref-type="bibr" rid="B148">2018</xref>), which provides an unified view on the back-up patterns in planning and reinforcement learning (regarding their depth and width), and thereby an integrated view on dynamic programming and reinforcement learning methodology. Similar ideas return in our framework, but we extend them with several additional dimensions, and to a wide variety of other planning literature.</p>
<p>However, the main inspiration of our work is <italic>trial-based heuristic tree search</italic> (THTS) (Keller and Helmert, <xref ref-type="bibr" rid="B77">2013</xref>; Keller, <xref ref-type="bibr" rid="B76">2015</xref>), a framework that subsumed several planning algorithms, like Dynamic Programming (Bellman, <xref ref-type="bibr" rid="B15">1954</xref>), MCTS (Kocsis and Szepesv&#x000E1;ri, <xref ref-type="bibr" rid="B78">2006</xref>) and heuristic search (Pearl, <xref ref-type="bibr" rid="B114">1984</xref>) methods. THTS shows that a variety of planning algorithms can indeed be unified in the same algorithmic space, which we believe provided a lot of insight in the commonalities of these algorithms. Our present framework can be seen as an extension and modification of these ideas to also incorporate literature from the reinforcement learning community. Compared to THTS, we first of all add several new categories to the framework, such as &#x0201C;solution representation&#x0201D; and &#x0201C;update of the solution,&#x0201D; to accommodate for the various ways in which planning and RL methods differ in the way they store and update the outcome of their back-ups. Second, THTS purely focused on the online planning setting, while we incorporate a new dimension &#x0201C;set root state&#x0201D; that also allows for different prioritization schemes in offline planning and learning. Third, we make several smaller adjustments and extensions, such as splitting up the back-up dimension in several subdimensions, and using a different definition of the concept of a trial (which we define as a single forward sequence of states and actions), which allows us to bound the computational effort per trial. This also leads to a new &#x0201C;budget per root&#x0201D; dimension in the framework, which now specifies the number of trials (width) of the unfolded subtree in the local solution. We nevertheless invite the reader to also read the THTS papers, since they are a useful companion to the present paper.</p>
</sec>
<sec id="s3">
<title>3. Definitions</title>
<p>In sequential decision-making, formalized as Markov Decision Process optimization, we are interested in the following problem: given a (sequence of) state(s), what next action is best to choose, based on the criterion of highest cumulative pay-off in the future. More formally, we aim for <italic>context-dependent action prioritization based on a (discounted) cumulative reward criterion</italic>. This is a core challenge in artificial intelligence research, as it contains the key elements of the world: there is sensory information about the environment (states), we can influence that environment through actions, and there is some notion of what is preferable, now and in the future. The formulation can deal with a wide variety of well-known problem instances, like path planning, robotic manipulation, game playing and autonomous driving.</p>
<sec>
<title>3.1. Markov Decision Process</title>
<p>The formal definition of a <italic>Markov Decision Process</italic> (MDP) (Puterman, <xref ref-type="bibr" rid="B122">2014</xref>) is a tuple <inline-formula><mml:math id="M13"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">M</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. The environment consists of a <italic>transition function</italic> <inline-formula><mml:math id="M14"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and a <italic>reward function</italic> <inline-formula><mml:math id="M15"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0211D;</mml:mi></mml:math></inline-formula>. At each timestep <italic>t</italic> we observe some state <inline-formula><mml:math id="M16"><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow></mml:math></inline-formula> and pick an action <inline-formula><mml:math id="M17"><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:math></inline-formula>. Then, the environment returns a next state <inline-formula><mml:math id="M18"><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0007E;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and associated scalar reward <inline-formula><mml:math id="M19"><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. The first state is sampled from the initial state distribution <italic>p</italic><sub>0</sub>(<italic>s</italic>), while &#x003B3; &#x02208; [0, 1] denotes a discount parameter.</p>
<p>The state space can either have an atomic, factorized, or structured form (Russell and Norvig, <xref ref-type="bibr" rid="B126">2016</xref>). <italic>Atomic</italic> state spaces treat each state as a separate, discrete entity, without the specification of any additional relation between states. In contrast, factorized states consist of a vector of attributes, which thereby provide a relation between different states (i.e., the attributes of states may partially overlap). Factorized state spaces allow for <italic>generalization</italic> between states, an important feature of learning algorithms. Finally, <italic>structured</italic> state spaces consist of factorized states with additional structure beyond simple discrete or continuous values, for example in the form of a symbolic language. In this work, we primarily focus on settings with atomic or factorized states.</p>
<p>The agent acts in the environment according to a <italic>policy</italic> <inline-formula><mml:math id="M20"><mml:mi>&#x003C0;</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. In the search community, a policy is also known as a <italic>contingency plan</italic> or <italic>strategy</italic> (Russell and Norvig, <xref ref-type="bibr" rid="B126">2016</xref>). By repeatedly selecting actions and transitioning to a next state, we can sample a <italic>trace</italic> through the environment. The <italic>cumulative return</italic> of the trace is denoted by: <inline-formula><mml:math id="M21"><mml:msub><mml:mrow><mml:mi>J</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>K</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msup><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, for a trace of length <italic>K</italic>. For <italic>K</italic> &#x0003D; &#x0221E; we call this the infinite-horizon return. The action-value function <italic>Q</italic><sup>&#x003C0;</sup>(<italic>s, a</italic>) is defined as the expectation of this cumulative return given a particular policy &#x003C0;:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi>Q</mml:mi><mml:mi>&#x003C0;</mml:mi></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mover accent='true'><mml:mo>=</mml:mo><mml:mo>&#x002D9;</mml:mo></mml:mover><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mstyle  mathvariant='double-struck'><mml:mi>E</mml:mi></mml:mstyle><mml:mrow><mml:mi>&#x003C0;</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant='script'>T</mml:mi></mml:mrow></mml:msub><mml:mo stretchy='true'>[</mml:mo><mml:mstyle displaystyle='true'><mml:munderover><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>K</mml:mi></mml:munderover><mml:mrow><mml:msup><mml:mi>&#x003B3;</mml:mi><mml:mi>k</mml:mi></mml:msup></mml:mrow></mml:mstyle><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0007C;</mml:mo><mml:msub><mml:mi>s</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>a</mml:mi><mml:mi>t</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='true'>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This equation can be written in a recursive form, better known as the <italic>Bellman equation</italic>:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi>Q</mml:mi><mml:mi>&#x003C0;</mml:mi></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:msub><mml:mstyle  mathvariant='double-struck'><mml:mi>E</mml:mi></mml:mstyle><mml:mrow><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>~</mml:mo><mml:mi mathvariant='script'>T</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo> <mml:mrow><mml:mi>&#x0211B;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:msub><mml:mstyle  mathvariant='double-struck'><mml:mi>E</mml:mi></mml:mstyle><mml:mrow><mml:msup><mml:mi>a</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>~</mml:mo><mml:mi>&#x003C0;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mo>&#x0007C;</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo></mml:mrow></mml:msub><mml:mo stretchy='false'>[</mml:mo><mml:msup><mml:mi>Q</mml:mi><mml:mi>&#x003C0;</mml:mi></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>s</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo>,</mml:mo><mml:msup><mml:mi>a</mml:mi><mml:mo>&#x02032;</mml:mo></mml:msup><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow> <mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Our goal is to find a policy &#x003C0; that maximizes our expected return <italic>Q</italic><sup>&#x003C0;</sup>(<italic>s, a</italic>):</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x022C6;</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">arg</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>max</mml:mo><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In the planning and control literature, the above problem is typically formulated as a cost <italic>minimization</italic> problem (Bellman, <xref ref-type="bibr" rid="B16">1957</xref>). That formulation is interchangeable with our presentation by negating the reward function. The formulation also contains <italic>stochastic shortest path</italic> (SSP) problems (Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B20">1991</xref>), which are a common setting in the planning literature. SSP problems are MDP specifications with negative rewards on all transitions and particular terminal goal states, where we attempt to reach the goal with as little cost as possible. The MDP specification induces a graph, which is in the planning community commonly referred to as an <italic>AND-OR graph</italic>: we repeatedly need to choose between actions (OR), and then take the expectation over the next states (AND). In a search tree these two operations are sometimes referred to as <italic>decision nodes</italic> (OR) and <italic>chance nodes</italic> (AND), respectively.</p>
</sec>
<sec>
<title>3.2. Access to the MDP Dynamics</title>
<p>A crucial aspect in MDP optimization is the way we can interact with the MDP, i.e., the <italic>type of access</italic> we have to the transition and reward function. We will here focus on the type of access to the transition function, since the type of access to the reward usually mimics the type of access to the transition function. All MDP algorithms at some point <italic>query</italic> the MDP transition function at a particular state-action pair (<italic>s, a</italic>), and get information back about the possible next state(s) <italic>s</italic>&#x02032; and associated reward <inline-formula><mml:math id="M25"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. However, there are differences in the <italic>order</italic> in which we can make queries, and in the <italic>type of information</italic> we get back after a query (Kearns et al., <xref ref-type="bibr" rid="B75">2002</xref>; Keller and Helmert, <xref ref-type="bibr" rid="B77">2013</xref>).</p>
<p>Regarding the first consideration, reinforcement learning methods often assume we need to make our next query at the state that resulted from our last query, i.e., we have to move forward (similar to the way humans interact with the real world). We propose to call this <italic>irreversible</italic> access to the MDP, since we cannot revert a particular action. In practice, RL approaches often assume that we can reset at any particular moment to a state sampled from the initial state distribution, so we may also call this <italic>resettable</italic> access to the MDP. In contrast, planning methods often assume we can query the MDP dynamics in any preferred order of state-action pairs, i.e., we can <italic>set</italic> the query to any state we like. This property also allows us to repeatedly plan forward from the same state (like humans plan in their mind), which we therefore propose to call <italic>reversible</italic> access to the MDP dynamics. The distinction between reversible/settable and irreversible/resettable access is visualized in the rows of <xref ref-type="fig" rid="F1">Figure 1</xref>. Reversible/settable access to the MDP dynamics is usually referred to as a (known) <italic>model</italic>.</p>
<disp-quote><p><italic>A model is a type of access to the MDP dynamics that can be queried in any preferred order of state-action pairs</italic>.</p></disp-quote>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Illustration of different types of access to the MDP transition dynamics. Rows: We may either have <italic>reversible/settable</italic> access to the MDP dynamics, in which case we can query the MDP on any desired state, or <italic>irreversible/resettable</italic> access to the MDP, in which case we have to make the next query at the resulting state, or we can reset to a state from the initial state distribution. Any type of reversible/settable access to the MDP is usually called a (known) <italic>model</italic>. Columns: On each query to the MDP dynamics, we may either get access to the full distribution of possible next states (<italic>descriptive</italic>/<italic>declarative</italic> access), or only get a single sample from this distribution (<italic>generative</italic> access). Note that we could theoretically think of irreversible descriptive access, in which we do see the probabilities but need to continue from the next state, but we are unaware of such a model in practice.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-908353-g0001.tif"/>
</fig>
<p>A second important distinction concerns the type of information we get about the possible next states. A <italic>descriptive/declarative</italic> model provides us the full probabilities of each possible next state, i.e., the entire distribution of <inline-formula><mml:math id="M26"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>|</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, which allows us to fully evaluate the expectation over the dynamics in the Bellman equation (Equation 2). In contrast, <italic>generative</italic> access only provides us with a sample from the next state distribution, without access to the true underlying probabilities (we may of course approximate the expectation in Equation (2) through repeated sampling). These two options are displayed in the columns of <xref ref-type="fig" rid="F1">Figure 1</xref>).</p>
<p>Together, the two considerations lead to three types of access to the MDP dynamics, as shown in the cells of <xref ref-type="fig" rid="F1">Figure 1</xref>. Reversible descriptive access (top-left) is for example used in Value Iteration (Bellman, <xref ref-type="bibr" rid="B16">1957</xref>), reversible generative access (top-right) is used in Monte Carlo Tree Search (Kocsis and Szepesv&#x000E1;ri, <xref ref-type="bibr" rid="B78">2006</xref>), while irreversible generative access (bottom-right) is used in Q-learning (Watkins and Dayan, <xref ref-type="bibr" rid="B161">1992</xref>). The combination of irreversible and descriptive access, in the bottom-left of <xref ref-type="fig" rid="F1">Figure 1</xref>), is theoretically possible, but to our knowledge does not occur in practice. Note that there is also a natural ordering in these types of MDP access: reversible descriptive access gives most information and freedom, followed by reversible generative access (since we can always sample from distributional access), and then followed by irreversible generative access (since we can always restrict the order of sampling ourselves). However, the difficulty to obtain a particular type of access follows the opposite pattern: descriptive models are typically hardest to obtain, while a irreversible generative access is by definition available through real-world interaction.</p>
</sec>
<sec>
<title>3.3. Definitions of Planning and Reinforcement Learning</title>
<p>We are now ready to give formal definitions of MDP planning and reinforcement learning. While there are various definitions of both fields in literature (Russell and Norvig, <xref ref-type="bibr" rid="B126">2016</xref>; Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>), these are typically not specific enough to discriminate planning from reinforcement learning. One possible distinction is based on the <italic>type of access</italic> to the MDP dynamics: planning approaches had settable/reversible access to the dynamics (&#x0201C;a known model&#x0201D;), while reinforcement learning approaches had irreversible access (&#x0201C;an unknown model&#x0201D;). However, there is a second possible distinction, based on the <italic>coverage or storage of the solution</italic>. This distinction seems known to many researchers, but is seldomly expicitly discussed in research papers. On the one hand, planning methods tend to use <italic>local</italic> solution representations: the solution is only stored temporarily, and usually valid for only a subset of all states (for example repeatedly simulating forward from a current state). In contrast, reinforcement learning approaches tend to use a <italic>global</italic> solution: a permanent storage of the solution which is typically valid for all possible states.</p>
<disp-quote><p><italic>A local solution temporarily stores solution estimates for a subset of all states</italic>.</p>
<p><italic>A global solution permanently stores solution estimates for all states</italic>.</p></disp-quote>
<p>The focus of RL methods on global solutions is easy to understand: without a model we cannot repeatedly simulate forward from the same state, and therefore our best bet is to store a solution for all possible states (we can never build a local solution beyond size one, since we have to move forward). The global solutions that we gradually update are typically referred to as <italic>learned</italic> solutions, which connects reinforcement learning to the broader machine learning literature.</p>
<p>Interestingly, our two possible distinctions between planning and reinforcement learning (model vs. no model, and local vs. global solution) do not always agree. For example, both Value Iteration (Bellman, <xref ref-type="bibr" rid="B17">1966</xref>) and AlphaZero (Silver et al., <xref ref-type="bibr" rid="B137">2018</xref>) combine a global solution (which would make it reinforcement learning) with a model (which would make it planning). Indeed, Dynamic Programming has long been considered a bridging technique between planning and reinforcement learning. We propose to solve this issue by considering these borderline cases as <italic>model-based reinforcement learning</italic> (Samuel, <xref ref-type="bibr" rid="B128">1967</xref>; Sutton, <xref ref-type="bibr" rid="B146">1990</xref>; Moerland et al., <xref ref-type="bibr" rid="B102">2020a</xref>), and thereby let the global vs. local distinction dominate.</p>
<disp-quote><p><italic>Planning is a class of MDP algorithms that 1) use a model and 2) only store a local solution</italic>.</p></disp-quote>
<disp-quote><p><italic>Reinforcement learning is a class of MDP algorithms that store a global solution</italic>.</p></disp-quote>
<p>The definition of reinforcement learning may then be further partitioned into model-free and model-based RL:</p>
<disp-quote><p><italic>Model-free reinforcement learning is a class of MDP algorithms that 1) do not use a model, and 2) store a global solution</italic>.</p></disp-quote>
<disp-quote><p><italic>Model-based reinforcement learning is a class of MDP algorithms that 1) use a model, and 2) store a global solution</italic>.</p></disp-quote>
<p>These definitions are summarized in <xref ref-type="table" rid="T1">Table 1</xref>. We explicitly introduce these definitions since the boundaries between both fields have generally remained vague, and a clear separation (for example between local and global solutions) will later on be useful in our framework as well.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Categorization of planning and reinforcement learning, based on 1) the presence of a model (settable/reversible access to the MDP to the MDP dynamics), and 2) the presence of a global/learned solution.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="left"><bold>Global solution</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Planning</td>
<td valign="top" align="left">&#x0002B;</td>
<td valign="top" align="left">-</td>
</tr>
<tr>
<td valign="top" align="left">Reinforcement learning</td>
<td valign="top" align="left">&#x0002B;/-</td>
<td valign="top" align="left">&#x0002B;</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;Model-free reinforcement learning</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">&#x0002B;</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;Model-based reinforcement learning</td>
<td valign="top" align="left">&#x0002B;</td>
<td valign="top" align="left">&#x0002B;</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec id="s4">
<title>4. Background</title>
<p>Both planning and reinforcement learning are mature research fields with a large corpus of literature. As mentioned in the Introduction, the intention of this paper is not to provide full surveys of these fields. Instead, the aim of this section is to provide a quick overview of research directions in both fields, pointing into the directions of relevant literature.</p>
<sec>
<title>4.1. Planning</title>
<p><italic>Planning</italic> (or <italic>search</italic>) is a large research field within artificial intelligence (LaValle, <xref ref-type="bibr" rid="B86">2006</xref>; Russell and Norvig, <xref ref-type="bibr" rid="B126">2016</xref>). A classic approach in MDP planning is <italic>dynamic programming</italic> (DP), of which value iteration (VI) (Bellman, <xref ref-type="bibr" rid="B17">1966</xref>) and policy iteration (PI) (Howard, <xref ref-type="bibr" rid="B68">1960</xref>) are classic examples. DP algorithms sweep through the entire state space, repeatedly solving small subproblems based on the Bellman optimality equation. Dynamic programming is thereby a bridging technique between planning and reinforcement learning (since it combines a model and a global representation of the solution), and would under our definitions be a form of model-based reinforcement learning. While guaranteed to converge on the optimal value function, we typically cannot store the entire solution in tabular form due to the curse of dimensionality (Bellman, <xref ref-type="bibr" rid="B17">1966</xref>). Sometimes tables may be stored more efficiently, for example through binary decision diagrams (BDD) (Akers, <xref ref-type="bibr" rid="B3">1978</xref>; Bryant, <xref ref-type="bibr" rid="B31">1992</xref>), or we can battle the curse of dimensionality through approximate solutions (Powell, <xref ref-type="bibr" rid="B119">2007</xref>; Bertsekas, <xref ref-type="bibr" rid="B19">2011</xref>), which we further discuss in the section on reinforcement learning.</p>
<p>Most planning literature has focused on local solution derived from traces sampled from some start state, which are often represented as <italic>trees</italic> or <italic>graphs</italic>. Historically this starts with research on <italic>uninformed search</italic>, which studied the order of node expansion in a search tree, like <italic>breadth-first search</italic> (BFS) (Moore, <xref ref-type="bibr" rid="B105">1959</xref>), <italic>depth-first search</italic> (Tarjan, <xref ref-type="bibr" rid="B150">1972</xref>), and <italic>iterative deepening</italic> (Slate and Atkin, <xref ref-type="bibr" rid="B142">1983</xref>). However, most planning algorithms follow a pattern of <italic>best-first search</italic>, where we next expand the node which currently seems most promising. An early example is Dijkstra&#x00027;s algorithm (Dijkstra, <xref ref-type="bibr" rid="B43">1959</xref>), which next expands the node with the current lowest path cost. Dijkstra also introduced the notions of a <italic>frontier</italic> (or open set), which is the set of states on the border of the planning tree/graph that are still candidate for expansion, and of an <italic>explored states</italic> (or closed set), which is the set of states that have already been expanded. By tracking a frontier and explored set we turn a tree search into a graph search, since it prevents the further expansion of <italic>redundant</italic> paths (multiple action sequences leading to the same state).</p>
<p>We may further improve planning performance through the use of <italic>heuristics</italic> (Simon and Newell, <xref ref-type="bibr" rid="B140">1958</xref>), which in planning are often functions that provide a quick, optimistic estimate of the value of a particular state. When we apply best-first search to the sum of the path cost and admissible heuristic, we arrive at the well-known search algorithm A<sup>&#x022C6;</sup> (Hart et al., <xref ref-type="bibr" rid="B62">1968</xref>), which is applicable to deterministic domains. The same approach was extended to the stochastic MDP setting as AO<sup>&#x022C6;</sup> (Pohl, <xref ref-type="bibr" rid="B118">1970</xref>; Nilsson, <xref ref-type="bibr" rid="B109">1971</xref>). Another successful idea in the (heuristic) planning literature is the use of <italic>labeling</italic> to mark a particular state as solved (not requiring further expansion) when its value estimate is guaranteed to have converged (which happens when the state is either terminal or all of its children have been solved). Labeling can be challenging due to the potential presence of loops (which we can expand indefinitely), for which LAO<sup>&#x022C6;</sup> (Hansen and Zilberstein, <xref ref-type="bibr" rid="B60">2001</xref>) further extends the AO<sup>&#x022C6;</sup> algorithm. A survey of heuristic search is provided by Pearl (<xref ref-type="bibr" rid="B114">1984</xref>), while Kanal and Kumar (<xref ref-type="bibr" rid="B71">2012</xref>) discuss the relation of these methods to <italic>branch-and-bound</italic> search, which has been popular in operations research.</p>
<p>A bridging algorithm from the planning to the learning community was <italic>learning real-time</italic> A<sup>&#x022C6;</sup> (LRTA<sup>&#x022C6;</sup>) (Korf, <xref ref-type="bibr" rid="B82">1990</xref>), which started to incorporate learning methodology in planning methods (and was as such one of the first model-based RL papers). This approach was later extended to the MDP setting as Real-time Dynamic Programming (RTDP) (Barto et al., <xref ref-type="bibr" rid="B10">1995</xref>), which performs DP updates on traces sampled from a start state distribution. <italic>Labeled-RTDP</italic> (Bonet and Geffner, <xref ref-type="bibr" rid="B26">2003b</xref>) extends RTDP through a labeling mechanism for solved states, with further improvements of RTDP provided by McMahan et al. (<xref ref-type="bibr" rid="B98">2005</xref>), Smith and Simmons (<xref ref-type="bibr" rid="B144">2006</xref>), and Sanner et al. (<xref ref-type="bibr" rid="B129">2009</xref>).</p>
<p>Many planning algorithms suffer from high-memory requirements, since it is typically infeasible to store all possible states in memory. Several research lines have therefore investigated planning algorithms that have reduced memory requirements. Some well-known examples are <italic>iterative deepening</italic> depth-first search (Slate and Atkin, <xref ref-type="bibr" rid="B142">1983</xref>), iterative deepening A<sup>&#x022C6;</sup> (Korf, <xref ref-type="bibr" rid="B81">1985</xref>), Simplified Memory-Bounded A<sup>&#x022C6;</sup> (SMA<sup>&#x022C6;</sup>) (Russell, <xref ref-type="bibr" rid="B125">1992</xref>) and recursive best-first search (RBFS) (Korf, <xref ref-type="bibr" rid="B83">1993</xref>). For a more extensive discussion of (heuristic) MDP planning methods we refer the reader to Kolobov (<xref ref-type="bibr" rid="B79">2012</xref>) and Geffner and Bonet (<xref ref-type="bibr" rid="B51">2013</xref>).</p>
<p>A different branch in planning research estimates action values based on statistical sampling techniques, better known as <italic>sample-based planning</italic>. A classic approach is <italic>Monte Carlo search</italic> (MCS) (Tesauro and Galperin, <xref ref-type="bibr" rid="B152">1997</xref>), in which we sample a number of traces for each currently available action and estimate their value as the mean return of these traces. Sample-based planning was further extended to <italic>sparse sampling</italic> (Kearns et al., <xref ref-type="bibr" rid="B75">2002</xref>), which formed the basis for <italic>Monte Carlo Tree Search</italic> (MCTS) (Coulom, <xref ref-type="bibr" rid="B39">2006</xref>; Kocsis and Szepesv&#x000E1;ri, <xref ref-type="bibr" rid="B78">2006</xref>; Browne et al., <xref ref-type="bibr" rid="B30">2012</xref>). While MCS only tracks statistics at the root of the tree search, MCTS recursively applies the same principle at deeper levels of the tree as well. Exploration and exploitation within the tree are typically based on variants of the upper confidence bounds (UCB) rule (Auer et al., <xref ref-type="bibr" rid="B8">2002</xref>). MCTS for example showed early success in the game of Go (Gelly and Wang, <xref ref-type="bibr" rid="B52">2006</xref>). In the control community, there is a second branch of sample-based planning known as <italic>rapidly-exploring random trees</italic> (RRTs) (LaValle, <xref ref-type="bibr" rid="B85">1998</xref>). In contrast to MCTS, which samples in action space to construct a tree, RRTs sample in state space and try to find an action that connects the new sampled state to the existing explicit tree in memory.</p>
<p>Planning in continuous state and actions spaces, like in robotics, is typically referred to as <italic>optimal control</italic> (Lewis et al., <xref ref-type="bibr" rid="B90">2012</xref>; Levine, <xref ref-type="bibr" rid="B89">2018</xref>). Here, dynamics functions are often smooth and differentiable, and many algorithms therefore use a form of <italic>gradient-based planning</italic>. In this case, we directly optimize the policy for the cumulative reward objective by differentiating through the dynamics function. When the dynamics model is linear and the reward function quadratic, the solution is actually available in analytical form, better known as the linear-quadratic regulator (LQR) (Anderson and Moore, <xref ref-type="bibr" rid="B5">2007</xref>). In practice, dynamics are often not linear, but this can be partly mitigated by repeatedly linearizing the dynamics around the current state [known as iterative LQR (iLQR) Todorov and Li, <xref ref-type="bibr" rid="B154">2005</xref>]. In the RL community, gradient-based planning is often referred to as <italic>value gradients</italic> (Heess et al., <xref ref-type="bibr" rid="B64">2015</xref>). Alternatively, we can also write the MDP problem as a non-linear programming problem (i.e., take the more black-box optimization approach), where the dynamics function for example enters as a constraint, better known as <italic>direct optimal control</italic> (Bock and Plitt, <xref ref-type="bibr" rid="B23">1984</xref>). Another research line treats planning as probabilistic inference (Toussaint, <xref ref-type="bibr" rid="B155">2009</xref>; Botvinick and Toussaint, <xref ref-type="bibr" rid="B27">2012</xref>; Kappen et al., <xref ref-type="bibr" rid="B72">2012</xref>), where we construct message-passing algorithms to infer which actions would lead to receiving a final reward.</p>
<p>A popular approach in the control community is <italic>model predictive control</italic> (MPC) (Morari and Lee, <xref ref-type="bibr" rid="B106">1999</xref>), also known as <italic>receding-horizon control</italic> (Mayne and Michalska, <xref ref-type="bibr" rid="B96">1990</xref>), where we optimize for an action up to a certain lookahead depth, execute the best action from the plan, and then re-plan from the resulting next state (i.e., we never optimize for the full MDP horizon). Such interleaving of planning and acting (McDermott, <xref ref-type="bibr" rid="B97">1978</xref>) is in the planning community often referred to as <italic>decision-time</italic> planning or <italic>online</italic> planning, where we directly need to find an action for a current state. In contrast, <italic>background</italic> or <italic>offline</italic> planning (Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>) uses planning operations to improve the solution for a variety of states, for example stored in a global solution.</p>
</sec>
<sec>
<title>4.2. Reinforcement Learning</title>
<p>Reinforcement learning (RL) (Barto et al., <xref ref-type="bibr" rid="B12">1983</xref>; Wiering and Van Otterlo, <xref ref-type="bibr" rid="B163">2012</xref>; Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>) is a large research field within machine learning. While the planning literature is mostly organized in sub-disciplines (as discussed above), RL literature can best be covered through the range of subtopics within algorithms that have been studied. A central idea in RL is the use of <italic>bootstrapping</italic> (Sutton, <xref ref-type="bibr" rid="B145">1988</xref>), where we plug in a <italic>learned</italic> value estimate to improve the estimate of a state that precedes it. Literature has focused on the way we can construct these bootstrap estimates, for example distinguishing between <italic>on-policy</italic> (Rummery and Niranjan, <xref ref-type="bibr" rid="B124">1994</xref>) and <italic>off-policy</italic> back-ups (Watkins and Dayan, <xref ref-type="bibr" rid="B161">1992</xref>). The depth of the back-up has also received much attention in RL, where estimates of different depths can for example be combined through <italic>eligibility traces</italic> (Singh and Sutton, <xref ref-type="bibr" rid="B141">1996</xref>). We can also use multi-step methods in the off-policy setting through the use of importance sampling, where we generally reweight the back-up contribution of the next step by its probability under the optimal policy. Examples in this direction are the Tree-backup [TB(&#x003BB;)] algorithm (Precup, <xref ref-type="bibr" rid="B120">2000</xref>) and Retrace(&#x003BB;) (Munos et al., <xref ref-type="bibr" rid="B108">2016</xref>).</p>
<p>Reinforcement learning research has also focused on direct specification of the solution, in the form of a policy function. An important result in this direction is the <italic>policy gradient theorem</italic> (Williams, <xref ref-type="bibr" rid="B164">1992</xref>; Sutton et al., <xref ref-type="bibr" rid="B149">2000</xref>; Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>), which specifies an unbiased estimate of the gradient of the objective with respect to policy parameters. Policy search methods can be stabilized in various ways (Schulman et al., <xref ref-type="bibr" rid="B132">2015</xref>, <xref ref-type="bibr" rid="B134">2017</xref>), can be integrated with (gradient-based) planning (Deisenroth and Rasmussen, <xref ref-type="bibr" rid="B41">2011</xref>; Levine and Koltun, <xref ref-type="bibr" rid="B88">2013</xref>), and have for example shown much success in robotics (Deisenroth et al., <xref ref-type="bibr" rid="B42">2013</xref>). Note that policy search can also be approached in a gradient-free way, for example through evolutionary strategies (Moriarty et al., <xref ref-type="bibr" rid="B107">1999</xref>; Whiteson and Stone, <xref ref-type="bibr" rid="B162">2006</xref>), including the successful <italic>cross-entropy method</italic> (CEM) (Mannor et al., <xref ref-type="bibr" rid="B94">2003</xref>).</p>
<p>A central theme in reinforcement learning research is the use of supervised learning methods to <italic>approximate</italic> the solution, which allows information to <italic>generalize</italic> between similar states (and in larger problems allow a global solution to fit in memory). Early results on function approximation include tile coding (Sutton, <xref ref-type="bibr" rid="B147">1996</xref>) and linear approximation (Bradtke and Barto, <xref ref-type="bibr" rid="B28">1996</xref>), while state-of-the-art results are achieved by the use of deep neural networks (Goodfellow et al., <xref ref-type="bibr" rid="B54">2016</xref>), whose application to RL was pioneerd by Mnih et al. (<xref ref-type="bibr" rid="B99">2015</xref>). Surveys of deep reinforcement learning are provided by Fran&#x000E7;ois-Lavet et al. (<xref ref-type="bibr" rid="B50">2018</xref>) and Arulkumaran et al. (<xref ref-type="bibr" rid="B6">2017</xref>).</p>
<p>Another fundamental theme in RL research is the balance between exploration and exploitation. Random perturbation approaches include &#x003F5;-greedy and Boltzmann exploration (Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>), while other approaches, such as confidence bounds (Kaelbling, <xref ref-type="bibr" rid="B70">1993</xref>) and Thompson sampling (Thompson, <xref ref-type="bibr" rid="B153">1933</xref>), leverage the uncertainty in an action value estimate. Another large branch in RL exploration research is <italic>intrinsic motivation</italic> (Chentanez et al., <xref ref-type="bibr" rid="B36">2005</xref>), which explores based on concepts like curiosity (Schmidhuber, <xref ref-type="bibr" rid="B131">1991</xref>), novelty, and model uncertainty (Guez et al., <xref ref-type="bibr" rid="B56">2012</xref>).</p>
<p>Reinforcement learning and planning have been combined in the field of model-based reinforcement learning (Hester and Stone, <xref ref-type="bibr" rid="B65">2012</xref>; Moerland et al., <xref ref-type="bibr" rid="B102">2020a</xref>). In the RL community, this idea started with <italic>Dyna</italic> (Sutton, <xref ref-type="bibr" rid="B146">1990</xref>), which uses sampled data (from an irreversible environment) to learn a reversible dynamics model, and subsequently makes planning updates over this learning model to further improve the value function. Successful model-based RL algorithms include AlphaZero (Silver et al., <xref ref-type="bibr" rid="B137">2018</xref>), which set superhuman performance in Go, Chess and Shogi, and Guided Policy Search (Levine and Koltun, <xref ref-type="bibr" rid="B88">2013</xref>), which was successful in robotics tasks. We can also use a learned model for gradient-based policy updates, as for example done in PILCO (Deisenroth and Rasmussen, <xref ref-type="bibr" rid="B41">2011</xref>), while a learned backward model allows us to more quickly spread new information over the state space [known as <italic>prioritized sweeping</italic> (PS) Moore and Atkeson, <xref ref-type="bibr" rid="B104">1993</xref>]. A full survey of model-based reinforcement learning is provided by Moerland et al. (<xref ref-type="bibr" rid="B102">2020a</xref>).</p>
<p>Reinforcement learning research is also organized around a variety of subtopics, such as hierarchical/temporal abstraction (Barto and Mahadevan, <xref ref-type="bibr" rid="B11">2003</xref>), goal setting and generalization over goals (Schaul et al., <xref ref-type="bibr" rid="B130">2015</xref>), transfer between tasks (Taylor and Stone, <xref ref-type="bibr" rid="B151">2009</xref>), and multi-agent reinforcement learning (Busoniu et al., <xref ref-type="bibr" rid="B33">2008</xref>). While these topics are all important, our framework solely focuses on a single agent in a single MDP optimization task. However, note that many of these topics are complementary to our framework (i.e., they could further extend it). For example, we may discover higher-level actions (hierarchical RL) to define a new, more abstract MPD, in which all of the principles of our framework are again applicable.</p>
<p>To summarize, this section covered some important research directions within planning and reinforcement learning. Our treatment was of course superficial, and by no means covered all relevant literature from both fields. Nevertheless, it does provide common ground on the type of literature we consider for our framework. In the next section, we will try to organize the ideas from both fields into a single framework.</p>
</sec>
</sec>
<sec id="s5">
<title>5. Framework</title>
<p>We will now introduce the Framework for Reinforcement Learning and Planning (FRAP). Pseudocode for the framework is provided in Algorithm 1, while all individual dimensions are summarized in <xref ref-type="table" rid="T2">Table 2</xref>. We will first cover the high-level intuition of the framework, as visualized in <xref ref-type="fig" rid="F2">Figure 2</xref>. FRAP centers around the notion of <italic>root states</italic> and <italic>trials</italic>.</p>
<disp-quote><p><italic>A root state is a state for which we attempt to improve the solution estimate</italic>.</p></disp-quote>
<disp-quote><p><italic>A trial is a sequence of forward actions and next states from a root state, which is used to compute an estimate of the cumulative reward from the root state</italic>.</p></disp-quote>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Overview of dimensions in the Framework for Reinforcement learning and Planning (FRAP).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Dimension</bold></th>
<th valign="top" align="left"><bold>Consideration</bold></th>
<th valign="top" align="left"><bold>Choices</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">1. Solution (Section 5.1)</td>
<td valign="top" align="left">- Coverage</td>
<td valign="top" align="left">Global, local</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Type</td>
<td valign="top" align="left">(Goal-conditioned) value, (goal-conditioned) policy, counts,&#x02026;</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Method</td>
<td valign="top" align="left">Param. tabular, param. approximate, non/semi-parametric</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Initialization</td>
<td valign="top" align="left">Uniform, random, optimistic, expert</td>
</tr>
<tr>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">2. Set root state (Section 5.2)</td>
<td valign="top" align="left">- Selection</td>
<td valign="top" align="left">Ordered, initial state, forward sampling, backward sampling, previously visited</td>
</tr>
<tr>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">3. Budget per root (Section 5.3)</td>
<td valign="top" align="left">- Number of trials (width)</td>
<td valign="top" align="left">1, <italic>n</italic>, convergence, &#x0221E;</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Depth per trial (<italic>d</italic><sub>max</sub>)</td>
<td valign="top" align="left">1, <italic>n</italic>, adaptive, &#x0221E;</td>
</tr>
<tr>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">4. Selection in trial<break/> (Section 5.4)</td>
<td valign="top" align="left">- Next action</td>
<td valign="top" align="left">Ordered, greedy (with heuristic), value-based perturbation (random, means, uncertainty), state-based perturbation (knowledge-based IM, competence-based IM)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Next state</td>
<td valign="top" align="left">Sample, ordered</td>
</tr>
<tr>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">5. Bootstrap (Section 5.5)</td>
<td valign="top" align="left">- Location</td>
<td valign="top" align="left">State, state-action</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Type</td>
<td valign="top" align="left">Learned, heuristic</td>
</tr>
<tr>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">6. Back-up (Section 5.6)</td>
<td valign="top" align="left">- Back-up policy</td>
<td valign="top" align="left">Behavioral policy, greedy/max, other policy...</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Policy expectation</td>
<td valign="top" align="left">Sample/partial, expected/full</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Dynamics expectation</td>
<td valign="top" align="left">Sample/partial, expected/full</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Additional characteristics</td>
<td valign="top" align="left">Explored states, convergence label, counts, uncertainty, return distribution</td>
</tr>
<tr>
<td/>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">7. Update (Section 5.7)</td>
<td valign="top" align="left">- Loss/objective</td>
<td valign="top" align="left">Squared loss, policy gradient, value gradient, cross-entropy, etc.</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Learning rate</td>
<td valign="top" align="left">Step (&#x003B7; fixed), Replace (&#x003B7; &#x0003D; 1.0 on table), Average (&#x003B7; &#x0003D; 1/<italic>n</italic> on table), Eligibility (&#x003B7; &#x0003D; (1&#x02212;&#x003BB;)&#x000B7;&#x003BB;<sup>(<italic>d</italic>&#x02212;1)</sup>), Adaptive (trust region), etc.</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Examples for several algorithms are shown in <xref ref-type="table" rid="T7">Table 7</xref>. IM, Intrinsic Motivation</italic>.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Graphical illustration of framework (Algorithm 1). Left: Algorithm outer loop (Algorithm 1, line 4), illustrating the interplay of global and local solutions with trials. After possibly initializing a global solution, we repeatedly fix a new root state for which we want to improve our solution. Then, we initialize a new local solution for the particular root, and make one or multiple trials (trial budget), where each trial updates the local solution. After the budget is expanded, we may use the local solution to update the global solution and/or set the next root state and/or reuse information for the next local solution. The process then repeats with setting a new root, possible based on the global and/or local solution. Right: Algorithm inner loop (Algorithm 1, line 5), illustrating an individual trial. A trial starts from a root node, from which we repeatedly select actions, query the MDP at the specific state-action pair, and then transition to a next state. We repeat this process <italic>d</italic><sub>max</sub> times, after which we start the back-up phase, consisting of <italic>d</italic><sub>max</sub> back-ups. When budget is available, we start another trial from the same root node.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-908353-g0002.tif"/>
</fig>
<p>The central idea of FRAP is that all planning and reinforcement learning algorithms repeatedly 1) fix root states, 2) make trials from these root states, 3) improve their solution based on the outcome of these trials, and 4) use this improved solution to better direct new trials and better set new root states. FRAP therefore consists of an <italic>outer loop</italic> (the while loop on Algorithm 1, line 4), in which we repeatedly set new root states, and an <italic>inner loop</italic> (the while loop on Algorithm 1, line 5), in which we (repeatedly) make trials from the current root state to update our solution. We will briefly discuss both loops.</p>
<p>A schematic illustration of the outer loop is shown on the left side of <xref ref-type="fig" rid="F2">Figure 2</xref>. The algorithm starts by potentially initializing a global solution (for all states), and subsequently fixing a new root state. Then, we initialize a local solution for the particular root, and start making trials from the root, which each update the local solution. When we run out of trial budget for this root, we may use the local solution to update the global solution (when used). Afterwards, we fix a next root state, and initialize a new local solution, in which we may reuse information from the last local solution (Algorithm 1, line 9). The outer loop then repeats for the new root state.</p>
<p>The inner loop of FRAP consists of trials, and is schematically visualized on the right of <xref ref-type="fig" rid="F2">Figure 2</xref>. A trial starts from the root node, and consists of a forward sequence of actions and resulting next states and rewards, which are obtained from <italic>queries</italic> to the MDP dynamics. This process repeats <italic>d</italic><sub>max</sub> times, where the specification of <italic>d</italic><sub>max</sub> depends on the local solution and differs between algorithms. The forward phase of the trial then halts, after which we possibly <italic>bootstrap</italic> to estimate the remaining expected return from the leaf state, without further unfolding the trial. Then, the trial proceeds with a sequence of <italic>one-step back-ups</italic>, which process the acquired information from the forward phase. We repeat the trial process until we run out of budget, after which we fix a new root state (Algorithm 1, line 8).</p>
<p>Action selection in FRAP not only happens within the trial (Algorithm 1, line 16), but is in many algorithms also part of next root selection (Algorithm 1, line 8). It is important to mention that in the case of model-free RL, where we have irreversible access to the MDP dynamics, these two action selection moments are actually equal by definition. For example, a model-free RL agent may fix a root, sample a trial from this root, and use it to update the global solution. However, because the environment is irreversible, the next root selection has to use the same action and resulting next state as was taken within the trial. Model-free RL agents therefore have some specific restrictions in the FRAP pseudocode, as illustrated on the blue lines of Algorithm 1 (the trial budget per root is for example also by definition equal to one).</p>
<p>FRAP is therefore really a conceptual framework, and practical implementations may differ from the pseudocode in Algorithm 1. For example, many planning methods store an explicit frontier, i.e., the set of nodes that are candidate for expansion. Practical implementations would directly jump to the frontier, and not first traverse the known part of the tree from the root, as happens in each trial of Algorithm 1. However, it is conceptually useful to still think of these forward steps, since they will be part of the back-up phase (we are eventually looking for a good decision at the root). Another example would be a model-free RL agent that uses a Monte Carlo return estimate. Practical implementations may sample a full episode, compute the cumulative reward starting from each state in the episode, and jointly update the solution for all these states. However, conceptually every state in the episode has then been a root state once, for which we compute an estimate. In FRAP, we would therefore see this as sampling the actual episode only once from the first root, store it in the local solution, and then repeatedly set new roots along the states in the episode, where we keep reusing the local solution from the last root (Algorithm 1 line 9). In summary, all algorithms conceptually fit FRAP, since they all fix root states for which they compute improved estimates of the cumulative return and solution, but some algorithms may take implementation shortcuts.</p>
<p>We are now ready to discuss the individual dimensions of the framework, i.e., describe the possible choices on each of the lines in Algorithm 1. These dimensions are: how to <italic>represent</italic> the solution, how to <italic>set the next root state</italic>, which <italic>trial budget</italic> to allocate per root state, how to <italic>select</italic> actions and next states within a trial, how to <italic>back-up</italic> information obtained from the trial, and how to <italic>update</italic> the local and global solution based on these back-up estimates. The considerations of FRAP are summarized in <xref ref-type="table" rid="T2">Table 2</xref>, while the comments on the right side of Algorithm 1 indicate to which lines each dimension is applicable.</p>
<sec>
<title>5.1. Solution Representation</title>
<p>We first of all have to decide how we will represent the solution to our problem. The top row of <xref ref-type="table" rid="T2">Table 2</xref> shows the four relevant considerations: the coverage of our solution, the type of function we will represent, the method we use to represent this function, and the way we initialize the chosen method. The first item distinguishes between <italic>local/partial</italic> (for a subset of states) and <italic>global</italic> (for all states) solutions, a topic which we already extensively discussed in Section 3.3. Note that FRAP <italic>always</italic> builds a local solution: even a single episode of a model-free RL algorithm is considered a local solution that estimates the value of states in the trace. A local solution therefore aggregates information from one or more trials, which may then itself be used to update a global solution (when we use one) (Algorithm 1, line 1).</p>
<p>For both local and global solutions we next need to decide what type of function to represent. The most common choices are to represent the solution as a <italic>value</italic> function <inline-formula><mml:math id="M27"><mml:mi>V</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0211D;</mml:mi></mml:math></inline-formula>, <italic>state-action value</italic> function <inline-formula><mml:math id="M28"><mml:mi>Q</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mi>&#x0211D;</mml:mi></mml:math></inline-formula>, or <italic>policy</italic> function <inline-formula><mml:math id="M29"><mml:mi>&#x003C0;</mml:mi><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. Some algorithms combine value and policy solutions, better known as <italic>actor-critic</italic> algorithms (Konda and Tsitsiklis, <xref ref-type="bibr" rid="B80">1999</xref>). We may also store the <italic>uncertainty</italic> around value estimates (Osband et al., <xref ref-type="bibr" rid="B111">2016</xref>; Moerland et al., <xref ref-type="bibr" rid="B100">2017</xref>), for example using <italic>counts</italic> (Kocsis and Szepesv&#x000E1;ri, <xref ref-type="bibr" rid="B78">2006</xref>), or through convergence labels that mark a particular value estimate as solved (Nilsson, <xref ref-type="bibr" rid="B109">1971</xref>; Bonet and Geffner, <xref ref-type="bibr" rid="B26">2003b</xref>). Some methods also store the entire distribution of returns (Bellemare et al., <xref ref-type="bibr" rid="B14">2017</xref>; Moerland et al., <xref ref-type="bibr" rid="B101">2018</xref>), or condition their solution on a particular goal (Schaul et al., <xref ref-type="bibr" rid="B130">2015</xref>) (i.e., store a solution for multiple reward functions).</p>
<p>After deciding on the type of function to represent, we next need to specify the representation method. This is actually a supervised learning question, which we can largely break up in <italic>parametric</italic> and <italic>non-parametric</italic> approaches. <italic>Parametric tabular</italic> representations use a unique parameter for the solution at each state-action pair, which is for example used in the local solution of a graph search, or in the global solution of a tabular RL algorithm. For high-dimensional problems, we typically need to use <italic>parametric approximate</italic> representations, such as (deep) neural networks (Rumelhart et al., <xref ref-type="bibr" rid="B123">1986</xref>; Goodfellow et al., <xref ref-type="bibr" rid="B54">2016</xref>). Apart from reduced memory requirement, a major benefit of approximate representations it their ability to <italic>generalize</italic> over the input space, and thereby make predictions for state-actions that have not been observed yet. However, the individual predictions of approximate methods may contain errors, and there are indications that the combination of tabular and approximate representations may provide the best of both worlds (Silver et al., <xref ref-type="bibr" rid="B139">2017</xref>; Wang et al., <xref ref-type="bibr" rid="B160">2019</xref>; Moerland et al., <xref ref-type="bibr" rid="B103">2020b</xref>). Alternatively, we may also store the solution in a <italic>non-parametric</italic> way, where we simply store exact sampled traces (e.g., a search tree that does not aggregate over different traces), or <italic>semi-parametric</italic> way (Graves et al., <xref ref-type="bibr" rid="B55">2016</xref>), where we may optimize a neural network to write to and read to a table (Blundell et al., <xref ref-type="bibr" rid="B22">2016</xref>; Pritzel et al., <xref ref-type="bibr" rid="B121">2017</xref>), sometimes referred to as <italic>episodic memory</italic> (Gershman and Daw, <xref ref-type="bibr" rid="B53">2017</xref>).</p>
<p>Finally, we also need to initialize our solution representation. Tabular representations are often <italic>uniformly</italic> initialized, for example setting all initial estimates to 0. Approximate representations are often <italic>randomly</italic> initialized, which provides the tie breaking necessary for gradient-based updating. Some approaches use initialization to guide exploration, either through <italic>optimistic initialization</italic> (when a state has not been visited yet, we consider its value estimate to be high) (Bertsekas and Tsitsiklis, <xref ref-type="bibr" rid="B21">1996</xref>) or <italic>expert initialization</italic> (where we use imitation learning from (human) expert demonstrations to initialize the solution) (Hussein et al., <xref ref-type="bibr" rid="B69">2017</xref>). We will further discuss exploration methods in Section 5.4.</p>
<p>An overview of our notation for the different local/global and tabular/approximate solution types is shown in <xref ref-type="table" rid="T3">Table 3</xref>. We will denote <italic>local</italic> estimates with superscript <bold>l</bold>, e.g., <italic>V</italic><sup><bold>l</bold></sup>(<italic>s</italic>) or <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>), and <italic>global</italic> solutions with superscript <bold>g</bold>, e.g., <italic>V</italic><sup><bold>g</bold></sup>(<italic>s</italic>), <italic>Q</italic><sup><bold>g</bold></sup>(<italic>s, a</italic>) or &#x003C0;<sup><bold>g</bold></sup>(<italic>a</italic>|<italic>s</italic>). In practice, only global solutions are learned in approximate form, which we indicate with a subscript &#x003B8; (for parameters &#x003B8;).</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Overview of notation.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left"><bold>Back-up estimate</bold></th>
<th valign="top" align="left"><bold>Local solution</bold></th>
<th valign="top" align="left"><bold>Global solution</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Tabular</td>
<td valign="top" align="left"><inline-formula><mml:math id="M30"><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M31"><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="left"><italic>V</italic><sup><bold>l</bold></sup>(<italic>s</italic>), <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>)</td>
<td valign="top" align="left"><italic>V</italic><sup><bold>g</bold></sup>(<italic>s</italic>), <italic>Q</italic><sup><bold>g</bold></sup>(<italic>s, a</italic>), &#x003C0;<sup><bold>g</bold></sup>(<italic>a</italic>|<italic>s</italic>)</td>
</tr>
<tr>
<td valign="top" align="left">Approximate</td>
<td valign="top" align="left">(-)</td>
<td valign="top" align="left">(-)</td>
<td valign="top" align="left"><inline-formula><mml:math id="M32"><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M33"><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M34"><mml:msubsup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Each trial provides new back-up estimates <inline-formula><mml:math id="M35"><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M36"><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> at the states and actions that appear in the trial. These estimates are aggregated in the local solution V<sup><bold>l</bold></sup>(s) and Q<sup><bold>l</bold></sup>(s, a) (i.e., the local solution can be influenced by multiple trials). The local solution may itself be used to update the global solution V<sup><bold>g</bold></sup>(s), Q<sup><bold>g</bold></sup>(s, a) and/or &#x003C0;<sup><bold>g</bold></sup>(a|s). When the global solution is stored in approximate form (which is often the case), we denote them by <inline-formula><mml:math id="M37"><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M38"><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and/or <inline-formula><mml:math id="M39"><mml:msubsup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> (where &#x003B8; denotes the parameters of the approximation). Back-up estimates and local solutions are in practice never represented in approximate form</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>As you will notice, <xref ref-type="table" rid="T3">Table 3</xref> contains a separate entry for the <italic>back-up estimate</italic>, <inline-formula><mml:math id="M40"><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> or <inline-formula><mml:math id="M41"><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, which are formed during every trial. Especially researchers from a planning background may find this confusing, since in many algorithms the back-up estimate and local solution are actually the same. However, we should consider these two different quantities, for two reasons. First of all, in some algorithms, like the roll-out phase of MCTS, we do make additional MDP queries (the trial continues) and back-ups, but the back-up estimate from the last part of the trial is never stored in the local solution (the local solution expands with only one new node per trial). Second, many algorithms use their local solution to <italic>aggregate</italic> cumulative reward estimates from different depths, which is for example used in eligibility traces (Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>). For our conceptual framework, we therefore consider each cumulative reward estimate the result of a single trial, and the local solution may combine the estimate of trials in multiple ways. We will discuss ways to aggregate back-up estimates into the local solution in Section 5.7.</p>
</sec>
<sec>
<title>5.2. Set a Root State</title>
<p>The next consideration in our framework is the selection of a root state (Algorithm 1, line 2 and 8), for which we will attempt to improve our solution (by computing a new value estimate). The main considerations are listed in the second row of <xref ref-type="table" rid="T2">Table 2</xref>. A first approach is to select a state from the state space in an <italic>ordered</italic> way, for a example by sweeping through all possible states (Howard, <xref ref-type="bibr" rid="B68">1960</xref>; Bellman, <xref ref-type="bibr" rid="B17">1966</xref>). A major downside of this approach is that many states in the state space are often not even reachable from the start state (<xref ref-type="fig" rid="F3">Figure 3</xref>), and we may spend much computational effort on states that will never be part of the practical solution.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Venn diagram of total state space. Only a subset of the entire state space is <italic>reachable</italic> from the start state under <italic>any policy</italic>. An even smaller subset of the reachable set is eventually <italic>relevant</italic>, in the sense that they are reachable from the start state under the <italic>optimal policy</italic>. Finally, a subset of the relevant state are of course all start states. Figure extended from Sutton and Barto (<xref ref-type="bibr" rid="B148">2018</xref>).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-908353-g0003.tif"/>
</fig>
<p>When the MDP definition includes the notion of a <italic>start state distribution</italic>, this information may be utilized to improve our selection of root states, by only sampling root states on traces from the start. This ensures that new roots are always reachable, which may strongly reduce the number of states we will update in practice (illustrated in <xref ref-type="fig" rid="F3">Figure 3</xref>). In <xref ref-type="table" rid="T2">Table 2</xref>, we list this as the <italic>forward sampling</italic> approach to selecting new root states. Note that this generally also involves an action selection question (in which direction do we set the next root), which we will discuss in Section 5.4.</p>
<p>The next option is to select new root states in the reverse direction, i.e., through backward sampling (instead of forward sampling). This approach does require a <italic>backwards model</italic> <italic>p</italic>(<italic>s, a</italic>|<italic>s</italic>&#x02032;), which specifies the possible state-action pairs (<italic>s, a</italic>) that may lead to a next state <italic>s</italic>&#x02032;. The main idea is to set next root states at the possible precursor states of a state whose value has just changed much, better known as <italic>prioritized sweeping</italic> (Moore and Atkeson, <xref ref-type="bibr" rid="B104">1993</xref>). We thereby focus our update budget on regions of the state space that likely need updating, which may speed-up convergence. Similar ideas have been studied in the planning community as <italic>backward search</italic> or <italic>regression search</italic> (Nilsson, <xref ref-type="bibr" rid="B110">1982</xref>; Bonet and Geffner, <xref ref-type="bibr" rid="B24">2001</xref>; Alc&#x000E1;zar et al., <xref ref-type="bibr" rid="B4">2013</xref>), which makes prioritized sweeping an interleaved form of forward and backward search.</p>
<p>Finally, we do not always need to select the next root state from the current trace. For example, we may track the set of <italic>previously visited states</italic>, and select our next root from this set. This approach, which is for example part of Dyna (Sutton, <xref ref-type="bibr" rid="B146">1990</xref>), gives greater freedom in the order of root states, while it still ensures that we only update reachable states. To summarize, we need to decide on a way to set root states, which may for example be done in an ordered way, through forward sampling, through backward sampling, or by selecting previously visited states (<xref ref-type="table" rid="T2">Table 2</xref>, second row).</p>
</sec>
<sec>
<title>5.3. Budget per Root</title>
<p>After we fixed a root state (a state for which we will attempt to improve the solution), we need to decide on 1) the number of trials from the particular root (Algorithm 1, line 5), and 2) when a trial itself will end, i.e., the depth <italic>d</italic><sub>max</sub> of each forward trial (Algorithm 1, line 13 &#x00026; 22). These possible choices on each of these two considerations are listed in the third row of <xref ref-type="table" rid="T2">Table 2</xref>. Note that since every trial consists of a single forward beam, the total number of trials is actually a good measure of the total width of the local solution (<bold>Figure 6</bold>). The joint space of both considerations is visualized in <xref ref-type="fig" rid="F4">Figure 4</xref>, which we will discuss below.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Possible combinations of width (trial budget) and depth (<italic>d</italic><sub>max</sub>) per trial from a root state. Practical algorithms reside somewhere left of the left dotted line, since full with combined with full depth (exhaustive search) is not feasible in larger problems. Figure extended from Sutton and Barto (<xref ref-type="bibr" rid="B148">2018</xref>).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-908353-g0004.tif"/>
</fig>
<p>Regarding the <italic>trial budget per root state</italic>, a first possible choice is to only run a single trial. This choice is characteristic for model-free RL algorithms (Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>). Algorithms that have access to a model may also run multiple trials per root state. This budget can for example be specified as a fixed hyperparameter, as is often the choice in MCTS (Browne et al., <xref ref-type="bibr" rid="B30">2012</xref>). When we interact with a real-world environment, the trial budget may actually be enforced by the time until the next decision is required. In the planning community, this is referred to as <italic>decision time planning</italic> or <italic>online planning</italic>. In offline approaches, we may also provide an adaptive trial budget, for example until some convergence criterion is met (often in combination with an admissible heuristic, which may reduce the required number of trials to convergence a lot) (Nilsson, <xref ref-type="bibr" rid="B109">1971</xref>; Hansen and Zilberstein, <xref ref-type="bibr" rid="B60">2001</xref>; Bonet and Geffner, <xref ref-type="bibr" rid="B26">2003b</xref>). Finally, we may also specify an infinite trial budget, i.e., we will repeat trials until all possible sequences (for the specified depth) have been expanded.</p>
<p>The second decision involves the <italic>depth</italic> of each individual trial. A first option is to use a trial depth of one, which is for example part of value/policy iteration (Bellman, <xref ref-type="bibr" rid="B17">1966</xref>) and temporal difference learning (Sutton, <xref ref-type="bibr" rid="B145">1988</xref>; Watkins and Dayan, <xref ref-type="bibr" rid="B161">1992</xref>; Rummery and Niranjan, <xref ref-type="bibr" rid="B124">1994</xref>). We may also specify a fixed multi-step depth, which is the case for <italic>n</italic>-step methods, or specify a full depth (&#x0221E;), in which case we unroll the trail until a terminal state is reached (in practice we often still limit the trial by a large depth). The latter is also known as a <italic>Monte Carlo roll-out</italic>, which is for example used in MCTS. Finally, many algorithms make use of an <italic>adaptive</italic> trial depth, which depends on the current local solution (i.e., note that <italic>d</italic><sub>max</sub>(<bold>l</bold>) depends on <bold>l</bold> in Algorithm 1, lines 13 and 22). For example, several (heuristic) planning algorithms terminate a trial once we reach a state or action that did not appear in our current local solution yet (Hart et al., <xref ref-type="bibr" rid="B62">1968</xref>; Nilsson, <xref ref-type="bibr" rid="B109">1971</xref>). As another example, we may terminate a trial once it reaches a state in the explored set or makes a cycle to a duplicate state, which are also examples of an adaptive <italic>d</italic><sub>max</sub>(<bold>l</bold>). To summarize, the trial budget and depth of each trial are important considerations in all planning and RL algorithms.</p>
</sec>
<sec>
<title>5.4. Selection Within a Trial</title>
<p>Once we have specified the trial budget and depth rules from a particular root state, we have to decide how to actually select the actions and states that will appear in each individual trial (they may unroll in different directions). In other words, we have specified the overall shape of all trials in <xref ref-type="fig" rid="F4">Figure 4</xref>, but not yet how this shape will actually be unfolded. We will first discuss <italic>action selection</italic>, which happens in Algorithm 1 line 16 and in many algorithms also at line 8, when we set the next root through forward sampling. Afterwards, we will discuss <italic>next state selection</italic>, which happens in line 26 of Algorithm 1. The considerations that we discuss for both topics are listed in the fourth row of <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<p><bold>Action selection</bold> The first approach to action selection is to pick actions in an <italic>ordered</italic> way, where we select actions <italic>independently</italic> of our interaction history with the MDP. Examples include uninformed search methods, such as iterative deepening. A downside of ordered action selection is that it may spend much time on states with lower value estimates, which typically makes it infeasible in larger problems. Most methods therefore try to prioritize actions in trials based on knowledge from previous trials. A first category of approaches prioritize actions based on their (current) value estimate, which we will call <italic>value-based selection</italic>. The cardinal example of value-based selection is <italic>greedy</italic> action selection, which repeatedly selects actions with the highest current value estimate. This is the dominant approach in the heuristic search literature (Hart et al., <xref ref-type="bibr" rid="B62">1968</xref>; Nilsson, <xref ref-type="bibr" rid="B109">1971</xref>; Barto et al., <xref ref-type="bibr" rid="B10">1995</xref>; Hansen and Zilberstein, <xref ref-type="bibr" rid="B60">2001</xref>), where an <italic>admissible</italic> heuristic may guarantee that greedy action selection will find the optimal solution.</p>
<p>Note that heuristic search algorithms in practice usually maintain a <italic>frontier</italic> (<xref ref-type="fig" rid="F5">Figure 5</xref>), and therefore do not actually need to greedily traverse the local solution toward the best leaf state. However, as Schulte and Keller (<xref ref-type="bibr" rid="B135">2014</xref>) also show, any ordering on the frontier can also be achieved by step-wise action selection from the root, and frontiers therefore conceptually fully fit into our framework (although the practical implementation may differ). The notion of frontiers is important, because algorithms that use a frontier often <italic>switch</italic> their action selection strategy once they reach the frontier. For example, a heuristic search algorithm may greedily select actions within the known part of the local solution, but at the frontier expand all possible actions, which is a form of ordered action selection. For some algorithms, we will therefore separately mention the action selection strategy <italic>before the frontier</italic> (BF) and <italic>after the frontier</italic> (AF).</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Frontier-based exploration in planning (left) and reinforcement learning (right, <italic>intrinsic motivation</italic>). Left: Frontier and explored set in a graph. Blue denotes the start state, red a final state, green denotes the explored set (states that have been visited and whose successors have been visited), orange denotes the frontier (states that have been visited but whose successors have not all been visited). Methods without a frontier and explored set (like random perturbation, which is used in most RL approaches) may sample many redundant trials that make loops in the left part of the problem, because they do not find the narrow passage. Right: In large problems, it may become infeasible to store the frontier and explored set in tabular form. Part of intrinsic motivation literature (Colas et al., <xref ref-type="bibr" rid="B37">2020</xref>) tracks <italic>global</italic> (sub)goal spaces (red line) in global, approximate form. We may for example sample new goals from this space based on novelty, and subsequently attempt to reach that goal through a goal-conditioned policy, effectively mimicking frontier-based exploration in approximate, global form.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-908353-g0005.tif"/>
</fig>
<p>Without an admissible heuristic greedy action selection is not guaranteed to find the optimal solution. Algorithms therefore usually introduce a form of <italic>exploration</italic>. A first option in this category is <italic>random perturbation</italic>, which is in the RL community usually referred to as &#x003F5;-greedy exploration (Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>). Similar ideas have been extensively studied in the planning community (Valenzano et al., <xref ref-type="bibr" rid="B156">2014</xref>), for example in limited discrepancy search (Harvey and Ginsberg, <xref ref-type="bibr" rid="B63">1995</xref>), <italic>k</italic>-best-first-search (KBFS) (Felner et al., <xref ref-type="bibr" rid="B48">2003</xref>) and best-first width search (BFWS) (Lipovetzky and Geffner, <xref ref-type="bibr" rid="B92">2017</xref>). We may also make the selection probabilities proportional to the current mean estimates of each action, which is for discrete and continuous action spaces for example achieved by Boltzmann exploration (Cesa-Bianchi et al., <xref ref-type="bibr" rid="B35">2017</xref>) and entropy regularization (Peters et al., <xref ref-type="bibr" rid="B116">2010</xref>).</p>
<p>A downside of random perturbation methods is their inability to naturally transition from exploration to exploitation. A solution is to track the uncertainty of value estimate of each action, i.e., <italic>uncertainty-based perturbation</italic>. Such approaches have been extensively studied in the multi-armed bandit literature (Slivkins, <xref ref-type="bibr" rid="B143">2019</xref>), and successful exploration methods from RL and planning (Kaelbling, <xref ref-type="bibr" rid="B70">1993</xref>; Kocsis and Szepesv&#x000E1;ri, <xref ref-type="bibr" rid="B78">2006</xref>; Hao et al., <xref ref-type="bibr" rid="B61">2019</xref>) are actually based on work from the bandit literature (Auer et al., <xref ref-type="bibr" rid="B8">2002</xref>). Note that uncertainty estimation in sequential problems, like the MDP formulation, is harder than the multi-armed bandit setting, since we need to take the uncertainty in the value estimates of future states into account (Dearden et al., <xref ref-type="bibr" rid="B40">1998</xref>; Moerland et al., <xref ref-type="bibr" rid="B100">2017</xref>). As an alternative, we may also estimate uncertainty in a Bayesian way, and for example explore through Thompson sampling (Thompson, <xref ref-type="bibr" rid="B153">1933</xref>; Osband et al., <xref ref-type="bibr" rid="B111">2016</xref>). Note that <italic>optimistic initialization</italic> of the solution, already discussed Section 5.1, also uses optimism in the face of uncertainty to guide exploration, although it does not track the true uncertainty in the value estimates.</p>
<p>In contrast to value-based perturbation, we may also use <italic>state-based perturbation</italic>, where we inject exploration noise <italic>based on our interaction history with the MDP</italic> (i.e., independently of the extrinsic reward). As a classic example, a particular state might be interesting because it is novel, i.e., we have not visited it before in our current interaction history with the MDP. In the reinforcement learning literature, this approach is often referred to as <italic>intrinsic motivation</italic> (IM) (Chentanez et al., <xref ref-type="bibr" rid="B36">2005</xref>; Oudeyer et al., <xref ref-type="bibr" rid="B112">2007</xref>). We already encountered the same idea in the planning literature through the use of frontiers and explored set, which essentially prevent expansion of a state that we already visited before. In the RL (intrinsic motivation) literature, we usually make a separation between <italic>knowledge-based</italic> intrinsic motivation, which marks states or actions as interesting because they provide new knowledge about the MDP, and <italic>competence-based</italic> intrinsic motivation, where we prioritize target states based on our <italic>ability</italic> to reach them. Examples of the knowledge-based IM include intrinsic rewards for <italic>novelty</italic> (Brafman et al., <xref ref-type="bibr" rid="B29">2003</xref>; Bellemare et al., <xref ref-type="bibr" rid="B13">2016</xref>), recency (Sutton, <xref ref-type="bibr" rid="B146">1990</xref>), curiosity (Pathak et al., <xref ref-type="bibr" rid="B113">2017</xref>), surprise (Achiam and Sastry, <xref ref-type="bibr" rid="B1">2017</xref>), and model uncertainty (Houthooft et al., <xref ref-type="bibr" rid="B67">2016</xref>), while we may also provide intrinsic motivation for the <italic>content</italic> of a state, for example a saliency for objects (Kulkarni et al., <xref ref-type="bibr" rid="B84">2016</xref>). Competence-based IM may for example prioritize (goal) states of intermediate difficulty (which we manage to reach sometimes) (Florensa et al., <xref ref-type="bibr" rid="B49">2018</xref>), or on which we are currently making learning progress (Lopes et al., <xref ref-type="bibr" rid="B93">2012</xref>; Baranes and Oudeyer, <xref ref-type="bibr" rid="B9">2013</xref>; Matiisen et al., <xref ref-type="bibr" rid="B95">2017</xref>).</p>
<p>As mentioned above, there is clear connection between the use of frontiers in planning literature and the use of intrinsic motivation in reinforcement learning literature, which we illustrate in <xref ref-type="fig" rid="F5">Figure 5</xref>. On the one hand, the planning literature has many techniques to track and prioritize frontiers, but these tabular approaches do suffer in high-dimensional problems. In contrast, in RL methods that do not track frontiers (but for example use random perturbation) many trials may not hit a new state at all (Ecoffet et al., <xref ref-type="bibr" rid="B44">2021</xref>). Intrinsic motivation literature has studied the use of <italic>global, approximate frontiers</italic> (i.e., global, approximate sets of interesting states to explore), which is typically referred to as intrinsically motivated goal exploration processes (IMGEP) (Colas et al., <xref ref-type="bibr" rid="B37">2020</xref>). A successful example algorithm in this class is Go-Explore (Ecoffet et al., <xref ref-type="bibr" rid="B44">2021</xref>), which achieved state-of-the-art performance on the sparse-reward benchmark task Montezuma&#x00027;s Revenge. However, IMGEP approaches have their challenges as well, especially because it is hard to track convergence of approximate solutions, and our goal space may for example be off, or we do encounter a novel region but after an update of our goal-conditioned policy we are not able to get back. Tabular solutions from the planning literature do not suffer from these issues, and we conjecture that there is much potential here in the combination of ideas from both research fields.</p>
<p>As mentioned in the beginning, action selection often also plays a role on Algorithm 1 line 8, when we select next root states through forward sampling from the previous root (as discussed in Section 5.2). In the planning literature, this is often referred to as the <italic>recommendation function</italic> (Keller and Helmert, <xref ref-type="bibr" rid="B77">2013</xref>) (what action do we recommend at the root after all trials and back-ups). When we want to maximize performance, action recommendation is often greedy. However, during offline learning, we may inject additional exploration into action selection at the root, for example by <italic>planning to explore</italic> (the trials in a learned model direct the agent toward interesting new root state in the true environment) (Sekar et al., <xref ref-type="bibr" rid="B136">2020</xref>). We will refer to this type of action selection as <italic>next root</italic> (NR) selection, and note that some algorithms therefore have three different action selection strategies: before the frontier (BF) within a trial, after the frontier (AF) within a trial, and to set the next root (NR) for new trials. An overview of the discussed action selection methods, with some characteristic examples, is provided in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Overview of action selection methodology within a trial.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Action selection method</bold></th>
<th valign="top" align="left"><bold>Characteristic examples</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>Ordered</bold></td>
<td valign="top" align="left">Value iteration Bellman, <xref ref-type="bibr" rid="B17">1966</xref> Iterative deepening Korf, <xref ref-type="bibr" rid="B81">1985</xref></td>
</tr>
<tr>
<td valign="top" align="left"><bold>Value-based</bold></td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;- Greedy (with heuristic)</td>
<td valign="top" align="left">AO<sup>&#x022C6;</sup> Nilsson, <xref ref-type="bibr" rid="B109">1971</xref></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">RTDP Barto et al., <xref ref-type="bibr" rid="B10">1995</xref></td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;- Random perturbation</td>
<td valign="top" align="left">&#x003F5;-greedy Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Gaussian noise Van Hasselt and Wiering, <xref ref-type="bibr" rid="B158">2007</xref></td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;- Mean perturbation</td>
<td valign="top" align="left">Boltzmann Cesa-Bianchi et al., <xref ref-type="bibr" rid="B35">2017</xref></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Entropy regularization Peters et al., <xref ref-type="bibr" rid="B116">2010</xref></td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;- Uncertainty perturbation</td>
<td valign="top" align="left">Upper confidence bounds Kaelbling, <xref ref-type="bibr" rid="B70">1993</xref></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Posterior sampling Thompson, <xref ref-type="bibr" rid="B153">1933</xref></td>
</tr>
<tr>
<td valign="top" align="left"><bold>State-based</bold></td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;- Knowledge-based IM</td>
<td valign="top" align="left">Novelty Brafman et al., <xref ref-type="bibr" rid="B29">2003</xref></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Suprise Achiam and Sastry, <xref ref-type="bibr" rid="B1">2017</xref></td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;- Competence-based IM</td>
<td valign="top" align="left">Learning progress P&#x000E9;r&#x000E9; et al., <xref ref-type="bibr" rid="B115">2018</xref></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Goal-reaching success Florensa et al., <xref ref-type="bibr" rid="B49">2018</xref></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>At the highest level, we may either prioritized actions in an ordered way (independent of our interaction history with the MDP), in a value-based way (based on obtained rewards in our interaction history with the MDP), or in astate-based (based on our interaction history with the MDP, but independent of the value). The table shows possible subcategories, and some characteristic examples in the right column</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p><bold>State selection</bold> After our extensive discussion of action selection methods within a trial, we also need to discuss <italic>next state selection</italic>, which happens at line 26 of Algorithm 1. The two possible options here are ordered and sample selection. <italic>Ordered</italic> next state selection is for example used in value and policy iteration, where we simply expand every possible next state of an action. This approach is only feasible when we have settable, descriptive access to the MDP dynamics (see Section 3.2), since we can then decide ourselves which next state we want to make our next MDP query from. The second option is to <italic>sample</italic> the next state, which is by definition the choice when we only have generative access to the MDP dynamics. However, sampled next state selection may even be beneficial when we do have descriptive access (Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>).</p>
<p>To summarize this section on action and next state selection within a trial, <xref ref-type="fig" rid="F6">Figure 6</xref> illustrates some characteristic trial patterns. On the left of the figure we visualize a local solution consisting of a single trial with <italic>d</italic><sub>max</sub> &#x0003D; 2, which is for example used in two-step temporal difference (TD) learning (Sutton, <xref ref-type="bibr" rid="B145">1988</xref>). In the middle, we see a local solution consisting of four trials, each with a <italic>d</italic><sub>max</sub> of 1. Each action and next state is selected in an ordered way, which is for example used in value iteration (Bellman, <xref ref-type="bibr" rid="B17">1966</xref>). Finally, the right side of the figure shows a local solution consisting of three trials, one with <italic>d</italic><sub>max</sub> &#x0003D; 1 and two with <italic>d</italic><sub>max</sub> &#x0003D; 2, which could for example appear in Monte Carlo Tree Search (Kocsis and Szepesv&#x000E1;ri, <xref ref-type="bibr" rid="B78">2006</xref>). With the methodology described in this section, we can construct any other preferred local solution pattern. In the next section we will discuss what to do at the leaf states of these patterns, i.e., what to do when we reach the trial&#x00027;s <italic>d</italic><sub>max</sub>.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Example local solution patterns. <bold>(A)</bold> Local solution consisting of a single trial with depth 2. Total queries to the MDP = 2. Example: two-step temporal difference learning. <bold>(B)</bold> Local solution consisting of four trial with depth 1. Total queries to the MDP = 4. Example: value iteration. <bold>(C)</bold> Local solution consisting of three trials, one with depth 1 and two with depth 2. Total queries to the MDP = 4. Example: Monte Carlo Tree Search.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-908353-g0006.tif"/>
</fig>
</sec>
<sec>
<title>5.5. Bootstrap</title>
<p>The main aim of trials is to provide a new/improved estimate of the value of each action at the root, i.e., the expected cumulative sum of rewards from this state-action (Equation 1). However, when we choose to end a trial before we can evaluate the entire sum, we may still obtain an estimate of the cumulative reward through <italic>bootstrapping</italic>. A bootstrap function is a function that provides a quick estimate of the value of a particular state or state-action. When we decide to end our trial at a state, we need to bootstrap a state value (Algorithm 1, line 14), and when we decide to end the trial at an action, we need to bootstrap a state-action value (Algorithm 1, line 23). A potential benefit of a state value function is that it has lower dimension and might be easier to learn/obtain, while a state-action value function has the benefit that it allows for off-policy back-ups (see Section 5.6) without additional queries to the MDP. Note that terminal states have a value of 0 by definition.</p>
<p>The bootstrap function itself may either be obtained from a <italic>heuristic function</italic>, or it can be learned. Heuristic functions have been studied extensively in the planning community. A heuristic is called <italic>admissible</italic> when it provides an <italic>optimistic</italic> estimate of the remaining value for every state, which allows for greedy action selection strategies during the search. Heuristics can be obtained from prior knowledge, but much research has focused on automatic ways to obtain heuristics, often by first solving a simplified version of the problem. When the problem is stochastic, a popular approach is <italic>determinization</italic>, where we first solve a deterministic version of the MDP to obtain a heuristic for the full planning task (Hoffmann and Nebel, <xref ref-type="bibr" rid="B66">2001</xref>; Yoon et al., <xref ref-type="bibr" rid="B166">2007</xref>), or <italic>delete relaxations</italic> (Bonet and Geffner, <xref ref-type="bibr" rid="B24">2001</xref>), where we temporarily ignore the action effects that remove state attributes (which is only applicable in symbolic states spaces). A heuristic is called &#x00027;blind&#x00027; when it is initialized to the same value everywhere. For an extensive discussion of ways to obtain heuristics we refer the reader to Pearl (<xref ref-type="bibr" rid="B114">1984</xref>) and Edelkamp and Schrodl (<xref ref-type="bibr" rid="B45">2011</xref>).</p>
<p>The alternative approach is to <italic>learn</italic> a global state or state-action value function. Note that this function can also serve as our solution representation (see Section 5.1). The learned value function can be trained on the root value estimates of previous trials (see Section 5.7), and thereby gradually improve its performance (Sutton, <xref ref-type="bibr" rid="B145">1988</xref>; Korf, <xref ref-type="bibr" rid="B82">1990</xref>). A major benefit of learned value functions is 1) their ability to improve performance with more data, and 2) their ability to <italic>generalize</italic> when learned in approximate form. For example, while Deep Blue (Campbell et al., <xref ref-type="bibr" rid="B34">2002</xref>), the first computer programme to defeat a human Chess world champion, used a heuristic bootstrap function, this approach was later outperformed by AlphaZero (Silver et al., <xref ref-type="bibr" rid="B137">2018</xref>), which uses a deep neural network to learn a bootstrap function that provides better generalization.</p>
</sec>
<sec>
<title>5.6. Back-Up</title>
<p>Bootstrapping ends the forward phase of a trial, after which we start the back-up phase (<xref ref-type="fig" rid="F2">Figure 2</xref>, right). The goal of back-ups is to process the acquired information of the trial. We will primarily focus on the <italic>value back-up</italic>, where we construct new estimates <inline-formula><mml:math id="M42"><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M43"><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> for states and actions that appear in the trial. At the end of this section, we will also briefly comment on other types of information we may include in the back-up.</p>
<p>Value back-ups are based on the one-step Bellman equation, as shown in Equation 2. The first expectation of this equation, over the possible next states, shows the <italic>dynamics back-up</italic>: we need to aggregate value estimates for different possible next states into an state-action value estimate for the state-action that may lead to them. The second expectation, over the possible actions, shows the <italic>policy back-up</italic>: we want to aggregate state-action values into a value estimate at the particular state. We therefore need to discuss how to deal with width (expectations) over the policy and dynamics. In Algorithm 1, policy and dynamics back-ups happen at line 18 and 28, while we will now discuss the relevant considerations for these back-ups, as listed in the sixth row of <xref ref-type="table" rid="T2">Table 2</xref>.</p>
<p>For the policy back-up, we first need to specify which back-up policy we will actually employ. A first option is to use the current behavioral policy (which we used for action selection within the trial) as the back-up policy, which is in RL literature usually referred to as <italic>on-policy</italic> back-ups. An alternative is to use another policy than the behavioral policy, which is referred to as <italic>off-policy</italic>. The most common off-policy back-up is the <italic>greedy</italic> or <italic>max</italic> back-up, which puts all probability on the action with the highest current value estimate. The greedy back-up is common in tabular solutions, but can be unstable when combined with a global approximate solutions and bootstrapping (Van Hasselt et al., <xref ref-type="bibr" rid="B157">2018</xref>). Note that off-policy back-ups do not need to be greedy, and we may also use back-up policies that are more greedy than the exploration policy, but less greedy than the max operator (Coulom, <xref ref-type="bibr" rid="B39">2006</xref>; Keller, <xref ref-type="bibr" rid="B76">2015</xref>).</p>
<p>We next need to decide whether we will make a <italic>full</italic>/<italic>expected</italic> policy back-up, or a <italic>partial</italic>/<italic>sample</italic> policy back-up. Expected back-ups evaluate the full expectation over the policy probabilities, and therefore need to expand all child actions of a state. In contrast, sample back-ups only back-up the value from a sampled action, and therefore do not need to trial all child actions (and are therefore called &#x0201C;partial&#x0201D;). Sample back-ups are less accurate but computationally cheaper, and will move toward the true value over multiple samples.</p>
<p>The same consideration actually applies to the back-up over the dynamics, which can also be <italic>full</italic>/<italic>expected</italic> back-up, or <italic>partial</italic>/<italic>sample</italic>. Which type of dynamics back-up we can make also depends on the type of access we have to the MDP. When we only have generative access to the MDP, we are forced to make sample back-ups. In contrast, when we have descriptive access to the MDP, we can either make expected or sample back-ups. Although sample back-ups have higher variance, they are computationally cheaper and may be more efficient when many next states have a small probability (Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>). We summarize the common back-up equations for policy and dynamics in <xref ref-type="table" rid="T5">Table 5</xref>, while <xref ref-type="fig" rid="F7">Figure 7</xref> visualizes common combinations of these as back-up diagrams.</p>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Equations for the policy and dynamics back-up, applicable to Algorithm 1 line 18 and 28, respectively.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left"><bold>Equation</bold></th>
<th/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>Policy</bold></td>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;Sample back-up</td>
<td valign="top" align="left"><inline-formula><mml:math id="M44"><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>,</td>
<td valign="top" align="left">for <italic>a</italic>&#x0007E;&#x003C0;(&#x000B7;|<italic>s</italic>&#x02032;)</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;Expected back-up</td>
<td valign="top" align="left"><inline-formula><mml:math id="M45"><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x0007E;</mml:mo><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;Greedy back-up</td>
<td valign="top" align="left"><inline-formula><mml:math id="M46"><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo class="qopname">^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td/>
</tr>
<tr>
<td valign="top" align="left"><bold>Dynamics</bold></td>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;Sample back-up</td>
<td valign="top" align="left"><inline-formula><mml:math id="M47"><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>,</td>
<td valign="top" align="left">for <italic>s</italic>&#x02032;&#x0007E;<italic>T</italic>(&#x000B7;|<italic>s, a</italic>)</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;Expected back-up</td>
<td valign="top" align="left"><inline-formula><mml:math id="M48"><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:msub><mml:mrow><mml:mi>E</mml:mi></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x0007E;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>|</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td/>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Types of 1-step back-ups. For the back-up over the policy (columns), we need to decide on i) the type of policy (on-policy or off-policy) and ii) whether we do a full or partial back-up. For the back-up over the dynamics (rows), we also need to decide whether we do a full or partial back-up. Note that for the greedy/max back-up policy the expected and sample back-ups are equivalent. Mentioned algorithms: Value Iteration (Bellman, <xref ref-type="bibr" rid="B17">1966</xref>), Expected SARSA (Van Seijen et al., <xref ref-type="bibr" rid="B159">2009</xref>), SARSA (Rummery and Niranjan, <xref ref-type="bibr" rid="B124">1994</xref>), MCTS (Kocsis and Szepesv&#x000E1;ri, <xref ref-type="bibr" rid="B78">2006</xref>), Q-learning (Watkins and Dayan, <xref ref-type="bibr" rid="B161">1992</xref>), and AO<sup>&#x022C6;</sup> (Nilsson, <xref ref-type="bibr" rid="B109">1971</xref>).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-908353-g0007.tif"/>
</fig>
<p>Many algorithms back-up additional information to improve action selection in future trials. We may want to track the uncertainty in the value estimates, for example by backing-up visitation counts (Browne et al., <xref ref-type="bibr" rid="B30">2012</xref>), by backing-up entire uncertainty distributions around value estimates (Dearden et al., <xref ref-type="bibr" rid="B40">1998</xref>; Deisenroth and Rasmussen, <xref ref-type="bibr" rid="B41">2011</xref>), or by backing-up the distribution of the return (Bellemare et al., <xref ref-type="bibr" rid="B14">2017</xref>). Some methods back-up <italic>labels</italic> that mark a particular value estimate as &#x0201C;solved&#x0201D; when we are completely certain about its value estimate (Nilsson, <xref ref-type="bibr" rid="B109">1971</xref>; Bonet and Geffner, <xref ref-type="bibr" rid="B26">2003b</xref>). As mentioned before, graph searches also back-up information about frontiers and explored sets, which can be seen as another kind of label, one that removes duplicates and marks expanded states. The overarching theme in all these additional back-ups is that they track some kind of uncertainty about the value of a particular state, which can be utilized during action selection in future trials.</p>
</sec>
<sec>
<title>5.7. Update</title>
<p>The last step of the framework involves updating the local solutions (<italic>V</italic><sup><bold>l</bold></sup>(<italic>s</italic>) and <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>)) based on the back-up estimates (<inline-formula><mml:math id="M49"><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M50"><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>), and subsequently updating the global solution (<italic>V</italic><sup><bold>g</bold></sup>(<italic>s</italic>) and/or <italic>Q</italic><sup><bold>g</bold></sup>(<italic>s, a</italic>) and/or &#x003C0;<sup><bold>g</bold></sup>(<italic>a</italic>|<italic>s</italic>)) based on the local solution. In Algorithm 1, the updates of the local solution happen in lines 19 and 29, while the update of the global solution (when used) occurs in line 7. The main message of this section is that we can write both types of updates, whether it concerns updates of nodes in a planning tree or updates of a global policy network, as <italic>gradient descent</italic> updates on a particular <italic>loss function</italic>. We hope this provides further insight in the similarity between planning and learning, since planning updates on a tree/graph can usually be written as tabular learning updates with a particular learning rate.</p>
<p>We will first introduce our general notation. A loss function is denoted by <inline-formula><mml:math id="M51"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, where &#x003B8; denotes the parameters to be updated. In case of a tabular solution, the parameters are simply the individual entries in the table (like <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>))) (see Section 5.1 and <xref ref-type="table" rid="T3">Table 3</xref> for a summary of notation), and we will therefore not explicitly add a subscript &#x003B8;. When we have specified a solution and a loss function, the parameters can be updated based on gradient descent, with update rule:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M333"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02212;</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x000B7;</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mo>&#x02207;</mml:mo><mml:mi>&#x003B8;</mml:mi></mml:msub><mml:mi>&#x02112;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>,</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B7;&#x02208;&#x0211D;<sup>&#x0002B;</sup> is a learning rate. We will first show which loss function and update rules are common in updating of the local solution, and subsequently discuss how they reappear in updates of the global solution based on the local solution. An overview of common loss functions and update rules is provided in <xref ref-type="table" rid="T6">Table 6</xref>, which we will now discuss in more detail.</p>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p> Overview of common loss functions and update rules.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="left"><bold>Loss</bold></th>
<th valign="top" align="left"><bold>Update</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>Local update</bold></td>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;<italic>Value</italic></td>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;Squared loss</td>
<td valign="top" align="left"><inline-formula><mml:math id="M53"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>|</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td valign="top" align="left"><inline-formula><mml:math id="M54"><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;&#x000A0;Replace update (&#x003B7; &#x0003D; 1)</td>
<td/>
<td valign="top" align="left"><inline-formula><mml:math id="M55"><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;&#x000A0;Average update (<inline-formula><mml:math id="M56"><mml:mi>&#x003B7;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>)</td>
<td/>
<td valign="top" align="left"><inline-formula><mml:math id="M57"><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x000B7;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left">&#x000A0;&#x000A0;&#x000A0;Eligibility update</td>
<td/>
<td valign="top" align="left"><inline-formula><mml:math id="M58"><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003BB;</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>d</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msup><mml:mo>&#x000B7;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr> <tr>
<td valign="top" align="left"><bold>Global update</bold></td>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;<italic>Value</italic></td>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;Squared loss</td>
<td valign="top" align="left"><inline-formula><mml:math id="M59"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:math></inline-formula></td>
<td valign="top" align="left"><inline-formula><mml:math id="M60"><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;Cross-entropy softmax loss</td>
<td valign="top" align="left"><inline-formula><mml:math id="M61"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:mstyle class="text"><mml:mtext class="texttt" mathvariant="monospace">softmax</mml:mtext></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mstyle><mml:mtext>a</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>&#x000B7;</mml:mo><mml:mo class="qopname">log</mml:mo><mml:mstyle class="text"><mml:mtext class="texttt" mathvariant="monospace">softmax</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mstyle><mml:mtext>a</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="left"><inline-formula><mml:math id="M62"><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mstyle class="text"><mml:mtext class="texttt" mathvariant="monospace">softmax</mml:mtext></mml:mstyle><mml:msup><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mstyle><mml:mtext>a</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mo>&#x000B7;</mml:mo><mml:mo class="qopname">log</mml:mo><mml:mstyle class="text"><mml:mtext class="texttt" mathvariant="monospace">softmax</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mstyle><mml:mtext>a</mml:mtext></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;<italic>Policy</italic></td>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;Policy gradient</td>
<td valign="top" align="left"><inline-formula><mml:math id="M63"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:mo class="qopname">ln</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msubsup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="left"><inline-formula><mml:math id="M64"><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;Determ. policy gradient</td>
<td valign="top" align="left"><inline-formula><mml:math id="M65"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> &#x000A0;(<inline-formula><mml:math id="M66"><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup></mml:math></inline-formula> trained on <italic>Q</italic><sup><bold>l</bold></sup>)</td>
<td valign="top" align="left"><inline-formula><mml:math id="M67"><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;Value gradient</td>
<td valign="top" align="left"><inline-formula><mml:math id="M68"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="left"><inline-formula><mml:math id="M69"><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> (<xref ref-type="fig" rid="F8">Figure 8</xref>)</td>
</tr>
<tr>
<td valign="top" align="left">&#x000A0;&#x000A0;Cross-entropy loss</td>
<td valign="top" align="left"><inline-formula><mml:math id="M70"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">L</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B8;</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mo class="qopname">ln</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:msup><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:math></inline-formula></td>
<td valign="top" align="left"><inline-formula><mml:math id="M71"><mml:mi>&#x003B8;</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>&#x003B8;</mml:mi><mml:mo>-</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:mrow></mml:munder><mml:msup><mml:mrow><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x000B7;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow></mml:msub><mml:msubsup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Top: Local update, where we use back-up values <inline-formula><mml:math id="M72"><mml:mover accent="true"><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and/or <inline-formula><mml:math id="M73"><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> to update the local solution V<sup><bold>l</bold></sup>(s) and/or Q<sup><bold>l</bold></sup>(s, a). The special cases of replace update and average update are explicitly shown. Bottom: Global update, where we use the local solution estimates V<sup><bold>l</bold></sup>(s) and/or Q<sup><bold>l</bold></sup>(s, a) to update global (approximate) solutions <inline-formula><mml:math id="M74"><mml:msubsup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, <inline-formula><mml:math id="M75"><mml:msubsup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and/or <inline-formula><mml:math id="M76"><mml:msubsup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003B8;</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>g</mml:mtext></mml:mstyle></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. Parameters of the global solution are denoted by &#x003B8; (when the global value solution is tabular each &#x003B8; in the table can be read as Q<sup><bold>g</bold></sup>(s, a)). Note that the table illustrates some characteristic examples, but other losses and update rules are possible. <inline-formula><mml:math id="M77"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>d</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> denotes an estimate from a trial of depth d</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p><bold>Local solution update</bold> Here we will focus on the update of state-action values <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>) (Algorithm 1, line 29), but the same principles apply to state value updating (Algorithm 1, line 19). We therefore want to specify an update of <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>) based on a new back-up value <inline-formula><mml:math id="M78"><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. A classic choice of loss function for continuous values is the <italic>squared loss</italic>, given by:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M599"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mi>&#x02112;</mml:mi><mml:mo stretchy='false'>(</mml:mo><mml:msup><mml:mi>Q</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy="false">&#x0007C;</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>=</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac><mml:msup><mml:mrow><mml:mo stretchy='false'>[</mml:mo><mml:mover accent='true'><mml:mi>Q</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msup><mml:mi>Q</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Differentiating this loss with respect to <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>) and plugging it into Equation (4) (where <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>) are the parameters) gives the well-known <italic>tabular learning rule</italic>:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M601"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:msup><mml:mi>Q</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02190;</mml:mo><mml:msup><mml:mi>Q</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>+</mml:mo><mml:mi>&#x003B7;</mml:mi><mml:mo>&#x000B7;</mml:mo><mml:mo stretchy='false'>[</mml:mo><mml:mover accent='true'><mml:mi>Q</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo>&#x02212;</mml:mo><mml:msup><mml:mi>Q</mml:mi><mml:mstyle mathvariant='bold' mathsize='normal'><mml:mi>l</mml:mi></mml:mstyle></mml:msup><mml:mo stretchy='false'>(</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy='false'>)</mml:mo><mml:mo stretchy='false'>]</mml:mo><mml:mo>.</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Intuitively, we move our estimate <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>) a bit in the direction of our new back-up value <inline-formula><mml:math id="M81"><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>. In the tabular case, &#x003B7; is therefore restricted to [0, 1]. Most planning algorithms use special cases of the above update rule. A first common choice is to set &#x003B7; &#x0003D; 1.0, which gives the <italic>replace update</italic>:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M622"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This update completely overwrites the estimate in the local solution by the new back-up value. This is the typical approach in heuristic planning (Hart et al., <xref ref-type="bibr" rid="B62">1968</xref>; Nilsson, <xref ref-type="bibr" rid="B109">1971</xref>; Hansen and Zilberstein, <xref ref-type="bibr" rid="B60">2001</xref>), where an admissible heuristic often ensures that our new estimate (from a deeper unfolding of the planning tree) provides a better informed estimate than the previous estimate. Although one would typically not think of such a replace update as a gradient-based approach, these updates are in fact all connected.</p>
<p>When we do not have a good heuristic available (and we therefore need to bootstrap from a learned value function or use deep roll-outs to estimate the cumulative reward), estimates of different depths may have different reliability (known as the <italic>bias-variance trade-off</italic> ) (Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>). We may for example equally weight the contribution of estimates of different depths, which we will call an <italic>averaging update</italic> (which uses <inline-formula><mml:math id="M85"><mml:mi>&#x003B7;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>, where <italic>n</italic> denotes the number of trials/back-up estimates for the node):</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M644"><mml:mtable class="eqnarray" columnalign="right center left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac><mml:mo>&#x000B7;</mml:mo><mml:mrow><mml:mo>[</mml:mo><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mstyle><mml:mtext>l</mml:mtext></mml:mstyle></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>]</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>This is for example used in MCTS implementations that use bootstrapping instead of rollouts (Silver et al., <xref ref-type="bibr" rid="B137">2018</xref>).</p>
<p>While the above update gives the value estimate from each trial equal weight, we may also make the contribution of a trial estimate dependent on the depth of the trial, as is for example done in <italic>elegibility traces</italic> (Schulman et al., <xref ref-type="bibr" rid="B133">2016</xref>; Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>). In this case, we essentially set &#x003B7; &#x0003D; (1&#x02212;&#x003BB;)&#x000B7;&#x003BB;<sup>(<italic>d</italic>&#x02212;1)</sup>, where &#x003BB;&#x02208;[0, 1] is the exponential decay and <italic>d</italic> is the length of the trace on which we update. More sophisticated reweighting schemes of the targets of different trials are possible as well (Munos et al., <xref ref-type="bibr" rid="B108">2016</xref>), for example based on the <italic>uncertainty</italic> of the estimate at each depth (Buckman et al., <xref ref-type="bibr" rid="B32">2018</xref>). In short, the local solution may combine value estimates from different trials (with different depths) in numerous ways, as summarized in the top part of <xref ref-type="table" rid="T6">Table 6</xref>.</p>
<p><bold>Global solution update</bold> When our algorithm uses a global solution, we next need to update this global solution (<italic>V</italic><sup><bold>g</bold></sup> and/or <italic>Q</italic><sup><bold>g</bold></sup> and/or &#x003C0;<sup><bold>g</bold></sup>) based on the estimates from our local solution (<italic>V</italic><sup><bold>l</bold></sup> and/or <italic>Q</italic><sup><bold>l</bold></sup>) (Algorithm 1, line 7). For a value-based solution that is <italic>tabular</italic>, we typically use the same squared loss (Equation 5), which leads to the global tabular update rule <italic>Q</italic><sup><bold>g</bold></sup>(<italic>s, a</italic>)&#x02190;<italic>Q</italic><sup><bold>g</bold></sup>(<italic>s, a</italic>)&#x0002B;&#x003B7;&#x000B7;[<italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>)&#x02212;<italic>Q</italic><sup><bold>g</bold></sup>(<italic>s, a</italic>)], which exactly resembles the local version (Equation 6), apart from the fact that we now update <italic>Q</italic><sup><bold>g</bold></sup>(<italic>s, a</italic>), while <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>) has the role of target. This approach is the basis of all tabular RL methods (Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>). [For (model-free) RL approaches that directly update the global solution after a single trial, we may also imagine the local solution does not exist, and we directly update the global solution from the back-up estimates].</p>
<p>We will therefore primarily focus on the function approximation setting, where we update a global approximate representation parametrized by &#x003B8;. <xref ref-type="table" rid="T6">Table 6</xref> shows some example loss functions and update rules that appear in this case. The most important point to note is that there are many ways in which we may combine a local estimate, such as <italic>Q</italic><sup><bold>l</bold></sup>(<italic>s, a</italic>), and the global solution, such as <italic>Q</italic><sup><bold>g</bold></sup>(<italic>s, a</italic>) or &#x003C0;<sup><bold>g</bold></sup>(<italic>a</italic>|<italic>s</italic>), in a loss function. For value-based updating, we may use the squared loss, but other options are possible as well, like a cross-entropy loss over the softmax of the Q-values returned from planning (the local solution) and the softmax of the Q-values from a global neural network approximation (Hamrick et al., <xref ref-type="bibr" rid="B58">2020a</xref>). For policy-based updating, well-known examples include the <italic>policy gradient</italic> (Williams, <xref ref-type="bibr" rid="B164">1992</xref>; Sutton et al., <xref ref-type="bibr" rid="B149">2000</xref>; Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref>) and <italic>deterministic policy gradient</italic> (Silver et al., <xref ref-type="bibr" rid="B138">2014</xref>; Lillicrap et al., <xref ref-type="bibr" rid="B91">2015</xref>) loss functions. Again, other options have been successful as well, such as a cross-entropy loss between the normalized visitations counts at the root of an MCTS (part of the local solution) and a global policy network, as for example used by AlphaZero (Silver et al., <xref ref-type="bibr" rid="B139">2017</xref>). In short, various objectives are possible (and more may be discovered), as long as minimization of the objective moves our global solution in the right direction (based on the obtained information from the trial).</p>
<p>An important other class of approaches is <italic>gradient-based planning</italic>, also known as <italic>value gradients</italic> (Fairbank and Alonso, <xref ref-type="bibr" rid="B47">2012</xref>; Heess et al., <xref ref-type="bibr" rid="B64">2015</xref>). These approaches require a (known or learned) differentiable transition and reward model (and a differentiable value function when we also include bootstrapping). When we also specify a differentiable policy, then each trial generates a fully differentiable graph, in which we can directly differentiate the cumulative reward with respect to the policy parameters. This idea is illustrated in <xref ref-type="fig" rid="F8">Figure 8</xref>, where we aggregate over all gradient paths in the graph (red dotted lines). Gradient-based planning is popular in the robotics and control community (Todorov and Li, <xref ref-type="bibr" rid="B154">2005</xref>; Anderson and Moore, <xref ref-type="bibr" rid="B5">2007</xref>; Deisenroth and Rasmussen, <xref ref-type="bibr" rid="B41">2011</xref>), where dynamics functions are relatively smooth and differentiable, although the idea can also be applied with discrete states (Wu et al., <xref ref-type="bibr" rid="B165">2017</xref>).</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Illustration of gradient-based planning. When we have access to a differentiable transition function <inline-formula><mml:math id="M83"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">T</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>|</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> and differentiable reward function <inline-formula><mml:math id="M84"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">R</mml:mi></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, and we also specify a differentiable policy &#x003C0;<sub>&#x003B8;</sub>(<italic>a</italic>|<italic>s</italic>), then a single trial generates a fully differentiable computational graph. The figure shows an example graph for a trial of depth 3. The black arrows show the forward passes through the policy, dynamics function and rewards function. In the example, we also bootstrap from a differentiable (learned) value function, but this can also be omitted. We may then update the policy parameters by directly differentiating the cumulative reward (objective, green box) with respect to the policy parameters, effectively summing the gradients over all backwards path indicated by the red dotted lines.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="frai-05-908353-g0008.tif"/>
</fig>
<p><xref ref-type="table" rid="T6">Table 6</xref> summarizes some of the common loss functions we discussed. The examples in the table all have analytical gradients, but otherwise we may always use finite differences to numerically estimate the gradient of an objective. The learning rate in these update equations is typically tuned to a specific value (or decay scheme), although there are more sophisticated approaches that bound the step size, such as proximal policy optimization (PPO) (Schulman et al., <xref ref-type="bibr" rid="B134">2017</xref>). Moreover, we did not discuss gradient-free updating of a global solution, because these algorithms typically do not exploit MDP-specific knowledge (i.e, they do not construct and back-up value estimates at states throughout the MDP, but only sample the objective function based on traces from the root). However, we do note that gradient-free black-box optimization can also be successful in MDP optimization, as for example show for evolutionary strategies Moriarty et al. (<xref ref-type="bibr" rid="B107">1999</xref>); Whiteson and Stone (<xref ref-type="bibr" rid="B162">2006</xref>); Salimans et al. (<xref ref-type="bibr" rid="B127">2017</xref>), simulated annealing (Atiya et al., <xref ref-type="bibr" rid="B7">2003</xref>) and the cross-entropy method Mannor et al. (<xref ref-type="bibr" rid="B94">2003</xref>).</p>
<p>This concludes our discussion of the dimensions in the framework. An overview of all considerations and their possible choices is shown in <xref ref-type="table" rid="T2">Table 2</xref>, while Algorithm 1 shows how all these considerations piece together in a general algorithmic framework. To illustrate the validity of the framework, the next section will analyze a variety of planning and RL methods along the framework dimensions.</p>
</sec>
</sec>
<sec id="s6">
<title>6. Comparison of Algorithms</title>
<p>Having discussed all the dimensions of the framework, we will now zoom out and reflect on its use and potential implications. The main point of our framework is that MDP planning and reinforcement learning algorithms occupy the same solution space. To illustrate this idea, <xref ref-type="table" rid="T7">Table 7</xref> shows for a range of well-known planning (blue), model-free RL (red) and model-based RL (green) algorithms the choices they make on the dimensions of the framework. The list is of course not complete (we could have included any other preferred algorithm), but the table illustrates that the framework is applicable to a wide range of algorithms.</p>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Comparison of algorithms (columns) along the framework dimensions (rows).</p></caption>
<table frame="hsides" rules="groups">
<tbody><tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left"><bold>Dimension</bold></td>
<td valign="top" align="left"><bold>Consideration</bold></td>
<td valign="top" align="left">Value iteration Bellman, <xref ref-type="bibr" rid="B17">1966</xref></td>
<td valign="top" align="left" style="background-color:#bbb7da;"> LAO<sup>&#x022C6;</sup> Hansen and Zilberstein, <xref ref-type="bibr" rid="B60">2001</xref></td>
<td valign="top" align="left" style="background-color:#bbb7da;"> Labeled RTDP Bonet and Geffner, <xref ref-type="bibr" rid="B26">2003b</xref></td>
<td valign="top" align="left" style="background-color:#bbb7da;"> Monte Carlo search Tesauro and Galperin, <xref ref-type="bibr" rid="B152">1997</xref></td>
<td valign="top" align="left" style="background-color:#bbb7da;"> MCTS Kocsis and Szepesv&#x000E1;ri, <xref ref-type="bibr" rid="B78">2006</xref></td>
<td valign="top" align="left" style="background-color:#fcc9b5;"> Q-learning Watkins and Dayan, <xref ref-type="bibr" rid="B161">1992</xref></td>
<td valign="top" align="left" style="background-color:#fcc9b5;"> TD(&#x003BB;) Sutton and Barto, <xref ref-type="bibr" rid="B148">2018</xref></td>
</tr>
<tr>
<td valign="top" align="left">MDP access</td>
<td/>
<td valign="top" align="left">Settable descriptive</td>
<td valign="top" align="left">Settable descriptive</td>
<td valign="top" align="left">Settable descriptive</td>
<td valign="top" align="left">Settable generative</td>
<td valign="top" align="left">Settable generative</td>
<td valign="top" align="left">Resettable generative</td>
<td valign="top" align="left">Resettable generative</td>
</tr>
<tr>
<td valign="top" align="left">Solution</td>
<td valign="top" align="left">- Coverage</td>
<td valign="top" align="left">Global</td>
<td valign="top" align="left">Local</td>
<td valign="top" align="left">Local</td>
<td valign="top" align="left">Local</td>
<td valign="top" align="left">Local</td>
<td valign="top" align="left">Global</td>
<td valign="top" align="left">Global</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Type</td>
<td valign="top" align="left"><italic>V</italic>(<italic>s</italic>)</td>
<td valign="top" align="left"><italic>V</italic>(<italic>s</italic>)</td>
<td valign="top" align="left"><italic>V</italic>(<italic>s</italic>)</td>
<td valign="top" align="left"><italic>Q</italic>(<italic>s, a</italic>)</td>
<td valign="top" align="left"><italic>Q</italic>(<italic>s, a</italic>)</td>
<td valign="top" align="left"><italic>Q</italic>(<italic>s, a</italic>)</td>
<td valign="top" align="left"><italic>V</italic>(<italic>s</italic>)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Method</td>
<td valign="top" align="left">Tabular</td>
<td valign="top" align="left">Tabular</td>
<td valign="top" align="left">Tabular</td>
<td valign="top" align="left">Tabular</td>
<td valign="top" align="left">Tabular</td>
<td valign="top" align="left">Tabular</td>
<td valign="top" align="left">Tabular</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Initialization</td>
<td valign="top" align="left">Uniform</td>
<td valign="top" align="left">Heuristic</td>
<td valign="top" align="left">Heuristic</td>
<td valign="top" align="left">Uniform</td>
<td valign="top" align="left">Optimistic</td>
<td valign="top" align="left">Uniform</td>
<td valign="top" align="left">Uniform</td>
</tr>
<tr>
<td valign="top" align="left">Root</td>
<td valign="top" align="left">- Selection</td>
<td valign="top" align="left">Ordered</td>
<td valign="top" align="left">Forward sampling</td>
<td valign="top" align="left">Forward sampling</td>
<td valign="top" align="left">Forward sampling</td>
<td valign="top" align="left">Forward sampling</td>
<td valign="top" align="left">Forward sampling</td>
<td valign="top" align="left">Forward sampling</td>
</tr>
<tr>
<td valign="top" align="left">Budget</td>
<td valign="top" align="left">- &#x00023; trials per root</td>
<td valign="top" align="left">up to <inline-formula><mml:math id="M87"><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow><mml:mo>|</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>|</mml:mo></mml:math></inline-formula></td>
<td valign="top" align="left">till convergence</td>
<td valign="top" align="left">up to <inline-formula><mml:math id="M88"><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow><mml:mo>|</mml:mo><mml:mo>&#x000B7;</mml:mo><mml:mo>|</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">S</mml:mi></mml:mrow><mml:mo>|</mml:mo></mml:math></inline-formula></td>
<td valign="top" align="left"><italic>n</italic></td>
<td valign="top" align="left"><italic>n</italic></td>
<td valign="top" align="left">1</td>
<td valign="top" align="left"><italic>d</italic><sub>max</sub></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Depth</td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">1..<italic>n</italic></td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">&#x0221E;</td>
<td valign="top" align="left">&#x0221E;</td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">1..<italic>d</italic><sub>max</sub></td>
</tr>
<tr>
<td valign="top" align="left">Selection</td>
<td valign="top" align="left">- Next action</td>
<td valign="top" align="left">Ordered</td>
<td valign="top" align="left">BF: Greedy, AF: Ordered, NR: Greedy</td>
<td valign="top" align="left">BF: Greedy, AF: Ordered, NR: Greedy</td>
<td valign="top" align="left">BF: Ordered AF: Baseline</td>
<td valign="top" align="left">BF: Uncertainty AF: Baseline NR: Greedy</td>
<td valign="top" align="left">Random pert.</td>
<td valign="top" align="left">Random pert.</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Next state</td>
<td valign="top" align="left">Ordered</td>
<td valign="top" align="left">Ordered</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
</tr>
<tr>
<td valign="top" align="left">Bootstrap</td>
<td valign="top" align="left">- Location</td>
<td valign="top" align="left">State</td>
<td valign="top" align="left">State</td>
<td valign="top" align="left">State</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">State-action</td>
<td valign="top" align="left">State</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Type</td>
<td valign="top" align="left">Learned</td>
<td valign="top" align="left">Heuristic</td>
<td valign="top" align="left">Heuristic</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Learned</td>
<td valign="top" align="left">Learned</td>
</tr>
<tr>
<td valign="top" align="left">Back-up</td>
<td valign="top" align="left">- Back-up policy</td>
<td valign="top" align="left">Greedy/max</td>
<td valign="top" align="left">Greedy/max</td>
<td valign="top" align="left">Greedy/max</td>
<td valign="top" align="left">On-policy</td>
<td valign="top" align="left">On-policy</td>
<td valign="top" align="left">Greedy/max</td>
<td valign="top" align="left">On-policy</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Policy exp.</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Sample</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Dynamics exp.</td>
<td valign="top" align="left">Expected</td>
<td valign="top" align="left">Expected</td>
<td valign="top" align="left">Expected</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Add. back-ups</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Convergence label</td>
<td valign="top" align="left">Convergence label</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Counts</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">-</td>
</tr>
<tr>
<td valign="top" align="left">Update</td>
<td valign="top" align="left">- Loss</td>
<td valign="top" align="left">(Squared)</td>
<td valign="top" align="left">(Squared)</td>
<td valign="top" align="left">(Squared)</td>
<td valign="top" align="left">(Squared)</td>
<td valign="top" align="left">(Squared)</td>
<td valign="top" align="left">(Squared)</td>
<td valign="top" align="left">(Squared)</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td/>
<td valign="top" align="left">- Update type</td>
<td valign="top" align="left">Replace (&#x003B7; &#x0003D; 1.0)</td>
<td valign="top" align="left">Replace (&#x003B7; &#x0003D; 1.0)</td>
<td valign="top" align="left">Replace (&#x003B7; &#x0003D; 1.0)</td>
<td valign="top" align="left">Average (&#x003B7; &#x0003D; 1/<italic>n</italic>)</td>
<td valign="top" align="left">Average (&#x003B7; &#x0003D; 1/<italic>n</italic>)</td>
<td valign="top" align="left">Fixed step</td>
<td valign="top" align="left">Eligibility</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="left"><bold>Dimension</bold></td>
<td valign="top" align="left"><bold>Consideration</bold></td>
<td valign="top" align="left">REINFORCE Williams, <xref ref-type="bibr" rid="B164">1992</xref></td>
<td valign="top" align="left" style="background-color:#fcc9b5;">DQN Mnih et al., <xref ref-type="bibr" rid="B99">2015</xref></td>
<td valign="top" align="left" style="background-color:#fcc9b5;">Prioritized sweeping Moore and Atkeson, <xref ref-type="bibr" rid="B104">1993</xref></td>
<td valign="top" align="left" style="background-color:#c0e2ca;">Dyna Sutton, <xref ref-type="bibr" rid="B146">1990</xref></td>
<td valign="top" align="left" style="background-color:#c0e2ca;">PILCO Deisenroth and Rasmussen, <xref ref-type="bibr" rid="B41">2011</xref></td>
<td valign="top" align="left" style="background-color:#c0e2ca;">AlphaGo Silver et al., <xref ref-type="bibr" rid="B139">2017</xref></td>
<td valign="top" align="left" style="background-color:#c0e2ca;">Go-Explore (policy-based) Ecoffet et al., <xref ref-type="bibr" rid="B44">2021</xref></td>
</tr>
<tr>
<td valign="top" align="left">MDP access</td>
<td/>
<td valign="top" align="left">Resettable generative</td>
<td valign="top" align="left">Resettable generative</td>
<td valign="top" align="left">Resettable generative</td>
<td valign="top" align="left">Resettable generative</td>
<td valign="top" align="left">Resettable generative</td>
<td valign="top" align="left">Settable generative</td>
<td valign="top" align="left">Resettable generative</td>
</tr>
<tr>
<td valign="top" align="left">Solution</td>
<td valign="top" align="left">- Coverage</td>
<td valign="top" align="left">Global</td>
<td valign="top" align="left">Global</td>
<td valign="top" align="left">Global</td>
<td valign="top" align="left">Global</td>
<td valign="top" align="left">Global</td>
<td valign="top" align="left">Global</td>
<td valign="top" align="left">Global</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Type</td>
<td valign="top" align="left">&#x003C0;(<italic>a</italic>|<italic>s</italic>)</td>
<td valign="top" align="left"><italic>Q</italic>(<italic>s, a</italic>)</td>
<td valign="top" align="left"><italic>Q</italic>(<italic>s, a</italic>)</td>
<td valign="top" align="left"><italic>Q</italic>(<italic>s, a</italic>)</td>
<td valign="top" align="left">&#x003C0;(<italic>a</italic>|<italic>s</italic>)</td>
<td valign="top" align="left">&#x003C0;(<italic>a</italic>|<italic>s</italic>), <italic>V</italic>(<italic>s</italic>)</td>
<td valign="top" align="left">&#x003C0;(<italic>a</italic>|<italic>s, g</italic>), <italic>V</italic>(<italic>s</italic>)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Method</td>
<td valign="top" align="left">Tabular</td>
<td valign="top" align="left">Approximate (NN)</td>
<td valign="top" align="left">Tabular</td>
<td valign="top" align="left">Tabular</td>
<td valign="top" align="left">Approximate (GP)</td>
<td valign="top" align="left">Approximate (NN)</td>
<td valign="top" align="left">Approximate (NN)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Initialization</td>
<td valign="top" align="left">Uniform</td>
<td valign="top" align="left">Random</td>
<td valign="top" align="left">Uniform</td>
<td valign="top" align="left">Uniform</td>
<td valign="top" align="left">Random</td>
<td valign="top" align="left">Random</td>
<td valign="top" align="left">Random</td>
</tr>
<tr>
<td valign="top" align="left">Root</td>
<td valign="top" align="left">- Selection</td>
<td valign="top" align="left">Forward</td>
<td valign="top" align="left">Forward</td>
<td valign="top" align="left">Forward &#x0002B; backward</td>
<td valign="top" align="left">Forward &#x0002B; visited states</td>
<td valign="top" align="left">Forward</td>
<td valign="top" align="left">Forward</td>
<td valign="top" align="left">Forward</td>
</tr>
<tr>
<td valign="top" align="left">Budget</td>
<td valign="top" align="left">- &#x00023; trials per root</td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">1600</td>
<td valign="top" align="left"><italic>d</italic><sub>max</sub></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Depth</td>
<td valign="top" align="left">&#x0221E;</td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">1</td>
<td valign="top" align="left">&#x0221E;</td>
<td valign="top" align="left">MCTS: 1..<italic>n</italic> NR: &#x0221E;</td>
<td valign="top" align="left">1..<italic>d</italic><sub>max</sub></td>
</tr>
<tr>
<td valign="top" align="left">Selection</td>
<td valign="top" align="left">- Next action</td>
<td valign="top" align="left">Rand. pert. (stoch. policy)</td>
<td valign="top" align="left">Rand. pert. (&#x003F5;-greedy)</td>
<td valign="top" align="left">State-based (novelty)</td>
<td valign="top" align="left">State-based (novelty) &#x0002B; Mean pert. (Boltzmann)</td>
<td valign="top" align="left">Rand. pert. (stoch. policy)</td>
<td valign="top" align="left">BF/AF: Uncertainty NR: Rand. pert.</td>
<td valign="top" align="left">BF: Novelty &#x0002B; Mean pert. (entropy), AF: Rand. pert.</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Next state</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
</tr>
<tr>
<td valign="top" align="left">Bootstrap</td>
<td valign="top" align="left">- Location</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">State-action</td>
<td valign="top" align="left">State-action</td>
<td valign="top" align="left">State-action</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">State</td>
<td valign="top" align="left">State</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Type</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Learned</td>
<td valign="top" align="left">Learned</td>
<td valign="top" align="left">Learned</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Learned</td>
<td valign="top" align="left">Learned</td>
</tr>
<tr>
<td valign="top" align="left">Back-up</td>
<td valign="top" align="left">- Back-up policy</td>
<td valign="top" align="left">On-policy</td>
<td valign="top" align="left">Max/greedy</td>
<td valign="top" align="left">Max/greedy</td>
<td valign="top" align="left">On-policy</td>
<td valign="top" align="left">On-policy</td>
<td valign="top" align="left">On-policy</td>
<td valign="top" align="left">On-policy</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Policy exp.</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Max</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Dynamics exp.</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Expected</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
<td valign="top" align="left">Sample</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Add. back-ups</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">-</td>
<td valign="top" align="left">Priorities, counts</td>
<td valign="top" align="left">Counts</td>
<td valign="top" align="left">Uncertainty</td>
<td valign="top" align="left">Counts</td>
<td valign="top" align="left">Counts</td>
</tr>
<tr>
<td valign="top" align="left">Update</td>
<td valign="top" align="left">- Loss</td>
<td valign="top" align="left">Policy gradient</td>
<td valign="top" align="left">Squared</td>
<td valign="top" align="left">(Squared)</td>
<td valign="top" align="left">(Squared)</td>
<td valign="top" align="left">Value gradient</td>
<td valign="top" align="left">Cross-entropy (policy) &#x0002B; squared (value)</td>
<td valign="top" align="left">Policy gradient (PPO) &#x0002B; squared (value)</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">- Learning rate</td>
<td valign="top" align="left">Fixed step</td>
<td valign="top" align="left">Fixed step</td>
<td valign="top" align="left">Fixed step</td>
<td valign="top" align="left">Fixed step</td>
<td valign="top" align="left">Fixed step</td>
<td valign="top" align="left">Local: Average Global: fixed step</td>
<td valign="top" align="left">Local: eligibility Global: adaptive</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Blue, red and green color denote planning, model-free RL and model-based RL algorithms, respectively (although Value Iteration is technically model-based RL under our definitions in Section 3, we still list it as first entry since it is a core algorithm). All methods that use a global solution also use a local solution (which we did not explicitly write in the table). Regarding action selection, when applicable we discriminate before frontier (BF) action selection, after frontier (AF) action selection, and next root (NR) action selection. When the squared loss is written between brackets, it means that the algorithm uses a direct tabular update rule and the squared loss is therefore never explicitly part of the algorithm. NN, neural network; GP, Gaussian Process, PPO = Proximal Policy Optimization (Schulman et al., <xref ref-type="bibr" rid="B134">2017</xref>)</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>A first observation from the table is that it reads like a patchwork. On most dimensions the same decisions appear in both the planning and reinforcement learning literature, showing that both fields actually have quite some overlap in developed methodology. For example, the depth and back-up schemes of MCTS (Kocsis and Szepesv&#x000E1;ri, <xref ref-type="bibr" rid="B78">2006</xref>) and REINFORCE (Williams, <xref ref-type="bibr" rid="B164">1992</xref>) are exactly the same, but they differ in their solution coverage (MCTS only uses a local solution, REINFORCE updates a global solution after every trial) and exploration method. Such comparisons provide insight into the overlap and differences between various approaches.</p>
<p>The second observation of the table is therefore that <italic>all algorithms have to make a decision on each dimension</italic>. Therefore, even though we often do not consciously consider each of the dimensions when we come up with a new algorithm, we are still implicitly making a decision on each of them. The framework could thereby potentially help to structure the design of new algorithms, by consciously walking along the dimensions of the framework. It also shows what we should actually report about an algorithm to fully characterize it.</p>
<p>There is one deeper connection between planning and tabular reinforcement learning we have not discussed yet. In our framework, we treated the back-up estimates generated from a single model-free RL trial as a local solution. This increases consistency (i.e., allows for the pseudocode of Algorithm 1), but we could also view model-free RL as a direct update of the global solution based on the back-up estimate (i.e., skip the local solution). With this view we see another relation between common planning and tabular learning algorithms, such as MCTS (planning) and Monte Carlo reinforcement learning (MCRL). Both these algorithms sample trials and compute back-up estimates in the same way, but MCTS writes these to a local tabular solution (with learning rate <inline-formula><mml:math id="M89"><mml:mi>&#x003B7;</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:mfrac></mml:math></inline-formula>), while MCRL writes these to a global tabular solution (with fixed learning rate &#x003B7;). These algorithms from different research fields are therefore strongly connected, not only in their back-up, but also in their update schemes.</p>
<p>We will briefly emphasize elements of the framework, or possible combinations of choices, that could deserve extra attention. First of all, the main success of reinforcement learning originates from its use of global, approximate representations (Silver et al., <xref ref-type="bibr" rid="B139">2017</xref>; Ecoffet et al., <xref ref-type="bibr" rid="B44">2021</xref>), for example in the form of deep neural networks. These approximate representations allow for generalization between similar states, and planning researchers may therefore want to emphasize global solution representations in their algorithms. The other way around, a main part of the success of planning literature comes from the stability and guarantees of building local, tabular solutions. Combinations of both approaches show state-of-the-art results (Levine and Abbeel, <xref ref-type="bibr" rid="B87">2014</xref>; Silver et al., <xref ref-type="bibr" rid="B139">2017</xref>; Hamrick et al., <xref ref-type="bibr" rid="B58">2020a</xref>), and each illustrate that we can be very creative in the way learned global solutions can guide new planning iterations, and the way planning output may influence the global solution and/or action selection. Important research questions are therefore how action selection within a trial can be influenced by the global solution (Algorithm 1, line 16), how a local solution should influence the global solution (i.e., variants of loss functions, Algorithm 1, line 7), and how we may adaptively assign planning budgets per root state (Algorithm 1, line 5). A recent systematic study of design considerations in planning in the context of model-based deep reinforcement learning is provided by Hamrick et al. (<xref ref-type="bibr" rid="B59">2020b</xref>).</p>
<p>Another important direction for cross-pollination is the study of <italic>global, approximate frontiers</italic>. On the one hand, planning research has extensively studied the benefit of local, tabular frontiers, a crucial idea which has been ignored in most RL literature. On the other hand, tabular frontiers do not scale to high-dimensional problems, and in these cases we need to track some kind of global approximate frontier, as studied in intrinsically motivated goal exploration processes (Colas et al., <xref ref-type="bibr" rid="B37">2020</xref>). Initial results in this direction are for example provided by P&#x000E9;r&#x000E9; et al. (<xref ref-type="bibr" rid="B115">2018</xref>) and Ecoffet et al. (<xref ref-type="bibr" rid="B44">2021</xref>), but there appears to be much remaining research in this field. Getting back to the previous point, we also believe semi-parametric memory and episodic memory (Blundell et al., <xref ref-type="bibr" rid="B22">2016</xref>; Pritzel et al., <xref ref-type="bibr" rid="B121">2017</xref>) may play a big role for global approximate solutions, for example to ensure we can directly get back to a recently discovered interesting state.</p>
<p>A third interesting direction is a stronger emphasis on the idea of backward search (planning terminology) or prioritized sweeping (RL terminology). In both communities, backward search has received considerable less attention than forward search, while backward approaches are crucial to spread acquired information efficiently over a (global) state space (by setting root states in a smarter way, see Section 5.2). The major bottleneck seems the necessity of a <italic>reverse</italic> model (which state-actions may lead to a particular state), which is often available in smaller, tabular problems, but not in large complex problems where we only have a simulator or real world interaction available. However, we may learn an approximate reverse model from data, which could bring these powerful ideas back into the picture. Initial (promising) results in this direction are provided by Corneil et al. (<xref ref-type="bibr" rid="B38">2018</xref>), Edwards et al. (<xref ref-type="bibr" rid="B46">2018</xref>), and Agostinelli et al. (<xref ref-type="bibr" rid="B2">2019</xref>).</p>
<p>In summary, the framework for reinforcement learning and planning (FRAP), as presented in this paper, shows that both planning and reinforcement learning algorithms share the same algorithmic space. This provides a common language for researchers from both fields, and may help inspire future research (for example by cross-pollination). Finally, we hope the paper also serves an educational purpose, for researchers from one field that enter into the other, but particularly for students, as a systematic way to think about the decisions that need to be made in a planning or reinforcement learning algorithm, and as a way to integrate algorithms that are often presented in disjoint courses.</p>
</sec>
<sec sec-type="data-availability" id="s7">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p>
</sec>
<sec id="s8">
<title>Author Contributions</title>
<p>TM led the project and wrote the first manuscript. JB was involved in the conceptual design of the paper and provided feedback on the manuscript. AP and CJ both supervised the project, were involved in conceptual discussions, and provided comments on the manuscript to reach the final version. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec> </body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Achiam</surname> <given-names>J.</given-names></name> <name><surname>Sastry</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>Surprise-based intrinsic motivation for deep reinforcement learning</article-title>. <source>arXiv preprint arXiv:1703.01732</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1703.01732</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Agostinelli</surname> <given-names>F.</given-names></name> <name><surname>McAleer</surname> <given-names>S.</given-names></name> <name><surname>Shmakov</surname> <given-names>A.</given-names></name> <name><surname>Baldi</surname> <given-names>P.</given-names></name></person-group> (<year>2019</year>). <article-title>Solving the Rubik&#x00027;s cube with deep reinforcement learning and search</article-title>. <source>Nat. Mach. Intell</source>. <volume>1</volume>, <fpage>356</fpage>&#x02013;<lpage>363</lpage>. <pub-id pub-id-type="doi">10.1038/s42256-019-0070-z</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Akers</surname> <given-names>S. B..</given-names></name></person-group> (<year>1978</year>). <article-title>Binary decision diagrams</article-title>. <source>IEEE Trans. Comput</source>. <volume>27</volume>, <fpage>509</fpage>&#x02013;<lpage>516</lpage>. <pub-id pub-id-type="doi">10.1109/TC.1978.1675141</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Alc&#x000E1;zar</surname> <given-names>V.</given-names></name> <name><surname>Borrajo</surname> <given-names>D.</given-names></name> <name><surname>Fern&#x000E1;ndez</surname> <given-names>S.</given-names></name> <name><surname>Fuentetaja</surname> <given-names>R.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;Revisiting regression in planning,&#x0201D;</article-title> in <source>Twenty-Third International Joint Conference on Artificial Intelligence</source> (<publisher-loc>Beijing</publisher-loc>).</citation>
</ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Anderson</surname> <given-names>B. D.</given-names></name> <name><surname>Moore</surname> <given-names>J. B.</given-names></name></person-group> (<year>2007</year>). <source>Optimal Control: Linear Quadratic Methods</source>. Courier Corporation.</citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Arulkumaran</surname> <given-names>K.</given-names></name> <name><surname>Deisenroth</surname> <given-names>M. P.</given-names></name> <name><surname>Brundage</surname> <given-names>M.</given-names></name> <name><surname>Bharath</surname> <given-names>A. A.</given-names></name></person-group> (<year>2017</year>). <article-title>Deep reinforcement learning: a brief survey</article-title>. <source>IEEE Signal Process. Mag</source>. <volume>34</volume>, <fpage>26</fpage>&#x02013;<lpage>38</lpage>. <pub-id pub-id-type="doi">10.1109/MSP.2017.2743240</pub-id></citation>
</ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Atiya</surname> <given-names>A. F.</given-names></name> <name><surname>Parlos</surname> <given-names>A. G.</given-names></name> <name><surname>Ingber</surname> <given-names>L.</given-names></name></person-group> (<year>2003</year>). <article-title>&#x0201C;A reinforcement learning method based on adaptive simulated annealing,&#x0201D;</article-title> in <source>2003 46th Midwest Symposium on Circuits and Systems, Vol. 1</source> (<publisher-loc>Cairo</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>121</fpage>&#x02013;<lpage>124</lpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Auer</surname> <given-names>P.</given-names></name> <name><surname>Cesa-Bianchi</surname> <given-names>N.</given-names></name> <name><surname>Fischer</surname> <given-names>P.</given-names></name></person-group> (<year>2002</year>). <article-title>Finite-time analysis of the multiarmed bandit problem</article-title>. <source>Mach Learn</source>. <volume>47</volume>, <fpage>235</fpage>&#x02013;<lpage>256</lpage>. <pub-id pub-id-type="doi">10.1023/A:1013689704352</pub-id></citation>
</ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Baranes</surname> <given-names>A.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P.-Y.</given-names></name></person-group> (<year>2013</year>). <article-title>Active learning of inverse models with intrinsically motivated goal exploration in robots</article-title>. <source>Rob. Auton. Syst</source>. <volume>61</volume>, <fpage>49</fpage>&#x02013;<lpage>73</lpage>. <pub-id pub-id-type="doi">10.1016/j.robot.2012.05.008</pub-id></citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barto</surname> <given-names>A. G.</given-names></name> <name><surname>Bradtke</surname> <given-names>S. J.</given-names></name> <name><surname>Singh</surname> <given-names>S. P.</given-names></name></person-group> (<year>1995</year>). <article-title>Learning to act using real-time dynamic programming</article-title>. <source>Artif. Intell</source>. <volume>72</volume>, <fpage>81</fpage>&#x02013;<lpage>138</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(94)00011-O</pub-id><pub-id pub-id-type="pmid">30732992</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barto</surname> <given-names>A. G.</given-names></name> <name><surname>Mahadevan</surname> <given-names>S.</given-names></name></person-group> (<year>2003</year>). <article-title>Recent advances in hierarchical reinforcement learning</article-title>. <source>Discrete Event Dyn. Syst</source>. <volume>13</volume>, <fpage>41</fpage>&#x02013;<lpage>77</lpage>. <pub-id pub-id-type="doi">10.1023/A:1022140919877</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barto</surname> <given-names>A. G.</given-names></name> <name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Anderson</surname> <given-names>C. W.</given-names></name></person-group> (<year>1983</year>). <article-title>Neuronlike adaptive elements that can solve difficult learning control problems</article-title>. <source>IEEE Trans. Syst. Man Cybern</source>. <volume>13</volume>, <fpage>834</fpage>&#x02013;<lpage>846</lpage>. <pub-id pub-id-type="doi">10.1109/TSMC.1983.6313077</pub-id></citation>
</ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bellemare</surname> <given-names>M.</given-names></name> <name><surname>Srinivasan</surname> <given-names>S.</given-names></name> <name><surname>Ostrovski</surname> <given-names>G.</given-names></name> <name><surname>Schaul</surname> <given-names>T.</given-names></name> <name><surname>Saxton</surname> <given-names>D.</given-names></name> <name><surname>Munos</surname> <given-names>R.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Unifying count-based exploration and intrinsic motivation,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Barcelona</publisher-loc>), <fpage>1471</fpage>&#x02013;<lpage>1479</lpage>.</citation>
</ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bellemare</surname> <given-names>M. G.</given-names></name> <name><surname>Dabney</surname> <given-names>W.</given-names></name> <name><surname>Munos</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;A distributional perspective on reinforcement learning,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Sydney</publisher-loc>: <publisher-name>PMLR	</publisher-name>), <fpage>449</fpage>&#x02013;<lpage>458</lpage>.</citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bellman</surname> <given-names>R..</given-names></name></person-group> (<year>1954</year>). <article-title>The theory of dynamic programming</article-title>. <source>Bull. New Ser. Am. Math. Soc</source>. <volume>60</volume>, <fpage>503</fpage>&#x02013;<lpage>515</lpage>. <pub-id pub-id-type="doi">10.1090/S0002-9904-1954-09848-8</pub-id></citation>
</ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bellman</surname> <given-names>R..</given-names></name></person-group> (<year>1957</year>). <article-title>A Markovian decision process</article-title>. <source>J. Math. Mech</source>. <volume>6</volume>, <fpage>679</fpage>&#x02013;<lpage>684</lpage>. <pub-id pub-id-type="doi">10.1512/iumj.1957.6.56038</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bellman</surname> <given-names>R..</given-names></name></person-group> (<year>1966</year>). <article-title>Dynamic programming</article-title>. <source>Science</source> <volume>153</volume>, <fpage>34</fpage>&#x02013;<lpage>37</lpage>. <pub-id pub-id-type="doi">10.1126/science.153.3731.34</pub-id><pub-id pub-id-type="pmid">17730601</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bertsekas</surname> <given-names>D..</given-names></name></person-group> (<year>2012</year>). <source>Dynamic Programming and Optimal Control: Volume I. Vol. 1</source>. Athena scientific.</citation>
</ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bertsekas</surname> <given-names>D. P..</given-names></name></person-group> (<year>2011</year>). <source>Dynamic Programming and Optimal Control 3rd Edition, Volume 2</source>. <publisher-loc>Belmont, MA</publisher-loc>: <publisher-name>Athena Scientific</publisher-name>.</citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bertsekas</surname> <given-names>D. P.</given-names></name> <name><surname>Tsitsiklis</surname> <given-names>J. N.</given-names></name></person-group> (<year>1991</year>). <article-title>An analysis of stochastic shortest path problems</article-title>. <source>Math. Operat. Res</source>. <volume>16</volume>, <fpage>580</fpage>&#x02013;<lpage>595</lpage>. <pub-id pub-id-type="doi">10.1287/moor.16.3.580</pub-id><pub-id pub-id-type="pmid">34860657</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bertsekas</surname> <given-names>D. P.</given-names></name> <name><surname>Tsitsiklis</surname> <given-names>J. N.</given-names></name></person-group> (<year>1996</year>). <source>Neuro-Dynamic Programming, Vol. 5</source>. <publisher-loc>Belmont, MA</publisher-loc>: <publisher-name>Athena Scientific</publisher-name>.</citation>
</ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Blundell</surname> <given-names>C.</given-names></name> <name><surname>Uria</surname> <given-names>B.</given-names></name> <name><surname>Pritzel</surname> <given-names>A.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Ruderman</surname> <given-names>A.</given-names></name> <name><surname>Leibo</surname> <given-names>J. Z.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Model-free episodic control</article-title>. <source>arXiv preprint arXiv:1606.04460</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1606.04460</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bock</surname> <given-names>H. G.</given-names></name> <name><surname>Plitt</surname> <given-names>K.-J.</given-names></name></person-group> (<year>1984</year>). <article-title>A multiple shooting algorithm for direct solution of optimal control problems</article-title>. <source>IFAC Proc</source>. <volume>17</volume>, <fpage>1603</fpage>&#x02013;<lpage>1608</lpage>. <pub-id pub-id-type="doi">10.1016/S1474-6670(17)61205-9</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bonet</surname> <given-names>B.</given-names></name> <name><surname>Geffner</surname> <given-names>H.</given-names></name></person-group> (<year>2001</year>). <article-title>Planning as heuristic search</article-title>. <source>Artif. Intell</source>. <volume>129</volume>, <fpage>5</fpage>&#x02013;<lpage>33</lpage>. <pub-id pub-id-type="doi">10.1016/S0004-3702(01)00108-4</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bonet</surname> <given-names>B.</given-names></name> <name><surname>Geffner</surname> <given-names>H.</given-names></name></person-group> (<year>2003a</year>). <article-title>&#x0201C;Faster heuristic search algorithms for planning with uncertainty and full feedback,&#x0201D;</article-title> in <source>IJCAI</source> (<publisher-loc>Acapulco</publisher-loc>), <fpage>1233</fpage>&#x02013;<lpage>1238</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bonet</surname> <given-names>B.</given-names></name> <name><surname>Geffner</surname> <given-names>H.</given-names></name></person-group> (<year>2003b</year>). <article-title>&#x0201C;Labeled RTDP: improving the convergence of real-time dynamic programming,&#x0201D;</article-title> in <source>ICAPS Vol. 3</source> (<publisher-loc>Trento</publisher-loc>), <fpage>12</fpage>&#x02013;<lpage>21</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Botvinick</surname> <given-names>M.</given-names></name> <name><surname>Toussaint</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <article-title>Planning as inference</article-title>. <source>Trends Cogn. Sci</source>. <volume>16</volume>, <fpage>485</fpage>&#x02013;<lpage>488</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2012.08.006</pub-id><pub-id pub-id-type="pmid">22940577</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bradtke</surname> <given-names>S. J.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name></person-group> (<year>1996</year>). <article-title>Linear least-squares algorithms for temporal difference learning</article-title>. <source>Mach. Learn</source>. <volume>22</volume>, <fpage>33</fpage>&#x02013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1007/BF00114723</pub-id></citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brafman</surname> <given-names>R. I.</given-names></name> <name><surname>Tennenholtz</surname> <given-names>M.</given-names></name> <name><surname>Dale Schuurmans</surname> <given-names>D.</given-names></name></person-group> (<year>2003</year>). <article-title>R-MAX&#x02013;A general polynomial time algorithm for near-optimal reinforcement learning</article-title>. <source>J. Mach. Learn. Res</source>. <volume>3</volume>, <fpage>213</fpage>&#x02013;<lpage>231</lpage>. <pub-id pub-id-type="doi">10.1162/153244303765208377</pub-id></citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Browne</surname> <given-names>C. B.</given-names></name> <name><surname>Powley</surname> <given-names>E.</given-names></name> <name><surname>Whitehouse</surname> <given-names>D.</given-names></name> <name><surname>Lucas</surname> <given-names>S. M.</given-names></name> <name><surname>Cowling</surname> <given-names>P. I.</given-names></name> <name><surname>Rohlfshagen</surname> <given-names>P.</given-names></name> <etal/></person-group>. (<year>2012</year>). <article-title>A survey of monte carlo tree search methods</article-title>. <source>IEEE Trans. Comput. Intell. AI Games</source> <volume>4</volume>, <fpage>1</fpage>&#x02013;<lpage>43</lpage>. <pub-id pub-id-type="doi">10.1109/TCIAIG.2012.2186810</pub-id><pub-id pub-id-type="pmid">30342548</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bryant</surname> <given-names>R. E..</given-names></name></person-group> (<year>1992</year>). <article-title>Symbolic boolean manipulation with ordered binary-decision diagrams</article-title>. <source>ACM Comput. Surveys</source> <volume>24</volume>, <fpage>293</fpage>&#x02013;<lpage>318</lpage>. <pub-id pub-id-type="doi">10.1145/136035.136043</pub-id></citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Buckman</surname> <given-names>J.</given-names></name> <name><surname>Hafner</surname> <given-names>D.</given-names></name> <name><surname>Tucker</surname> <given-names>G.</given-names></name> <name><surname>Brevdo</surname> <given-names>E.</given-names></name> <name><surname>Lee</surname> <given-names>H.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Sample-efficient reinforcement learning with stochastic ensemble value expansion,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>8224</fpage>&#x02013;<lpage>8234</lpage>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Busoniu</surname> <given-names>L.</given-names></name> <name><surname>Babuska</surname> <given-names>R.</given-names></name> <name><surname>De Schutter</surname> <given-names>B.</given-names></name></person-group> (<year>2008</year>). <article-title>A comprehensive survey of multiagent reinforcement learning. <italic>IEEE Trans</italic></article-title>. <source>Syst. Man Cybern. C</source> <volume>2</volume>, <fpage>156</fpage>&#x02013;<lpage>172</lpage>. <pub-id pub-id-type="doi">10.1109/TSMCC.2007.913919</pub-id><pub-id pub-id-type="pmid">34723814</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Campbell</surname> <given-names>M.</given-names></name> <name><surname>Hoane Jr</surname> <given-names>A. J.</given-names></name> <name><surname>Hsu</surname> <given-names>F.-,h.</given-names></name></person-group> (<year>2002</year>). <article-title>Deep blue</article-title>. <source>Artif. Intell</source>. <volume>134</volume>, <fpage>57</fpage>&#x02013;<lpage>83</lpage>. <pub-id pub-id-type="doi">10.1016/S0004-3702(01)00129-1</pub-id></citation>
</ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cesa-Bianchi</surname> <given-names>N.</given-names></name> <name><surname>Gentile</surname> <given-names>C.</given-names></name> <name><surname>Lugosi</surname> <given-names>G.</given-names></name> <name><surname>Neu</surname> <given-names>G.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Boltzmann exploration done right,&#x0201D;</article-title> in <source>31st Conference on Neural Information Processing Systems (NIPS 2017)</source> (<publisher-loc>Long Beach, CA</publisher-loc>).</citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chentanez</surname> <given-names>N.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name> <name><surname>Singh</surname> <given-names>S. P.</given-names></name></person-group> (<year>2005</year>). <article-title>&#x0201C;Intrinsically motivated reinforcement learning,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Vancouver, BC</publisher-loc>), <fpage>1281</fpage>&#x02013;<lpage>1288</lpage>.<pub-id pub-id-type="pmid">29631753</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Colas</surname> <given-names>C.</given-names></name> <name><surname>Karch</surname> <given-names>T.</given-names></name> <name><surname>Sigaud</surname> <given-names>O.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P.-Y.</given-names></name></person-group> (<year>2020</year>). <article-title>Intrinsically motivated goal-conditioned reinforcement learning: a short survey</article-title>. <source>arXiv preprint arXiv:2012.09830</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2012.09830</pub-id></citation>
</ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Corneil</surname> <given-names>D.</given-names></name> <name><surname>Gerstner</surname> <given-names>W.</given-names></name> <name><surname>Brea</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Efficient model-based deep reinforcement learning with variational state tabulation</article-title>. <source>arXiv preprint arXiv:1802.04325</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1802.04325</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Coulom</surname> <given-names>R..</given-names></name></person-group> (<year>2006</year>). <article-title>&#x0201C;Efficient selectivity and backup operators in Monte-Carlo tree search,&#x0201D;</article-title> in <source>International Conference on Computers and Games</source> (<publisher-loc>Turin</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>72</fpage>&#x02013;<lpage>83</lpage>.</citation>
</ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dearden</surname> <given-names>R.</given-names></name> <name><surname>Friedman</surname> <given-names>N.</given-names></name> <name><surname>Russell</surname> <given-names>S.</given-names></name></person-group> (<year>1998</year>). <article-title>&#x0201C;Bayesian Q-learning,&#x0201D;</article-title> in <source>AAAI/IAAI</source> (<publisher-loc>Madison</publisher-loc>), <fpage>761</fpage>&#x02013;<lpage>768</lpage>.</citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Deisenroth</surname> <given-names>M.</given-names></name> <name><surname>Rasmussen</surname> <given-names>C. E.</given-names></name></person-group> (<year>2011</year>). <article-title>&#x0201C;PILCO: a model-based and data-efficient approach to policy search,&#x0201D;</article-title> in <source>Proceedings of the 28th International Conference on Machine Learning (ICML-11)</source>, <fpage>465</fpage>&#x02013;<lpage>472</lpage>.</citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deisenroth</surname> <given-names>M. P.</given-names></name> <name><surname>Neumann</surname> <given-names>G.</given-names></name> <name><surname>Peters</surname> <given-names>J.</given-names></name></person-group> (<year>2013</year>). <article-title>A survey on policy search for robotics</article-title>. <source>Foundat. Trends&#x000AE; Rob</source>. <volume>2</volume>, <fpage>1</fpage>&#x02013;<lpage>142</lpage>. <pub-id pub-id-type="doi">10.1561/2300000021</pub-id><pub-id pub-id-type="pmid">30780043</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dijkstra</surname> <given-names>E. W..</given-names></name></person-group> (<year>1959</year>). <article-title>A note on two problems in connexion with graphs</article-title>. <source>Numerische Math</source>. <volume>1</volume>, <fpage>269</fpage>&#x02013;<lpage>271</lpage>. <pub-id pub-id-type="doi">10.1007/BF01386390</pub-id></citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ecoffet</surname> <given-names>A.</given-names></name> <name><surname>Huizinga</surname> <given-names>J.</given-names></name> <name><surname>Lehman</surname> <given-names>J.</given-names></name> <name><surname>Stanley</surname> <given-names>K. O.</given-names></name> <name><surname>Clune</surname> <given-names>J.</given-names></name></person-group> (<year>2021</year>). <article-title>First return, then explore</article-title>. <source>Nature</source> <volume>590</volume>, <fpage>580</fpage>&#x02013;<lpage>586</lpage>. <pub-id pub-id-type="doi">10.1038/s41586-020-03157-9</pub-id><pub-id pub-id-type="pmid">33627813</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Edelkamp</surname> <given-names>S.</given-names></name> <name><surname>Schrodl</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <source>Heuristic Search: Theory and Applications</source>. Elsevier.</citation>
</ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Edwards</surname> <given-names>A. D.</given-names></name> <name><surname>Downs</surname> <given-names>L.</given-names></name> <name><surname>Davidson</surname> <given-names>J. C.</given-names></name></person-group> (<year>2018</year>). <article-title>Forward-backward reinforcement learning</article-title>. <source>arXiv preprint arXiv:1803.10227</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1803.10227</pub-id></citation>
</ref>
<ref id="B47">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Fairbank</surname> <given-names>M.</given-names></name> <name><surname>Alonso</surname> <given-names>E.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Value-gradient learning,&#x0201D;</article-title> in <source>The 2012 International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Brisbane, QLD</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Felner</surname> <given-names>A.</given-names></name> <name><surname>Kraus</surname> <given-names>S.</given-names></name> <name><surname>Korf</surname> <given-names>R. E.</given-names></name></person-group> (<year>2003</year>). <article-title>KBFS: K-best-first search</article-title>. <source>Ann. Math. Artif. Intell</source>. <volume>39</volume>, <fpage>19</fpage>&#x02013;<lpage>39</lpage>. <pub-id pub-id-type="doi">10.1023/A:1024452529781</pub-id></citation>
</ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Florensa</surname> <given-names>C.</given-names></name> <name><surname>Held</surname> <given-names>D.</given-names></name> <name><surname>Geng</surname> <given-names>X.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Automatic goal generation for reinforcement learning agents,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>, <fpage>1514</fpage>&#x02013;<lpage>1523</lpage>.</citation>
</ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fran&#x000E7;ois-Lavet</surname> <given-names>V.</given-names></name> <name><surname>Henderson</surname> <given-names>P.</given-names></name> <name><surname>Islam</surname> <given-names>R.</given-names></name> <name><surname>Bellemare</surname> <given-names>M. G.</given-names></name> <name><surname>Pineau</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>An introduction to deep reinforcement learning</article-title>. <source>Foundat. Trends&#x000AE;  Mach. Learn</source>. <volume>11</volume>, <fpage>219</fpage>&#x02013;<lpage>354</lpage>. <pub-id pub-id-type="doi">10.1561/9781680835397</pub-id></citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Geffner</surname> <given-names>H.</given-names></name> <name><surname>Bonet</surname> <given-names>B.</given-names></name></person-group> (<year>2013</year>). <article-title>A concise introduction to models and methods for automated planning</article-title>. <source>Synthesis Lectures Artif. Intell. Mach. Learn</source>. <volume>8</volume>, <fpage>1</fpage>&#x02013;<lpage>141</lpage>. <pub-id pub-id-type="doi">10.2200/S00513ED1V01Y201306AIM022</pub-id></citation>
</ref>
<ref id="B52">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gelly</surname> <given-names>S.</given-names></name> <name><surname>Wang</surname> <given-names>Y.</given-names></name></person-group> (<year>2006</year>). <article-title>&#x0201C;Exploration exploitation in go: UCT for Monte-Carlo go,&#x0201D;</article-title> in <source>NIPS: Neural Information Processing Systems Conference On-line trading of Exploration and Exploitation Workshop</source>.<pub-id pub-id-type="pmid">34094345</pub-id></citation></ref>
<ref id="B53">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gershman</surname> <given-names>S. J.</given-names></name> <name><surname>Daw</surname> <given-names>N. D.</given-names></name></person-group> (<year>2017</year>). <article-title>Reinforcement learning and episodic memory in humans and animals: an integrative framework</article-title>. <source>Annu. Rev. Psychol</source>. <volume>68</volume>, <fpage>101</fpage>&#x02013;<lpage>128</lpage>. <pub-id pub-id-type="doi">10.1146/annurev-psych-122414-033625</pub-id><pub-id pub-id-type="pmid">27618944</pub-id></citation></ref>
<ref id="B54">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Courville</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <source>Deep Learning</source>. MIT Press.</citation>
</ref>
<ref id="B55">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Graves</surname> <given-names>A.</given-names></name> <name><surname>Wayne</surname> <given-names>G.</given-names></name> <name><surname>Reynolds</surname> <given-names>M.</given-names></name> <name><surname>Harley</surname> <given-names>T.</given-names></name> <name><surname>Danihelka</surname> <given-names>I.</given-names></name> <name><surname>Grabska-Barwi&#x00144;ska</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Hybrid computing using a neural network with dynamic external memory</article-title>. <source>Nature</source> <volume>538</volume>, <fpage>471</fpage>&#x02013;<lpage>476</lpage>. <pub-id pub-id-type="doi">10.1038/nature20101</pub-id><pub-id pub-id-type="pmid">27732574</pub-id></citation></ref>
<ref id="B56">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Guez</surname> <given-names>A.</given-names></name> <name><surname>Silver</surname> <given-names>D.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Efficient Bayes-adaptive reinforcement learning using sample-based search,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>1025</fpage>&#x02013;<lpage>1033</lpage>.</citation>
</ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hamrick</surname> <given-names>J. B..</given-names></name></person-group> (<year>2019</year>). <article-title>Analogues of mental simulation and imagination in deep learning</article-title>. <source>Curr. Opin. Behav. Sci</source>. <volume>29</volume>, <fpage>8</fpage>&#x02013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1016/j.cobeha.2018.12.011</pub-id></citation>
</ref>
<ref id="B58">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hamrick</surname> <given-names>J. B.</given-names></name> <name><surname>Bapst</surname> <given-names>V.</given-names></name> <name><surname>Sanchez-Gonzalez</surname> <given-names>A.</given-names></name> <name><surname>Pfaff</surname> <given-names>T.</given-names></name> <name><surname>Weber</surname> <given-names>T.</given-names></name> <name><surname>Buesing</surname> <given-names>L.</given-names></name> <etal/></person-group>. (<year>2020a</year>). <article-title>&#x0201C;Combining q-learning and search with amortized value estimates,&#x0201D;</article-title> in <source>International Conference on Learning Representations (ICLR)</source>.</citation>
</ref>
<ref id="B59">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hamrick</surname> <given-names>J. B.</given-names></name> <name><surname>Friesen</surname> <given-names>A. L.</given-names></name> <name><surname>Behbahani</surname> <given-names>F.</given-names></name> <name><surname>Guez</surname> <given-names>A.</given-names></name> <name><surname>Viola</surname> <given-names>F.</given-names></name> <name><surname>Witherspoon</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2020b</year>). <article-title>&#x0201C;On the role of planning in model-based deep reinforcement learning,&#x0201D;</article-title> in <source>International Conference on Learning Representations</source>.<pub-id pub-id-type="pmid">34081705</pub-id></citation></ref>
<ref id="B60">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hansen</surname> <given-names>E. A.</given-names></name> <name><surname>Zilberstein</surname> <given-names>S.</given-names></name></person-group> (<year>2001</year>). <article-title>LAO&#x022C6;: a heuristic search algorithm that finds solutions with loops</article-title>. <source>Artif. Intell</source>. <volume>129</volume>, <fpage>35</fpage>&#x02013;<lpage>62</lpage>. <pub-id pub-id-type="doi">10.1016/S0004-3702(01)00106-0</pub-id></citation>
</ref>
<ref id="B61">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hao</surname> <given-names>B.</given-names></name> <name><surname>Abbasi Yadkori</surname> <given-names>Y.</given-names></name> <name><surname>Wen</surname> <given-names>Z.</given-names></name> <name><surname>Cheng</surname> <given-names>G.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Bootstrapping upper confidence bound,&#x0201D;</article-title> in <source>33rd Conference on Neural Information Processing Systems (NeurIPS 2019)</source> (<publisher-loc>Vancouver, BC</publisher-loc>).</citation>
</ref>
<ref id="B62">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hart</surname> <given-names>P. E.</given-names></name> <name><surname>Nilsson</surname> <given-names>N. J.</given-names></name> <name><surname>Raphael</surname> <given-names>B.</given-names></name></person-group> (<year>1968</year>). <article-title>A formal basis for the heuristic determination of minimum cost paths</article-title>. <source>IEEE Trans. Syst. Sci. Cybern</source>. <volume>4</volume>, <fpage>100</fpage>&#x02013;<lpage>107</lpage>. <pub-id pub-id-type="doi">10.1109/TSSC.1968.300136</pub-id></citation>
</ref>
<ref id="B63">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Harvey</surname> <given-names>W. D.</given-names></name> <name><surname>Ginsberg</surname> <given-names>M. L.</given-names></name></person-group> (<year>1995</year>). <article-title>&#x0201C;Limited discrepancy search,&#x0201D;</article-title> in <source>IJCAI</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>607</fpage>&#x02013;<lpage>615</lpage>.</citation>
</ref>
<ref id="B64">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Heess</surname> <given-names>N.</given-names></name> <name><surname>Wayne</surname> <given-names>G.</given-names></name> <name><surname>Silver</surname> <given-names>D.</given-names></name> <name><surname>Lillicrap</surname> <given-names>T.</given-names></name> <name><surname>Erez</surname> <given-names>T.</given-names></name> <name><surname>Tassa</surname> <given-names>Y.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Learning continuous control policies by stochastic value gradients,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>2944</fpage>&#x02013;<lpage>2952</lpage>.</citation>
</ref>
<ref id="B65">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hester</surname> <given-names>T.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Learning and using models,&#x0201D;</article-title> in <source>Reinforcement Learning</source> (<publisher-loc>Springer</publisher-loc>), <fpage>111</fpage>&#x02013;<lpage>141</lpage>.</citation>
</ref>
<ref id="B66">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hoffmann</surname> <given-names>J.</given-names></name> <name><surname>Nebel</surname> <given-names>B.</given-names></name></person-group> (<year>2001</year>). <article-title>The FF planning system: fast plan generation through heuristic search</article-title>. <source>J. Artif. Intell. Res</source>. <volume>14</volume>, <fpage>253</fpage>&#x02013;<lpage>302</lpage>. <pub-id pub-id-type="doi">10.1613/jair.855</pub-id></citation>
</ref>
<ref id="B67">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Houthooft</surname> <given-names>R.</given-names></name> <name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Duan</surname> <given-names>Y.</given-names></name> <name><surname>Schulman</surname> <given-names>J.</given-names></name> <name><surname>De Turck</surname> <given-names>F.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Vime: variational information maximizing exploration,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>1109</fpage>&#x02013;<lpage>1117</lpage>.</citation>
</ref>
<ref id="B68">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Howard</surname> <given-names>R. A..</given-names></name></person-group> (<year>1960</year>). <source>Dynamic Programming and Markov Processes.</source> John Wiley.</citation>
</ref>
<ref id="B69">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hussein</surname> <given-names>A.</given-names></name> <name><surname>Gaber</surname> <given-names>M. M.</given-names></name> <name><surname>Elyan</surname> <given-names>E.</given-names></name> <name><surname>Jayne</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>Imitation learning: a survey of learning methods</article-title>. <source>ACM Comput. Surveys</source> <volume>50</volume>, <fpage>1</fpage>&#x02013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1145/3054912</pub-id></citation>
</ref>
<ref id="B70">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kaelbling</surname> <given-names>L. P..</given-names></name></person-group> (<year>1993</year>). <source>Learning in Embedded Systems</source>. MIT Press.</citation>
</ref>
<ref id="B71">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kanal</surname> <given-names>L.</given-names></name> <name><surname>Kumar</surname> <given-names>V.</given-names></name></person-group> (<year>2012</year>). <source>Search in Artificial Intelligence</source>. Springer Science &#x00026; Business Media.</citation>
</ref>
<ref id="B72">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kappen</surname> <given-names>H. J.</given-names></name> <name><surname>G&#x000F3;mez</surname> <given-names>V.</given-names></name> <name><surname>Opper</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <article-title>Optimal control as a graphical model inference problem</article-title>. <source>Mach. Learn</source>. <volume>87</volume>, <fpage>159</fpage>&#x02013;<lpage>182</lpage>. <pub-id pub-id-type="doi">10.1007/s10994-012-5278-7</pub-id><pub-id pub-id-type="pmid">30932830</pub-id></citation></ref>
<ref id="B73">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kautz</surname> <given-names>H.</given-names></name> <name><surname>Selman</surname> <given-names>B.</given-names></name> <name><surname>Hoffmann</surname> <given-names>J.</given-names></name></person-group> (<year>2006</year>). <article-title>&#x0201C;SatPlan: planning as satisfiability,&#x0201D;</article-title> in <source>5th International Planning Competition, Vol. 20</source> (<publisher-loc>Cumbria</publisher-loc>), <fpage>156</fpage>.</citation>
</ref>
<ref id="B74">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kautz</surname> <given-names>H. A.</given-names></name> <name><surname>Selman</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>1992</year>). Planning as Satisfiability. In ECAI, volume 92, pages 359-363. Citeseer.</citation>
</ref>
<ref id="B75">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kearns</surname> <given-names>M.</given-names></name> <name><surname>Mansour</surname> <given-names>Y.</given-names></name> <name><surname>Ng</surname> <given-names>A. Y.</given-names></name></person-group> (<year>2002</year>). <article-title>A sparse sampling algorithm for near-optimal planning in large Markov decision processes</article-title>. <source>Mach. Learn</source>. <volume>49</volume>, <fpage>193</fpage>&#x02013;<lpage>208</lpage>. <pub-id pub-id-type="doi">10.1023/A:1017932429737</pub-id></citation>
</ref>
<ref id="B76">
<citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Keller</surname> <given-names>T..</given-names></name></person-group> (<year>2015</year>). <source>Anytime optimal MDP planning with trial-based heuristic tree search</source> (<publisher-loc>Ph.D. thesis</publisher-loc>). University of Freiburg, Freiburg im Breisgau, Germany.</citation>
</ref>
<ref id="B77">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Keller</surname> <given-names>T.</given-names></name> <name><surname>Helmert</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;Trial-based heuristic tree search for finite horizon MDPs,&#x0201D;</article-title> in <source>Twenty-Third International Conference on Automated Planning and Scheduling</source> (<publisher-loc>Rome</publisher-loc>).</citation>
</ref>
<ref id="B78">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kocsis</surname> <given-names>L.</given-names></name> <name><surname>Szepesv&#x000E1;ri</surname> <given-names>C.</given-names></name></person-group> (<year>2006</year>). <article-title>Bandit based monte-carlo planning</article-title>. <source>ECML</source> <volume>6</volume>, <fpage>282</fpage>&#x02013;<lpage>293</lpage>. <pub-id pub-id-type="doi">10.1007/11871842_29</pub-id></citation>
</ref>
<ref id="B79">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kolobov</surname> <given-names>A..</given-names></name></person-group> (<year>2012</year>). <article-title>Planning with Markov decision processes: an AI perspective</article-title>. <source>Synthesis Lectures Artif. Intell. Mach. Learn</source>. <volume>6</volume>, <fpage>1</fpage>&#x02013;<lpage>210</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-031-01559-5</pub-id></citation>
</ref>
<ref id="B80">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Konda</surname> <given-names>V.</given-names></name> <name><surname>Tsitsiklis</surname> <given-names>J.</given-names></name></person-group> (<year>1999</year>). &#x0201C;Actor-critic algorithms,&#x0201D; <italic>Advances in Neural Information Processing Systems</italic> (Denver, CO).</citation>
</ref>
<ref id="B81">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Korf</surname> <given-names>R. E..</given-names></name></person-group> (<year>1985</year>). <article-title>Depth-first iterative-deepening: an optimal admissible tree search</article-title>. <source>Artif. Intell</source>. <volume>27</volume>, <fpage>97</fpage>&#x02013;<lpage>109</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(85)90084-0</pub-id></citation>
</ref>
<ref id="B82">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Korf</surname> <given-names>R. E..</given-names></name></person-group> (<year>1990</year>). <article-title>Real-time heuristic search</article-title>. <source>Artif. Intell</source>. <volume>42</volume>, <fpage>189</fpage>&#x02013;<lpage>211</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(90)90054-4</pub-id></citation>
</ref>
<ref id="B83">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Korf</surname> <given-names>R. E..</given-names></name></person-group> (<year>1993</year>). <article-title>Linear-space best-first search</article-title>. <source>Artif. Intell</source>. <volume>62</volume>, <fpage>41</fpage>&#x02013;<lpage>78</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(93)90045-D</pub-id></citation>
</ref>
<ref id="B84">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kulkarni</surname> <given-names>T. D.</given-names></name> <name><surname>Narasimhan</surname> <given-names>K.</given-names></name> <name><surname>Saeedi</surname> <given-names>A.</given-names></name> <name><surname>Tenenbaum</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Barcelona</publisher-loc>), <fpage>3675</fpage>&#x02013;<lpage>3683</lpage>.</citation>
</ref>
<ref id="B85">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lavalle</surname> <given-names>S..</given-names></name></person-group> (<year>1998</year>). <article-title>Rapidly-exploring random trees: A new tool for path planning</article-title>. <source>Computer Science Dept. Oct.</source> 98.</citation>
</ref>
<ref id="B86">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>LaValle</surname> <given-names>S. M..</given-names></name></person-group> (<year>2006</year>). <source>Planning Algorithms</source>. <publisher-loc>Cambridge</publisher-loc>: <publisher-name>Cambridge University Press</publisher-name>.</citation>
</ref>
<ref id="B87">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Levine</surname> <given-names>S.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Learning neural network policies with guided policy search under unknown dynamics,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>1071</fpage>&#x02013;<lpage>1079</lpage>.</citation>
</ref>
<ref id="B88">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Levine</surname> <given-names>S.</given-names></name> <name><surname>Koltun</surname> <given-names>V.</given-names></name></person-group> (<year>2013</year>). <article-title>&#x0201C;Guided policy search,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>Atlanta</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>9</lpage>.</citation>
</ref>
<ref id="B89">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Levine</surname> <given-names>W. S..</given-names></name></person-group> (<year>2018</year>). <source>The Control Handbook (Three Volume Set)</source>. CRC Press.</citation>
</ref>
<ref id="B90">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lewis</surname> <given-names>F. L.</given-names></name> <name><surname>Vrabie</surname> <given-names>D.</given-names></name> <name><surname>Syrmos</surname> <given-names>V. L.</given-names></name></person-group> (<year>2012</year>). <source>Optimal Control</source>. <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>John Wiley &#x00026; Sons</publisher-name>.</citation>
</ref>
<ref id="B91">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lillicrap</surname> <given-names>T. P.</given-names></name> <name><surname>Hunt</surname> <given-names>J. J.</given-names></name> <name><surname>Pritzel</surname> <given-names>A.</given-names></name> <name><surname>Heess</surname> <given-names>N.</given-names></name> <name><surname>Erez</surname> <given-names>T.</given-names></name> <name><surname>Tassa</surname> <given-names>Y.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Continuous control with deep reinforcement learning</article-title>. <source>arXiv preprint arXiv:1509.02971</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1509.02971</pub-id><pub-id pub-id-type="pmid">29994078</pub-id></citation></ref>
<ref id="B92">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lipovetzky</surname> <given-names>N.</given-names></name> <name><surname>Geffner</surname> <given-names>H.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Best-first width search: exploration and exploitation in classical planning,&#x0201D;</article-title> in <source>Thirty-First AAAI Conference on Artificial Intelligence</source> (<publisher-loc>San Francisco, CA</publisher-loc>).</citation>
</ref>
<ref id="B93">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lopes</surname> <given-names>M.</given-names></name> <name><surname>Lang</surname> <given-names>T.</given-names></name> <name><surname>Toussaint</surname> <given-names>M.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P.-Y.</given-names></name></person-group> (<year>2012</year>). <article-title>&#x0201C;Exploration in model-based reinforcement learning by empirically estimating learning progress,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Lake Tahoe</publisher-loc>), <fpage>206</fpage>&#x02013;<lpage>214</lpage>.</citation>
</ref>
<ref id="B94">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mannor</surname> <given-names>S.</given-names></name> <name><surname>Rubinstein</surname> <given-names>R. Y.</given-names></name> <name><surname>Gat</surname> <given-names>Y.</given-names></name></person-group> (<year>2003</year>). <article-title>&#x0201C;The cross entropy method for fast policy search,&#x0201D;</article-title> in <source>Proceedings of the 20th International Conference on Machine Learning (ICML-03)</source> (<publisher-loc>Washington, DC</publisher-loc>), <fpage>512</fpage>&#x02013;<lpage>519</lpage>.</citation>
</ref>
<ref id="B95">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Matiisen</surname> <given-names>T.</given-names></name> <name><surname>Oliver</surname> <given-names>A.</given-names></name> <name><surname>Cohen</surname> <given-names>T.</given-names></name> <name><surname>Schulman</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>Teacher-student curriculum learning</article-title>. <source>arXiv preprint arXiv:1707.00183</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1707.00183</pub-id></citation>
</ref>
<ref id="B96">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mayne</surname> <given-names>D. Q.</given-names></name> <name><surname>Michalska</surname> <given-names>H.</given-names></name></person-group> (<year>1990</year>). <article-title>Receding horizon control of nonlinear systems</article-title>. <source>IEEE Trans. Automa.t Contr</source>. <volume>35</volume>, <fpage>814</fpage>&#x02013;<lpage>824</lpage>. <pub-id pub-id-type="doi">10.1109/9.57020</pub-id></citation>
</ref>
<ref id="B97">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McDermott</surname> <given-names>D..</given-names></name></person-group> (<year>1978</year>). <article-title>Planning and acting</article-title>. <source>Cogn. Sci</source>. <volume>2</volume>, <fpage>71</fpage>&#x02013;<lpage>109</lpage>. <pub-id pub-id-type="doi">10.1207/s15516709cog0202_1</pub-id></citation>
</ref>
<ref id="B98">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>McMahan</surname> <given-names>H. B.</given-names></name> <name><surname>Likhachev</surname> <given-names>M.</given-names></name> <name><surname>Gordon</surname> <given-names>G. J.</given-names></name></person-group> (<year>2005</year>). <article-title>&#x0201C;Bounded real-time dynamic programming: RTDP with monotone upper bounds and performance guarantees,&#x0201D;</article-title> in <source>Proceedings of the 22nd International Conference on Machine Learning</source> (<publisher-loc>Bonn</publisher-loc>), <fpage>569</fpage>&#x02013;<lpage>576</lpage>.</citation>
</ref>
<ref id="B99">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mnih</surname> <given-names>V.</given-names></name> <name><surname>Kavukcuoglu</surname> <given-names>K.</given-names></name> <name><surname>Silver</surname> <given-names>D.</given-names></name> <name><surname>Rusu</surname> <given-names>A. A.</given-names></name> <name><surname>Veness</surname> <given-names>J.</given-names></name> <name><surname>Bellemare</surname> <given-names>M. G.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Human-level control through deep reinforcement learning</article-title>. <source>Nature</source> <volume>518</volume>, <fpage>529</fpage>. <pub-id pub-id-type="doi">10.1038/nature14236</pub-id><pub-id pub-id-type="pmid">25719670</pub-id></citation></ref>
<ref id="B100">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Moerland</surname> <given-names>T. M.</given-names></name> <name><surname>Broekens</surname> <given-names>J.</given-names></name> <name><surname>Jonker</surname> <given-names>C. M.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Efficient exploration with double uncertain value networks,&#x0201D;</article-title> in <source>Deep Reinforcement Learning Symposium, 31st Conference on Neural Information Processing Systems (NIPS)</source> (<publisher-loc>Long Beach, CA</publisher-loc>).</citation>
</ref>
<ref id="B101">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Moerland</surname> <given-names>T. M.</given-names></name> <name><surname>Broekens</surname> <given-names>J.</given-names></name> <name><surname>Jonker</surname> <given-names>C. M.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;The potential of the return distribution for exploration in RL,&#x0201D;</article-title> in <source>Exploration in Reinforcement Learning Workshop, 35th International Conference on Machine Learning (ICML)</source> (<publisher-loc>Stockholm</publisher-loc>).</citation>
</ref>
<ref id="B102">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Moerland</surname> <given-names>T. M.</given-names></name> <name><surname>Broekens</surname> <given-names>J.</given-names></name> <name><surname>Jonker</surname> <given-names>C. M.</given-names></name></person-group> (<year>2020a</year>). <article-title>Model-based reinforcement learning: a survey</article-title>. <source>arXiv preprint arXiv:2006.16712</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2006.16712</pub-id></citation>
</ref>
<ref id="B103">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Moerland</surname> <given-names>T. M.</given-names></name> <name><surname>Deichler</surname> <given-names>A.</given-names></name> <name><surname>Baldi</surname> <given-names>S.</given-names></name> <name><surname>Broekens</surname> <given-names>J.</given-names></name> <name><surname>Jonker</surname> <given-names>C. M.</given-names></name></person-group> (<year>2020b</year>). <article-title>Think too fast nor too slow: the computational trade-off between planning and reinforcement learning</article-title>. <source>arXiv preprint arXiv:2005.07404</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2005.07404</pub-id></citation>
</ref>
<ref id="B104">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moore</surname> <given-names>A. W.</given-names></name> <name><surname>Atkeson</surname> <given-names>C. G.</given-names></name></person-group> (<year>1993</year>). <article-title>Prioritized sweeping: Reinforcement learning with less data and less time</article-title>. <source>Mach. Learn</source>. <volume>13</volume>, <fpage>103</fpage>&#x02013;<lpage>130</lpage>. <pub-id pub-id-type="doi">10.1007/BF00993104</pub-id></citation>
</ref>
<ref id="B105">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moore</surname> <given-names>E. F..</given-names></name></person-group> (<year>1959</year>). <article-title>The shortest path through a maze</article-title>. <source>Proc. Int. Symp. Switch. Theory</source> <volume>1959</volume>, <fpage>285</fpage>&#x02013;<lpage>292</lpage>.</citation>
</ref>
<ref id="B106">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Morari</surname> <given-names>M.</given-names></name> <name><surname>Lee</surname> <given-names>J. H.</given-names></name></person-group> (<year>1999</year>). <article-title>Model predictive control: past, present and future</article-title>. <source>Comput. Chem. Eng</source>. <volume>23</volume>, <fpage>667</fpage>&#x02013;<lpage>682</lpage>. <pub-id pub-id-type="doi">10.1016/S0098-1354(98)00301-9</pub-id></citation>
</ref>
<ref id="B107">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Moriarty</surname> <given-names>D. E.</given-names></name> <name><surname>Schultz</surname> <given-names>A. C.</given-names></name> <name><surname>Grefenstette</surname> <given-names>J. J.</given-names></name></person-group> (<year>1999</year>). <article-title>Evolutionary algorithms for reinforcement learning</article-title>. <source>J. Artif. Intell. Res</source>. <volume>11</volume>:<fpage>241</fpage>&#x02013;<lpage>276</lpage>. <pub-id pub-id-type="doi">10.1613/jair.613</pub-id></citation>
</ref>
<ref id="B108">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Munos</surname> <given-names>R.</given-names></name> <name><surname>Stepleton</surname> <given-names>T.</given-names></name> <name><surname>Harutyunyan</surname> <given-names>A.</given-names></name> <name><surname>Bellemare</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Safe and efficient off-policy reinforcement learning,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Barcelona</publisher-loc>), <fpage>1054</fpage>&#x02013;<lpage>1062</lpage>.</citation>
</ref>
<ref id="B109">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nilsson</surname> <given-names>N. J..</given-names></name></person-group> (<year>1971</year>). <source>Problem-Solving Methods in Artificial Intelligence.</source> McGraw-Hill Pub. Co.</citation>
</ref>
<ref id="B110">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nilsson</surname> <given-names>N. J..</given-names></name></person-group> (<year>1982</year>). <source>Principles of Artificial Intelligence</source>. Springer Science &#x00026; Business Media.</citation>
</ref>
<ref id="B111">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Osband</surname> <given-names>I.</given-names></name> <name><surname>Blundell</surname> <given-names>C.</given-names></name> <name><surname>Pritzel</surname> <given-names>A.</given-names></name> <name><surname>Van Roy</surname> <given-names>B.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Deep exploration via bootstrapped DQN,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>4026</fpage>&#x02013;<lpage>4034</lpage>.</citation>
</ref>
<ref id="B112">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Oudeyer</surname> <given-names>P.-Y.</given-names></name> <name><surname>Kaplan</surname> <given-names>F.</given-names></name> <name><surname>Hafner</surname> <given-names>V. V.</given-names></name></person-group> (<year>2007</year>). <article-title>Intrinsic motivation systems for autonomous mental development</article-title>. <source>IEEE Trans. Evolut. Comput</source>. <volume>11</volume>, <fpage>265</fpage>&#x02013;<lpage>286</lpage>. <pub-id pub-id-type="doi">10.1109/TEVC.2006.890271</pub-id><pub-id pub-id-type="pmid">33501012</pub-id></citation></ref>
<ref id="B113">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pathak</surname> <given-names>D.</given-names></name> <name><surname>Agrawal</surname> <given-names>P.</given-names></name> <name><surname>Efros</surname> <given-names>A. A.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Curiosity-driven exploration by self-supervised prediction,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops</source>, <fpage>16</fpage>&#x02013;<lpage>17</lpage>.</citation>
</ref>
<ref id="B114">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pearl</surname> <given-names>J..</given-names></name></person-group> (<year>1984</year>). <source>Heuristics: Intelligent Search Strategies for Computer Problem Solving</source>. Addison-Wesley Longman Publishing Co., Inc.<pub-id pub-id-type="pmid">29771918</pub-id></citation></ref>
<ref id="B115">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>P&#x000E9;r&#x000E9;</surname> <given-names>A.</given-names></name> <name><surname>Forestier</surname> <given-names>S.</given-names></name> <name><surname>Sigaud</surname> <given-names>O.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P.-Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Unsupervised learning of goal spaces for intrinsically motivated goal exploration</article-title>. <source>arXiv preprint arXiv:1803.00781</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1803.00781</pub-id><pub-id pub-id-type="pmid">33510630</pub-id></citation></ref>
<ref id="B116">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Peters</surname> <given-names>J.</given-names></name> <name><surname>Mulling</surname> <given-names>K.</given-names></name> <name><surname>Altun</surname> <given-names>Y.</given-names></name></person-group> (<year>2010</year>). <article-title>&#x0201C;Relative entropy policy search,&#x0201D;</article-title> in <source>Twenty-Fourth AAAI Conference on Artificial Intelligence</source>.</citation>
</ref>
<ref id="B117">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Plaat</surname> <given-names>A.</given-names></name> <name><surname>Kosters</surname> <given-names>W.</given-names></name> <name><surname>Preuss</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>High-accuracy model-based reinforcement learning, a survey</article-title>. <source>arXiv preprint arXiv:2107.08241</source>. <pub-id pub-id-type="doi">10.48550/arXiv.2107.08241</pub-id></citation>
</ref>
<ref id="B118">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pohl</surname> <given-names>I..</given-names></name></person-group> (<year>1970</year>). <article-title>Heuristic search viewed as path finding in a graph</article-title>. <source>Artif. Intell</source>. <volume>1</volume>, <fpage>193</fpage>&#x02013;<lpage>204</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(70)90007-X</pub-id></citation>
</ref>
<ref id="B119">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Powell</surname> <given-names>W. B..</given-names></name></person-group> (<year>2007</year>). <source>Approximate Dynamic Programming: Solving the curses of dimensionality, Vol. 703</source>. <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>John Wiley &#x00026; Sons</publisher-name>.</citation>
</ref>
<ref id="B120">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Precup</surname> <given-names>D..</given-names></name></person-group> (<year>2000</year>). <article-title>&#x0201C;Eligibility traces for off-policy policy evaluation,&#x0201D;</article-title> in <source>Computer Science Department Faculty Publication Series</source>, 80.</citation>
</ref>
<ref id="B121">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pritzel</surname> <given-names>A.</given-names></name> <name><surname>Uria</surname> <given-names>B.</given-names></name> <name><surname>Srinivasan</surname> <given-names>S.</given-names></name> <name><surname>Badia</surname> <given-names>A. P.</given-names></name> <name><surname>Vinyals</surname> <given-names>O.</given-names></name> <name><surname>Hassabis</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>&#x0201C;Neural episodic control,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>, <fpage>2827</fpage>&#x02013;<lpage>2836</lpage>.</citation>
</ref>
<ref id="B122">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Puterman</surname> <given-names>M. L..</given-names></name></person-group> (<year>2014</year>). <source>Markov Decision Processes: Discrete Stochastic Dynamic Programming</source>. <publisher-loc>Hoboken, NJ</publisher-loc>: <publisher-name>John Wiley &#x00026; Sons</publisher-name>.</citation>
</ref>
<ref id="B123">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rumelhart</surname> <given-names>D. E.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name> <name><surname>Williams</surname> <given-names>R. J.</given-names></name></person-group> (<year>1986</year>). <article-title>Learning representations by back-propagating errors</article-title>. <source>Nature</source> <volume>323</volume>, <fpage>533</fpage>&#x02013;<lpage>536</lpage>. <pub-id pub-id-type="doi">10.1038/323533a0</pub-id></citation>
</ref>
<ref id="B124">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rummery</surname> <given-names>G. A.</given-names></name> <name><surname>Niranjan</surname> <given-names>M.</given-names></name></person-group> (<year>1994</year>). <source>On-line Q-Learning Using Connectionist Systems, Vol. 37</source>. <publisher-loc>Cambridge, UK</publisher-loc>: <publisher-name>University of Cambridge, Department of Engineering Cambridge</publisher-name>.</citation>
</ref>
<ref id="B125">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russell</surname> <given-names>S. J..</given-names></name></person-group> (<year>1992</year>). <article-title>Efficient Memory-bounded search methods</article-title>. <source>ECAI</source> <volume>92</volume>, <fpage>1</fpage>&#x02013;<lpage>5</lpage>.<pub-id pub-id-type="pmid">21095872</pub-id></citation></ref>
<ref id="B126">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Russell</surname> <given-names>S. J.</given-names></name> <name><surname>Norvig</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). <source>Artificial Intelligence: A Modern Approach</source>. <publisher-loc>Kuala Lumpur</publisher-loc>: <publisher-name>Pearson Education Limited</publisher-name>.</citation>
</ref>
<ref id="B127">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Salimans</surname> <given-names>T.</given-names></name> <name><surname>Ho</surname> <given-names>J.</given-names></name> <name><surname>Chen</surname> <given-names>X.</given-names></name> <name><surname>Sidor</surname> <given-names>S.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name></person-group> (<year>2017</year>). <article-title>Evolution strategies as a scalable alternative to reinforcement learning</article-title>. <source>arXiv preprint arXiv:1703.03864</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1703.03864</pub-id><pub-id pub-id-type="pmid">35349452</pub-id></citation></ref>
<ref id="B128">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Samuel</surname> <given-names>A. L..</given-names></name></person-group> (<year>1967</year>). <article-title>Some studies in machine learning using the game of checkers. II-Recent progress</article-title>. <source>IBM J. Res. Dev</source>. <volume>11</volume>, <fpage>601</fpage>&#x02013;<lpage>617</lpage>. <pub-id pub-id-type="doi">10.1147/rd.116.0601</pub-id></citation>
</ref>
<ref id="B129">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sanner</surname> <given-names>S.</given-names></name> <name><surname>Goetschalckx</surname> <given-names>R.</given-names></name> <name><surname>Driessens</surname> <given-names>K.</given-names></name> <name><surname>Shani</surname> <given-names>G.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Bayesian real-time dynamic programming,&#x0201D;</article-title> in <source>Twenty-First International Joint Conference on Artificial Intelligence</source>.</citation>
</ref>
<ref id="B130">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schaul</surname> <given-names>T.</given-names></name> <name><surname>Horgan</surname> <given-names>D.</given-names></name> <name><surname>Gregor</surname> <given-names>K.</given-names></name> <name><surname>Silver</surname> <given-names>D.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Universal value function approximators,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>, <fpage>1312</fpage>&#x02013;<lpage>1320</lpage>.</citation>
</ref>
<ref id="B131">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schmidhuber</surname> <given-names>J..</given-names></name></person-group> (<year>1991</year>). <article-title>&#x0201C;A possibility for implementing curiosity and boredom in model-building neural controllers,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats</source>, <fpage>222</fpage>&#x02013;<lpage>227</lpage>.</citation>
</ref>
<ref id="B132">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schulman</surname> <given-names>J.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name> <name><surname>Jordan</surname> <given-names>M.</given-names></name> <name><surname>Moritz</surname> <given-names>P.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Trust region policy optimization,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>, <fpage>1889</fpage>&#x02013;<lpage>1897</lpage>.<pub-id pub-id-type="pmid">34665751</pub-id></citation></ref>
<ref id="B133">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schulman</surname> <given-names>J.</given-names></name> <name><surname>Moritz</surname> <given-names>P.</given-names></name> <name><surname>Levine</surname> <given-names>S.</given-names></name> <name><surname>Jordan</surname> <given-names>M.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;High-dimensional continuous control using generalized advantage estimation,&#x0201D;</article-title> in <source>Proceedings of the International Conference on Learning Representations (ICLR)</source>.</citation>
</ref>
<ref id="B134">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schulman</surname> <given-names>J.</given-names></name> <name><surname>Wolski</surname> <given-names>F.</given-names></name> <name><surname>Dhariwal</surname> <given-names>P.</given-names></name> <name><surname>Radford</surname> <given-names>A.</given-names></name> <name><surname>Klimov</surname> <given-names>O.</given-names></name></person-group> (<year>2017</year>). <article-title>Proximal policy optimization algorithms</article-title>. <source>arXiv preprint arXiv:1707.06347</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1707.06347</pub-id></citation>
</ref>
<ref id="B135">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schulte</surname> <given-names>T.</given-names></name> <name><surname>Keller</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Balancing exploration and exploitation in classical planning,&#x0201D;</article-title> in <source>International Symposium on Combinatorial Search, Vol. 5</source>.</citation>
</ref>
<ref id="B136">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sekar</surname> <given-names>R.</given-names></name> <name><surname>Rybkin</surname> <given-names>O.</given-names></name> <name><surname>Daniilidis</surname> <given-names>K.</given-names></name> <name><surname>Abbeel</surname> <given-names>P.</given-names></name> <name><surname>Hafner</surname> <given-names>D.</given-names></name> <name><surname>Pathak</surname> <given-names>D.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Planning to explore via self-supervised world models,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source> (<publisher-loc>PMLR</publisher-loc>), <fpage>8583</fpage>&#x02013;<lpage>8592</lpage>.</citation>
</ref>
<ref id="B137">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Silver</surname> <given-names>D.</given-names></name> <name><surname>Hubert</surname> <given-names>T.</given-names></name> <name><surname>Schrittwieser</surname> <given-names>J.</given-names></name> <name><surname>Antonoglou</surname> <given-names>I.</given-names></name> <name><surname>Lai</surname> <given-names>M.</given-names></name> <name><surname>Guez</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2018</year>). <article-title>A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play</article-title>. <source>Science</source> <volume>362</volume>, <fpage>1140</fpage>&#x02013;<lpage>1144</lpage>. <pub-id pub-id-type="doi">10.1126/science.aar6404</pub-id><pub-id pub-id-type="pmid">30523106</pub-id></citation></ref>
<ref id="B138">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Silver</surname> <given-names>D.</given-names></name> <name><surname>Lever</surname> <given-names>G.</given-names></name> <name><surname>Heess</surname> <given-names>N.</given-names></name> <name><surname>Degris</surname> <given-names>T.</given-names></name> <name><surname>Wierstra</surname> <given-names>D.</given-names></name> <name><surname>Riedmiller</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Deterministic policy gradient algorithms,&#x0201D;</article-title> in <source>International Conference on Machine Learning</source>, <fpage>387</fpage>&#x02013;<lpage>395</lpage>.</citation>
</ref>
<ref id="B139">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Silver</surname> <given-names>D.</given-names></name> <name><surname>Schrittwieser</surname> <given-names>J.</given-names></name> <name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Antonoglou</surname> <given-names>I.</given-names></name> <name><surname>Huang</surname> <given-names>A.</given-names></name> <name><surname>Guez</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Mastering the game of go without human knowledge</article-title>. <source>Nature</source> <volume>550</volume>, <fpage>354</fpage>. <pub-id pub-id-type="doi">10.1038/nature24270</pub-id><pub-id pub-id-type="pmid">29052630</pub-id></citation></ref>
<ref id="B140">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simon</surname> <given-names>H. A.</given-names></name> <name><surname>Newell</surname> <given-names>A.</given-names></name></person-group> (<year>1958</year>). <article-title>Heuristic problem solving: The next advance in operations research</article-title>. <source>Oper. Res</source>. <volume>6</volume>, <fpage>1</fpage>&#x02013;<lpage>10</lpage>. <pub-id pub-id-type="doi">10.1287/opre.6.1.1</pub-id></citation>
</ref>
<ref id="B141">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Singh</surname> <given-names>S. P.</given-names></name> <name><surname>Sutton</surname> <given-names>R. S.</given-names></name></person-group> (<year>1996</year>). <article-title>Reinforcement learning with replacing eligibility traces</article-title>. <source>Mach. Learn</source>. <volume>22</volume>, <fpage>123</fpage>&#x02013;<lpage>158</lpage>. <pub-id pub-id-type="doi">10.1007/BF00114726</pub-id></citation>
</ref>
<ref id="B142">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Slate</surname> <given-names>D. J.</given-names></name> <name><surname>Atkin</surname> <given-names>L. R.</given-names></name></person-group> (<year>1983</year>). <article-title>&#x0201C;Chess 4.5&#x02013;the northwestern university chess program,&#x0201D;</article-title> in <source>Chess skill in Man and Machine</source> (<publisher-loc>Springer</publisher-loc>), <fpage>82</fpage>&#x02013;<lpage>118</lpage>.</citation>
</ref>
<ref id="B143">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Slivkins</surname> <given-names>A..</given-names></name></person-group> (<year>2019</year>). <article-title>Introduction to multi-armed bandits</article-title>. <source>Foundat. Trends&#x000AE;  Mach. Learn</source>. <volume>1</volume>, <fpage>1</fpage>&#x02013;<lpage>286</lpage>. <pub-id pub-id-type="doi">10.1561/9781680836219</pub-id></citation>
</ref>
<ref id="B144">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Smith</surname> <given-names>T.</given-names></name> <name><surname>Simmons</surname> <given-names>R.</given-names></name></person-group> (<year>2006</year>). <article-title>&#x0201C;Focused real-time dynamic programming for MDPs: squeezing more out of a heuristic,&#x0201D;</article-title> in <source>AAAI</source>, <fpage>1227</fpage>&#x02013;<lpage>1232</lpage>.</citation>
</ref>
<ref id="B145">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S..</given-names></name></person-group> (<year>1988</year>). <article-title>Learning to predict by the methods of temporal differences</article-title>. <source>Mach. Learn</source>. <volume>3</volume>, <fpage>9</fpage>&#x02013;<lpage>44</lpage>. <pub-id pub-id-type="doi">10.1007/BF00115009</pub-id></citation>
</ref>
<ref id="B146">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S..</given-names></name></person-group> (<year>1990</year>). <article-title>&#x0201C;Integrated architectures for learning, planning, and reacting based on approximating dynamic programming,&#x0201D;</article-title> in <source>Machine Learning Proceedings 1990</source> (<publisher-loc>Elsevier</publisher-loc>), <fpage>216</fpage>&#x02013;<lpage>224</lpage>.</citation>
</ref>
<ref id="B147">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S..</given-names></name></person-group> (<year>1996</year>). <article-title>&#x0201C;Generalization in reinforcement learning: Successful examples using sparse coarse coding,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>1038</fpage>&#x02013;<lpage>1044</lpage>.</citation>
</ref>
<ref id="B148">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name></person-group> (<year>2018</year>). <source>Reinforcement Learning: An Introduction</source>. MIT Press.</citation>
</ref>
<ref id="B149">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>McAllester</surname> <given-names>D. A.</given-names></name> <name><surname>Singh</surname> <given-names>S. P.</given-names></name> <name><surname>Mansour</surname> <given-names>Y.</given-names></name></person-group> (<year>2000</year>). <article-title>&#x0201C;Policy gradient methods for reinforcement learning with function approximation,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems</source>, <fpage>1057</fpage>&#x02013;<lpage>1063</lpage>.</citation>
</ref>
<ref id="B150">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tarjan</surname> <given-names>R..</given-names></name></person-group> (<year>1972</year>). <article-title>Depth-first search and linear graph algorithms</article-title>. <source>SIAM J. Comput</source>. <volume>1</volume>, <fpage>146</fpage>&#x02013;<lpage>160</lpage>. <pub-id pub-id-type="doi">10.1137/0201010</pub-id></citation>
</ref>
<ref id="B151">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Taylor</surname> <given-names>M. E.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name></person-group> (<year>2009</year>). <article-title>Transfer learning for reinforcement learning domains: a survey</article-title>. <source>J. Mach. Learn. Res</source>. <volume>10</volume>, <fpage>1633</fpage>&#x02013;<lpage>1685</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-01882-4</pub-id><pub-id pub-id-type="pmid">33205114</pub-id></citation></ref>
<ref id="B152">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tesauro</surname> <given-names>G.</given-names></name> <name><surname>Galperin</surname> <given-names>G. R.</given-names></name></person-group> (<year>1997</year>). <article-title>&#x0201C;On-line policy improvement using monte-carlo search,&#x0201D;</article-title> in <source>Advances in Neural Information Processing Systems 9</source>, eds M. C. Mozer, M. I. Jordan, and T. Petsche (Denver, CO: MIT Press), <fpage>1068</fpage>&#x02013;<lpage>1074</lpage>.</citation>
</ref>
<ref id="B153">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thompson</surname> <given-names>W. R..</given-names></name></person-group> (<year>1933</year>). <article-title>On the likelihood that one unknown probability exceeds another in view of the evidence of two samples</article-title>. <source>Biometrika</source> <volume>25</volume>, <fpage>285</fpage>&#x02013;<lpage>294</lpage>. <pub-id pub-id-type="doi">10.1093/biomet/25.3-4.285</pub-id></citation>
</ref>
<ref id="B154">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Todorov</surname> <given-names>E.</given-names></name> <name><surname>Li</surname> <given-names>W.</given-names></name></person-group> (<year>2005</year>). <article-title>&#x0201C;A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems,&#x0201D;</article-title> in <source>Proceedings of the 2005, American Control Conference, 2005</source> (<publisher-loc>Portland, OR</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>300</fpage>&#x02013;<lpage>306</lpage>.</citation>
</ref>
<ref id="B155">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Toussaint</surname> <given-names>M..</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;Robot trajectory optimization using approximate inference,&#x0201D;</article-title> in <source>Proceedings of the 26th Annual International Conference on Machine Learning</source> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1049</fpage>&#x02013;<lpage>1056</lpage>.</citation>
</ref>
<ref id="B156">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Valenzano</surname> <given-names>R. A.</given-names></name> <name><surname>Sturtevant</surname> <given-names>N. R.</given-names></name> <name><surname>Schaeffer</surname> <given-names>J.</given-names></name> <name><surname>Xie</surname> <given-names>F.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;A comparison of knowledge-based GBFS enhancements and knowledge-free exploration,&#x0201D;</article-title> in <source>Twenty-Fourth International Conference on Automated Planning and Scheduling</source> (<publisher-loc>Portsmouth, NH</publisher-loc>).</citation>
</ref>
<ref id="B157">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Van Hasselt</surname> <given-names>H.</given-names></name> <name><surname>Doron</surname> <given-names>Y.</given-names></name> <name><surname>Strub</surname> <given-names>F.</given-names></name> <name><surname>Hessel</surname> <given-names>M.</given-names></name> <name><surname>Sonnerat</surname> <given-names>N.</given-names></name> <name><surname>Modayil</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Deep reinforcement learning and the deadly triad</article-title>. <source>arXiv preprint arXiv:1812.02648</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1812.02648</pub-id></citation>
</ref>
<ref id="B158">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Van Hasselt</surname> <given-names>H.</given-names></name> <name><surname>Wiering</surname> <given-names>M. A.</given-names></name></person-group> (<year>2007</year>). <article-title>&#x0201C;Reinforcement learning in continuous action spaces,&#x0201D;</article-title> in <source>2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning</source> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>272</fpage>&#x02013;<lpage>279</lpage>.</citation>
</ref>
<ref id="B159">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Van Seijen</surname> <given-names>H.</given-names></name> <name><surname>Van Hasselt</surname> <given-names>H.</given-names></name> <name><surname>Whiteson</surname> <given-names>S.</given-names></name> <name><surname>Wiering</surname> <given-names>M.</given-names></name></person-group> (<year>2009</year>). <article-title>&#x0201C;A theoretical and empirical analysis of expected sarsa,&#x0201D;</article-title> in <source>2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning</source> (<publisher-loc>Nashville, TN</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>177</fpage>&#x02013;<lpage>184</lpage>.</citation>
</ref>
<ref id="B160">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>T.</given-names></name> <name><surname>Bao</surname> <given-names>X.</given-names></name> <name><surname>Clavera</surname> <given-names>I.</given-names></name> <name><surname>Hoang</surname> <given-names>J.</given-names></name> <name><surname>Wen</surname> <given-names>Y.</given-names></name> <name><surname>Langlois</surname> <given-names>E.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>Benchmarking model-based reinforcement learning</article-title>. <source>CoRR, abs/1907.02057</source>. <pub-id pub-id-type="doi">10.48550/arXiv.1907.02057</pub-id></citation>
</ref>
<ref id="B161">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Watkins</surname> <given-names>C. J.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name></person-group> (<year>1992</year>). <article-title>Q-learning</article-title>. <source>Mach. Learn</source>. <volume>8</volume>, <fpage>279</fpage>&#x02013;<lpage>292</lpage>. <pub-id pub-id-type="doi">10.1023/A:1022676722315</pub-id></citation>
</ref>
<ref id="B162">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Whiteson</surname> <given-names>S.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name></person-group> (<year>2006</year>). <article-title>Evolutionary function approximation for reinforcement learning</article-title>. <source>J. Mach. Learn. Res</source>. <volume>7</volume>, <fpage>877</fpage>&#x02013;<lpage>917</lpage>.</citation>
</ref>
<ref id="B163">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wiering</surname> <given-names>M.</given-names></name> <name><surname>Van Otterlo</surname> <given-names>M.</given-names></name></person-group> (<year>2012</year>). <article-title>Reinforcement learning</article-title>. <source>Adaptat. Learn. Optim</source>. 12, 3. <pub-id pub-id-type="doi">10.1007/978-3-642-27645-3</pub-id></citation>
</ref>
<ref id="B164">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Williams</surname> <given-names>R. J..</given-names></name></person-group> (<year>1992</year>). <article-title>Simple statistical gradient-following algorithms for connectionist reinforcement learning</article-title>. <source>Mach. Learn</source>. <volume>8</volume>, <fpage>229</fpage>&#x02013;<lpage>256</lpage>. <pub-id pub-id-type="doi">10.1007/BF00992696</pub-id></citation>
</ref>
<ref id="B165">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>G.</given-names></name> <name><surname>Say</surname> <given-names>B.</given-names></name> <name><surname>Sanner</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Scalable planning with tensorflow for hybrid nonlinear domains,&#x0201D;</article-title> in <source>31st Conference on Neural Information Processing Systems (NIPS 2017)</source> (<publisher-loc>Long Beach, CA</publisher-loc>).</citation>
</ref>
<ref id="B166">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yoon</surname> <given-names>S. W.</given-names></name> <name><surname>Fern</surname> <given-names>A.</given-names></name> <name><surname>Givan</surname> <given-names>R.</given-names></name></person-group> (<year>2007</year>). <article-title>&#x0201C;FF-Replan: a baseline for probabilistic planning,&#x0201D;</article-title> in <source>ICAPS Vol. 7</source> (<publisher-loc>Providence, RI</publisher-loc>), <fpage>352</fpage>&#x02013;<lpage>359</lpage>.</citation>
</ref>
</ref-list> 
</back>
</article> 