<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="review-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Robot. AI</journal-id>
<journal-title>Frontiers in Robotics and AI</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Robot. AI</abbrev-journal-title>
<issn pub-type="epub">2296-9144</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/frobt.2021.584075</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Robotics and AI</subject>
<subj-group>
<subject>Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Reinforcement Learning With Human Advice: A Survey</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Najar</surname> <given-names>Anis</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1033652/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Chetouani</surname> <given-names>Mohamed</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/119789/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Laboratoire de Neurosciences Cognitives Computationnelles, INSERM U960</institution>, <addr-line>Paris</addr-line>, <country>France</country></aff>
<aff id="aff2"><sup>2</sup><institution>Institute for Intelligent Systems and Robotics, Sorbonne Universit&#x000E9;, CNRS UMR 7222</institution>, <addr-line>Paris</addr-line>, <country>France</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Iolanda Leite, Royal Institute of Technology, Sweden</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Garrett Warnell, United States Army Research Laboratory, United States; Tesca Fitzgerald, Carnegie Mellon University, United States</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Anis Najar <email>anis.najar&#x00040;ens.fr</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Human-Robot Interaction, a section of the journal Frontiers in Robotics and AI</p></fn></author-notes>
<pub-date pub-type="epub">
<day>01</day>
<month>06</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>8</volume>
<elocation-id>584075</elocation-id>
<history>
<date date-type="received">
<day>16</day>
<month>07</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>03</day>
<month>03</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2021 Najar and Chetouani.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Najar and Chetouani</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>In this paper, we provide an overview of the existing methods for integrating human advice into a reinforcement learning process. We first propose a taxonomy of the different forms of advice that can be provided to a learning agent. We then describe the methods that can be used for interpreting advice when its meaning is not determined beforehand. Finally, we review different approaches for integrating advice into the learning process.</p></abstract>
<kwd-group>
<kwd>advice-taking systems</kwd>
<kwd>reinforcement learning</kwd>
<kwd>interactive machine learning</kwd>
<kwd>human-robot interaction</kwd>
<kwd>unlabeled teaching signals</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="3"/>
<equation-count count="28"/>
<ref-count count="135"/>
<page-count count="20"/>
<word-count count="17338"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Teaching a machine through natural interaction is an old idea dating back to the foundations of AI, as it was already stated by Alan Turing in 1950: &#x0201C;<italic>It can also be maintained that it is best to provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English. That process could follow the normal teaching of a child. Things would be pointed out and named, etc.&#x0201D;</italic> (Turing, <xref ref-type="bibr" rid="B124">1950</xref>). Since then, many efforts have been made for endowing robots and artificial agents with the capacity to learn from humans in a natural and unconstrained manner (Chernova and Thomaz, <xref ref-type="bibr" rid="B21">2014</xref>). However, designing human-like learning robots still raises several challenges regarding their capacity to adapt to different teaching strategies and their ability to take advantage of the variety of teaching signals that can be produced by humans (Vollmer et al., <xref ref-type="bibr" rid="B127">2016</xref>).</p>
<p>The interactive machine learning literature references a plethora of teaching signals such as instructions (Pradyot et al., <xref ref-type="bibr" rid="B97">2012b</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>), demonstrations (Argall et al., <xref ref-type="bibr" rid="B6">2009</xref>), and feedback (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>; Najar et al., <xref ref-type="bibr" rid="B89">2016</xref>). These signals can be categorized in several ways depending on what, when, and how they are produced. For example, a common taxonomy is to divide interactive learning methods into three groups: learning from advice, learning from evaluative feedback (or critique), and learning from demonstration (LfD) (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>, <xref ref-type="bibr" rid="B56">2011b</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2010</xref>). While this taxonomy is commonly used in the literature, it is not infallible as these categories can overlap. For example, in some papers, evaluative feedback is considered as a particular type of advice (Judah et al., <xref ref-type="bibr" rid="B49">2010</xref>; Griffith et al., <xref ref-type="bibr" rid="B37">2013</xref>). In more rare cases, demonstrations (Whitehead, <xref ref-type="bibr" rid="B130">1991</xref>; Lin, <xref ref-type="bibr" rid="B65">1992</xref>) were also referred to as advice (Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>; Maclin et al., <xref ref-type="bibr" rid="B74">2005a</xref>). The definition of advice in the literature is relatively vague with no specific constraints on what type of input can be provided to the learning agent. For example, it has been defined as &#x0201C;<italic>concept definitions, behavioral constraints, and performance heuristics&#x0201D;</italic> (Hayes-Roth et al., <xref ref-type="bibr" rid="B44">1981</xref>), or as &#x0201C;<italic>any external input to the control algorithm that could be used by the agent to take decisions about and modify the progress of its exploration or strengthen its belief in a policy&#x0201D;</italic> (Pradyot and Ravindran, <xref ref-type="bibr" rid="B98">2011</xref>). Although more specific definitions can be found, such as &#x0201C;<italic>suggesting an action when a certain condition is true&#x0201D;</italic> (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>), in other works advice also represents state preferences (Utgoff and Clouse, <xref ref-type="bibr" rid="B125">1991</xref>), action preferences (Maclin et al., <xref ref-type="bibr" rid="B74">2005a</xref>), constraints on action values (Maclin et al., <xref ref-type="bibr" rid="B75">2005b</xref>; Torrey et al., <xref ref-type="bibr" rid="B122">2008</xref>), explanations (Krening et al., <xref ref-type="bibr" rid="B62">2017</xref>), instructions (Clouse and Utgoff, <xref ref-type="bibr" rid="B25">1992</xref>; Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>; Kuhlmann et al., <xref ref-type="bibr" rid="B63">2004</xref>; Rosenstein et al., <xref ref-type="bibr" rid="B100">2004</xref>), feedback (Judah et al., <xref ref-type="bibr" rid="B49">2010</xref>; Griffith et al., <xref ref-type="bibr" rid="B37">2013</xref>; Celemin and Ruiz-Del-Solar, <xref ref-type="bibr" rid="B19">2019</xref>), or demonstrations (Whitehead, <xref ref-type="bibr" rid="B130">1991</xref>; Lin, <xref ref-type="bibr" rid="B65">1992</xref>; Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>). In some papers, the term feedback is used as a shortcut for evaluative feedback (Thomaz and Breazeal, <xref ref-type="bibr" rid="B117">2006</xref>; Leon et al., <xref ref-type="bibr" rid="B64">2011</xref>; Griffith et al., <xref ref-type="bibr" rid="B37">2013</xref>; Knox et al., <xref ref-type="bibr" rid="B59">2013</xref>; Loftin et al., <xref ref-type="bibr" rid="B68">2016</xref>). However, the same term is sometimes used to refer to corrective feedback (Argall et al., <xref ref-type="bibr" rid="B5">2011</xref>). While these two types of feedback, evaluative and corrective, are sometimes designated by the same label, they are basically different. The lack of consensus about the terminology in the literature makes all these concepts difficult to disentangle, and represents an obstacle toward establishing a systematic understanding of how these teaching signals relate to each other from a computational point of view. The goal of this survey is to clarify some of the terminology used in the interactive machine learning literature by providing a taxonomy of the different forms of advice, and to review how these teaching signals can be integrated into a reinforcement learning (RL) process (Sutton and Barto, <xref ref-type="bibr" rid="B109">1998</xref>). In this survey, we define advice as <italic>teaching signals that can be communicated by the teacher to the learning system without executing the task</italic>. Thus, we do not cover LfD, since demonstration is different from advice given this definition, and comprehensive surveys on this topic already exist (Argall et al., <xref ref-type="bibr" rid="B6">2009</xref>; Chernova and Thomaz, <xref ref-type="bibr" rid="B21">2014</xref>).</p>
<p>Although the methods we cover belong to various mathematical frameworks, we mainly focus on the RL perspective. We equivalently use the terms of &#x0201C;agent,&#x0201D; &#x0201C;robot,&#x0201D; and &#x0201C;system,&#x0201D; by making abstraction of the support over which the RL algorithm is implemented. Throughout this paper, we use the term &#x0201C;shaping&#x0201D; to refer to the mechanism by which advice is integrated into the learning process. Although this concept has been mainly used within the RL literature as a method for accelerating the learning process by providing the learning agent with intermediate rewards (Gullapalli and Barto, <xref ref-type="bibr" rid="B41">1992</xref>; Singh, <xref ref-type="bibr" rid="B103">1992</xref>; Dorigo and Colombetti, <xref ref-type="bibr" rid="B34">1994</xref>; Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>; Judah et al., <xref ref-type="bibr" rid="B48">2014</xref>; Cederborg et al., <xref ref-type="bibr" rid="B16">2015</xref>), the general meaning of shaping is equivalent to training, which is to make an agent&#x00027;s &#x0201C;<italic>behavior converge to a predefined target behavior&#x0201D;</italic> (Dorigo and Colombetti, <xref ref-type="bibr" rid="B34">1994</xref>).</p>
<p>The paper is organized as follows. We first introduce some background about RL in section 2. We then provide an overview of the existing methods for integrating human advice into an RL process in section 3. The different methods are discussed in section 4, before concluding the paper in section 5.</p>
</sec>
<sec id="s2">
<title>2. Reinforcement Learning</title>
<p>RL refers to family of problems where an autonomous agent has to learn a sequential decision-making task (Sutton and Barto, <xref ref-type="bibr" rid="B109">1998</xref>). These problems are generally represented as Markov decision process (MDP), defined as a tuple &#x0003C; <italic>S, A, T, R</italic>, &#x003B3; &#x0003E;. <italic>S</italic> represents the state-space over which the problem is defined and <italic>A</italic> is the set of actions the agent is able to perform on every time-step. <italic>T</italic> : <italic>S</italic> &#x000D7; <italic>A</italic> &#x02192; <italic>Pr</italic>(<italic>s</italic>&#x02032;|<italic>s, a</italic>) defines a state-transition probability function, where <italic>Pr</italic>(<italic>s</italic>&#x02032;|<italic>s, a</italic>) represents the probability that the agent transitions from state <italic>s</italic> to state <italic>s</italic>&#x02032; after executing action <italic>a</italic>. <italic>R</italic> : <italic>S</italic> &#x000D7; <italic>A</italic> &#x02192; &#x0211D; is a reward function that defines the reward <italic>r</italic>(<italic>s, a</italic>) that the agent gets for performing action <italic>a</italic> in state <italic>s</italic>. When at time <italic>t</italic>, the agent performs an action <italic>a</italic><sub><italic>t</italic></sub> from state <italic>s</italic><sub><italic>t</italic></sub>, it receives a reward <italic>r</italic><sub><italic>t</italic></sub> and transitions to state <italic>s</italic><sub><italic>t</italic>&#x0002B;1</sub>. The discount factor, &#x003B3;, represents how much future rewards are taken into account for the current decision.</p>
<p>The behavior of the agent is represented as a policy &#x003C0; that defines the probability to select each action in every state: &#x02200;<italic>s</italic> &#x02208; <italic>S</italic>, &#x003C0;(<italic>s</italic>) &#x0003D; {&#x003C0;(<italic>s, a</italic>); <italic>a</italic> &#x02208; <italic>A</italic>} &#x0003D; {<italic>Pr</italic>(<italic>a</italic>|<italic>s</italic>); <italic>a</italic> &#x02208; <italic>A</italic>}. The quality of a policy is measured by the amount of rewards it enables the agent to collect over the long run. The expected amount of cumulative rewards, when starting from a state <italic>s</italic> and following a policy &#x003C0;, is given by the state-value function and is written as:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mi>R</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munder></mml:mstyle><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">|</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Another form of value function, called action-value function and noted <italic>Q</italic><sup>&#x003C0;</sup>, provides more directly exploitable information than <italic>V</italic><sup>&#x003C0;</sup> for decision-making, as the agent has direct access to the value of each possible decision:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow></mml:munder></mml:mstyle><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo stretchy="false">|</mml:mo><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mtext>&#x02003;</mml:mtext><mml:mo>;</mml:mo><mml:mo>&#x02200;</mml:mo><mml:mi>s</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>S</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>A</mml:mi><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>To optimize its behavior, the agent must find the optimal policy &#x003C0;<sup>&#x0002A;</sup> that maximizes <italic>V</italic><sup>&#x003C0;</sup> and <italic>Q</italic><sup>&#x003C0;</sup>. When both the reward and transition functions are unknown, the optimal policy must be learnt from the rewards the agent obtains by interacting with its environment using an RL algorithm. RL algorithms can be decomposed into three categories: value-based, policy-gradient, and Actor-Critic (Sutton and Barto, <xref ref-type="bibr" rid="B109">1998</xref>).</p>
<sec>
<title>2.1. Value-Based RL</title>
<p>In value-based RL, the optimal policy is obtained by iteratively optimizing the value function. Examples of value-based algorithms include Q-learning (Watkins and Dayan, <xref ref-type="bibr" rid="B128">1992</xref>) and SARSA (Sutton, <xref ref-type="bibr" rid="B108">1996</xref>).</p>
<p>In Q-learning, the action-value function of the optimal policy &#x003C0;<sup>&#x0002A;</sup> is computed iteratively. On every time-step <italic>t</italic>, when the agent transitions from state <italic>s</italic><sub><italic>t</italic></sub> to state <italic>s</italic><sub><italic>t</italic>&#x0002B;1</sub> by performing an action <italic>a</italic><sub><italic>t</italic></sub>, and receives a reward <italic>r</italic><sub><italic>t</italic></sub>, the Q-value of the last state-action pair is updated using:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mo>&#x02208;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mtext>&#x000A0;</mml:mtext><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B1; &#x02208; [0, 1] is a learning rate.</p>
<p>At decision time, the policy &#x003C0; can be derived from the Q-function using different action-selection strategies. The &#x003F5;<italic>-greedy</italic> action-selection strategy consists of selecting most of the time the optimal action with respect to the Q-function, <inline-formula><mml:math id="M4"><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:munder><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, and selecting with a small probability &#x003F5; a random action. With the <italic>softmax</italic> action-selection strategy, the policy &#x003C0; is derived at decision-time by computing a softmax distribution over the Q-values:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M5"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The SARSA algorithm is similar to Q-learning, with one difference at the update function of the Q-values:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M6"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>a</italic><sub><italic>t</italic>&#x0002B;1</sub> is the action the agent selects at time-step <italic>t</italic> &#x0002B; 1. At decision time, the same action-selection strategies can be implemented as for Q-learning.</p>
</sec>
<sec>
<title>2.2. Policy-Gradient RL</title>
<p>In contrast to value-based RL, policy-gradient methods do not compute a value function (Williams, <xref ref-type="bibr" rid="B134">1992</xref>). Instead, the policy is directly optimized from the perceived rewards. In this approach, the policy &#x003C0; is controlled with a set of parameters <italic>w</italic> &#x02208; &#x0211D;<sup><italic>n</italic></sup>, such that &#x003C0;<sub><italic>w</italic></sub>(<italic>s, a</italic>) is differentiable in <italic>w</italic>; &#x02200;<italic>s</italic> &#x02208; <italic>S, a</italic> &#x02208; <italic>A</italic>. For example, <italic>w</italic> can be defined so that <italic>w</italic>(<italic>s, a</italic>) reflects the preference for taking an action in a given state by expressing the policy as a softmax distribution over the parameters:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M7"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:msup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msup></mml:mrow></mml:mfrac><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>A learning iteration is composed of two stages. First, the agent estimates the expected returns, <italic>G</italic>, by sampling a set of trajectories. Then, the policy &#x003C0;<sub><italic>w</italic></sub> is updated using the gradient of the expected returns with respect to <italic>w</italic>. For example, in the REINFORCE algorithm (Williams, <xref ref-type="bibr" rid="B134">1992</xref>), a trajectory of <italic>T</italic> time-steps is first sampled from one single episode. Then, for every time-step <italic>t</italic> of the trajectory, the return <italic>G</italic> is computed as <inline-formula><mml:math id="M8"><mml:mi>G</mml:mi><mml:mo>&#x02190;</mml:mo><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:munderover><mml:msup><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula>, and the policy parameters are updated with:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M9"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>w</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003B3;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msup><mml:mi>G</mml:mi><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo class="qopname">ln</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
<sec>
<title>2.3. Actor-Critic RL</title>
<p>Actor-Critic architectures constitute a hybrid approach between value-based and policy-gradient methods by computing both the policy (the actor) and a value function (the critic) (Barto et al., <xref ref-type="bibr" rid="B10">1983</xref>). The actor can be represented as a parameterized softmax distribution as in Equation (6). The critic computes a value function that is used for evaluating the actor. The reward <italic>r</italic><sub><italic>t</italic></sub> received at time <italic>t</italic> is used for computing a temporal difference (TD) error:</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M10"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B3;</mml:mi><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The TD error is then used for updating both the critic and the actor, using respectively, Equations (9) and (10):</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M11"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:mi>V</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E10"><label>(10)</label><mml:math id="M12"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x02190;</mml:mo><mml:mi>w</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B1; &#x02208; [0, 1] and &#x003B2; &#x02208; [0, 1] are two learning rates. A positive TD error increases the probability of selecting <italic>a</italic><sub><italic>t</italic></sub> in <italic>s</italic><sub><italic>t</italic></sub>, while a negative TD error decreases it.</p>
<p>The main advantage of RL algorithms is the autonomy of the learning process. Given a predefined reward function, they allow an agent to optimize its behavior without the intervention of a human supervisor. However, they present several limitations. For instance, they involve a time-consuming iterative process that limits their applicability to complex real-world problems (Kober et al., <xref ref-type="bibr" rid="B60">2013</xref>). Some existing techniques, such as reward shaping, aim at overcoming this limitation by defining intermediate rewards (Gullapalli and Barto, <xref ref-type="bibr" rid="B41">1992</xref>; Mataric, <xref ref-type="bibr" rid="B79">1994</xref>). However, they generally require expert knowledge for designing an appropriate reward shaping function (Ng et al., <xref ref-type="bibr" rid="B91">1999</xref>; Wiewiora et al., <xref ref-type="bibr" rid="B133">2003</xref>). Also, the exploration aspect of autonomous learning methods raises several safety issues (Garcia and Fernandez, <xref ref-type="bibr" rid="B36">2015</xref>).</p>
<p>Interactive learning constitutes a complementary approach that aims at overcoming these limitations by involving a human teacher in the learning process. In the next section, we show how a human teacher can provide an RL agent with various forms of advice to convey different information about the task. We then show how advice can be interpreted by the agent, for instance by grounding its meaning in the learning process using either the reward function, the value function or the policy. Finally, we show how advice can be used, in turn, to intervene at different levels of the learning process, by influencing either the reward function, the value function, the policy, or the action-selection strategy.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Reinforcement Learning With Human Advice</title>
<p>In one of the first papers of artificial intelligence, John McCarthy described an &#x0201C;<italic>Advice Taker&#x0201D;</italic> system that could learn by being told (McCarthy, <xref ref-type="bibr" rid="B82">1959</xref>). This idea was then elaborated in Hayes-Roth et al. (<xref ref-type="bibr" rid="B43">1980</xref>) and Hayes-Roth et al. (<xref ref-type="bibr" rid="B44">1981</xref>), where a general framework for learning from advice was proposed. This framework can be summarized in the following five steps (Cohen and Feigenbaum, <xref ref-type="bibr" rid="B26">1982</xref>; Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>):</p>
<list list-type="order">
<list-item><p>Requesting or receiving the advice.</p></list-item>
<list-item><p>Converting the advice into an internal representation.</p></list-item>
<list-item><p>Converting the advice into a usable form (operationalization).</p></list-item>
<list-item><p>Integrating the reformulated advice into the agent&#x00027;s knowledge base.</p></list-item>
<list-item><p>Judging the value of the advice.</p></list-item>
</list>
<p>The first step describes how human advice can be provided to the system. Different forms of advice can be distinguished based on this criterion. Step 2 refers to the encoding the perceived advice into an internal representation. Most of existing advice-taking systems assume that the internal representation of advice is predetermined by the system designer. However, some recent works tackle the problem of letting the system learn how to interpret raw advice in order to make the interaction protocol less constraining for the human teacher (Vollmer et al., <xref ref-type="bibr" rid="B127">2016</xref>). Steps 3&#x02013;5 describe how human advice can be used by the agent for learning. These three steps are often confounded into one single process, that we call shaping, which consists of integrating advice into the agent&#x00027;s learning process.</p>
<p>In the remainder of this section, we first propose a taxonomy of different categories of advice based on how they can be provided to the system (step 1). Then we detail how advice can be interpreted (step 2). Finally, we present how advice can be integrated into an RL process (steps 3&#x02013;5).</p>
<sec>
<title>3.1. Providing Advice</title>
<p>The means by which teaching signals can be communicated to a learning agent vary. They can be provided via natural language (Kuhlmann et al., <xref ref-type="bibr" rid="B63">2004</xref>; Cruz et al., <xref ref-type="bibr" rid="B29">2015</xref>; Pal&#x000E9;ologue et al., <xref ref-type="bibr" rid="B95">2018</xref>), computer vision (Atkeson and Schaal, <xref ref-type="bibr" rid="B8">1997</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>), hand-written programs (Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>; Maclin et al., <xref ref-type="bibr" rid="B74">2005a</xref>,<xref ref-type="bibr" rid="B75">b</xref>; Torrey et al., <xref ref-type="bibr" rid="B122">2008</xref>), artificial interfaces (Abbeel et al., <xref ref-type="bibr" rid="B1">2010</xref>; Suay and Chernova, <xref ref-type="bibr" rid="B105">2011</xref>; Knox et al., <xref ref-type="bibr" rid="B59">2013</xref>), or physical interaction (Lozano-Perez, <xref ref-type="bibr" rid="B70">1983</xref>; Akgun et al., <xref ref-type="bibr" rid="B3">2012</xref>). Despite the variety of communication channels, we can distinguish two main categories of teaching signals based on how they are produced: advice and demonstration. Even though advice and demonstration can share the same communication channels, like computer vision (Atkeson and Schaal, <xref ref-type="bibr" rid="B8">1997</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>) and artificial interfaces (Abbeel et al., <xref ref-type="bibr" rid="B1">2010</xref>; Suay and Chernova, <xref ref-type="bibr" rid="B105">2011</xref>; Knox et al., <xref ref-type="bibr" rid="B59">2013</xref>), they are fundamentally different from each other in that demonstration requires the task to be executed by the teacher (demonstrated), while advice does not. In rare cases, demonstration (Whitehead, <xref ref-type="bibr" rid="B130">1991</xref>; Lin, <xref ref-type="bibr" rid="B65">1992</xref>) has been referred to as advice (Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>; Maclin et al., <xref ref-type="bibr" rid="B74">2005a</xref>). However, it is more common to consider demonstration and advice as two distinct and complementary approaches for interactive learning (Dillmann et al., <xref ref-type="bibr" rid="B32">2000</xref>; Argall et al., <xref ref-type="bibr" rid="B4">2008</xref>; Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>, <xref ref-type="bibr" rid="B56">2011b</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2010</xref>). Based on this distinction, we define advice as <italic>teaching signals that can be communicated by the teacher to the learning system without executing the task</italic>.</p>
<p>We mainly distinguish two forms of advice depending on how it is provided to the system: <italic>general advice</italic> and <italic>contextual advice</italic> (<xref ref-type="fig" rid="F1">Figure 1</xref>, <xref ref-type="table" rid="T1">Table 1</xref>). <italic>General advice</italic> can be communicated to the system, non-interactively, prior to the learning process (offline). This type of advice represents information about the task that do not depend on the context in which they are provided. They are self-sufficient in that they include all the required information for being converted into a usable form (operationalization). Examples include specifying general constraints about the task and providing general instructions about the desired behavior. <italic>Contextual advice</italic>, on the other hand, is context-dependent, in that the communicated information depends on the current state of the task. So, unlike <italic>general advice</italic>, it must be provided interactively along the task (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>; Celemin and Ruiz-Del-Solar, <xref ref-type="bibr" rid="B19">2019</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>). <italic>Contextual advice</italic> can also be provided in an offline fashion, with the teacher interacting with previously recorded task executions by the learning agent (Judah et al., <xref ref-type="bibr" rid="B49">2010</xref>; Argall et al., <xref ref-type="bibr" rid="B5">2011</xref>). Even in this case, each piece of advice has to be provided at a specific moment of the task execution. Examples of <italic>contextual advice</italic> include evaluative feedback (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>; Najar et al., <xref ref-type="bibr" rid="B89">2016</xref>), corrective feedback (Argall et al., <xref ref-type="bibr" rid="B5">2011</xref>; Celemin and Ruiz-Del-Solar, <xref ref-type="bibr" rid="B19">2019</xref>), guidance (Thomaz and Breazeal, <xref ref-type="bibr" rid="B117">2006</xref>; Suay and Chernova, <xref ref-type="bibr" rid="B105">2011</xref>), and contextual instructions (Clouse and Utgoff, <xref ref-type="bibr" rid="B25">1992</xref>; Rosenstein et al., <xref ref-type="bibr" rid="B100">2004</xref>; Pradyot et al., <xref ref-type="bibr" rid="B96">2012a</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Taxonomy of advice.</p></caption>
<graphic xlink:href="frobt-08-584075-g0001.tif"/>
</fig>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Types of advice.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Category</bold></th>
<th valign="top" align="left"><bold>References</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">General constraints</td>
<td valign="top" align="left">Hayes-Roth et al., <xref ref-type="bibr" rid="B44">1981</xref>; Kuhlmann et al., <xref ref-type="bibr" rid="B63">2004</xref>; Mangasarian et al., <xref ref-type="bibr" rid="B78">2004</xref>; Maclin et al., <xref ref-type="bibr" rid="B74">2005a</xref>,<xref ref-type="bibr" rid="B75">b</xref>; Torrey et al., <xref ref-type="bibr" rid="B122">2008</xref></td>
</tr>
<tr>
<td valign="top" align="left">General instructions</td>
<td valign="top" align="left">Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>; Kuhlmann et al., <xref ref-type="bibr" rid="B63">2004</xref>; Branavan et al., <xref ref-type="bibr" rid="B12">2009</xref>, <xref ref-type="bibr" rid="B13">2010</xref>; Vogel and Jurafsky, <xref ref-type="bibr" rid="B126">2010</xref></td>
</tr>
<tr>
<td valign="top" align="left">Guidance</td>
<td valign="top" align="left">Thomaz, <xref ref-type="bibr" rid="B116">2006</xref>; Thomaz and Cakmak, <xref ref-type="bibr" rid="B120">2009</xref>; Suay and Chernova, <xref ref-type="bibr" rid="B105">2011</xref>; Chu et al., <xref ref-type="bibr" rid="B24">2016</xref>; Subramanian et al., <xref ref-type="bibr" rid="B107">2016</xref></td>
</tr>
<tr>
<td valign="top" align="left">Contextual instructions</td>
<td valign="top" align="left">Utgoff and Clouse, <xref ref-type="bibr" rid="B125">1991</xref>; Clouse and Utgoff, <xref ref-type="bibr" rid="B25">1992</xref>; Nicolescu and Mataric, <xref ref-type="bibr" rid="B93">2003</xref>; Rosenstein et al., <xref ref-type="bibr" rid="B100">2004</xref>; Rybski et al., <xref ref-type="bibr" rid="B101">2007</xref>; Thomaz and Breazeal, <xref ref-type="bibr" rid="B119">2007b</xref>; Branavan et al., <xref ref-type="bibr" rid="B13">2010</xref>; Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>; Pradyot et al., <xref ref-type="bibr" rid="B97">2012b</xref>; Grizou et al., <xref ref-type="bibr" rid="B40">2013</xref>; MacGlashan et al., <xref ref-type="bibr" rid="B71">2014a</xref>; Cruz et al., <xref ref-type="bibr" rid="B29">2015</xref>; Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref></td>
</tr>
<tr>
<td valign="top" align="left">Corrective feedback</td>
<td valign="top" align="left">Nicolescu and Mataric, <xref ref-type="bibr" rid="B93">2003</xref>; Chernova and Veloso, <xref ref-type="bibr" rid="B22">2009</xref>; Argall et al., <xref ref-type="bibr" rid="B5">2011</xref>; Celemin and Ruiz-Del-Solar, <xref ref-type="bibr" rid="B19">2019</xref></td>
</tr>
<tr>
<td valign="top" align="left">Evaluative feedback</td>
<td valign="top" align="left">Dorigo and Colombetti, <xref ref-type="bibr" rid="B34">1994</xref>; Colombetti et al., <xref ref-type="bibr" rid="B27">1996</xref>; Isbell et al., <xref ref-type="bibr" rid="B47">2001</xref>; Kaplan et al., <xref ref-type="bibr" rid="B50">2002</xref>; Thomaz et al., <xref ref-type="bibr" rid="B121">2006</xref>; Kim and Scassellati, <xref ref-type="bibr" rid="B52">2007</xref>; Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>, <xref ref-type="bibr" rid="B54">2010</xref>, <xref ref-type="bibr" rid="B55">2011a</xref>, <xref ref-type="bibr" rid="B57">2012a</xref>,<xref ref-type="bibr" rid="B58">b</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2010</xref>; Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>; Lopes et al., <xref ref-type="bibr" rid="B69">2011</xref>; Grizou et al., <xref ref-type="bibr" rid="B40">2013</xref></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Griffith et al., <xref ref-type="bibr" rid="B37">2013</xref>; Grizou et al., <xref ref-type="bibr" rid="B39">2014b</xref>; Loftin et al., <xref ref-type="bibr" rid="B67">2014</xref>, <xref ref-type="bibr" rid="B68">2016</xref>; Ho et al., <xref ref-type="bibr" rid="B45">2015</xref>; Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref>; Najar et al., <xref ref-type="bibr" rid="B89">2016</xref>, <xref ref-type="bibr" rid="B90">2020b</xref>; MacGlashan et al., <xref ref-type="bibr" rid="B72">2017</xref></td>
</tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>3.1.1. General Advice</title>
<p>Advice can be used by the human teacher to provide the agent with general information about the task prior to the learning process. These information can be provided to the system in a written form (Hayes-Roth et al., <xref ref-type="bibr" rid="B43">1980</xref>; Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>; Kuhlmann et al., <xref ref-type="bibr" rid="B63">2004</xref>; Branavan et al., <xref ref-type="bibr" rid="B12">2009</xref>; Vogel and Jurafsky, <xref ref-type="bibr" rid="B126">2010</xref>).</p>
<p>General advice can specify <italic>general constraints</italic> about the task such as domain concepts, behavioral constraints, and performance heuristics. For example, the first ever implemented advice-taking system relied on general constraints that were written as LISP expressions, to specify concepts, rules and heuristics for a card-playing agent (Hayes-Roth et al., <xref ref-type="bibr" rid="B44">1981</xref>).</p>
<p>A second form of general advice, <italic>general instructions</italic>, explicitly specifies to the agent what actions to perform in different situations. It can be provided either in the form of <italic>if-then</italic> rules (Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>; Kuhlmann et al., <xref ref-type="bibr" rid="B63">2004</xref>), or as detailed action plans describing the step-by-step sequence of actions that should be performed in order to solve the task (Branavan et al., <xref ref-type="bibr" rid="B12">2009</xref>; Vogel and Jurafsky, <xref ref-type="bibr" rid="B126">2010</xref>). Action plans can be seen as a sequence of low-level or high-level <italic>contextual instructions</italic> (cf. definition below). For example, a sequence like (e.g., &#x0201C;<italic>Click start, point to search, and then click for files or folders.&#x0201D;</italic>), can be decomposed into a sequence of three low-level <italic>contextual instructions</italic> (Branavan et al., <xref ref-type="bibr" rid="B12">2009</xref>).</p>
</sec>
<sec>
<title>3.1.2. Contextual Advice</title>
<p>In contrast to <italic>general advice</italic>, a <italic>contextual advice</italic> depends on the state in which it is provided. To use the terms of the advice-taking process, a part of the information that is required for operationalization is implicit, and must be inferred by the learner from the current context. Consequently, <italic>contextual advice</italic> must be progressively provided to the learning agent along the task. Contextual advice can be divided into two main categories: guidance and feedback. Guidance informs about future actions, whereas feedback informs about past ones.</p>
</sec>
<sec>
<title>3.1.3. Guidance</title>
<p>Guidance is a term that is encountered in many papers and has been made popular by the work of Thomaz (<xref ref-type="bibr" rid="B116">2006</xref>) about socially guided machine learning. In the broad sense, guidance represents the general idea of guiding the learning process of an agent. In this sense, all interactive learning methods can be considered as a form of guidance. A bit more specific definition of guidance is when human inputs are provided in order to bias the exploration strategy (Thomaz and Cakmak, <xref ref-type="bibr" rid="B120">2009</xref>). For instance, in Subramanian et al. (<xref ref-type="bibr" rid="B107">2016</xref>), demonstrations were provided in order to teach the agent how to explore interesting regions of the state space. In Chu et al. (<xref ref-type="bibr" rid="B24">2016</xref>), kinesthetic teaching was used for guiding the exploration process for learning object affordances. In the most specific sense, guidance constitutes a form of advice that consists of suggesting a limited set of actions from all the possible ones (Thomaz and Breazeal, <xref ref-type="bibr" rid="B117">2006</xref>; Suay and Chernova, <xref ref-type="bibr" rid="B105">2011</xref>).</p>
</sec>
<sec>
<title>3.1.4. Contextual Instructions</title>
<p>One particular type of guidance is to suggest only one action to perform. We refer to this type of advice as <italic>contextual instructions</italic>. For example, in Cruz et al. (<xref ref-type="bibr" rid="B29">2015</xref>), the authors used both terms of advice and guidance for referring to contextual instructions. Contextual instructions can be either low-level or high-level (Branavan et al., <xref ref-type="bibr" rid="B13">2010</xref>). Low-level instructions indicate the next action to perform (Grizou et al., <xref ref-type="bibr" rid="B40">2013</xref>), whereas high-level instructions indicate a more extended goal without explicitly specifying the sequence of actions that should be executed (MacGlashan et al., <xref ref-type="bibr" rid="B71">2014a</xref>). High-level instructions were also referred to as commands (MacGlashan et al., <xref ref-type="bibr" rid="B71">2014a</xref>; Tellex et al., <xref ref-type="bibr" rid="B114">2014</xref>). In RL terminology, high-level instructions would correspond to performing <italic>options</italic> (Sutton et al., <xref ref-type="bibr" rid="B110">1999</xref>). Contextual instructions can be provided through speech (Grizou et al., <xref ref-type="bibr" rid="B40">2013</xref>), gestures (Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>), or myoelectric (EMG) interfaces (Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref>).</p>
</sec>
<sec>
<title>3.1.5. Feedback</title>
<p>We distinguish two main forms of feedback: evaluative and corrective. Evaluative feedback, also called critique, consists in evaluating the quality of the agent&#x00027;s actions (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2010</xref>). Corrective feedback, also called instructive feedback, implicitly implies that the performed action is wrong (Argall et al., <xref ref-type="bibr" rid="B5">2011</xref>; Celemin and Ruiz-Del-Solar, <xref ref-type="bibr" rid="B19">2019</xref>). However, it goes beyond simply criticizing the performed action, by informing the agent about the correct one.</p>
</sec>
<sec>
<title>3.1.6. Corrective Feedback</title>
<p>Corrective feedback can be either a corrective instruction (Chernova and Veloso, <xref ref-type="bibr" rid="B22">2009</xref>) or a corrective demonstration (Nicolescu and Mataric, <xref ref-type="bibr" rid="B93">2003</xref>). The main difference with instructions (respectively, demonstrations) is that they are provided after an action (respectively, a sequence of actions) is executed by the agent, not before. So, operationalization is made with respect to the previous state instead of the current one.</p>
<p>So far, corrective feedback has been mainly used for augmenting LfD systems (Nicolescu and Mataric, <xref ref-type="bibr" rid="B93">2003</xref>; Chernova and Veloso, <xref ref-type="bibr" rid="B22">2009</xref>; Argall et al., <xref ref-type="bibr" rid="B5">2011</xref>). For example, in Chernova and Veloso (<xref ref-type="bibr" rid="B22">2009</xref>), while the robot is reproducing the provided demonstrations, the teacher could interactively rectify any incorrect action. In Nicolescu and Mataric (<xref ref-type="bibr" rid="B93">2003</xref>), corrective demonstrations were delimited by two predefined verbal commands that were pronounced by the teacher. In Argall et al. (<xref ref-type="bibr" rid="B5">2011</xref>), the authors presented a framework based on <italic>advice-operators</italic>, allowing a teacher to correct entire segments of demonstrations through a visual interface. Advice-operators were defined as numerical operations that can be performed on state-action pairs. The teacher could choose an operator from a predefined set, and apply it to the segment to be corrected. In Celemin and Ruiz-Del-Solar (<xref ref-type="bibr" rid="B19">2019</xref>), the authors took inspiration from advice-operators to propose learning from corrective feedback as a standalone method, contrasting with other methods for learning from evaluative feedback such as TAMER (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>).</p>
</sec>
<sec>
<title>3.1.7. Evaluative Feedback</title>
<p>Teaching an agent by evaluating its actions is an alternative solution to the standard RL approach. Evaluative feedback can be provided in different forms: a scalar value <italic>f</italic> &#x02208; [&#x02212;1, 1] (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>), a binary value <italic>f</italic> &#x02208; {&#x02212;1, 1} (Thomaz et al., <xref ref-type="bibr" rid="B121">2006</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>), a positive reinforcer <italic>f</italic> &#x02208; {&#x0201C;<italic>Good</italic>!&#x02033;, &#x0201C;<italic>Bravo</italic>!&#x02033;} (Kaplan et al., <xref ref-type="bibr" rid="B50">2002</xref>), or a categorical information <italic>f</italic> &#x02208; {<italic>Correct, Wrong</italic>} (Loftin et al., <xref ref-type="bibr" rid="B68">2016</xref>). These values can be provided through buttons (Kaplan et al., <xref ref-type="bibr" rid="B50">2002</xref>; Suay and Chernova, <xref ref-type="bibr" rid="B105">2011</xref>; Knox et al., <xref ref-type="bibr" rid="B59">2013</xref>), speech (Kim and Scassellati, <xref ref-type="bibr" rid="B52">2007</xref>; Grizou et al., <xref ref-type="bibr" rid="B40">2013</xref>), gestures (Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>), or electroencephalogram (EEG) signals (Grizou et al., <xref ref-type="bibr" rid="B38">2014a</xref>).</p>
<p>Another form of evaluative feedback is to provide preferences between demonstrated trajectories (Christiano et al., <xref ref-type="bibr" rid="B23">2017</xref>; Sadigh et al., <xref ref-type="bibr" rid="B102">2017</xref>; Cui and Niekum, <xref ref-type="bibr" rid="B30">2018</xref>). Instead of critiquing one single action or a sequence of actions, the teacher provides a ranking for demonstrated trajectories. The provided human preferences are then aggregated in order to infer the reward function. This form of evaluative feedback has been mainly investigated within the LfD community as an alternative to the standard Inverse Reinforcement Learning approach (IRL) (Ng and Russell, <xref ref-type="bibr" rid="B92">2000</xref>), by relaxing the constraint for the teacher to provide demonstrations.</p>
</sec>
</sec>
<sec>
<title>3.2. Interpreting Advice</title>
<p>The second step of the advice-taking process stipulates that advice needs to be converted into an internal representation. Predefining the meaning of advice by hand-coding the mapping between raw signals and their internal representation has been widely used in the literature (Clouse and Utgoff, <xref ref-type="bibr" rid="B25">1992</xref>; Nicolescu and Mataric, <xref ref-type="bibr" rid="B93">2003</xref>; Lockerd and Breazeal, <xref ref-type="bibr" rid="B66">2004</xref>; Rosenstein et al., <xref ref-type="bibr" rid="B100">2004</xref>; Rybski et al., <xref ref-type="bibr" rid="B101">2007</xref>; Thomaz and Breazeal, <xref ref-type="bibr" rid="B119">2007b</xref>; Chernova and Veloso, <xref ref-type="bibr" rid="B22">2009</xref>; Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>; Pradyot et al., <xref ref-type="bibr" rid="B96">2012a</xref>; Cruz et al., <xref ref-type="bibr" rid="B29">2015</xref>; Celemin and Ruiz-Del-Solar, <xref ref-type="bibr" rid="B19">2019</xref>). However, this solution has many limitations. First, programming the meaning of raw advice signals for new tasks requires expert programming skills, which is not accessible to all human users. Second, it limits the possibility for different teachers to use their own preferred signals.</p>
<p>One way to address these limitations is to teach the system how to interpret the teacher&#x00027;s raw advice signals. This way, the system would be able to understand advice that can be expressed through natural language or non-verbal cues, without predetermining the meaning of each signal. In this case, we talk about learning with unlabeled teaching signals (Grizou et al., <xref ref-type="bibr" rid="B39">2014b</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>). To achieve this goal, different approaches have been taken in the literature. <xref ref-type="table" rid="T2">Table 2</xref> summarizes the literature addressing the question of interpreting advice. We categorize them according to the type of advice, the communication channel, the interpretation method, and the inputs given to the system for interpretation.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Interpreting advice.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Advice</bold></th>
<th valign="top" align="left"><bold>Channel</bold></th>
<th valign="top" align="left"><bold>Method</bold></th>
<th valign="top" align="left"><bold>Inputs</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Kate and Mooney, <xref ref-type="bibr" rid="B51">2006</xref></td>
<td valign="top" align="left">GI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">SVM</td>
<td valign="top" align="left">Demonstration<xref ref-type="table-fn" rid="TN1"><sup>&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">Kim and Scassellati, <xref ref-type="bibr" rid="B52">2007</xref></td>
<td valign="top" align="left">EFB</td>
<td valign="top" align="left">Speech</td>
<td valign="top" align="left">kNN</td>
<td valign="top" align="left">Binary EFB classes</td>
</tr>
<tr>
<td valign="top" align="left">Chen and Mooney, <xref ref-type="bibr" rid="B20">2011</xref></td>
<td valign="top" align="left">GLI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">SVM</td>
<td valign="top" align="left">Demonstration</td>
</tr>
<tr>
<td valign="top" align="left">Tellex et al., <xref ref-type="bibr" rid="B113">2011</xref></td>
<td valign="top" align="left">GHI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">Graphical model</td>
<td valign="top" align="left">Demonstration</td>
</tr>
<tr>
<td valign="top" align="left">Artzi and Zettlemoyer, <xref ref-type="bibr" rid="B7">2013</xref></td>
<td valign="top" align="left">GHI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">Perceptron</td>
<td valign="top" align="left">Rewards or demonstration &#x0002B; language model</td>
</tr>
<tr>
<td valign="top" align="left">Duvallet et al., <xref ref-type="bibr" rid="B35">2013</xref></td>
<td valign="top" align="left">GLI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">MCC</td>
<td valign="top" align="left">Demonstration &#x0002B; language model</td>
</tr>
<tr>
<td valign="top" align="left">Tellex et al., <xref ref-type="bibr" rid="B114">2014</xref></td>
<td valign="top" align="left">GHI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">Gradient descent</td>
<td valign="top" align="left">Demonstration</td>
</tr>
<tr>
<td valign="top" align="left">Pradyot et al., <xref ref-type="bibr" rid="B97">2012b</xref></td>
<td valign="top" align="left">CLI</td>
<td valign="top" align="left">Gestures</td>
<td valign="top" align="left">MLN</td>
<td valign="top" align="left">Demonstration<xref ref-type="table-fn" rid="TN1"><sup>&#x0002A;</sup></xref></td>
</tr>
<tr>
<td valign="top" align="left">Lopes et al., <xref ref-type="bibr" rid="B69">2011</xref></td>
<td valign="top" align="left">EFB and CFB</td>
<td valign="top" align="left">Simulation</td>
<td valign="top" align="left">IRL</td>
<td valign="top" align="left">EFB and CFB</td>
</tr>
<tr>
<td valign="top" align="left">Grizou et al., <xref ref-type="bibr" rid="B40">2013</xref></td>
<td valign="top" align="left">EFB or CLI</td>
<td valign="top" align="left">Speech</td>
<td valign="top" align="left">EM</td>
<td valign="top" align="left">Task models</td>
</tr>
<tr>
<td valign="top" align="left">Grizou et al., <xref ref-type="bibr" rid="B39">2014b</xref></td>
<td valign="top" align="left">EFB</td>
<td valign="top" align="left">EEG</td>
<td valign="top" align="left">EM</td>
<td valign="top" align="left">Task models</td>
</tr>
<tr>
<td valign="top" align="left">MacGlashan et al., <xref ref-type="bibr" rid="B71">2014a</xref></td>
<td valign="top" align="left">GHI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">EM</td>
<td valign="top" align="left">Task and language models</td>
</tr>
<tr>
<td valign="top" align="left">MacGlashan et al., <xref ref-type="bibr" rid="B73">2014b</xref></td>
<td valign="top" align="left">GHI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">EM</td>
<td valign="top" align="left">EFB &#x0002B; language model</td>
</tr>
<tr>
<td valign="top" align="left">Loftin et al., <xref ref-type="bibr" rid="B68">2016</xref></td>
<td valign="top" align="left">EFB</td>
<td valign="top" align="left">Buttons</td>
<td valign="top" align="left">EM</td>
<td valign="top" align="left">Task models</td>
</tr>
<tr>
<td valign="top" align="left">Branavan et al., <xref ref-type="bibr" rid="B12">2009</xref></td>
<td valign="top" align="left">GLI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">PGRL</td>
<td valign="top" align="left">Rewards</td>
</tr>
<tr>
<td valign="top" align="left">Branavan et al., <xref ref-type="bibr" rid="B13">2010</xref></td>
<td valign="top" align="left">GHI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">MB-PGRL</td>
<td valign="top" align="left">Rewards</td>
</tr>
<tr>
<td valign="top" align="left">Vogel and Jurafsky, <xref ref-type="bibr" rid="B126">2010</xref></td>
<td valign="top" align="left">GLI</td>
<td valign="top" align="left">Text</td>
<td valign="top" align="left">SARSA</td>
<td valign="top" align="left">Demonstration</td>
</tr>
<tr>
<td valign="top" align="left">Najar et al., <xref ref-type="bibr" rid="B88">2015b</xref></td>
<td valign="top" align="left">CLI</td>
<td valign="top" align="left">Simulation</td>
<td valign="top" align="left">XCS</td>
<td valign="top" align="left">Rewards</td>
</tr>
<tr>
<td valign="top" align="left">Najar et al., <xref ref-type="bibr" rid="B87">2015a</xref></td>
<td valign="top" align="left">CLI</td>
<td valign="top" align="left">Gestures</td>
<td valign="top" align="left">XCS</td>
<td valign="top" align="left">EFB</td>
</tr>
<tr>
<td valign="top" align="left">Najar et al., <xref ref-type="bibr" rid="B89">2016</xref></td>
<td valign="top" align="left">CLI</td>
<td valign="top" align="left">Gestures</td>
<td valign="top" align="left">Q-learning</td>
<td valign="top" align="left">EFB</td>
</tr>
<tr>
<td valign="top" align="left">Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref></td>
<td valign="top" align="left">CLI</td>
<td valign="top" align="left">EMG</td>
<td valign="top" align="left">ACRL</td>
<td valign="top" align="left">Rewards and/or EFB</td>
</tr>
<tr>
<td valign="top" align="left">Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref></td>
<td valign="top" align="left">CLI</td>
<td valign="top" align="left">Gestures</td>
<td valign="top" align="left">ACRL</td>
<td valign="top" align="left">Rewards and/or EFB</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>GI, General instruction; GLI, general low-level instruction; GHI, general high-level instruction; CLI, contextual low-level instruction; EFB, evaluative feedback; CFB, corrective feedback; SVM, Support Vector Machines; kNN, k-nearest neighbors; MCC, multi-class classification; MLN, Markov Logic Networks; IRL, Inverse Reinforcement Learning; PGRL, policy-gradient RL; MB-PGRL, model-based policy-gradient RL; ACRL, Actor-Critic RL</italic>.</p>
<fn id="TN1"><label>&#x0002A;</label><p><italic>The term demonstration here is taken in the general sense as a trajectory, not necessarily the optimal one</italic>.</p></fn>
</table-wrap-foot>
</table-wrap>
<sec>
<title>3.2.1. Supervised Interpretation</title>
<p>Some methods relied on interpreters trained with supervised learning methods (Kate and Mooney, <xref ref-type="bibr" rid="B51">2006</xref>; Zettlemoyer and Collins, <xref ref-type="bibr" rid="B135">2009</xref>; Matuszek et al., <xref ref-type="bibr" rid="B81">2013</xref>). For example, in Kuhlmann et al. (<xref ref-type="bibr" rid="B63">2004</xref>), the system was able to convert general instructions expressed in a constrained natural language into a formal representation using <italic>if-then</italic> rules, by using a parser that was previously trained with annotated data. In Pradyot et al. (<xref ref-type="bibr" rid="B97">2012b</xref>), two different models of contextual instructions were learned in the first place using Markov logic networks (MLN) (Domingos et al., <xref ref-type="bibr" rid="B33">2016</xref>), and then used for guiding a learning agent in a later phase. The most likely interpretation was taken from the instruction model with the highest confidence. In Kim and Scassellati (<xref ref-type="bibr" rid="B52">2007</xref>), a binary classification of prosodic features was performed offline, before using it to convert evaluative feedback into a numerical reward signal for task learning.</p>
</sec>
<sec>
<title>3.2.2. Grounded Interpretation</title>
<p>More recent approaches take inspiration from the <italic>grounded language acquisition</italic> literature (Mooney, <xref ref-type="bibr" rid="B83">2008</xref>) to learn a model that grounds the meaning of advice into concepts from the task. For example, general instructions expressed in natural language can be paired with demonstrations of the corresponding tasks to learn the mapping between low-level contextual instructions and their intended actions (Chen and Mooney, <xref ref-type="bibr" rid="B20">2011</xref>; Tellex et al., <xref ref-type="bibr" rid="B113">2011</xref>; Duvallet et al., <xref ref-type="bibr" rid="B35">2013</xref>). In MacGlashan et al. (<xref ref-type="bibr" rid="B71">2014a</xref>), the authors proposed a model for grounding general high-level instructions into reward functions from user demonstrations. The agent had access to a set of hypotheses about possible tasks, in addition to command-to-demonstration pairings. Generative models of tasks, language, and behaviors were then inferred using expectation maximization (EM) (Dempster et al., <xref ref-type="bibr" rid="B31">1977</xref>). In addition to having a set of hypotheses about possible reward functions, the agent was also endowed with planning abilities that allowed it to infer a policy according to the most likely task. The authors extended their model in MacGlashan et al. (<xref ref-type="bibr" rid="B73">2014b</xref>) to ground command meanings in reward functions using evaluative feedback instead of demonstrations.</p>
<p>In a similar work (Grizou et al., <xref ref-type="bibr" rid="B40">2013</xref>), a robot learned to interpret both low-level contextual instructions and evaluative feedback, while inferring the task using an EM algorithm. Contextual advice was interactively provided through speech. As in MacGlashan et al. (<xref ref-type="bibr" rid="B73">2014b</xref>), the robot knew the set of possible tasks, and was endowed with a planning algorithm allowing it to derive a policy for each possible task. This model was also used for interpreting evaluative feedback provided through EEG signals (Grizou et al., <xref ref-type="bibr" rid="B39">2014b</xref>). In Lopes et al. (<xref ref-type="bibr" rid="B69">2011</xref>), a predefined set of known feedback signals, both evaluative and corrective, were used for interpreting additional signals with IRL.</p>
</sec>
<sec>
<title>3.2.3. RL-Based Interpretation</title>
<p>A different approach relies on RL for interpreting advice (Branavan et al., <xref ref-type="bibr" rid="B12">2009</xref>, <xref ref-type="bibr" rid="B13">2010</xref>; Vogel and Jurafsky, <xref ref-type="bibr" rid="B126">2010</xref>; Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>). In Branavan et al. (<xref ref-type="bibr" rid="B12">2009</xref>), the authors used a policy-gradient RL algorithm with a predefined reward function to interpret general low-level instructions for a software application. This model was extended in Branavan et al. (<xref ref-type="bibr" rid="B13">2010</xref>) to allow for the interpretation of high-level instructions by learning a model of the environment. In Vogel and Jurafsky (<xref ref-type="bibr" rid="B126">2010</xref>), a similar approach was used for interpreting general low-level instructions, in a path-following task, using the SARSA algorithm. The rewards were computed according to the deviation from a provided demonstration.</p>
<p>In Mathewson and Pilarski (<xref ref-type="bibr" rid="B80">2016</xref>), contextual low-level instructions were provided to a prosthetic robotic arm in the form of myoelectric control signals and interpreted using evaluative feedback with an Actor-Critic architecture. In Najar et al. (<xref ref-type="bibr" rid="B88">2015b</xref>), a model of contextual low-level instructions was built using the XCS algorithm (Butz and Wilson, <xref ref-type="bibr" rid="B15">2001</xref>) in order to predict task rewards, and used simultaneously for speeding-up the learning process. This model was extended in Najar et al. (<xref ref-type="bibr" rid="B87">2015a</xref>) to predict action values instead of task rewards. In Najar et al. (<xref ref-type="bibr" rid="B89">2016</xref>), interpretation was based on evaluative feedback using the Q-learning algorithm. In Najar (<xref ref-type="bibr" rid="B84">2017</xref>), several methods for interpreting contextual low-level instructions were compared. Each contextual low-level instruction was defined as a <italic>signal policy</italic> representing a probability distribution over the action-space in the same way as an RL policy:</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M13"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>A</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>;</mml:mo><mml:mi>a</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>A</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>i</italic> is an observed instruction signal, such as a pointing gesture or a vocal command. Two types of interpretation methods were proposed: batch and incremental. The main idea of batch interpretation methods is to derive a state policy for an instruction signal by combining the policies of every task state in which it has been observed. Different combination methods were investigated. The Bayes optimal solution derives the signal policy by marginalizing the state policies over all the states where the signal has been observed:</p>
<disp-formula id="E12"><label>(12)</label><mml:math id="M14"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<disp-formula id="E13"><label>(13)</label><mml:math id="M15"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>S</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>Pr</italic>(<italic>i</italic>|<italic>s</italic>), <italic>Pr</italic>(<italic>s</italic>), and <italic>Pr</italic>(<italic>i</italic>) represent, respectively, the probability of observing the signal <italic>i</italic> in state <italic>s</italic>, the probability of being in state <italic>s</italic> and the probability of observing the signal <italic>i</italic>.</p>
<p>Other batch interpretation methods were inspired from ensemble methods (Wiering and van Hasselt, <xref ref-type="bibr" rid="B131">2008</xref>), which have been classically used for combining the policies of different learning algorithms. These methods compute preferences <italic>p</italic>(<italic>i, a</italic>) for each action, which are then transformed into a policy using the softmax distribution as in Equation (6). Boltzmann Multiplication consists in multiplying the policies:</p>
<disp-formula id="E14"><label>(14)</label><mml:math id="M16"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x0220F;</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>S</mml:mi><mml:mo>;</mml:mo><mml:msup><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>i</italic><sup>&#x0002A;</sup>(<italic>s</italic>) represents the instruction signal associated to the state <italic>s</italic>.</p>
<p>Boltzmann Addition consists in adding the policies:</p>
<disp-formula id="E15"><label>(15)</label><mml:math id="M17"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>S</mml:mi><mml:mo>;</mml:mo><mml:msup><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In Majority Voting, the most preferred interpretation for a signal <italic>i</italic> is the action that is optimal the most often over all its contingent states:</p>
<disp-formula id="E16"><label>(16)</label><mml:math id="M18"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>S</mml:mi><mml:mo>;</mml:mo><mml:msup><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>I</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>I</italic>(<italic>x, y</italic>) is the indicator function that outputs 1 when <italic>x</italic> &#x0003D; <italic>y</italic> and 0 otherwise.</p>
<p>In Rank Voting, the most preferred action for <italic>i</italic> is the one that has the highest cumulative ranking over all its contingent states:</p>
<disp-formula id="E17"><label>(17)</label><mml:math id="M19"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>S</mml:mi><mml:mo>;</mml:mo><mml:msup><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>R</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>R</italic>(<italic>s, a</italic>) is the rank of action <italic>a</italic> in state <italic>s</italic>, such that if <italic>a</italic><sub><italic>j</italic></sub> and <italic>a</italic><sub><italic>k</italic></sub> denote two different actions and &#x003C0;(<italic>s, a</italic><sub><italic>j</italic></sub>) &#x02265; &#x003C0;(<italic>s, a</italic><sub><italic>k</italic></sub>) then <italic>R</italic>(<italic>s, a</italic><sub><italic>j</italic></sub>) &#x02265; <italic>R</italic>(<italic>s, a</italic><sub><italic>k</italic></sub>).</p>
<p>Incremental interpretation methods, on the other hand, incrementally update the meaning of each instruction signal using information from the task learning process such as the rewards, the TD error, or the policy gradient. With Reward-based Updating, instruction signals constitute the state space for an alternative MDP which is solved using a standard RL algorithm. This approach is similar to the one used in Branavan et al. (<xref ref-type="bibr" rid="B13">2010</xref>), Branavan et al. (<xref ref-type="bibr" rid="B12">2009</xref>), and Vogel and Jurafsky (<xref ref-type="bibr" rid="B126">2010</xref>). In Value-based Updating, the meaning of an instruction is updated with the same amount as the Q-values of its corresponding state:</p>
<disp-formula id="E18"><label>(18)</label><mml:math id="M20"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B4;</mml:mi><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>whereas in Policy-based Updating, it is updated using the policy update:</p>
<disp-formula id="E19"><label>(19)</label><mml:math id="M21"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003B4;</mml:mi><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>&#x003B4;</mml:mi><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>These methods were compared using both a reward function and evaluative feedback. Policy-based Updating presented the best compromise in terms of performance and computation cost.</p>
</sec>
</sec>
<sec>
<title>3.3. Shaping With Advice</title>
<p>We can distinguish several strategies for integrating advice into an RL system, depending on which stage of the learning process is influenced by the advice. The overall RL process can be summarized as follows. First, the main source of information to an RL agent is the reward function. In value-based RL, the reward function is used for computing a value function, which is then used for deriving a policy. In policy-based RL, the policy is directly derived from the reward function without computing any value function. Finally, the policy is used for decision-making. Advice can be integrated into the learning process at any of these four different stages: the reward function, the value function, the policy, or the decision.</p>
<p>We qualify the methods used for integrating advice as shaping methods. In the literature, this term has been used exclusively for evaluative feedback, especially as a technique for providing extra-rewards. For example, we find different terminologies such as reward shaping (Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>), interactive shaping (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>), and policy shaping (Griffith et al., <xref ref-type="bibr" rid="B37">2013</xref>; Cederborg et al., <xref ref-type="bibr" rid="B16">2015</xref>). In some works, the term shaping is not even adopted (Loftin et al., <xref ref-type="bibr" rid="B68">2016</xref>). In this survey, we generalize this term to all types of advice by considering the term shaping in its general meaning as influencing an RL agent toward a desired behavior. In this sense, all methods for integrating advice into an RL process are considered as shaping methods, especially that similar shaping patterns can be found across different categories of advice.</p>
<p>We distinguish four main strategies for integrating advice into an RL system: reward shaping, value shaping, policy shaping, and decision biasing, depending on the stage in which advice is integrated into the learning process (cf. <xref ref-type="table" rid="T3">Table 3</xref>). Orthogonal to this categorization, we distinguish model-free from model-based shaping strategies. In model-free shaping, the perceived advice is directly integrated into the learning process, whereas model-based shaping methods build a model of the teacher that is kept in parallel with the agent&#x00027;s own model of the task. Both models can be combined using several combination techniques that we review in this section.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Shaping methods.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Shaping method</bold></th>
<th valign="top" align="left"><bold>Model</bold></th>
<th valign="top" align="left"><bold>Advice</bold></th>
<th valign="top" align="left"><bold>References</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Reward shaping</td>
<td valign="top" align="left">Model-free</td>
<td valign="top" align="left">Contextual instructions</td>
<td valign="top" align="left">Clouse and Utgoff, <xref ref-type="bibr" rid="B25">1992</xref></td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">Evaluative feedback</td>
<td valign="top" align="left">Isbell et al., <xref ref-type="bibr" rid="B47">2001</xref>; Thomaz et al., <xref ref-type="bibr" rid="B121">2006</xref>; Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>; Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Model-based</td>
<td valign="top" align="left">Contextual instructions</td>
<td valign="top" align="left">Najar et al., <xref ref-type="bibr" rid="B88">2015b</xref></td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">Evaluative feedback</td>
<td valign="top" align="left">Knox and Stone, <xref ref-type="bibr" rid="B54">2010</xref>, <xref ref-type="bibr" rid="B55">2011a</xref>, <xref ref-type="bibr" rid="B58">2012b</xref></td>
</tr>
<tr>
<td valign="top" align="left">Value shaping</td>
<td valign="top" align="left">Model-free</td>
<td valign="top" align="left">General instructions</td>
<td valign="top" align="left">Utgoff and Clouse, <xref ref-type="bibr" rid="B125">1991</xref>; Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>; Kuhlmann et al., <xref ref-type="bibr" rid="B63">2004</xref>; Maclin et al., <xref ref-type="bibr" rid="B74">2005a</xref>,<xref ref-type="bibr" rid="B75">b</xref>; Torrey et al., <xref ref-type="bibr" rid="B122">2008</xref></td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">Evaluative feedback</td>
<td valign="top" align="left">Dorigo and Colombetti, <xref ref-type="bibr" rid="B34">1994</xref>; Colombetti et al., <xref ref-type="bibr" rid="B27">1996</xref>; Najar et al., <xref ref-type="bibr" rid="B89">2016</xref></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Model-based</td>
<td valign="top" align="left">Contextual instructions</td>
<td valign="top" align="left">Najar et al., <xref ref-type="bibr" rid="B87">2015a</xref>, <xref ref-type="bibr" rid="B89">2016</xref></td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">Evaluative feedback</td>
<td valign="top" align="left">Knox and Stone, <xref ref-type="bibr" rid="B54">2010</xref>, <xref ref-type="bibr" rid="B55">2011a</xref>, <xref ref-type="bibr" rid="B58">2012b</xref></td>
</tr>
<tr>
<td valign="top" align="left">Policy shaping</td>
<td valign="top" align="left">Model-free</td>
<td valign="top" align="left">Contextual instructions</td>
<td valign="top" align="left">Rosenstein et al., <xref ref-type="bibr" rid="B100">2004</xref></td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">Evaluative feedback</td>
<td valign="top" align="left">Ho et al., <xref ref-type="bibr" rid="B45">2015</xref>; MacGlashan et al., <xref ref-type="bibr" rid="B72">2017</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref></td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Model-based</td>
<td valign="top" align="left">Contextual instructions</td>
<td valign="top" align="left">Pradyot et al., <xref ref-type="bibr" rid="B97">2012b</xref>; Grizou et al., <xref ref-type="bibr" rid="B40">2013</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref></td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">Evaluative feedback</td>
<td valign="top" align="left">Knox and Stone, <xref ref-type="bibr" rid="B54">2010</xref>, <xref ref-type="bibr" rid="B55">2011a</xref>, <xref ref-type="bibr" rid="B58">2012b</xref>; Lopes et al., <xref ref-type="bibr" rid="B69">2011</xref>; Griffith et al., <xref ref-type="bibr" rid="B37">2013</xref>; Loftin et al., <xref ref-type="bibr" rid="B68">2016</xref></td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">Corrective feedback</td>
<td valign="top" align="left">Lopes et al., <xref ref-type="bibr" rid="B69">2011</xref></td>
</tr>
<tr>
<td valign="top" align="left">Decision biasing</td>
<td/>
<td valign="top" align="left">Guidance</td>
<td valign="top" align="left">Thomaz and Breazeal, <xref ref-type="bibr" rid="B117">2006</xref>; Suay and Chernova, <xref ref-type="bibr" rid="B105">2011</xref></td>
</tr>
<tr>
<td/>
<td/>
<td valign="top" align="left">Contextual instructions</td>
<td valign="top" align="left">Nicolescu and Mataric, <xref ref-type="bibr" rid="B93">2003</xref>; Rosenstein et al., <xref ref-type="bibr" rid="B100">2004</xref>; Rybski et al., <xref ref-type="bibr" rid="B101">2007</xref>; Thomaz and Breazeal, <xref ref-type="bibr" rid="B119">2007b</xref>; Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>; Cruz et al., <xref ref-type="bibr" rid="B29">2015</xref></td>
</tr>
</tbody>
</table>
</table-wrap>
<sec>
<title>3.3.1. Reward Shaping</title>
<p>Traditionally, reward shaping has been used as a technique for providing an RL agent with intermediate rewards to speed-up the learning process (Gullapalli and Barto, <xref ref-type="bibr" rid="B41">1992</xref>; Mataric, <xref ref-type="bibr" rid="B79">1994</xref>; Ng et al., <xref ref-type="bibr" rid="B91">1999</xref>; Wiewiora, <xref ref-type="bibr" rid="B132">2003</xref>). One way for providing intermediate rewards is to use evaluative feedback (Isbell et al., <xref ref-type="bibr" rid="B47">2001</xref>; Thomaz et al., <xref ref-type="bibr" rid="B121">2006</xref>; Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>; Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref>). In these works, evaluative feedback was considered in the same way as the feedback provided by the agent&#x00027;s environment in RL; so intermediate rewards are homogeneous to MDP rewards. After converting evaluative feedback into a numerical value, it can be considered as a delayed reward, just like MDP rewards, and used for computing a value function using standard RL algorithms (cf. <xref ref-type="fig" rid="F2">Figure 2</xref>) (Isbell et al., <xref ref-type="bibr" rid="B47">2001</xref>; Thomaz et al., <xref ref-type="bibr" rid="B121">2006</xref>; Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>; Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref>). This means that the effect of the provided feedback extends beyond the last performed action. When the RL agent has also access to a predefined reward function <italic>R</italic>, a new reward function <italic>R</italic>&#x02032; is computed by summing both forms of reward: <italic>R</italic>&#x02032; &#x0003D; <italic>R</italic> &#x0002B; <italic>R</italic><sup><italic>h</italic></sup>, where <italic>R</italic><sup><italic>h</italic></sup> is the human delivered reward. This way of shaping with is model-free in that the numerical values provided by the human teacher are directly used for augmenting the reward function.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Shaping with evaluative feedback. 1: model-free reward shaping. 2: model-based reward shaping. 3: model-free value shaping. 4: model-based value shaping. 5: model-free policy shaping. 6: model-based policy shaping.</p></caption>
<graphic xlink:href="frobt-08-584075-g0002.tif"/>
</fig>
<p>Reward shaping can also be performed with instructions (cf. <xref ref-type="fig" rid="F3">Figure 3</xref>). For example, in Clouse and Utgoff (<xref ref-type="bibr" rid="B25">1992</xref>), <italic>contextual instructions</italic> were integrated into an RL algorithm by positively reinforcing the proposed actions in a model-free fashion.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Shaping with contextual instructions. 1: model-free reward shaping. 2: model-based reward shaping. 3: model-free value shaping. 4: model-based value shaping. 5: model-free policy shaping. 6: model-based policy shaping. 7: decision biasing.</p></caption>
<graphic xlink:href="frobt-08-584075-g0003.tif"/>
</fig>
<p>Other works considered building an intermediate model of human rewards to perform model-based reward shaping. In the TAMER framework (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>), evaluative feedback was converted into rewards and used for computing a regression model <italic>&#x00124;</italic>, called the &#x0201C;<italic>Human Reinforcement Function</italic>.&#x0201D; This model predicted the amount of rewards <italic>&#x00124;</italic>(<italic>s, a</italic>) that the human provided for each state-action pair (<italic>s, a</italic>). Knox and Stone (<xref ref-type="bibr" rid="B54">2010</xref>, <xref ref-type="bibr" rid="B55">2011a</xref>, <xref ref-type="bibr" rid="B58">2012b</xref>) proposed eight different shaping methods for combining the <italic>human reinforcement function &#x00124;</italic> with a predefined MDP reward function <italic>R</italic>. One of them, Reward Shaping, generalizes the reward shaping method by introducing a decaying weight factor &#x003B2; that controls the contribution of <italic>&#x00124;</italic> over <italic>R</italic>:</p>
<disp-formula id="E20"><label>(20)</label><mml:math id="M22"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>R</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mo>*</mml:mo><mml:mi>&#x00124;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Model-based reward shaping can also be performed with <italic>contextual instructions</italic>. In Najar et al. (<xref ref-type="bibr" rid="B88">2015b</xref>), a human teacher provided social cues to humanoid robot about the next action to perform. A model of these cues was built in order to predict task rewards and used simultaneously for reward shaping.</p>
</sec>
<sec>
<title>3.3.2. Value Shaping</title>
<p>While investigating reward shaping, some authors pointed out the fundamental difference that exists between immediate and delayed rewards (Dorigo and Colombetti, <xref ref-type="bibr" rid="B34">1994</xref>; Colombetti et al., <xref ref-type="bibr" rid="B27">1996</xref>; Knox and Stone, <xref ref-type="bibr" rid="B57">2012a</xref>). Particularly, they considered evaluative feedback as an immediate information about the value of an action, as opposed to standard MDP rewards (Ho et al., <xref ref-type="bibr" rid="B46">2017</xref>). For example, in Dorigo and Colombetti (<xref ref-type="bibr" rid="B34">1994</xref>), the authors used a <italic>myopic discounting</italic> scheme by setting the discount factor &#x003B3; to zero. In this way, evaluative feedback constituted <italic>immediate reinforcements in response to the actions of the learning agent</italic>, which comes to consider rewards as equivalent to action values. So, value shaping constitutes an alternative to reward shaping by considering evaluative feedback as an action-preference function. The work of Dorigo and Colombetti (<xref ref-type="bibr" rid="B34">1994</xref>) was one of the earliest examples of model-free value-shaping. Another example can be found in Najar et al. (<xref ref-type="bibr" rid="B89">2016</xref>), where evaluative feedback was directly used for updating a robot&#x00027;s action values with <italic>myopic discounting</italic>.</p>
<p>Model-free value shaping can also be done with <italic>general advice</italic>. For example, <italic>if-then</italic> rules can be incorporated into a kernel-based regression model by using the Knowledge-Based Kernel Regression (KBKR) method (Mangasarian et al., <xref ref-type="bibr" rid="B78">2004</xref>). This method was used for integrating <italic>general constraints</italic> into the value function of a SARSA agent using Support Vector Regression for value function approximation (Maclin et al., <xref ref-type="bibr" rid="B75">2005b</xref>). In this case, advice was provided in the form of constraints on action values (e.g., <italic>if</italic> condition <italic>then Q</italic>(<italic>s, a</italic>) &#x02265; 1), and incorporated into the value function through the KBKR method. This approach was extended in Maclin et al. (<xref ref-type="bibr" rid="B74">2005a</xref>) by proposing a new way of defining constraints on action values. In the new method, pref-KBKR (preference KBKR), the constraints were expressed in terms of action preferences (e.g., <italic>if</italic> condition <italic>then</italic> prefer action <italic>a</italic> to action <italic>b</italic>). This method was also used in Torrey et al. (<xref ref-type="bibr" rid="B122">2008</xref>). Another possibility is given by the Knowledge-Based Neural Network (KBANN) method, which allows incorporating knowledge expressed in the form of <italic>if-then</italic> rules into a neural network (Towell and Shavlik, <xref ref-type="bibr" rid="B123">1994</xref>). This method was used in RATLE, an advice-taking system based on Q-learning that used a neural network to approximate its Q-function (Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>). <italic>General instructions</italic> written in the form of <italic>if-then</italic> rules and <italic>while-repeat</italic> loops were incorporated into the Q-function using an extension of KBANN method. In Kuhlmann et al. (<xref ref-type="bibr" rid="B63">2004</xref>), a SARSA agent was augmented with an <italic>Advice Unit</italic> that computed additional action values. <italic>General instructions</italic> were expressed in a specific formal language in the form of <italic>if-then</italic> rules. Each time a rule was activated in a given state, the value of the corresponding action was increased or decreased by a constant in the Advice Unit, depending on whether the rule advised for or against the action. These values were then used for augmenting the values generated by the agent&#x00027;s value function approximator.</p>
<p>Model-based value shaping with evaluative feedback has been investigated by Knox and Stone (<xref ref-type="bibr" rid="B57">2012a</xref>) by comparing different discount factors for the <italic>human reinforcement function &#x00124;</italic>. The authors demonstrated that setting the discount factor to zero was better suited, which came to consider <italic>&#x00124;</italic> as an action-value function more than a reward function.<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> The numerical representation of evaluative feedback is used for modifying the Q-function rather than the reward function. One of the shaping methods that they proposed, Q-Augmentation (Knox and Stone, <xref ref-type="bibr" rid="B54">2010</xref>, <xref ref-type="bibr" rid="B55">2011a</xref>, <xref ref-type="bibr" rid="B58">2012b</xref>), uses the human reinforcement function <italic>&#x00124;</italic> for augmenting the MDP Q-function using:</p>
<disp-formula id="E21"><label>(21)</label><mml:math id="M23"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>Q</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x02032;</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mo>*</mml:mo><mml:mi>&#x00124;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B2; is the same decaying weight factor as in Equation (20).</p>
<p>Model-based value shaping can also be done with <italic>contextual instructions</italic>. In Najar et al. (<xref ref-type="bibr" rid="B87">2015a</xref>) and Najar et al. (<xref ref-type="bibr" rid="B89">2016</xref>), a robot built a model of contextual instructions in order to predict action values, which were used in turn for updating the value function.</p>
</sec>
<sec>
<title>3.3.3. Policy Shaping</title>
<p>The third shaping strategy is to integrate the advice directly into the agent&#x00027;s policy. Examples of model-free policy shaping with evaluative feedback can be found in MacGlashan et al. (<xref ref-type="bibr" rid="B72">2017</xref>) and Najar et al. (<xref ref-type="bibr" rid="B90">2020b</xref>). In both methods, evaluative feedback was used for updating the actor of an Actor-Critic architecture. In MacGlashan et al. (<xref ref-type="bibr" rid="B72">2017</xref>), the update term was scaled by the gradient of the policy:</p>
<disp-formula id="E22"><label>(22)</label><mml:math id="M24"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>w</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo class="qopname">ln</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>f</italic><sub><italic>t</italic></sub> is the feedback provided at time <italic>t</italic>. In Najar et al. (<xref ref-type="bibr" rid="B90">2020b</xref>), however, the authors did not consider a multiplying factor for evaluative feedback:</p>
<disp-formula id="E23"><label>(23)</label><mml:math id="M25"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>w</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Model-free policy shaping with <italic>contextual instructions</italic> was considered in Rosenstein et al. (<xref ref-type="bibr" rid="B100">2004</xref>), in the context of an Actor-Critic architecture, where the error between the instruction and the <italic>actor</italic>&#x00027;s decision was used as an additional term to the TD error for updating the <italic>actor</italic>&#x00027;s parameters:</p>
<disp-formula id="E24"><label>(24)</label><mml:math id="M26"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>w</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>w</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B1;</mml:mi><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:msub><mml:mrow><mml:mi>&#x003B4;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>E</mml:mi></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msup><mml:mo>-</mml:mo><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mo>&#x02207;</mml:mo></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:msup><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>a</italic><sup><italic>E</italic></sup> is the actor&#x00027;s exploratory action, <italic>a</italic><sup><italic>A</italic></sup> is its deterministic action, <italic>a</italic><sup><italic>S</italic></sup> is the teacher&#x00027;s action, &#x003C0;<sup><italic>A</italic></sup>(<italic>s</italic>) is the actor&#x00027;s deterministic policy, and <italic>k</italic> is an interpolation parameter.</p>
<p>Knox and Stone proposed two model-based policy shaping methods for evaluative feedback (Knox and Stone, <xref ref-type="bibr" rid="B54">2010</xref>, <xref ref-type="bibr" rid="B55">2011a</xref>, <xref ref-type="bibr" rid="B58">2012b</xref>). Action Biasing uses the same equation as Q-Augmentation (Equation 21) but only in decision-making, so that the agent&#x00027;s Q-function is not modified:</p>
<disp-formula id="E25"><label>(25)</label><mml:math id="M27"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msup><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mi>Q</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>&#x003B2;</mml:mi><mml:mo>*</mml:mo><mml:mi>&#x00124;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The second method, Control Sharing, arbitrates between the decisions of both value functions based on a probability criterion. A parameter &#x003B2; is used as a threshold for determining the probability of selecting the decision according to <italic>&#x00124;</italic>:</p>
<disp-formula id="E26"><label>(26)</label><mml:math id="M28"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mi>r</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:mo>=</mml:mo><mml:mi>a</mml:mi><mml:mi>r</mml:mi><mml:mi>g</mml:mi><mml:mi>m</mml:mi><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:mi>&#x00124;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:mi>i</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>&#x003B2;</mml:mi><mml:mo>,</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Otherwise, the decision is made according to the MDP policy.</p>
<p>Other model-based policy shaping methods do not convert evaluative feedback into a scalar but into a categorical information (Lopes et al., <xref ref-type="bibr" rid="B69">2011</xref>; Griffith et al., <xref ref-type="bibr" rid="B37">2013</xref>; Loftin et al., <xref ref-type="bibr" rid="B68">2016</xref>). The distribution of provided feedback is used within a Bayesian framework in order to derive a policy. The method proposed in Griffith et al. (<xref ref-type="bibr" rid="B37">2013</xref>) outperformed Action Biasing, Control Sharing, and Reward Shaping. After inferring the teacher&#x00027;s policy from the feedback distribution, it computed the Bayes optimal combination with the MDP policy by multiplying both probability distributions: &#x003C0; &#x0221D; &#x003C0;<sub><italic>R</italic></sub> &#x000D7; &#x003C0;<sub><italic>F</italic></sub>, where &#x003C0;<sub><italic>R</italic></sub> is the policy derived from the reward function and &#x003C0;<sub><italic>F</italic></sub> the policy derived from evaluative feedback. In Lopes et al. (<xref ref-type="bibr" rid="B69">2011</xref>), both evaluative and corrective feedback were considered under a Bayesian IRL perspective.</p>
<p>Model-based policy shaping can also be performed with <italic>contextual instructions</italic>. For example, in Pradyot et al. (<xref ref-type="bibr" rid="B97">2012b</xref>), the RL agent arbitrates between the action proposed by its Q-learning policy and the one proposed by the instruction model based on a confidence criterion:</p>
<disp-formula id="E27"><label>(27)</label><mml:math id="M29"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003BA;</mml:mi></mml:mrow><mml:mrow><mml:mi>&#x003C0;</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>A</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>a</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>-</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:mi>b</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mi>A</mml:mi><mml:mo>;</mml:mo><mml:mi>b</mml:mi><mml:mo>&#x02260;</mml:mo><mml:mi>a</mml:mi></mml:mrow></mml:munder></mml:mstyle><mml:mi>&#x003C0;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>s</mml:mi><mml:mo>,</mml:mo><mml:mi>b</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>The same arbitration criterion was used in Najar et al. (<xref ref-type="bibr" rid="B90">2020b</xref>) to decide between the outputs of an Instruction Model and a Task Model.</p>
</sec>
<sec>
<title>3.3.4. Decision Biasing</title>
<p>In the previous paragraphs, we said that policy shaping methods can be either model-free, by directly modifying the agent&#x00027;s policy, or model-based, by building a model that is used at decision-time to bias the output of the policy. A different approach consists of using advice to directly bias the output of the policy at decision-time without corrupting the policy nor modeling the advice. This strategy, that we call decision biasing, is the simplest way of using advice as it only biases the exploration strategy of the agent, without modifying any of its internal variables. In this case, learning is done indirectly by experiencing the effects of following the advice.</p>
<p>This strategy has been mainly used in the literature with guidance and contextual instructions. For example, in Suay and Chernova (<xref ref-type="bibr" rid="B105">2011</xref>) and Thomaz and Breazeal (<xref ref-type="bibr" rid="B117">2006</xref>) guidance reduces the set of actions that the agent can perform at a given time-step.</p>
<p>Contextual instructions can also be used for guiding a robot along the learning process (Thomaz and Breazeal, <xref ref-type="bibr" rid="B119">2007b</xref>; Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>; Cruz et al., <xref ref-type="bibr" rid="B29">2015</xref>). For example, in Nicolescu and Mataric (<xref ref-type="bibr" rid="B93">2003</xref>) and Rybski et al. (<xref ref-type="bibr" rid="B101">2007</xref>), an LfD system was augmented with verbal instructions in order to make the robot perform some actions during the demonstrations. In Rosenstein et al. (<xref ref-type="bibr" rid="B100">2004</xref>), in addition to model-free policy shaping, the provided instruction was also used for decision biasing. The robot executed a composite real-valued action that was computed as a linear combination of the <italic>actor</italic>&#x00027;s decision and the supervisor&#x00027;s instruction:</p>
<disp-formula id="E28"><label>(28)</label><mml:math id="M30"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>a</mml:mi><mml:mo>&#x02190;</mml:mo><mml:mi>k</mml:mi><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>E</mml:mi></mml:mrow></mml:msup><mml:mo>&#x0002B;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mi>a</mml:mi></mml:mrow><mml:mrow><mml:mi>S</mml:mi></mml:mrow></mml:msup><mml:mo>,</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>a</italic><sup><italic>E</italic></sup> is the actor&#x00027;s exploratory action, <italic>a</italic><sup><italic>S</italic></sup> the supervisor&#x00027;s action, and <italic>k</italic> an interpolation parameter.</p>
</sec>
</sec>
</sec>
<sec sec-type="discussion" id="s4">
<title>4. Discussion</title>
<p>In this section, we first discuss the difference between the various forms of advice introduced in section 3.1. We then discuss the approaches presented in sections 3.2 and 3.3. Finally, we open some perspectives toward a unified view of interactive learning methods.</p>
<sec>
<title>4.1. Comparing Different Forms of Advice</title>
<p>When designing an advice-taking system, one may ask which type of advice is best suited (Suay et al., <xref ref-type="bibr" rid="B106">2012</xref>). In this survey, we categorized different forms of advice according to how they are provided to the system. Even though the same interpretation and shaping methods can be applied to different categories of advice, each form of advice requires a different level of involvement from the human teacher and provides a different level of control over the learning process. Some of them provide poor information about the policy, so the learning process relies mostly on autonomous exploration. Others are more informative about the policy, so the learning process mainly depends on the human teacher.</p>
<p>This aspect has been described in the literature as the guidance-exploration spectrum (Breazeal and Thomaz, <xref ref-type="bibr" rid="B14">2008</xref>). In section 3.1, we presented guidance as a special type of advice. So, in order to avoid confusion about the term guidance, we will use the term exploration-control spectrum instead of guidance-exploration (<xref ref-type="fig" rid="F4">Figure 4</xref>). In the following paragraphs, we compare different forms of advice along this spectrum, by putting them into perspective with respect to other learning schemes such as autonomous learning and LfD.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Exploration-control spectrum. As we move to the right, teaching signals inform more directly about the optimal policy and provide more control to the human over the learning process.</p></caption>
<graphic xlink:href="frobt-08-584075-g0004.tif"/>
</fig>
<sec>
<title>4.1.1. Autonomous Learning</title>
<p>At one end of the exploration-control spectrum, autonomous learning methods assume that the robot is able to autonomously evaluate its performance on the task, through a predefined evaluation function, such as a reward function. The main advantage of this approach is the autonomy of the learning process. The evaluation function being integrated on board, the robot is able to optimize its behavior without requiring help from a supervisor.</p>
<p>However, this approach has some limitations when deployed in real-world settings. First, it is often hard to design, especially in complex environments, an appropriate evaluation function that could anticipate all aspects of a task (Kober et al., <xref ref-type="bibr" rid="B60">2013</xref>). Second, this approach relies on autonomous exploration, which raises some practical challenges. For example, exploring the space of behaviors makes the convergence of the learning process very slow, which limits the feasibility of such approach in complex problems. Also, autonomous exploration may lead to dangerous situations. So, safety is an important issue that has to be considered when designing autonomous learning systems (Garcia and Fernandez, <xref ref-type="bibr" rid="B36">2015</xref>).</p>
</sec>
<sec>
<title>4.1.2. Evaluative Feedback</title>
<p>Evaluative feedback constitutes another way to evaluate the agent&#x00027;s performance that has many advantages over predefined reward functions. First, like all other types of teaching signals, it can alleviate the limitations of autonomous learning, by allowing faster convergence rates and safer exploration. Whether it is represented as categorical information (Griffith et al., <xref ref-type="bibr" rid="B37">2013</xref>) or as immediate rewards (Dorigo and Colombetti, <xref ref-type="bibr" rid="B34">1994</xref>), it provides a more straightforward evaluation of the policy, as it directly informs about the optimality of the performed action (Ho et al., <xref ref-type="bibr" rid="B45">2015</xref>). Second, from an engineering point of view, evaluative feedback is generally easier to implement than a reward function. If designing a proper reward function can be challenging in practice, evaluative feedback generally takes the form of binary values that can be easily implemented (Knox et al., <xref ref-type="bibr" rid="B59">2013</xref>).</p>
<p>Nevertheless, the informativeness of evaluative feedback is still limited, as it is only given as a reaction to the agent&#x00027;s actions, without communicating the optimal one. So, the agent still needs to explore different actions, with trial-and-error, as in the autonomous learning setting. The main difference is that exploration is not required any more once the agent tries the optimal action and gets a positive feedback. So, the trade-off between exploration and exploitation is less tricky to address than in autonomous learning. The limitation in the informativeness of evaluative feedback can lead to poor performance. In fact, when it is the only available communicative channel, people tend to use it also as a form of guidance, in order to inform the agent about future actions (Thomaz et al., <xref ref-type="bibr" rid="B121">2006</xref>). This violates the assumption about how evaluative feedback should be used, which affects learning performance. Performance significantly improves when teachers are provided with an additional communicative channel for guidance (Thomaz and Breazeal, <xref ref-type="bibr" rid="B117">2006</xref>). This reflects the limitations of evaluative feedback and demonstrates that human teachers also need to provide guidance.</p>
</sec>
<sec>
<title>4.1.3. Corrective Feedback</title>
<p>One possibility for improving the feedback channel is to allow for corrections and refinements (Thomaz and Breazeal, <xref ref-type="bibr" rid="B118">2007a</xref>). Corrective instructions improve the informativeness of evaluative feedback by allowing the teacher to inform the agent about the optimal action (Celemin and Ruiz-Del-Solar, <xref ref-type="bibr" rid="B19">2019</xref>). Being also reactive to the agent&#x00027;s actions, they still require exploration. However, they prevent the agent from waiting until it tries the correct action by its own, so they require less exploration compared to evaluative feedback.</p>
<p>On the other hand, corrective instructions require more engineering efforts than evaluative feedback, as they are generally more than a binary information. Since they operate over the action space, they require from the system designer to encode the mapping between contextual instruction signals and their corresponding actions.</p>
<p>An even more informative form of corrective feedback is provided by corrective demonstrations, which extend beyond correcting one single action to correcting a whole sequence of actions (Chernova and Veloso, <xref ref-type="bibr" rid="B22">2009</xref>). Corrective demonstrations operate on the same space as demonstrations, which require more engineering than contextual instructions and also provide more control over the learning process (cf. the paragraph about demonstrations below).</p>
</sec>
<sec>
<title>4.1.4. Guidance</title>
<p>The experiments of Thomaz and Breazeal have shown that human teachers want to provide guidance (Thomaz and Breazeal, <xref ref-type="bibr" rid="B117">2006</xref>). In contrast to feedback, guidance allows the agent to be informed about future aspects of the task, such as the next action to perform (contextual instruction) (Cruz et al., <xref ref-type="bibr" rid="B29">2015</xref>), an interesting region to explore (demonstration) (Subramanian et al., <xref ref-type="bibr" rid="B107">2016</xref>) or a set of interesting actions to try (guidance) (Thomaz and Breazeal, <xref ref-type="bibr" rid="B117">2006</xref>).</p>
<p>Even though guidance requires less exploration compared to feedback by informing about future aspects of the task, the control over the learning process is exerted indirectly through decision biasing (cf. section 3.3). By performing the communicated guidance, the agent does not directly integrate this information as being the optimal behavior. Instead, it will be able to learn only through the experienced effects, for example by receiving a reward. So guidance is only about limiting exploration, without providing full control over the learning process, as it still depends on the evaluation of the performed actions.</p>
</sec>
<sec>
<title>4.1.5. Instructions</title>
<p>With respect to guidance, instructions inform more directly about the optimal policy in two main aspects. First, instructions are a special case of guidance where the teacher communicates only the optimal action. Second, the information about the optimal action can be integrated more directly into the learning process via reward shaping, value shaping, or policy shaping.</p>
<p>In section 3.1, we presented two main strategies for providing instructions: providing general instructions in the form of <italic>if-then</italic> rules, or interactively providing contextual instructions as the agent progresses in the task. The advantage of general instructions is that they do not depend on the dynamics of the task. Even though in the literature they are generally provided offline prior to the learning process, there is no reason they cannot be integrated at any moment of the task. For example, in works like (Kuhlmann et al., <xref ref-type="bibr" rid="B63">2004</xref>), we can imagine that different rules being activated and deactivated at different moments of the task. Their integration into the learning process will only depend on the validity of their conditions, not on the moment of their activation by the teacher. This puts less interactive load on the teacher as he/she does not need to stay concentrated in order to provide the correct information at the right moment.</p>
<p>General instructions also present some drawbacks. First, they can be difficult to formulate. The teacher needs to gain insight about the task and the environment dynamics in order to take into account different situations in advance and to formulate relevant rules (Kuhlmann et al., <xref ref-type="bibr" rid="B63">2004</xref>). Furthermore, they require from the teacher to know about the robot&#x00027;s sensors and effectors in order to correctly express the desired behaviors. So, formulating rules requires expertise about the task, the environment, and the robot. Second, general instructions can be difficult to communicate. They require either expert programming skills from the teacher or sophisticated natural language understanding capabilities from the agent.</p>
<p>Contextual instructions, on the other hand, communicate a less sophisticated message at a time, which makes them easier to formulate and to provide. Compared to general instructions, they only inform about the next action to perform, without expressing the condition, which can be inferred by the agent from the current task state. However, this makes them more prone to ambiguity. For instance, writing general instructions by hand allows the teacher to specify the features that are relevant to the application of each rule, i.e., to control generalization. With contextual instructions, however, generalization has to be inferred by the agent from the context.</p>
<p>Finally, interactively providing instructions makes it easy for the teacher to adapt to changes in the environment&#x00027;s dynamics. So they provide more control over the learning process with respect to general instructions. However, this can be challenging in highly dynamical tasks, as the teacher needs a lapse of time to communicate each contextual instruction.</p>
</sec>
<sec>
<title>4.1.6. Demonstration</title>
<p>Formally, a demonstration is defined as a sequence of state-action pairs representing a trajectory in the task space (Argall et al., <xref ref-type="bibr" rid="B6">2009</xref>). So, from a strictly formal view, a demonstration is not very different from a general instruction providing a sequence of actions to perform (Branavan et al., <xref ref-type="bibr" rid="B12">2009</xref>; Vogel and Jurafsky, <xref ref-type="bibr" rid="B126">2010</xref>). The only difference is the sequence of states that the robot is supposed to experience. In many LfD settings, such as teleoperation (Abbeel et al., <xref ref-type="bibr" rid="B1">2010</xref>) and kinesthetic teaching (Akgun et al., <xref ref-type="bibr" rid="B3">2012</xref>), the states visited by the robot are controlled by the human. So, controlling a robot through these devices can be seen as providing a continuous stream of contextual instructions: the commands sent via the joystick or the forces exerted on the robot&#x00027;s kinesthetic device. So the difference between action plans and demonstrations provided under these settings goes beyond their formal definitions as sequences of actions or state-action pairs.</p>
<p>The main difference between demonstrations and general instructions (actually, all forms of advice) is that demonstrations provide control not only over the learning process but also over task execution. When providing demonstrations, the teacher controls the robot joints, so the communicated instruction is systematically executed. With instructions, however, the robot is in control of its own actions. Even though the instruction can be integrated into the learning process, via any shaping methods, the robot is still free to execute or not the communicated action.</p>
<p>One downside of this control is that demonstrations involve more human load than instructions. Demonstrations require from the teacher to be active in executing the task, while instructions involve only communication. This aspect confers some advantages to instructions in that they offer more possibilities in terms of interaction. Instructions can be provided with different modalities such as speech or gesture, and by using a wider variety of words or signals. Demonstrations, however, are constrained by the control interface. Moreover, demonstrations require continuous focus in providing complete trajectories, while instructions can be sporadic, like with contextual instructions.</p>
<p>Therefore, instructions can be better suited in situations where demonstrations can be difficult to provide. For example, people with limited autonomy may be unable to demonstrate a task by themselves, or to control a robot&#x00027;s joints. In these situations, communication is more convenient. On the other hand, demonstrations are more adapted for highly dynamical tasks and continuous environments, since instructions require some time to be communicated.</p>
</sec>
</sec>
<sec>
<title>4.2. Comparing Different Interpretation Methods</title>
<p>In section 3.2, we presented three main approaches for interpreting advice. The classical approach, supervised interpretation, relies on annotated data for training linguistic parsers. Even though this approach can be effective for building systems that are able to take into account natural language advice, they come at the cost of constituting large corpora of language-to-command alignments.</p>
<p>The second approach, grounded interpretation, relaxes this constraint by relying on examples of task executions instead of perfectly aligned commands. This approach is easier to implement by taking advantage of crowd-sourcing platforms like Amazon Mechanical Turk. Also, the annotation process is facilitated as it can be performed in the reverse order compared to the standard approach. First, various demonstrations of the task are collected, for example in the form of videos (Tellex et al., <xref ref-type="bibr" rid="B113">2011</xref>, <xref ref-type="bibr" rid="B114">2014</xref>). Then, each demonstration is associated to a general instruction. Even though this approach is more affordable than standard language-to-command annotation, it still comes at the cost of providing demonstrations, which can be challenging to provide in some contexts, as discussed in the previous section.</p>
<p>The third approach, RL-based interpretation, relaxes these constraints even more by relying only on a predefined performance criterion to guide the interpretation process (Branavan et al., <xref ref-type="bibr" rid="B12">2009</xref>, <xref ref-type="bibr" rid="B13">2010</xref>). Some intermediate methods also exists, for example by deriving a reward function from demonstrations and then using an RL algorithm to interpret advice (Vogel and Jurafsky, <xref ref-type="bibr" rid="B126">2010</xref>; Tellex et al., <xref ref-type="bibr" rid="B114">2014</xref>). Given that reward functions can also be challenging to design, some methods rely on predefined advice for interpreting other advice (Lopes et al., <xref ref-type="bibr" rid="B69">2011</xref>; Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref>; Najar et al., <xref ref-type="bibr" rid="B89">2016</xref>), or a combination of both advice and reward functions (Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>).</p>
<p>Orthogonal to the difference between supervised, grounded, and RL-based interpretation methods, we can distinguish two different strategies for teaching the system how to interpret unlabeled advice. The first strategy is to teach the system how to interpret advice without using it in parallel for task learning. For example, a human can teach an agent how to interpret continuous streams of contextual instructions by using evaluative feedback (Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref>). Here, the main task for the agent is to learn how interpret unlabeled instructions, not to use them for learning another task. Another example is when the agent is first provided with general instructions, either in the form of <italic>if-then</italic> rules or action plans, and then teaching it how to interpret these instructions using either demonstrations (Tellex et al., <xref ref-type="bibr" rid="B113">2011</xref>; MacGlashan et al., <xref ref-type="bibr" rid="B71">2014a</xref>), evaluative feedback (MacGlashan et al., <xref ref-type="bibr" rid="B73">2014b</xref>) or a predefined reward function (Branavan et al., <xref ref-type="bibr" rid="B12">2009</xref>, <xref ref-type="bibr" rid="B13">2010</xref>; Vogel and Jurafsky, <xref ref-type="bibr" rid="B126">2010</xref>). In this case, even though the agent is allowed to interact with its environment, the main task is still to learn how to interpret advice, not to use it for task learning.</p>
<p>The second strategy consists of guiding a task-learning process by interactively providing the agent with unlabeled contextual advice. In this case, the agent learns how to interpret advice at the same time as it learns to perform the task (Grizou et al., <xref ref-type="bibr" rid="B40">2013</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>). For example, in Grizou et al. (<xref ref-type="bibr" rid="B40">2013</xref>), the robot is provided with a set of hypotheses about possible tasks and advice meanings. The robot then infers the task and advice meanings that are the most coherent with each other and with the history of observed advice signals. In Najar et al. (<xref ref-type="bibr" rid="B90">2020b</xref>), task rewards are used for grounding the meaning of contextual instructions, which are used in turn for speeding-up the task-learning process.</p>
<p>It is important to understand the difference between these two strategies. First, when the agent learns how to interpret advice while using it for task learning, we must think about which shaping method to use for integrating the interpreted advice into the task-learning process (cf. section 3.3). Second, when the goal is only to interpret advice, there is no challenge about the optimality nor the sparsity of the unlabeled advice.</p>
<p>With the first strategy, advice cannot be erroneous as it constitutes the reference for the interpretation process. Even though the methods implementing this strategy do not explicitly assume perfect advice, the robustness of the interpretation methods against inconsistent advice is not systematically investigated. When advice is also used for task learning, however, we need to take into account whether or not advice is correct with respect to the target task. For example, in Grizou et al. (<xref ref-type="bibr" rid="B40">2013</xref>), the authors report the performance of their system under erroneous evaluative feedback. In Najar et al. (<xref ref-type="bibr" rid="B90">2020b</xref>), the system is evaluated in simulation against various levels of error for both evaluative feedback and contextual instructions. Also with the first strategy, advice signals cannot be sparse since they constitute the state-space of the interpretation process. For instance, the standard RL methods that have been used for interpreting general instructions (Branavan et al., <xref ref-type="bibr" rid="B12">2009</xref>, <xref ref-type="bibr" rid="B13">2010</xref>; Vogel and Jurafsky, <xref ref-type="bibr" rid="B126">2010</xref>) cannot be used for interpreting sparse contextual instructions. In these methods, instructions constitute the state-space of an MDP over which the RL algorithm is deployed, so they need to be instantiated on every time-step. This problem has been addressed in Najar et al. (<xref ref-type="bibr" rid="B90">2020b</xref>), where the system was able to interpret sporadic contextual instructions by using the TD error of the task-learning process.</p>
</sec>
<sec>
<title>4.3. Comparing Different Shaping Methods</title>
<p>In section 3.3, we presented different methods for integrating advice into an RL process: reward shaping, value shaping, policy shaping, and decision biasing. The standard approach, reward shaping, has been effective in many domains (Clouse and Utgoff, <xref ref-type="bibr" rid="B25">1992</xref>; Isbell et al., <xref ref-type="bibr" rid="B47">2001</xref>; Thomaz et al., <xref ref-type="bibr" rid="B121">2006</xref>; Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>; Mathewson and Pilarski, <xref ref-type="bibr" rid="B80">2016</xref>). However, this way of providing intermediate rewards has been shown to cause sub-optimal behaviors such as positive circuits (Knox and Stone, <xref ref-type="bibr" rid="B57">2012a</xref>; Ho et al., <xref ref-type="bibr" rid="B45">2015</xref>). Even though these effects have been mainly studied under the scope of evaluative feedback, they can also be extended to other forms of advice such as instructions, since the positive circuits problem is inherent to the reward shaping scheme regardless of the source of the rewards (Mahadevan and Connell, <xref ref-type="bibr" rid="B77">1992</xref>; Randlov and Alstrom, <xref ref-type="bibr" rid="B99">1998</xref>; Ng et al., <xref ref-type="bibr" rid="B91">1999</xref>; Wiewiora, <xref ref-type="bibr" rid="B132">2003</xref>).</p>
<p>Consequently, many authors considered value shaping as an alternative solution to reward shaping (Knox and Stone, <xref ref-type="bibr" rid="B58">2012b</xref>; Ho et al., <xref ref-type="bibr" rid="B46">2017</xref>). However, when comparing different shaping methods for evaluative feedback, Knox and Stone observed that &#x0201C;<italic>the more a technique directly affects action selection, the better it does, and the more it affects the update to the Q function for each transition experience, the worse it does&#x0201D;</italic> (Knox and Stone, <xref ref-type="bibr" rid="B58">2012b</xref>). In fact, this can be explained by the specificity of the Q-function with respect to other preference functions. Unlike other preference functions (e.g., Advantage function, Harmon et al., <xref ref-type="bibr" rid="B42">1994</xref>), a Q-function also informs about the proximity to the goal via temporal discounting. Contextual advice such as evaluative feedback and contextual instructions, however, only inform about local preferences like the last or the next action, without including such information (Ho et al., <xref ref-type="bibr" rid="B45">2015</xref>). So, like reward shaping, value shaping with contextual advice may also lead to convergence problems.</p>
<p>Overall, policy shaping methods show better performance compared to other shaping methods (Knox and Stone, <xref ref-type="bibr" rid="B58">2012b</xref>; Griffith et al., <xref ref-type="bibr" rid="B37">2013</xref>; Ho et al., <xref ref-type="bibr" rid="B45">2015</xref>). In addition to performance, another advantage of policy shaping is that it is applicable to a wider range of methods that directly derive a policy, without computing a value function or even using rewards.</p>
</sec>
<sec>
<title>4.4. Toward a Unified View</title>
<p>Overall, all forms of advice overcome the limitations of autonomous learning by providing more control over the learning process. Since more control comes at the cost of more interaction load, the autonomy of the learning process is important for minimizing the burden on the human teacher (Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>). Consequently, many advice-taking systems combine different learning modalities in order to balance between autonomy and control. For example, RL can be augmented with evaluative feedback (Judah et al., <xref ref-type="bibr" rid="B49">2010</xref>; Sridharan, <xref ref-type="bibr" rid="B104">2011</xref>; Knox and Stone, <xref ref-type="bibr" rid="B58">2012b</xref>), corrective feedback (Celemin et al., <xref ref-type="bibr" rid="B18">2019</xref>), instructions (Maclin and Shavlik, <xref ref-type="bibr" rid="B76">1996</xref>; Kuhlmann et al., <xref ref-type="bibr" rid="B63">2004</xref>; Rosenstein et al., <xref ref-type="bibr" rid="B100">2004</xref>; Pradyot et al., <xref ref-type="bibr" rid="B97">2012b</xref>), instructions and evaluative feedback (Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>), demonstrations (Taylor et al., <xref ref-type="bibr" rid="B112">2011</xref>; Subramanian et al., <xref ref-type="bibr" rid="B107">2016</xref>), demonstrations and evaluative feedback (Leon et al., <xref ref-type="bibr" rid="B64">2011</xref>), or demonstrations, evaluative feedback, and instructions (Tenorio-Gonzalez et al., <xref ref-type="bibr" rid="B115">2010</xref>). Demonstrations can be augmented with corrective feedback (Chernova and Veloso, <xref ref-type="bibr" rid="B22">2009</xref>; Argall et al., <xref ref-type="bibr" rid="B5">2011</xref>), instructions (Rybski et al., <xref ref-type="bibr" rid="B101">2007</xref>), instructions and feedback, both evaluative and corrective (Nicolescu and Mataric, <xref ref-type="bibr" rid="B93">2003</xref>), or with prior RL (Syed and Schapire, <xref ref-type="bibr" rid="B111">2007</xref>). In Waytowich et al. (<xref ref-type="bibr" rid="B129">2018</xref>), the authors proposed a framework for combining different learning modalities in a principled way. The system could balance autonomy and human control by switching from demonstration to guidance to evaluative feedback using a set of predefined metrics such as performance.</p>
<p>Integrating different forms of advice into one single and unified formalism remains an active research question. So far, different forms of advice have been mainly investigated separately by different communities. For example, some shaping methods have been designed exclusively for evaluative feedback and were not tested with other forms of advice such as contextual instructions, and the converse is also true. In this survey, we extracted several aspects that were shared across different forms of advice. Regardless of the type of advice, we must ask the same computational questions as we go through the same overall process (<xref ref-type="fig" rid="F5">Figure 5</xref>): First, we must think about how advice will be represented and whether its meaning will be predetermined or interpreted by the learning agent. Second, we must decide whether to aggregate advice into a model, or directly use it for influencing the learning process (model-based vs. model-free shaping). Finally, we must choose a shaping method for integrating advice (or its model) into the learning process. From this perspective, all shaping methods that were specifically designed for evaluative feedback could also be used for instructions and <italic>vice versa</italic>. For example, all the methods proposed by Knox and Stone for learning from evaluative feedback (Knox and Stone, <xref ref-type="bibr" rid="B54">2010</xref>, <xref ref-type="bibr" rid="B55">2011a</xref>, <xref ref-type="bibr" rid="B58">2012b</xref>), can be recycled for learning from instructions. Similarly, the confidence criterion used in Pradyot et al. (<xref ref-type="bibr" rid="B97">2012b</xref>) for learning from contextual instructions constitutes another Control Sharing mechanism, similar to the one proposed in Knox and Stone (<xref ref-type="bibr" rid="B54">2010</xref>), Knox and Stone (<xref ref-type="bibr" rid="B55">2011a</xref>), and Knox and Stone (<xref ref-type="bibr" rid="B58">2012b</xref>) for learning from evaluative feedback.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Shaping with advice, a unified view. When advice is provided to the learning agent, it has first to be encoded into an appropriate representation. If the mapping between teaching signals and their corresponding internal representation is not predetermined, then advice has to be interpreted by the agent. Then advice can be integrated into the learning process (shaping), either in a model-free or a model-based fashion. Optional steps, interpretation and modeling, are sketched in light gray.</p></caption>
<graphic xlink:href="frobt-08-584075-g0005.tif"/>
</fig>
<p>It is also interesting to think about the relationship between interpretation and shaping. For example, we can notice the similarity between interpretation and shaping methods. In Section 3.2, we mentioned that some interpretation methods relying on the task-learning process can be either reward-based, value-based, or policy-based. This scheme is reminiscent of the different shaping methods: reward shaping, value shaping, and policy shaping. For instance, the policy shaping method proposed in Griffith et al. (<xref ref-type="bibr" rid="B37">2013</xref>) for combining evaluative feedback with a reward function is mathematically equivalent to the Boltzmann Multiplication method used in Najar (<xref ref-type="bibr" rid="B84">2017</xref>) for interpreting contextual instructions. So by extension, the other ensemble methods that have been used for interpreting contextual instructions could also be used for shaping. We also note that the confidence criterion in Pradyot et al. (<xref ref-type="bibr" rid="B97">2012b</xref>) was used for both interpreting instructions and policy shaping. So, we can think of the relationship between shaping and interpretation as a reciprocal influence scheme, where advice can be interpreted from the task-learning process in a reward-based, value-based, or a policy-based way, and in turn can influence the learning process in a reward-based, value-based, or policy-based shaping way (Najar, <xref ref-type="bibr" rid="B84">2017</xref>). This view contrasts with the standard flow of the advice-taking process, where advice is interpreted before being integrated into the learning process (Hayes-Roth et al., <xref ref-type="bibr" rid="B44">1981</xref>). In fact in many works, interpretation and shaping happen simultaneously, sometimes by using the same mechanisms (Pradyot and Ravindran, <xref ref-type="bibr" rid="B98">2011</xref>; Najar et al., <xref ref-type="bibr" rid="B85">2020a</xref>).</p>
<p>Under this perspective, we can extend the similarity between all forms of advice to include also other sources of information such as demonstration and reward functions. At the end, even though these signals can sometimes contradict each other, they globally inform about one same thing, i.e., the task (Cederborg and Oudeyer, <xref ref-type="bibr" rid="B17">2014</xref>). Until recently, advice and demonstration have been mainly considered as two complementary but distinct approaches, i.e., communication vs. action (Dillmann et al., <xref ref-type="bibr" rid="B32">2000</xref>; Argall et al., <xref ref-type="bibr" rid="B4">2008</xref>; Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>, <xref ref-type="bibr" rid="B56">2011b</xref>; Judah et al., <xref ref-type="bibr" rid="B49">2010</xref>). However, these two approaches share many common aspects. For example, the counterpart of interpreting advice in the LfD literature is the correspondence problem, which is the question of how to map the teacher&#x00027;s states an actions into the agent&#x00027;s own states and actions. With advice, we also have a correspondence problem that consists of interpreting the raw advice signals. So, we can consider a more general correspondence problem that consists of interpreting raw teaching signals, independently from their nature. So far, the correspondence problem has been mainly addressed within the community of learning by imitation. Imitation is a special type of social learning in which the agent reproduces what it perceives. So, there is an assumption about the fact that what is seen has to be reproduced. Advice is different from imitation in that the robot has to reproduce what is communicated by the advice and not what is perceived. For instance, saying &#x0201C;turn left,&#x0201D; requires from the robot to perform the action of turning left, not to reproduce the sentence &#x0201C;turn left&#x0201D;. However, evidence from neuroscience gave rise to a new understanding of the emergence of human language as a sophistication of imitation throughout evolution (Adornetti and Ferretti, <xref ref-type="bibr" rid="B2">2015</xref>). In this view, language is grounded in action, just like imitation (Corballis, <xref ref-type="bibr" rid="B28">2010</xref>). For example, there is evidence that the mirror neurons of monkeys also fire to the sounds of certain actions, such as the tearing of paper or the cracking of nuts (Kohler et al., <xref ref-type="bibr" rid="B61">2002</xref>), and that spoken phrases about movements of the foot and the hand activate the corresponding mirror-neuron regions of the pre-motor cortex in humans (Aziz-Zadeh et al., <xref ref-type="bibr" rid="B9">2006</xref>).</p>
<p>So, one challenging question is whether we could unify the problem of interpreting any kind of teaching signal under the scope of one general correspondence problem. This is a relatively new research question, and few attempts have been made in this direction. In Cederborg and Oudeyer (<xref ref-type="bibr" rid="B17">2014</xref>), the authors proposed a mathematical framework for learning from different sources of information. The main idea is to relax the assumptions about the meaning of teaching signals by taking advantage of the coherence between the different sources of information. When comparing demonstrations with instructions, we mentioned that some demonstration settings could be considered as a way of providing continuous streams of contextual instructions, with the subtle difference that demonstrations are systematically executed by the robot. Considering this analogy, the growing literature about interpreting instructions (Branavan et al., <xref ref-type="bibr" rid="B13">2010</xref>; Vogel and Jurafsky, <xref ref-type="bibr" rid="B126">2010</xref>; Grizou et al., <xref ref-type="bibr" rid="B40">2013</xref>; Najar et al., <xref ref-type="bibr" rid="B90">2020b</xref>) could provide insights for designing new ways of solving the correspondence problem in imitation.</p>
<p>Unifying all types of teaching signals under the same view is a relatively recent research question (Cederborg and Oudeyer, <xref ref-type="bibr" rid="B17">2014</xref>; Waytowich et al., <xref ref-type="bibr" rid="B129">2018</xref>), and this survey aims at pushing toward this direction by clarifying some of the concepts used in the interactive learning literature and highlighting the similarities that exist between different approaches. The computational questions covered in this survey extend beyond the boundaries of Artificial Intelligence, as similar research questions regarding the computational implementation of social learning strategies are also addressed by the Cognitive Neuroscience community (Biele et al., <xref ref-type="bibr" rid="B11">2011</xref>; Najar et al., <xref ref-type="bibr" rid="B85">2020a</xref>; Olsson et al., <xref ref-type="bibr" rid="B94">2020</xref>). We hope this survey will contribute in bridging the gap between both communities.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s5">
<title>5. Conclusion</title>
<p>In this paper, we provided an overview of the existing methods for integrating human advice into an RL process. We first proposed a taxonomy of the different forms of advice that can be provided to a learning agent. We then described different methods that can be used for interpreting advice, and for integrating it into the learning process. Finally, we discussed the different approaches and opened some perspectives toward a unified view of interactive learning methods.</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>AN wrote the manuscript. MC supervised the project. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back>
<ack><p>This work was supported by the Romeo2 project. This manuscript has been released as a pre-print at arXiv (Najar and Chetouani, <xref ref-type="bibr" rid="B86">2020</xref>).</p></ack>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abbeel</surname> <given-names>P.</given-names></name> <name><surname>Coates</surname> <given-names>A.</given-names></name> <name><surname>Ng</surname> <given-names>A. Y.</given-names></name></person-group> (<year>2010</year>). <article-title>Autonomous helicopter aerobatics through apprenticeship learning</article-title>. <source>Int. J. Robot. Res</source>. <volume>29</volume>, <fpage>1608</fpage>&#x02013;<lpage>1639</lpage>. <pub-id pub-id-type="doi">10.1177/0278364910371999</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Adornetti</surname> <given-names>I.</given-names></name> <name><surname>Ferretti</surname> <given-names>F.</given-names></name></person-group> (<year>2015</year>). <article-title>The pragmatic foundations of communication: an action-oriented model of the origin of language</article-title>. <source>Theor. Histor. Sci</source>. <volume>11</volume>, <fpage>63</fpage>&#x02013;<lpage>80</lpage>. <pub-id pub-id-type="doi">10.12775/ths-2014-004</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Akgun</surname> <given-names>B.</given-names></name> <name><surname>Cakmak</surname> <given-names>M.</given-names></name> <name><surname>Yoo</surname> <given-names>J. W.</given-names></name> <name><surname>Thomaz</surname> <given-names>A. L.</given-names></name></person-group> (<year>2012</year>). <article-title>Trajectories and keyframes for kinesthetic teaching: a human-robot interaction perspective</article-title>, in <source>Proceedings of the Seventh Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI &#x00027;12</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>391</fpage>&#x02013;<lpage>398</lpage>. <pub-id pub-id-type="doi">10.1145/2157689.2157815</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Argall</surname> <given-names>B. D.</given-names></name> <name><surname>Browning</surname> <given-names>B.</given-names></name> <name><surname>Veloso</surname> <given-names>M.</given-names></name></person-group> (<year>2008</year>). <article-title>Learning robot motion control with demonstration and advice-operators</article-title>, in <source>2008 IEEE/RSJ International Conference on Intelligent Robots and Systems</source> (<publisher-loc>Nice</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>399</fpage>&#x02013;<lpage>404</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2008.4651020</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Argall</surname> <given-names>B. D.</given-names></name> <name><surname>Browning</surname> <given-names>B.</given-names></name> <name><surname>Veloso</surname> <given-names>M. M.</given-names></name></person-group> (<year>2011</year>). <article-title>Teacher feedback to scaffold and refine demonstrated motion primitives on a mobile robot</article-title>. <source>Robot. Auton. Syst</source>. <volume>59</volume>, <fpage>243</fpage>&#x02013;<lpage>255</lpage>. <pub-id pub-id-type="doi">10.1016/j.robot.2010.11.004</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Argall</surname> <given-names>B. D.</given-names></name> <name><surname>Chernova</surname> <given-names>S.</given-names></name> <name><surname>Veloso</surname> <given-names>M.</given-names></name> <name><surname>Browning</surname> <given-names>B.</given-names></name></person-group> (<year>2009</year>). <article-title>A survey of robot learning from demonstration</article-title>. <source>Robot. Auton. Syst</source>. <volume>57</volume>, <fpage>469</fpage>&#x02013;<lpage>483</lpage>. <pub-id pub-id-type="doi">10.1016/j.robot.2008.10.024</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Artzi</surname> <given-names>Y.</given-names></name> <name><surname>Zettlemoyer</surname> <given-names>L.</given-names></name></person-group> (<year>2013</year>). <article-title>Weakly supervised learning of semantic parsers for mapping instructions to actions</article-title>. <source>Trans. Assoc. Comput. Linguist</source>. <volume>1</volume>, <fpage>49</fpage>&#x02013;<lpage>62</lpage>. <pub-id pub-id-type="doi">10.1162/tacl_a_00209</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Atkeson</surname> <given-names>C. G.</given-names></name> <name><surname>Schaal</surname> <given-names>S.</given-names></name></person-group> (<year>1997</year>). <article-title>Learning tasks from a single demonstration</article-title>, in <source>Proceedings of International Conference on Robotics and Automation</source> (<publisher-loc>Albuquerque, NM</publisher-loc>), <fpage>1706</fpage>&#x02013;<lpage>1712</lpage>. <pub-id pub-id-type="doi">10.1109/ROBOT.1997.614389</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Aziz-Zadeh</surname> <given-names>L.</given-names></name> <name><surname>Wilson</surname> <given-names>S. M.</given-names></name> <name><surname>Rizzolatti</surname> <given-names>G.</given-names></name> <name><surname>Iacoboni</surname> <given-names>M.</given-names></name></person-group> (<year>2006</year>). <article-title>Congruent embodied representations for visually presented actions and linguistic phrases describing actions</article-title>. <source>Curr. Biol</source>. <volume>16</volume>, <fpage>1818</fpage>&#x02013;<lpage>1823</lpage>. <pub-id pub-id-type="doi">10.1016/j.cub.2006.07.060</pub-id><pub-id pub-id-type="pmid">16979559</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Barto</surname> <given-names>A. G.</given-names></name> <name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Anderson</surname> <given-names>C. W.</given-names></name></person-group> (<year>1983</year>). <article-title>Neuronlike adaptive elements that can solve difficult learning control problems</article-title>. <source>IEEE Trans. Syst. Man Cybernet</source>. <volume>13</volume>, <fpage>834</fpage>&#x02013;<lpage>846</lpage>. <pub-id pub-id-type="doi">10.1109/TSMC.1983.6313077</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Biele</surname> <given-names>G.</given-names></name> <name><surname>Rieskamp</surname> <given-names>J.</given-names></name> <name><surname>Krugel</surname> <given-names>L. K.</given-names></name> <name><surname>Heekeren</surname> <given-names>H. R.</given-names></name></person-group> (<year>2011</year>). <article-title>The neural basis of following advice</article-title>. <source>PLoS Biol</source>. <volume>9</volume>:<fpage>e1001089</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pbio.1001089</pub-id><pub-id pub-id-type="pmid">21713027</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Branavan</surname> <given-names>S. R. K.</given-names></name> <name><surname>Chen</surname> <given-names>H.</given-names></name> <name><surname>Zettlemoyer</surname> <given-names>L. S.</given-names></name> <name><surname>Barzilay</surname> <given-names>R.</given-names></name></person-group> (<year>2009</year>). <article-title>Reinforcement learning for mapping instructions to actions</article-title>, in <source>Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP</source> (<publisher-loc>Stroudsburg, PA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>82</fpage>&#x02013;<lpage>90</lpage>. <pub-id pub-id-type="doi">10.3115/1687878.1687892</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Branavan</surname> <given-names>S. R. K.</given-names></name> <name><surname>Zettlemoyer</surname> <given-names>L. S.</given-names></name> <name><surname>Barzilay</surname> <given-names>R.</given-names></name></person-group> (<year>2010</year>). <article-title>Reading between the lines: learning to map high-level instructions to commands</article-title>, in <source>Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL &#x00027;10</source> (<publisher-loc>Stroudsburg, PA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>1268</fpage>&#x02013;<lpage>1277</lpage>.</citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Breazeal</surname> <given-names>C.</given-names></name> <name><surname>Thomaz</surname> <given-names>A. L.</given-names></name></person-group> (<year>2008</year>). <article-title>Learning from human teachers with socially guided exploration</article-title>, in <source>2008 IEEE International Conference on Robotics and Automation</source> (<publisher-loc>Pasadena, CA</publisher-loc>), <fpage>3539</fpage>&#x02013;<lpage>3544</lpage>. <pub-id pub-id-type="doi">10.1109/ROBOT.2008.4543752</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Butz</surname> <given-names>M. V.</given-names></name> <name><surname>Wilson</surname> <given-names>S. W.</given-names></name></person-group> (<year>2001</year>). <article-title>An algorithmic description of XCS</article-title>, in <source>Advances in Learning Classifier Systems: Third International Workshop, IWLCS 2000</source>, eds <person-group person-group-type="editor"><name><surname>Luca Lanzi</surname> <given-names>P.</given-names></name> <name><surname>Stolzmann</surname> <given-names>W.</given-names></name> <name><surname>Wilson</surname> <given-names>S. W.</given-names></name></person-group> (<publisher-loc>Paris</publisher-loc>), <fpage>253</fpage>&#x02013;<lpage>272</lpage>. <pub-id pub-id-type="doi">10.1007/3-540-44640-0</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cederborg</surname> <given-names>T.</given-names></name> <name><surname>Grover</surname> <given-names>I.</given-names></name> <name><surname>Isbell</surname> <given-names>C. L.</given-names></name> <name><surname>Thomaz</surname> <given-names>A. L.</given-names></name></person-group> (<year>2015</year>). <article-title>Policy shaping with human teachers</article-title>, in <source>Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI&#x00027;15</source> (<publisher-loc>Buenos Aires</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>3366</fpage>&#x02013;<lpage>3372</lpage>.</citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cederborg</surname> <given-names>T.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P.-Y.</given-names></name></person-group> (<year>2014</year>). <article-title>A social learning formalism for learners trying to figure out what a teacher wants them to do</article-title>. <source>Paladyn J. Behav. Robot</source>. <volume>5</volume>, <fpage>64</fpage>&#x02013;<lpage>99</lpage>. <pub-id pub-id-type="doi">10.2478/pjbr-2014-0005</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Celemin</surname> <given-names>C.</given-names></name> <name><surname>Maeda</surname> <given-names>G.</given-names></name> <name><surname>del Solar</surname> <given-names>J. R.</given-names></name> <name><surname>Peters</surname> <given-names>J.</given-names></name> <name><surname>Kober</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Reinforcement learning of motor skills using policy search and human corrective advice</article-title>. <source>Int. J. Robot. Res</source>. <volume>38</volume>, <fpage>1560</fpage>&#x02013;<lpage>1580</lpage>. <pub-id pub-id-type="doi">10.1177/0278364919871998</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Celemin</surname> <given-names>C.</given-names></name> <name><surname>Ruiz-Del-Solar</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>An interactive framework for learning continuous actions policies based on corrective feedback</article-title>. <source>J. Intell. Robot. Syst</source>. <volume>95</volume>, <fpage>77</fpage>&#x02013;<lpage>97</lpage>. <pub-id pub-id-type="doi">10.1007/s10846-018-0839-z</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>D. L.</given-names></name> <name><surname>Mooney</surname> <given-names>R. J.</given-names></name></person-group> (<year>2011</year>). <article-title>Learning to interpret natural language navigation instructions from observations</article-title>, in <source>Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence, AAAI&#x00027;11</source> (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>859</fpage>&#x02013;<lpage>865</lpage>.</citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chernova</surname> <given-names>S.</given-names></name> <name><surname>Thomaz</surname> <given-names>A. L.</given-names></name></person-group> (<year>2014</year>). <article-title>Robot learning from human teachers</article-title>. <source>Synthesis Lect. Artif. Intell. Mach. Learn</source>. <volume>8</volume>, <fpage>1</fpage>&#x02013;<lpage>121</lpage>. <pub-id pub-id-type="doi">10.2200/S00568ED1V01Y201402AIM028</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chernova</surname> <given-names>S.</given-names></name> <name><surname>Veloso</surname> <given-names>M.</given-names></name></person-group> (<year>2009</year>). <article-title>Interactive policy learning through confidence-based autonomy</article-title>. <source>J. Artif. Int. Res</source>. <volume>34</volume>, <fpage>1</fpage>&#x02013;<lpage>25</lpage>. <pub-id pub-id-type="doi">10.1613/jair.2584</pub-id></citation></ref>
<ref id="B23">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Christiano</surname> <given-names>P. F.</given-names></name> <name><surname>Leike</surname> <given-names>J.</given-names></name> <name><surname>Brown</surname> <given-names>T.</given-names></name> <name><surname>Martic</surname> <given-names>M.</given-names></name> <name><surname>Legg</surname> <given-names>S.</given-names></name> <name><surname>Amodei</surname> <given-names>D.</given-names></name></person-group> (<year>2017</year>). <article-title>Deep reinforcement learning from human preferences</article-title>, in <source>Advances in Neural Information Processing Systems</source>, eds <person-group person-group-type="editor"><name><surname>Von Luxburg</surname> <given-names>U.</given-names></name> <etal/></person-group>. (<publisher-loc>Long Beach, CA</publisher-loc>; <publisher-name>Neural Information Processing Systems Foundation, Inc. (NIPS)</publisher-name>), <fpage>4299</fpage>&#x02013;<lpage>4307</lpage>.</citation></ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chu</surname> <given-names>V.</given-names></name> <name><surname>Fitzgerald</surname> <given-names>T.</given-names></name> <name><surname>Thomaz</surname> <given-names>A. L.</given-names></name></person-group> (<year>2016</year>). <article-title>Learning object affordances by leveraging the combination of human-guidance and self-exploration</article-title>, in <source>The Eleventh ACM/IEEE International Conference on Human Robot Interaction, HRI &#x00027;16</source> (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>IEEE Press</publisher-name>), <fpage>221</fpage>&#x02013;<lpage>228</lpage>. <pub-id pub-id-type="doi">10.1109/HRI.2016.7451755</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Clouse</surname> <given-names>J. A.</given-names></name> <name><surname>Utgoff</surname> <given-names>P. E.</given-names></name></person-group> (<year>1992</year>). <article-title>A teaching method for reinforcement learning</article-title>, in <source>Proceedings of the Ninth International Workshop on Machine Learning, ML &#x00027;92</source> (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>Morgan Kaufmann Publishers Inc.</publisher-name>), <fpage>92</fpage>&#x02013;<lpage>110</lpage>. <pub-id pub-id-type="doi">10.1016/B978-1-55860-247-2.50017-6</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cohen</surname> <given-names>P.</given-names></name> <name><surname>Feigenbaum</surname> <given-names>E. A.</given-names></name></person-group> (<year>1982</year>). <source>The Handbook of Artificial Intelligence</source>, <volume>Vol. 3</volume>. <publisher-loc>Los Altos, CA</publisher-loc>: <publisher-name>William Kaufmann &#x00026; HeurisTech Press</publisher-name>.</citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Colombetti</surname> <given-names>M.</given-names></name> <name><surname>Dorigo</surname> <given-names>M.</given-names></name> <name><surname>Borghi</surname> <given-names>G.</given-names></name></person-group> (<year>1996</year>). <article-title>Behavior analysis and training-a methodology for behavior engineering</article-title>. <source>IEEE Trans. Syst. Man Cybernet. B</source> <volume>26</volume>, <fpage>365</fpage>&#x02013;<lpage>380</lpage>. <pub-id pub-id-type="doi">10.1109/3477.499789</pub-id><pub-id pub-id-type="pmid">18263040</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Corballis</surname> <given-names>M. C.</given-names></name></person-group> (<year>2010</year>). <article-title>Mirror neurons and the evolution of language</article-title>. <source>Brain Lang</source>. <volume>112</volume>, <fpage>25</fpage>&#x02013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1016/j.bandl.2009.02.002</pub-id><pub-id pub-id-type="pmid">19342089</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cruz</surname> <given-names>F.</given-names></name> <name><surname>Twiefel</surname> <given-names>J.</given-names></name> <name><surname>Magg</surname> <given-names>S.</given-names></name> <name><surname>Weber</surname> <given-names>C.</given-names></name> <name><surname>Wermter</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Interactive reinforcement learning through speech guidance in a domestic scenario</article-title>, in <source>2015 International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Killarney</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/IJCNN.2015.7280477</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cui</surname> <given-names>Y.</given-names></name> <name><surname>Niekum</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Active reward learning from critiques</article-title>, in <source>2018 IEEE International Conference on Robotics and Automation (ICRA)</source> (<publisher-loc>Brisbane, QLD</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6907</fpage>&#x02013;<lpage>6914</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2018.8460854</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dempster</surname> <given-names>A. P.</given-names></name> <name><surname>Laird</surname> <given-names>N. M.</given-names></name> <name><surname>Rubin</surname> <given-names>D. B.</given-names></name></person-group> (<year>1977</year>). <article-title>Maximum likelihood from incomplete data via the EM algorithm</article-title>. <source>J. R. Stat. Soc. Ser. B</source> <volume>39</volume>, <fpage>1</fpage>&#x02013;<lpage>38</lpage>. <pub-id pub-id-type="doi">10.1111/j.2517-6161.1977.tb01600.x</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dillmann</surname> <given-names>R.</given-names></name> <name><surname>Rogalla</surname> <given-names>O.</given-names></name> <name><surname>Ehrenmann</surname> <given-names>M.</given-names></name> <name><surname>Z&#x000F6;liner</surname> <given-names>R.</given-names></name> <name><surname>Bordegoni</surname> <given-names>M.</given-names></name></person-group> (<year>2000</year>). <article-title>Learning robot behaviour and skills based on human demonstration and advice: the machine learning paradigm</article-title>, in <source>Robotics Research</source>, eds <person-group person-group-type="editor"><name><surname>Hollerbach</surname> <given-names>J. M.</given-names></name> <name><surname>Koditscheck</surname> <given-names>D. E.</given-names></name></person-group> (<publisher-loc>Snowbird, UT</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>229</fpage>&#x02013;<lpage>238</lpage>. <pub-id pub-id-type="doi">10.1007/978-1-4471-0765-1_28</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Domingos</surname> <given-names>P.</given-names></name> <name><surname>Lowd</surname> <given-names>D.</given-names></name> <name><surname>Kok</surname> <given-names>S.</given-names></name> <name><surname>Nath</surname> <given-names>A.</given-names></name> <name><surname>Poon</surname> <given-names>H.</given-names></name> <name><surname>Richardson</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Unifying logical and statistical AI</article-title>, in <source>2016 31st Annual ACM/IEEE Symposium on Logic in Computer Science (LICS)</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1145/2933575.2935321</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dorigo</surname> <given-names>M.</given-names></name> <name><surname>Colombetti</surname> <given-names>M.</given-names></name></person-group> (<year>1994</year>). <article-title>Robot shaping: developing autonomous agents through learning</article-title>. <source>Artif. Intell</source>. <volume>71</volume>, <fpage>321</fpage>&#x02013;<lpage>370</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(94)90047-7</pub-id></citation></ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Duvallet</surname> <given-names>F.</given-names></name> <name><surname>Kollar</surname> <given-names>T.</given-names></name> <name><surname>Stentz</surname> <given-names>A.</given-names></name></person-group> (<year>2013</year>). <article-title>Imitation learning for natural language direction following through unknown environments</article-title>, in <source>2013 IEEE International Conference on Robotics and Automation</source> (<publisher-loc>Karlsruhe</publisher-loc>), <fpage>1047</fpage>&#x02013;<lpage>1053</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2013.6630702</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Garcia</surname> <given-names>J.</given-names></name> <name><surname>Fernandez</surname> <given-names>F.</given-names></name></person-group> (<year>2015</year>). <article-title>A Comprehensive Survey on Safe Reinforcement Learning</article-title>. <source>J. Mach. Learn. Res</source>. <volume>16</volume>, <fpage>1437</fpage>&#x02013;<lpage>1480</lpage>. <pub-id pub-id-type="doi">10.5555/2789272.2886795</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Griffith</surname> <given-names>S.</given-names></name> <name><surname>Subramanian</surname> <given-names>K.</given-names></name> <name><surname>Scholz</surname> <given-names>J.</given-names></name> <name><surname>Isbell</surname> <given-names>C. L.</given-names></name> <name><surname>Thomaz</surname> <given-names>A.</given-names></name></person-group> (<year>2013</year>). <article-title>Policy shaping: integrating human feedback with reinforcement learning</article-title>, in <source>Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS&#x00027;13</source> (<publisher-loc>Lake Tahoe, CA</publisher-loc>: <publisher-name>Curran Associates Inc.</publisher-name>), <fpage>2625</fpage>&#x02013;<lpage>2633</lpage>.</citation></ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Grizou</surname> <given-names>J.</given-names></name> <name><surname>Iturrate</surname> <given-names>I.</given-names></name> <name><surname>Montesano</surname> <given-names>L.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P.-Y.</given-names></name> <name><surname>Lopes</surname> <given-names>M.</given-names></name></person-group> (<year>2014a</year>). <article-title>Calibration-free BCI based control</article-title>, in <source>Twenty-Eighth AAAI Conference on Artificial Intelligence</source> (<publisher-loc>Qu&#x000E9;bec City, QC</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation></ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Grizou</surname> <given-names>J.</given-names></name> <name><surname>Iturrate</surname> <given-names>I.</given-names></name> <name><surname>Montesano</surname> <given-names>L.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P.-Y.</given-names></name> <name><surname>Lopes</surname> <given-names>M.</given-names></name></person-group> (<year>2014b</year>). <article-title>Interactive learning from unlabeled instructions</article-title>, in <source>Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI&#x00027;14</source> (<publisher-loc>Arlington, VA</publisher-loc>: <publisher-name>AUAI Press</publisher-name>), <fpage>290</fpage>&#x02013;<lpage>299</lpage>.</citation></ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Grizou</surname> <given-names>J.</given-names></name> <name><surname>Lopes</surname> <given-names>M.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P. Y.</given-names></name></person-group> (<year>2013</year>). <article-title>Robot learning simultaneously a task and how to interpret human instructions</article-title>, in <source>2013 IEEE Third Joint International Conference on Development and Learning and Epigenetic Robotics (ICDL)</source> (<publisher-loc>Osaka</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/DevLrn.2013.6652523</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gullapalli</surname> <given-names>V.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name></person-group> (<year>1992</year>). <article-title>Shaping as a method for accelerating reinforcement learning</article-title>, in <source>Proceedings of the 1992 IEEE International Symposium on Intelligent Control</source> (<publisher-loc>Glasgow</publisher-loc>), <fpage>554</fpage>&#x02013;<lpage>559</lpage>. <pub-id pub-id-type="doi">10.1109/ISIC.1992.225046</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Harmon</surname> <given-names>M. E.</given-names></name> <name><surname>Baird</surname> <given-names>L. C.</given-names></name> <name><surname>Klopf</surname> <given-names>A. H.</given-names></name></person-group> (<year>1994</year>). <article-title>Advantage updating applied to a differential game</article-title>, in <source>Proceedings of the 7th International Conference on Neural Information Processing Systems, NIPS&#x00027;94</source> (<publisher-loc>Cambridge, MA</publisher-loc>: <publisher-name>MIT Press</publisher-name>), <fpage>353</fpage>&#x02013;<lpage>360</lpage>.</citation></ref>
<ref id="B43">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hayes-Roth</surname> <given-names>F.</given-names></name> <name><surname>Klahr</surname> <given-names>P.</given-names></name> <name><surname>Mostow</surname> <given-names>D. J.</given-names></name></person-group> (<year>1980</year>). <source>Knowledge Acquisition, Knowledge Programming, and Knowledge Refinement</source>. <publisher-loc>Santa Monica, CA</publisher-loc>: <publisher-name>Rand Corporation</publisher-name>.</citation></ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hayes-Roth</surname> <given-names>F.</given-names></name> <name><surname>Klahr</surname> <given-names>P.</given-names></name> <name><surname>Mostow</surname> <given-names>D. J.</given-names></name></person-group> (<year>1981</year>). <article-title>Advice-taking and knowledge refinement: an iterative view of skill acquisition</article-title>. <source>Cognit Skills Acquisit</source>. <fpage>231</fpage>&#x02013;<lpage>253</lpage>.</citation></ref>
<ref id="B45">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ho</surname> <given-names>M. K.</given-names></name> <name><surname>Littman</surname> <given-names>M. L.</given-names></name> <name><surname>Cushman</surname> <given-names>F.</given-names></name> <name><surname>Austerweil</surname> <given-names>J. L.</given-names></name></person-group> (<year>2015</year>). <article-title>Teaching with rewards and punishments: reinforcement or communication?</article-title> in <source>Proceedings of the 37th Annual Meeting of the Cognitive Science Society</source> (<publisher-loc>Pasadena, CA</publisher-loc>).</citation></ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ho</surname> <given-names>M. K.</given-names></name> <name><surname>MacGlashan</surname> <given-names>J.</given-names></name> <name><surname>Littman</surname> <given-names>M. L.</given-names></name> <name><surname>Cushman</surname> <given-names>F.</given-names></name></person-group> (<year>2017</year>). <article-title>Social is special: a normative framework for teaching with and learning from evaluative feedback</article-title>. <source>Cognition</source> <volume>167</volume>, <fpage>91</fpage>&#x02013;<lpage>106</lpage>. <pub-id pub-id-type="doi">10.1016/j.cognition.2017.03.006</pub-id><pub-id pub-id-type="pmid">28341268</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Isbell</surname> <given-names>C.</given-names></name> <name><surname>Shelton</surname> <given-names>C. R.</given-names></name> <name><surname>Kearns</surname> <given-names>M.</given-names></name> <name><surname>Singh</surname> <given-names>S.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name></person-group> (<year>2001</year>). <article-title>A social reinforcement learning agent</article-title>, in <source>Proceedings of the Fifth International Conference on Autonomous Agents, AGENTS &#x00027;01</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>377</fpage>&#x02013;<lpage>384</lpage>. <pub-id pub-id-type="doi">10.1145/375735.376334</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Judah</surname> <given-names>K.</given-names></name> <name><surname>Fern</surname> <given-names>A.</given-names></name> <name><surname>Tadepalli</surname> <given-names>P.</given-names></name> <name><surname>Goetschalckx</surname> <given-names>R.</given-names></name></person-group> (<year>2014</year>). <article-title>Imitation learning with demonstrations and shaping rewards</article-title>, in <source>Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI&#x00027;14</source> (<publisher-loc>Qubec City, QC</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>1890</fpage>&#x02013;<lpage>1896</lpage>.</citation></ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Judah</surname> <given-names>K.</given-names></name> <name><surname>Roy</surname> <given-names>S.</given-names></name> <name><surname>Fern</surname> <given-names>A.</given-names></name> <name><surname>Dietterich</surname> <given-names>T. G.</given-names></name></person-group> (<year>2010</year>). <article-title>Reinforcement learning via practice and critique advice</article-title>, in <source>Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI&#x00027;10</source> (<publisher-loc>Atlanta, GA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>481</fpage>&#x02013;<lpage>486</lpage>.</citation></ref>
<ref id="B50">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaplan</surname> <given-names>F.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P.-Y.</given-names></name> <name><surname>Kubinyi</surname> <given-names>E.</given-names></name> <name><surname>Miklosi</surname> <given-names>A.</given-names></name></person-group> (<year>2002</year>). <article-title>Robotic clicker training</article-title>. <source>Robot. Auton. Syst</source>. <volume>38</volume>, <fpage>197</fpage>&#x02013;<lpage>206</lpage>. <pub-id pub-id-type="doi">10.1016/S0921-8890(02)00168-9</pub-id></citation></ref>
<ref id="B51">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kate</surname> <given-names>R. J.</given-names></name> <name><surname>Mooney</surname> <given-names>R. J.</given-names></name></person-group> (<year>2006</year>). <article-title>Using string-kernels for learning semantic parsers</article-title>, in <source>Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44</source> (<publisher-loc>Stroudsburg, PA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>913</fpage>&#x02013;<lpage>920</lpage>. <pub-id pub-id-type="doi">10.3115/1220175.1220290</pub-id></citation></ref>
<ref id="B52">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>E. S.</given-names></name> <name><surname>Scassellati</surname> <given-names>B.</given-names></name></person-group> (<year>2007</year>). <article-title>Learning to refine behavior using prosodic feedback</article-title>, in <source>2007 IEEE 6th International Conference on Development and Learning</source> (<publisher-loc>London</publisher-loc>), <fpage>205</fpage>&#x02013;<lpage>210</lpage>. <pub-id pub-id-type="doi">10.1109/DEVLRN.2007.4354072</pub-id></citation></ref>
<ref id="B53">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Knox</surname> <given-names>W. B.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name></person-group> (<year>2009</year>). <article-title>Interactively shaping agents via human reinforcement: the TAMER framework</article-title>, in <source>Proceedings of the Fifth International Conference on Knowledge Capture, K-CAP &#x00027;09</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>9</fpage>&#x02013;<lpage>16</lpage>. <pub-id pub-id-type="doi">10.1145/1597735.1597738</pub-id></citation></ref>
<ref id="B54">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Knox</surname> <given-names>W. B.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name></person-group> (<year>2010</year>). <article-title>Combining manual feedback with subsequent MDP reward signals for reinforcement learning</article-title>, in <source>Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems, AAMAS &#x00027;10</source> (<publisher-loc>Richland, SC</publisher-loc>: <publisher-name>International Foundation for Autonomous Agents and Multiagent Systems</publisher-name>), <fpage>5</fpage>&#x02013;<lpage>12</lpage>.</citation></ref>
<ref id="B55">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Knox</surname> <given-names>W. B.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name></person-group> (<year>2011a</year>). <article-title>Augmenting reinforcement learning with human feedback</article-title>, in <source>ICML 2011 Workshop on New Developments in Imitation Learning</source> (<publisher-loc>Bellevue, WA</publisher-loc>).</citation></ref>
<ref id="B56">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Knox</surname> <given-names>W. B.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name></person-group> (<year>2011b</year>). <article-title>Understanding human teaching modalities in reinforcement learning environments: a preliminary report</article-title>, in <source>IJCAI 2011 Workshop on Agents Learning Interactively from Human Teachers (ALIHT)</source> (<publisher-loc>Barcelona</publisher-loc>).</citation></ref>
<ref id="B57">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Knox</surname> <given-names>W. B.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name></person-group> (<year>2012a</year>). <article-title>Reinforcement learning from human reward: discounting in episodic tasks</article-title>, in <source>2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication</source> (<publisher-loc>Paris</publisher-loc>), <fpage>878</fpage>&#x02013;<lpage>885</lpage>. <pub-id pub-id-type="doi">10.1109/ROMAN.2012.6343862</pub-id></citation></ref>
<ref id="B58">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Knox</surname> <given-names>W. B.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name></person-group> (<year>2012b</year>). <article-title>Reinforcement learning from simultaneous human and MDP reward</article-title>, in <source>Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems - Volume 1, AAMAS &#x00027;12</source> (<publisher-loc>Richland, SC</publisher-loc>: <publisher-name>International Foundation for Autonomous Agents and Multiagent Systems</publisher-name>), <fpage>475</fpage>&#x02013;<lpage>482</lpage>.</citation></ref>
<ref id="B59">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Knox</surname> <given-names>W. B.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name> <name><surname>Breazeal</surname> <given-names>C.</given-names></name></person-group> (<year>2013</year>). <article-title>Training a robot via human feedback: a case study</article-title>, in <source>Proceedings of the 5th International Conference on Social Robotics - Volume 8239, ICSR 2013</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer-Verlag</publisher-name>), <fpage>460</fpage>&#x02013;<lpage>470</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-02675-6_46</pub-id></citation></ref>
<ref id="B60">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kober</surname> <given-names>J.</given-names></name> <name><surname>Bagnell</surname> <given-names>J. A.</given-names></name> <name><surname>Peters</surname> <given-names>J.</given-names></name></person-group> (<year>2013</year>). <article-title>Reinforcement learning in robotics: a survey</article-title>. <source>Int. J. Robot. Res</source>. <volume>32</volume>, <fpage>1238</fpage>&#x02013;<lpage>1274</lpage>. <pub-id pub-id-type="doi">10.1177/0278364913495721</pub-id></citation></ref>
<ref id="B61">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kohler</surname> <given-names>E.</given-names></name> <name><surname>Keysers</surname> <given-names>C.</given-names></name> <name><surname>Umilte</surname> <given-names>M. A.</given-names></name> <name><surname>Fogassi</surname> <given-names>L.</given-names></name> <name><surname>Gallese</surname> <given-names>V.</given-names></name> <name><surname>Rizzolatti</surname> <given-names>G.</given-names></name></person-group> (<year>2002</year>). <article-title>Hearing sounds, understanding actions: action representation in mirror neurons</article-title>. <source>Science</source> <volume>297</volume>, <fpage>846</fpage>&#x02013;<lpage>848</lpage>. <pub-id pub-id-type="doi">10.1126/science.1070311</pub-id><pub-id pub-id-type="pmid">12161656</pub-id></citation></ref>
<ref id="B62">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krening</surname> <given-names>S.</given-names></name> <name><surname>Harrison</surname> <given-names>B.</given-names></name> <name><surname>Feigh</surname> <given-names>K. M.</given-names></name> <name><surname>Isbell</surname> <given-names>C. L.</given-names></name> <name><surname>Riedl</surname> <given-names>M.</given-names></name> <name><surname>Thomaz</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>Learning from explanations using sentiment and advice in RL</article-title>. <source>IEEE Trans. Cogn. Dev. Syst</source>. <volume>9</volume>, <fpage>44</fpage>&#x02013;<lpage>55</lpage>. <pub-id pub-id-type="doi">10.1109/TCDS.2016.2628365</pub-id></citation></ref>
<ref id="B63">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kuhlmann</surname> <given-names>G.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name> <name><surname>Mooney</surname> <given-names>R. J.</given-names></name> <name><surname>Shavlik</surname> <given-names>J. W.</given-names></name></person-group> (<year>2004</year>). <article-title>Guiding a reinforcement learner with natural language advice: initial results in robocup soccer</article-title>, in <source>The AAAI-2004 Workshop on Supervisory Control of Learning and Adaptive Systems</source> (<publisher-loc>San Jose, CA</publisher-loc>).</citation></ref>
<ref id="B64">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Leon</surname> <given-names>A.</given-names></name> <name><surname>Morales</surname> <given-names>E. F.</given-names></name> <name><surname>Altamirano</surname> <given-names>L.</given-names></name> <name><surname>Ruiz</surname> <given-names>J. R.</given-names></name></person-group> (<year>2011</year>). <article-title>Teaching a robot to perform task through imitation and on-line feedback</article-title>, in <source>Proceedings of the 16th Iberoamerican Congress Conference on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, CIARP&#x00027;11</source> (<publisher-loc>Berlin; Heidelberg</publisher-loc>: <publisher-name>Springer-Verlag</publisher-name>), <fpage>549</fpage>&#x02013;<lpage>556</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-25085-9_65</pub-id></citation></ref>
<ref id="B65">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>L.-J.</given-names></name></person-group> (<year>1992</year>). <article-title>Self-improving reactive agents based on reinforcement learning, planning and teaching</article-title>. <source>Mach. Learn</source>. <volume>8</volume>, <fpage>293</fpage>&#x02013;<lpage>321</lpage>. <pub-id pub-id-type="doi">10.1007/BF00992699</pub-id></citation></ref>
<ref id="B66">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lockerd</surname> <given-names>A.</given-names></name> <name><surname>Breazeal</surname> <given-names>C.</given-names></name></person-group> (<year>2004</year>). <article-title>Tutelage and socially guided robot learning</article-title>, in <source>2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source> (<publisher-loc>Sendai</publisher-loc>), <fpage>3475</fpage>&#x02013;<lpage>3480</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2004.1389954</pub-id></citation></ref>
<ref id="B67">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Loftin</surname> <given-names>R.</given-names></name> <name><surname>MacGlashan</surname> <given-names>J.</given-names></name> <name><surname>Peng</surname> <given-names>B.</given-names></name> <name><surname>Taylor</surname> <given-names>M. E.</given-names></name> <name><surname>Littman</surname> <given-names>M. L.</given-names></name> <name><surname>Huang</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>A strategy-aware technique for learning behaviors from discrete human feedback</article-title>, in <source>Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, AAAI&#x00027;14</source> (<publisher-loc>Qubec City, QC</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>937</fpage>&#x02013;<lpage>943</lpage>.</citation></ref>
<ref id="B68">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Loftin</surname> <given-names>R.</given-names></name> <name><surname>Peng</surname> <given-names>B.</given-names></name> <name><surname>Macglashan</surname> <given-names>J.</given-names></name> <name><surname>Littman</surname> <given-names>M. L.</given-names></name> <name><surname>Taylor</surname> <given-names>M. E.</given-names></name> <name><surname>Huang</surname> <given-names>J.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning</article-title>. <source>Auton. Agents Multiagent Syst</source>. <volume>30</volume>, <fpage>30</fpage>&#x02013;<lpage>59</lpage>. <pub-id pub-id-type="doi">10.1007/s10458-015-9283-7</pub-id></citation></ref>
<ref id="B69">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lopes</surname> <given-names>M.</given-names></name> <name><surname>Cederbourg</surname> <given-names>T.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P. Y.</given-names></name></person-group> (<year>2011</year>). <article-title>Simultaneous acquisition of task and feedback models</article-title>, in <source>2011 IEEE International Conference on Development and Learning (ICDL)</source> (<publisher-loc>Frankfurt am Main</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1109/DEVLRN.2011.6037359</pub-id></citation></ref>
<ref id="B70">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lozano-Perez</surname> <given-names>T.</given-names></name></person-group> (<year>1983</year>). <article-title>Robot programming</article-title>. <source>Proc. IEEE</source> <volume>71</volume>, <fpage>821</fpage>&#x02013;<lpage>841</lpage>. <pub-id pub-id-type="doi">10.1109/PROC.1983.12681</pub-id></citation></ref>
<ref id="B71">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>MacGlashan</surname> <given-names>J.</given-names></name> <name><surname>Babes-Vroman</surname> <given-names>M.</given-names></name> <name><surname>DesJardins</surname> <given-names>M.</given-names></name> <name><surname>Littman</surname> <given-names>M.</given-names></name> <name><surname>Muresan</surname> <given-names>S.</given-names></name> <name><surname>Squire</surname> <given-names>S.</given-names></name></person-group> (<year>2014a</year>). <source>Translating English to Reward Functions</source>. Technical Report CS14-01, Computer Science Department, Brown University.</citation></ref>
<ref id="B72">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>MacGlashan</surname> <given-names>J.</given-names></name> <name><surname>Ho</surname> <given-names>M. K.</given-names></name> <name><surname>Loftin</surname> <given-names>R.</given-names></name> <name><surname>Peng</surname> <given-names>B.</given-names></name> <name><surname>Wang</surname> <given-names>G.</given-names></name> <name><surname>Roberts</surname> <given-names>D. L.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>Interactive learning from policy-dependent human feedback</article-title>, in <source>Proceedings of the 34th International Conference on Machine Learning</source> (<publisher-loc>Sydney, NSW</publisher-loc>), <fpage>2285</fpage>&#x02013;<lpage>2294</lpage>.</citation></ref>
<ref id="B73">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>MacGlashan</surname> <given-names>J.</given-names></name> <name><surname>Littman</surname> <given-names>M.</given-names></name> <name><surname>Loftin</surname> <given-names>R.</given-names></name> <name><surname>Peng</surname> <given-names>B.</given-names></name> <name><surname>Roberts</surname> <given-names>D.</given-names></name> <name><surname>Taylor</surname> <given-names>M. E.</given-names></name></person-group> (<year>2014b</year>). <article-title>Training an agent to ground commands with reward and punishment</article-title>, in <source>Proceedings of the AAAI Machine Learning for Interactive Systems Workshop</source> (<publisher-loc>Qu&#x000E9;bec City, QC</publisher-loc>).</citation></ref>
<ref id="B74">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Maclin</surname> <given-names>R.</given-names></name> <name><surname>Shavlik</surname> <given-names>J.</given-names></name> <name><surname>Torrey</surname> <given-names>L.</given-names></name> <name><surname>Walker</surname> <given-names>T.</given-names></name> <name><surname>Wild</surname> <given-names>E.</given-names></name></person-group> (<year>2005a</year>). <article-title>Giving advice about preferred actions to reinforcement learners via knowledge-based kernel regression</article-title>, in <source>Proceedings of the 20th National Conference on Artificial Intelligence - Volume 2, AAAI&#x00027;05</source> (<publisher-loc>Pittsburgh, PA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>819</fpage>&#x02013;<lpage>824</lpage>.</citation></ref>
<ref id="B75">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Maclin</surname> <given-names>R.</given-names></name> <name><surname>Shavlik</surname> <given-names>J.</given-names></name> <name><surname>Walker</surname> <given-names>T.</given-names></name> <name><surname>Torrey</surname> <given-names>L.</given-names></name></person-group> (<year>2005b</year>). <article-title>Knowledge-based support-vector regression for reinforcement learning</article-title>, in <source>IJCAI 2005 Workshop on Reasoning, Representation, and Learning in Computer Games</source> (<publisher-loc>Edinburgh</publisher-loc>), <fpage>61</fpage>.</citation></ref>
<ref id="B76">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Maclin</surname> <given-names>R.</given-names></name> <name><surname>Shavlik</surname> <given-names>J. W.</given-names></name></person-group> (<year>1996</year>). <article-title>Creating advice-taking reinforcement learners</article-title>. <source>Mach. Learn</source>. <volume>22</volume>, <fpage>251</fpage>&#x02013;<lpage>281</lpage>. <pub-id pub-id-type="doi">10.1007/BF00114730</pub-id></citation></ref>
<ref id="B77">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mahadevan</surname> <given-names>S.</given-names></name> <name><surname>Connell</surname> <given-names>J.</given-names></name></person-group> (<year>1992</year>). <article-title>Automatic programming of behavior-based robots using reinforcement learning</article-title>. <source>Artif. Intell</source>. <volume>55</volume>, <fpage>311</fpage>&#x02013;<lpage>365</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(92)90058-6</pub-id></citation></ref>
<ref id="B78">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mangasarian</surname> <given-names>O. L.</given-names></name> <name><surname>Shavlik</surname> <given-names>J. W.</given-names></name> <name><surname>Wild</surname> <given-names>E. W.</given-names></name></person-group> (<year>2004</year>). <article-title>Knowledge-based kernel approximation</article-title>. <source>J. Mach. Learn. Res</source>. <volume>5</volume>, <fpage>1127</fpage>&#x02013;<lpage>1141</lpage>. <pub-id pub-id-type="doi">10.5555/1005332.1044697</pub-id></citation></ref>
<ref id="B79">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mataric</surname> <given-names>M. J.</given-names></name></person-group> (<year>1994</year>). <article-title>Reward functions for accelerated learning</article-title>, in <source>Proceedings of the Eleventh International Conference on Machine Learning</source> (<publisher-loc>New Brunswick, NJ</publisher-loc>), <fpage>181</fpage>&#x02013;<lpage>189</lpage>. <pub-id pub-id-type="doi">10.1016/B978-1-55860-335-6.50030-1</pub-id></citation></ref>
<ref id="B80">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mathewson</surname> <given-names>K. W.</given-names></name> <name><surname>Pilarski</surname> <given-names>P. M.</given-names></name></person-group> (<year>2016</year>). <article-title>Simultaneous control and human feedback in the training of a robotic agent with actor-critic reinforcement learning</article-title>. <source>arXiv [Preprint]. arXiv:1606.06979</source>.</citation></ref>
<ref id="B81">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Matuszek</surname> <given-names>C.</given-names></name> <name><surname>Herbst</surname> <given-names>E.</given-names></name> <name><surname>Zettlemoyer</surname> <given-names>L.</given-names></name> <name><surname>Fox</surname> <given-names>D.</given-names></name></person-group> (<year>2013</year>). <article-title>Learning to parse natural language commands to a robot control system</article-title>, in <source>Experimental Robotics: The 13th International Symposium on Experimental Robotics</source>, eds <person-group person-group-type="editor"><name><surname>Desai</surname> <given-names>J. P.</given-names></name> <name><surname>Dudek</surname> <given-names>G.</given-names></name> <name><surname>Khatib</surname> <given-names>O.</given-names></name> <name><surname>Kumar</surname> <given-names>V.</given-names></name></person-group> (<publisher-loc>Heidelberg</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>), <fpage>403</fpage>&#x02013;<lpage>415</lpage> <pub-id pub-id-type="doi">10.1007/978-3-319-00065-7</pub-id></citation></ref>
<ref id="B82">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>McCarthy</surname> <given-names>J.</given-names></name></person-group> (<year>1959</year>). <article-title>Programs with common sense</article-title>, in <source>Proceedings of the Teddington Conference on the Mechanization of Thought Processes</source> (<publisher-loc>London</publisher-loc>: <publisher-name>Her Majesty&#x00027;s Stationary Office</publisher-name>), <fpage>75</fpage>&#x02013;<lpage>91</lpage>.</citation></ref>
<ref id="B83">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mooney</surname> <given-names>R. J.</given-names></name></person-group> (<year>2008</year>). <article-title>Learning to connect language and perception</article-title>, in <source>Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI&#x00027;08</source> (<publisher-loc>Chicago, IL</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>1598</fpage>&#x02013;<lpage>1601</lpage>.</citation></ref>
<ref id="B84">
<citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Najar</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <source>Shaping robot behaviour with unlabeled human instructions</source> (<publisher-loc>Ph.D. thesis</publisher-loc>). Paris, France: University Paris VI.</citation></ref>
<ref id="B85">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Najar</surname> <given-names>A.</given-names></name> <name><surname>Bonnet</surname> <given-names>E.</given-names></name> <name><surname>Bahrami</surname> <given-names>B.</given-names></name> <name><surname>Palminteri</surname> <given-names>S.</given-names></name></person-group> (<year>2020a</year>). <article-title>The actions of others act as a pseudo-reward to drive imitation in the context of social reinforcement learning</article-title>. <source>PLoS Biol</source>. <volume>18</volume>:<fpage>e3001028</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pbio.3001028</pub-id><pub-id pub-id-type="pmid">33290387</pub-id></citation></ref>
<ref id="B86">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Najar</surname> <given-names>A.</given-names></name> <name><surname>Chetouani</surname> <given-names>M.</given-names></name></person-group> (<year>2020</year>). <article-title>Reinforcement learning with human advice. A survey</article-title>. <source>arXiv [Preprint]. arXiv:2005.11016</source>.</citation></ref>
<ref id="B87">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Najar</surname> <given-names>A.</given-names></name> <name><surname>Sigaud</surname> <given-names>O.</given-names></name> <name><surname>Chetouani</surname> <given-names>M.</given-names></name></person-group> (<year>2015a</year>). <article-title>Social-task learning for HRI</article-title>, in <source>Social Robotics: 7th International Conference, ICSR 2015</source>, eds <person-group person-group-type="editor"><name><surname>Tapus</surname> <given-names>A.</given-names></name> <name><surname>Andre</surname> <given-names>E.</given-names></name> <name><surname>Martin</surname> <given-names>J. C.</given-names></name> <name><surname>Ferland</surname> <given-names>F.</given-names></name> <name><surname>Ammi</surname> <given-names>M.</given-names></name></person-group> (<publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>), <fpage>472</fpage>&#x02013;<lpage>481</lpage> <pub-id pub-id-type="doi">10.1007/978-3-319-25554-5</pub-id></citation></ref>
<ref id="B88">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Najar</surname> <given-names>A.</given-names></name> <name><surname>Sigaud</surname> <given-names>O.</given-names></name> <name><surname>Chetouani</surname> <given-names>M.</given-names></name></person-group> (<year>2015b</year>). <article-title>Socially guided XCS: using teaching signals to boost learning</article-title>, in <source>Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO Companion &#x00027;15</source> (<publisher-loc>Madrid</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>1021</fpage>&#x02013;<lpage>1028</lpage>. <pub-id pub-id-type="doi">10.1145/2739482.2768452</pub-id></citation></ref>
<ref id="B89">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Najar</surname> <given-names>A.</given-names></name> <name><surname>Sigaud</surname> <given-names>O.</given-names></name> <name><surname>Chetouani</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>Training a robot with evaluative feedback and unlabeled guidance signals</article-title>, in <source>2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN)</source> (<publisher-loc>New York, NY</publisher-loc>), <fpage>261</fpage>&#x02013;<lpage>266</lpage>. <pub-id pub-id-type="doi">10.1109/ROMAN.2016.7745140</pub-id></citation></ref>
<ref id="B90">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Najar</surname> <given-names>A.</given-names></name> <name><surname>Sigaud</surname> <given-names>O.</given-names></name> <name><surname>Chetouani</surname> <given-names>M.</given-names></name></person-group> (<year>2020b</year>). <article-title>Interactively shaping robot behaviour with unlabeled human instructions</article-title>. <source>Auton. Agents Multiagent Syst</source>. <volume>34</volume>:<fpage>35</fpage>. <pub-id pub-id-type="doi">10.1007/s10458-020-09459-6</pub-id></citation></ref>
<ref id="B91">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ng</surname> <given-names>A. Y.</given-names></name> <name><surname>Harada</surname> <given-names>D.</given-names></name> <name><surname>Russell</surname> <given-names>S. J.</given-names></name></person-group> (<year>1999</year>). <article-title>Policy invariance under reward transformations: theory and application to reward shaping</article-title>, in <source>Proceedings of the Sixteenth International Conference on Machine Learning, ICML &#x00027;99</source> (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>Morgan Kaufmann Publishers Inc.</publisher-name>), <fpage>278</fpage>&#x02013;<lpage>287</lpage>.</citation></ref>
<ref id="B92">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ng</surname> <given-names>A. Y.</given-names></name> <name><surname>Russell</surname> <given-names>S. J.</given-names></name></person-group> (<year>2000</year>). <article-title>Algorithms for inverse reinforcement learning</article-title>, in <source>Proceedings of the Seventeenth International Conference on Machine Learning, ICML &#x00027;00</source> (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>Morgan Kaufmann Publishers Inc.</publisher-name>), <fpage>663</fpage>&#x02013;<lpage>670</lpage>.</citation></ref>
<ref id="B93">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nicolescu</surname> <given-names>M. N.</given-names></name> <name><surname>Mataric</surname> <given-names>M. J.</given-names></name></person-group> (<year>2003</year>). <article-title>Natural methods for robot task learning: instructive demonstrations, generalization and practice</article-title>, in <source>Proceedings of the Second International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS &#x00027;03</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>241</fpage>&#x02013;<lpage>248</lpage>. <pub-id pub-id-type="doi">10.1145/860575.860614</pub-id></citation></ref>
<ref id="B94">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Olsson</surname> <given-names>A.</given-names></name> <name><surname>Knapska</surname> <given-names>E.</given-names></name> <name><surname>Lindstr&#x000F6;m</surname> <given-names>B.</given-names></name></person-group> (<year>2020</year>). <article-title>The neural and computational systems of social learning</article-title>. <source>Nat Rev Neurosci</source> <volume>21</volume>, <fpage>197</fpage>&#x02013;<lpage>212</lpage>. <pub-id pub-id-type="doi">10.1038/s41583-020-0276-4</pub-id><pub-id pub-id-type="pmid">32221497</pub-id></citation></ref>
<ref id="B95">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pal&#x000E9;ologue</surname> <given-names>V.</given-names></name> <name><surname>Martin</surname> <given-names>J.</given-names></name> <name><surname>Pandey</surname> <given-names>A. K.</given-names></name> <name><surname>Chetouani</surname> <given-names>M.</given-names></name></person-group> (<year>2018</year>). <article-title>Semantic-based interaction for teaching robot behavior compositions using spoken language</article-title>, in <source>Social Robotics - 10th International Conference, ICSR 2018</source> (<publisher-loc>Qingdao</publisher-loc>), <fpage>421</fpage>&#x02013;<lpage>430</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-05204-1_41</pub-id></citation></ref>
<ref id="B96">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pradyot</surname> <given-names>K. V. N.</given-names></name> <name><surname>Manimaran</surname> <given-names>S. S.</given-names></name> <name><surname>Ravindran</surname> <given-names>B.</given-names></name></person-group> (<year>2012a</year>). <article-title>Instructing a reinforcement learner</article-title>, in <source>Proceedings of the Twenty-Fifth International Florida Artificial Intelligence Research Society Conference</source> (<publisher-loc>Marco Island, FL</publisher-loc>), <fpage>23</fpage>&#x02013;<lpage>25</lpage>.</citation></ref>
<ref id="B97">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pradyot</surname> <given-names>K. V. N.</given-names></name> <name><surname>Manimaran</surname> <given-names>S. S.</given-names></name> <name><surname>Ravindran</surname> <given-names>B.</given-names></name> <name><surname>Natarajan</surname> <given-names>S.</given-names></name></person-group> (<year>2012b</year>). <article-title>Integrating human instructions and reinforcement learners: an SRL approach</article-title>, in <source>Proceedings of the UAI workshop on Statistical Relational AI</source> (<publisher-loc>Catalina Island, CA</publisher-loc>).</citation></ref>
<ref id="B98">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pradyot</surname> <given-names>K. V. N.</given-names></name> <name><surname>Ravindran</surname> <given-names>B.</given-names></name></person-group> (<year>2011</year>). <article-title>Beyond rewards: learning from richer supervision</article-title>, in <source>Proceedings of the 9th European Workshop on Reinforcement Learning</source> (<publisher-loc>Athens</publisher-loc>).</citation></ref>
<ref id="B99">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Randlov</surname> <given-names>J.</given-names></name> <name><surname>Alstrom</surname> <given-names>P.</given-names></name></person-group> (<year>1998</year>). <article-title>Learning to drive a bicycle using reinforcement learning and shaping</article-title>, in <source>Proceedings of the Fifteenth International Conference on Machine Learning, ICML &#x00027;98</source> (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>Morgan Kaufmann Publishers Inc.</publisher-name>), <fpage>463</fpage>&#x02013;<lpage>471</lpage>.</citation></ref>
<ref id="B100">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rosenstein</surname> <given-names>M. T.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name> <name><surname>Si</surname> <given-names>J.</given-names></name> <name><surname>Barto</surname> <given-names>A.</given-names></name> <name><surname>Powell</surname> <given-names>W.</given-names></name> <name><surname>Wunsch</surname> <given-names>D.</given-names></name></person-group> (<year>2004</year>). <article-title>Supervised actor-critic reinforcement learning</article-title>, in <source>Handbook of Learning and Approximate Dynamic Programming</source>, eds <person-group person-group-type="editor"><name><surname>Si</surname> <given-names>J.</given-names></name> <name><surname>Barto</surname> <given-names>A.</given-names></name> <name><surname>Powell</surname> <given-names>W.</given-names></name> <name><surname>Wunsch</surname> <given-names>D.</given-names></name></person-group> (<publisher-name>JohnWiley &#x00026; Sons, Inc.</publisher-name>), <fpage>359</fpage>&#x02013;<lpage>380</lpage>.</citation></ref>
<ref id="B101">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rybski</surname> <given-names>P. E.</given-names></name> <name><surname>Yoon</surname> <given-names>K.</given-names></name> <name><surname>Stolarz</surname> <given-names>J.</given-names></name> <name><surname>Veloso</surname> <given-names>M. M.</given-names></name></person-group> (<year>2007</year>). <article-title>Interactive robot task training through dialog and demonstration</article-title>, in <source>2007 2nd ACM/IEEE International Conference on Human-Robot Interaction (HRI)</source> (<publisher-loc>Arlington, VA</publisher-loc>), <fpage>49</fpage>&#x02013;<lpage>56</lpage>. <pub-id pub-id-type="doi">10.1145/1228716.1228724</pub-id></citation></ref>
<ref id="B102">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sadigh</surname> <given-names>D.</given-names></name> <name><surname>Dragan</surname> <given-names>A. D.</given-names></name> <name><surname>Sastry</surname> <given-names>S.</given-names></name> <name><surname>Seshia</surname> <given-names>S. A.</given-names></name></person-group> (<year>2017</year>). <article-title>Active preference based learning of reward functions</article-title>, in <source>Robotics: Science and Systems</source>, eds <person-group person-group-type="editor"><name><surname>Amato</surname> <given-names>N.</given-names></name> <name><surname>Srinivasa</surname> <given-names>S.</given-names></name> <name><surname>Ayanian</surname> <given-names>N.</given-names></name> <name><surname>Kuindersma</surname> <given-names>S.</given-names></name></person-group> (<publisher-loc>Robotics</publisher-loc>: <publisher-name>Science and Systems Foundation</publisher-name>). <pub-id pub-id-type="doi">10.15607/RSS.2017.XIII.053</pub-id></citation></ref>
<ref id="B103">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Singh</surname> <given-names>S. P.</given-names></name></person-group> (<year>1992</year>). <article-title>Transfer of learning by composing solutions of elemental sequential tasks</article-title>. <source>Mach. Learn</source>. <volume>8</volume>, <fpage>323</fpage>&#x02013;<lpage>339</lpage>. <pub-id pub-id-type="doi">10.1007/BF00992700</pub-id></citation></ref>
<ref id="B104">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sridharan</surname> <given-names>M.</given-names></name></person-group> (<year>2011</year>). <article-title>Augmented reinforcement learning for interaction with non-expert humans in agent domains</article-title>, in <source>Proceedings of the 2011 10th International Conference on Machine Learning and Applications and Workshops - Volume 01, ICMLA &#x00027;11</source> (<publisher-loc>Washington, DC</publisher-loc>: <publisher-name>IEEE Computer Society</publisher-name>), <fpage>424</fpage>&#x02013;<lpage>429</lpage>. <pub-id pub-id-type="doi">10.1109/ICMLA.2011.37</pub-id></citation></ref>
<ref id="B105">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Suay</surname> <given-names>H. B.</given-names></name> <name><surname>Chernova</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <article-title>Effect of human guidance and state space size on interactive reinforcement learning</article-title>, in <source>2011 RO-MAN</source> (<publisher-loc>Atlanta, GA</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>6</lpage>. <pub-id pub-id-type="doi">10.1109/ROMAN.2011.6005223</pub-id></citation></ref>
<ref id="B106">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Suay</surname> <given-names>H. B.</given-names></name> <name><surname>Toris</surname> <given-names>R.</given-names></name> <name><surname>Chernova</surname> <given-names>S.</given-names></name></person-group> (<year>2012</year>). <article-title>A practical comparison of three robot learning from demonstration algorithm</article-title>. <source>Int. J. Soc. Robot</source>. <volume>4</volume>, <fpage>319</fpage>&#x02013;<lpage>330</lpage>. <pub-id pub-id-type="doi">10.1007/s12369-012-0158-7</pub-id></citation></ref>
<ref id="B107">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Subramanian</surname> <given-names>K.</given-names></name> <name><surname>Isbell</surname> <given-names>J.r. C. L.</given-names></name> <name><surname>Thomaz</surname> <given-names>A. L.</given-names></name></person-group> (<year>2016</year>). <article-title>Exploration from demonstration for interactive reinforcement learning</article-title>, in <source>Proceedings of the 2016 International Conference on Autonomous Agents &#x00026; Multiagent Systems, AAMAS &#x00027;16</source> (<publisher-loc>Richland, SC</publisher-loc>: <publisher-name>International Foundation for Autonomous Agents and Multiagent Systems</publisher-name>), <fpage>447</fpage>&#x02013;<lpage>456</lpage>.</citation></ref>
<ref id="B108">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name></person-group> (<year>1996</year>). <article-title>Generalization in reinforcement learning: Successful examples using sparse coarse coding</article-title>, in <source>Advances in Neural Information Processing Systems</source>, eds <person-group person-group-type="editor"><name><surname>Touretzky</surname> <given-names>D.</given-names></name> <name><surname>Mozer</surname> <given-names>M. C.</given-names></name> <name><surname>Hasselmo</surname> <given-names>M.</given-names></name></person-group> (<publisher-loc>Denver, CO</publisher-loc>: <publisher-name>MIT Press</publisher-name>), <fpage>1038</fpage>&#x02013;<lpage>1044</lpage>.</citation></ref>
<ref id="B109">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Barto</surname> <given-names>A. G.</given-names></name></person-group> (<year>1998</year>). <source>Reinforcement Learning: An Introduction</source>. <publisher-name>MIT Press</publisher-name>. <pub-id pub-id-type="doi">10.1109/TNN.1998.712192</pub-id></citation></ref>
<ref id="B110">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sutton</surname> <given-names>R. S.</given-names></name> <name><surname>Precup</surname> <given-names>D.</given-names></name> <name><surname>Singh</surname> <given-names>S.</given-names></name></person-group> (<year>1999</year>). <article-title>Between mdps and semi-MDPs: a framework for temporal abstraction in reinforcement learning</article-title>. <source>Artif. Intell</source>. <volume>112</volume>, <fpage>181</fpage>&#x02013;<lpage>211</lpage>. <pub-id pub-id-type="doi">10.1016/S0004-3702(99)00052-1</pub-id></citation></ref>
<ref id="B111">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Syed</surname> <given-names>U.</given-names></name> <name><surname>Schapire</surname> <given-names>R. E.</given-names></name></person-group> (<year>2007</year>). <article-title>Imitation learning with a value-based prior</article-title>, in <source>Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, UAI&#x00027;07</source> (<publisher-loc>Arlington, VA</publisher-loc>: <publisher-name>AUAI Press</publisher-name>), <fpage>384</fpage>&#x02013;<lpage>391</lpage>.</citation></ref>
<ref id="B112">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Taylor</surname> <given-names>M. E.</given-names></name> <name><surname>Suay</surname> <given-names>H. B.</given-names></name> <name><surname>Chernova</surname> <given-names>S.</given-names></name></person-group> (<year>2011</year>). <article-title>Integrating reinforcement learning with human demonstrations of varying ability</article-title>, in <source>The 10th International Conference on Autonomous Agents and Multiagent Systems - Volume 2, AAMAS &#x00027;11</source> (<publisher-loc>Richland, SC</publisher-loc>: <publisher-name>International Foundation for Autonomous Agents and Multiagent Systems</publisher-name>), <fpage>617</fpage>&#x02013;<lpage>624</lpage>.</citation></ref>
<ref id="B113">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tellex</surname> <given-names>S.</given-names></name> <name><surname>Kollar</surname> <given-names>T.</given-names></name> <name><surname>Dickerson</surname> <given-names>S.</given-names></name> <name><surname>Walter</surname> <given-names>M. R.</given-names></name> <name><surname>Banerjee</surname> <given-names>A. G.</given-names></name> <name><surname>Teller</surname> <given-names>S. J.</given-names></name> <etal/></person-group>. (<year>2011</year>). <article-title>Understanding natural language commands for robotic navigation and mobile manipulation</article-title>, in <source>Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence</source> (<publisher-loc>San Francisco, CA</publisher-loc>).</citation></ref>
<ref id="B114">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Tellex</surname> <given-names>S.</given-names></name> <name><surname>Thaker</surname> <given-names>P.</given-names></name> <name><surname>Joseph</surname> <given-names>J.</given-names></name> <name><surname>Roy</surname> <given-names>N.</given-names></name></person-group> (<year>2014</year>). <article-title>Learning perceptually grounded word meanings from unaligned parallel data</article-title>. <source>Mach. Learn</source>. <volume>94</volume>, <fpage>151</fpage>&#x02013;<lpage>167</lpage>. <pub-id pub-id-type="doi">10.1007/s10994-013-5383-2</pub-id></citation></ref>
<ref id="B115">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tenorio-Gonzalez</surname> <given-names>A. C.</given-names></name> <name><surname>Morales</surname> <given-names>E. F.</given-names></name> <name><surname>Villasenor-Pineda</surname> <given-names>L.</given-names></name></person-group> (<year>2010</year>). <article-title>Dynamic reward shaping: training a robot by voice</article-title>, in <source>Advances in Artificial Intelligence - IBERAMIA 2010: 12th Ibero-American Conference on AI</source>, eds <person-group person-group-type="editor"><name><surname>Kuri-Morales</surname> <given-names>A.</given-names></name> <name><surname>Simari</surname> <given-names>G. R.</given-names></name></person-group> (<publisher-loc>Berlin; Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>483</fpage>&#x02013;<lpage>492</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-16952-6</pub-id></citation></ref>
<ref id="B116">
<citation citation-type="thesis"><person-group person-group-type="author"><name><surname>Thomaz</surname> <given-names>A. L.</given-names></name></person-group> (<year>2006</year>). <source>Socially guided machine learning</source> (<publisher-loc>Ph.D. thesis</publisher-loc>). Massachusetts Institute of Technology, Cambridge, MA, United States.</citation></ref>
<ref id="B117">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Thomaz</surname> <given-names>A. L.</given-names></name> <name><surname>Breazeal</surname> <given-names>C.</given-names></name></person-group> (<year>2006</year>). <article-title>Reinforcement learning with human teachers: evidence of feedback and guidance with implications for learning performance</article-title>, in <source>Proceedings of the 21st National Conference on Artificial Intelligence - Volume 1, AAAI&#x00027;06</source> (<publisher-loc>Boston, MA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>1000</fpage>&#x02013;<lpage>1005</lpage>.</citation></ref>
<ref id="B118">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Thomaz</surname> <given-names>A. L.</given-names></name> <name><surname>Breazeal</surname> <given-names>C.</given-names></name></person-group> (<year>2007a</year>). <article-title>Asymmetric interpretations of positive and negative human feedback for a social learning agent</article-title>, in <source>RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication</source> (<publisher-loc>Jeju-si</publisher-loc>), <fpage>720</fpage>&#x02013;<lpage>725</lpage>. <pub-id pub-id-type="doi">10.1109/ROMAN.2007.4415180</pub-id></citation></ref>
<ref id="B119">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Thomaz</surname> <given-names>A. L.</given-names></name> <name><surname>Breazeal</surname> <given-names>C.</given-names></name></person-group> (<year>2007b</year>). <article-title>Robot learning via socially guided exploration</article-title>, in <source>2007 IEEE 6th International Conference on Development and Learning</source> (<publisher-loc>London, UK</publisher-loc>), <fpage>82</fpage>&#x02013;<lpage>87</lpage>. <pub-id pub-id-type="doi">10.1109/DEVLRN.2007.4354078</pub-id></citation></ref>
<ref id="B120">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Thomaz</surname> <given-names>A. L.</given-names></name> <name><surname>Cakmak</surname> <given-names>M.</given-names></name></person-group> (<year>2009</year>). <article-title>Learning about objects with human teachers</article-title>, in <source>Proceedings of the 4th ACM/IEEE International Conference on Human Robot Interaction, HRI &#x00027;09</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>15</fpage>&#x02013;<lpage>22</lpage>. <pub-id pub-id-type="doi">10.1145/1514095.1514101</pub-id></citation></ref>
<ref id="B121">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Thomaz</surname> <given-names>A. L.</given-names></name> <name><surname>Hoffman</surname> <given-names>G.</given-names></name> <name><surname>Breazeal</surname> <given-names>C.</given-names></name></person-group> (<year>2006</year>). <article-title>Reinforcement learning with human teachers: understanding how people want to teach robots</article-title>, in <source>ROMAN 2006 - The 15th IEEE International Symposium on Robot and Human Interactive Communication</source> (<publisher-loc>Hatfield</publisher-loc>), <fpage>352</fpage>&#x02013;<lpage>357</lpage>. <pub-id pub-id-type="doi">10.1109/ROMAN.2006.314459</pub-id></citation></ref>
<ref id="B122">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Torrey</surname> <given-names>L.</given-names></name> <name><surname>Walker</surname> <given-names>T.</given-names></name> <name><surname>Maclin</surname> <given-names>R.</given-names></name> <name><surname>Shavlik</surname> <given-names>J. W.</given-names></name></person-group> (<year>2008</year>). <article-title>Advice taking and transfer learning: naturally inspired extensions to reinforcement learning</article-title>, in <source>AAAI Fall Symposium: Naturally-Inspired Artificial Intelligence (AAAI)</source> (<publisher-loc>Arlington, VI</publisher-loc>), <fpage>103</fpage>&#x02013;<lpage>110</lpage>.</citation></ref>
<ref id="B123">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Towell</surname> <given-names>G. G.</given-names></name> <name><surname>Shavlik</surname> <given-names>J. W.</given-names></name></person-group> (<year>1994</year>). <article-title>Knowledge-based artificial neural networks</article-title>. <source>Artif. Intell</source>. <volume>70</volume>, <fpage>119</fpage>&#x02013;<lpage>165</lpage>. <pub-id pub-id-type="doi">10.1016/0004-3702(94)90105-8</pub-id></citation></ref>
<ref id="B124">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Turing</surname> <given-names>A. M.</given-names></name></person-group> (<year>1950</year>). <article-title>Computing machinery and intelligence</article-title>. <source>Mind</source> <volume>59</volume>, <fpage>433</fpage>&#x02013;<lpage>460</lpage>. <pub-id pub-id-type="doi">10.1093/mind/LIX.236.433</pub-id></citation></ref>
<ref id="B125">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Utgoff</surname> <given-names>P. E.</given-names></name> <name><surname>Clouse</surname> <given-names>J. A.</given-names></name></person-group> (<year>1991</year>). <article-title>Two kinds of training information for evaluation function learning</article-title>, in <source>Proceedings of the Ninth Annual Conference on Artificial Intelligence</source> (<publisher-loc>Anaheim, CA</publisher-loc>: <publisher-name>Morgan Kaufmann</publisher-name>), <fpage>596</fpage>&#x02013;<lpage>600</lpage>.</citation></ref>
<ref id="B126">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vogel</surname> <given-names>A.</given-names></name> <name><surname>Jurafsky</surname> <given-names>D.</given-names></name></person-group> (<year>2010</year>). <article-title>Learning to follow navigational directions</article-title>, in <source>Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL &#x00027;10</source> (<publisher-loc>Stroudsburg, PA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>806</fpage>&#x02013;<lpage>814</lpage>. <pub-id pub-id-type="pmid">27633514</pub-id></citation></ref>
<ref id="B127">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vollmer</surname> <given-names>A.-L.</given-names></name> <name><surname>Wrede</surname> <given-names>B.</given-names></name> <name><surname>Rohlfing</surname> <given-names>K. J.</given-names></name> <name><surname>Oudeyer</surname> <given-names>P.-Y.</given-names></name></person-group> (<year>2016</year>). <article-title>Pragmatic frames for teaching and learning in human-robot interaction: review and challenges</article-title>. <source>Front. Neurorobot</source>. <volume>10</volume>:<fpage>10</fpage>. <pub-id pub-id-type="doi">10.3389/fnbot.2016.00010</pub-id><pub-id pub-id-type="pmid">27752242</pub-id></citation></ref>
<ref id="B128">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Watkins</surname> <given-names>C. J.</given-names></name> <name><surname>Dayan</surname> <given-names>P.</given-names></name></person-group> (<year>1992</year>). <article-title>Q-learning</article-title>. <source>Mach. Learn</source>. <volume>8</volume>, <fpage>279</fpage>&#x02013;<lpage>292</lpage>. <pub-id pub-id-type="doi">10.1023/A:1022676722315</pub-id></citation></ref>
<ref id="B129">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Waytowich</surname> <given-names>N. R.</given-names></name> <name><surname>Goecks</surname> <given-names>V. G.</given-names></name> <name><surname>Lawhern</surname> <given-names>V. J.</given-names></name></person-group> (<year>2018</year>). <article-title>Cycle-of-learning for autonomous systems from human interaction</article-title>. <source>arXiv [Preprint]. arXiv:1808.09572</source>.</citation></ref>
<ref id="B130">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Whitehead</surname> <given-names>S. D.</given-names></name></person-group> (<year>1991</year>). <article-title>A complexity analysis of cooperative mechanisms in reinforcement learning</article-title>, in <source>Proceedings of the Ninth National Conference on Artificial Intelligence - Volume 2, AAAI&#x00027;91</source> (<publisher-loc>Anaheim, CA</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>607</fpage>&#x02013;<lpage>613</lpage>.</citation></ref>
<ref id="B131">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wiering</surname> <given-names>M. A.</given-names></name> <name><surname>van Hasselt</surname> <given-names>H.</given-names></name></person-group> (<year>2008</year>). <article-title>Ensemble algorithms in reinforcement learning</article-title>. <source>Trans. Syst. Man Cyber. B</source> <volume>38</volume>, <fpage>930</fpage>&#x02013;<lpage>936</lpage>. <pub-id pub-id-type="doi">10.1109/TSMCB.2008.920231</pub-id><pub-id pub-id-type="pmid">18632380</pub-id></citation></ref>
<ref id="B132">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wiewiora</surname> <given-names>E.</given-names></name></person-group> (<year>2003</year>). <article-title>Potential-based shaping and Q-value initialization are equivalent</article-title>. <source>J. Artif. Intell. Res</source>. <volume>19</volume>, <fpage>205</fpage>&#x02013;<lpage>208</lpage>. <pub-id pub-id-type="doi">10.1613/jair.1190</pub-id></citation></ref>
<ref id="B133">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wiewiora</surname> <given-names>E.</given-names></name> <name><surname>Cottrell</surname> <given-names>G.</given-names></name> <name><surname>Elkan</surname> <given-names>C.</given-names></name></person-group> (<year>2003</year>). <article-title>Principled methods for advising reinforcement learning agents</article-title>, in <source>Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML&#x00027;03</source> (<publisher-loc>Washington, DC</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>792</fpage>&#x02013;<lpage>799</lpage>.</citation></ref>
<ref id="B134">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Williams</surname> <given-names>R. J.</given-names></name></person-group> (<year>1992</year>). <article-title>Simple statistical gradient-following algorithms for connectionist reinforcement learning</article-title>. <source>Mach. Learn</source>. <volume>8</volume>, <fpage>229</fpage>&#x02013;<lpage>256</lpage>. <pub-id pub-id-type="doi">10.1007/BF00992696</pub-id></citation></ref>
<ref id="B135">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zettlemoyer</surname> <given-names>L. S.</given-names></name> <name><surname>Collins</surname> <given-names>M.</given-names></name></person-group> (<year>2009</year>). <article-title>Learning context-dependent mappings from sentences to logical form</article-title>, in <source>Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL &#x00027;09</source> (<publisher-loc>Stroudsburg, PA</publisher-loc>: <publisher-name>Association for Computational Linguistics</publisher-name>), <fpage>976</fpage>&#x02013;<lpage>984</lpage>. <pub-id pub-id-type="doi">10.3115/1690219.1690283</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup>The authors proposed another mechanism for handling temporal credit assignment in order to alleviate the effect of highly dynamical tasks (Knox and Stone, <xref ref-type="bibr" rid="B53">2009</xref>). In their system, human-generated rewards were distributed backward to previously performed actions within a fixed time window.</p></fn>
</fn-group>
</back>
</article>
