<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Robot. AI</journal-id>
<journal-title>Frontiers in Robotics and AI</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Robot. AI</abbrev-journal-title>
<issn pub-type="epub">2296-9144</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">693050</article-id>
<article-id pub-id-type="doi">10.3389/frobt.2021.693050</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Robotics and AI</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Machine Teaching for Human Inverse Reinforcement Learning</article-title>
<alt-title alt-title-type="left-running-head">Lee et&#x20;al.</alt-title>
<alt-title alt-title-type="right-running-head">Machine Teaching for Human IRL</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Lee</surname>
<given-names>Michael S.</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1208362/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Admoni</surname>
<given-names>Henny</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1145126/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Simmons</surname>
<given-names>Reid</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1373716/overview"/>
</contrib>
</contrib-group>
<aff>Robotics Institute, Carnegie Mellon University, <addr-line>Pittsburgh</addr-line>, <addr-line>PA</addr-line>, <country>United&#x20;States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/9158/overview">Tony Belpaeme</ext-link>, Ghent University, Belgium</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/28828/overview">Goren Gordon</ext-link>, Tel Aviv University, Israel</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/655147/overview">Emmanuel Senft</ext-link>, University of Wisconsin-Madison, United&#x20;States</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Michael S. Lee, <email>ml5@andrew.cmu.edu</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Human-Robot Interaction, a section of the journal Frontiers in Robotics and&#x20;AI</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>30</day>
<month>06</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>8</volume>
<elocation-id>693050</elocation-id>
<history>
<date date-type="received">
<day>09</day>
<month>04</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>07</day>
<month>06</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Lee, Admoni and Simmons.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Lee, Admoni and Simmons</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>As robots continue to acquire useful skills, their ability to teach their expertise will provide humans the two-fold benefit of learning from robots and collaborating fluently with them. For example, robot tutors could teach handwriting to individual students and delivery robots could convey their navigation conventions to better coordinate with nearby human workers. Because humans naturally communicate their behaviors through selective demonstrations, and comprehend others&#x2019; through reasoning that resembles inverse reinforcement learning (IRL), we propose a method of teaching humans based on demonstrations that are informative for IRL. But unlike prior work that optimizes solely for IRL, this paper incorporates various human teaching strategies (e.g. scaffolding, simplicity, pattern discovery, and testing) to better accommodate human learners. We assess our method with user studies and find that our measure of test difficulty corresponds well with human performance and confidence, and also find that favoring simplicity and pattern discovery increases human performance on difficult tests. However, we did not find a strong effect for our method of scaffolding, revealing shortcomings that indicate clear directions for future&#x20;work.</p>
</abstract>
<kwd-group>
<kwd>inverse reinforcement learning</kwd>
<kwd>learning from demonstration</kwd>
<kwd>scaffolding</kwd>
<kwd>policy summarization</kwd>
<kwd>machine teaching</kwd>
</kwd-group>
<contract-num rid="cn001">N00014-18-1-2503</contract-num>
<contract-num rid="cn002">W911NF-20-1-0006</contract-num>
<contract-sponsor id="cn001">Office of Naval Research<named-content content-type="fundref-id">10.13039/100000006</named-content>
</contract-sponsor>
<contract-sponsor id="cn002">Defense Advanced Research Projects Agency<named-content content-type="fundref-id">10.13039/100000185</named-content>
</contract-sponsor>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>As robots become capable in tasks once accomplished only by humans, the extent of their influence will depend in part on their ability to teach and convey their skills. From the youngest of us learning to handwrite (<xref ref-type="bibr" rid="B36">Sandygulova et&#x20;al., 2020</xref>; <xref ref-type="bibr" rid="B16">Guneysu Ozgur et&#x20;al., 2020</xref>) to practitioners of crafts such as chess, many of us stand to benefit from robots that can effectively teach their mastered skill. Furthermore, our ability to collaborate fluently with robots partly depends our understanding of their behaviors. For example, workers at a construction site could better coordinate with a new delivery robot if the robot could clearly convey its navigation conventions (e.g. when it would choose to go through mud over taking a long detour).</p>
<p>While demonstrations are a natural method of teaching and learning behaviors for humans, its effectiveness still hinges on conveying an informative set of demonstrations. The literature on how humans generate and understand behaviors provides insight into what makes a demonstration informative. Cognitive science suggests that humans often model one another&#x2019;s behavior as exactly or approximately maximizing a reward function (<xref ref-type="bibr" rid="B22">Jern et&#x20;al., 2017</xref>; <xref ref-type="bibr" rid="B20">Jara-Ettinger et&#x20;al., 2016</xref>; <xref ref-type="bibr" rid="B28">Lucas et&#x20;al., 2014</xref>), which they can infer through reasoning resembling inverse reinforcement learning (IRL) (<xref ref-type="bibr" rid="B30">Ng and Russell, 2000</xref>; <xref ref-type="bibr" rid="B21">Jara-Ettinger, 2019</xref>; <xref ref-type="bibr" rid="B7">Baker et&#x20;al., 2009</xref>; <xref ref-type="bibr" rid="B8">Baker et&#x20;al., 2011</xref>). Furthermore, humans are often able to obtain a behavior that (approximately) maximizes a reward function through planning, which can be modeled as dynamic programming or Monte Carlo tree search (<xref ref-type="bibr" rid="B37">Shteingart and Loewenstein, 2014</xref>; <xref ref-type="bibr" rid="B42">Wunderlich et&#x20;al., 2012</xref>).</p>
<p>Putting these insights together, we can often expect humans to be able to model others&#x2019; behaviors once equipped with their reward functions.<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref> For example, upon seeing a new worker consistently arrive on time to each workday, a manager will infer that the worker places high values on punctuality and consistency and will arrive promptly at other work-related functions. Thus, the problem of conveying a behavior or skill can be reduced to conveying the underlying reward function, and the informativeness of a demonstration can be quantified by how much information it reveals regarding the reward function using&#x20;IRL.</p>
<p>Though IRL offers a principled measure of a demonstration&#x2019;s informativeness, human learning is multi-faceted and is also influenced by other factors, such as the simplicity of explanations (<xref ref-type="bibr" rid="B27">Lombrozo, 2016</xref>). Thus, unlike prior work on machine teaching that optimizes solely for IRL (<xref ref-type="bibr" rid="B10">Brown and Niekum, 2019</xref>), this paper incorporates insights on how humans effectively learn to further accommodate human learners.</p>
<p>In this work, we explore whether augmenting IRL with insights from human teaching improves human learning over optimizing for IRL alone. We first employ <italic>scaffolding</italic> from social constructivism (learning theory) to encourage demonstrations that are not just informative but also comprehensible. Specifically, we assume a general human learner without prior knowledge, and sequence demonstrations that incrementally increase in informativeness and difficulty. Noting the cognitive science literature that suggests humans favor simple explanations that follow a discernible pattern (<xref ref-type="bibr" rid="B27">Lombrozo, 2016</xref>; <xref ref-type="bibr" rid="B40">Williams et&#x20;al., 2010</xref>), we also optimize for visual <italic>simplicity and pattern discovery</italic> when selecting demonstrations. Finally, toward effective <italic>testing</italic> of the learner&#x2019;s understanding, we show that the measure of a demonstration&#x2019;s informativeness during teaching can be inverted into a measure of expected difficulty for a human to predict that exact demonstration during testing.</p>
<p>Two user studies strongly correlate our measure of test difficulty with human performance and confidence, with low, medium, and high difficulty tests yielding high, medium, and low performance and confidence respectively. Study results also show that favoring simplicity and pattern discovery significantly increases human performance on difficult tests. However, we do not find a strong effect for our method of scaffolding, revealing shortcomings that indicate clear directions for future&#x20;work.</p>
</sec>
<sec id="s2">
<title>2 Related Work</title>
<sec id="s2-1">
<title>2.1 Policy Summarization and Machine Teaching</title>
<p>The problem of policy summarization considers which states and actions should be conveyed to help a user obtain a global understanding of a robot&#x2019;s policy (i.e. behavior or skill) (<xref ref-type="bibr" rid="B5">Amir et&#x20;al., 2019</xref>). There are two primary approaches to this problem. The first relies on heuristics to evaluate the value of communicating certain states and actions, such as entropy (<xref ref-type="bibr" rid="B18">Huang et&#x20;al., 2018</xref>), differences in Q-values (<xref ref-type="bibr" rid="B4">Amir and Amir, 2018</xref>), and differences between the policies of two agents (<xref ref-type="bibr" rid="B6">Amitai and Amir, 2021</xref>).</p>
<p>We build on the second approach, which follows the machine teaching paradigm (<xref ref-type="bibr" rid="B43">Zhu et&#x20;al., 2018</xref>). Given an assumed learning model of the student (e.g. IRL to learn a reward function), the machine teaching objective is to select the minimal set of teaching examples (i.e. demonstrations) that will help the learner arrive at a specific target model (e.g. a policy). Though machine teaching was first applied to classification and regression (<xref ref-type="bibr" rid="B44">Zhu, 2015</xref>; <xref ref-type="bibr" rid="B26">Liu and Zhu, 2016</xref>), it has also recently been employed to convey reward functions from which the corresponding policy can be reconstructed. <xref ref-type="bibr" rid="B19">Huang et&#x20;al. (2019)</xref> selected informative demonstrations for humans modeled to employ approximate Bayesian IRL for recovering the reward. This technique requires the true reward function to be within a candidate set of reward functions over which to perform Bayesian inference, and computation scales linearly with the size of the set. <xref ref-type="bibr" rid="B11">Cakmak and Lopes (2012)</xref> instead focused on IRL learners and selected demonstrations that maximally reduced uncertainty over all viable reward parameters, posed as a volume removal problem. <xref ref-type="bibr" rid="B10">Brown and Niekum (2019)</xref> improved this method (particularly for high dimensions) by solving an equivalent set cover problem instead with their Set Cover Optimal Teaching (SCOT) algorithm. However, SCOT is not explicitly designed for human learners and this paper builds on SCOT to address that&#x20;gap.</p>
</sec>
<sec id="s2-2">
<title>2.2 Techniques for Human Teaching</title>
<p>Human teaching and learning is a multifaceted process that has been studied extensively. Thus, we also take inspiration from social constructivism (learning theory) and cognitive science in informing how a robot may teach a skill to a human learner so that the learner may correctly reproduce that skill in new situations.</p>
<p>
<bold>Scaffolding</bold>: Scaffolding is a well-established pedagogical technique in which a more knowledgeable teacher assists a learner in accomplishing a task currently beyond the learner&#x2019;s abilities, e.g. by reducing the degrees of freedom of the problem and/or by demonstrating partial solutions to the task (<xref ref-type="bibr" rid="B41">Wood et&#x20;al., 1976</xref>). Noting the benefits seen by automated scaffolding to date [e.g. <xref ref-type="bibr" rid="B35">Sampayo-Vargas et&#x20;al. (2013)</xref>], we implement the first recommendation made by <xref ref-type="bibr" rid="B34">Reiser (2004)</xref> for software-based scaffolding, which is to reduce the complexity of the learning problem through additional structure. Specifically, we incorporate this technique when teaching a skill by providing demonstrations that sequentially increase in informativeness and difficulty.</p>
<p>
<bold>Simplicity and Pattern Discovery</bold>: Studies on explanations preferred by humans indicate a bias toward those that are simpler and have fewer causes (<xref ref-type="bibr" rid="B27">Lombrozo, 2016</xref>). Furthermore, <xref ref-type="bibr" rid="B40">Williams et&#x20;al. (2010)</xref> found that explanations can be detrimental if they do not help the learner to notice useful patterns or even mislead them with false patterns. Together, these two works support the idea that explanations should minimize distractions that potentially inspire false correlations and instead highlight and reinforce the minimal set of causes. We thus also optimize for simplicity and pattern discovery when selecting demonstrations that naturally &#x201c;explain&#x201d; the underlying&#x20;skill.</p>
<p>
<bold>Testing</bold>: Effective scaffolding requires an accurate diagnosis of the learner&#x2019;s current abilities to provide the appropriate level of assistance throughout the teaching process (<xref ref-type="bibr" rid="B12">Collins et&#x20;al., 1988</xref>). A common diagnostic method is presenting the learner with tests of varying difficulties and assessing their understanding of a skill. Toward this, we propose a way to quantify the difficulty of a test that specifically assesses the student&#x2019;s ability to predict the right behavior in a new situation.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Technical Background</title>
<sec>
<title>3.1 Markov Decision Process</title>
<p>The robot&#x2019;s environment is represented as an instance (indexed by <italic>i</italic>) of a deterministic<xref ref-type="fn" rid="fn2">
<sup>2</sup>
</xref> Markov decision process, <inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>D</mml:mi>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>R</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
<mml:mo>,</mml:mo>
<mml:msubsup>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>0</mml:mn>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf2">
<mml:math id="m2">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf3">
<mml:math id="m3">
<mml:mi mathvariant="script">A</mml:mi>
</mml:math>
</inline-formula> denote the state and action sets, <inline-formula id="inf4">
<mml:math id="m4">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>&#x2192;</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> the transition function, <inline-formula id="inf5">
<mml:math id="m5">
<mml:mrow>
<mml:mi>R</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>&#x2192;</mml:mo>
<mml:mi mathvariant="normal">&#x211d;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> the reward function, <inline-formula id="inf6">
<mml:math id="m6">
<mml:mrow>
<mml:mi>&#x3b3;</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0,1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> the discount factor, and <inline-formula id="inf7">
<mml:math id="m7">
<mml:mrow>
<mml:msubsup>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>0</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula> the initial state distribution, and <inline-formula id="inf8">
<mml:math id="m8">
<mml:mrow>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mo>:</mml:mo>
<mml:mstyle displaystyle="true">
<mml:msub>
<mml:mo>&#x222a;</mml:mo>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mstyle>
</mml:mrow>
</mml:math>
</inline-formula> the union over the states of all related instances of MDPs, which we call a domain (to be described in the following paragraphs).</p>
<p>Finally, the robot has an optimal policy (i.e. a skill) <inline-formula id="inf9">
<mml:math id="m9">
<mml:mrow>
<mml:mrow>
<mml:msup>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:msup>
<mml:msub>
<mml:mi>&#xa0;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>:</mml:mo>
</mml:mrow>
<mml:msub>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2192;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> that maps each state in an MDP to the action that will optimize the reward in an infinite horizon. A sequence of <inline-formula id="inf10">
<mml:math id="m10">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>s</mml:mi>
<mml:mtext>&#x2032;</mml:mtext>
</mml:msup>
<mml:msub>
<mml:mi>&#xa0;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> tuples obtain by following <inline-formula id="inf11">
<mml:math id="m11">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> gives rise to an optimal trajectory (i.e. a demonstration) <inline-formula id="inf12">
<mml:math id="m12">
<mml:mrow>
<mml:mi>&#x3be;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf13">
<mml:math id="m13">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msup>
<mml:mi>s</mml:mi>
<mml:mtext>&#x2032;</mml:mtext>
</mml:msup>
<mml:msub>
<mml:mi>&#xa0;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msub>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. We assume that <italic>R</italic> can be expressed as a weighted linear combination of <italic>l</italic> reward features<xref ref-type="fn" rid="fn3">
<sup>3</sup>
</xref> <inline-formula id="inf14">
<mml:math id="m14">
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>&#x2192;</mml:mo>
<mml:msup>
<mml:mi mathvariant="normal">&#x211d;</mml:mi>
<mml:mi>l</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, i.e. <inline-formula id="inf15">
<mml:math id="m15">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:mi>R</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mrow>
<mml:msup>
<mml:mo>&#x2a;</mml:mo>
<mml:mi mathvariant="normal">&#x22a4;</mml:mi>
</mml:msup>
</mml:mrow>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>&#x3d5;</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</inline-formula> (<xref ref-type="bibr" rid="B1">Abbeel and Ng, 2004</xref>). We also assume that the human is aware of all aspects of an MDP (including the reward features) but not the weights <inline-formula id="inf16">
<mml:math id="m16">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>Let a domain refer to a collection of related MDPs that share <inline-formula id="inf17">
<mml:math id="m17">
<mml:mrow>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>R</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> but differ in <inline-formula id="inf18">
<mml:math id="m18">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="script">S</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf19">
<mml:math id="m19">
<mml:mrow>
<mml:msubsup>
<mml:mi>S</mml:mi>
<mml:mi>i</mml:mi>
<mml:mn>0</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:math>
</inline-formula>. Take for example the delivery domain, which modifies the Taxi domain (<xref ref-type="bibr" rid="B14">Dietterich, 1998</xref>) by adding mud (see <xref ref-type="fig" rid="F1">Figure&#x20;1</xref>). The robot is rewarded for efficiently delivering the package to the destination while avoiding the mud if the detour is not too costly. Though MDPs in this domain may vary in the number and locations of mud patches and subsequently offer a diverse set of demonstrations (e.g. see <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>), they importantly share the same reward function&#x20;<italic>R</italic>.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>
<bold>(A)</bold> A demonstration <inline-formula id="inf20">
<mml:math id="m20">
<mml:mi mathvariant="script">D</mml:mi>
</mml:math>
</inline-formula> of an optimal policy &#x3c0; in the delivery domain. Agent aims to deliver the package to the destination while avoiding walls and avoiding mud if the detour is not too costly. <bold>(B)</bold> The left demonstration can be translated into a set of half-space constraints on the underlying policy reward weights using <xref ref-type="disp-formula" rid="e4">Eq. 4</xref>. The darker shaded region is where all constraints (the red and light blue lines) hold true, which corresponds to the behavior equivalence class BEC(<inline-formula id="inf21">
<mml:math id="m21">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>), see Section 3.3.</p>
</caption>
<graphic xlink:href="frobt-08-693050-g001.tif"/>
</fig>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>
<bold>(A)</bold> Sample demonstrations exhibiting scaffolding, simplicity, and pattern discovery. We scaffold by showing demonstrations that incrementally decrease in BEC area (which appears to correlate inversely with informativeness and difficulty). Simplicity is encouraged by minimizing visual clutter (i.e. unnecessary mud patches). Pattern discovery is encouraged by holding the agent and passenger locations constant while highlighting the single additional toll between demonstrations that changes the optimal behavior. <bold>(B)</bold> Histogram of BEC areas of the 25,600 possible demonstrations in the delivery domain. Cluster centers returned by k-means (k &#x3d; 6) are shown as red circles along the <italic>x</italic>-axis. Demonstrations from every other cluster are selected and shown in order of largest to smallest BEC area for scaffolded machine teaching.</p>
</caption>
<graphic xlink:href="frobt-08-693050-g002.tif"/>
</fig>
<p>Because instances of a domain share <italic>R</italic>, the various demonstrations all support inference over the same <inline-formula id="inf22">
<mml:math id="m22">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> through IRL. Thus, we overload the notation <inline-formula id="inf23">
<mml:math id="m23">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> to refer to any policy of a domain instance that optimizes a reward with <inline-formula id="inf24">
<mml:math id="m24">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>. Furthermore, while a demonstration strictly consists of both an optimal trajectory <inline-formula id="inf25">
<mml:math id="m25">
<mml:mrow>
<mml:mi>&#x3be;</mml:mi>
<mml:mo>&#x2a;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> (obtained by following <inline-formula id="inf26">
<mml:math id="m26">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>) and the corresponding MDP (minus <inline-formula id="inf27">
<mml:math id="m27">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>), we will refer to a demonstration only by <inline-formula id="inf28">
<mml:math id="m28">
<mml:mrow>
<mml:mi>&#x3be;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> in this work for notational simplicity.</p>
<p>Having represented the robot&#x2019;s environment and policy, we now define the problem of generating demonstrations for teaching that policy through the lens of machine teaching.</p>
</sec>
<sec>
<title>3.2 Machine Teaching for Policies</title>
<p>As formalized by <xref ref-type="bibr" rid="B24">Lage et&#x20;al. (2019)</xref>, machine teaching for policies seeks to convey a set of demonstrations <inline-formula id="inf29">
<mml:math id="m29">
<mml:mi mathvariant="script">D</mml:mi>
</mml:math>
</inline-formula> of size <italic>n</italic> (i.e. the allotted budget for teaching set) that will maximize the similarity <italic>&#x3c1;</italic> between <inline-formula id="inf30">
<mml:math id="m30">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> and the policy <inline-formula id="inf31">
<mml:math id="m31">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> recovered using a model <inline-formula id="inf32">
<mml:math id="m32">
<mml:mi mathvariant="normal">&#x2133;</mml:mi>
</mml:math>
</inline-formula> on <inline-formula id="inf33">
<mml:math id="m33">
<mml:mi mathvariant="script">D</mml:mi>
</mml:math>
</inline-formula>
<disp-formula id="e1">
<mml:math id="m34">
<mml:mrow>
<mml:munder>
<mml:mrow>
<mml:mtext>arg</mml:mtext>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mtext>&#xa0;max</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mtext>&#x39e;</mml:mtext>
</mml:mrow>
</mml:munder>
<mml:mi>&#x3c1;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="normal">&#x2133;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>s</mml:mtext>
<mml:mo>.</mml:mo>
<mml:mtext>t</mml:mtext>
<mml:mo>.</mml:mo>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>&#x7c;</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>where <inline-formula id="inf34">
<mml:math id="m35">
<mml:mtext>&#x39e;</mml:mtext>
</mml:math>
</inline-formula> is the set of all optimal demonstrations of <inline-formula id="inf35">
<mml:math id="m36">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> in a domain. We assume that the <inline-formula id="inf36">
<mml:math id="m37">
<mml:mi mathvariant="normal">&#x2133;</mml:mi>
</mml:math>
</inline-formula> employed by humans to approximate the underlying <inline-formula id="inf37">
<mml:math id="m38">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> is IRL. Once <inline-formula id="inf38">
<mml:math id="m39">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> (and the subsequent reward function) is approximated, we assume that human learners are able to arrive at <inline-formula id="inf39">
<mml:math id="m40">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>, i.e. the skill, through planning on the underlying&#x20;MDP.</p>
<p>Thus, the teaching objective reduces to effectively conveying <inline-formula id="inf40">
<mml:math id="m41">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> through well-selected demonstrations.<xref ref-type="fn" rid="fn4">
<sup>4</sup>
</xref> In order to quantify the information a demonstration provides on <inline-formula id="inf41">
<mml:math id="m42">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>, we leverage the idea of behavior equivalence classes.</p>
</sec>
<sec>
<title>3.3 Behavior Equivalence Class</title>
<p>The <italic>behavior equivalence class</italic> (BEC) of <italic>&#x3c0;</italic> is the set of (viable) reward weights under which <italic>&#x3c0;</italic> is still optimal. The larger the BEC(<italic>&#x3c0;</italic>) is, the greater the potential uncertainty over <inline-formula id="inf42">
<mml:math id="m43">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> that is underlying the robot&#x2019;s optimal policy.<disp-formula id="e2">
<mml:math id="m44">
<mml:mrow>
<mml:mtext>BEC</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi mathvariant="normal">&#x211d;</mml:mi>
<mml:mi>l</mml:mi>
</mml:msup>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#xa0;optimal&#xa0;w</mml:mtext>
<mml:mo>.</mml:mo>
<mml:mtext>r</mml:mtext>
<mml:mo>.</mml:mo>
<mml:mtext>t</mml:mtext>
<mml:mo>.</mml:mo>
<mml:mtext>&#xa0;</mml:mtext>
<mml:mi>R</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mi mathvariant="normal">&#x22a4;</mml:mi>
</mml:msup>
<mml:mi>&#x3d5;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>
</p>
<p>The BEC(<italic>&#x3c0;</italic>) can be calculated as the intersection of the following half-space constraints generated by the central IRL equation (<xref ref-type="bibr" rid="B30">Ng and Russell, 2000</xref>)<disp-formula id="e3">
<mml:math id="m45">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mi mathvariant="normal">&#x22a4;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>&#x3bc;</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>&#x3bc;</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2265;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>&#x2200;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>arg</mml:mtext>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mtext>max</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mi mathvariant="normal">&#x2032;</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mi>Q</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
<mml:mi mathvariant="normal">&#x2032;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>s</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">S</mml:mi>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>where <inline-formula id="inf43">
<mml:math id="m46">
<mml:mtable columnalign="left">
<mml:mtr>
<mml:mtd>
<mml:msubsup>
<mml:mi>&#x3bc;</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi mathvariant="double-struck">E</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
<mml:mi>&#x221e;</mml:mi>
</mml:msubsup>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msup>
<mml:mi>&#x3b3;</mml:mi>
<mml:mi>t</mml:mi>
</mml:msup>
<mml:mi>&#x3d5;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>&#x3d;</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:math>
</inline-formula> is the vector of expected reward feature counts accrued from taking action <italic>a</italic> in <italic>s</italic>, then following <italic>&#x3c0;</italic> after, and <inline-formula id="inf44">
<mml:math id="m47">
<mml:mrow>
<mml:mi>Q</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> refers to the optimal Q-value in a state and a possible action (<xref ref-type="bibr" rid="B39">Watkins and Dayan, 1992</xref>).</p>
<p>
<xref ref-type="bibr" rid="B10">Brown and Niekum (2019)</xref> proved that the BEC(<inline-formula id="inf45">
<mml:math id="m48">
<mml:mrow>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:mi>&#x3c0;</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>) of a set of demonstrations <inline-formula id="inf46">
<mml:math id="m49">
<mml:mi mathvariant="script">D</mml:mi>
</mml:math>
</inline-formula> of a policy <italic>&#x3c0;</italic> can be formulated similarly as the intersection of the following half-spaces<disp-formula id="e4">
<mml:math id="m50">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mi mathvariant="normal">&#x22a4;</mml:mi>
</mml:msup>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mi>&#x3bc;</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mi>&#x3bc;</mml:mi>
<mml:mi>&#x3c0;</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msubsup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2265;</mml:mo>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2200;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>s</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">D</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi mathvariant="script">A</mml:mi>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>
</p>
<p>Using the <xref ref-type="disp-formula" rid="e4">Eq. 4</xref>, every demonstration can be translated into a set of constraints on the viable reward weights.</p>
<p>Consider an example in the delivery domain with <inline-formula id="inf47">
<mml:math id="m51">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mo>&#x3d;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> {<italic>up</italic>, <italic>down</italic>, <italic>left</italic>, <italic>right</italic>, <italic>pick up</italic>, <italic>drop</italic>, <italic>exit</italic>}, <inline-formula id="inf48">
<mml:math id="m52">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mo>&#x2a;</mml:mo>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:mn>26</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>3</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
<xref ref-type="fn" rid="fn5">
<sup>5</sup>
</xref> and binary reward features <inline-formula id="inf49">
<mml:math id="m53">
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> [<italic>dropped off package at destination</italic>, <italic>entered mud</italic>, <italic>action taken</italic>]. The demonstration in the left image of <xref ref-type="fig" rid="F1">Figure&#x20;1</xref> corresponds to the constraints in the right image. With a unit cost for each action, the constraints on viable reward weights intuitively indicate that 1) <inline-formula id="inf50">
<mml:math id="m54">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>&#x2265;</mml:mo>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> since a total of 10 actions were taken in the demonstration and that 2) <inline-formula id="inf51">
<mml:math id="m55">
<mml:mrow>
<mml:msub>
<mml:mi>w</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mtext>&#x2a;</mml:mtext>
<mml:mo>&#x2264;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> as the detour around the mud took two actions.</p>
</sec>
<sec>
<title>3.4 Set Cover Optimal Teaching (SCOT)</title>
<p>SCOT (<xref ref-type="bibr" rid="B10">Brown and Niekum, 2019</xref>) allows a robot to select the minimum number of demonstrations that results in the smallest BEC area (i.e. the intersection of the constraints) for an IRL learner. As it only considers IRL, it serves as a baseline method to the techniques proposed in this work that augment SCOT with human teaching strategies.</p>
<p>The SCOT algorithm is summarized here for completeness. The robot first translates all possible demonstrations of its policy in a domain into a corresponding set of BEC constraints. After taking a union of these constraints, redundant constraints are removed using linear programming (<xref ref-type="bibr" rid="B33">Paulraj and Sumathi, 2010</xref>). These non-redundant constraints together form the minimal representation of BEC(<inline-formula id="inf52">
<mml:math id="m56">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>). SCOT now iteratively runs through all possible demonstrations again and greedily adds to the teaching set <inline-formula id="inf53">
<mml:math id="m57">
<mml:mi mathvariant="script">D</mml:mi>
</mml:math>
</inline-formula> the demonstration that covers as many of the remaining constraints in BEC(<inline-formula id="inf54">
<mml:math id="m58">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>), until all constraints are covered.<xref ref-type="fn" rid="fn6">
<sup>6</sup>
</xref> These steps correspond to lines 2&#x2013;13 in <xref ref-type="other" rid="alg1">Algorithm&#x20;1</xref>.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Proposed Techniques for Teaching Humans</title>
<sec id="s4-1">
<title>4.1 Scaffolding</title>
<p>The SCOT algorithm efficiently selects the minimum number of demonstrations that results in the smallest BEC area for a pure IRL learner (<xref ref-type="bibr" rid="B10">Brown and Niekum, 2019</xref>). Such a learner is assumed to fully grasp these few highly nuanced examples that delicately straddle decision-making boundaries and find any other demonstrations redundant. However, <italic>we posit that the BEC area of a demonstration not only inversely corresponds to the amount of information it contains about the possible values of</italic> <inline-formula id="inf55">
<mml:math id="m59">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>
<italic>, but also inversely corresponds to the effort required for a human to extract that information.</italic> Thus humans will likely benefit from additional scaffolded examples that ease them in and incrementally relax the degrees of freedom of the learning problem.</p>
<p>We develop a scaffolding method for a learner without any prior knowledge, outlined as follows. First, obtain the SCOT demonstrations that contains the maximum information on <inline-formula id="inf56">
<mml:math id="m60">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>. If space remains in the teaching budget <italic>n</italic> for additional demonstrations, begin scaffolding by sorting all possible demonstrations in a domain according to their BEC areas. Then cluster them using k-means into twice as many clusters as the remaining budget to ensure that no two consecutive demonstrations are nearly identical in BEC area (see <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>). Randomly draw <italic>m</italic> candidate demonstrations from every other cluster. Finally from these <italic>n</italic> pools of candidate demonstrations, select the ones that best optimize visuals for the teaching set <inline-formula id="inf57">
<mml:math id="m61">
<mml:mi mathvariant="script">D</mml:mi>
</mml:math>
</inline-formula> (as described in the next section). See lines 16&#x2013;21 in <xref ref-type="other" rid="alg1">Algorithm 1</xref>. In this paper, the algorithm always divided the BEC areas into 6 clusters, considering every other cluster to correspond to &#x201c;low&#x201d;, &#x201c;medium&#x201d;, and &#x201c;high&#x201d; information respectively.</p>
</sec>
<sec id="s4-2">
<title>4.2 Simplicity and Pattern Discovery</title>
<p>Though the BEC area of a demonstration provides an unbiased, quantitative measure of the information transferred to a pure IRL learner, <italic>human learners are likely also influenced by the medium of the demonstration, e.g. visuals, and the simplicity and patterns it affords</italic>. For example, visible differences between sequential demonstrations can highlight relevant aspects, while visual clutter that does not actually influence the robot&#x2019;s behavior (e.g. extraneous mud not in the path of the delivery robot) may distract or even mislead the&#x20;human.</p>
<p>We perform a greedy sequential optimization for pattern discovery and then for simplicity. We first encourage pattern matching by considering candidates from different BEC clusters (which often exhibit qualitatively different behaviors) that are most visually similar to the previous demonstration.<xref ref-type="fn" rid="fn7">
<sup>7</sup>
</xref> The aim is to highlight a change in environment (e.g. a new mud patch) that caused the change in behavior (e.g. robot takes a detour) while keeping all other elements constant. We then optimize for simplicity. A measure of visual simplicity is manually defined for each domain (e.g. the number of mud patches in the delivery domain), and out of the scaffolding candidates, the visually simplest demonstration is selected.</p>
<p>The proposed methods for scaffolding and visual optimization come together in <xref ref-type="other" rid="alg1">Algorithm 1</xref>.<xref ref-type="fn" rid="fn8">
<sup>8</sup>
</xref> Since the highest information SCOT demonstrations are selected first then demonstrations are selected via k-means clustering from high to low information, the algorithm concludes by reversing the demonstration list to order the demonstrations from easiest to hardest (line 28).<xref ref-type="fn" rid="fn9">
<sup>9</sup>
</xref> <inline-formula id="inf58">
<mml:math id="m62">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mi mathvariant="normal">N</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> denotes the operation of extracting unit normal vectors corresponding to a set of half-space constraints, and <inline-formula id="inf59">
<mml:math id="m63">
<mml:mi mathvariant="normal">&#x2216;</mml:mi>
</mml:math>
</inline-formula> denotes set subtraction. An example of a sequence of demonstrations that exhibits scaffolding, simplicity, and pattern discovery can be found at the top of <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>.</p>
</sec>
<sec id="s4-3">
<title>4.3 Testing</title>
<p>An optimal trajectory&#x2019;s BEC area intuitively captures its informativeness as a teaching demonstration. The smaller the area, the less uncertainty there is regarding the value of <inline-formula id="inf60">
<mml:math id="m64">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>We propose a complementary and novel idea: <italic>that the BEC area can be inverted as a measure of a trajectory&#x2019;s difficulty as a question during testing</italic>, i.e. when a human is asked to predict the robot&#x2019;s trajectory in a new situation. Intuitively, a large BEC area indicates that there are many viable reward weights for a demonstration, and thus the human does not need to precisely understand <inline-formula id="inf61">
<mml:math id="m65">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> to correctly predict the robot&#x2019;s trajectory. We can also use this measure to scaffold tests of varying difficulties to gauge the human&#x2019;s understanding of <inline-formula id="inf62">
<mml:math id="m66">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> and subsequently <inline-formula id="inf63">
<mml:math id="m67">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>
<statement content-type="algorithm" id="alg1">
<label>
<bold>Algorithm 1</bold>
</label>
<p>Machine Teaching for Human Learners.</p>
</statement>
</p>
<fig id="F8" position="float">
<p>
<inline-graphic xlink:href="frobt-08-693050-fx1.tif"/>
</p>
</fig>
</sec>
</sec>
<sec id="s5">
<title>5 User Studies</title>
<p>We ran two online user studies that involved participants watching demonstrations of a 2D agent&#x2019;s policy and predicting the optimal trajectory in new test environments.<xref ref-type="fn" rid="fn10">
<sup>10</sup>
</xref> The studies were designed to evaluate the following hypotheses.</p>
<p>
<bold>H1</bold>: The BEC area of a demonstration correlates 1) inversely to the expected difficulty for a human to correctly predict it during testing, and 2) directly to their confidence in that prediction.</p>
<p>
<bold>H2</bold>: The BEC area of a demonstration also correlates 1) inversely to the information transferred to a human during teaching and 2) inversely to the subsequent test performance.</p>
<p>
<bold>H3</bold>: Forward scaffolding (demonstrations shown in increasing difficulty) will result in better qualitative assessments of the teaching set and better participant test performance over no scaffolding (only high difficulty demonstrations shown) and backward scaffolding (demonstrations shown in decreasing difficulty), in that&#x20;order.</p>
<p>
<bold>H4</bold>: Positive visual optimization will result in better qualitative assessments of the teaching set and better test performance over negative visual optimization (with positive and negative visual optimization corresponding to the maximization and minimization, respectively, of both simplicity and pattern discovery).</p>
<p>The two user studies jointly tested H1. The first study tested H2 and the second study tested H3 and&#x20;H4.</p>
<sec id="s5-1">
<title>5.1 Domains</title>
<p>Three simple gridworld domains were designed for this study (see <xref ref-type="fig" rid="F3">Figure&#x20;3</xref>). The available actions were {<italic>up</italic>, <italic>down</italic>, <italic>left</italic>, <italic>right</italic>, <italic>pick up</italic>, <italic>drop</italic>, <italic>exit</italic>}. Each domain consisted of one shared reward feature of unit action cost, and two unique reward features as follows.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Three domains were presented in the user study, each with a different set of reward weights to infer from demonstrations using inverse reinforcement learning. <bold>(A)</bold> delivery, <bold>(B)</bold> two-goal, <bold>C</bold>: skateboard.</p>
</caption>
<graphic xlink:href="frobt-08-693050-g003.tif"/>
</fig>
<p>
<bold>Delivery domain</bold>: The agent is rewarded for bringing a package to the destination and penalized for moving into&#x20;mud.</p>
<p>
<bold>Two-goal domain</bold>: The agent is rewarded for reaching one of two goals, with each goal having a different reward.</p>
<p>
<bold>Skateboard domain</bold>: The agent is rewarded for reaching the goal. It is penalized less per action if it has picked up a skateboard (i.e. riding a skateboard is less costly than walking).</p>
<p>To convey an upper bound on the positive reward weight, the agent exited from the game immediately if it encountered an environment where working toward the positive reward would yield a lower overall reward (e.g. too much mud along its path). The semantics of each domain were masked with basic geometric shapes and colors to prevent biasing human learners with priors. All domains were implemented using the simple_rl framework (<xref ref-type="bibr" rid="B2">Abel, 2019</xref>).</p>
</sec>
<sec id="s5-2">
<title>5.2 Study Design</title>
<p>The first and second user studies (US1 and US2, respectively) used the same domains, procedures, and measures, though they differed in which variable was manipulated.</p>
<p>US1 explored how BEC area of demonstrations correlates with a human&#x2019;s understanding of the underlying policy. Thus, the between-subjects variable was <italic>information class</italic>, with three levels: low, medium, and maximum (i.e. SCOT). The low and medium information demonstrations were selected from the fifth and third BEC clusters respectively (see <xref ref-type="fig" rid="F2">Figure&#x20;2</xref>). When selecting multiple demonstrations from a <italic>single</italic> cluster, we optimized for visual simplicity and <italic>dissimilarity</italic> as diversity<xref ref-type="fn" rid="fn11">
<sup>11</sup>
</xref> of demonstrations has been shown to improve human learning (<xref ref-type="bibr" rid="B4">Amir and Amir, 2018</xref>; <xref ref-type="bibr" rid="B19">Huang et&#x20;al., 2019</xref>). The number of demonstrations shown in each domain was set to equal the number of SCOT demonstrations for fair comparison (2 for delivery and skateboard, 3 for two-goal).</p>
<p>US2 explored how incorporating human learning strategies impacts a human&#x2019;s understanding of the underlying policy. Specifically, it examined how the presence and direction of scaffolding, and optimization of visuals, would impact the human&#x2019;s test performance. The between-subjects variables were <italic>scaffolding class</italic> (none, forward, and backward), and <italic>visual optimization</italic> (positive and negative). For scaffolding class, forward scaffolding showed demonstrations according to <xref ref-type="other" rid="alg1">Algorithm 1</xref>, backward scaffolding showed forward scaffolding&#x2019;s demonstrations in reverse, and no scaffolding showed all high informative examples from the 1st BEC cluster (<xref ref-type="fig" rid="F2">Figure&#x20;2</xref>). Five demonstrations were shown for each domain, always ending with demonstrations determined by&#x20;SCOT.</p>
<p>Both US1 and US2 had two additional within-subject variables: <italic>domain</italic> (delivery, two-goal, and skateboard, described in <xref ref-type="sec" rid="s5-1">Section 5.1</xref>) and <italic>test difficulty</italic> (low, medium, and high, determined by the BEC area of the test).</p>
<p>For both user studies, participants first completed a series of tutorials that introduced them to the mechanics of the domains they would encounter. In the tutorials, participants learned that the agent would be rewarded or penalized according to key events (i.e. reward features) specific to each domain. They were then asked to generate a few predetermined trajectories in a practice domain with a live reward counter to familiarize themselves with the keyboard controls and a practice reward function. Finally, participants entered the main user study and completed a single trial in each of the delivery, two-goal, and skateboard domains. Each trial involved a teaching portion and a test portion. In the teaching portion, participants watched videos of optimal trajectories that maximized reward in that domain, then answered subjective questions about the demonstrations (M2-M4, see <xref ref-type="sec" rid="s5-3">Section 5.3</xref>). In the subsequent test portion, participants were given six new test environments and asked to provide the optimal trajectory. The tests always included two low, two medium, and two high difficulty environments shown in random order. For each of the tests, participants also provided their confidence in their response (M5). The teaching videos for each condition were pulled from a filtered pool of 3 exemplary sets of demonstrations proposed by <xref ref-type="other" rid="alg1">Algorithm 1</xref> to control for bias in the results. The tests were likewise pulled from a filtered pool of 3 exemplary sets of demonstrations for each of the low, medium, and high difficulty test conditions.</p>
<p>Finally, though the methods described in this paper are designed for a human with no prior knowledge regarding any of the weights, the agent in our user studies assumed that the human was aware of the step cost and only needed to learn the relationship between the remaining two weights in each domain. This simplified the problem at the expense of a less accurate human model and measure of a demonstration&#x2019;s informativeness via BEC area. However, the effect was likely mitigated in part by the clustering and sampling in <xref ref-type="other" rid="alg1">Algorithm 1</xref>, which only makes use of coarse BEC&#x20;areas.</p>
</sec>
<sec id="s5-3">
<title>5.3 Measures</title>
<p>The following objective and subjective measures were recorded to evaluate the aforementioned hypotheses.</p>
<p>
<bold>M1</bold>. <bold>Optimal response:</bold> For each test, whether the participant&#x2019;s trajectory received the optimal reward or not was recorded.</p>
<p>
<bold>M2</bold>. <bold>Informativeness rating:</bold> 5-point Likert scale with prompt &#x201c;How informative were these demonstrations in understanding how to score well in this game?&#x201d;</p>
<p>
<bold>M3</bold>. <bold>Mental effort rating:</bold> 5-point Likert scale with prompt &#x201c;How much mental effort was required to process these demonstrations?&#x201d;</p>
<p>
<bold>M4</bold>. <bold>Puzzlement rating:</bold> 5-point Likert scale with prompt &#x201c;How puzzled were you by these demonstrations?&#x201d;</p>
<p>
<bold>M5</bold>. <bold>Confidence rating:</bold> 5-point Likert scale with prompt &#x201c;How confident are you that you obtained the optimal score?&#x201d;</p>
</sec>
</sec>
<sec sec-type="results" id="s6">
<title>6 Results</title>
<p>One hundred and sixty two participants were recruited using Prolific (<xref ref-type="bibr" rid="B32">Palan and Schitter, 2018</xref>) for the two user studies. Participants&#x2019; ages ranged from 18 to 57 (M &#x3d; 26.07, SD &#x3d; 8.35). Participants self-reported gender (roughly 67% male, 30% female, 2% non-binary, and 1% preferred to not disclose). Each of the nine possible between-subjects conditions across the two user studies were randomly assigned 18 participants (such that US1 and US2 contained 54 and 108 participants respectively), and the order of the domains presented to each participant was counterbalanced.</p>
<p>The three domains were designed to vary in the difficulty of their respective optimal trajectories. We calculated an intraclass coefficient (ICC) based on a mean-rating (k &#x3d; 3), consistency-based, 2-way mixed effects model (<xref ref-type="bibr" rid="B23">Koo and Li, 2016</xref>) to evaluate the consistency of each participant&#x2019;s performance across domains. A low ICC value of 0.37 (<inline-formula id="inf64">
<mml:math id="m68">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>.001</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) indicated that performance in fact varied considerably across domains for each participant. We subsequently average each participant&#x2019;s scores across the domains in all following analyses, potentially yielding results that are representative of domains with a range of difficulties.</p>
<p>
<bold>H1:</bold> We combine the test responses from both user studies as they shared the same pool of tests. A one-way repeated measures ANOVA revealed a statistically significant difference in the percentage of optimal responses (M1) across test difficulty (<inline-formula id="inf65">
<mml:math id="m69">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>2,322</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>275.35</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>.001</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>). Post-hoc pairwise Tukey analyses further revealed significant differences between each of the three groups, with the percentage of optimal responses dropping from low (<inline-formula id="inf66">
<mml:math id="m70">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.89</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>), to medium (<inline-formula id="inf67">
<mml:math id="m71">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.68</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>), to high (<inline-formula id="inf68">
<mml:math id="m72">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.36</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) test difficulties (<inline-formula id="inf69">
<mml:math id="m73">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>.001</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> in all cases).</p>
<p>Spearman&#x2019;s rank-order correlation further showed a significant inverse correlation between test difficulty and confidence (M5, <inline-formula id="inf70">
<mml:math id="m74">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>.40</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>.001</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>486</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>). See <xref ref-type="fig" rid="F4">Figure&#x20;4</xref> for the raw confidence&#x20;data.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Participants were significantly more confident of their responses as test difficulty decreased.</p>
</caption>
<graphic xlink:href="frobt-08-693050-g004.tif"/>
</fig>
<p>
<italic>Objective and subjective results both support H1, that BEC area can indeed be used as a measure of difficulty for testing.</italic> We thus proceed with the rest of the analyses with &#x201c;test difficulty&#x201d; as a validated independent variable.</p>
<p>
<bold>H2:</bold> A two-way mixed ANOVA on percentage of optimal responses (M1) did not reveal a significant effect of information class of the teaching set (<inline-formula id="inf71">
<mml:math id="m75">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>2,51</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1.23</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.30</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>), though test difficulty had a significant effect consistent with the H1 analysis (<inline-formula id="inf72">
<mml:math id="m76">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>2,102</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>118.58</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>.001</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>). There was no interaction between information class and test difficulty (<inline-formula id="inf73">
<mml:math id="m77">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>4,102</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.67</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.61</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>).</p>
<p>Spearman&#x2019;s correlation test only found a significant negative correlation between information class and perceived informativeness (M2, <inline-formula id="inf74">
<mml:math id="m78">
<mml:mrow>
<mml:msub>
<mml:mi>r</mml:mi>
<mml:mi>s</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>0.28</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.04</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>54</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>). Neither mental effort (M3, <inline-formula id="inf75">
<mml:math id="m79">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.08</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) nor puzzlement (M4, <inline-formula id="inf76">
<mml:math id="m80">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.36</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) were found to have significant correlations with information class. See <xref ref-type="fig" rid="F5">Figure&#x20;5</xref> for the raw subjective ratings.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>The information class of demonstrations only significantly influences their perceived informativeness, ironically decreasing from low to maximum information class. This suggests that a demonstration&#x2019;s intrinsic information content (as measured by its BEC area) does not always correlate with the information transferred to human learners. No significant effects were found between information class and mental effort or puzzlement.</p>
</caption>
<graphic xlink:href="frobt-08-693050-g005.tif"/>
</fig>
<p>
<italic>The data failed to support H2.</italic> The data suggests that IRL alone is indeed an imperfect model of human learning, motivating the use of human teaching techniques to better accommodate human learners.</p>
<p>There was no correlation between information class and test performance, likely a result of two factors. First, the number of demonstrations provided (two or three) across the conditions in US1 were likely too few for human learners, who are not pure IRL learners and can sometimes benefit from &#x201c;redundant&#x201d; examples that reinforce a concept. Second, as will be discussed under the scaffolding subsection in <xref ref-type="sec" rid="s7">Section 7.2</xref>, BEC area is likely an insufficient model of a demonstration&#x2019;s informativeness to a human and warrants further iteration.</p>
<p>Accordingly, maximum information demonstrations provided by SCOT (<inline-formula id="inf77">
<mml:math id="m81">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.61</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) failed to significantly improve the percentage of optimal responses compared to medium (<inline-formula id="inf78">
<mml:math id="m82">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.65</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) and low (<inline-formula id="inf79">
<mml:math id="m83">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.67</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> information demonstrations as IRL would have predicted. The subjective results further indicate that people ironically found the maximally informative demonstrations least informative. We hypothesize that participants struggled to digest the information contained within SCOT&#x2019;s demonstrations all at once, motivating the use of scaffolding to stage learning into mangeable segments.</p>
<p>
<bold>H3:</bold> A two-way mixed ANOVA on percentage of optimal responses (M1) revealed a significant interaction effect between scaffolding and test difficulty (<inline-formula id="inf80">
<mml:math id="m84">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>4,210</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>2.79</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.03</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>). Tukey analyses showed that no scaffolding (<inline-formula id="inf81">
<mml:math id="m85">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.46</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) yielded significantly better test performance than forward scaffolding (<inline-formula id="inf82">
<mml:math id="m86">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.34</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) for high difficulty tests (<inline-formula id="inf83">
<mml:math id="m87">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.05</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>). Though not statistically significant, a trend of forward and backward scaffolding outperforming no scaffolding on low (<inline-formula id="inf84">
<mml:math id="m88">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.89</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>0.89</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>0.85</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> respectively) and medium difficulty tests (<inline-formula id="inf85">
<mml:math id="m89">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.69</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>0.69</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn>0.62</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> respectively) can be observed as well (see <xref ref-type="fig" rid="F6">Figure&#x20;6</xref>).</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Though the three scaffolding conditions perform similarly in aggregate across all tests, &#x201c;no scaffolding&#x201d; significantly increases performance for high difficulty&#x20;tests.</p>
</caption>
<graphic xlink:href="frobt-08-693050-g006.tif"/>
</fig>
<p>A two-way mixed ANOVA surprisingly did not reveal a significant effect from scaffolding (<inline-formula id="inf86">
<mml:math id="m90">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>2,105</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.02</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.98</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) but did find a significant effect for test difficulty (<inline-formula id="inf87">
<mml:math id="m91">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>2,210</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>167.63</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>.001</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) on percentage of optimal responses (M1) as expected.</p>
<p>A Kruskal&#x2013;Wallis test did not find differences between the informativeness (<inline-formula id="inf88">
<mml:math id="m92">
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>5.18</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.07</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>), mental effort (<inline-formula id="inf89">
<mml:math id="m93">
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1.16</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.56</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>), or puzzlement (<inline-formula id="inf90">
<mml:math id="m94">
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.59</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.74</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) ratings (M2&#x2013;M4) of differently scaffolded teaching&#x20;sets.</p>
<p>
<italic>The data largely failed to support H3.</italic> Forward and backward scaffolding surprisingly led to nearly identical test performance. Though no scaffolding performed similarly overall, it yielded a significant increase in performance specifically for high difficulty tests. These two surprising results are addressed in the discussion. The subjective measures did not indicate any clear relationships.</p>
<p>
<bold>H4</bold>: A two-way mixed ANOVA on percentage of optimal responses (M1) revealed significant effects of test difficulty (<inline-formula id="inf91">
<mml:math id="m95">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>2,212</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>169.21</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>.001</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) and an interaction effect between optimized visuals and test difficulty (<inline-formula id="inf92">
<mml:math id="m96">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>2,212</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>5.61</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.004</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>). Exploring the interaction effect with Tukey analyses revealed that visual optimization had no effect on test performance on low (<inline-formula id="inf93">
<mml:math id="m97">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.24</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) and medium (<inline-formula id="inf94">
<mml:math id="m98">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.90</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) difficulty tests, but led to a significant improvement in performance in high (<inline-formula id="inf95">
<mml:math id="m99">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>.001</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) difficulty tests for positive visual optimization (<inline-formula id="inf96">
<mml:math id="m100">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.45</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) over negative (<inline-formula id="inf97">
<mml:math id="m101">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.31</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>), see <xref ref-type="fig" rid="F7">Figure&#x20;7</xref>. The two-way mixed ANOVA did not reveal a significant from optimized visuals alone (<inline-formula id="inf98">
<mml:math id="m102">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1,106</mml:mn>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>2.27</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.13</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>).</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>
<bold>(A)</bold> Optimizing teaching demonstration visuals does not significant affect performance on low and medium difficulty tests, but leads to a significant improvement on high difficulty tests. <bold>(B)</bold> Ratings on mental effort and puzzlement surprisingly increased for positive visual optimization, likely an artifact of unforeseen study design effects. No significant effects were found for ratings on informativeness.</p>
</caption>
<graphic xlink:href="frobt-08-693050-g007.tif"/>
</fig>
<p>A Mann&#x2013;Whitney <italic>U</italic> test surprisingly found that ratings for mental effort (<inline-formula id="inf99">
<mml:math id="m103">
<mml:mrow>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>54</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>54</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1131.5</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.03</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) and puzzlement (<inline-formula id="inf100">
<mml:math id="m104">
<mml:mrow>
<mml:mi>U</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>g</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>54</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>54</mml:mn>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1082.5</mml:mn>
<mml:mo>,</mml:mo>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.02</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>) (M3 and M4) increased for positive visual optimization. Informativeness ratings were not found to differ significantly between the two visual optimizations (<inline-formula id="inf101">
<mml:math id="m105">
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>.11</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>).</p>
<p>
<italic>The data partially supports H4.</italic> Optimizing visuals improved test performance for high difficulty tests. However, optimizing visuals also yielded counterintuitive results for the subjective measures on mental effort and puzzlement, which we address in the following section.</p>
</sec>
<sec sec-type="discussion" id="s7">
<title>7 Discussion</title>
<sec>
<title>7.1 Learning Styles</title>
<p>Analyzing the free-form comments provided by participants throughout the user studies revealed unexpected insights about their learning styles. Though this paper assumed that participant learning would only resemble IRL, we discovered it sometimes resembled imitation learning<xref ref-type="fn" rid="fn12">
<sup>12</sup>
</xref>, which models humans as learning the optimal behavior directly from demonstrations (as opposed to through an intermediate reward function like IRL) (<xref ref-type="bibr" rid="B13">Daw et&#x20;al., 2005</xref>; <xref ref-type="bibr" rid="B24">Lage et&#x20;al., 2019</xref>). For example, one participant expounded upon their mental effort Likert rating (M3) with following description of IRL-style learning: &#x201c;You need to make a moderate amount of mental effort to understand all the rules and outweight [sic] everything and see what is worth it or not in the game.&#x201d; In contrast, another expounded upon their used mental effort rating with the following description of IL-style learning: &#x201c;The primary &#x2018;mental effort&#x2019; was in memorizing the patterns of each level/stage and matching the optimal movements for them.&#x201d;</p>
<p>To better understand the types of learning employed by our participants, we analyzed their optional responses to the following questions: &#x201c;Feel free to explain any of your selections above if you wish:&#x201d; (asked in conjunction with prompts for ratings of informativeness, mental effort, and puzzlement of demonstrations in each domain, i.e. up to three times) and &#x201c;Do you have any comments or feedback on the study?&#x201d; (asked after the completion of the full study, i.e. once). Similar to <xref ref-type="bibr" rid="B24">Lage et&#x20;al. (2019)</xref>, we coded relevant responses from participants regarding their thought process as resembling IRL (e.g. &#x201c;So, the yellow squares should be avoided if possible and they possibly remove two points when crossed but I&#x2019;m not sure&#x201d;) or as resembling IL (e.g. &#x201c;I did not understand the rule regarding yellow tiles. It seems they should be avoided, but not always. Interesting&#x2026;&#x201d;), or as &#x201c;unclear&#x201d; (e.g. &#x201c;After some examples I feel like I&#x2019;m understanding way better these puzzles.&#x201d;). A second coder uninvolved in the study independently labeled the same set of responses, assigning the same label to <inline-formula id="inf102">
<mml:math id="m106">
<mml:mrow>
<mml:mn>79</mml:mn>
<mml:mtext>%</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> of the responses. A Cohen&#x2019;s kappa of 0.64 between the two sets of codings further indicates moderate to substantial agreement (<xref ref-type="bibr" rid="B25">Landis and Koch, 1977</xref>; <xref ref-type="bibr" rid="B3">Altman, 1990</xref>; <xref ref-type="bibr" rid="B29">McHugh, 2012</xref>). Please refer to the <xref ref-type="sec" rid="s14">Supplementary Material</xref> for the responses, labels, and further details on the coding process.</p>
<p>As <xref ref-type="table" rid="T1">Table&#x20;1</xref> conveys, both coders agreed that more responses resembled IRL than IL and &#x201c;unclear&#x201d; combined, suggesting that people perhaps employed IRL more often than not. However, we note that the way the tutorials introduced the domains may have influenced this result. For example, explicitly conveying each domain&#x2019;s unique reward features and clarifying that a trajectory&#x2019;s reward is determined by a weighting over those features may have encouraged participants to first infer the reward weights from optimal demonstrations (e.g. through IRL) and then infer the optimal policy (as opposed to directly inferring the optimal policy e.g. through&#x20;IL).</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Coding of qualitative participant responses as resembling inverse reinforcement learning (IRL) or imitation learning (IL), or &#x201c;unclear.&#x201d;</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="center">Learning style</th>
<th colspan="2" align="center">Raw counts (across user studies)</th>
<th colspan="2" align="center">Percentages (across coders)</th>
</tr>
<tr>
<th align="center">Coder 1</th>
<th align="center">Coder 2</th>
<th align="center">User study 1 (%)</th>
<th align="center">User study 2 (%)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">IRL</td>
<td align="char" char=".">25</td>
<td align="char" char=".">27</td>
<td align="char" char=".">32</td>
<td align="char" char=".">68</td>
</tr>
<tr>
<td align="left">IL</td>
<td align="char" char=".">7</td>
<td align="char" char=".">9</td>
<td align="char" char=".">27</td>
<td align="char" char=".">12</td>
</tr>
<tr>
<td align="left">Unclear</td>
<td align="char" char=".">15</td>
<td align="char" char=".">11</td>
<td align="char" char=".">41</td>
<td align="char" char=".">20</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Examining the percentage of each response across the two user studies reveals another interesting trend. Responses were far more likely to be coded as IRL in US2, where participants got to see five demonstrations as opposed to US1, where participants only got to see two or three demonstrations. This echoes the observation of <xref ref-type="bibr" rid="B24">Lage et&#x20;al. (2019)</xref> that people may be more inclined to use IL over IRL in less familiar situations, which may be moderated in future studies through more extensive pre-study practice and/or additional informative demonstrations that better familiarize the participant to the domains.</p>
<p>Finally, out of 15 participants who provided more than one response, coders agreed that eight appeared to employ the same learning style throughout the user study (e.g. participants 129 and 142 in US2 only provided responses resembling IRL), four appeared to have changed styles through the user study (e.g. participants 59 in US1 and 20 in US2 provided various responses that resembled IL, IRL, or were unclear), and three were ambiguous (i.e. one coder coded a consistent learning style while the other did not). Though we controlled for learning effects by counterbalancing the order of the domains, participants likely found the domains to vary in the difficulty of their respective optimal trajectories (as suggested by the ICC score). Furthermore, certain conditions led to significant differences in subjective and objective outcomes (e.g. maximum information demonstrations were ironically perceived to be least informative (H2) and positive visual optimization improved performance for high difficulty tests (H4)). We thus hypothesize that the varying difficulties in domains and conditions non-trivially influenced the learning styles at different times [e.g. by moderating familiarity (<xref ref-type="bibr" rid="B24">Lage et&#x20;al., 2019</xref>)].</p>
<p>
<italic>Future work:</italic> The multi-faceted nature of human learning can be described by a number of models such as IRL and IL. <xref ref-type="bibr" rid="B24">Lage et&#x20;al. (2019)</xref> show post hoc that tailoring the teaching to the human&#x2019;s favored learning style can improve the learning outcome. Thus, predicting a human&#x2019;s current learning style a priori or in situ (e.g. by using features such as the human&#x2019;s familiarity of the task or domain) and matching the teaching appropriately in real time will be an important direction of future&#x20;work.</p>
</sec>
<sec>
<title>7.2 Scaffolding</title>
<p>Though BEC area is a well-motivated preliminary model of a demonstration&#x2019;s informativeness to a human, backward scaffolding&#x2019;s unexpected on-par performance with forward scaffolding suggests that it is insufficient and our scaffolding order likely was not clear cut in either direction. In considering possible explanations, we note that <xref ref-type="disp-formula" rid="e4">Eq. 4</xref> presents a computationally elegant method of generating BEC constraints via sub-optimal, one-step deviations from the optimal trajectory. However, these suboptimal trajectories do not always correspond to the suboptimal trajectories in the human&#x2019;s mind (e.g. which may allow more than one-step deviations). This sometimes leads to a disconnect between a demonstration&#x2019;s informativeness as measured by BEC area and its informativeness from the point of view of the&#x20;human.</p>
<p>Furthermore, forward and backward scaffolding (each comprised of low, medium, and high information demonstrations) yielded higher performance for low and medium difficulty tests, and no scaffolding (comprised of only high information demonstrations) yields significantly higher performance for high difficulty tests. Improved performance when matching the informativeness and difficulty of teaching and testing demonstrations respectively (which yields similar demonstrations) further suggests that IL-style learning may have also been at&#x20;play.</p>
<p>Finally, participants across each condition never achieved a mean score of greater than 0.5 for high difficulty tests, indicating that they were largely unable to grasp the more subtle aspects of the agent&#x2019;s optimal behavior. While the five demonstrations shown in US2 should have conveyed the maximum possible information (in an IRL-sense), they were not as effective in reality. One reason may be that human cognition is constrained by limited time and computation (<xref ref-type="bibr" rid="B15">Griffiths, 2020</xref>), and at times may opt for approximate, rather than exact, inference (<xref ref-type="bibr" rid="B38">Vul et&#x20;al., 2014</xref>; <xref ref-type="bibr" rid="B19">Huang et&#x20;al., 2019</xref>). Approximate inference (and even IL-style learning) indeed would have struggled with high difficulty tests whose optimal behavior could often only be discerned through exact computation of rewards. In addition to potentially showing more demonstrations (including &#x201c;redundant&#x201d; demonstrations that reinforce concepts and are still useful for approximate IRL), we believe that more effective scaffolding that further simplifies the concepts being taught while simultaneously challenging human&#x2019;s current knowledge will be key to addressing this gap, as we discuss&#x20;next.</p>
<p>
<italic>Future work:</italic> We propose two directions for future work on scaffolding. First, we note that our selected demonstrations often revealed information about multiple reward weights at once, which could be difficult to process. Instead, we can further scaffold by teaching about one weight at a time, when possible. Second, <xref ref-type="bibr" rid="B34">Reiser (2004)</xref> suggests that scaffolding should not only provide structure that reduces problem complexity but at times induce cognitive conflict to challenge and engage the learner. The current method of scaffolded teaching assumes that the learner has no prior knowledge when calculating a demonstration&#x2019;s informativeness (e.g. <xref ref-type="other" rid="alg1">Algorithm 1</xref> considers a repeat showing of a demonstration to a learner to be equally as informative as the first showing). But when filtering for teaching and testing sets for the user studies, we sometimes observed and accounted for the fact that demonstrations with the same BEC area could further vary in informativeness or difficulty to different learners based on whether it presented an expected behavior or not. We believe that providing demonstrations which incrementally deviate from the human&#x2019;s current model will be more informative to a human and would be better suited to scaffolding.</p>
</sec>
<sec>
<title>7.3 Simplicity and Pattern Discovery</title>
<p>Optimizing visuals improved test performance, but only for high difficulty tests. This suggests that simplicity and pattern discovery could produce a meaningful reduction in complexity for only high information demonstrations (which contain the insights necessary to do well on the high difficulty tests), while those of low and medium information were already comprehensible.</p>
<p>We found counterintuitive results on mental effort or puzzlement ratings (M3&#x2013;M4) for H4, where ratings for mental effort and puzzlement increased from negative to positive visual optimizations. One factor may have been the open-ended phrasing of the corresponding Likert prompts that failed to always elicit the intended measure. For example, one participant expounded upon their mental effort rating by saying &#x201c;it takes a bit of efford [sic] remembering that you can quit at any time,&#x201d; referencing the difficulty of remembering all available actions rather than the intended difficulty of performing inference over the optimal behavior.</p>
<p>Similarly, the open-ended prompt for puzzlement failed to always query specifically for potential puzzlement arising from (a potentially counterintuitive) ordering of the demonstrations. Instead it sometimes invited comments such as &#x2018;I think i [sic] saw the same distance to the objective 2 times and 2 differnt [sic] outcomes,&#x2019; and interestingly informed us of possible unforeseen confounders on puzzlement such as limited memory. As participants were not allowed to rewatch previous demonstrations to enforce scaffolding order, similar demonstrations (in correspondingly similar environments) were sometimes mistaken to have shown different behaviors in the same environment.</p>
<p>
<italic>Future work:</italic> Future iterations would benefit from &#x201c;marking critical features&#x201d; that &#x201c;accentuates certain features of the task that are relevant&#x201d;, as suggested by <xref ref-type="bibr" rid="B41">Wood et&#x20;al. (1976)</xref>. For example, imagine showing two side-by-side demonstrations in the delivery domain, one where the robot exits because of the many mud patches in its path and one where the robot completes the delivery because of one fewer mud patch in its path. Outlining the presence and absence of the critical mud patch with a salient border in the two demonstrations respectively would help highlight the relevant cause for the change in robot behavior to the learner.</p>
</sec>
<sec>
<title>7.4 Testing</title>
<p>Objective and subjective results strongly support BEC area as a measure of test difficulty for human learners. Following studies may thus use tests of varying BEC areas and difficulties to evaluate and track the learner&#x2019;s understanding throughout the learning process.</p>
<p>
<italic>Future work:</italic> Effective scaffolding is contingent on maintaining an accurate model of the learner&#x2019;s current abilities. Though this work assumed disjoint teaching and testing phases, learning is far more dynamic in reality. Future work should therefore explore how to select an initial set of tests that can accurately discern the learner&#x2019;s current knowledge, and also to know when to switch between teaching and testing throughout the learning process.</p>
</sec>
<sec>
<title>7.5 Real-world Applicability</title>
<p>Though the proposed method of machine teaching is theoretically general, there are additional considerations that must be addressed for real-world applicability.</p>
<p>First, a robot&#x2019;s policy may be a function of many parameters. Though performing IRL in a high-dimensional space may sometimes be warranted, humans naturally exhibit a bias toward simpler explanations with fewer causes (<xref ref-type="bibr" rid="B27">Lombrozo, 2016</xref>) and can only effectively reason about a few variables at once (e.g. <xref ref-type="bibr" rid="B17">Halford et&#x20;al. (2005)</xref> suggest the&#x20;limit to be around four). Thus, future work may examine approximating a high-dimensional policy with a low-dimensional policy that can be conveyed instead with minimal loss. Additionally, scaffolding methods that explicitly convey only a subset of the reward weights at a time should be developed as previously&#x20;noted.</p>
<p>Second, a robot&#x2019;s entire trajectory will not always be necessary or reasonable to convey if it is lengthy. Thus techniques that extract and convey only the informative segments along with sufficient context will be important. For segments that are infeasible to convey in the real world (e.g due to necessary preconditions not being met), demonstrations may be given in simulation instead.</p>
</sec>
</sec>
<sec sec-type="conclusion" id="s8">
<title>8 Conclusion</title>
<p>As robots continue to gain useful skills, their ability to teach them to humans will benefit those looking to acquire said skills and also facilitate fluent collaboration with humans. In this work, we thus explored how a robot may teach by providing demonstrations of its skill that are tailored for human learning.</p>
<p>We augmented the common model of humans as inverse reinforcement learners with insights from learning theory and cognitive science to better accommodate human learning. Scaffolding provided demonstrations that increase in informativeness and difficulty, aiming to ease the learner into the skill being taught. Furthermore, simple demonstrations that conveyed a discernible pattern were favored to minimize potentially misleading distractions and instead highlight critical features. Finally, a measure for quantifying the difficulty of tests was proposed toward effective evaluation of learning progress.</p>
<p>User studies strongly correlated our measure of test difficulty with human performance and confidence. Favoring simplicity and pattern discovery when selecting teaching demonstrations also led to a significant increase in performance for high difficulty tests. However, scaffolding failed to produce a significant effect on the test performance, informing both the shortcomings of the current implementation and the ways it can be improved in future iterations. Finally, though this work assumed disjoint teaching and testing phases with a static human model, effective scaffolding requires the teacher query, maintain, and leverage a dynamic model of the student to tailor the learning appropriately. We leave this as an exciting direction for future&#x20;work.</p>
</sec>
</body>
<back>
<sec id="s9">
<title>Data Availability Statement</title>
<p>The code for the human teaching techniques can be found in the following repository: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://github.com/SUCCESS-MURI/machine-teaching-human-IRL">https://github.com/SUCCESS-MURI/machine-teaching-human-IRL</ext-link>. The code for generating the user study (including videos of the teaching and testing demonstrations) and the data corresponding to our results can be found in the following repository: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://github.com/SUCCESS542MURI/psiturk-machine-teaching">https://github.com/SUCCESS542 MURI/psiturk-machine-teaching</ext-link>.</p>
</sec>
<sec id="s10">
<title>Ethics Statement</title>
<p>The studies involving human participants were reviewed and approved by the Institutional Review Board of Carnegie Mellon University. The participants provided their written informed consent to participate in this&#x20;study.</p>
</sec>
<sec id="s11">
<title>Author Contributions</title>
<p>All authors equally contributed to the ideas presented in this paper, i.e. the techniques for human teaching and user studies design. ML implemented the techniques and user studies, ran the user studies, and analyzed the data. The manuscript was prepared, revised, and approved by all authors.</p>
</sec>
<sec id="s12">
<title>Funding</title>
<p>This work was supported by the Office of Naval Research award N00014-18-1-2503 and Defense Advanced Research Projects Agency (DARPA)/Army Research Office (ARO) award W911NF-20-1-0006. The views and conclusions contained in this document are of the authors and should not be interpreted as representing official policies, expressed or&#x20;implied, of DARPA, ARO, or U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for government purposes notwithstanding any copyright notation herein.</p>
</sec>
<sec sec-type="COI-statement" id="s13">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<ack>
<p>We would like to thank Vignesh Rajmohan and Meghna Behari for their assistance in creating the user study, and Pallavi Koppol for serving as an independent coder and for sharing her user study and data analysis templates.</p>
</ack>
<sec id="s14">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frobt.2021.693050/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frobt.2021.693050/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="Table1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>
<xref ref-type="bibr" rid="B30">Ng and Russell (2000)</xref> suggest that &#x201c;the reward function, rather than the policy, is the most succinct, robust, and transferable definition of the task.&#x201d;</p>
</fn>
<fn id="fn2">
<label>2</label>
<p>Though we assume a deterministic MDP, the methods described here naturally generalize to MDPs with stochastic transition functions and policies.</p>
</fn>
<fn id="fn3">
<label>3</label>
<p>This assumption can be made without loss of generality as the reward features can be nonlinear with respect to states and actions and be arbitrarily complex.</p>
</fn>
<fn id="fn4">
<label>4</label>
<p>In principle, a robot could simply convey <inline-formula id="inf103">
<mml:math id="m107">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> explicitly to a human. However, it can be nontrivial for humans to map precise numerical reward weights to the corresponding optimal behavior through planning, especially if there is large number of reward features. Thus, providing demonstrations that inherently carry information regarding <inline-formula id="inf104">
<mml:math id="m108">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> and directly conveying the optimal behavior can be more a effective teaching method for human learners.</p>
</fn>
<fn id="fn5">
<label>5</label>
<p>In practice, we also require that <inline-formula id="inf105">
<mml:math id="m109">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2016;</mml:mo>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mo>&#x2016;</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> to circumvent the scaling invariance of IRL solutions and to eliminate the degenerate all-zero reward function (<xref ref-type="bibr" rid="B9">Brown and Niekum, 2018</xref>). We convey the non-normalized <inline-formula id="inf106">
<mml:math id="m110">
<mml:mi mathvariant="normal">w</mml:mi>
</mml:math>
</inline-formula> here for intuition.</p>
</fn>
<fn id="fn6">
<label>6</label>
<p>Instead of greedily adding the first demonstration that covers the most remaining constraints of BEC(<inline-formula id="inf107">
<mml:math id="m111">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>) at each iteration, one can enumerate all possible combinations of demonstrations that cover BEC(<inline-formula id="inf108">
<mml:math id="m112">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>) and optimize for simplicity and pattern discovery here as&#x20;well.</p>
</fn>
<fn id="fn7">
<label>7</label>
<p>We measure the visual similarity of two states by defining a hash function over a domain&#x2019;s state space and calculating the edit distance between the two corresponding state hashes.</p>
</fn>
<fn id="fn8">
<label>8</label>
<p>An implementation is available at <ext-link ext-link-type="uri" xlink:href="https://github.com/SUCCESS-MURI/machine-teaching-human-IRL">https://github.com/SUCCESS-MURI/machine-teaching-human-IRL</ext-link>.</p>
</fn>
<fn id="fn9">
<label>9</label>
<p>In theory, one could order SCOT and k-means demonstrations jointly by BEC area and potentially allowing them to mix in order. However, a SCOT demonstration that contributes a maximally informative constraint of BEC(<inline-formula id="inf109">
<mml:math id="m113">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>) may in fact have a large BEC area. Thus, showing this SCOT demonstration early on may actually render a later k-means demonstration as uninformative (i.e. the SCOT demonstration&#x2019;s BEC(<inline-formula id="inf110">
<mml:math id="m114">
<mml:mrow>
<mml:mi>&#x3c0;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula>) constraint may cause a later k-means demonstration&#x2019;s constraints to be redundant). Instead, showing k-means demonstrations that iteratively decrease in BEC area, then showing SCOT demonstrations ensures that the learner receives non-redundant constraints on <inline-formula id="inf111">
<mml:math id="m115">
<mml:mrow>
<mml:mi mathvariant="normal">w</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:mrow>
</mml:math>
</inline-formula> at each&#x20;step.</p>
</fn>
<fn id="fn10">
<label>10</label>
<p>Code for the user studies, videos of teaching and testing demonstrations, and the collected data are available at <ext-link ext-link-type="uri" xlink:href="https://github.com/SUCCESS-MURI/psiturk-machine-teaching">https://github.com/SUCCESS-MURI/psiturk-machine-teaching</ext-link>.</p>
</fn>
<fn id="fn11">
<label>11</label>
<p>Note that <xref ref-type="other" rid="alg1">Algorithm 1</xref> already achieves diversity by scaffolding demonstrations across <italic>different</italic> BEC clusters and thus benefits instead from visual similarity.</p>
</fn>
<fn id="fn12">
<label>12</label>
<p>Note that the term &#x201c;behavior cloning&#x201d; is sometimes used instead to refer to the process of directly learning the optimal behavior. Accordingly, &#x201c;imitation learning&#x201d; is sometimes used to refer to the broad class of techniques that learn optimal behavior from demonstrations, encompassing both behavior cloning and IRL (<xref ref-type="bibr" rid="B31">Osa et&#x20;al., 2018</xref>).</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>A. Y.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>Apprenticeship Learning via Inverse Reinforcement Learning</article-title>. In <conf-name>Proceedings of the twenty-first international conference on Machine learning</conf-name>. </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Abel</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Simple Rl: Reproducible Reinforcement Learning in python</article-title>. In <conf-name>ICLR Workshop on Reproducibility in Machine Learning</conf-name>. </citation>
</ref>
<ref id="B3">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Altman</surname>
<given-names>D. G.</given-names>
</name>
</person-group> (<year>1990</year>). <source>Practical Statistics for Medical Research</source> (<publisher-loc>Boca Raton, FL</publisher-loc>: <publisher-name>CRC Press</publisher-name>).</citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Amir</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Amir</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Highlights: Summarizing Agent Behavior to People</article-title>. In <conf-name>Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems</conf-name> (<publisher-loc>Richland, South Carolina</publisher-loc>: <publisher-name>International Foundation for Autonomous Agents and Multiagent Systems</publisher-name>), <fpage>1168</fpage>&#x2013;<lpage>1176</lpage>. </citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Amir</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Doshi-Velez</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Sarne</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Summarizing Agent Strategies</article-title>. <source>Auton. Agent Multi-agent Syst.</source> <volume>33</volume>, <fpage>628</fpage>&#x2013;<lpage>644</lpage>. <pub-id pub-id-type="doi">10.1007/s10458-019-09418-w</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Amitai</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Amir</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2021</year>). <source>&#x201c;I Don&#x2019;t Think So&#x201d;: Disagreement-Based Policy Summaries for Comparing Agents</source> <comment>arXiv preprint arXiv:2102.03064</comment>.</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baker</surname>
<given-names>C. L.</given-names>
</name>
<name>
<surname>Saxe</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Tenenbaum</surname>
<given-names>J.&#x20;B.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>Action Understanding as Inverse Planning</article-title>. <source>Cognition</source> <volume>113</volume>, <fpage>329</fpage>&#x2013;<lpage>349</lpage>. <pub-id pub-id-type="doi">10.1016/j.cognition.2009.07.005</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baker</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Saxe</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Tenenbaum</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Bayesian Theory of Mind: Modeling Joint Belief-Desire Attribution</article-title>. In <conf-name>Proceedings of the Annual Meeting of the Cognitive Science Society</conf-name>. vol. <volume>33</volume>. </citation>
</ref>
<ref id="B9">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Brown</surname>
<given-names>D. S.</given-names>
</name>
<name>
<surname>Niekum</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2018</year>). <source>Thirty-Second AAAI Conference on Artificial Intelligence</source>. <article-title>Efficient Probabilistic Performance Bounds for Inverse Reinforcement Learning</article-title>. </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brown</surname>
<given-names>D. S.</given-names>
</name>
<name>
<surname>Niekum</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Machine Teaching for Inverse Reinforcement Learning: Algorithms and Applications</article-title>. In <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence Aaai</conf-name> <volume>33</volume>, <fpage>7749</fpage>&#x2013;<lpage>7758</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v33i01.33017749</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cakmak</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lopes</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Algorithmic and Human Teaching of Sequential Decision Tasks</article-title>. In <conf-name>Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence</conf-name>. <fpage>1536</fpage>&#x2013;<lpage>1542</lpage>. </citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Collins</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Brown</surname>
<given-names>J.&#x20;S.</given-names>
</name>
<name>
<surname>Newman</surname>
<given-names>S. E.</given-names>
</name>
</person-group> (<year>1988</year>). <article-title>Cognitive Apprenticeship</article-title>. <source>Thinking: J.&#x20;Philos. Child.</source> <volume>8</volume>, <fpage>2</fpage>&#x2013;<lpage>10</lpage>. <pub-id pub-id-type="doi">10.5840/thinking19888129</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Daw</surname>
<given-names>N. D.</given-names>
</name>
<name>
<surname>Niv</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Dayan</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Uncertainty-based Competition between Prefrontal and Dorsolateral Striatal Systems for Behavioral Control</article-title>. <source>Nat. Neurosci.</source> <volume>8</volume>, <fpage>1704</fpage>&#x2013;<lpage>1711</lpage>. <pub-id pub-id-type="doi">10.1038/nn1560</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dietterich</surname>
<given-names>T. G.</given-names>
</name>
</person-group> (<year>1998</year>). <article-title>The Maxq Method for Hierarchical Reinforcement Learning</article-title>. In <conf-name>Proceedings of the Fifteenth International Conference on Machine Learning</conf-name>. <fpage>118</fpage>&#x2013;<lpage>126</lpage>. </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Griffiths</surname>
<given-names>T. L.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Understanding Human Intelligence through Human Limitations</article-title>. <source>Trends Cogn. Sci.</source> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guneysu Ozgur</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>&#xd6;zg&#xfc;r</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Asselborn</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Johal</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Yadollahi</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Bruno</surname>
<given-names>B.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Iterative Design and Evaluation of a Tangible Robot-Assisted Handwriting Activity for Special Education</article-title>. <source>Front. Robot. AI</source> <volume>7</volume>, <fpage>29</fpage>. <pub-id pub-id-type="doi">10.3389/frobt.2020.00029</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Halford</surname>
<given-names>G. S.</given-names>
</name>
<name>
<surname>Baker</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>McCredden</surname>
<given-names>J.&#x20;E.</given-names>
</name>
<name>
<surname>Bain</surname>
<given-names>J.&#x20;D.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>How many Variables Can Humans Process?</article-title> <source>Psychol. Sci.</source> <volume>16</volume>, <fpage>70</fpage>&#x2013;<lpage>76</lpage>. <pub-id pub-id-type="doi">10.1111/j.0956-7976.2005.00782.x</pub-id> </citation>
</ref>
<ref id="B18">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>S. H.</given-names>
</name>
<name>
<surname>Bhatia</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Dragan</surname>
<given-names>A. D.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Establishing Appropriate Trust via Critical States</article-title>. In <conf-name>2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</conf-name>. <publisher-name>IEEE</publisher-name>, <fpage>3929</fpage>&#x2013;<lpage>3936</lpage>. </citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>S. H.</given-names>
</name>
<name>
<surname>Held</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Dragan</surname>
<given-names>A. D.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Enabling Robots to Communicate Their Objectives</article-title>. <source>Auton. Robot</source> <volume>43</volume>, <fpage>309</fpage>&#x2013;<lpage>326</lpage>. <pub-id pub-id-type="doi">10.1007/s10514-018-9771-0</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jara-Ettinger</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Gweon</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Schulz</surname>
<given-names>L. E.</given-names>
</name>
<name>
<surname>Tenenbaum</surname>
<given-names>J.&#x20;B.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>The Na&#xef;ve Utility Calculus: Computational Principles Underlying Commonsense Psychology</article-title>. <source>Trends Cognitive Sciences</source> <volume>20</volume>, <fpage>589</fpage>&#x2013;<lpage>604</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2016.05.011</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jara-Ettinger</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Theory of Mind as Inverse Reinforcement Learning</article-title>. <source>Curr. Opin. Behav. Sci.</source> <volume>29</volume>, <fpage>105</fpage>&#x2013;<lpage>110</lpage>. <pub-id pub-id-type="doi">10.1016/j.cobeha.2019.04.010</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jern</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lucas</surname>
<given-names>C. G.</given-names>
</name>
<name>
<surname>Kemp</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>People Learn Other People&#x27;s Preferences through Inverse Decision-Making</article-title>. <source>Cognition</source> <volume>168</volume>, <fpage>46</fpage>&#x2013;<lpage>64</lpage>. <pub-id pub-id-type="doi">10.1016/j.cognition.2017.06.017</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koo</surname>
<given-names>T. K.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>M. Y.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research</article-title>. <source>J.&#x20;Chiropractic Med.</source> <volume>15</volume>, <fpage>155</fpage>&#x2013;<lpage>163</lpage>. <pub-id pub-id-type="doi">10.1016/j.jcm.2016.02.012</pub-id> </citation>
</ref>
<ref id="B24">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lage</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Lifschitz</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Doshi-Velez</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Amir</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Exploring Computational User Models for Agent Policy Summarization</article-title>. In <conf-name>Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19</conf-name> (<publisher-loc>Red Hook, New York</publisher-loc>: <publisher-name>International Joint Conferences on Artificial Intelligence Organization</publisher-name>), <fpage>1401</fpage>&#x2013;<lpage>1407</lpage>. <pub-id pub-id-type="doi">10.24963/ijcai.2019/194</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Landis</surname>
<given-names>J.&#x20;R.</given-names>
</name>
<name>
<surname>Koch</surname>
<given-names>G. G.</given-names>
</name>
</person-group> (<year>1977</year>). <article-title>The Measurement of Observer Agreement for Categorical Data</article-title>. <source>Biometrics</source> <volume>33</volume>, <fpage>159</fpage>&#x2013;<lpage>174</lpage>. <pub-id pub-id-type="doi">10.2307/2529310</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>The Teaching Dimension of Linear Learners</article-title>. <source>J.&#x20;Machine Learn. Res.</source> <volume>17</volume>, <fpage>1</fpage>&#x2013;<lpage>25</lpage>. </citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lombrozo</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Explanatory Preferences Shape Learning and Inference</article-title>. <source>Trends Cogn. Sci.</source> <volume>20</volume>, <fpage>748</fpage>&#x2013;<lpage>759</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2016.08.001</pub-id> </citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lucas</surname>
<given-names>C. G.</given-names>
</name>
<name>
<surname>Griffiths</surname>
<given-names>T. L.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Fawcett</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Gopnik</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kushnir</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2014</year>). <article-title>The Child as Econometrician: A Rational Model of Preference Understanding in Children</article-title>. <source>PloS one</source> <volume>9</volume>, <fpage>e92160</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0092160</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>McHugh</surname>
<given-names>M. L.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Interrater Reliability: the Kappa Statistic</article-title>. <source>Biochem. Med.</source> <volume>22</volume>, <fpage>276</fpage>&#x2013;<lpage>282</lpage>. <pub-id pub-id-type="doi">10.11613/bm.2012.031</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ng</surname>
<given-names>A. Y.</given-names>
</name>
<name>
<surname>Russell</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2000</year>). <article-title>Algorithms for Inverse Reinforcement Learning</article-title>. In <conf-name>in Proc. 17th International Conf. on Machine Learning</conf-name>. </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Osa</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Pajarinen</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Neumann</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Bagnell</surname>
<given-names>J.&#x20;A.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Peters</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). <article-title>An Algorithmic Perspective on Imitation Learning</article-title>. <source>FNT in Robotics</source> <volume>7</volume>, <fpage>1</fpage>&#x2013;<lpage>179</lpage>. <pub-id pub-id-type="doi">10.1561/2300000053</pub-id> </citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Palan</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Schitter</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Prolific.ac-A Subject Pool for Online Experiments</article-title>. <source>J.&#x20;Behav. Exp. Finance</source> <volume>17</volume>, <fpage>22</fpage>&#x2013;<lpage>27</lpage>. <pub-id pub-id-type="doi">10.1016/j.jbef.2017.12.004</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Paulraj</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sumathi</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>A Comparative Study of Redundant Constraints Identification Methods in Linear Programming Problems</article-title>. <source>Math. Probl. Eng.</source>, <volume>2010</volume>. </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Reiser</surname>
<given-names>B. J.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>Scaffolding Complex Learning: The Mechanisms of Structuring and Problematizing Student Work</article-title>. <source>J.&#x20;Learn. Sci.</source> <volume>13</volume>, <fpage>273</fpage>&#x2013;<lpage>304</lpage>. <pub-id pub-id-type="doi">10.1207/s15327809jls1303_2</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sampayo-Vargas</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Cope</surname>
<given-names>C. J.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Byrne</surname>
<given-names>G. J.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>The Effectiveness of Adaptive Difficulty Adjustments on Students&#x27; Motivation and Learning in an Educational Computer Game</article-title>. <source>Comput. Edu.</source> <volume>69</volume>, <fpage>452</fpage>&#x2013;<lpage>462</lpage>. <pub-id pub-id-type="doi">10.1016/j.compedu.2013.07.004</pub-id> </citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sandygulova</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Johal</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhexenova</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Tleubayev</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Zhanatkyzy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Turarova</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Cowriting Kazakh: Learning a New Script with a Robot</article-title>. In <conf-name>Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction</conf-name>. <fpage>113</fpage>&#x2013;<lpage>120</lpage>. </citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shteingart</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Loewenstein</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Reinforcement Learning and Human Behavior</article-title>. <source>Curr. Opin. Neurobiol.</source> <volume>25</volume>, <fpage>93</fpage>&#x2013;<lpage>98</lpage>. <pub-id pub-id-type="doi">10.1016/j.conb.2013.12.004</pub-id> </citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vul</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Goodman</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Griffiths</surname>
<given-names>T. L.</given-names>
</name>
<name>
<surname>Tenenbaum</surname>
<given-names>J.&#x20;B.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>One and Done? Optimal Decisions from Very Few Samples</article-title>. <source>Cogn. Sci.</source> <volume>38</volume>, <fpage>599</fpage>&#x2013;<lpage>637</lpage>. <pub-id pub-id-type="doi">10.1111/cogs.12101</pub-id> </citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Watkins</surname>
<given-names>C. J.&#x20;C. H.</given-names>
</name>
<name>
<surname>Dayan</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>1992</year>). <article-title>Q-learning</article-title>. <source>Machine Learn.</source> <volume>8</volume>, <fpage>279</fpage>&#x2013;<lpage>292</lpage>. <pub-id pub-id-type="doi">10.1023/a:1022676722315</pub-id> </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Williams</surname>
<given-names>J.&#x20;J.</given-names>
</name>
<name>
<surname>Lombrozo</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Rehder</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Why Does Explaining Help Learning? Insight from an Explanation Impairment Effect</article-title>. In <conf-name>Proceedings of the Annual Meeting of the Cognitive Science Society</conf-name>. vol. <volume>32</volume>. </citation>
</ref>
<ref id="B41">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wood</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Bruner</surname>
<given-names>J.&#x20;S.</given-names>
</name>
<name>
<surname>Ross</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>1976</year>). <article-title>The Role of Tutoring in Problem Solving</article-title>. <source>J.&#x20;Child. Psychol. Psychiat</source> <volume>17</volume>, <fpage>89</fpage>&#x2013;<lpage>100</lpage>. <pub-id pub-id-type="doi">10.1111/j.1469-7610.1976.tb00381.x</pub-id> </citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wunderlich</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Dayan</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Dolan</surname>
<given-names>R. J.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Mapping Value Based Planning and Extensively Trained Choice in the Human Brain</article-title>. <source>Nat. Neurosci.</source> <volume>15</volume>, <fpage>786</fpage>&#x2013;<lpage>791</lpage>. <pub-id pub-id-type="doi">10.1038/nn.3068</pub-id> </citation>
</ref>
<ref id="B43">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Singla</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zilles</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Rafferty</surname>
<given-names>A. N.</given-names>
</name>
</person-group> (<year>2018</year>). <source>An Overview of Machine Teaching</source> (<comment>arXiv preprint arXiv:1801.05927</comment>.</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Machine Teaching: an Inverse Problem to Machine Learning and an Approach toward Optimal Education</article-title>. In <conf-name>Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence</conf-name>. <fpage>4083</fpage>&#x2013;<lpage>4087</lpage>. </citation>
</ref>
</ref-list>
</back>
</article>