<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2020.00026</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Intention-Related Natural Language Grounding via Object Affordance Detection and Intention Semantic Extraction</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Mi</surname> <given-names>Jinpeng</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/766706/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Liang</surname> <given-names>Hongzhuo</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Katsakis</surname> <given-names>Nikolaos</given-names></name>
<xref ref-type="aff" rid="aff3"><sup>3</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/683340/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Tang</surname> <given-names>Song</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Li</surname> <given-names>Qingdu</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
</contrib>
<contrib contrib-type="author">
<name><surname>Zhang</surname> <given-names>Changshui</given-names></name>
<xref ref-type="aff" rid="aff4"><sup>4</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/634099/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Zhang</surname> <given-names>Jianwei</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/637751/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>Institute of Machine Intelligence (IMI), University of Shanghai for Science and Technology</institution>, <addr-line>Shanghai</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Technical Aspects of Multimodal Systems, Department of Informatics, University of Hamburg</institution>, <addr-line>Hamburg</addr-line>, <country>Germany</country></aff>
<aff id="aff3"><sup>3</sup><institution>Human-Computer Interaction, Department of Informatics, University of Hamburg</institution>, <addr-line>Hamburg</addr-line>, <country>Germany</country></aff>
<aff id="aff4"><sup>4</sup><institution>Department of Automation, State Key Lab of Intelligent Technologies and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University</institution>, <addr-line>Beijing</addr-line>, <country>China</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Mehdi Khamassi, Centre National de la Recherche Scientifique (CNRS), France</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Oluwarotimi WIlliams Samuel, Chinese Academy of Sciences, China; Markus Vincze, Vienna University of Technology, Austria</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Song Tang <email>tang&#x00040;informatik.uni-hamburg.de</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>13</day>
<month>05</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>14</volume>
<elocation-id>26</elocation-id>
<history>
<date date-type="received">
<day>30</day>
<month>10</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>09</day>
<month>04</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Mi, Liang, Katsakis, Tang, Li, Zhang and Zhang.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Mi, Liang, Katsakis, Tang, Li, Zhang and Zhang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Similar to specific natural language instructions, intention-related natural language queries also play an essential role in our daily life communication. Inspired by the psychology term &#x0201C;affordance&#x0201D; and its applications in Human-Robot interaction, we propose an object affordance-based natural language visual grounding architecture to ground intention-related natural language queries. Formally, we first present an attention-based multi-visual features fusion network to detect object affordances from RGB images. While fusing deep visual features extracted from a pre-trained CNN model with deep texture features encoded by a deep texture encoding network, the presented object affordance detection network takes into account the interaction of the multi-visual features, and reserves the complementary nature of the different features by integrating attention weights learned from sparse representations of the multi-visual features. We train and validate the attention-based object affordance recognition network on a self-built dataset in which a large number of images originate from MSCOCO and ImageNet. Moreover, we introduce an intention semantic extraction module to extract intention semantics from intention-related natural language queries. Finally, we ground intention-related natural language queries by integrating the detected object affordances with the extracted intention semantics. We conduct extensive experiments to validate the performance of the object affordance detection network and the intention-related natural language queries grounding architecture.</p></abstract>
<kwd-group>
<kwd>intention-related natural language grounding</kwd>
<kwd>object affordance detection</kwd>
<kwd>intention semantic extraction</kwd>
<kwd>multi-visual features</kwd>
<kwd>attention-based dynamic fusion</kwd>
</kwd-group>
<counts>
<fig-count count="9"/>
<table-count count="1"/>
<equation-count count="11"/>
<ref-count count="43"/>
<page-count count="12"/>
<word-count count="7363"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Human beings live in a multi-modal environment, where natural language and vision are the dominant channels for communication and perception. Naturally, we would like to develop intelligent agents with the ability to communicate and perceive their working scenarios as humans do. Natural language processing, computer vision, and the interplay between them are involved in the tasks for grounding natural language queries in working scenarios.</p>
<p>We often refer to objects in the environment when we have a pragmatic interaction with others, and we have the ability to comprehend specific and intention-related natural language queries in a wide range of practical applications. For instance, we can locate the target object &#x0201C;remote controller&#x0201D; according to the given specific natural language instruction &#x0201C;give me the remote controller next to the TV,&#x0201D; and we also can infer the intended &#x0201C;drinkware&#x0201D; from the intention-related query &#x0201C;I am thirsty, I want to drink some water.&#x0201D;</p>
<p>Cognitive psychologist Don Norman discussed affordance from the design perspective so that the function of objects could be easily perceived. He argued that affordance refers to the fundamental properties of an object and determines how the object could possibly be used (Norman, <xref ref-type="bibr" rid="B22">1988</xref>). According to Norman&#x00027;s viewpoint, drinks afford <italic>drinking</italic>, foods afford <italic>eating</italic>, and readings, such as text documents are for <italic>reading</italic>.</p>
<p>When new objects come into our sight in our daily life, we can infer their function according to multiple visual properties, such as shape, size, color, texture, and material. The capacity to infer functional aspects of objects or object affordance is crucial for us to describe and categorize objects more easily. Moreover, affordance is widely used in different tasks to boost their model&#x00027;s performance, such as Celikkanat et al. (<xref ref-type="bibr" rid="B7">2015</xref>) demonstrate affordance can improve the quality of natural human-robot interaction (HRI), Yu et al. (<xref ref-type="bibr" rid="B41">2015</xref>) integrate affordance to improve human intentions understanding in different time period, Thermos et al. (<xref ref-type="bibr" rid="B37">2017</xref>) fuse visual features and affordance to improve robustness for sensorimotor object recognition, Mi et al. (<xref ref-type="bibr" rid="B17">2019</xref>) utilize affordance to prompt a robot to understand human spoken instructions.</p>
<p>Following Norman&#x00027;s standpoint, we generalize 10 affordances [<italic>calling, drinking(I), drinking(II), eating(I), eating(II), playing, reading, writing, cleaning</italic>, and <italic>cooking</italic>] for objects that are commonly used in indoor environments. Although drinkware and drinks can be used for drinking, drinkware affords different function to drinks, i.e., the affordance of drinkware is different from drinks. The same situation also exists between foods and eating utensils. Therefore, we utilize <italic>drinking(I)</italic> for denoting the affordance of drinkware, <italic>drinking(II)</italic> for drinks, <italic>eating(I)</italic> for eating utensils, and <italic>eating(II)</italic> for foods, respectively.</p>
<p>Moreover, multiple features can improve model performance to recognize objects. The texture features can be <xref ref-type="supplementary-material" rid="SM1">Supplementary Information</xref> for the visual representation of partially occluded objects. And according to Song et al. (<xref ref-type="bibr" rid="B35">2015</xref>), the local texture features can enhance the object grasping estimation performance. Motivated by the complementary nature of the multiple features, we adopt multi-visual features, the deep visual features extracted from a pretrained CNN and the deep texture features encoded by a deep texture encoding network, to learn object affordances. The primary issue of fusing multi-visual features is that the fusion scheme should preserve the complementary nature of the features. Fusing different features through naive concatenation may fail to learn the relevance of multi-features, bring about redundancies and may lead to overfitting during the training period. Consequently, in order to reserve the complementary nature of multi-visual features in the process of affordance learning, we take advantage of the interaction information between the multi-visual features, and integrate an attention network with the interaction information to fuse the multi-visual features.</p>
<p>Besides, inspired by the role of affordance and its applications in HRI and in order to enable robots to understand intention-related natural language instructions, we attempt to ground intention-related natural language queries via object affordance. In this work, we decompose the intention-related natural language grounding into three subtasks: (1) detect affordance of objects in working scenarios; (2) extract intention semantics from intention-related natural language queries; (3) ground target objects by integrating the detected affordances with the extracted intention semantics. In other words, we ground intention-related natural language queries via object affordance detection and intention semantic extraction.</p>
<p>In summary, we propose an intention-related natural language grounding architecture which is composed of an object affordance detection network, an intention semantic extraction module, and a target object grounding module. Moreover, we conduct extensive experiments to validate the performance of the introduced object affordance detection network and the intention-related natural language grounding architecture. We also implement target object grounding and grasping experiments on a robotic platform to evaluate the introduced intention-related natural language grounding architecture.</p>
</sec>
<sec id="s2">
<title>2. Related Work</title>
<sec>
<title>2.1. Natural Language Grounding</title>
<p>Natural language grounding requires a comprehensive understanding of natural language expressions and images, and aims to locate the most related objects within images. Multiple approaches are proposed to address natural language grounding. Yu et al. (<xref ref-type="bibr" rid="B40">2016</xref>) introduce referring expression grounding which grounds referring expressions within given images via joint learning the region visual feature and the semantics embedded in referring expressions. Chen et al. (<xref ref-type="bibr" rid="B8">2017</xref>) present phrase grounding which aims to locate referred targets by corresponding phrases in natural language queries. These approaches need large datasets to train models to achieve natural language grounding.</p>
<p>Natural language grounding also attracts great interest in robotics. Thomason et al. (<xref ref-type="bibr" rid="B38">2017</xref>) apply opportunistic active learning to ground natural language in the home and office environment, and the presented model needs to ask human users &#x0201C;inquisitive&#x0201D; questions to locate target objects. Shridhar and Hsu (<xref ref-type="bibr" rid="B33">2018</xref>) employ expressions generated by a captioning model (Johnson et al., <xref ref-type="bibr" rid="B12">2016</xref>), gestures, and a dialog system to ground targets. Ahn et al. (<xref ref-type="bibr" rid="B1">2018</xref>) utilize position maps generated by the hourglass network (Newell et al., <xref ref-type="bibr" rid="B19">2016</xref>) and a question generation module to infer referred objects. Thomason et al. (<xref ref-type="bibr" rid="B39">2019</xref>) translate spoken language instructions into robot action commands and uses clarification conversations with human users to ground targets. However, conversation and dialog systems make HRI time-consuming and cumbersome.</p>
<p>Other work presents non-dialog methods to ground natural language queries. Bastianelli et al. (<xref ref-type="bibr" rid="B4">2016</xref>) utilize features extracted from semantic maps and spatial relationships between objects within the working environment to locate the targets for spoken language-based HRI. Alomari et al. (<xref ref-type="bibr" rid="B2">2017</xref>) locate target objects by learning to extract concepts of objects and building the mapping between the concepts and natural language commands. Paul et al. (<xref ref-type="bibr" rid="B23">2018</xref>) parse hierarchical abstract and concrete factors from natural language commands and adopts an approximate inference procedure to ground targets within working scenarios. Roesler et al. (<xref ref-type="bibr" rid="B29">2019</xref>) employ cross-situational learning to ground unknown synonymous objects and actions, and the introduced method utilizes different word representations to identify synonymous words and grounds targets according to the geometric characteristics of targets. These methods are proposed to ground natural language commands which embed specific target objects.</p>
<p>Different from the above mentioned approaches, we attempt to address intention-related natural language queries grounding without dialogs between human users and other auxiliary information. To this end, we draw support from object affordance to ground intention-related natural language instructions.</p>
</sec>
<sec>
<title>2.2. Object Affordance</title>
<p>Existing work utilizes multiple approaches to infer object affordances. Sun et al. (<xref ref-type="bibr" rid="B36">2014</xref>) predict object affordances through human demonstration, Kim and Sukhatme (<xref ref-type="bibr" rid="B13">2014</xref>) deduce affordance through extracted geometric features from point cloud segments, Zhu et al. (<xref ref-type="bibr" rid="B43">2014</xref>) reason affordance through querying the visual attributes, physical attributes, and categorical characteristics of objects in a pre-built knowledge base. Myers et al. (<xref ref-type="bibr" rid="B18">2015</xref>) perceive affordance from local shape and geometry primitives of objects. These methods adopted visual characteristics or geometric features to infer object affordances, so the scalability and flexibility of these approaches are limited.</p>
<p>Several recently published methods adopted deep learning-based approaches to detect object affordance. Dehban et al. (<xref ref-type="bibr" rid="B11">2016</xref>) propose a denoising auto-encoder to actively learn the affordances of objects and tools through observing the consequences of actions performed on objects and tools. Roy and Todorovic (<xref ref-type="bibr" rid="B30">2016</xref>) use a multi-scale CNN to extract mid-level visual features and combines them to segment affordances from RGB images. Unlike (Roy and Todorovic, <xref ref-type="bibr" rid="B30">2016</xref>), Sawatzky et al. (<xref ref-type="bibr" rid="B32">2017</xref>) regard affordance perception as semantic image segmentation and adopts a deep CNN based architecture to segment affordances from weakly labeled images. Nguyen et al. (<xref ref-type="bibr" rid="B20">2016</xref>) extract deep features from a CNN model and apply an encoder-decoder architecture to detect affordances for object parts. Mi et al. (<xref ref-type="bibr" rid="B17">2019</xref>) utilize deep features extracted from different convolutional layers of pretrained CNN model to recognize object affordances, Nguyen et al. (<xref ref-type="bibr" rid="B21">2017</xref>) apply an object detector, CNN and dense conditional random fields to detect object affordance from RGB images.</p>
<p>The aforementioned work utilized geometric features or deep features extracted from a pretrained CNN to infer object affordance, and did not take into consideration that the features from another source can be applied to improve affordance recognition accuracy. Rendle (<xref ref-type="bibr" rid="B28">2010</xref>) propose Factorization Machines (FM), which can model interactions between different features via factorized parameters and has the capability to assess the interactions from sparse data. And (Bahdanau et al., <xref ref-type="bibr" rid="B3">2015</xref>) initially present attention mechanisms to acquire different weights for different parts of input features, and can automatically search the most relevant parts to acquire better results from source features.</p>
<p>Inspired by Rendle (<xref ref-type="bibr" rid="B28">2010</xref>) and Bahdanau et al. (<xref ref-type="bibr" rid="B3">2015</xref>), we propose an attention-based architecture to fuse deep visual features with deep texture features through an attention network. The introduced fusion architecture takes sparse representations of the multi-visual features as input and achieves attention-based dynamic fusion for learning object affordances.</p>
</sec>
</sec>
<sec id="s3">
<title>3. Architecture Overview</title>
<p>Similar to specific natural language instructions, intention-related natural language queries are also a crucial component in our daily communication. Given an intention-related natural language command, such as &#x0201C;I am hungry, I want to eat something,&#x0201D; and a working scenario which is composed of multiple household objects, the objective of intention-related natural language grounding is to locate the most related object &#x0201C;food&#x0201D; within the working scenario.</p>
<p>In order to ground intention-related natural language queries, we propose an architecture as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. In this work, we formulate the proposed intention-related natural language grounding architecture into three sub-modules: (1) an object affordance detection network detects object affordance from RGB images; (2) an intention semantic extraction module extracts semantic word from intention-related natural language instructions; (3) a target object grounding module locates intended target objects by integrating the detected object affordances with the extracted intention semantic words.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>Architecture of the intention-related natural language grounding via object affordance detection and intention semantic extraction. The object affordance detection network detects object affordance from RGB images. The intention semantic extraction module calculates the different weights of each word in given natural language queries and extracts the intention semantic word. The grounding module locates target objects by combining the outputs of the object affordance detection network and the intention semantic extraction module.</p></caption>
<graphic xlink:href="fnbot-14-00026-g0001.tif"/>
</fig>
<p>We illustrate the details of the object affordance detection in section 4, we introduce the intention semantic extraction in section 5, and we describe the target object grounding module in section 6. Moreover, we give the details of the experiments conducted to validate the performance of the object affordance detection network and the intention-related natural language grounding architecture, and outline the acquired results in section 7.</p>
</sec>
<sec id="s4">
<title>4. Object Affordance Detection</title>
<p>Following Norman&#x00027;s viewpoint, we generalize ten affordances for ordinary household objects, and we present an attention-based multi-visual features fusion architecture, which can be trained end-to-end, to learn the affordances. <xref ref-type="fig" rid="F2">Figure 2</xref> illustrates the details of the proposed multi-visual features fusion architecture. The presented architecture is composed of a Region of Interest (RoI) detection network (RetinaNet), a deep features extraction module, an attention network, an attention-based dynamic fusion module, and an MLP (Multi-Layer Perceptron). We adopt two different deep networks to extract multi-visual features, the attention network is employed to generate dynamic attention weights through the sparse representations of the extracted features, while the dynamic fusion module fuses the multi-visual features by integrating them with the generated attention weights, and the MLP is applied to learn the object affordances. In this section, we introduce the details of each component of the proposed architecture.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>Architectural diagram of the object affordance detection via attention-based multi-visual features fusion. The RetinaNet is adopted to detect RoIs from raw images, and then for each detected RoI, the deep visual features and deep texture features are extracted by a pretrained CNN and a texture encoding network, respectively. In order to reserve the complementary nature of the different features and avoid causing redundancies during the multi-visual features fusion, an attention-based fusion mechanism is applied to fuse the multi-visual features. Through the attention-based fusion, the fused features are fed into an MLP to learn object affordances.</p></caption>
<graphic xlink:href="fnbot-14-00026-g0002.tif"/>
</fig>
<sec>
<title>4.1. Deep Features Extraction</title>
<sec>
<title>4.1.1. Deep Visual Feature Extraction</title>
<p>RetinaNet (Lin et al., <xref ref-type="bibr" rid="B15">2020</xref>) acquires better detection accuracy on MSCOCO (Lin et al., <xref ref-type="bibr" rid="B16">2014</xref>) than the all state-of-the-art two-stage detectors. Considering the performance of RetinaNet, we adopt RetinaNet to generate RoIs from raw images. The deep visual feature <italic>f</italic><sub><italic>v</italic></sub> is extracted by a pretrained CNN for each RoI <italic>I</italic><sub><italic>R</italic></sub>:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>C</mml:mi><mml:mi>N</mml:mi><mml:mi>N</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>f</italic><sub><italic>v</italic></sub> &#x02208; &#x0211D;<sup><italic>m&#x000D7;n&#x000D7;d</italic><sub><italic>v</italic></sub></sup>, <italic>m</italic> &#x000D7; <italic>n</italic> denotes the size of the extracted deep features, <italic>d</italic><sub><italic>v</italic></sub> is the output dimension of the CNN layer. In order to improve learning dynamics and reducing training time, we use <italic>L</italic><sub>2</sub> normalization to process the extracted deep visual features.</p>
</sec>
<sec>
<title>4.1.2. Deep Texture Feature Extraction</title>
<p>Multiple presented texture recognition networks can be used to encode texture features, e.g., Cimpoi et al. (<xref ref-type="bibr" rid="B9">2015</xref>) generates texture features through Fisher Vector pooling of a pretrained CNN filter bank, Zhang et al. (<xref ref-type="bibr" rid="B42">2017</xref>) proposes a texture encoding network for material and texture recognition, the texture encoding network encodes the deep texture features through a texture encoding layer which is integrated on top of convolutional layers and is capable of transferring CNNs from object recognition to texture and material recognition. Furthermore, the texture encoding network achieves state-of-the-art performance on the material dataset MINC2500 (Bell et al., <xref ref-type="bibr" rid="B5">2015</xref>). Due to the good performance of the texture encoding network introduced in Zhang et al. (<xref ref-type="bibr" rid="B42">2017</xref>), we select it to encode the texture feature for each detected RoI and convert the texture feature to vector <bold>v</bold><sub><bold><italic>t</italic></bold></sub>:</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M2"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>T</mml:mi><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>N</mml:mi><mml:mi>e</mml:mi><mml:mi>t</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>I</mml:mi></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>v</bold><sub><bold><italic>t</italic></bold></sub> &#x02208; &#x0211D;<sup><italic>1&#x000D7;d</italic><sub><italic>t</italic></sub></sup>, <italic>d</italic><sub><italic>t</italic></sub> is the output size of the texture encoding network.</p>
<p>We also apply <italic>L</italic><sub>2</sub> normalization to process each texture vector <bold>v</bold><sub><bold><italic>t</italic></bold></sub>. For modeling convenience, we utilize a single perceptron which is comprised of a linear layer and a tanh layer to transform <bold>v</bold><sub><bold><italic>T</italic></bold></sub> into a new vector:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mstyle mathvariant="italic"><mml:mi>W</mml:mi></mml:mstyle><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mstyle mathvariant="italic"><mml:mi>b</mml:mi></mml:mstyle></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>v^</bold><sub><bold><italic>t</italic></bold></sub> &#x02208; &#x0211D;<sup>1&#x000D7;d<sub><italic>l</italic></sub></sup>, <italic>W</italic> is a weight matrix and <italic>b</italic> is a bias vector for the linear layer, and <italic>d</italic><sub><italic>l</italic></sub> is the dimension of the linear layer. From Ben-Younes et al. (<xref ref-type="bibr" rid="B6">2017</xref>) and the experimental results, hyperbolic tangent produces slightly better results.</p>
<p>For fusing convenience, we adopt the tile operation to expand the texture vector <inline-formula><mml:math id="M4"><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula><sub><italic>t</italic></sub> to generate the deep texture representation <italic>f</italic><sub><italic>t</italic></sub> which has the same dimension with the deep visual feature <italic>f</italic><sub><italic>v</italic></sub>, i.e., the generated <italic>f</italic><sub><italic>t</italic></sub> &#x02208; &#x0211D;<sup>m&#x000D7;n&#x000D7;d<sub><italic>v</italic></sub></sup>.</p>
</sec>
</sec>
<sec>
<title>4.2. Attention-Based Multi-Visual Features Dynamic Fusion</title>
<p>Factorization Machines (FM) were proposed for recommendation system (Rendle, <xref ref-type="bibr" rid="B28">2010</xref>), and aimed at solving the problem of feature interactions under large-scale sparse data. Given a feature vector list, FM predicts the target through modeling all interactions between each pair of features:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M5"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x00177;</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x0002B;</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>&#x00175;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>w</italic><sub>0</sub> &#x02208; &#x0211D; is the global bias, <italic>x</italic><sub><italic>i</italic></sub> and <italic>x</italic><sub><italic>j</italic></sub> denote the <italic>i</italic>-th and <italic>j</italic>-th feature in the given feature list, <italic>w</italic><sub><italic>i</italic></sub> &#x02208; &#x0211D;<sup><italic>t</italic></sup> represents the weight of the <italic>i</italic>-th feature, &#x00175;<sub><italic>ij</italic></sub> models the interaction between the <italic>i</italic>-th and <italic>j</italic>-th feature and is calculated by:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x00175;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>v</italic><sub><italic>i</italic></sub>, <italic>v</italic><sub><italic>j</italic></sub> &#x02208; &#x0211D;<sup><italic>s</italic></sup> are the sparse representations of <italic>x</italic><sub><italic>i</italic></sub> and <italic>x</italic><sub><italic>j</italic></sub>, i.e., embedding vectors for the non-zero elements of <italic>x</italic><sub><italic>i</italic></sub> and <italic>x</italic><sub><italic>j</italic></sub>, <italic>s</italic> denotes the dimension of the embedding vectors.</p>
<p>In light of the FM, the &#x00175;<sub><italic>ij</italic></sub> comprises the interaction information of different features, and should be represented by the sparse non-zero elements of the different features. Formally, we extract the non-zero element set from <italic>f</italic><sub><italic>v</italic></sub> and <bold>v</bold><sub><bold><italic>t</italic></bold></sub>, and adopt an embedding layer to acquire the sparse representations <italic>e</italic><sub><italic>v</italic></sub> for <italic>f</italic><sub><italic>v</italic></sub> and <italic>e</italic><sub><italic>t</italic></sub> for <bold>v</bold><sub><bold><italic>t</italic></bold></sub>, respectively. We calculate the interacting matrix <italic>k</italic><sub><italic>vt</italic></sub> which embeds the interaction information between <italic>f</italic><sub><italic>v</italic></sub> and <bold>v</bold><sub><bold><italic>t</italic></bold></sub> by:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msubsup><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>k</italic><sub><italic>vt</italic></sub>&#x02208; &#x0211D;<sup><italic>p</italic>&#x000D7;<italic>p</italic></sup>, <italic>e</italic><sub><italic>v</italic></sub> and <italic>e</italic><sub><italic>t</italic></sub> &#x02208; &#x0211D;<sup>1&#x000D7;</sup><sup><italic>p</italic></sup>, <italic>p</italic> denotes the output size of the embedding layer.</p>
<p>In order to avoid causing information redundancies during features fusion, we integrate the attention mechanism with <italic>k</italic><sub><italic>vt</italic></sub> to complete feature fusion. By learning attention weights, the attention mechanism endows the model with the ability to emphasize the different weights of the multi-visual features during learning affordance. The attention weights can be parameterized by an attention network which is composed of an MLP and a softmax layer. The input of the attention network is the interacting matrix <italic>k</italic><sub><italic>vt</italic></sub>, the generated weight encodes the interaction information between the different features. The attention weights &#x003C4;<sub><italic>att</italic></sub> can be acquired by:</p>
<disp-formula id="E7"><label>(7)</label><mml:math id="M8"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mo>&#x02211;</mml:mo><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>and</p>
<disp-formula id="E8"><label>(8)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:msup><mml:mstyle mathvariant="italic"><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>h</mml:mi></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="italic"><mml:mi>W</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="italic"><mml:mi>b</mml:mi></mml:mstyle></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003C4;<sub><italic>att</italic></sub> &#x02208; &#x0211D;<sup>1&#x000D7;</sup><sup><italic>p</italic></sup>, <italic>W</italic><sub><italic>att</italic></sub>, <italic>b</italic><sub><italic>att</italic></sub>, and &#x003B1; are weight matrices, bias vector and model parameters for the attention network, respectively.</p>
<p>By means of the learned &#x003C4;<sub><italic>att</italic></sub>, we fuse <italic>f</italic><sub><italic>v</italic></sub> and <italic>f</italic><sub><italic>t</italic></sub> to produce the fused feature <italic>f</italic><sub><italic>fuse</italic></sub> to learn object affordances. The fused feature <italic>f</italic><sub><italic>fuse</italic></sub> is generated by:</p>
<disp-formula id="E9"><label>(9)</label><mml:math id="M10"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>f</mml:mi><mml:mi>u</mml:mi><mml:mi>s</mml:mi><mml:mi>e</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>v</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02295;</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003C4;</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>t</mml:mi><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>f</italic><sub><italic>fuse</italic></sub> &#x02208; &#x0211D;<sup><italic>m&#x000D7;n&#x000D7;d</italic></sup>, &#x02295; denotes concatenation. <xref ref-type="fig" rid="F3">Figure 3</xref> shows the details of the attention-based multi-visual features fusion.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Attention-based multi-visual features fusion network. The feature embedding layers process the sparse representations of the deep visual feature and the deep texture feature, and the outputs of the feature embedding layers are applied to generate the interaction information of the multi-visual features. Subsequently, the interaction information is fed into the attention network to acquire the attention weights, which are adopted to complete attention based dynamic fusion.</p></caption>
<graphic xlink:href="fnbot-14-00026-g0003.tif"/>
</fig>
</sec>
</sec>
<sec id="s5">
<title>5. Intention Semantic Extraction</title>
<p>Each word plays a different role in representing the semantic of natural language expressions, so we argue that each word should have different weights in natural language queries to ground target objects. In order to acquire the different weights, we propose a self-attentive network to calculate the weight of each word in natural language queries. We acquire the weights in three steps. First, given a natural language sentence <italic>S</italic>, we tokenize <italic>S</italic> into words by NLTK (Perkins, <xref ref-type="bibr" rid="B25">2010</xref>) toolkit, i.e., <italic>S</italic> = <italic>s</italic><sub>1</sub>, <italic>s</italic><sub>2</sub>, &#x02026;, <italic>s</italic><sub><italic>n</italic></sub>, <italic>i</italic> &#x02208; (1, <italic>n</italic>), n denotes the word number of <italic>S</italic>. Moreover, the lexical category of each tokenized word <italic>s</italic><sub><italic>i</italic></sub> is generated by a POS-tagger (part of speech tagger) of NLTK.</p>
<p>Second, we adopt GloVe (Pennington et al., <xref ref-type="bibr" rid="B24">2014</xref>) to transfer <italic>s</italic><sub><italic>i</italic></sub> into a 300-D vector <italic>r</italic><sub><italic>i</italic></sub> as word representation, <italic>r</italic><sub><italic>i</italic></sub> &#x02208; &#x0211D;<sup>1&#x000D7;300</sup>. These word representation vectors are concatenated as the representation of the sentence, i.e., <italic>R</italic> = (<italic>r</italic><sub>1</sub>, <italic>r</italic><sub>2</sub>, &#x02026;, <italic>r</italic><sub><italic>n</italic></sub>), <italic>R</italic> &#x02208; &#x0211D;<sup>n &#x000D7; 300</sup>. We then feed the generated sentence representation <italic>R</italic> into the self-attentive network to calculate the weight of each word. The self-attentive network adopts an attention mechanism over the hidden vector of a BiLSTM to generate a weight score &#x003B1;<sub><italic>i</italic></sub> for <italic>s</italic><sub><italic>i</italic></sub>. The self-attentive network is defined as:</p>
<disp-formula id="E10"><label>(10)</label><mml:math id="M11"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mtext class="textrm" mathvariant="normal">BiLSTM</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>R</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd columnalign="center"><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>t</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mi>h</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>W</mml:mi><mml:msub><mml:mrow><mml:mi>h</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:mi>b</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mstyle displaystyle="true"><mml:msub><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mstyle><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>h</italic><sub><italic>t</italic></sub> represents the hidden vector of the BiLSTM, <italic>u</italic><sub><italic>i</italic></sub> is the transformation vector generated by an MLP with learnable weight matrix <italic>W</italic> and bias vector <italic>b</italic>. In practice, we adopt the weight trained on the supervised data of the Stanford Natural Language Inference dataset (Conneau et al., <xref ref-type="bibr" rid="B10">2017</xref>) to be the initial weight of the BiLSTM in the self-attentive network.</p>
<p>Finally, the sentence <italic>S</italic> is re-ordered according to the acquired &#x003B1;<sub><italic>i</italic></sub>, the verb with the largest weight is selected to present the semantic of intention-related instruction, and the selected verb is fed into the grounding module to complete target object grounding.</p>
</sec>
<sec id="s6">
<title>6. Target Object Grounding</title>
<p>An essential step to achieve intention-related natural language grounding is to build the mapping between the detected affordances and the extracted intention semantic words. Inspired by the Latent Semantic Analysis (LSA) which is used to measure the similarity of words and text documents meaning, we propose a semantic metric measuring based approach to build the mapping between the detected affordances and the intention-related natural language queries.</p>
<p>We first transfer the extracted intention semantic word and the detected affordances into 300-D vectors by GloVe, and then calculate the word semantic similarity between them to achieve target grounding. Formally, we transform the extracted intention semantic word to vector <italic>v</italic><sub><italic>sem</italic></sub> &#x02208; &#x0211D;<sup>1&#x000D7;300</sup>, and also transfer the detected affordances into vectors <italic>v</italic><sub><italic>aff, i</italic></sub> &#x02208; &#x0211D;<sup>1&#x000D7;300</sup>, i &#x02208; (1, <italic>N</italic>), where <italic>N</italic> denotes the number of detected object affordances. We calculate the semantic similarity between them by:</p>
<disp-formula id="E11"><label>(11)</label><mml:math id="M12"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mi>S</mml:mi><mml:mi>i</mml:mi><mml:mi>m</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>f</mml:mi><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>f</mml:mi><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>s</mml:mi><mml:mi>e</mml:mi><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x000B7;</mml:mo><mml:mo>|</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>v</mml:mi></mml:mrow><mml:mrow><mml:mi>a</mml:mi><mml:mi>f</mml:mi><mml:mi>f</mml:mi><mml:mo>,</mml:mo><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mo>|</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where ||&#x000B7;||<sub>2</sub> denotes <italic>L</italic><sub>2</sub> normalization operation.</p>
<p>The object with the largest semantic similarity value of the intention semantic-affordance pair is selected as target. Through the semantic similarity calculation, the extracted intention semantics are mapped into the corresponding human-centered object affordance.</p>
</sec>
<sec id="s7">
<title>7. Experiments and Results</title>
<sec>
<title>7.1. Object Affordance Detection</title>
<sec>
<title>7.1.1. Dataset</title>
<p>In MSCOCO (Lin et al., <xref ref-type="bibr" rid="B16">2014</xref>) and ImageNet (Russakovsky et al., <xref ref-type="bibr" rid="B31">2015</xref>), there are only a few indoor scenes and few objects associated with the introduced ten affordances. Therefore, we create a dataset to train and evaluate the proposed object affordance recognition architecture. The proposed dataset<xref ref-type="fn" rid="fn0001"><sup>1</sup></xref> is composed of images collected by a Kinect V2 sensor and indoor scenes from MSCOCO and ImageNet.</p>
<p>The dataset contains in total of 12,349 RGB images and 14,695 bounding box annotations for object affordance detection (in which 3,378 annotations are from MSCOCO and ImageNet). We randomly select 56.1% regions (8,250) from the dataset for training, 22.1% regions (3,253) for validation, and the remaining 21.8% regions (3,192) for testing. <xref ref-type="fig" rid="F4">Figure 4</xref> shows some example images from the proposed dataset.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Example images from the proposed dataset. <bold>(Top)</bold> Images from MSCOCO. <bold>(Middle)</bold> Images from ImageNet. <bold>(Bottom)</bold> Images taken by Kinect V2.</p></caption>
<graphic xlink:href="fnbot-14-00026-g0004.tif"/>
</fig>
<p>As mentioned above, we generalize ten affordances that are related to ordinary household objects. <xref ref-type="fig" rid="F5">Figure 5</xref> illustrates the affordance distribution in the presented dataset. There are few <italic>writing</italic> and <italic>cleaning</italic> objects included in the images in the MSCOCO and ImageNet dataset, so we collect a large portion of the two categories images by a Kinect sensor.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>The affordance distribution in the presented dataset. Y-axis denotes the region number of each affordance.</p></caption>
<graphic xlink:href="fnbot-14-00026-g0005.tif"/>
</fig>
</sec>
<sec>
<title>7.1.2. Experimental Setup and Results</title>
<p>We utilize the available source<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref> which is an implementation of RetinaNet (Lin et al., <xref ref-type="bibr" rid="B15">2020</xref>) and select ResNet 50 to be the backbone to detect RoIs from RGB images. We extract the deep visual features from the last pooling layer of VGG19 (Simonyan and Zisserman, <xref ref-type="bibr" rid="B34">2014</xref>) trained on Imagenet (Russakovsky et al., <xref ref-type="bibr" rid="B31">2015</xref>) for each detected RoI. To produce a length-uniformed feature map for RoIs with different size, we rescaled the detected RoIs to 224 &#x000D7; 224 pixels. Accordingly, the dimension of the extracted deep visual feature for each RoI is 7 &#x000D7; 7 &#x000D7; 512, i.e., <italic>f</italic><sub><italic>v</italic></sub> &#x02208; &#x0211D;<sup><italic>7&#x000D7;7&#x000D7;512</italic></sup>.</p>
<p>We adopt the deep texture encoding network (Zhang et al., <xref ref-type="bibr" rid="B42">2017</xref>) trained on the material database MINC2500 to generate deep texture representations. We extract the texture features from the texture encoding layer for RoIs. The output size of the texture encoding layer is 32 &#x000D7; 128, so the dimension of <bold>v</bold><sub><bold><italic>t</italic></bold></sub> is 1 &#x000D7; 4,096. We set the output size of the single perceptron <italic>d</italic><sub><italic>l</italic></sub> = 512, therefore, the dimension of the transformed texture vector <inline-formula><mml:math id="M13"><mml:mover accent="true"><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>v</mml:mtext></mml:mstyle></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:math></inline-formula><sub><italic>t</italic></sub> is 1 &#x000D7; 512. Through the tile operation, the dimension of the generated deep texture representation <italic>f</italic><sub><italic>t</italic></sub> &#x02208; &#x0211D;<sup>7 &#x000D7; 7 &#x000D7; 512</sup>.</p>
<p>For modeling convenience, we set the size of the embedding layer to <italic>p</italic> = 512, the generated sparse representation for the deep visual feature and the deep texture feature, <italic>e</italic><sub><italic>v</italic></sub> and <italic>e</italic><sub><italic>t</italic></sub>, are vectors with the dimension of 1 &#x000D7; 512, and the dimension of produced interacted matrix <italic>k</italic><sub><italic>vt</italic></sub> &#x02208; &#x0211D;<sup><italic>512&#x000D7;512</italic></sup>. We tile the produced <italic>k</italic><sub><italic>vt</italic></sub> and feed it into the attention network, so the size of the generated attention weights &#x003C4;<sub><italic>att</italic></sub> &#x02208; &#x0211D;<sup><italic>1&#x000D7;512</italic></sup>. Through the attention weights based dynamic fusion, the dimension of each produced fused feature <italic>f</italic><sub><italic>fuse</italic></sub> is 7 &#x000D7;7 &#x000D7;1,024, i.e., <italic>f</italic><sub><italic>fuse</italic></sub> &#x02208; &#x0211D;<sup><italic>7&#x000D7;7&#x000D7;1,024</italic></sup>.</p>
<p>The fused features are fed into the MLP to learn affordances. The parameters of the MLP include: Cross Entropy loss function, Rectified Linear Unit (ReLU) activation function, and Adam optimizer. The structure of the MLP is 50176-4096-1024-10. In practice, we adopt the standard error back-propagation algorithm to train the model. We set the learning rate to 0.0001 and batch size to 32, and to prevent overfitting, we employ dropout to randomly drop 50% neurons during training.</p>
<p>We train the architecture in PyTorch. After 100 epochs training, the proposed network acquires 61.38% average accuracy on the test set. <xref ref-type="fig" rid="F6">Figure 6</xref> shows the confusion matrix of the acquired results by the presented network.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Generated confusion matrix of object affordance detection on the test set.</p></caption>
<graphic xlink:href="fnbot-14-00026-g0006.tif"/>
</fig>
<p>From <xref ref-type="fig" rid="F6">Figure 6</xref>, the affordances <italic>writing, cleaning</italic>, and <italic>cooking</italic> have relative low accuracy compared to the other affordances. The shapes and textures of the selected objects in the three categories are significantly different from each other. Therefore, we deduce the primary cause that lead to the low accuracy of the three affordances is the great shape and texture differences, so that the similarities between the deep features in one category are difficult to generalize and learn. <xref ref-type="fig" rid="F7">Figure 7</xref> shows some acquired example results of object affordance detection on the test set.</p>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Example results of object affordance detection on the test dataset. Raw images are collected from MSCOCO and ImageNet, used with permission.</p></caption>
<graphic xlink:href="fnbot-14-00026-g0007.tif"/>
</fig>
</sec>
<sec>
<title>7.1.3. Ablation Study and Comparison Experiments</title>
<p>Except validating the attention-based multi-visual features fusion network on the presented dataset, we also adopt different features fusion approach and utilize different networks to compare the detection accuracy.</p>
<p><bold>VGG19 Deep Features</bold>: In order to verify the effectiveness of the multi-visual features fusion for object affordances learning, we compare the results generated by the attention-base fusion network with a model trained by the deep visual features extracted from VGG 19. In this case, the deep features with shape of 7 &#x000D7;7 &#x000D7;512 are fed into an MLP with structure of 25088-4096-1024-10 to learn the affordances. After 100 epochs training, the generated model acquires 55.54% on the test set.</p>
<p><bold>Naive Concatenation</bold>: For validating the performance of attention-based fusion scheme, we adopt naive concatenation to concatenate the deep visual features and the deep texture features to generate the fused representations of the multi-visual features. The concatenated features are with the shape of 7 &#x000D7;7 &#x000D7;1,024 and are fed into the MLP which has the same structure in the multi-visual fusion architecture to recognize affordances. After 100 epochs, the generated model acquires 58.21% on the test set.</p>
<p><bold>RetinaNet</bold>: We directly train the RetinaNet (Lin et al., <xref ref-type="bibr" rid="B15">2020</xref>) (available source<xref ref-type="fn" rid="fn0002"><sup>2</sup></xref>[2]) on the proposed dataset. For a fair comparison, the backbone also utilizes ResNet 50. After 100 epochs training, the generated model obtains 58.92% average accuracy on the test set.</p>
<p><bold>YOLO V3</bold>: We also adopt the original pretrained weights to train YOLO V3 (Redmon and Farhadi, <xref ref-type="bibr" rid="B27">2018</xref>) (available code<xref ref-type="fn" rid="fn0003"><sup>3</sup></xref>) on the dataset. After 100 epochs training, the YOLO V3 model obtain 49.63% average accuracy on the test set. <xref ref-type="table" rid="T1">Table 1</xref> lists the results acquired by these different networks, different deep features, and different feature fusion approach.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Object affordance detection results acquired by different networks, deep features and feature fusion method.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center"><bold>Attention multi-visual features fusion</bold></th>
<th valign="top" align="center"><bold>VGG deep features</bold></th>
<th valign="top" align="center"><bold>Naive concatenation</bold></th>
<th valign="top" align="center"><bold>RetinaNet</bold></th>
<th valign="top" align="center"><bold>YOLO V3</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">calling</td>
<td valign="top" align="center">0.9036</td>
<td valign="top" align="center"><bold>0.9096</bold></td>
<td valign="top" align="center">0.8723</td>
<td valign="top" align="center">0.7747</td>
<td valign="top" align="center">0.5783</td>
</tr>
<tr>
<td valign="top" align="left">drinkingI</td>
<td valign="top" align="center"><bold>0.8991</bold></td>
<td valign="top" align="center">0.7785</td>
<td valign="top" align="center">0.8195</td>
<td valign="top" align="center">0.7806</td>
<td valign="top" align="center">0.4771</td>
</tr>
<tr>
<td valign="top" align="left">eatingII</td>
<td valign="top" align="center"><bold>0.7943</bold></td>
<td valign="top" align="center">0.7658</td>
<td valign="top" align="center">0.7569</td>
<td valign="top" align="center">0.6829</td>
<td valign="top" align="center">0.5696</td>
</tr>
<tr>
<td valign="top" align="left">playing</td>
<td valign="top" align="center">0.5676</td>
<td valign="top" align="center">0.4791</td>
<td valign="top" align="center">0.5305</td>
<td valign="top" align="center"><bold>0.8305</bold></td>
<td valign="top" align="center">0.7871</td>
</tr>
<tr>
<td valign="top" align="left">reading</td>
<td valign="top" align="center">0.5148</td>
<td valign="top" align="center">0.4938</td>
<td valign="top" align="center">0.5297</td>
<td valign="top" align="center"><bold>0.6424</bold></td>
<td valign="top" align="center">0.652</td>
</tr>
<tr>
<td valign="top" align="left">writing</td>
<td valign="top" align="center"><bold>0.2995</bold></td>
<td valign="top" align="center">0.2028</td>
<td valign="top" align="center">0.286</td>
<td valign="top" align="center">0.2628</td>
<td valign="top" align="center">0.2028</td>
</tr>
<tr>
<td valign="top" align="left">cleaning</td>
<td valign="top" align="center">0.1875</td>
<td valign="top" align="center">0.1625</td>
<td valign="top" align="center">0.175</td>
<td valign="top" align="center"><bold>0.375</bold></td>
<td valign="top" align="center">0.3327</td>
</tr>
<tr>
<td valign="top" align="left">drinkingII</td>
<td valign="top" align="center"><bold>0.7838</bold></td>
<td valign="top" align="center">0.7627</td>
<td valign="top" align="center">0.7248</td>
<td valign="top" align="center">0.6128</td>
<td valign="top" align="center">0.5824</td>
</tr>
<tr>
<td valign="top" align="left">eatingI</td>
<td valign="top" align="center"><bold>0.8162</bold></td>
<td valign="top" align="center">0.7103</td>
<td valign="top" align="center">0.7049</td>
<td valign="top" align="center">0.6738</td>
<td valign="top" align="center">0.4837</td>
</tr>
<tr style="border-bottom: thin solid #000000;">
<td valign="top" align="center">cooking</td>
<td valign="top" align="center">0.3719</td>
<td valign="top" align="center">0.2893</td>
<td valign="top" align="center"><bold>0.4214</bold></td>
<td valign="top" align="center">0.2562</td>
<td valign="top" align="center">0.2968</td>
</tr> <tr style="border-bottom: thin solid #000000;">
<td valign="top" align="center"><bold>Average</bold></td>
<td valign="top" align="center"><bold>0.6138</bold></td>
<td valign="top" align="center">0.5554</td>
<td valign="top" align="center">0.5821</td>
<td valign="top" align="center">0.5892</td>
<td valign="top" align="center">0.4963</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The bold value of each row is the acquired best accuracy of each affordance</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>From the experimental results, it is clear that the attention-based multi-visual features fusion network acquires the higher accuracy than the VGG deep features and naive concatenation approach. Although the RetinaNet obtains 58.92% average accuracy, our attention-based fusion network acquires the best detection accuracy on five affordance categories and the best average accuracy on the test set. The results demonstrate the performance of the multi-visual features and attention-based fusion network for learning object affordances.</p>
</sec>
</sec>
<sec>
<title>7.2. Intention-Related Natural Language Queries Grounding</title>
<p>In order to validate the performance of the intention-related natural language grounding architecture, we select 100 images from the introduced test dataset. To ensure the diversity of the intention-related queries, we collect 150 instructions by showing 10 participant different scenarios and ask them to give one or two queries for each image. We use the intention semantic extraction module to extract semantic words from these natural language sentences, the presented extraction module acquires 90.67% accuracy (136 correct samples in total 150 sentences).</p>
<p>We utilize the collected images and queries to test the effectiveness of the grounding architecture. <xref ref-type="fig" rid="F8">Figure 8</xref> lists some example results of intention-related natural language queries grounding. Through analyzing the failure target groundings, we found that the performance of the grounding architecture is greatly influenced by the affordance detection.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Example results of intention-related natural language query grounding. The first row lists example results of object affordance detection. The bar charts in the second row show the different weights of each word in given natural language instructions acquired by the intention semantic extraction module. &#x0003C;s&#x0003E; and &#x0003C;/s&#x0003E; represent the beginning of sentence token and the end of sentence token, respectively. The third row includes the natural language queries, and the extracted intention semantic words are covered with the corresponding color of the detected affordances.</p></caption>
<graphic xlink:href="fnbot-14-00026-g0008.tif"/>
</fig>
</sec>
<sec>
<title>7.3. Robotic Applications</title>
<p>We also conduct several spoken intention-related instruction grounding and target object grasping experiments on a UR5 robotic arm and a Robotiq 3-finger adaptive robot gripper platform. We first train an online speech recognizer under Kaldi (Povey et al., <xref ref-type="bibr" rid="B26">2011</xref>) and translate the spoken instructions into text by the online speech recognizer, we then ground spoken intention-related queries via the introduced grounding architecture.</p>
<p>In order to complete target object grasping, we combine bounding box values of the grounded target objects with depth data acquired by a Kinect V2 camera to locate the targets in 3D environments. Furthermore, we adopt the model from our previous work (Liang et al., <xref ref-type="bibr" rid="B14">2019</xref>) to learn the best grasping poses. <xref ref-type="fig" rid="F9">Figure 9</xref> shows some example results of spoken instructions grounding, target objects point cloud segmentation, and learned target object grasping poses. The robotic applications video can be found in the link: <ext-link ext-link-type="uri" xlink:href="https://www.youtube.com/watch?v=rchZeoAagxM">https://www.youtube.com/watch?v=rchZeoAagxM</ext-link>.</p>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>Example results of spoken natural language query groundings, point cloud segmentation, and learned target object grasping poses. The rectangles in the first row list the natural language queries, and the extracted intention semantic words are covered with corresponding color. The second row shows the results of the target object groundings. The images in the third row are point cloud segmentation by combining the bounding box values of grounded targets and the depth data acquired by a Kinect camera, and the red point clouds are the segmentations of the grounded target objects. The images in the fourth row show the grasping scenarios in MoveIt, the red grippers represent the learned best grasping poses.</p></caption>
<graphic xlink:href="fnbot-14-00026-g0009.tif"/>
</fig>
</sec>
</sec>
<sec id="s8">
<title>8. Conclusion and Future Work</title>
<p>We proposed an architecture that integrates an object affordance detection network with an intention-semantic extraction module to ground intention-related natural language queries. Contrary to the existing affordance detection frameworks, the proposed affordance detection network fuses deep visual features and deep texture features to recognize object affordances from RGB images. We fused the multi-visual features via an attention-based dynamic fusion architecture, which takes into account the interaction of the multi-visual features, preserves the complementary nature of the multi-visual features extracted from different networks, and avoids producing information redundancies during feature fusion. We trained the object affordance detection network on a self-built dataset, and we conducted extensive experiments to validate the performance of the attention-base multi-visual features fusion for learning object affordances.</p>
<p>Moreover, we presented an intention-related natural language grounding architecture via fusing the object affordance detection with intention-semantic extraction. We evaluated the performance of the intention-related natural language grounding architecture, and the experimental results demonstrate the performance of the natural language grounding architecture. We also integrated the intention-related natural language grounding architecture with an online speech recognizer to ground spoken intention-related natural language instructions and implemented target object grasping experiments on a robotic platform.</p>
<p>Currently, the introduced affordance detection network learns ten affordances through fusing the deep visual features and the deep texture features. In the future, we will apply meta-learning to learn more affordances from a smaller amount of annotated images, and develop a network-based framework to learn the different contributions of the different features for object affordances learning. Additionally, we will integrate the image captioning methodology with affordance to generate affordance-aware expression for each detected region within working scenarios.</p>
</sec>
<sec sec-type="data-availability-statement" id="s9">
<title>Data Availability Statement</title>
<p>All datasets generated for this study are included in the article/<xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref>.</p>
</sec>
<sec id="s10">
<title>Author Contributions</title>
<p>JM designed the study, wrote the initial draft of the manuscript, trained the object affordance detection network, completed the intention-related natural language grounding architecture, implemented and designed the validation experiments. HL completed the point cloud segmentation and grasping trajectories generation. JM and HL conducted the spoken instruction grounding experiments on the robotic platform. ST and QL provided critical revise advices for the manuscript. All authors contributed to the final paper revision.</p>
</sec>
<sec id="s11">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</body>
<back><sec sec-type="supplementary-material" id="s12">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fnbot.2020.00026/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fnbot.2020.00026/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Video_1.MP4" id="SM1" mimetype="video/mp4" xmlns:xlink="http://www.w3.org/1999/xlink">
<label>Supplementary Video 1</label>
<caption><p>Robotic applications based on the proposed intention-related natural language grounding architecture.</p></caption></supplementary-material>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ahn</surname> <given-names>H.</given-names></name> <name><surname>Choi</surname> <given-names>S.</given-names></name> <name><surname>Kim</surname> <given-names>N.</given-names></name> <name><surname>Cha</surname> <given-names>G.</given-names></name> <name><surname>Oh</surname> <given-names>S.</given-names></name></person-group> (<year>2018</year>). <article-title>Interactive text2pickup networks for natural language-based human-robot collaboration</article-title>. <source>IEEE Robot. Autom. Lett.</source> <volume>3</volume>, <fpage>3308</fpage>&#x02013;<lpage>3315</lpage>. <pub-id pub-id-type="doi">10.1109/LRA.2018.2852786</pub-id></citation></ref>
<ref id="B2">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Alomari</surname> <given-names>M.</given-names></name> <name><surname>Duckworth</surname> <given-names>P.</given-names></name> <name><surname>Hawasly</surname> <given-names>M.</given-names></name> <name><surname>Hogg</surname> <given-names>D. C.</given-names></name> <name><surname>Cohn</surname> <given-names>A. G.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Natural language grounding and grammar induction for robotic manipulation commands,&#x0201D;</article-title> in <source>Proceedings of the First Workshop on Language Grounding for Robotics</source> (<publisher-loc>Vancouver, BC</publisher-loc>), <fpage>35</fpage>&#x02013;<lpage>43</lpage>. <pub-id pub-id-type="doi">10.18653/v1/W17-2805</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bahdanau</surname> <given-names>D.</given-names></name> <name><surname>Cho</surname> <given-names>K.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Neural machine translation by jointly learning to align and translate,&#x0201D;</article-title> in <source>International Conference on learning and Representation (ICLR)</source> (<publisher-loc>San Diego, CA</publisher-loc>).</citation></ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bastianelli</surname> <given-names>E.</given-names></name> <name><surname>Croce</surname> <given-names>D.</given-names></name> <name><surname>Vanzo</surname> <given-names>A.</given-names></name> <name><surname>Basili</surname> <given-names>R.</given-names></name> <name><surname>Nardi</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;A discriminative approach to grounded spoken language understanding in interactive robotics,&#x0201D;</article-title> in <source>International Joint Conferences on Artificial Intelligence (IJCAI)</source> (<publisher-loc>New York, NY</publisher-loc>), <fpage>2747</fpage>&#x02013;<lpage>2753</lpage>.</citation></ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bell</surname> <given-names>S.</given-names></name> <name><surname>Upchurch</surname> <given-names>P.</given-names></name> <name><surname>Snavely</surname> <given-names>N.</given-names></name> <name><surname>Bala</surname> <given-names>K.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Material recognition in the wild with the materials in context database,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Boston, MA</publisher-loc>), <fpage>3479</fpage>&#x02013;<lpage>3487</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2015.7298970</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ben-Younes</surname> <given-names>H.</given-names></name> <name><surname>Cadene</surname> <given-names>R.</given-names></name> <name><surname>Cord</surname> <given-names>M.</given-names></name> <name><surname>Thome</surname> <given-names>N.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Mutan: multimodal tucker fusion for visual question answering,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision (ICCV)</source> (<publisher-loc>Venice</publisher-loc>), <fpage>2612</fpage>&#x02013;<lpage>2620</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.285</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Celikkanat</surname> <given-names>H.</given-names></name> <name><surname>Orhan</surname> <given-names>G.</given-names></name> <name><surname>Kalkan</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>A probabilistic concept web on a humanoid robot</article-title>. <source>IEEE Trans. Auton. Mental Dev.</source> <volume>7</volume>, <fpage>92</fpage>&#x02013;<lpage>106</lpage>. <pub-id pub-id-type="doi">10.1109/TAMD.2015.2418678</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chen</surname> <given-names>K.</given-names></name> <name><surname>Kovvuri</surname> <given-names>R.</given-names></name> <name><surname>Nevatia</surname> <given-names>R.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Query-guided regression network with context policy for phrase grounding,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision (ICCV)</source>, (Venice) <fpage>824</fpage>&#x02013;<lpage>832</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.95</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cimpoi</surname> <given-names>M.</given-names></name> <name><surname>Maji</surname> <given-names>S.</given-names></name> <name><surname>Vedaldi</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Deep filter banks for texture recognition and segmentation,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Boston, MA</publisher-loc>), <fpage>3828</fpage>&#x02013;<lpage>3836</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2015.7299007</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Conneau</surname> <given-names>A.</given-names></name> <name><surname>Kiela</surname> <given-names>D.</given-names></name> <name><surname>Schwenk</surname> <given-names>H.</given-names></name> <name><surname>Barrault</surname> <given-names>L.</given-names></name> <name><surname>Bordes</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Supervised learning of universal sentence representations from natural language inference data,&#x0201D;</article-title> in <source>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source> (<publisher-loc>Copenhagen</publisher-loc>), <fpage>670</fpage>&#x02013;<lpage>680</lpage>. <pub-id pub-id-type="doi">10.18653/v1/D17-1070</pub-id></citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dehban</surname> <given-names>A.</given-names></name> <name><surname>Jamone</surname> <given-names>L.</given-names></name> <name><surname>Kampff</surname> <given-names>A. R.</given-names></name> <name><surname>Santos-Victor</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Denoising auto-encoders for learning of objects and tools affordances in continuous space,&#x0201D;</article-title> in <source>2016 IEEE International Conference on Robotics and Automation (ICRA)</source> (<publisher-loc>Stockholm</publisher-loc>), <fpage>4866</fpage>&#x02013;<lpage>4871</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2016.7487691</pub-id></citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Johnson</surname> <given-names>J.</given-names></name> <name><surname>Karpathy</surname> <given-names>A.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Densecap: fully convolutional localization networks for dense captioning,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>4565</fpage>&#x02013;<lpage>4574</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.494</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>D. I.</given-names></name> <name><surname>Sukhatme</surname> <given-names>G. S.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Semantic labeling of 3d point clouds with object affordance for robot manipulation,&#x0201D;</article-title> in <source>2014 IEEE International Conference on Robotics and Automation (ICRA)</source> (<publisher-loc>Hong Kong</publisher-loc>), <fpage>5578</fpage>&#x02013;<lpage>5584</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2014.6907679</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liang</surname> <given-names>H.</given-names></name> <name><surname>Ma</surname> <given-names>X.</given-names></name> <name><surname>Li</surname> <given-names>S.</given-names></name> <name><surname>G&#x000F6;rner</surname> <given-names>M.</given-names></name> <name><surname>Tang</surname> <given-names>S.</given-names></name> <name><surname>Fang</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Pointnetgpd:1 detecting grasp configurations from point sets,&#x0201D;</article-title> in <source>International Conference on Robotics and Automation (ICRA)</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>3629</fpage>&#x02013;<lpage>3635</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2019.8794435</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.</given-names></name> <name><surname>Goyal</surname> <given-names>P.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Doll&#x000E1;r</surname> <given-names>P.</given-names></name></person-group> (<year>2020</year>). <article-title>Focal loss for dense object detection</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>42</volume>, <fpage>318</fpage>&#x02013;<lpage>327</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.324</pub-id><pub-id pub-id-type="pmid">30040631</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.-Y.</given-names></name> <name><surname>Maire</surname> <given-names>M.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name> <name><surname>Hays</surname> <given-names>J.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name> <name><surname>Ramanan</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>&#x0201C;Microsoft coco: common objects in context,&#x0201D;</article-title> in <source>European Conference on Computer Vision (ECCV)</source> (<publisher-loc>Zurich</publisher-loc>), <fpage>740</fpage>&#x02013;<lpage>755</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-10602-1_48</pub-id></citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mi</surname> <given-names>J.</given-names></name> <name><surname>Tang</surname> <given-names>S.</given-names></name> <name><surname>Deng</surname> <given-names>Z.</given-names></name> <name><surname>Goerner</surname> <given-names>M.</given-names></name> <name><surname>Zhang</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Object affordance based multimodal fusion for natural human-robot interaction</article-title>. <source>Cogn. Syst. Res.</source> <volume>54</volume>, <fpage>128</fpage>&#x02013;<lpage>137</lpage>. <pub-id pub-id-type="doi">10.1016/j.cogsys.2018.12.010</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Myers</surname> <given-names>A.</given-names></name> <name><surname>Teo</surname> <given-names>C. L.</given-names></name> <name><surname>Ferm&#x000FC;ller</surname> <given-names>C.</given-names></name> <name><surname>Aloimonos</surname> <given-names>Y.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Affordance detection of tool parts from geometric features,&#x0201D;</article-title> in <source>2015 IEEE International Conference on Robotics and Automation (ICRA)</source> (<publisher-loc>Seattle, WA</publisher-loc>), <fpage>1374</fpage>&#x02013;<lpage>1381</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2015.7139369</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Newell</surname> <given-names>A.</given-names></name> <name><surname>Yang</surname> <given-names>K.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Stacked hourglass networks for human pose estimation,&#x0201D;</article-title> in <source>European Conference on Computer Vision (ECCV)</source> (<publisher-loc>Amsterdam</publisher-loc>), <fpage>483</fpage>&#x02013;<lpage>499</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-46484-8_29</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nguyen</surname> <given-names>A.</given-names></name> <name><surname>Kanoulas</surname> <given-names>D.</given-names></name> <name><surname>Caldwell</surname> <given-names>D. G.</given-names></name> <name><surname>Tsagarakis</surname> <given-names>N. G.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Detecting object affordances with convolutional neural networks,&#x0201D;</article-title> in <source>2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source> (<publisher-loc>Daejeon</publisher-loc>), <fpage>2765</fpage>&#x02013;<lpage>2770</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2016.7759429</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Nguyen</surname> <given-names>A.</given-names></name> <name><surname>Kanoulas</surname> <given-names>D.</given-names></name> <name><surname>Caldwell</surname> <given-names>D. G.</given-names></name> <name><surname>Tsagarakis</surname> <given-names>N. G.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Object-based affordances detection with convolutional neural networks and dense conditional random fields,&#x0201D;</article-title> in <source>2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source> (<publisher-loc>Vancouver, BC</publisher-loc>), <fpage>5908</fpage>&#x02013;<lpage>5915</lpage>. <pub-id pub-id-type="doi">10.1109/IROS.2017.8206484</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Norman</surname> <given-names>D.</given-names></name></person-group> (<year>1988</year>). <source>The Design of Everyday Things</source>. <publisher-loc>New York, NY</publisher-loc>: <publisher-name>Basic Books</publisher-name>.</citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Paul</surname> <given-names>R.</given-names></name> <name><surname>Arkin</surname> <given-names>J.</given-names></name> <name><surname>Aksaray</surname> <given-names>D.</given-names></name> <name><surname>Roy</surname> <given-names>N.</given-names></name> <name><surname>Howard</surname> <given-names>T. M.</given-names></name></person-group> (<year>2018</year>). <article-title>Efficient grounding of abstract spatial concepts for natural language interaction with robot platforms</article-title>. <source>Int. J. Robot. Res.</source> <volume>37</volume>, <fpage>1269</fpage>&#x02013;<lpage>1299</lpage>. <pub-id pub-id-type="doi">10.1177/0278364918777627</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pennington</surname> <given-names>J.</given-names></name> <name><surname>Socher</surname> <given-names>R.</given-names></name> <name><surname>Manning</surname> <given-names>C.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Glove: global vectors for word representation,&#x0201D;</article-title> in <source>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source> (<publisher-loc>Doha</publisher-loc>), <fpage>1532</fpage>&#x02013;<lpage>1543</lpage>. <pub-id pub-id-type="doi">10.3115/v1/D14-1162</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Perkins</surname> <given-names>J.</given-names></name></person-group> (<year>2010</year>). <source>Python Text Processing With NLTK 2.0 Cookbook</source>. <publisher-loc>Birmingham</publisher-loc>: <publisher-name>Packt Publishing Ltd</publisher-name>.</citation></ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Povey</surname> <given-names>D.</given-names></name> <name><surname>Ghoshal</surname> <given-names>A.</given-names></name> <name><surname>Boulianne</surname> <given-names>G.</given-names></name> <name><surname>Burget</surname> <given-names>L.</given-names></name> <name><surname>Glembek</surname> <given-names>O.</given-names></name> <name><surname>Goel</surname> <given-names>N.</given-names></name> <etal/></person-group>. (<year>2011</year>). <article-title>&#x0201C;The kaldi speech recognition toolkit,&#x0201D;</article-title> in <source>IEEE 2011 Workshop on Automatic Speech Recognition and Understanding</source>.</citation></ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Redmon</surname> <given-names>J.</given-names></name> <name><surname>Farhadi</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>Yolov3: an incremental improvement</article-title>. <source>arXiv</source> 1804.02767.</citation></ref>
<ref id="B28">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rendle</surname> <given-names>S.</given-names></name></person-group> (<year>2010</year>). <article-title>&#x0201C;Factorization machines,&#x0201D;</article-title> in <source>IEEE International Conference on Data Mining (ICDM)</source> (<publisher-loc>Sydney, NSW</publisher-loc>), <fpage>995</fpage>&#x02013;<lpage>1000</lpage>. <pub-id pub-id-type="doi">10.1109/ICDM.2010.127</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Roesler</surname> <given-names>O.</given-names></name> <name><surname>Aly</surname> <given-names>A.</given-names></name> <name><surname>Taniguchi</surname> <given-names>T.</given-names></name> <name><surname>Hayashi</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Evaluation of word representations in grounding natural language instructions through computational human-robot interaction,&#x0201D;</article-title> in <source>14th ACM/IEEE International Conference on Human-Robot Interaction (HRI)</source> (<publisher-loc>Daegu</publisher-loc>), <fpage>307</fpage>&#x02013;<lpage>316</lpage>. <pub-id pub-id-type="doi">10.1109/HRI.2019.8673121</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Roy</surname> <given-names>A.</given-names></name> <name><surname>Todorovic</surname> <given-names>S.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;A multi-scale cnn for affordance segmentation in RGB images,&#x0201D;</article-title> in <source>European Conference on Computer Vision (ECCV)</source> (<publisher-loc>Amsterdam</publisher-loc>), <fpage>186</fpage>&#x02013;<lpage>201</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-46493-0_12</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russakovsky</surname> <given-names>O.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Su</surname> <given-names>H.</given-names></name> <name><surname>Krause</surname> <given-names>J.</given-names></name> <name><surname>Satheesh</surname> <given-names>S.</given-names></name> <name><surname>Ma</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Imagenet large scale visual recognition challenge</article-title>. <source>Int. J. Comput. Vis.</source> <volume>115</volume>, <fpage>211</fpage>&#x02013;<lpage>252</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-015-0816-y</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sawatzky</surname> <given-names>J.</given-names></name> <name><surname>Srikantha</surname> <given-names>A.</given-names></name> <name><surname>Gall</surname> <given-names>J.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Weakly supervised affordance detection,&#x0201D; 1in <italic>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</italic></article-title>, <fpage>5197</fpage>&#x02013;<lpage>5206</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.552</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Shridhar</surname> <given-names>M.</given-names></name> <name><surname>Hsu</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Interactive visual grounding of referring expressions for human-robot interaction,&#x0201D;</article-title> in <source>Proceedings of Robotics: Science &#x00026; Systems (RSS)</source> (<publisher-loc>Pittsburgh, PA</publisher-loc>). <pub-id pub-id-type="doi">10.15607/RSS.2018.XIV.028</pub-id></citation></ref>
<ref id="B34">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Very deep convolutional networks for large-scale image recognition</article-title>. <source>arXiv</source> abs/1409.1556.</citation></ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>H. O.</given-names></name> <name><surname>Fritz</surname> <given-names>M.</given-names></name> <name><surname>Goehring</surname> <given-names>D.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name></person-group> (<year>2015</year>). <article-title>Learning to detect visual grasp affordance</article-title>. <source>IEEE Trans. Autom. Sci. Eng.</source> <volume>13</volume>, <fpage>1</fpage>&#x02013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1109/TASE.2015.2396014</pub-id></citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>Y.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Lin</surname> <given-names>Y.</given-names></name></person-group> (<year>2014</year>). <article-title>Object-object interaction affordance learning</article-title>. <source>Robot. Auton. Syst.</source> <volume>62</volume>, <fpage>487</fpage>&#x02013;<lpage>496</lpage>. <pub-id pub-id-type="doi">10.1016/j.robot.2013.12.005</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Thermos</surname> <given-names>S.</given-names></name> <name><surname>Papadopoulos</surname> <given-names>G. T.</given-names></name> <name><surname>Daras</surname> <given-names>P.</given-names></name> <name><surname>Potamianos</surname> <given-names>G.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Deep affordance-grounded sensorimotor object recognition,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>, <fpage>49</fpage>&#x02013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.13</pub-id></citation></ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Thomason</surname> <given-names>J.</given-names></name> <name><surname>Padmakumar</surname> <given-names>A.</given-names></name> <name><surname>Sinapov</surname> <given-names>J.</given-names></name> <name><surname>Hart</surname> <given-names>J.</given-names></name> <name><surname>Stone</surname> <given-names>P.</given-names></name> <name><surname>Mooney</surname> <given-names>R. J.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Opportunistic active learning for grounding natural language descriptions,&#x0201D;</article-title> in <source>Conference on Robot Learning</source> (<publisher-loc>Mountain View, CA</publisher-loc>), <fpage>67</fpage>&#x02013;<lpage>76</lpage>.</citation></ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Thomason</surname> <given-names>J.</given-names></name> <name><surname>Padmakumar</surname> <given-names>A.</given-names></name> <name><surname>Sinapov</surname> <given-names>J.</given-names></name> <name><surname>Walker</surname> <given-names>N.</given-names></name> <name><surname>Jiang</surname> <given-names>Y.</given-names></name> <name><surname>Yedidsion</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2019</year>). <article-title>&#x0201C;Improving grounded natural language understanding through human-robot dialog,&#x0201D;</article-title> in <source>IEEE International Conference on Robotics and Automation (ICRA)</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>6934</fpage>&#x02013;<lpage>6941</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2019.8794287</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>L.</given-names></name> <name><surname>Poirson</surname> <given-names>P.</given-names></name> <name><surname>Yang</surname> <given-names>S.</given-names></name> <name><surname>Berg</surname> <given-names>A. C.</given-names></name> <name><surname>Berg</surname> <given-names>T. L.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;Modeling context in referring expressions,&#x0201D;</article-title> in <source>European Conference on Computer Vision (ECCV)</source> (<publisher-loc>Amsterdam</publisher-loc>), <fpage>69</fpage>&#x02013;<lpage>85</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-46475-6_5</pub-id></citation></ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yu</surname> <given-names>Z.</given-names></name> <name><surname>Sangwook</surname> <given-names>K.</given-names></name> <name><surname>Mallipeddi</surname> <given-names>R.</given-names></name> <name><surname>Lee</surname> <given-names>M.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Human intention understanding based on object affordance and action classification,&#x0201D;</article-title> in <source>International Joint Conference on Neural Networks (IJCNN)</source> (<publisher-loc>Killarney</publisher-loc>: <publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/IJCNN.2015.7280587</pub-id></citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>H.</given-names></name> <name><surname>Xue</surname> <given-names>J.</given-names></name> <name><surname>Dana</surname> <given-names>K.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Deep ten: texture encoding network,&#x0201D;</article-title> in <source>Proceedings 1of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>, <fpage>2896</fpage>&#x02013;<lpage>2905</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.309</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhu</surname> <given-names>Y.</given-names></name> <name><surname>Fathi</surname> <given-names>A.</given-names></name> <name><surname>Fei-Fei</surname> <given-names>L.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Reasoning about object affordances in a knowledge base representation,&#x0201D;</article-title> in <source>European Conference on Computer Vision (ECCV)</source> (<publisher-loc>Zurich</publisher-loc>), <fpage>408</fpage>&#x02013;<lpage>424</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-10605-2_27</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn id="fn0001"><p><sup>1</sup><ext-link ext-link-type="uri" xlink:href="https://tams.informatik.uni-hamburg.de/research/datasets/index.php">https://tams.informatik.uni-hamburg.de/research/datasets/index.php</ext-link></p></fn>
<fn id="fn0002"><p><sup>2</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/fizyr/keras-retinanet">https://github.com/fizyr/keras-retinanet</ext-link></p></fn>
<fn id="fn0003"><p><sup>3</sup><ext-link ext-link-type="uri" xlink:href="https://github.com/qqwweee/keras-yolo3">https://github.com/qqwweee/keras-yolo3</ext-link></p></fn>
</fn-group>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> This work was partly funded by the German Research Foundation (DFG) and National Science Foundation (NSFC) in project Crossmodal Learning under contract Sonderforschungsbereich Transregio 169, the DAAD German Academic Exchange Service under CASY project, and the National Natural Science Foundation of China (61773083).</p>
</fn>
</fn-group>
</back>
</article> 