<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article article-type="methods-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Robot. AI</journal-id>
<journal-title>Frontiers in Robotics and AI</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Robot. AI</abbrev-journal-title>
<issn pub-type="epub">2296-9144</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">833173</article-id>
<article-id pub-id-type="doi">10.3389/frobt.2022.833173</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Robotics and AI</subject>
<subj-group>
<subject>Methods</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Visual state estimation in unseen environments through domain adaptation and metric learning</article-title>
<alt-title alt-title-type="left-running-head">G&#xfc;ler et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frobt.2022.833173">10.3389/frobt.2022.833173</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>G&#xfc;ler</surname>
<given-names>P&#xfc;ren</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<xref ref-type="fn" rid="fn2">
<sup>&#x2020;</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/771569/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Stork</surname>
<given-names>Johannes A.</given-names>
</name>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Stoyanov</surname>
<given-names>Todor</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1654221/overview"/>
</contrib>
</contrib-group>
<aff>
<institution>Autonomous Mobile Manipulation Lab</institution>, <institution>&#xd6;rebro University</institution>, <addr-line>&#xd6;rebro</addr-line>, <country>Sweden</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/47417/overview">Lucia Beccai</ext-link>, Italian Institute of Technology (IIT), Italy</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1058968/overview">Aysegul Ucar</ext-link>, Firat University, Turkey</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1319341/overview">Ioanis Kostavelis</ext-link>, Centre for Research and Technology Hellas (CERTH), Greece</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: P&#xfc;ren G&#xfc;ler, <email>puren.guler@gmail.com</email>
</corresp>
<fn fn-type="present-address" id="fn2">
<label>
<sup>&#x2020;</sup>
</label>
<p>
<bold>Present address:</bold> P&#x00FC;ren G&#x00FC;ler, Ericsson Research, Lund, Sweden</p>
</fn>
<fn fn-type="other">
<p>This article was submitted to Robot and Machine Vision, a section of the journal Frontiers in Robotics and AI</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>19</day>
<month>08</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>9</volume>
<elocation-id>833173</elocation-id>
<history>
<date date-type="received">
<day>10</day>
<month>12</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>11</day>
<month>07</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 G&#xfc;ler, Stork and Stoyanov.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>G&#xfc;ler, Stork and Stoyanov</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>In robotics, deep learning models are used in many visual perception applications, including the tracking, detection and pose estimation of robotic manipulators. The state of the art methods however are conditioned on the availability of annotated training data, which may in practice be costly or even impossible to collect. Domain augmentation is one popular method to improve generalization to out-of-domain data by extending the training data set with predefined sources of variation, unrelated to the primary task. While this typically results in better performance on the target domain, it is not always clear that the trained models are capable to accurately separate the signals relevant to solving the task (e.g., appearance of an object of interest) from those associated with differences between the domains (e.g., lighting conditions). In this work we propose to improve the generalization capabilities of models trained with domain augmentation by formulating a secondary structured metric-space learning objective. We concentrate on one particularly challenging domain transfer task&#x2014;visual state estimation for an articulated underground mining machine&#x2014;and demonstrate the benefits of imposing structure on the encoding space. Our results indicate that the proposed method has the potential to transfer feature embeddings learned on the source domain, through a suitably designed augmentation procedure, and on to an unseen target domain.</p>
</abstract>
<kwd-group>
<kwd>articulated pose estimation</kwd>
<kwd>joint state estimation</kwd>
<kwd>deep metric learning</kwd>
<kwd>domain augmentation</kwd>
<kwd>triplet loss</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<sec id="s1-1">
<title>1.1 Motivation</title>
<p>In recent years, deep learning models have increasingly been applied to solve visual perception problems in robotics. For structured environments such as factories or warehouses that are not changing dramatically over time, training such models and obtaining successful results in test data is possible. However, for fully autonomous operations, these methods should work under test conditions in unstructured and unpredictable environments as well&#x2014;e.g., in scenes with continuously changing background, illumination or appearance. Data augmentation is one common approach to enhancing the ability of deep visual models to cope with unexpected changes in the environment. The basic principle is to increase robustness by introducing synthetic changes to the source domain during training, such as changing background or texture, cropping images, introducing artificial camera noise. Yet, simply adding more samples to the training data may not be enough to cover every scenario that can occur during testing<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref>.</p>
<p>To address the discrepancies between domains, models need to learn what are the task-relevant features in the data. Data augmentation helps to accomplish this by simply showing the model more varied data during training. However, there is an alternative: explicitly supervising what training samples should be considered similar or dissimilar by the model. Metric learning is one such alternative that aims to find an appropriate way to structure the similarities and differences in the underlying data (<xref ref-type="bibr" rid="B6">Kaya and Bilge (2019)</xref>). Metric learning however typically requires annotated data from all the potential target domains during training (e.g., detecting faces from different viewpoints (<xref ref-type="bibr" rid="B19">Schroff et al. (2015)</xref>)). However, collecting and labeling sufficient data from all potential domains is at best time consuming and often impossible in a robotics scenario. In this work we explore the possibilities of combining the two approaches: domain data augmentation and metric learning. This allows us to use a metric learning objective without access to labeled data from the target domain, making a principled approach to domain augmentation possible.</p>
<p>The target application we investigate in this work is the visual state estimation of an articulated mining machine (<xref ref-type="fig" rid="F1">Figure 1A</xref>). Kinematic chains, such as traditional robot manipulators and the booms of our mining machine, are composed of individual links coupled with actuators. The state estimation problem is thus typically solved by measuring angles between links through joint encoder sensors. However, encoders can cause erroneous pose estimates due to sensor noise, cable strain, deflection or vibration of the manipulator. Drilling rigs that are used in mining and construction operate in dangerous and highly corrosive environments (<xref ref-type="fig" rid="F1">Figure 1B</xref>). Hence, encoder sensors and data cables are subject to high wear and tear, motivating the need for a redundant visual state estimation system (<xref ref-type="fig" rid="F1">Figures 1C&#x2013;E</xref>).</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>
<bold>(A)</bold> A heavy-duty drilling machine with an articulated manipulator produced by the mining equipment manufacturer Epiroc <xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref> <bold>(B)</bold> Cabin view from the same machine operating in a mine <bold>(C&#x2013;E)</bold> Epiroc testing warehouse. We collect a <italic>source domain</italic> data set with the hall lights on in <bold>(C)</bold>, as well as a <italic>target domain</italic> data set with only on-board lightning in <bold>(D) (E)</bold> shows augmented source domain data that we alter to simulate conditions in the target domain.</p>
</caption>
<graphic xlink:href="frobt-09-833173-g001.tif"/>
</fig>
</sec>
<sec id="s1-2">
<title>1.2 Related work</title>
<p>Our work is at the intersection of several different field, i.e., robotics, computer vision, machine learning, and topics, e.g., transfer learning, domain augmentation, metric learning, triplet loss etc. To give a comprehensive overview for each of the related works from these topics is out of scope of this paper. In this section, we briefly overview each related topics very briefly and list the papers that we see the most relevant for our work.</p>
<sec id="s1-2-1">
<title>1.2.1 Robotics pose estimation through vision</title>
<p>In recent years, several studies in robotics have focused on estimating the pose of articulated links through visual sensors. Approaches based on markers (<xref ref-type="bibr" rid="B24">Vahrenkamp et al. (2008)</xref>), as well as on depth data and 3D models (<xref ref-type="bibr" rid="B9">Krainin et al. (2011)</xref>; <xref ref-type="bibr" rid="B8">Klingensmith et al. (2013)</xref>; <xref ref-type="bibr" rid="B18">Schmidt et al. (2014)</xref>) have been proposed. A large amount of work uses discriminative approaches that learn a direct mapping from the features of visual data (e.g., RGB or point cloud) to joint states or pose of articulated links (<xref ref-type="bibr" rid="B26">Widmaier et al. (2016)</xref>; <xref ref-type="bibr" rid="B1">Byravan and Fox (2017)</xref>; <xref ref-type="bibr" rid="B28">Zhou et al. (2019)</xref>). These features are usually extracted using either hand-made feature extractors or more end-to-end approaches such as Convolutional Neural Network (CNN) models. We choose the latter type of approach and employ a CNN architecture that can learn complex tasks directly from visual data (<xref ref-type="bibr" rid="B10">Krizhevsky et al. (2012)</xref>).</p>
<p>The feature-based methods mentioned above rely on the availability of a large amount of annotated data from both source and target domain. However, it may not be possible to collect annotated data for all the conditions a robot can encounter in a complex uncontrolled real-world environment such as an underground mine.</p>
</sec>
<sec id="s1-2-2">
<title>1.2.2 Transfer learning</title>
<p>Transfer learning is a huge field that have been categorized in several ways, e.g., <italic>label-setting</italic> wise where labels of source and/or target domain are available (transductive, inductive) or unavailable (unsupervised), <italic>domain feature space</italic> wise where source and target domain feature spaces are similar (homogeneous) or different (heterogeneous), <italic>field/topic wise</italic> such as deep learning, computer vision, activity recognition etc. (<xref ref-type="bibr" rid="B29">Zhuang et al. (2020)</xref>). To give a detail analysis and comparison for each of these different types of categorizations with respect to our proposed method is out of scope of this paper.</p>
<p>However, in brief, the objective of transfer learning is to improve the generalization of a learned model on the target domain by transferring knowledge contained in different but related source domains. This objective is accomplished by minimizing the distance between target and source domain data during training (e.g., <xref ref-type="bibr" rid="B4">Ganin et al. (2016)</xref>; <xref ref-type="bibr" rid="B23">Tzeng et al. (2017)</xref>; <xref ref-type="bibr" rid="B13">Laradji and Babanezhad (2020)</xref>). This naturally requires access to target domain data during training or fine-tuning, which as mentioned previously is often not readily available. Differently, in our work, we apply domain-aware augmentation to the source domain data without requiring training/fine-tuning in target domain.</p>
</sec>
<sec id="s1-2-3">
<title>1.2.3 Domain augmentation</title>
<p>Domain augmentation is a way of overcoming the data scarcity problem by adding a large amount of annotated synthetic data or by transforming existing data. Data augmentation is a huge field (e.g., <xref ref-type="bibr" rid="B20">Shorten and Khoshgoftaar (2019)</xref>) with various techniques and in-depth discussion of each of these techniques is out of scope of this paper. Nevertheless, we can say that the techniques such as background augmentation, adding noise or cropping/transforming images, are common means to increasing the data variation in the source domain (<xref ref-type="bibr" rid="B12">Lambrecht and K&#xe4;stner (2019)</xref>; <xref ref-type="bibr" rid="B5">Gulde et al. (2019)</xref>; <xref ref-type="bibr" rid="B14">Lee et al. (2020)</xref>; <xref ref-type="bibr" rid="B11">Labbe et al. (2021)</xref>). The model is then trained under more varied conditions which helps improve generalization and break the dependence on annotated data from the target domain. In our work, rather than such random augmentations, e.g., random noise injection in images or geometric transformations, we apply a domain-aware augmentation by assuming target domain knowledge is available. Hence, even though we do not have sufficient target data, we complete this insufficiency through target domain-aware augmentation of source data.</p>
</sec>
<sec id="s1-2-4">
<title>1.2.4 Metric learning</title>
<p>Metric learning is another approach to improving model generalization by learning the relation between samples in a dataset belonging to a certain domain. Learning such relations imposes a structure to the feature encoding domain, which in turn has been demonstrated to improve transfer in various applications, such as multi-view face recognition (<xref ref-type="bibr" rid="B19">Schroff et al. (2015)</xref>), medical imaging (<xref ref-type="bibr" rid="B15">Litjens et al. (2017)</xref>) or remote sensing for hyperspectral image classification (<xref ref-type="bibr" rid="B3">Dong et al. (2021)</xref>). The main challenges when combining deep learning with metric learning include the design of the metric loss function (e.g., contrastive or triplet loss function), the strategy for selecting samples (e.g., hard-negative, semi-hard negative), and the design of the network structure (e.g., siamese, triplet networks) (<xref ref-type="bibr" rid="B6">Kaya and Bilge (2019)</xref>). We apply a standard triplet loss and propose a domain-specific sample selection strategy as our contribution.</p>
</sec>
</sec>
<sec id="s1-3">
<title>1.3 Problem definition and contribution</title>
<p>In this article we aim to address some of the challenges in transferring learned vision-based models to new domains. In particular, we are interested in training a machine learning model for operation in an environment in which we are not able to collect data. We instead propose to use the background knowledge and prior information available at design time in order to appropriately augment the training procedure.</p>
<p>In doing so, our contributions are as follows:<list list-type="simple">
<list-item>
<p>&#x2022; We combine techniques from domain augmentation&#x2014;namely, the use of a designed augmentation procedure&#x2014;and from metric learning.</p>
</list-item>
<list-item>
<p>&#x2022; We adapt the triplet learning methodology and propose an approach for principled integration of domain-augmented data as a source for both positive and negative examples. Our main contribution is thus the said principled treatment of domain augmentation with the purpose of transfer of a vision-based learned model.</p>
</list-item>
<list-item>
<p>&#x2022; We evaluate our approach on a data set within mining robotics, thus demonstrating the practical use of the proposed approach.</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="s2">
<title>2 Methods</title>
<p>In this section we present a learning architecture aimed at recovering the joint angles of an articulated kinematic chain from visual observations. We design our approach to utilize domain adapted training data to improve model transfer to images collected in previously unseen environments. We accomplish this by posing two objectives&#x2014;a primary joint recovery objective and a secondary metric learning objective. This section begins with a problem specification in <xref ref-type="sec" rid="s2-1">Section 2.1</xref>, followed by a discussion of the base joint regression task in <xref ref-type="sec" rid="s2-2">Section 2.2</xref>. Next, in <xref ref-type="sec" rid="s2-3">Sections 2.3</xref>, <xref ref-type="sec" rid="s2-4">Section 2.4</xref>, <xref ref-type="sec" rid="s2-5">Section 2.5</xref>. We augment our method with a secondary objective that aims to learn a smooth feature embedding space.</p>
<sec id="s2-1">
<title>2.1 Learning a generalizable visual model</title>
<p>In this paper we are interested in solving a particular task relevant to mining robots: the visual state estimation problem. The base problem of recovering the robot state from visual observations has been previously discussed in other contexts, such as e.g. for robot manipulators (<xref ref-type="bibr" rid="B28">Zhou et al. (2019)</xref>). Given sufficient observations, it is possible to successfully train a neural network architecture, such as the one described in the following section. The challenge here lies in the difficulty of collecting sufficiently varied observations that span the full range of possible operating conditions for the machine. This problem is often solved via data augmentation, but as we show here, data augmentation alone may not be sufficient to guarantee good transfer of the learned visual models to out-of-domain data.</p>
<p>We formalize our problem as follows. We assume access to a sufficiently large data set of in-domain annotated examples. In our case these are supervised pairs of images <bold>I</bold> and measured robot joint configurations <bold>q</bold> from an onboard encoder system. In addition, we assume some prior knowledge of the target domain, which allows us to design an imperfect, yet admissible data augmentation procedure <italic>g</italic>
<sub>
<italic>aug</italic>
</sub>(<bold>I</bold>). The goal is then to best use the fixed data augmentation procedure in order to train a model that successfully generalizes to a novel domain.</p>
</sec>
<sec id="s2-2">
<title>2.2 Regressing joint states</title>
<p>Our approach is based on a CNN that extracts feature embeddings <bold>f</bold>, given a batch of RGB images <bold>I</bold>. The CNN is trained on a <italic>source</italic> domain of images, where each sample depicts a predetermined articulated kinematic chain (e.g., manipulator, machine boom) in a known configuration <bold>q</bold>. The joint regression task is thus to estimate a configuration <inline-formula id="inf1">
<mml:math id="m1">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> that is as close as possible to the true configuration <bold>q</bold>. We use the VGG16 network architecture (<xref ref-type="bibr" rid="B21">Simonyan and Zisserman (2014)</xref>) as a backbone for the feature extraction task and initialize it using weights pre-trained on the ImageNet classification data set (<xref ref-type="bibr" rid="B2">Deng et al. (2009)</xref>). Note however the proposed method is not dependent on any single CNN backbone and VGG16 could be substituted by an alternate feature extraction architecture. We then supervise the feature extraction task with a joint regression head, as seen in <xref ref-type="fig" rid="F2">Figure 2</xref> and outlined below.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>The overall proposed machine learning architecture. We use a CNN model based on the VGG16 architecture as a backbone for feature extraction. We extract feature embeddings <bold>f</bold> from a given image <bold>I</bold> and regress the joint state <bold>q</bold> via a fully connected output layer. In addition, we pose a metric learning objective where we strive to keep the embedding <bold>f</bold> close to select positive examples (<bold>f</bold>
<sub>
<italic>pos</italic>
</sub>) and far from select negative ones (<bold>f</bold>
<sub>
<italic>neg</italic>
</sub>).</p>
</caption>
<graphic xlink:href="frobt-09-833173-g002.tif"/>
</fig>
<p>The backbone, based on the VGG16 architecture (<xref ref-type="bibr" rid="B21">Simonyan and Zisserman (2014)</xref>), feeds the input image <bold>I</bold>
<sub>
<italic>i</italic>
</sub> through a series of convolution layers. We use all convolutional and pooling layers of VGG16, but discard the last fully connected layers, i.e., <italic>FC-4096</italic> and <italic>FC-1000</italic> in (<xref ref-type="bibr" rid="B21">Simonyan and Zisserman (2014)</xref>). Hence, the last layer of the backbone is the fifth maxpool layer of VGG16 and <bold>f</bold> is the feature embedding extracted from this maxpool layer. We regress the joint target <inline-formula id="inf2">
<mml:math id="m2">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> in <xref ref-type="fig" rid="F2">Figure 2</xref> via two fully-connected layers, <bold>
<italic>fc</italic>
</bold>. These layers have the same input structure as the <italic>FC-4096</italic> layer of VGG16 and for that reason we resize the input image to 224 &#xd7; 224 using nearest-neighbor interpolation.</p>
<p>We supervise the joint target regression task with a loss defined on the predicted state <inline-formula id="inf3">
<mml:math id="m3">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>. In our evaluation scenario discussed in <xref ref-type="sec" rid="s3-1">Section 3.1</xref> we have an output space with <inline-formula id="inf4">
<mml:math id="m4">
<mml:mi mathvariant="bold">q</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>7</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>, where each dimension represents the state of a joint in the kinematic chain. Five of these joints are revolute, while two are prismatic, resulting in a non-homogeneous configuration vector which is partially defined in radian and partially in meters. To counter to this difference, we regress radian <inline-formula id="inf5">
<mml:math id="m5">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">rad</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>5</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> and meter joint states <inline-formula id="inf6">
<mml:math id="m6">
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">met</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> in different layers simultaneously. The range of motion of joints in radian can be between 0 and 2<italic>&#x3c0;</italic>. Hence, to avoid issues due to angle wraparound, we define our regression loss function over a cosine/sine transform of the radian joint angles and concatenate them in a single array, <inline-formula id="inf7">
<mml:math id="m7">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:mi mathvariant="italic">cos</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">rad</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="italic">sin</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">rad</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">met</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="" close=")">
<mml:mrow>
<mml:mspace width="-0.17em"/>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>. Then, the loss for a batch of size <italic>n</italic>
<sub>
<italic>batch</italic>
</sub> is calculated by computing the Mean Squared Error (MSE) between the ground-truth <inline-formula id="inf8">
<mml:math id="m8">
<mml:mi mathvariant="bold">q</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">batch</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> and estimated <inline-formula id="inf9">
<mml:math id="m9">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">batch</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>:<disp-formula id="e1">
<mml:math id="m10">
<mml:msub>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">batch</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">batch</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munderover>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:munderover>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
<label>(1)</label>
</disp-formula>
</p>
</sec>
<sec id="s2-3">
<title>2.3 Learning a metric space</title>
<p>Estimating joint states from visual input, as described in the previous section, works well if we have sufficient in-domain data. In this work we are however interested in a case when such data are not readily available. To improve our model&#x2019;s generalization potential we lean on the concept of metric space learning. In particular, we employ a triplet loss function similar to the ones used in (<xref ref-type="bibr" rid="B19">Schroff et al. (2015)</xref>; <xref ref-type="bibr" rid="B22">Sun et al. (2014)</xref>). It is a well known loss function. However, for the sake of completeness of the methodology, we explain the details of our usage of triplet loss function in this section.</p>
<p>Given an image <bold>I</bold>
<sub>
<italic>i</italic>
</sub>, we aim to extract a lower-dimensional feature embedding <inline-formula id="inf10">
<mml:math id="m11">
<mml:mi mathvariant="bold">f</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>. Intuitively, we want that our embedding should map similar images to close by feature vectors, while dissimilar images should map to locations that are far apart. Crucially, similar in this case signifies a similarity in terms of the primary task&#x2014;that is, images that show the articulated manipulator chain in close by configurations&#x2014;and not in terms of image similarity per say. We bias our model to learn such an embedding by feeding the network with a triplet of images&#x2014;associating to every sample <bold>I</bold>
<sub>
<italic>i</italic>
</sub> a similar image <bold>I</bold>
<sub>
<italic>pos</italic>
</sub> and a dissimilar image <bold>I</bold>
<sub>
<italic>neg</italic>
</sub>&#x2014;as seen in <xref ref-type="fig" rid="F2">Figure 2</xref>. In the metric learning literature, these images are known as the <italic>anchor</italic> <bold>I</bold>
<sub>
<italic>i</italic>
</sub>, the <italic>positive</italic> <bold>I</bold>
<sub>
<italic>pos</italic>
</sub> and the <italic>negative</italic> <bold>I</bold>
<sub>
<italic>neg</italic>
</sub>.</p>
<p>As depicted in <xref ref-type="fig" rid="F2">Figure 2</xref>, the three images are embedded to corresponding feature-space vectors via copies of our backbone architecture, where the weights of the three networks are shared. The corresponding feature embeddings <bold>f</bold>, <bold>f</bold>
<sub>
<italic>pos</italic>
</sub> and <bold>f</bold>
<sub>
<italic>neg</italic>
</sub> are extracted from the final fully connected layer of the backbone networks and normalized. We want to enforce a margin <italic>m</italic> between similar and dissimilar features where:<disp-formula id="e2">
<mml:math id="m12">
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:mi mathvariant="bold">f</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">pos</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:mi mathvariant="bold">f</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">neg</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
<label>(2)</label>
</disp-formula>
</p>
<p>Hence, we formulate and minimize the following loss (Triplet Target in <xref ref-type="fig" rid="F2">Figure 2</xref>):<disp-formula id="e3">
<mml:math id="m13">
<mml:msub>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">triplet</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:munderover accentunder="false" accent="false">
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">batch</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:munderover>
<mml:mi mathvariant="italic">max</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">pos</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2212;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">neg</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msubsup>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo>,</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(3)</label>
</disp-formula>where <bold>f</bold>
<sup>
<italic>i</italic>
</sup> is <italic>i</italic>th element in the batch. We incorporate this secondary objective in the overall training loss, which corresponds to our modification in usage of the triplet loss function. It is minimized as:<disp-formula id="e4">
<mml:math id="m14">
<mml:msub>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">total</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>w</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>w</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:msub>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">triplet</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
<label>(4)</label>
</disp-formula>where <italic>w</italic> is a weight specifying the relative importance between the primary (<italic>L</italic>
<sub>
<italic>js</italic>
</sub>) and secondary (<italic>L</italic>
<sub>
<italic>triplet</italic>
</sub>) targets.</p>
</sec>
<sec id="s2-4">
<title>2.4 Selecting samples</title>
<p>Choosing the negative and positive examples to use for each anchor in a triplet is known to be critically important for fast convergence and good performance. Hence finding anchor-negative pairs that violate <xref ref-type="disp-formula" rid="e2">Equation 2</xref> (i.e., hard-negatives) is important (<xref ref-type="bibr" rid="B19">Schroff et al. (2015)</xref>). To select negatives, we use an online negative exemplar mining strategy from the whole training data set. In this section, we explain our proposed online negative mining strategy adapted for our dataset.</p>
<p>At the end of each training epoch, we calculate and store the Euclidean distance between the embedded features of each training sample, obtaining a confusion matrix <italic>C</italic>
<sub>
<italic>f</italic>
</sub>(<bold>f</bold>) &#x2208; <italic>R</italic>
<sup>
<italic>N</italic>&#xd7;<italic>N</italic>
</sup> (where <italic>N</italic> is the cardinality of the training data set):<disp-formula id="e5">
<mml:math id="m15">
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msup>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
<p>In addition, we also calculate and store the distance between ground-truth joint state of each training sample, obtaining another confusion matrix <italic>C</italic>
<sub>
<italic>q</italic>
</sub>(<bold>q</bold>) &#x2208; <italic>R</italic>
<sup>
<italic>N</italic>&#xd7;<italic>N</italic>
</sup>:<disp-formula id="e6">
<mml:math id="m16">
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
<label>(6)</label>
</disp-formula>
</p>
<p>Then, for each sample <italic>i</italic>, we eliminate all samples <italic>k</italic> that are too close in terms of joint configuration, that is <italic>&#x2200;k</italic>: <italic>C</italic>
<sub>
<italic>q</italic>
</sub> (<italic>i</italic>, <italic>k</italic>) &#x3c; <italic>&#x3b1;</italic> with a preset similarity threshold <italic>&#x3b1;</italic>. Finally, we select hard-negative samples among the remaining possible pairs by looking up the feature-space confusion matrix <italic>C</italic>
<sub>
<italic>f</italic>
</sub> and choosing the closest feature-space sample <inline-formula id="inf11">
<mml:math id="m17">
<mml:munder>
<mml:mrow>
<mml:mi mathvariant="italic">arg</mml:mi>
<mml:mspace width="0.17em"/>
<mml:mi mathvariant="italic">min</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</sec>
<sec id="s2-5">
<title>2.5 Data augmentation</title>
<p>We apply the domain-aware augmentation procedure <italic>g</italic>
<sub>
<italic>aug</italic>
</sub> randomly with 50% chance to the negatives <bold>I</bold>
<sub>
<italic>neg</italic>
</sub> mined from the source domain. This results in negative images that are appearance-wise both dissimilar (<italic>g</italic>
<sub>
<italic>aug</italic>
</sub> (<bold>I</bold>
<sub>
<italic>neg</italic>
</sub>)) and similar (<bold>I</bold>
<sub>
<italic>neg</italic>
</sub>) to the anchors <bold>I</bold>
<sub>
<italic>i</italic>
</sub>. For positive pair selection, we apply augmentation to each anchor image <bold>I</bold>
<sub>
<italic>pos</italic>
</sub> &#x3d; <italic>g</italic>
<sub>
<italic>aug</italic>
</sub> (<bold>I</bold>
<sub>
<italic>i</italic>
</sub>) and select it as the positive pair for anchor <bold>I</bold>
<sub>
<italic>i</italic>
</sub>. Augmentation makes positive images appearance-wise dissimilar to the anchor image, while keeping an identical joint state configuration. Hence intuitively, we aim to bring closer the embeddings of these visually distinct images by learning to abstract from appearance and focus on what matters for the primary task.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Materials and equipment</title>
<p>In this section, we overview our data collection and experimental setup in <xref ref-type="sec" rid="s3-1">Sections 3.1</xref> and <xref ref-type="sec" rid="s3-2">Section 3.2</xref>.</p>
<sec id="s3-1">
<title>3.1 Dataset collection</title>
<p>We evaluate our approach on a task of visual state estimation for a drilling rig (see <xref ref-type="fig" rid="F3">Figure 3</xref>). The input of our method is an RGB image, <bold>I</bold> &#x2208; <italic>R</italic>
<sup>224<italic>x</italic>224<italic>x</italic>3</sup>, while the expected outputs are the joint configurations <bold>q</bold> describing the state of one articulated boom of the machine. We measure <bold>q</bold> by means of a number of encoder and resolver modules attached to each rotational and prismatic joint of the boom and connected to the vehicle&#x2019;s CAN network. Simultaneously, we record corresponding images from a MultiSense S21 stereo camera, mounted on top of the operator cabin as shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. Hence, we train the network in our method using <bold>I</bold> as input, with the ground-truth joint angle states <bold>q</bold> as output targets. Collecting simultaneously the ground-truth joint angle states through CAN network and the images from the stereo camera is implemented via the Robot Operating System (ROS) (<xref ref-type="bibr" rid="B17">Quigley et al. (2009)</xref>). We conducted our experiments on a computer with GeForce RTX 2080 as GPU and an Intel(R) Xeon(R) E-2176G CPU.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>We mount a MultiSense S21 stereo camera on the operator cabin for data collection. Sensor placement is indicated by red oval in figure. Alternate mounting locations were explored in different data collection runs.</p>
</caption>
<graphic xlink:href="frobt-09-833173-g003.tif"/>
</fig>
<p>We collect data under two different sets of conditions, mimicking the scenario that our system would need to face in a real deployment. The machine is deployed in a service hall and we record images of the boom in different configurations. We do so first with the hall lights on, creating a <italic>source domain</italic> data set with good lightning conditions (<xref ref-type="fig" rid="F1">Figure 1C</xref>). Next, we repeat the data acquisition but with the hall lights switched off and the on-vehicle headlights turned on, creating a second <italic>target domain</italic> data set (<xref ref-type="fig" rid="F1">Figure 1D</xref>). This setup is meant to mimic the real deployment conditions of our system, wherein it is not possible to collect data from all target domains likely to occur in the field.</p>
<p>Overall, our data set consists of 20,066 annotated images from the source domain, and 6,693 corresponding images in the target domain. In both cases the range of motions of the booms observed in the two data sets are similar. We partition the data sets in a 60/20/20 split for training, validation and testing. We apply the augmentation procedure <italic>g</italic>
<sub>
<italic>aug</italic>
</sub> only to the source domain data. In our experiments, <italic>g</italic>
<sub>
<italic>aug</italic>
</sub> involves adding a randomly weighted Gaussian noise to each pixel, randomly decreasing the brightness of the full image with up to 40%, and randomly adding simulated specular reflections. The last step is meant to replicate the oversaturated reflections of the vehicle headlights in the target domain and is implemented by superimposing random white circles of varying radius and with edges smoothed by a Gaussian filter (example shown in <xref ref-type="fig" rid="F1">Figure 1E</xref>).</p>
</sec>
<sec id="s3-2">
<title>3.2 Training details</title>
<p>We use TensorFlow&#x2019;s estimator API for implementing our network architecture. The joint regression loss <italic>L</italic>
<sub>
<italic>js</italic>
</sub> is calculated for each batch with size <italic>n</italic>
<sub>
<italic>batch</italic>
</sub> &#x3d; 8 as described in <xref ref-type="sec" rid="s2-2">Section 2.2</xref>. For calculating the triplet loss <italic>L</italic>
<sub>
<italic>triplet</italic>
</sub> we use batches of <italic>n</italic>
<sub>
<italic>batch</italic>
</sub> &#x3d; 8 image triplets <bold>I</bold>, <bold>I</bold>
<sub>
<italic>pos</italic>
</sub> and <bold>I</bold>
<sub>
<italic>neg</italic>
</sub>. The triplet loss metaparameter <italic>m</italic> in <xref ref-type="disp-formula" rid="e3">equation 3</xref> is set to 0.05. We combine the two losses using <xref ref-type="disp-formula" rid="e4">equation 4</xref>, with <italic>w</italic> set to 0.1. Finally, the metaparameter <italic>&#x3b1;</italic> used in mining of negatives (see <xref ref-type="sec" rid="s2-4">Section 2.4</xref>) is set to 0.25.</p>
<p>We use the Adam optimizer (<xref ref-type="bibr" rid="B7">Kingma and Ba (2014)</xref>) to minimize the total loss and train the network end-to-end. Adam is a broadly used adaptive optimization algorithm for deep learning applications in computer vision. It is a fast converging optimization algorithm. Triplet loss is a difficult loss function where speed of convergence can slow down e.g., due to sample selection. Hence, we want to use an optimizer that can speed up the convergence process. We expect that its estimation quality should be comparable with other optimizers used in deep neural networks. Therefore, we use it due to its being a fast converging and common practice method in deep learning field. We set Adam&#x2019;s learning rate of 1e-5 and apply early stopping. We set <italic>&#x3bb;</italic> and dropout to 5e-4 and 0.5 respectively for regularization. In addition, L2 regularization is applied in each layer and dropout is applied in the final fully connected (<bold>
<italic>fc</italic>
</bold>) layers. We apply early stopping by terminating training if the loss does not decrease for three consecutive epochs in the validation set.</p>
<p>We distinguish five distinct training/testing conditions. In all cases we evaluate the trained architectures on the retained test data from the <italic>target</italic> domain.<list list-type="simple">
<list-item>
<p>&#x2022; Baseline target (BT): As a baseline we train a version of our architecture that only contains the joint state estimation head&#x2014;that is, optimizing only the loss <italic>L</italic>
<sub>
<italic>js</italic>
</sub>. The baseline is given access to the training set from the target domain and represents the <italic>ideal</italic> case. That is, the best possible performance achievable by the architecture, if sufficient labeled in-domain data were available. We note that this baseline should not be taken as the performance we aim to achieve, since the premise of this work is that we operate in a regime in which it is not possible to collect data from all conceivable deployment domains.</p>
</list-item>
<list-item>
<p>&#x2022; Pre-trained baseline source (PBS): Under this condition we directly transfer a network trained on the source domain and evaluate it on the target domain. This case represents the naive approach of hoping for the best and is meant to evaluate the difficulty of generalizing between our two domains.</p>
</list-item>
<list-item>
<p>&#x2022; Pre-trained source domain data with 12k data augmentation (PDA12k): It is a network trained only on the joint estimation task, using source domain data that is augmented with an dditional 12k samples (i.e., doubling the training data by providing one augmented sample for each).</p>
</list-item>
<list-item>
<p>&#x2022; Pre-trained triplet loss with source (PTrip): It represents the proposed approach. We train using both the joint state estimation and metric learning losses, where we use the same data as in the previous condition&#x2014;all source domain training data, plus an additional 12k augmented images.</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="s4">
<title>4 Results</title>
<sec id="s4-1">
<title>4.1 Estimation accuracy</title>
<p>As a first step, we evaluate the different transfer approaches based on the primary task error. To evaluate the joint state estimation error, we extract the estimates <inline-formula id="inf12">
<mml:math id="m18">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">rad</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> and <inline-formula id="inf13">
<mml:math id="m19">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="italic">met</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> separately from each architecture. Then, as explained in <xref ref-type="sec" rid="s2-2">Section 2.2</xref>, to avoid angle wrap-around errors we apply the cosine/sine transform to the rotational joints. The transformed radian joint states and meter joint states are concatenated in <inline-formula id="inf14">
<mml:math id="m20">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>. For each data sample <italic>i</italic> &#x2208; <italic>N</italic>, where <italic>N</italic> is number of test data, the prediction error is calculated as the <inline-formula id="inf15">
<mml:math id="m21">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">L</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> norm:<disp-formula id="e7">
<mml:math id="m22">
<mml:mi>E</mml:mi>
<mml:mi>r</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
<label>(7)</label>
</disp-formula>
</p>
<p>Since the error distribution is not Gaussian, rather than mean and standard derivation over the whole test data set, we compare the median and interquartile range (IQR).</p>
<p>According to <xref ref-type="table" rid="T1">Table 1</xref>, both PDA12K and PTrip decrease the error significantly, compared to direct transfer (PBS). Hence, our way of using data augmentation with a triplet loss increases the transferability capacity of the baseline model trained only on source domain. However, even with the best performance, the error of prediction with the transferred models is still much higher (&#x2248;7 times) than the BT model which is trained directly on labeled target domain data.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Median joint-space error <italic>Err</italic>
<sub>
<italic>js</italic>
</sub>. The bold text indicates the best results among transferred models (PBS, PDA12K, PTrip).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="left">BT</th>
<th align="left">PBS</th>
<th align="left">PDA12k</th>
<th align="left">PTrip</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<bold>Median</bold>
</td>
<td align="left">
<italic>0.0937</italic>
</td>
<td align="left">0.91</td>
<td align="left">
<bold>0.639</bold>
</td>
<td align="left">0.71</td>
</tr>
<tr>
<td align="left">
<bold>IQR</bold>
</td>
<td align="left">
<italic>0.0931</italic>
</td>
<td align="left">0.53</td>
<td align="left">0.567</td>
<td align="left">
<bold>0.435</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>We also test the prediction accuracy of the evaluated models on a secondary task&#x2014;that is, the task of pose estimation for the links of the boom. In reality, this secondary task is of more interest in our application, but is very challenging to efficiently supervise the network. We calculate the error of pose estimation of the end-effector using the model-based displacement measure (DISP) introduced by (<xref ref-type="bibr" rid="B27">Zhang et al. (2007)</xref>). DISP calculates the maximum distance between corresponding vertices of a mesh model of a given manipulator, when placed in different configurations. In our case we are interested in the DISP measure between the ground-truth configuration <bold>q</bold> and the estimated configuration <inline-formula id="inf16">
<mml:math id="m23">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>. This measure provides a more interpretable metric and directly correlates with the expected accuracy in task space when using the estimated joint configurations. Formally, we calculate the measure over all points <italic>p</italic> that are vertices of the manipulator mesh <italic>M</italic> as:<disp-formula id="e8">
<mml:math id="m24">
<mml:mi>D</mml:mi>
<mml:mi>I</mml:mi>
<mml:mi>S</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi>P</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mi>max</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mfenced>
<mml:msub>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
<label>(8)</label>
</disp-formula>where <bold>p</bold>(<bold>q</bold>) is the position of point <bold>p</bold> when the model is placed in joint configuration <bold>q</bold>.</p>
<p>The DISP errors for each evaluated approach are shown in <xref ref-type="table" rid="T2">Table 2</xref>. We note that our proposed approach with domain-aware data augmentation and triplet selection performs best at this measure. Both our full approach and the domain-aware data augmentation variant result in improved pose estimation, compared to the direct transfer approach. Overall, the PTrip approach results in an improvement of roughly 30% compared to the direct transfer baseline (PBS). While this is encouraging, we note that all transfer approaches remain far from the desired performance attained by the method trained in-domain.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Median DISP error that calculates error of the pose of end-effector in meter for target domain data with different training/testing approaches. The bold text indicates the best results among transferred models (PBS, PDA12K, PTrip).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="left">BT</th>
<th align="left">PBS</th>
<th align="left">PDA12k</th>
<th align="left">PTrip</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<bold>Median</bold>
</td>
<td align="left">
<italic>0.265</italic>
</td>
<td align="left">1.719</td>
<td align="left">1.487</td>
<td align="left">
<bold>1.198</bold>
</td>
</tr>
<tr>
<td align="left">
<bold>IQR</bold>
</td>
<td align="left">
<italic>0.189</italic>
</td>
<td align="left">1.046</td>
<td align="left">0.819</td>
<td align="left">
<bold>0.67</bold>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>It is important to note that, although our approach performs better on the DISP measure, we did not directly supervise this task, and consequently there is a degree of randomness to this outcome. Our intuition is that the metric space learning objective forces our prediction model to make errors in a similar direction for similar joints. The results is that, although PTrip often makes errors in predicting a joint configuration comparable to those of the PDA12K model, these errors are correlated and often cancel out. As an illustrative example, consider <xref ref-type="fig" rid="F4">Figure 4</xref>. Two input images are shown, along with a corresponding birds&#x2019;s-eye view visualization of the estimated and ground-truth configurations (<xref ref-type="fig" rid="F4">Figures 4A,B</xref>). The prediction in <xref ref-type="fig" rid="F4">Figure 4C</xref> has a higher configuration-space error than the one in <xref ref-type="fig" rid="F4">Figure 4D</xref>. However, the bulk of the error in the first case is distributed on the two prismatic axes, with opposite error magnitude. This results in a lower DISP measure for the estimate in <xref ref-type="fig" rid="F4">Figure 4C</xref>. Visually, this result is not unexpected, as the models make predictions based on appearance, and in appearance space the two predictions in <xref ref-type="fig" rid="F4">Figure 4C</xref> are much closer. We note this unexpected benefit of our proposed method and defer deeper investigation for future work.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>
<bold>(A)</bold>, <bold>(B)</bold> Two example frames from the target data set. Contrast and colors re-adjusted for clarity of display <bold>(C)</bold>, <bold>(D)</bold> Corresponding ground truth (in yellow) and predicted (in purple) boom configurations. The prediction in <bold>(C)</bold> results in a large joint state error, but a low DISP measure (<italic>Err</italic>
<sub>
<italic>js</italic>
</sub> &#x3d; 0.6, DISP &#x3d; 0.57). On the other hand, the prediction in <bold>(D)</bold> results in a low JSE, but a high DISP measure (<italic>Err</italic>
<sub>
<italic>js</italic>
</sub> &#x3d; 0.35, DISP &#x3d; 0.89).</p>
</caption>
<graphic xlink:href="frobt-09-833173-g004.tif"/>
</fig>
</sec>
<sec id="s4-2">
<title>4.2 Statistical analysis</title>
<p>To determine whether the error results of transfer learning models stated in <xref ref-type="table" rid="T1">Table 1</xref> and <xref ref-type="table" rid="T2">Table 2</xref> are not random and their difference are statistically significant, we apply further statistical tests. We chose Mood&#x2019;s median test (<xref ref-type="bibr" rid="B16">Mood (1954)</xref>) to do that due to the error distributions are non-Gaussian, as we state above. Mood&#x2019;s median test is a non-parametric statistical test. It can replace more common statistical tests such as <italic>t</italic>-test or ANNOVA that requires normal data assumption. Hence, we use Mood&#x2019;s median test to show that the accuracy results we state in <xref ref-type="table" rid="T1">Table 1</xref> and <xref ref-type="table" rid="T2">Table 2</xref> are not randomly found values but they have statistical significance.</p>
<p>Mood&#x2019;s median is a very well known statistical test. Yet, for the sake of completion, we explain it very briefly. The null hypothesis of Mood&#x2019;s median test is that the population medians are all equal, hence there is no significant difference between populations. To assess this null hypothesis, we choose <italic>&#x3b1;</italic> &#x3d; 0.05 as significance level. Then, to test the null hypothesis, a chi-square value is calculated between <italic>k</italic> populations. In our case, we compare PBS, PDA12K and PTrip&#x2019;s error results with each other two by two, i.e., <italic>k</italic> &#x3d; 2. Another important parameter is critical value that we compare the calculated chi-square value with. If chi-square is bigger than the critical value, we can reject the null hypothesis and the difference between our errors states in <xref ref-type="table" rid="T1">Table 1</xref> and <xref ref-type="table" rid="T2">Table 2</xref> are meaningful. The critical value is determined based on <italic>k</italic> and chosen <italic>&#x3b1;</italic>. For <italic>k</italic> &#x3d; 2 and <italic>&#x3b1;</italic> &#x3d; 0.05, the critical value is determined as 3.841.</p>
<p>For clarity, we show the chi-square results in a matrix format in <xref ref-type="table" rid="T3">Table 3</xref> and <xref ref-type="table" rid="T4">Table 4</xref>. Our results clearly show that all the chi-square values are at least 10 times bigger than critical value in <xref ref-type="table" rid="T3">Table 3</xref> and <xref ref-type="table" rid="T4">Table 4</xref>. Hence, the null hypothesis is rejected and we can say that the difference between errors stated in <xref ref-type="table" rid="T1">Table 1</xref> and <xref ref-type="table" rid="T2">Table 2</xref> are statistically significant.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Mood&#x2019;s median test&#x2019;s chi-square values calculated from joint state errors of different training/testing conditions. The italic ones shows the smallest chi-square values. The bold text indicates the best results among transferred models (PBS, PDA12K, PTrip).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="left">PBS</th>
<th align="left">PDA12k</th>
<th align="left">PTrip</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<bold>PBS</bold>
</td>
<td align="left">N/A</td>
<td align="left">156.912</td>
<td align="left">157.15</td>
</tr>
<tr>
<td align="left">
<bold>PDA12k</bold>
</td>
<td align="left">156.912</td>
<td align="left">N/A</td>
<td align="left">
<italic>11.593</italic>
</td>
</tr>
<tr>
<td align="left">
<bold>PTrip</bold>
</td>
<td align="left">157.15</td>
<td align="left">
<italic>11.593</italic>
</td>
<td align="left">N/A</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Mood&#x2019;s median test&#x2019;s chi-square values calculated from DISP errors of different training/testing conditions. The italic ones shows the smallest chi-square values. The bold text indicates the best results among transferred models (PBS, PDA12K, PTrip).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="left">PBS</th>
<th align="left">PDA12k</th>
<th align="left">PTrip</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<bold>PBS</bold>
</td>
<td align="left">N/A</td>
<td align="left">
<italic>38.804</italic>
</td>
<td align="left">226.528</td>
</tr>
<tr>
<td align="left">
<bold>PDA12k</bold>
</td>
<td align="left">
<italic>38.804</italic>
</td>
<td align="left">N/A</td>
<td align="left">135.63</td>
</tr>
<tr>
<td align="left">
<bold>PTrip</bold>
</td>
<td align="left">226.528</td>
<td align="left">135.63</td>
<td align="left">N/A</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Also, the statistical results are consistent with the error results. For instance, the smallest chi-square value is between PTrip and PDA12k in <xref ref-type="table" rid="T3">Table 3</xref>. Therefore, the significance of difference between the errors of PTrip and PDA12k is not as high as the ones of PTrip and PBS, or, PBS and PDA12k. This results is consistent with the smallest error difference between PTrip and PDA12k, as shown in <xref ref-type="table" rid="T1">Table 1</xref>. We can observe similar consistencies between <xref ref-type="table" rid="T4">Table 4</xref> and <xref ref-type="table" rid="T2">Table 2</xref> for PBS and PDA12k as well.</p>
</sec>
<sec id="s4-3">
<title>4.3 Latent space analysis</title>
<p>In addition to evaluating the primary task, we also analyze the performance according to our secondary metric learning objective. In particular, we are interested in the generalization properties of the learned feature encoders, and thus in this section we base our evaluation on sequences of images from the target domain. We embed both consecutive images with similar appearance and joint configuration, as well as images from remote sections of the data set. In order to visualize the obtained embeddings <inline-formula id="inf17">
<mml:math id="m25">
<mml:mi mathvariant="bold">f</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>, we map the whole target data set through each of the three test conditions PBS, PDA12K and PTrip. We then take the corresponding data sets of feature embeddings in <italic>d</italic> &#x2212; dimensional space and pass them through another dimensionality reduction step to obtain an interpretable 2D visualization. For the latter step we use the popular t-SNE dimensionality reduction schema (<xref ref-type="bibr" rid="B25">Van der Maaten and Hinton (2008)</xref>), as it creates locally smooth embeddings relevant for each feature space. In this manner, we can easily discern how closely similar/dissimilar feature points place in the learned latent space (e.g., <xref ref-type="fig" rid="F5">Figure 5</xref>) and qualitatively evaluate how well each approach captures the smoothness and structure of the target domain.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>
<bold>Case1</bold>: t-SNE plots of learned feature embeddings&#x2014;<bold>(A)</bold> through PBS <bold>(B)</bold> PDA12K and <bold>(C)</bold> PTrip&#x2014;colored with ground-truth joint states distance to a reference frame (red cross). Both PDA12K and PBD separates the points in similar distance while PTrip brings them closer together.</p>
</caption>
<graphic xlink:href="frobt-09-833173-g005.tif"/>
</fig>
<p>For clarity, we select several data points from some exemplary cases rather than plotting the whole feature space. To display the similarity in the primary task space we color the feature embeddings by Euclidean distance to a fixed reference configuration. We plot the embedding of the reference with a red cross (e.g., <xref ref-type="fig" rid="F5">Figure 5</xref>) and use the same color scale in all images with lighter colors representing more dissimilar joint configurations. The euclidean distance is calculated using the cosine/sine transformed <inline-formula id="inf18">
<mml:math id="m26">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>12</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> as explained in <xref ref-type="sec" rid="s2">Section 2</xref>. We plot both the feature embedding, as well as the corresponding input.</p>
<p>
<xref ref-type="fig" rid="F5">Figure 5</xref> illustrates the feature embeddings for a sequence of images that capture a yawing motion of the boom of the machine (<bold>Case1</bold>). Even though frame 2 is in almost equal distance to frame 1 and frame 3, it is placed closer to frame3 in both PBS (<xref ref-type="fig" rid="F5">Figure 5A</xref>) and PData12k (<xref ref-type="fig" rid="F5">Figure 5B</xref>). On the other hand, PTrip manages to bring them closer (<xref ref-type="fig" rid="F5">Figure 5C</xref>) and thus results in a more faithful representation of these points in latent space. To verify this observation, we compare smoothness of the estimated joint configurations with smoothness of the ground-truth joint configurations. Smoothness of joint configuration is an important factor for robots to move end-effector accurately and in a smooth continuous manner along a specified trajectory. Therefore, we calculate the trajectory smoothness for the joint configurations <inline-formula id="inf19">
<mml:math id="m27">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> predicted by each model over the entire yawing motion. We measure smoothness using the center line average (CLA) metric: i.e., <inline-formula id="inf20">
<mml:math id="m28">
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:mfrac>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:msubsup>
<mml:mrow>
<mml:mo movablelimits="false" form="prefix">&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msubsup>
<mml:mo stretchy="false">&#x7c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo stretchy="false">&#x7c;</mml:mo>
</mml:math>
</inline-formula> where <inline-formula id="inf21">
<mml:math id="m29">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">q</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> is the sample mean and <italic>d</italic> &#x3d; 12 We calculate the average CLA over normalized joint state estimates between 0 and 1 and report results in <xref ref-type="table" rid="T5">Table 5</xref>. We note for example that for <bold>Case1</bold> PTrip achieves a trajectory with comparable smoothness to the one featured by the ground-truth trajectory. Hence PTrip achieves the best representation of the ground-truth joint states in latent space by bringing similar features closer and keeping dissimilar ones apart.</p>
<table-wrap id="T5" position="float">
<label>TABLE 5</label>
<caption>
<p>Center line average of predicted joint states for the selected cases discussed. The bold text indicates the best results among transferred models (PBS, PDA12K, PTrip).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left"/>
<th align="left">PBS</th>
<th align="left">PDA12k</th>
<th align="left">PTrip</th>
<th align="left">Ground-truth</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<bold>Case1</bold>
</td>
<td align="left">0.061</td>
<td align="left">0.054</td>
<td align="left">
<bold>0.038</bold>
</td>
<td align="left">
<italic>0.020</italic>
</td>
</tr>
<tr>
<td align="left">
<bold>Case2</bold>
</td>
<td align="left">0.060</td>
<td align="left">0.077</td>
<td align="left">
<bold>0.069</bold>
</td>
<td align="left">
<italic>0.070</italic>
</td>
</tr>
<tr>
<td align="left">
<bold>Case3</bold>
</td>
<td align="left">0.081</td>
<td align="left">
<bold>0.072</bold>
</td>
<td align="left">0.096</td>
<td align="left">
<italic>0.055</italic>
</td>
</tr>
<tr>
<td align="left">
<bold>Case4</bold>
</td>
<td align="left">
<bold>0.041</bold>
</td>
<td align="left">0.044</td>
<td align="left">
<bold>0.041</bold>
</td>
<td align="left">
<italic>0.024</italic>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>In <xref ref-type="fig" rid="F6">Figure 6</xref>, we display a more complex case (<bold>Case2</bold>) where the boom is executing a combination of motions of several joints simultaneously&#x2014;i.e., the end effector is yawing, rolling and translating. In <xref ref-type="fig" rid="F6">Figure 6A</xref>, PBS pushes frame 2 (dark blue) away from the reference point frame 1 (reference), while it brings frame 5 (light green) closer to frame 2. This creates an inconsistency. PDA brings these similar points, frame 2 and frame 1 closer while pushing frame 5 (a dissimilar point) further away (<xref ref-type="fig" rid="F6">Figure 6B</xref>). However, the mild green frames such as frame4 that has almost the same distance to both frame 1 and frame 5 are pulled in closer to the darker points. Finally, PTrip finds a balance between these similar and dissimilar points (<xref ref-type="fig" rid="F6">Figure 6C</xref>).</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>
<bold>Case2</bold>: t-SNE plots of learned feature embeddings&#x2014;<bold>(A)</bold> through PBS <bold>(B)</bold> PDA12K and <bold>(C)</bold> PTrip&#x2014;colored with ground-truth joint states distance to a reference frame (red cross). PTrip brings similar blue dots closer than PBS and PDA12K.</p>
</caption>
<graphic xlink:href="frobt-09-833173-g006.tif"/>
</fig>
<p>In <bold>Case1</bold> and <bold>Case2</bold> we examine sample motion sequences where PTrip performs better relative to the other models. However, there are cases where PTrip also fails in bringing/separating similar/dissimilar points in latent space. For instance, in <xref ref-type="fig" rid="F7">Figure 7</xref>, frame 3 (mild green) is in equal distance to frame 2 and frame 4. But both PBS and PTrip place it in a closer place to frame 4 (<xref ref-type="fig" rid="F7">Figures 7A,C</xref>), while PDA12k manages to place them in a more balanced way (<xref ref-type="fig" rid="F7">Figure 7B</xref>). This reflects to the smoothness measure and PDA12k gives the closest average CLA to the ground-truth value (<xref ref-type="table" rid="T5">Table 5</xref>).</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>
<bold>Case3</bold>: t-SNE plots of learned feature embeddings&#x2014;<bold>(A)</bold> through PBS <bold>(B)</bold> PDA12K and <bold>(C)</bold> PTrip&#x2014;colored with ground-truth joint states distance to a reference frame (red cross). PDA12K successfully brings similar points closer and provides a smooth transition between consecutive frames. PTrip fails to do so. But as we see <bold>(A)</bold>, ground truth data is also erroneous hence PTrip fails to correct the error. With a more accurate labeling for training, PTrip should do fine as well.</p>
</caption>
<graphic xlink:href="frobt-09-833173-g007.tif"/>
</fig>
<p>Finally, we show another relatively simple case (<bold>Case4</bold>) similar to <bold>Case1</bold> where we mainly observe a yaw motion of the boom. For this sequence all the models fail to create a consistent result with respect to the ground truth labels (<xref ref-type="fig" rid="F8">Figures 8A&#x2013;C</xref>). While frame 2 and frame 1 capture almost identical end effector poses, all three models place frame 2 much closer to the dissimilar frame 3. Hence, for this sequence the feature encoders map dissimilar poses as similar, which reflects on the smoothness measure&#x2014;all the models give two times larger CLA values than the ground-truth. Intuitively this makes the shape of the latent space more complex, which in turn places higher demands on the following regression network, and may be the cause of the observed high prediction errors and meager transfer capability of the three evaluated models.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>
<bold>Case4</bold>: t-SNE plots of learned feature embeddings&#x2014;<bold>(A)</bold> through PBS <bold>(B)</bold> PDA12K and <bold>(C)</bold> PTrip&#x2014;colored with ground-truth joint states distance to a reference frame (red cross). All models fail to correct error in ground truth data and give erroneous results where similar points scattered different places.</p>
</caption>
<graphic xlink:href="frobt-09-833173-g008.tif"/>
</fig>
</sec>
<sec id="s4-4">
<title>4.4 Discussion</title>
<p>In our result section, we discuss several experimental results to show that our way of using data augmentation with a triplet loss increases the transferability capacity of the baseline model trained only on source domain. In these experiments, we observe an error decrease in joint state estimation in PDA12K and PTrip compared to the direct transfer baseline (PBS) in <xref ref-type="table" rid="T1">Table 1</xref>. Also in <xref ref-type="table" rid="T2">Table 2</xref>, the PTrip approach results in an improvement of roughly 30% for pose estimation compared to PBS. Moreover, our latent space analysis shows that the feature embeddings learned through PDA12K and PTrip represent the smoothness and structure of the target domain for different cases better than PBS (<xref ref-type="fig" rid="F5">Figure 5</xref> and <xref ref-type="fig" rid="F6">Figure 6</xref>).</p>
<p>However, even though we show the improved transferability capability of our proposed method, there are limitations as well. The main limitation of our approach comes from the fact that we have not directly trained the regression task for pose estimation that is of more interest in our application. Hence, this may cause a degree of randomness to our pose estimation calculation (<xref ref-type="fig" rid="F4">Figure 4</xref>), e.g., our combined metric learning and data augmentation approach (PTrip) performs better on the DISP measure (iable 2) compared to the ones in joint state estimation (<xref ref-type="table" rid="T1">Table 1</xref>). We can observe this randomness in the latent space analysis, as well. In <xref ref-type="fig" rid="F8">Figure 8</xref>, for this sequence, the feature encoders of all three models map dissimilar poses as similar. As a result, we can conclude that we have a more complex shape in the latent space than the other sequences presented in <xref ref-type="fig" rid="F5">Figure 5</xref>. This complex latent space places higher demands on the regression task. This causes high prediction errors and low transfer capability. Therefore, supervising the regression task directly over pose estimation can help to differentiate similar/dissimilar poses in a more accurate way in the latent space. As a result, the introduced limitation stresses the importance of more careful selection of the task for training (e.g., regression task directly over pose estimation).</p>
</sec>
</sec>
<sec id="s5">
<title>5 Conclusion</title>
<p>In this paper we introduce a new transfer learning method that combines metric learning and domain-aware data augmentation. Differently from previous transfer learning methods, our approach does not use target domain data directly during training but includes target domain knowledge through source domain augmentation. We apply the method to a scenario in mining robotics that features a difficult to predict and fully capture deployment domain. We concentrate on the challenging task of estimating the joint configurations of an articulated manipulator in an unknown target domain, by only having access to labeled data from a different source domain. Our results indicate that the proposed integration of a metric learning objective and domain-aware data augmentation have a promising transfer capacity, with <inline-formula id="inf22">
<mml:math id="m30">
<mml:mo>&#x2248;</mml:mo>
<mml:mn>30</mml:mn>
<mml:mi>%</mml:mi>
</mml:math>
</inline-formula> improvement with respect to a model trained only on source domain data. Moreover, we qualitatively evaluate the latent space of our approach and demonstrate that the feature encoder trained results in a smooth embedding. Hence, our approach has the capacity to map images of similar manipulator configurations to close-by regions of the latent space, regardless of visual appearance. Due to the challenging transfer task however, the error obtained for joint state prediction on the target domain is still substantially higher than the ones that can be obtained by supervising the model with real in-domain data. Our future work will concentrate on further exploring the relationship between the latent space smoothness and the subsequent regression task. We also aim at devising more generic domain augmentation methods and explore adversarial approaches to generating relevant out-of-domain data.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The datasets presented in this article are not readily available because Data will be made publicly available pending approval by industrial partners. Requests to access the datasets should be directed to PG, <email>puren.guler@gmail.com</email>.</p>
</sec>
<sec id="s7">
<title>Author contributions</title>
<p>Conceptualization, PG, JS and TS; methodology, PG, JS, and TS; software, PG, TS; validation, PG and TS; formal analysis, PG and TS; investigation, PG and TS; resources, PG and TS; data curation, PG and TS; writing&#x2014;original draft preparation, PG; writing&#x2014;review and editing, TS; visualization, PG and TS; supervision, TS; project administration, TS; funding acquisition, TS. All authors have read and agreed to the published version of the manuscript.</p>
</sec>
<sec id="s8">
<title>Funding</title>
<p>This work is supported by Vinnova/SIP STRIM project 2017-02205.</p>
</sec>
<ack>
<p>TS and JS would like to acknowledge support by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.</p>
</ack>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>Epiroc AB, <ext-link ext-link-type="uri" xlink:href="https://www.epirocgroup.com/">https://www.epirocgroup.com/</ext-link>.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Byravan</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Fox</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Se3-nets: Learning rigid body motion using deep neural networks</article-title>,&#x201d; in <conf-name>2017 IEEE International Conference on Robotics and Automation (ICRA)</conf-name> (<publisher-loc>Singapore</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>173</fpage>&#x2013;<lpage>180</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2017.7989023</pub-id> </citation>
</ref>
<ref id="B2">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Deng</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Socher</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L.-J.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Fei-Fei</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Imagenet: A large-scale hierarchical image database</article-title>,&#x201d; in <conf-name>Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on</conf-name> (<publisher-loc>Miami, FL, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>248</fpage>&#x2013;<lpage>255</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2009.5206848</pub-id> </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dong</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Deep metric learning with online hard mining for hyperspectral classification</article-title>. <source>Remote Sens.</source> <volume>13</volume>, <fpage>1368</fpage>. <pub-id pub-id-type="doi">10.3390/rs13071368</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ganin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ustinova</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Ajakan</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Germain</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Larochelle</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Laviolette</surname>
<given-names>F.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). <article-title>Domain-adversarial training of neural networks</article-title>. <source>J. Mach. Learn. Res.</source> <volume>17</volume>, <fpage>2096</fpage>&#x2013;<lpage>2030</lpage>. </citation>
</ref>
<ref id="B5">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Gulde</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Ludl</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Andrejtschik</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Thalji</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Curio</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Ropose-real: Real world dataset acquisition for data-driven industrial robot arm pose estimation</article-title>,&#x201d; in <conf-name>2019 International Conference on Robotics and Automation (ICRA)</conf-name> (<publisher-loc>Montreal, QC, Canada</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4389</fpage>&#x2013;<lpage>4395</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2019.8793900</pub-id> </citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kaya</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bilge</surname>
<given-names>H. &#x15e;.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Deep metric learning: A survey</article-title>. <source>Symmetry</source> <volume>11</volume>, <fpage>1066</fpage>. <pub-id pub-id-type="doi">10.3390/sym11091066</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kingma</surname>
<given-names>D. P.</given-names>
</name>
<name>
<surname>Ba</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Adam: A method for stochastic optimization</article-title>. <comment>
<italic>arXiv preprint arXiv:1412.6980</italic>
</comment>. </citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Klingensmith</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Galluzzo</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Dellin</surname>
<given-names>C. M.</given-names>
</name>
<name>
<surname>Kazemi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Bagnell</surname>
<given-names>J. A.</given-names>
</name>
<name>
<surname>Pollard</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>Closed-loop servoing using real-time markerless arm tracking</article-title>. </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krainin</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Henry</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Fox</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Manipulator and object tracking for in-hand 3d object modeling</article-title>. <source>Int. J. Robotics Res.</source> <volume>30</volume>, <fpage>1311</fpage>&#x2013;<lpage>1327</lpage>. <pub-id pub-id-type="doi">10.1177/0278364911403178</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Krizhevsky</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Sutskever</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G. E.</given-names>
</name>
</person-group> (<year>2012</year>). &#x201c;<article-title>Imagenet classification with deep convolutional neural networks</article-title>,&#x201d; in <source>Advances in neural information processing systems</source>, <fpage>1097</fpage>&#x2013;<lpage>1105</lpage>. </citation>
</ref>
<ref id="B11">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Labbe</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Carpentier</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Aubry</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Sivic</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Single-view robot pose and joint angle estimation via render and compare</article-title>,&#x201d; in <conf-name>Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>. </citation>
</ref>
<ref id="B12">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lambrecht</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>K&#xe4;stner</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Towards the usage of synthetic data for marker-less pose estimation of articulated robots in rgb images</article-title>,&#x201d; in <conf-name>2019 19th International Conference on Advanced Robotics (ICAR)</conf-name> (<publisher-loc>Belo Horizonte, Brazil</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>240</fpage>&#x2013;<lpage>247</lpage>. <pub-id pub-id-type="doi">10.1109/ICAR46387.2019.8981600</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Laradji</surname>
<given-names>I. H.</given-names>
</name>
<name>
<surname>Babanezhad</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>M-adda: Unsupervised domain adaptation with deep metric learning</article-title>,&#x201d; in <source>Domain adaptation for visual understanding</source> (<publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>17</fpage>&#x2013;<lpage>31</lpage>. </citation>
</ref>
<ref id="B14">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>T. E.</given-names>
</name>
<name>
<surname>Tremblay</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>To</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Mosier</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kroemer</surname>
<given-names>O.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). &#x201c;<article-title>Camera-to-robot pose estimation from a single image</article-title>,&#x201d; in <conf-name>2020 IEEE International Conference on Robotics and Automation (ICRA)</conf-name> (<publisher-loc>Paris, France</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>9426</fpage>&#x2013;<lpage>9432</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA40945.2020.9196596</pub-id> </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Litjens</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Kooi</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Bejnordi</surname>
<given-names>B. E.</given-names>
</name>
<name>
<surname>Setio</surname>
<given-names>A. A. A.</given-names>
</name>
<name>
<surname>Ciompi</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Ghafoorian</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>A survey on deep learning in medical image analysis</article-title>. <source>Med. image Anal.</source> <volume>42</volume>, <fpage>60</fpage>&#x2013;<lpage>88</lpage>. <pub-id pub-id-type="doi">10.1016/j.media.2017.07.005</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mood</surname>
<given-names>A. M.</given-names>
</name>
</person-group> (<year>1954</year>). <article-title>On the asymptotic efficiency of certain nonparametric two-sample tests</article-title>. <source>Ann. Math. Stat.</source> <volume>25</volume>, <fpage>514</fpage>&#x2013;<lpage>522</lpage>. <pub-id pub-id-type="doi">10.1214/aoms/1177728719</pub-id> </citation>
</ref>
<ref id="B17">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Quigley</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Conley</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Gerkey</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Faust</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Foote</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Leibs</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2009</year>). &#x201c;<article-title>Ros: An open-source robot operating system</article-title>,&#x201d; in <conf-name>ICRA workshop on open source software</conf-name>, <conf-loc>Kobe, Japan</conf-loc>, <fpage>5</fpage>. <comment>vol. 3</comment>. </citation>
</ref>
<ref id="B18">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Schmidt</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Newcombe</surname>
<given-names>R. A.</given-names>
</name>
<name>
<surname>Fox</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Dart: Dense articulated real-time tracking</article-title>,&#x201d; in <source>Robotics: Science and systems</source>, <volume>Vol. 2</volume>. </citation>
</ref>
<ref id="B19">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Schroff</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Kalenichenko</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Philbin</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Facenet: A unified embedding for face recognition and clustering</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name> (<publisher-loc>Boston, MA, USA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>815</fpage>&#x2013;<lpage>823</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2015.7298682</pub-id> </citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shorten</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Khoshgoftaar</surname>
<given-names>T. M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>A survey on image data augmentation for deep learning</article-title>. <source>J. Big Data</source> <volume>6</volume>, <fpage>60</fpage>. <pub-id pub-id-type="doi">10.1186/s40537-019-0197-0</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Simonyan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Very deep convolutional networks for large-scale image recognition</article-title>. <comment>
<italic>arXiv preprint arXiv:1409.1556</italic>
</comment>. </citation>
</ref>
<ref id="B22">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Deep learning face representation by joint identification-verification</article-title>,&#x201d; in <source>Advances in neural information processing systems</source>, <fpage>1988</fpage>&#x2013;<lpage>1996</lpage>. </citation>
</ref>
<ref id="B23">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Tzeng</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Hoffman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Saenko</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Darrell</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Adversarial discriminative domain adaptation</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, <fpage>7167</fpage>&#x2013;<lpage>7176</lpage>. </citation>
</ref>
<ref id="B24">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Vahrenkamp</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Wieland</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Azad</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Gonzalez</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Asfour</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Dillmann</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2008</year>). &#x201c;<article-title>Visual servoing for humanoid grasping and manipulation tasks</article-title>,&#x201d; in <conf-name>Humanoid Robots, 2008. Humanoids 2008. 8th IEEE-RAS International Conference on</conf-name> (<publisher-loc>Daejeon, South Korea</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>406</fpage>&#x2013;<lpage>412</lpage>. <pub-id pub-id-type="doi">10.1109/ICHR.2008.4755985</pub-id> </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Van der Maaten</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Visualizing data using t-SNE</article-title>. <source>J. Mach. Learn. Res.</source> <volume>9</volume>, <fpage>2579</fpage>. </citation>
</ref>
<ref id="B26">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Widmaier</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Kappler</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Schaal</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Bohg</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Robot arm pose estimation by pixel-wise regression of joint angles</article-title>,&#x201d; in <conf-name>Robotics and Automation (ICRA), 2016 IEEE International Conference on</conf-name> (<publisher-loc>Stockholm, Sweden</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>616</fpage>&#x2013;<lpage>623</lpage>. <pub-id pub-id-type="doi">10.1109/ICRA.2016.7487185</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>Y. J.</given-names>
</name>
<name>
<surname>Manocha</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2007</year>). &#x201c;<article-title>C-Dist: Efficient distance computation for rigid and articulated models in configuration space</article-title>,&#x201d; in <conf-name>Proceedings of the 2007 ACM symposium on Solid and physical modeling</conf-name> (<publisher-loc>Beijing, China</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>159</fpage>&#x2013;<lpage>169</lpage>. </citation>
</ref>
<ref id="B28">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Chi</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhuang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>3d pose estimation of robot arm with rgb images based on deep learning</article-title>,&#x201d; in <conf-name>International Conference on Intelligent Robotics and Applications</conf-name> (<publisher-loc>New York, NY, USA</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>541</fpage>&#x2013;<lpage>553</lpage>. </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhuang</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Duan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Xi</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>H.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>A comprehensive survey on transfer learning</article-title>. <source>Proc. IEEE</source> <volume>109</volume>, <fpage>43</fpage>&#x2013;<lpage>76</lpage>. <pub-id pub-id-type="doi">10.1109/jproc.2020.3004555</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>