<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="review-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Virtual Real.</journal-id>
<journal-title>Frontiers in Virtual Reality</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Virtual Real.</abbrev-journal-title>
<issn pub-type="epub">2673-4192</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1171230</article-id>
<article-id pub-id-type="doi">10.3389/frvir.2023.1171230</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Virtual Reality</subject>
<subj-group>
<subject>Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Hand interaction designs in mixed and augmented reality head mounted display: a scoping review and classification</article-title>
<alt-title alt-title-type="left-running-head">Nguyen et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frvir.2023.1171230">10.3389/frvir.2023.1171230</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Nguyen</surname>
<given-names>Richard</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2204852/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Gouin-Vallerand</surname>
<given-names>Charles</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Amiri</surname>
<given-names>Maryam</given-names>
</name>
<xref ref-type="aff" rid="aff3">
<sup>3</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>DOMUS Laboratory</institution>, <institution>Department of Computer Science</institution>, <institution>Universit&#xe9; de Sherbrooke</institution>, <addr-line>Sherbrooke</addr-line>, <addr-line>QC</addr-line>, <country>Canada</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>DOMUS Laboratory</institution>, <institution>Business School</institution>, <institution>Universit&#xe9; de Sherbrooke</institution>, <addr-line>Sherbrooke</addr-line>, <addr-line>QC</addr-line>, <country>Canada</country>
</aff>
<aff id="aff3">
<sup>3</sup>
<institution>VMware Canada</institution>, <addr-line>Ottawa</addr-line>, <addr-line>ON</addr-line>, <country>Canada</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/711018/overview">Andrea Sanna</ext-link>, Polytechnic University of Turin, Italy</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/135107/overview">Mark Billinghurst</ext-link>, University of South Australia, Australia</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/578347/overview">Mingze Xi</ext-link>, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Richard Nguyen, <email>richard.nguyen@usherbrooke.ca</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>31</day>
<month>07</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>4</volume>
<elocation-id>1171230</elocation-id>
<history>
<date date-type="received">
<day>21</day>
<month>02</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>18</day>
<month>07</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Nguyen, Gouin-Vallerand and Amiri.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Nguyen, Gouin-Vallerand and Amiri</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Mixed reality has made its first step towards democratization in 2017 with the launch of a first generation of commercial devices. As a new medium, one of the challenges is to develop interactions using its endowed spatial awareness and body tracking. More specifically, at the crossroad between artificial intelligence and human-computer interaction, the goal is to go beyond the Window, Icon, Menu, Pointer (WIMP) paradigm humans are mainly using on desktop computer. Hand interactions either as a standalone modality or as a component of a multimodal modality are one of the most popular and supported techniques across mixed reality prototypes and commercial devices. In this context, this paper presents scoping literature review of hand interactions in mixed reality. The goal of this review is to identify the recent findings on hand interactions about their design and the place of artificial intelligence in their development and behavior. This review resulted in the highlight of the main interaction techniques and their technical requirements between 2017 and 2022 as well as the design of the Metaphor-behavior taxonomy to classify those interactions.</p>
</abstract>
<kwd-group>
<kwd>augmented reality</kwd>
<kwd>mixed reality</kwd>
<kwd>hand interaction</kwd>
<kwd>POST WIMP</kwd>
<kwd>hand grasp</kwd>
<kwd>gestures</kwd>
<kwd>machine learning</kwd>
<kwd>scoping review</kwd>
</kwd-group>
<contract-sponsor id="cn001">Mitacs<named-content content-type="fundref-id">10.13039/501100004489</named-content>
</contract-sponsor>
<contract-sponsor id="cn002">VMware<named-content content-type="fundref-id">10.13039/100016682</named-content>
</contract-sponsor>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Augmented Reality</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Mixed Reality embodies experiences that involve both the physical world and virtual contents. Following the popular Virtuality-Reality Continuum from (<xref ref-type="bibr" rid="B28">Milgram et al., 1995</xref>), it encompasses Augmented Reality, which consists in adding virtual content on the real world and Augmented Virtuality, which consists in representing physical objects in virtual environments.</p>
<p>For this new medium, hand interactions are one of the most popular modalities to manipulate virtual content in mixed reality. Indeed, as we are used to grab and manipulate physical objects to explore the real world, this modality is intuitive and perceived as natural. Recent headsets such as HoloLens 2, Meta 2 or Magic Leap One support natively hand interactions. Besides, external sensors such as Leap Motion (<xref ref-type="bibr" rid="B17">Kim et al., 2019</xref>) and Myo Armband (<xref ref-type="bibr" rid="B3">Bautista et al., 2020</xref>) are also used to improve or enable hand interactions with older headsets. This modality has been made possible with the progress of computer vision and to some extent the progress of machine learning. Mixed reality headsets are endowed with sensors and computer vision algorithms that allow the machine to understand the context around the user, including but not limited to, tracking the hand movements in real-time. As such, it is possible to design more intelligent user interfaces where virtual content and physical objects are manipulated seamlessly.</p>
<p>According to <xref ref-type="bibr" rid="B4">Billinghurst et al. (2015)</xref>&#x2019;s new interface medium adoption steps, mixed and augmented reality are still at the second step of adoption where they still use the desktop computer interface metaphors called the Window, Icon, Menu, pointer (WIMP) paradigm. The goal is to go beyond this paradigm and reach the third step of adoption by designing new interface metaphors suited to mixed and augmented reality. By definition mixed reality encompasses interactions with both virtual content and the real physical world. As such, the new interface and its metaphors will aim at closing the gap between what is real and what is computer generated. This can be done, for example, by enabling the user to manipulate virtual content as he would do with their physical counterpart or allowing him to seamlessly switch from interacting with the real and the virtual using diverse modalities and effectors.</p>
<p>In this paper, we are proposing a scoping literature review on hand interactions in mixed reality over the period 2017&#x2013;2022. The year 2017 was chosen as the starting point of our review because researchers and developers have been able to start prototyping with commercial devices with the form under which they are popularized nowadays. Indeed HoloLens 1 and Meta 2<sup>1</sup> were released in 2016 and were mostly shipped globally early 2017. Magic Leap One a direct competitor was also announced at the end of 2017 for a release in 2018. We aim at providing in this paper an outline of the hand interaction techniques designed and their technical requirements to give students and researchers an overview of the recent trends in the context where accessible headsets for prototyping has been made available. We also propose a Metaphor-Behavior taxonomy to describe hand interactions which extends the model of <xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref> and <xref ref-type="bibr" rid="B25">Macaranas et al. (2015)</xref>, and was formulated while making this review.</p>
</sec>
<sec id="s2">
<title>2 Related works</title>
<sec id="s2-1">
<title>2.1 What is an interaction in mixed reality?</title>
<p>In the work of <xref ref-type="bibr" rid="B12">Hand (1997)</xref>, a user in virtual reality is acting to achieve four fundamental tasks.<list list-type="simple">
<list-item>
<p>1. Navigation;</p>
</list-item>
<list-item>
<p>2. Selection;</p>
</list-item>
<list-item>
<p>3. Manipulation;</p>
</list-item>
<list-item>
<p>4. Application Control.</p>
</list-item>
</list>The model proposed by <xref ref-type="bibr" rid="B12">Hand (1997)</xref> can be applied to mixed reality as the user is also evolving in a 3D environment. The difference is that in this context, the environment is composed of both virtual and physical content as targets of the interactions. The navigation task refers to exploring the environment by changing the point of view on the scene. In virtual reality, it consists in changing the point of view of the camera that represents the head of the user. To extend this model to augmented and mixed reality, the first task will also refer to moving in the physical world that is spatially registered by the system. Indeed, in augmented and mixed reality, there is a need to synchronize the position of the user in the virtual environment reference frame and the physical world environment reference frame. This is mostly done by the computing of a common origin through calibrations to match the two coordinate systems. More advanced systems also map the real environment topography to a 3D mesh that allows virtual content to interact with physical surfaces.The selection task refers to designating an object as the focus of the interaction with the user. The object can be both virtual and real in mixed reality.The manipulation task refers to changing the properties of a designated object such as the position or its shape. The application control task refers to triggering functionalities and communicating information to the application. To achieve those goals, the user has several modalities to convey his intention to the application and interact with the environment. The most popular modalities in commercial devices are: hand control, controllers, voice commands and head/eyes control.</p>
</sec>
<sec id="s2-2">
<title>2.2 Hand interaction</title>
<p>Based on the previous definition, hand interactions can be separated into two categories in most commercial mixed reality applications.<list list-type="simple">
<list-item>
<p>&#x2022; gesture based interactions;</p>
</list-item>
<list-item>
<p>&#x2022; trajectories and simulated physic-based interactions.</p>
</list-item>
</list>A gesture according to <xref ref-type="bibr" rid="B5">Bouchard et al. (2014)</xref> is an expressive and meaningful body motion (hand, face, arms, etc.) that convey a message or more generally, embed important information of spatio-temporal nature. In the work of <xref ref-type="bibr" rid="B43">Vuletic et al. (2019)</xref>, a systematic review on hand gesture, hand gestures can be classified by three ways: temporal classification, contextual classification and instruction based classification. Temporal classification separates gestures in two classes: static or dynamic, the first one being hand poses while the second one is composed of several hand moves. Contextual classification describe what the gestures are used for and how do they convey the information in comparison to speech language. In this classification model, gestures can be separated in two classes: communicative and manipulative. In the communicative class the sub-categories are.<list list-type="simple">
<list-item>
<p>&#x2022; Symbolic gestures, representing a symbolic object or concept with a cultural semantic without having a direct morphological relation to the designated object;</p>
</list-item>
<list-item>
<p>&#x2022; Semaphoric gestures, formalizing a dictionary of gesture-to-action without any semantic background;</p>
</list-item>
<list-item>
<p>&#x2022; Pantomimic gestures, imitating a sequence of actions to describe a narrative.</p>
</list-item>
<list-item>
<p>&#x2022; Iconic gestures, illustrating the content of the speech by describing a shape, a spatial relation or an action of an object;</p>
</list-item>
<list-item>
<p>&#x2022; Metaphoric gestures, illustrating the content of the speech by describing an abstract concept;</p>
</list-item>
<list-item>
<p>&#x2022; Modalizing symbolic gestures or Beat gestures, following the rhythm of the speech while emphasizing on part of it;</p>
</list-item>
<list-item>
<p>&#x2022; Cohesive gestures, highlighting the continuation of a topic that has been interrupted by using a recurrent gesture.</p>
</list-item>
<list-item>
<p>&#x2022; Adaptors, releasing body tension through unconscious gesture.</p>
</list-item>
</list>The first three are independent from speech and can communicate on their own while the last five are complementing speech language. For the second class, any gesture that affects a spatial component of an object is considered as manipulative. Deictic gestures that are pointing gestures can be considered as both communicative and manipulative as they communicate the object of the focus while also manipulating the direction which is a spatial component. Finally, instruction based classification separates gestures into two categories: prescribed or free-form. The former designates predefined dictionary of gestures that need to be learnt while the latter is non-restrictive. The hand gestures reviewed by <xref ref-type="bibr" rid="B43">Vuletic et al. (2019)</xref> were classified in one of those groups or a group formed by a combination of those gesture categories. As mentioned by the authors, contextual classification has the flaw that is tightly tied to the context of speech. According to them, a potential research on a classification that decouples hand gestures from speech could be interesting for both ergotic (gestures for manipulation) and epistemic (gestures for learning from tactile exploration).</p>
<p>
<xref ref-type="bibr" rid="B18">Koutsabasis and Vogiatzidakis. (2019)</xref> established a review on mid-air interactions with empirical studies such as gesture elicitations and user studies. <xref ref-type="bibr" rid="B18">Koutsabasis and Vogiatzidakis. (2019)</xref> note that there are no standard for gestures design and that the design and implementation are targeted for selected users and for a specific context. In terms of classification, <xref ref-type="bibr" rid="B18">Koutsabasis and Vogiatzidakis. (2019)</xref> grouped the reviewed interactions in types that describes what the interaction actually do: Targeting, Navigate, Pan, Typing, Rotate, Select, Point, 3D model shaping, Grabbing 3D object, Travel, Zoom, Other. The first three are 2D interactions while the next seven are 3D interactions. There are both 2D and 3D interactions for Zoom and Other. This classification can be considered as a more granular classification of the model of <xref ref-type="bibr" rid="B12">Hand. (1997)</xref>. Indeed, we can group the subcategories this way.<list list-type="simple">
<list-item>
<p>1. Navigation: Navigate, Zoom, Pan, Travel;</p>
</list-item>
<list-item>
<p>2. Selection: Select, Point, Grabbing 3D object;</p>
</list-item>
<list-item>
<p>3. Manipulation: 3D model shaping;</p>
</list-item>
<list-item>
<p>4. Application control: Typing.</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec sec-type="methods" id="s3">
<title>3 Methodology</title>
<sec id="s3-1">
<title>3.1 Systematic review</title>
<p>For this review, we did a scoping review following the PRISMA filtering method (<xref ref-type="bibr" rid="B33">Page et al., 2021</xref>) as described in this section and as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>. The research questions we will answer in this review are.<list list-type="simple">
<list-item>
<p>&#x2022; RQ1: What hand interactions have been designed and how to classify them?</p>
</list-item>
<list-item>
<p>&#x2022; RQ2: What are the apparatus and algorithms for the implementation of those interactions?</p>
</list-item>
<list-item>
<p>&#x2022; RQ3: What impact did the availability of commercial mixed reality headset have on hand interaction design and prototyping?</p>
</list-item>
</list>
</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Screening process.</p>
</caption>
<graphic xlink:href="frvir-04-1171230-g001.tif"/>
</fig>
<p>To gather the articles, we did one request on the database scopus with the filters.<list list-type="simple">
<list-item>
<p>&#x2022; (&#x201c;augmented reality&#x201d; OR &#x201c;mixed reality&#x201d;) AND (&#x201c;HMD&#x201d; OR &#x201c;head-mounted display&#x201d; OR &#x201c;head mounted display&#x201d; OR &#x201c;helmet mounted display&#x201d; OR &#x201c;helmet-mounted display&#x201d; OR &#x201c;HoloLens&#x201d; OR &#x201c;egocentric&#x201d; OR &#x201c;glass&#x201d; OR &#x201c;headset&#x201d;), contextualizing the request on the subject of mixed and augmented reality;</p>
</list-item>
<list-item>
<p>&#x2022; AND (&#x201c;hand gesture&#x201d; OR &#x201c;hand interaction&#x201d; OR &#x201c;hand manipulation&#x201d; OR &#x201c;hand grasp&#x201d; OR &#x201c;mid-air interactions&#x201d;), restricting the articles to hand interactions.</p>
</list-item>
</list>
</p>
<p>In the first part of the filters, we decided to include augmented reality as the border between mixed reality (MR) and augmented reality (AR) is mobile and Milgram Virtuality-Reality Continuum (<xref ref-type="bibr" rid="B28">Milgram et al., 1995</xref>) describes AR being part of MR. In this context, we consider experiences in which the user is mainly in the real world and where virtual contents are added and interacts with the user and the physical world. The second part of the filters is targeting hand interactions. We also included egocentric point of view research because work on hand interactions, that are using sensors in an egocentric point of view, can be adapted for commercial mixed reality headsets as they are endowed with similar sensors. The scope of the research has been limited to 2017 to 2021 because the goal of this review is to explore the recent research that bloomed with the release of commercial products in 2017. The request was last updated on the 23 October 2022 and resulted in 112 articles of which we removed four duplicates. The <xref ref-type="fig" rid="F2">Figure 2</xref> illustrates the year distribution of the articles.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Year distribution of the articles reviewed.</p>
</caption>
<graphic xlink:href="frvir-04-1171230-g002.tif"/>
</fig>
<sec id="s3-1-1">
<title>3.1.1 First screening</title>
<p>To filter the articles, we started with the screening on titles and abstracts using three inclusion criteria of which at least one must be met.<list list-type="simple">
<list-item>
<p>&#x2022; the article describes an interaction technique:</p>
</list-item>
<list-item>
<p>&#x2022; the article describes a technical solution to support interactions;</p>
</list-item>
<list-item>
<p>&#x2022; the article is comparing different interaction techniques.</p>
</list-item>
</list>
</p>
<p>Furthermore, we also defined three exclusion criteria which invalidate all articles that meet one or several of them.<list list-type="simple">
<list-item>
<p>&#x2022; the articles describe a technical solution for low-cost AR/MR such as Google cardboard;</p>
</list-item>
<list-item>
<p>&#x2022; the article describes human-robot interactions;</p>
</list-item>
<list-item>
<p>&#x2022; the article describes multi-user collaboration applications.</p>
</list-item>
</list>
</p>
<p>The first exclusion criteria is more precisely aiming at filtering articles that focuses more on algorithms that tackle the computing and streaming cost issues on cheaper HMD device rather than the interactions themselves which are the focus of this review. 50% of the articles describe an interaction technique, 24% of them describe a technical solution to implement interactions and 12% of them compare different techniques.</p>
<p>This first screening resulted 60 articles that were kept for this scoping review.</p>
</sec>
<sec id="s3-1-2">
<title>3.1.2 Second screening</title>
<p>For the second screening made on the content of the articles, the exclusion criteria are.<list list-type="simple">
<list-item>
<p>&#x2022; the article describe a use case for natively supported interactions on commercial devices;</p>
</list-item>
<list-item>
<p>&#x2022; the article does not describe any prototype.</p>
</list-item>
</list>
</p>
<p>The first criterion allows us to keep articles focused on interaction techniques and implementations instead of use cases. This criterion filtered 24 articles which is the biggest part of the articles that were rejected in this second screening. We then removed one article that described a concept of architecture without prototype. We also removed five articles that did not describe any hand interaction. Finally, we kept the most recent version of the work of among <xref ref-type="bibr" rid="B2">Bautista et al. (2018)</xref> and <xref ref-type="bibr" rid="B3">Bautista et al. (2020)</xref> which resulted in 28 articles in total or 26% of the total articles from the initial request.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4 Analysis</title>
<sec id="s4-1">
<title>4.1 Extending frutos and macaranas taxonomy</title>
<p>In this litterature review, we came across two articles from <xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref> and <xref ref-type="bibr" rid="B37">Serrano et al. (2022)</xref> that uses a taxonomy introduced by Macaranas in order to compare two popular hand interaction modality in commercial mixed reality HMD.</p>
<sec id="s4-1-1">
<title>4.1.1 Macaranas taxonomy</title>
<p>
<xref ref-type="bibr" rid="B25">Macaranas et al. (2015)</xref> introduced this taxonomy to classify strategies to make intuitive interactions using as criteria the type of mental scheme used to learn them. The three classes are.<list list-type="simple">
<list-item>
<p>&#x2022; Metaphoric mapping, which are based on the images that link repeated outcomes from everyday iteractions to conceptual metaphors. An example given by the author is the fact that the height of a pile can be associated to the concept of high quantity. As such, interactions that are based on going up and down to increase or decrease an output are using this image and are hence metaphoric mappings;</p>
</list-item>
<list-item>
<p>&#x2022; Isomorphic mapping, which are based on the one-to-one spatial relation between the user input and the system output. The main focus of this mapping is the correlation between the spatial movement of the input and the effect produced. The output can be physical or abstract. The authors give the example of a User Interface (UI) element made of empty ticks horizontally lined up that is mapped to the sound volume. The spatial movement on the horizontal line fills the ticks which are isomorphically mapped to the sound volume as each tick represent an quantity of volume;</p>
</list-item>
<list-item>
<p>&#x2022; Conventional mappings, which are based on the interactions adapted from previous interfaces the user has used. The authors underline that they exclude in this mapping the interactions grounded on image schema-based metaphors and one-to-one mappings so as to differentiate from the two formers classes. The authors illustrate in their third figure using the rotation direction convention learned through the reading of a clock and the usage of a screw to learn how to use a control knob to increase the sound volume.</p>
</list-item>
</list>From our understanding of <xref ref-type="bibr" rid="B25">Macaranas et al. (2015)</xref>&#x2019;s model, all three strategies to make an interface intuitive are not mutually exclusive. Indeed, the rotation of the knob is mapped to a volume quantity and thus can also be considered as an isomorphic mapping. On top of that, the rotation in the correct direction defined across those three objects is associated to the concept of an increase (the increase of time for the clock, and the increase in the progression to achieve the task of closing the jar or screwing for the jar cap and the screw respectively) can also be considered as a metaphoric mapping.</p>
</sec>
<sec id="s4-1-2">
<title>4.1.2 Application of macaranas taxonomy on hand interactions for commercial HMD</title>
<p>
<xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref> and <xref ref-type="bibr" rid="B37">Serrano et al. (2022)</xref> are applying <xref ref-type="bibr" rid="B25">Macaranas et al. (2015)</xref> model for strategies to make interactive system intuitive to compare HoloLens 1 and respectively Meta 2 and Magic Leap One interactions. Both authors designate HoloLens 1 interactions as Metaphoric Mappings as opposed to Meta 2 and Magic Leap One, being Isomorphic Mappings. Meta 2 and Magic Leap One interactions are considered as isomorphic mappings because when an object is grabbed using Meta 2 headset, the object position becomes linked to the hand position by means of a virtual representation of the hand. The hand tracking interaction creates a one-to-one spatial mapping between the hand joints movement and the feedback from the simulated physics in the virtual environment. On the opposite side, HoloLens 1 is considered by both authors as using a metaphoric mapping because it reminds of the mouse clicks on desktop computer. However, in our opinion, it can be argued that the whole interaction of HoloLens 1 relies on not the image of a mouse but the cursor-based interaction from the WIMP paradigm and thus is a conventional mapping. Besides, the mouse cursor image is not really associated to any abstract concept as described in the metaphorical association image scheme in <xref ref-type="bibr" rid="B25">Macaranas et al. (2015)</xref> paper. The research of <xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref> and <xref ref-type="bibr" rid="B37">Serrano et al. (2022)</xref> highlight the need for a classification model for hand gestures unbound by speech as suggested by the review of <xref ref-type="bibr" rid="B43">Vuletic et al. (2019)</xref> because they are using a model designed to classify learning strategies and not hand interactions. Moreover, we have not found other work using this <xref ref-type="bibr" rid="B25">Macaranas et al. (2015)</xref> to classify hand interactions which imply that it is not standardized yet.</p>
</sec>
</sec>
<sec id="s4-2">
<title>4.2 Our extension for the specific case of hand interactions</title>
<p>The model we propose to classify interactions only describes the way the user is performing an interaction and the way users are learning how to use them. If we need to describe the use case of the interactions in our review as defined in the classification of <xref ref-type="bibr" rid="B18">Koutsabasis and Vogiatzidakis. (2019)</xref>, we keep the model of <xref ref-type="bibr" rid="B12">Hand. (1997)</xref> which is more general. As the model classifies the interactions using two criterias, we propose two axis.<list list-type="simple">
<list-item>
<p>&#x2022; The Interaction Behaviour axis defined by Isomorphic and Gestural in the extremes;</p>
</list-item>
<list-item>
<p>&#x2022; The Metaphor axis defined by Conventional and Realistic in the extremes.</p>
</list-item>
</list>
</p>
<p>In the first axis, Isomorphic refers to the same definition proposed by <xref ref-type="bibr" rid="B25">Macaranas et al. (2015)</xref> as a one-to-one spatial mapping between the user hand movement and the effect output in mixed reality but adding the idea that it is unprescribed. Gestural as opposed to that is encompassing the association of a motion to a punctual predefined effect and the extreme being pose based interaction where the user only needs to shape his hand in a specific shape to trigger an action. For example, of HoloLens 1 head pointing and air tap selection based interaction is in the gestural side of the spectrum because it is mainly relying on the air tap gesture as a predefined trigger of the action. As opposed to that, Meta 2 and Magic Leap One hand tracking manipulation are more isomorphic as they rely on the direct contact between the hand virtual representation and the mixed reality content. The better is the isomorphy between the real hand and the virtual hand serving as an effector, the better the system react to the user&#x2019;s input. It can be noted though that categorizing the headsets globally is not totally accurate as it depends on the interactions used. For example, on Meta 2 and Magic Leap One, the user can grab an object. The grab which triggers the selection of the object is a gestural interaction but the manipulation of the object is isomorphic. To compare with <xref ref-type="bibr" rid="B43">Vuletic et al. (2019)</xref>&#x2019;s gesture classifications described in 2.2, the extreme isomorphy would correspond to free-form gestures while the extreme gestural would be static gestures. 1) Deictic and Dynamic Gestures would be in the gestural side of the continuum but closer to the limit as they can map a motion of the user to a continuous output.</p>
<p>In the second axis, metaphor refers to the mental image that helps the user learn to execute the interaction. We use as a definition of Conventional Metaphor a similar definition to <xref ref-type="bibr" rid="B25">Macaranas et al. (2015)</xref>&#x2019;s Conventional Mapping but broaden to include all mental models defined for a specific context and that are learned through repetitions. As opposed to that, a Realistic Metaphor is based on the user&#x2019;s everyday experience and repeated patterns in the real world. For example, if the user is mimicking the shape of binoculars to zoom on a content, the metaphor is realistic. Oppositely, if he uses a two-fingers pinch, referring to the touchscreen interaction counterpart, the metaphor is conventional. We defined this classification as a continuum because a gesture can be more or less conventional compare to others. To explain this aspect, we will use the comparison to <xref ref-type="bibr" rid="B43">Vuletic et al. (2019)</xref> model. In <xref ref-type="fig" rid="F3">Figure 3</xref> We have placed the gestures subcategories that can be used independently from speech language in the Metaphor axis. We also placed metaphoric gestures even if <xref ref-type="bibr" rid="B43">Vuletic et al. (2019)</xref> originally included them in speech related gestures because the definition is close to what <xref ref-type="bibr" rid="B25">Macaranas et al. (2015)</xref> designate as Metaphoric Mapping. The most realistic gestures would be Pantomimic gestures because they represent exactly what a user would be doing in reality. The most conventional gestures would be Semaphoric gestures because they have no semantic background and are learned solely by repetitions from zero for the context of usage. In between, we have Symbolic gestures that have an acknowledge meaning in the conventional side. In the realistic side, we have Metaphoric gestures that are tied to a similarity between a pattern in the real world and an abstract concept. Furthermore, a Symbolic gesture is actually a Semaphoric gesture that has been culturally ingrained in the gesture language by the repetition usage and of which the meaning has been commonly accepted. As such, in our model, a conventional interaction will shift toward the realistic ends of the continuum as it is becoming a standard in the society. To put it more simply, as an interaction designed for a specific system is democratized at the scale of the society, it becomes a daily interaction that people can use as a reference mental model when learning new interactions. As a result, the placement of an interaction on the Metaphor continuum can evolve over time. This the reason why we would position HoloLens 1 air-tap based interaction on the conventional side of the metaphor spectrum but close to the frontier with realistic metaphors.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Independent communicative gesture categories placed in the proposed taxonomy.</p>
</caption>
<graphic xlink:href="frvir-04-1171230-g003.tif"/>
</fig>
<p>Our taxonomy is also illustrated in <xref ref-type="fig" rid="F4">Figure 4</xref> with the example of interactions aiming at navigating in a virtual text content for each quadrants. In the top left corner of the graph, the Isomorphic Interaction based on Conventional Metaphor is represented by the usage of a scroll bar widget. The user is able to grab the handle of the widget which creates the one-to-one spatial relation between the widget and his hand. The user is also familiar with similar widgets that can be found on computers and mobile phones. In top right corner, the widget can be replaced by a mid-air scrolling gesture which consists in repeating a vertical translation down or up. One translation is registered as a gesture and translated to the movement of the content. The metaphor is conventional as it is a recurrent gesture for touchscreen interfaces. In the bottom left corner, the user is manipulating a virtual book. The pages of the book have a direct spatial relation with the finger representations of the user, and the metaphor is correlated to the manipulation of a real book. Finally, in the bottom right corner, the user is doing the mimic of turning the pages of a book. The gesture of turning one page is recognized and translated to the browsing of the content. The metaphor is the same as the precedent interaction but the behavior is gestural because only the punctual movement is registered and interpreted by the system.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>Illustration of the Taxonomy of interactions by behaviour and metaphor (The position relative to the extremity of the axis is not relevant in this figure).</p>
</caption>
<graphic xlink:href="frvir-04-1171230-g004.tif"/>
</fig>
<p>This model we propose do not aim at being exhaustive in the classification of the interactions but to give a more global idea of what type of interaction have been designed with a simple description of the learning cue and the relation between the hand movement and the output.</p>
</sec>
<sec id="s4-3">
<title>4.3 Analysis of the reviewed paper using the metaphor-behavior taxonomy</title>
<p>The <xref ref-type="fig" rid="F5">Figure 5</xref> together with <xref ref-type="table" rid="T1">Tables 1</xref>, <xref ref-type="table" rid="T2">2</xref> summarize the classification of the selected articles. In the screened articles, the distribution of the articles in the taxonomy is illustrated in <xref ref-type="fig" rid="F6">Figure 6</xref>. We can observe that in terms of metaphor used, the distribution is almost even. When it comes to the behaviour of the interaction, there are more gestural interactions than isomorphic ones. For both axis, there are papers that designs and evaluates interactions in both side of the spectrum. For example, gesture authoring tools like <xref ref-type="bibr" rid="B30">Mo et al. (2021)</xref> paper can be used to design interactions based on both types of metaphors. We decided to not classify the article of <xref ref-type="bibr" rid="B31">Mueller et al. (2017)</xref> as they propose a hand tracking technical solution for the issue of hand-object occlusion, which could be used to design any type of interaction from the classification.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>Classification of the screened articles. Articles numbered using the <xref ref-type="table" rid="T1">tables 1</xref> and <xref ref-type="table" rid="T2">2</xref>.</p>
</caption>
<graphic xlink:href="frvir-04-1171230-g005.tif"/>
</fig>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Table of screened articles (1/2).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Number</th>
<th align="left">Authors</th>
<th align="left">Title</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">1</td>
<td align="left">
<xref ref-type="bibr" rid="B34">Plasson et al. (2020)</xref>
</td>
<td align="left">3D Tabletop AR: A comparison of mid-air, touch and Touch &#x2b; Mid-Air interaction</td>
</tr>
<tr>
<td align="left">2</td>
<td align="left">
<xref ref-type="bibr" rid="B48">Zhang et al. (2020)</xref>
</td>
<td align="left">ARSketch: Sketch-Based User Interface for Augmented Reality Glasses</td>
</tr>
<tr>
<td align="left">3</td>
<td align="left">
<xref ref-type="bibr" rid="B1">Ababsa et al. (2020)</xref>
</td>
<td align="left">Combining hololens and leap-motion for free hand-based 3d interaction in mr environments</td>
</tr>
<tr>
<td align="left">4</td>
<td align="left">
<xref ref-type="bibr" rid="B19">Lee and Chu. (2018)</xref>
</td>
<td align="left">Dual-MR: Interaction with mixed reality using smartphones</td>
</tr>
<tr>
<td align="left">5</td>
<td align="left">
<xref ref-type="bibr" rid="B7">Chang et al. (2017)</xref>
</td>
<td align="left">Evaluating gesture-based augmented reality annotation</td>
</tr>
<tr>
<td align="left">6</td>
<td align="left">
<xref ref-type="bibr" rid="B22">Lin and Yamaguchi. (2021)</xref>
</td>
<td align="left">Evaluation of Operability by Different Gesture Input Patterns for Crack Inspection Work Support System</td>
</tr>
<tr>
<td align="left">7</td>
<td align="left">
<xref ref-type="bibr" rid="B24">Lu et al. (2019)</xref>
</td>
<td align="left">FMHash: Deep Hashing of In-Air-Handwriting for User Identification</td>
</tr>
<tr>
<td align="left">8</td>
<td align="left">
<xref ref-type="bibr" rid="B47">Yu et al. (2017)</xref>
</td>
<td align="left">Geometry-aware interactive AR authoring using a smartphone in a wearable AR environment</td>
</tr>
<tr>
<td align="left">9</td>
<td align="left">
<xref ref-type="bibr" rid="B30">Mo et al. (2021)</xref>
</td>
<td align="left">Gesture Knitter: A Hand Gesture Design Tool for Head-Mounted Mixed Reality Applications</td>
</tr>
<tr>
<td align="left">10</td>
<td align="left">
<xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref>
</td>
<td align="left">Head Mounted Display Interaction Evaluation: Manipulating Virtual Objects in Augmented Reality</td>
</tr>
<tr>
<td align="left">11</td>
<td align="left">
<xref ref-type="bibr" rid="B20">Lee et al. (2019)</xref>
</td>
<td align="left">HIBEY: Hide the keyboard in augmented reality</td>
</tr>
<tr>
<td align="left">12</td>
<td align="left">
<xref ref-type="bibr" rid="B14">Jailungka and Charoenseang. (2018)</xref>
</td>
<td align="left">Intuitive 3D model prototyping with leap motion and microsoft hololens</td>
</tr>
<tr>
<td align="left">13</td>
<td align="left">
<xref ref-type="bibr" rid="B41">Sun et al. (2019)</xref>
</td>
<td align="left">MagicHand: Interact with iot devices in augmented reality environment</td>
</tr>
<tr>
<td align="left">14</td>
<td align="left">
<xref ref-type="bibr" rid="B15">Jang et al. (2017)</xref>
</td>
<td align="left">Metaphoric Hand Gestures for Orientation-Aware VR Object Manipulation With an Egocentric Viewpoint</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Table of screened articles (2/2).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Number</th>
<th align="left">Authors</th>
<th align="left">Title</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">15</td>
<td align="left">
<xref ref-type="bibr" rid="B46">Xiao et al. (2018)</xref>
</td>
<td align="left">MRTouch: Adding touch input to head-mounted mixed reality</td>
</tr>
<tr>
<td align="left">16</td>
<td align="left">
<xref ref-type="bibr" rid="B31">Mueller et al. (2017)</xref>
</td>
<td align="left">Real-Time Hand Tracking under Occlusion from an Egocentric RGB-D Sensor</td>
</tr>
<tr>
<td align="left">17</td>
<td align="left">
<xref ref-type="bibr" rid="B6">Caputo et al. (2021)</xref>
</td>
<td align="left">SHREC 2021: Skeleton-based hand gesture recognition in the wild</td>
</tr>
<tr>
<td align="left">18</td>
<td align="left">
<xref ref-type="bibr" rid="B16">Kim et al. (2018)</xref>
</td>
<td align="left">SWAG Demo: Smart Watch Assisted Gesture Interaction for Mixed Reality Head-Mounted Displays</td>
</tr>
<tr>
<td align="left">19</td>
<td align="left">
<xref ref-type="bibr" rid="B3">Bautista et al. (2020)</xref>
</td>
<td align="left">Usability test with medical personnel of a hand-gesture control techniques for surgical environment</td>
</tr>
<tr>
<td align="left">20</td>
<td align="left">
<xref ref-type="bibr" rid="B29">Min et al. (2019)</xref>
</td>
<td align="left">VPModel: High-Fidelity Product Simulation in a Virtual-Physical Environment</td>
</tr>
<tr>
<td align="left">21</td>
<td align="left">
<xref ref-type="bibr" rid="B8">Choudhary et al. (2021)</xref>
</td>
<td align="left">Real-Time Magnification in Augmented Reality</td>
</tr>
<tr>
<td align="left">22</td>
<td align="left">
<xref ref-type="bibr" rid="B36">Sch&#xe4;fer et al. (2022)</xref>
</td>
<td align="left">The Gesture Authoring Space: Authoring Customised Hand Gestures for Grasping Virtual Objects in Immersive Virtual Environments</td>
</tr>
<tr>
<td align="left">23</td>
<td align="left">
<xref ref-type="bibr" rid="B37">Serrano et al. (2022)</xref>
</td>
<td align="left">An empirical evaluation of two natural hand interaction systems in augmented reality</td>
</tr>
<tr>
<td align="left">24</td>
<td align="left">
<xref ref-type="bibr" rid="B39">Su et al. (2022)</xref>
</td>
<td align="left">A Natural Bare-Hand Interaction Method With Augmented Reality for Constraint-Based Virtual Assembly</td>
</tr>
<tr>
<td align="left">25</td>
<td align="left">
<xref ref-type="bibr" rid="B13">Hu et al. (2018)</xref>
</td>
<td align="left">3D separable convolutional neural network for dynamic hand gesture recognition</td>
</tr>
<tr>
<td align="left">26</td>
<td align="left">
<xref ref-type="bibr" rid="B40">Su et al. (2021)</xref>
</td>
<td align="left">Smart training: Mask R-CNN oriented approach</td>
</tr>
<tr>
<td align="left">27</td>
<td align="left">
<xref ref-type="bibr" rid="B38">Shrestha et al. (2018)</xref>
</td>
<td align="left">Computer-Vision based Bare-Hand Augmented Reality Interface for Controlling an AR Object</td>
</tr>
<tr>
<td align="left">28</td>
<td align="left">
<xref ref-type="bibr" rid="B21">Lee et al. (2022)</xref>
</td>
<td align="left">Virtual Keyboards with Real-Time and Robust Deep Learning-Based Gesture Recognition</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Distribution of the classification in the reviewed articles.</p>
</caption>
<graphic xlink:href="frvir-04-1171230-g006.tif"/>
</fig>
<p>In this section, we will be answering to RQ1 using the proposed taxonomy.</p>
<sec id="s4-3-1">
<title>4.3.1 Comparison between isomorphic interactions and gestural interactions</title>
<p>The article of <xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref>, on which this taxonomy is built on, is among the articles screened in this review. As mentioned above, <xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref> compare the interactions from the headsets HoloLens 1 and Meta 2. Similarly, <xref ref-type="bibr" rid="B37">Serrano et al. (2022)</xref> compare HoloLens 1 and Magic Leap One. The authors designate the interactions on HoloLens as Metaphoric Mappings and the interactions on Meta 2 and on Magic Leap as Isomorphic Mappings. However, following our model, HoloLens 1 cursor head-pointing and pinch gesture selection-based interactions are Gestural Interactions using a Conventional Metaphor. Meta 2 and Magic Leap One trajectory and simulated physic-based interactions are Isomorphic Interactions based on Realistic Metaphors. The user studies from <xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref> show that Meta 2 interactions are preferred, qualified as more natural and useable, and require less cognitive charge according to the participants. Several other articles from the screening (<xref ref-type="bibr" rid="B14">Jailungka and Charoenseang, 2018</xref>; <xref ref-type="bibr" rid="B16">Kim et al., 2018</xref>; <xref ref-type="bibr" rid="B1">Ababsa et al., 2020</xref>; <xref ref-type="bibr" rid="B3">Bautista et al., 2020</xref>) also highlight the good usability of grasping objects. HoloLens 1 interactions only outperform Meta 2 interactions in terms of precision for the scaling which is a manipulation task. On the opposite side, <xref ref-type="bibr" rid="B37">Serrano et al. (2022)</xref> show that although Magic Leap One is preferred in the subjective questionnaires, there are no statistically significant differences between HoloLens 1 interactions and Magic Leap One interactions either for the objective data (accuracy, number of mistakes, time completion) and the subjective criteria (usefulness, preference and recommendation).</p>
<p>In our opinion, the studies, instead of comparing the behavior of the interactions, are reflecting the importance of the metaphors used by both interactions. Indeed, <xref ref-type="bibr" rid="B37">Serrano et al. (2022)</xref> explicitly use the realism of the interactions to oppose HoloLens 1 and Magic Leap One. In both articles (<xref ref-type="bibr" rid="B10">Frutos-Pascual et al., 2019</xref>; <xref ref-type="bibr" rid="B37">Serrano et al., 2022</xref>), Meta 2 and Magic Leap One interactions are favored because the realistic metaphor of grasping objects to explore the world is well ingrained in the knowledge of the users. Besides, even if HoloLens 1 interactions are using the conventional metaphor of the WIMP paradigm, the translation of the interface from 2D screen interactions to 3D mid-air interactions requires the user to adjust the spatial perception from the 2D mental image. According to <xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref>, the high performance of HoloLens 1 interactions for the scaling might be due to the similarity with the interactions learned on desktop computers and the lack of realistic metaphors for this kind of manipulation task. To go back to the behavior of the interaction, the cursor pointing is gestural because of the selection trigger gesture but also has a part of isomorphic behavior since the user is interacting with 3D handles spatially constrained to the head cursor when selected. The reason why HoloLens 1 interactions are better than Meta 2 for this specific task can be related to the fact that a common practice in human computer interaction research to increase the precision of an interaction is to change the mapping of the movement. Meta 2 has, by nature of the default interaction, a one-to-one mapping while the HoloLens 1 mapping can be changed for a different scaling to meet the precision required for the task. <xref ref-type="bibr" rid="B37">Serrano et al. (2022)</xref> also support the fact that nonrealistic interactions can work better as, according to them a realistic interaction that is not realistic enough can have the similar problem of the uncanny valley (<xref ref-type="bibr" rid="B27">McMahan et al., 2016</xref>) found in the field of robotics.</p>
<p>To finish with the comparison between isomorphic and gestural interactions, the article of <xref ref-type="bibr" rid="B3">Bautista et al. (2020)</xref> compares Meta 1 interactions and the gesture interface from the Myo armband. The authors use a similar differentiation as our taxonomy by qualifying Meta interactions as Manipulation Control, which corresponds to our isomorphic interactions and Myo armband interface as Gesture Control, which corresponds to our gestural interaction. The comparison of the two interaction techniques in user studies is done through an evaluation of the task of the control of the application. The study shows that compared to Meta 1, Myo armband is more comfortable, has less errors, requires less help from the researcher to the participants in order to complete the task and has a better completion time. However, the article does not detail the steps executed by the participants which makes it hard to understand what kind of gestures were used and to analyze the reason why they are better performing. Furthermore, Myo armband might be preferred because of the uptime and accuracy of the recognition of the interactions. Meta 2 is limited to sensors in an egocentric point of view which has a very limited volume of tracking in front of the user as opposed to Myo armband which detects gestures using electromyography data which works as long as the user is wearing the armband. This factor can significantly impact the number of errors, the time of completion and the perceived comfort.</p>
</sec>
<sec id="s4-3-2">
<title>4.3.2 Interactions mainly isomorphic</title>
<p>On the isomorphic side of the graph in <xref ref-type="fig" rid="F5">Figure 5</xref>, there are five groups.<list list-type="simple">
<list-item>
<p>1. Grab and direct hand manipulation (<xref ref-type="bibr" rid="B14">Jailungka and Charoenseang, 2018</xref>; <xref ref-type="bibr" rid="B16">Kim et al., 2018</xref>; <xref ref-type="bibr" rid="B10">Frutos-Pascual et al., 2019</xref>; <xref ref-type="bibr" rid="B1">Ababsa et al., 2020</xref>; <xref ref-type="bibr" rid="B3">Bautista et al., 2020</xref>; <xref ref-type="bibr" rid="B36">Sch&#xe4;fer et al., 2022</xref>; <xref ref-type="bibr" rid="B37">Serrano et al., 2022</xref>; <xref ref-type="bibr" rid="B39">Su et al., 2022</xref>);</p>
</list-item>
<list-item>
<p>2. Manipulation with a virtual tool as a proxy (<xref ref-type="bibr" rid="B15">Jang et al., 2017</xref>);</p>
</list-item>
<list-item>
<p>3. Mid-air drawing (<xref ref-type="bibr" rid="B7">Chang et al., 2017</xref>; <xref ref-type="bibr" rid="B48">Zhang et al., 2020</xref>);</p>
</list-item>
<list-item>
<p>4. Physical surface touchscreen (<xref ref-type="bibr" rid="B46">Xiao et al., 2018</xref>; <xref ref-type="bibr" rid="B34">Plasson et al., 2020</xref>);</p>
</list-item>
<list-item>
<p>5. Manipulation using a smartphone as a proxy (<xref ref-type="bibr" rid="B47">Yu et al., 2017</xref>; <xref ref-type="bibr" rid="B19">Lee and Chu, 2018</xref>);</p>
</list-item>
</list>The first three groups are using Realistic Metaphors with a one-to-one spatial relation between the movement of the hand and the feedback of the system. More specifically, the first and second groups allow the user to manipulate virtual content directly using the hands or a virtual tool in contact with the content and are intuitive because of the counterpart interactions in the real world. In the first group, <xref ref-type="bibr" rid="B39">Su et al. (2022)</xref> and <xref ref-type="bibr" rid="B36">Sch&#xe4;fer et al. (2022)</xref> go further in the implementation of the interaction with a more realistic design of the grab action to select an object. The former considers realistic collisions and physical constraints for the grab and for the assembly of virtual components on virtual and physical objects while the latter proposes an authoring tool that records custom hand grab poses for each virtual object. In <xref ref-type="bibr" rid="B36">Sch&#xe4;fer et al. (2022)</xref>&#x2019;s paper, their user studies show that the custom grab is perceived as more useable than the pinch and the standard closing hand grab. Making the behavior of virtual content realistic through physics models improves the experience as the user is less prone to mistakes and understands more easily how to manipulate the content thanks to his prior experiences. The third group is the reproduction of drawing and writing in the virtual environment. The articles from <xref ref-type="bibr" rid="B48">Zhang et al. (2020)</xref> and <xref ref-type="bibr" rid="B7">Chang et al. (2017)</xref> enable respectively 2D mid-air sketching and 3D annotations on the physical world. The former emphasizes on the usage of convolutional neural networks (CNN) for gesture recognition and sketch auto-completion. The latter shows the benefit of the common practice of limiting the degree of freedom in interactions. Indeed, in their user studies, 2D mapped drawing were preferred and more precise than simple 3D mid-air drawing. Furthermore, according to <xref ref-type="bibr" rid="B7">Chang et al. (2017)</xref>, cleaning the visual feedback for the annotations by showing a beautified version makes the drawing process faster. The works of <xref ref-type="bibr" rid="B48">Zhang et al. (2020)</xref> and <xref ref-type="bibr" rid="B7">Chang et al. (2017)</xref> show that it is possible to go a step further in the design of a more intelligent interaction by analyzing the trajectories of the hand and improving the effect returned by the system. To end with mid-air drawing, <xref ref-type="bibr" rid="B24">Lu et al. (2019)</xref> proposes a user identification using hashcode of the mid-air writing signature in mixed reality. As the system uses a CNN, the main problem is the requirement to retrain the network for each additional signature.</p>
<p>Groups 4 and 5 are using conventional Metaphor from the usage of smartphones and tablets. Indeed, in the fourth group, <xref ref-type="bibr" rid="B34">Plasson et al. (2020)</xref> and <xref ref-type="bibr" rid="B46">Xiao et al. (2018)</xref> are using the spatial understanding of mixed reality to implement an intelligent adaptive user interface by converting physical surfaces into touchscreens. The haptic feedbacks make the interaction more natural. The works of the researchers (<xref ref-type="bibr" rid="B46">Xiao et al., 2018</xref>; <xref ref-type="bibr" rid="B34">Plasson et al., 2020</xref>) contribute to reduce the gap between the virtual environment and the real environment. Finally, the fifth group takes advantage of the familiarity the users have with using smartphones to interact with digital content as well as the extra sensors of the smartphones to improve the spatial understanding of mixed reality. <xref ref-type="bibr" rid="B19">Lee and Chu. (2018)</xref> propose to capture virtual object on the screen of a smartphone. The user is then able to manipulate the content by moving the smartphone or using the touchscreen interactions. <xref ref-type="bibr" rid="B47">Yu et al. (2017)</xref> use the smartphone as a pointer to select virtual object as if the user is manipulating a laser. The strong point of their works is the ubiquity of smartphones which makes the interfacing between the user and the mixed reality environment more flexible.</p>
</sec>
<sec id="s4-3-3">
<title>4.3.3 Interactions mainly gestural</title>
<p>On the gestural side of the graph in <xref ref-type="fig" rid="F5">Figure 5</xref>, we identified ten groups.<list list-type="simple">
<list-item>
<p>1. Finger count pose (<xref ref-type="bibr" rid="B38">Shrestha et al., 2018</xref>; <xref ref-type="bibr" rid="B41">Sun et al., 2019</xref>; <xref ref-type="bibr" rid="B3">Bautista et al., 2020</xref>; <xref ref-type="bibr" rid="B48">Zhang et al., 2020</xref>; <xref ref-type="bibr" rid="B6">Caputo et al., 2021</xref>; <xref ref-type="bibr" rid="B8">Choudhary et al., 2021</xref>);</p>
</list-item>
<list-item>
<p>2. Symbolic gesture pose (<xref ref-type="bibr" rid="B6">Caputo et al., 2021</xref>; <xref ref-type="bibr" rid="B22">Lin and Yamaguchi, 2021</xref>; <xref ref-type="bibr" rid="B30">Mo et al., 2021</xref>);</p>
</list-item>
<list-item>
<p>3. Uncommon dynamic gestures for discrete outputs (<xref ref-type="bibr" rid="B13">Hu et al., 2018</xref>; <xref ref-type="bibr" rid="B3">Bautista et al., 2020</xref>; <xref ref-type="bibr" rid="B6">Caputo et al., 2021</xref>; <xref ref-type="bibr" rid="B30">Mo et al., 2021</xref>);</p>
</list-item>
<list-item>
<p>4. Symbolic dynamic gestures for discrete outputs (<xref ref-type="bibr" rid="B6">Caputo et al., 2021</xref>; <xref ref-type="bibr" rid="B30">Mo et al., 2021</xref>)</p>
</list-item>
<list-item>
<p>5. Pantomimic gestures for discrete outputs (<xref ref-type="bibr" rid="B15">Jang et al., 2017</xref>; <xref ref-type="bibr" rid="B29">Min et al., 2019</xref>; <xref ref-type="bibr" rid="B6">Caputo et al., 2021</xref>; <xref ref-type="bibr" rid="B30">Mo et al., 2021</xref>);</p>
</list-item>
<list-item>
<p>6. Finger-pointing (<xref ref-type="bibr" rid="B40">Su et al., 2021</xref>);</p>
</list-item>
<list-item>
<p>7. Head-pointing cursor and pinch (<xref ref-type="bibr" rid="B10">Frutos-Pascual et al., 2019</xref>; <xref ref-type="bibr" rid="B37">Serrano et al., 2022</xref>);</p>
</list-item>
<list-item>
<p>8. Virtual continuous keyboard (<xref ref-type="bibr" rid="B20">Lee et al., 2019</xref>; <xref ref-type="bibr" rid="B21">Lee et al., 2022</xref>);</p>
</list-item>
<list-item>
<p>9. Uncommon dynamic gestures for continuous outputs (<xref ref-type="bibr" rid="B41">Sun et al., 2019</xref>);</p>
</list-item>
<list-item>
<p>10. Pantomimic for continuous outputs (<xref ref-type="bibr" rid="B15">Jang et al., 2017</xref>; <xref ref-type="bibr" rid="B34">Plasson et al., 2020</xref>);</p>
</list-item>
</list>
</p>
<p>The first two groups described gestural-only interactions and differ slightly in the metaphor used to learn them even if they can be categorized as conventional. More specifically, in the both groups, users are learning to do a specific pose, which is static. However, the first group is what <xref ref-type="bibr" rid="B43">Vuletic et al. (2019)</xref> call as semaphoric gestures since the gestures have no semantic background, which is the extreme of the conventional side of the spectrum. Indeed the pose are based on finger extension. Depending on the number of fingers extended, a different functionality is mapped to the pose. The main advantage of this technique is the ease of detection by computer vision algorithms as the different poses are distinguishable. Oppositely, the second group uses symbolic metaphors which means that the poses have cultural semantic values. For example, the &#x201c;ok&#x201d; gesture found in <xref ref-type="bibr" rid="B22">Lin and Yamaguchi (2021)</xref> and <xref ref-type="bibr" rid="B6">Caputo et al. (2021)</xref> works, which consists in making a circle using the index and the thumb, is recognized in both articles and has commonly the cultural meaning of an agreement in English-speaking countries. This type of metaphor is a double-edged sword as it can make the interaction more intuitive for a set of population but can also be confusing or error inducing for other populations where the meaning is different. Indeed, the &#x201c;ok&#x201d; gesture is actually offensive in Brazil and got Richard Nixon booed by the crowd at Rio de Janeiro in the 1950s, while in Japan, it represents money (<xref ref-type="bibr" rid="B35">Reuters, 1996</xref>). It can be noted that <xref ref-type="bibr" rid="B30">Mo et al. (2021)</xref> and <xref ref-type="bibr" rid="B6">Caputo et al. (2021)</xref>&#x2019;s paper are classified in all the group described above as they are respectively a modular architecture for gesture design and a hand gesture recognition competition result summary. As such, they support gestures in all of the mentioned categories.</p>
<p>Groups 3, 4 and 5 which cover gestural interactions that are dynamic, are slightly more isomorphic than the first two groups as illustrated in our comparison with <xref ref-type="bibr" rid="B43">Vuletic et al. (2019)</xref>&#x2019;s model in <xref ref-type="fig" rid="F3">Figure 3</xref>. The difference between the three groups is again the learning metaphor. In terms of metaphor, the groups 3 and 4 have a similar relationship between each other as the one between groups 1 and 2 as they are both using conventional metaphor but are more specically using semaphoric and symbolic gestures respectively. We named the group 3 as &#x201c;uncommon gestures&#x201d; because the articles describe a variety of unrelated gestures such as doing a circle motion (<xref ref-type="bibr" rid="B38">Shrestha et al., 2018</xref>; <xref ref-type="bibr" rid="B30">Mo et al., 2021</xref>). The group 5 however is on the opposite of the spectrum in terms of metaphor as it is using pantomimic gestures which are at the extreme of the realistic side of the metaphor spectrum. In this group, <xref ref-type="bibr" rid="B15">Jang et al. (2017)</xref> proposes to invoke virtual objects by reproducing hand grasp poses as if the real object was in hand, while <xref ref-type="bibr" rid="B29">Min et al. (2019)</xref> augment an inert 3D printed camera prototype with visual feedbacks to the mimick of the usage of a real camera. With the definition given by <xref ref-type="bibr" rid="B43">Vuletic et al. (2019)</xref>, a pantomimic is by nature dynamic as it is a sequence of actions.</p>
<p>Going farther to the left on the isomorphic-gestural spectrum, we have the group 6 that is composed of a single article from <xref ref-type="bibr" rid="B40">Su et al. (2021)</xref>. The authors designed a deictic gesture to select a physical object in mixed reality. This gesture is as realistic as pantomimic gestures because it is using the exact motion that we use in reality to point at objects. It is a more realistic way to select an object compare to the group 7 which designate head-pointing cursor based interactions found in HoloLens 1. For the latter, a real-time spatial mapping between the user&#x2019;s head movement and a virtual content can be established after the user selects an item by doing the selection gesture &#x201c;pinch&#x201d; which consists in touching the index with the thumb to simulate a mouse &#x201c;click&#x201d;. As mentioned in 4.2, HoloLens 1 interactions while being a conventional metaphor is placed close to the frontier with realistic metaphor because it is based on the WIMP paradigm that have become part of our daily lives across multiple devices from desktop computers to mobile devices. The reason why group 6 is more gestural than group 7 is that the finger point described in <xref ref-type="bibr" rid="B40">Su et al. (2021)</xref>&#x2019;s work has a discrete output as its task is only the selection.</p>
<p>The last three groups have the same position as group 7 in the behaviour spectrum. The group 8 and 9 in particular have the exact same position in the metaphor spectrum as well because they are using semaphoric gestures. Even though the group 8 design a character input interaction based on keyboards, the trigger gestures do not have any semantic values. <xref ref-type="bibr" rid="B20">Lee et al. (2019)</xref> design a symbol typing interaction where the position of the hand during the movement allows the user to select a letter, to type in a letter, to delete a letter and to select a word from the auto-completion system. The gesture is totally different from the writing or the typing activity in the physical world. Similarly, <xref ref-type="bibr" rid="B21">Lee et al. (2022)</xref> use a specific gesture that uses the number of extended fingers as well as a thumb motion to trigger the typing. In the group 9, the only article is the work from <xref ref-type="bibr" rid="B41">Sun et al. (2019)</xref> and it describes a dynamic gesture which is triggered by a finger extension based pose and then map movement of the hand to a continuous value on a smart connected object. An example given in the article is the control of the sound volume with a horizontal translation with three fingers extended. Finally, the group 10 encompasses dynamic pantomimic gestures that are used to control continuous output. <xref ref-type="bibr" rid="B34">Plasson et al. (2020)</xref> use the metaphor of pulling the string of a floating helium balloon. The distance between the hand and the base of a virtual quantified stack regulates the value of the output on the stack as if it was the height of a floating balloon. <xref ref-type="bibr" rid="B15">Jang et al. (2017)</xref> article is also in this group because the user can use the invoked tools to manipulate virtual objects using the properties and functionalities of the tool invoked.</p>
</sec>
</sec>
<sec id="s4-4">
<title>4.4 Algorithms and devices</title>
<p>In this section, we will answer to RQ2 and RQ3.</p>
<sec id="s4-4-1">
<title>4.4.1 Devices</title>
<p>In terms of devices, <xref ref-type="fig" rid="F7">Figure 7</xref> summarizes the different devices used in the reviewed articles. Thirteen articles implemented their prototypes on HoloLens 1. 50% of those articles use external devices and/or sensors with a HMD to improve capture of the interactions. This trend is due to the limitations of the first version of HoloLens which only allowed limited head pointer cursor-based interactions and voice commands. As mentioned previously, the article from <xref ref-type="bibr" rid="B19">Lee and Chu. (2018)</xref> and <xref ref-type="bibr" rid="B47">Yu et al. (2017)</xref> use smartphones and their embedded sensors to complement mixed reality interactions. The other articles improve the hand interactions with a better hand tracking using external sensors such as depth cameras, with LeapMotion being the most popular, and Motion capture systems, such as OptiTrack (<xref ref-type="fig" rid="F7">Figure 7</xref>). The articles based on the Rokid Glass and the Oculus Rift headsets also use depth cameras to support hand interactions. On the opposite, the article on Meta, HoloLens 2 and Magic Leap One do not use any external sensors because of the native support of real-time hand tracking. There are six articles that are presenting research that are aimed at MR HMD but have yet to test on actual devices.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Distribution of devices in the reviewed articles.</p>
</caption>
<graphic xlink:href="frvir-04-1171230-g007.tif"/>
</fig>
</sec>
<sec id="s4-4-2">
<title>4.4.2 Algorithms</title>
<p>In this review, ten articles are describing the algorithms used to implement their interaction techniques. The remaining articles mostly implement interactions build on natively supported interactions and hand tracking from the headset and sensors after calibration. Following the trend of CNN, four articles are based on this solution. In this context CNNs are used to.<list list-type="simple">
<list-item>
<p>&#x2022; track the hand joints (<xref ref-type="bibr" rid="B31">Mueller et al., 2017</xref>; <xref ref-type="bibr" rid="B48">Zhang et al., 2020</xref>);</p>
</list-item>
<list-item>
<p>&#x2022; detect gestures (<xref ref-type="bibr" rid="B13">Hu et al., 2018</xref>; <xref ref-type="bibr" rid="B41">Sun et al., 2019</xref>; <xref ref-type="bibr" rid="B48">Zhang et al., 2020</xref>; <xref ref-type="bibr" rid="B6">Caputo et al., 2021</xref>; <xref ref-type="bibr" rid="B40">Su et al., 2021</xref>; <xref ref-type="bibr" rid="B21">Lee et al., 2022</xref>);</p>
</list-item>
<list-item>
<p>&#x2022; process trajectories (<xref ref-type="bibr" rid="B24">Lu et al., 2019</xref>; <xref ref-type="bibr" rid="B48">Zhang et al., 2020</xref>).</p>
</list-item>
</list>More specifically, <xref ref-type="bibr" rid="B31">Mueller et al. (2017)</xref> propose two CNNs that uses both RGB images and depth data to compute hand joint trajectories in hand-object occlusion situations by localizing the hand center and regressing the joints relative to this center.</p>
<p>Concerning gestures, <xref ref-type="bibr" rid="B30">Mo et al. (2021)</xref>, <xref ref-type="bibr" rid="B15">Jang et al. (2017)</xref>, <xref ref-type="bibr" rid="B6">Caputo et al. (2021)</xref>, <xref ref-type="bibr" rid="B29">Min et al. (2019)</xref>, <xref ref-type="bibr" rid="B36">Sch&#xe4;fer et al. (2022)</xref> and <xref ref-type="bibr" rid="B38">Shrestha et al. (2018)</xref> present alternative solutions to CNN. Firstly, <xref ref-type="bibr" rid="B30">Mo et al. (2021)</xref> represent gestures using Hidden Markov Model (HMM). In their solution Gesture Knitter, each gesture is split into two components. The first one called the gross gesture component is a primitive gesture that describes the movement of the hand palm. It is represented using a 12 layers HMM. The second component is the fine gesture which describes the movement of the fingers in the palm center referential and is represented using a 8 layers HMM. To infer the gesture, the authors use the Viterbi algorithm to determine the most likely sequence of hidden layers. Secondly, <xref ref-type="bibr" rid="B15">Jang et al. (2017)</xref> use an innovative method which consists in describing hand pose with a voxel grid where each voxel is activated if enough 3D cloud points from the hand are in that cell. The encoding of a voxel plan and the change of the encoding over time represent a pattern that is used to discriminate the gestures in random decision trees. Finally, the article of <xref ref-type="bibr" rid="B6">Caputo et al. (2021)</xref> summarizes the result of a hand gesture recognition competition in 2021, based on hand joints data from LeapMotion. Several solutions presented are based on popular algorithms such as Transformers or Recurrent Neural Network (RNN). As opposed to traditional CNN, those solutions are adapted to spatial-temporal data which are mandatory to represent a gesture. <xref ref-type="bibr" rid="B6">Caputo et al. (2021)</xref> highlight that Long Short Term Memory algorithms, a popular variant of RNN, have not been explored in the competition even if they have significant good results in the literature. The winning solution of the competition is a modified version of the CNN called the spatial-temporal CNN. The hand movement is represented using a spatial-temporal graph where the joints of the hands are nodes. In the graph, the nodes are.<list list-type="simple">
<list-item>
<p>&#x2022; linked according to the structure of the hand skeletons, which encapsulates the spatial information;</p>
</list-item>
<list-item>
<p>&#x2022; linked if they represent the same nodes in consecutives images of the movement, which encapsulates the temporal information.</p>
</list-item>
</list>Each graph representing the movement on a time window is sent to a spatial-temporal CNN to localize and detect a gesture. Then, for the classification, each movement is represented with a gradient histogram which is compared using cosine similarity to a set of histograms from known gestures. The article from <xref ref-type="bibr" rid="B29">Min et al. (2019)</xref> also uses gradient histograms but classifies with a Support Vector Machine (SVM) algorithm. <xref ref-type="bibr" rid="B36">Sch&#xe4;fer et al. (2022)</xref> use a more basic solution to recognize the custom grab gesture recorded on their virtual object. If the hand of the user is close to a virtual object, a frame of the hand joints is compared to the existing gesture data using euclidian distance. Similarly, <xref ref-type="bibr" rid="B38">Shrestha et al. (2018)</xref> also use a very simple gesture classifier by finding calculating the number of finger tip outside of a circle centered around the palm to establish how many fingers are extended.</p>
<p>The last article from <xref ref-type="bibr" rid="B46">Xiao et al. (2018)</xref> uses short-range depth and infrared data to create touch surfaces in the physical world. Their in-house algorithm DIRECT presented in their previous work (<xref ref-type="bibr" rid="B45">Xiao et al., 2016</xref>) detects physical planes and the contacts between the fingers and those plans.</p>
<p>Finally, in terms of computing solutions, most of the papers use CPU and GPU on personal computers except for the works of <xref ref-type="bibr" rid="B48">Zhang et al. (2020)</xref>, <xref ref-type="bibr" rid="B46">Xiao et al. (2018)</xref> and <xref ref-type="bibr" rid="B29">Min et al. (2019)</xref> where the CPU embedded on the headset (HoloLens 1) or glass (Rokid) is used. Network solutions to distribute the computing power have not been proposed in the reviewed articles.</p>
</sec>
</sec>
</sec>
<sec id="s5">
<title>5 Summary and discussion</title>
<p>The result of this review gave insight on the behavior of the interaction designed in the academic since commercial MR headsets have been made available for research and development and on their technical implementations. The <xref ref-type="table" rid="T3">Tables 3</xref>, <xref ref-type="table" rid="T4">4</xref> gives an overview of the articles analyzed in this paper.</p>
<table-wrap id="T3" position="float">
<label>TABLE 3</label>
<caption>
<p>Summary of the article reviewed (1/2).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Article</th>
<th align="left">Interaction behavior</th>
<th align="left">Interaction metaphor</th>
<th align="left">Headset(s)</th>
<th align="left">External sensor(s)</th>
<th align="left">Key Algorithm(s)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<xref ref-type="bibr" rid="B34">Plasson et al. (2020)</xref>
</td>
<td align="left">Isomorphic</td>
<td align="left">Realistic</td>
<td align="left">HoloLens 1</td>
<td align="left">OptiTrack</td>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B48">Zhang et al. (2020)</xref>
</td>
<td align="left">Isomorphic, Metaphoric</td>
<td align="left">Realistic</td>
<td align="left">Rokid Glass</td>
<td align="left">pmd flexx</td>
<td align="left">CNN</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B1">Ababsa et al. (2020)</xref>
</td>
<td align="left">Isomorphic</td>
<td align="left">Realistic</td>
<td align="left">HoloLens 1</td>
<td align="left">LeapMotion</td>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B19">Lee and Chu. (2018)</xref>
</td>
<td align="left">Isomorphic</td>
<td align="left">Conventional</td>
<td align="left">HoloLens 1</td>
<td align="left">iPhone</td>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B7">Chang et al. (2017)</xref>
</td>
<td align="left">Isomorphic</td>
<td align="left">Realistic</td>
<td align="left">HoloLens 1</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B22">Lin and Yamaguchi. (2021)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Conventional</td>
<td align="left">HoloLens 2</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B24">Lu et al. (2019)</xref>
</td>
<td align="left">Isomorphic</td>
<td align="left">Realistic</td>
<td align="left"/>
<td align="left">LeapMotion</td>
<td align="left">CNN</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B47">Yu et al. (2017)</xref>
</td>
<td align="left">Isomorphic</td>
<td align="left">Conventional</td>
<td align="left">HoloLens 1</td>
<td align="left">Smartphone, Smartwatch</td>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B30">Mo et al. (2021)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Realistic, Conventional</td>
<td align="left">HoloLens 2</td>
<td align="left"/>
<td align="left">multivariable HMM, Virterbi algorithm</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref>
</td>
<td align="left">Isomorphic, Gestural</td>
<td align="left">Realistic, Conventional</td>
<td align="left">HoloLens 1, Meta 2</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B20">Lee et al. (2019)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Conventional</td>
<td align="left">HoloLens 1</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B14">Jailungka and Charoenseang. (2018)</xref>
</td>
<td align="left">Isomorphic</td>
<td align="left">Realistic</td>
<td align="left">HoloLens 1</td>
<td align="left">LeapMotion</td>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B41">Sun et al. (2019)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Conventional</td>
<td align="left">HoloLens 1</td>
<td align="left">Depth camera</td>
<td align="left">CNN</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B15">Jang et al. (2017)</xref>
</td>
<td align="left">Isomorphic, Gestural</td>
<td align="left">Realistic</td>
<td align="left">Oculus Rift</td>
<td align="left">Intel Realsense SR3000</td>
<td align="left">Voxel encoding, Random forest</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B46">Xiao et al. (2018)</xref>
</td>
<td align="left">Isomorphic</td>
<td align="left">Conventional</td>
<td align="left">HoloLens 1</td>
<td align="left">Kinect</td>
<td align="left">DIRECT</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B31">Mueller et al. (2017)</xref>
</td>
<td align="left"/>
<td align="left"/>
<td align="left"/>
<td align="left">Intel RealSense SR300</td>
<td align="left">CNN</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B6">Caputo et al. (2021)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Realistic, Conventional</td>
<td align="left"/>
<td align="left">LeapMotion</td>
<td align="left">Transformers, CNN, GRU</td>
</tr>
</tbody>
</table>
</table-wrap>
<table-wrap id="T4" position="float">
<label>TABLE 4</label>
<caption>
<p>Summary of the article reviewed (2/2), (&#x2a; only used HoloLens 1 for making the dataset).</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Article</th>
<th align="left">Interaction behavior</th>
<th align="left">Interaction metaphor</th>
<th align="left">Headset(s)</th>
<th align="left">External sensor(s)</th>
<th align="left">Key Algorithm(s)</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<xref ref-type="bibr" rid="B16">Kim et al. (2018)</xref>
</td>
<td align="left">Isomorphic</td>
<td align="left">Realistic</td>
<td align="left">Oculus Rift Development Kit 2</td>
<td align="left">Ovrvision, Creative Senz3D, Samsung Gear Live</td>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B3">Bautista et al. (2020)</xref>
</td>
<td align="left">Isomorphic, Gestural</td>
<td align="left">Realistic, Conventional</td>
<td align="left">Meta 1</td>
<td align="left">Myo armband</td>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B29">Min et al. (2019)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Realistic</td>
<td align="left">HoloLens</td>
<td align="left">LeapMotion</td>
<td align="left">SVM</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B8">Choudhary et al. (2021)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Conventional</td>
<td align="left">HoloLens 2</td>
<td align="left">Logitech 4K Brio HDR</td>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B36">Sch&#xe4;fer et al. (2022)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Realistic</td>
<td align="left">Meta (Oculus) Quest 2</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B37">Serrano et al. (2022)</xref>
</td>
<td align="left">Isomorphic, Gestural</td>
<td align="left">Realistic, Conventional</td>
<td align="left">HoloLens 1, Magic Leap One</td>
<td align="left"/>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B39">Su et al. (2022)</xref>
</td>
<td align="left">Isomorphic</td>
<td align="left">Realistic</td>
<td align="left">HoloLens</td>
<td align="left">LeapMotion</td>
<td align="left"/>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B13">Hu et al. (2018)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Conventional</td>
<td align="left">(HoloLens 1)<sup>&#x2a;</sup>
</td>
<td align="left"/>
<td align="left">Frame Diff, CNN</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B40">Su et al. (2021)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Realistic</td>
<td align="left">Moverio BT-300</td>
<td align="left"/>
<td align="left">CNN</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B21">Lee et al. (2022)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Conventional</td>
<td align="left"/>
<td align="left">Webcam</td>
<td align="left">CNN</td>
</tr>
<tr>
<td align="left">
<xref ref-type="bibr" rid="B38">Shrestha et al. (2018)</xref>
</td>
<td align="left">Gestural</td>
<td align="left">Conventional</td>
<td align="left"/>
<td align="left">Webcam</td>
<td align="left">YCB segmentation, k-curvature angle treshold</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s5-1">
<title>5.1 Interaction behaviors</title>
<p>When it comes to comparing the behaviors of the interactions, isomorphic interactions are perceived as more natural and intuitive because the one-to-one spatial relation between the manipulated content and the body movement is similar to the daily interactions the user has with objects in the real-world. The other advantage of isomorphic interaction is the diversity of the possible designs. Indeed, a variety of virtual proxy objects, inspired by the behavior of real world objects, can be created to complement the interactions. However, this type of interaction is more prone to close contact interactions and still requires the use of a 3D version of the WIMP paradigm to trigger intangible basic application control functionalities. On the opposite side, gestural interactions have the advantage of empowering the user with interactions beyond what the physical world can offer with mid-air distance interactions. Besides, gestural interactions do not require visual cues to function which reduces the cognitive load of the user. Nevertheless, by design, this type of interaction requires a limited vocabulary that translates gestures into output in the system which makes the system inflexible.</p>
</sec>
<sec id="s5-2">
<title>5.2 Learning metaphors</title>
<p>For both types of interactions, the learning curve and the perception of naturalness depend on the metaphor used for the mechanism of the interaction. In general, realistic metaphors make the interaction more intuitive as it is more relatable to the user. However, <xref ref-type="bibr" rid="B37">Serrano et al. (2022)</xref> warn that insufficiently realistic metaphors can break the naturalness similar to the uncanny valley phenomenon in the field of robotics. Conventional metaphors are often used because of the ease of recognition by algorithms and for some specific contexts to make the interaction more efficient for the aimed task. As shown by <xref ref-type="bibr" rid="B10">Frutos-Pascual et al. (2019)</xref>, some specific tasks, such as the scaling, require more precision which can come at the cost of naturalness. This opposition between usability and efficiency reminds us of the notion of Flexibility and efficiency of use from Nielsen Usability Heuristics (<xref ref-type="bibr" rid="B32">Nielsen, 2022</xref>) that were initially devised for web design. This notion promotes the design of advanced shortcuts hidden from novice users that can speed up the interaction for experts. In our situation, the interaction itself is not necessarily a shortcut but a more efficient way to manipulate an object that is harder to learn for a new adopter of mixed and augmented reality. Thus, we believe that the design of hand interactions should factor the targeted users and the context of use to weight usability against efficiency. The steep learning curve when the user first encounter the conventional metaphor implicates that the interaction must justify a significant efficiency and usefulness to use this kind of metaphors. Factoring the learning need is even more important for gestural interactions with conventional metaphors because each gesture must be learned individually. In practice, a mixed reality application might need a set of natural interactions to facilitate the adoption of the medium and a more advanced set of interactions that are akin to shortcuts on desktop and mobile computers. In addition, long-term usages also need to be evaluated as muscle fatigue and health impacts are important factors in the ergonomy an user interface. These factors have not been explored in this review.</p>
</sec>
<sec id="s5-3">
<title>5.3 Apparatus</title>
<p>Before the review, in relation to our RQ3, we expected that the availability of commercial headsets would make prototyping easier with the native support of hand tracking and the progress of sensors equipped on newly released product. However, as highlighted, a for older headset like HoloLens 1, only simple gestures like air tap were supported which made researchers use external sensors to be able to design more complexe gestures. Furthermore, we also expected more use of RGB-D data in research using more modern HMD that has such sensors like Magic Leap One, Meta 2 and HoloLens 2. For example, the work of <xref ref-type="bibr" rid="B31">Mueller et al. (2017)</xref> could be applied to support hand-object occlusion providing that RGB-D pairs could be streamed. However, the articles that uses those HMD are limited to the usage of the hand joints data from the native hand tracking. This can be due to the fact that it is difficult to have access to raw sensor data for commercial devices. HoloLens 2 for example, has the API HoloLens2ForCV <xref ref-type="bibr" rid="B42">Ungureanu et al. (2020)</xref> to have access to all sensor data and more recently a publication provided a streaming architecture for HoloLens 2 sensor data with code sample <xref ref-type="bibr" rid="B9">Dibene and Dunn. (2022)</xref> to ease the process. However, a known issue is the difficulty to retrieve a matching pair of the RGB camera and the real time depth camera for hand tracking which is a requirement for RGB-D computer vision algorithms. This observation highlights the need to have more research and development friendly MR headsets that has a simple access and streaming to common data used in most of state of the art algorithms.</p>
</sec>
<sec id="s5-4">
<title>5.4 Trends</title>
<p>During our screening, we have observed that a significant amount of papers were proposing solutions to add hand tracking to low cost AR devices. This trend shows that there is an interest to make AR more affordable as indeed the MR HMD used in the reviewed articles cost more than a thousand dollars. A practical use case for a low computing cost RGB only hand tracking and hand gesture recognition is the support of hand interactions for Google Cardboard, for example,.</p>
<p>In the industry, commercial devices are trending toward supporting isomorphic interactions. Many examples can be not only found in mixed and augmented reality such as Magic Leap One, Meta 2 or HoloLens 2, but also in virtual reality such as Meta Quest (formerly Oculus Quest). We believe that the interest in isomorphic interactions comes from the need of appealing to new adopters. Besides, isomorphic interactions are also a good technical showcase of the hand-tracking technology prowess. The interest in gestural interactions has however not dwindled as across the review articles, about half of them propose this type of hand interaction. In our opinion, both types of interactions will be used jointly in the future for hybrid interactions or complementary interactions for different usages. The progress of hand tracking will also benefit on the development of gestural interactions as shown in the competition SHREC (<xref ref-type="bibr" rid="B6">Caputo et al., 2021</xref>) where the algorithms solely used data from hand joints.</p>
<p>When it comes to the technical implementations, researchers are using diverse data to track hand skeletons and/or detect hand gestures. Even if there are various algorithms including those that have been found in prior research as noted by <xref ref-type="bibr" rid="B43">Vuletic et al. (2019)</xref> review, such as HMM and SVM, in coherence with the popularity of neural networks, a lot of articles use solutions based on CNNs and RNNs. The issue with those solutions is the lack of flexibility because of the significant amount of data required to train the network and the necessity to retrain the network, or at least part of it through transfer learning, before adding new gestures. As such the work of <xref ref-type="bibr" rid="B30">Mo et al. (2021)</xref> and <xref ref-type="bibr" rid="B36">Sch&#xe4;fer et al. (2022)</xref> on gesture authoring tools, that are modular and that break down hand gestures into small primitives, are a promising perspective for prototyping interactions as well as making adaptive user interactions. With the same idea of being able to recognize gesture that were not originally in the training dataset, a recent machine learning field called Zero Shot Learning also addresses this concern. The idea of this technique is to train a model to distinguish a set of seen class using a semantic description. After training, the model would then be able to create new classifier for an unseen class if given a semantic description. The work of <xref ref-type="bibr" rid="B26">Madapana and Wachs. (2019)</xref> or <xref ref-type="bibr" rid="B44">Wu et al. (2021)</xref> for example, is an application of this technique for hand gesture recognition. The accuracy of unseen class still needs improvement but might be a promising tool for prototyping gestures in mixed reality without retraining models in the future. On top of the training time issue, CNN and RNN also require a significant computing power for real-time inference. As such, most of the prototypes are using client-server based solutions where the computing is done remotely from the headset. This solution is tackling two issues related to hardware limitations: network and computation latency, network bandwith and battery power.</p>
</sec>
<sec id="s5-5">
<title>5.5 Challenges</title>
<p>As mentioned above, gesture recognition and hand-tracking have tight computation power needs. As such, in the design of hand interactions, it is needed to decide what part of the computation is embedded in the mixed and augmented reality headset and what part is distributed. The different solutions for distributed computing are also a big consideration from the computing architecture such as edge computing or cloud computing to the communication protocols between the headsets and the remote computing machines. The distribution of the computing power also implies that the interaction recognition algorithms might need mechanisms to support the processing of the calculations on several devices and graphic cards. Currently, hand tracking specifically are embedded on headsets however those lightweight algorithms are subjected to stability issues especially when it comes to occlusion. It can be noted that hand-object occlusion has been little explored in the reviewed articles as only <xref ref-type="bibr" rid="B31">Mueller et al. (2017)</xref> propose an algorithm that tackles hand-object occlusion while, Myo armband, used by <xref ref-type="bibr" rid="B3">Bautista et al. (2020)</xref>, work in those constrained situations due to the nature of the sensors. Both solutions require external sensor or computing machine that communicates with the headset. In terms of external sensor, another alternative to vision-based tracking solutions are data glove which works in hand-object occlusion situations. The disadvantages often raised for those type of devices are the price and the discomfort especially in the context where users are also using their hands to manipulate physical objects on top of the virtual content. <xref ref-type="bibr" rid="B11">Glauser et al. (2019)</xref> propose a hand tracking glove called Stretch Sensing Soft Glove that can be fabricated for cheap which can be a solution to prototype gesture recognition with object in hand. Another challenge that Myo armband tackles, is the tracking volume. Indeed, as mixed and augmented reality headsets mainly use front cameras to spatialize the user and track his hands, the volume of interaction is very limited. The immersion of the user is broken everytime his hands are going out of the hand-tracking coverage. One way to circumvent this issue with the current limited hardware is to give awareness of the tracking volume to the user through visual cues. Solutions similar to Myo armband might become the next replacement for physical controllers to complement hand tracking. Meta (formerly Facebook) for example, is working on their EMG armband for AR/VR input<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref>. It should be noted that Myo armband was used in the reviewed article only for gesture detection and the aspect of hand tracking was not explored. The work of <xref ref-type="bibr" rid="B23">Liu et al. (2021)</xref> is an example of the future use of EMG armband towards a complete hand tracking of the user&#x2019;s skeleton.</p>
</sec>
<sec id="s5-6">
<title>5.6 Litterature gap</title>
<p>During our screening we have observed that only a small portion of the articles are comparing different interactions (12% of the articles). In our opinion, this fact is related to the lack of a standard method to evaluate an interaction in the literature. Indeed, in the reviewed articles, the designed interactions were often evaluated using qualitative (mainly usability and learnability questions using Likert-Scale questionnaires) and quantitative metrics (such completion time, precision and errors). However, each article uses their own metrics and questionnaires as popular standard usability test like the System Usability Scale are not appropriate for the specific case of hand interactions. In <xref ref-type="bibr" rid="B18">Koutsabasis and Vogiatzidakis. (2019)</xref>&#x2019;s review, similarly a variety of metrics across the reviewed papers can be found which support the fact that there is a lack of standard evaluation method. As such, for a comparison to be made, the authors need to be able to implement multiple different interactions as a benchmark, which is time consuming. An interesting research topic could be the establishment of a formal evaluation method for hand interactions using the basis of the gesture evaluation metrics for both usability and performance grouped in <xref ref-type="bibr" rid="B18">Koutsabasis and Vogiatzidakis. (2019)</xref>&#x2019;s review. We believe that the formal evaluation should also contextualise the interaction. The task targeted by the interaction should be factored in the evaluation using for example, the model of <xref ref-type="bibr" rid="B12">Hand. (1997)</xref> or a more granular classification such as what <xref ref-type="bibr" rid="B18">Koutsabasis and Vogiatzidakis. (2019)</xref> propose. External factors such as the hardware limitations (volume of tracking, occlusion handling, &#x2026; ) should also be considered in the impact of the rating of the interactions. Indeed, when <xref ref-type="bibr" rid="B3">Bautista et al. (2020)</xref> compare Meta 2 headset interactions with Myo armband, we mentioned that the limited range of hand tracking on the headset might have had an impact on the usability questionnaire results. Thus, the results of the evaluations are strongly dependant time when the research was conducted. The standard specification of hardware limitations and contextual use case would allow an easier comparison across the papers and highlight the necessity of a further testing when the hardware has been improved.</p>
<p>As mentioned, in the section Challenges, hand-object occlusion is still a technical problem for mixed reality interactions. As such, there are currently little research publications considering object in hand while doing hand interactions. Furthermore outside of simply factoring hand-object occlusion in the recognition process, a step further could be to step towards multimodal interactions by mixing hand interactions with tangible interactions. An example would be the works of <xref ref-type="bibr" rid="B49">Zhou et al. (2020)</xref> which use the shape of the hand grip to create a virtual interactable surface. A breakthrough in hand tracking or the usage of affordable data glove would allow more research of this type to flourish as it would require less heavy apparatus.</p>
<p>Finally, in this review, bimanual interactions have been little explored except for <xref ref-type="bibr" rid="B30">Mo et al. (2021)</xref>&#x2019;s work that proposes the support of two-handed gestures in their authoring tool. As in the real world, we are used to manipulate objects with both hands, it can be a promising lead to designing more natural interactions.</p>
</sec>
</sec>
</body>
<back>
<sec id="s6">
<title>Author contributions</title>
<p>RN is a PhD student and the principal author of the article. He worked on the whole process of the scoping review, proposed the classification model and wrote the article. CG-V is the supervisor of RN. He guided RN in the review methodology, suggested relevant information to extract from the review, discussed the classification model and reviewed the article. MA is an industrial collaborator from VMware, working in the project. She reviewed the article. All authors contributed to the article and approved the submitted version.</p>
</sec>
<sec id="s7">
<title>Funding</title>
<p>This project is funded by the Canadian funding organization MITACS (project IT27213) and the company VMware Canada.</p>
</sec>
<ack>
<p>The authors would like to acknowledge Guillaume Spalla and Fran&#xe7;ois Racicot-Lanoue who provided helpful insights and documents to write this review article.</p>
</ack>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of interest</title>
<p>MA was employed by VMware Canada.</p>
<p>The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
<p>The authors declare that this study received the funding from the company Vmware Canada. The funders had the following involvement in the study: the reviewing of the manuscript and the decision to publish.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>Augmented reality headset from the company Meta which is no longer operating. (different from the current Meta, formerly Facebook) <ext-link ext-link-type="uri" xlink:href="https://tech.facebook.com/reality-labs/2021/3/inside-facebook-reality-labs-wrist-based-interaction-for-the-next-computing-platform/">https://tech.facebook.com/reality-labs/2021/3/inside-facebook-reality-labs-wrist-based-interaction-for-the-next-computing-platform/</ext-link>
</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ababsa</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chardonnet</surname>
<given-names>J. R.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Combining hololens and leap-motion for free hand-based 3d interaction in mr environments 12242 LNCS</article-title>,&#x201d; in <source>Augmented reality, virtual reality, and computer graphics</source>, <conf-loc>Lece, Italy</conf-loc> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>), <fpage>315</fpage>&#x2013;<lpage>327</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-58465-8_24</pub-id>
</citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bautista</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Maradei</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Pedraza</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Augmented reality user interaction to computer assisted orthopedic surgery system</article-title>,&#x201d; in <source>ACM international conference proceeding series</source>, <conf-loc>Merida, Mexico</conf-loc> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>). <pub-id pub-id-type="doi">10.1145/3293578.3293590</pub-id>
</citation>
</ref>
<ref id="B3">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bautista</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Maradei</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Pedraza</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Usability test with medical personnel of a hand-gesture control techniques for surgical environment</article-title>,&#x201d; in <source>International journal on interactive design and manufacturing (IJIDeM)</source> (<publisher-loc>Paris, France</publisher-loc>: <publisher-name>Springer-Verlag France</publisher-name>), <fpage>1031</fpage>&#x2013;<lpage>1040</lpage>. <pub-id pub-id-type="doi">10.1007/s12008-020-00690-9</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Billinghurst</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Clark</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>A survey of augmented reality</article-title>. <source>A Surv. augmented Real.</source> <volume>8</volume>, <fpage>73</fpage>&#x2013;<lpage>272</lpage>. <pub-id pub-id-type="doi">10.1561/1100000049</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bouchard</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Bouzouane</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bouchard</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Gesture recognition in smart home using passive RFID technology</article-title>,&#x201d; in <source>Proceedings of the 7th international conference on PErvasive technologies related to assistive environments</source>, <conf-loc>Rhodes, Greece</conf-loc> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1145/2674396.2674405</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Caputo</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Giachetti</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Soso</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Pintani</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>D&#x2019;Eusanio</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pini</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Shrec 2021: Skeleton-based hand gesture recognition in the wild</article-title>. <source>Comput. Graph.</source> <volume>99</volume>, <fpage>201</fpage>&#x2013;<lpage>211</lpage>. <pub-id pub-id-type="doi">10.1016/j.cag.2021.07.007</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Nuernberger</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Luan</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>H&#xf6;llerer</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Evaluating gesture-based augmented reality annotation</article-title>,&#x201d; in <conf-name>2017 IEEE Symposium on 3D User Interfaces</conf-name>, <conf-loc>Los Angeles, CA, USA</conf-loc>, <conf-date>18-19 March 2017</conf-date> (<publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/3DUI.2017.7893337</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Choudhary</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Ugarte</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bruder</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Welch</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Real-time magnification in augmented reality</article-title>,&#x201d; in <source>Symposium on spatial user interaction</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>2</lpage>. <pub-id pub-id-type="doi">10.1145/3485279.3488286</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dibene</surname>
<given-names>J. C.</given-names>
</name>
<name>
<surname>Dunn</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>HoloLens 2 sensor streaming</article-title>. <comment>arXiv</comment>. <pub-id pub-id-type="doi">10.48550/arXiv.2211.02648</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Frutos-Pascual</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Creed</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Williams</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Head mounted display interaction evaluation: Manipulating virtual objects in augmented reality 11749</article-title>,&#x201d; in <source>IFIP conference on human-computer interaction</source>, <conf-loc>Paphos, Cyprus</conf-loc> (<publisher-loc>Heidelberg</publisher-loc>: <publisher-name>Springer-VerlagBerlin</publisher-name>). <pub-id pub-id-type="doi">10.1007/978-3-030-29390-1_16</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Glauser</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Panozzo</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hilliges</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Sorkine-Hornung</surname>
<given-names>O.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Interactive hand pose estimation using a stretch-sensing soft glove</article-title>. <source>ACM Trans. Graph.</source> <volume>38</volume>, <fpage>41</fpage>&#x2013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1145/3306346.3322957</pub-id>
</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hand</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>1997</year>). <article-title>A survey of 3D interaction techniques</article-title>. <source>A Surv. 3d Interact. Tech.</source> <volume>16</volume>, <fpage>269</fpage>&#x2013;<lpage>281</lpage>. <pub-id pub-id-type="doi">10.1111/1467-8659.00194</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Kurfess</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>3d separable convolutional neural network for dynamic hand gesture recognition</article-title>. <source>Neurocomputing</source> <volume>318</volume>, <fpage>151</fpage>&#x2013;<lpage>161</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2018.08.042</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Jailungka</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Charoenseang</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Intuitive 3d model prototyping with leap motion and microsoft hololens 10903 LNCS</article-title>,&#x201d; in <source>Human-computer interaction. Interaction technologies</source>, <conf-loc>Las Vegas, NV</conf-loc> (<publisher-loc>Heidelberg</publisher-loc>: <publisher-name>Springer-VerlagBerlin</publisher-name>), <fpage>269</fpage>&#x2013;<lpage>284</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-91250-9_21</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Jeon</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>T. K.</given-names>
</name>
<name>
<surname>Woo</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>J. H.</given-names>
</name>
<name>
<surname>Heo</surname>
<given-names>Y. M.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Five new records of soil-derived <italic>trichoderma</italic> in korea: <italic>T. albolutescens</italic>, <italic>T. asperelloides</italic>, <italic>T. orientale</italic>, <italic>T. spirale</italic>, and <italic>T. tomentosum</italic>
</article-title>. <source>egocentric Viewp.</source> <volume>47</volume>, <fpage>1</fpage>&#x2013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.5941/MYCO.2017.45.1.1</pub-id>
</citation>
</ref>
<ref id="B16">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>H. I.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yeo</surname>
<given-names>H. S.</given-names>
</name>
<name>
<surname>Quigley</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Woo</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>SWAG demo: Smart watch assisted gesture interaction for mixed reality head-mounted displays</article-title>,&#x201d; in <source>Adjunct proceedings - 2018 IEEE international symposium on mixed and augmented reality</source>, <conf-loc>Munich, Germany</conf-loc> (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>Institute of Electrical and Electronics Engineers Inc.</publisher-name>), <fpage>428</fpage>&#x2013;<lpage>429</lpage>. <pub-id pub-id-type="doi">10.1109/ISMAR-Adjunct.2018.00130</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kim</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Choi</surname>
<given-names>S. H.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>K. B.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>J. Y.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>User interactions for augmented reality smart glasses: A comparative evaluation of visual contexts and interaction gestures</article-title>. <source>A Comp. Eval. Vis. contexts Interact. gestures</source> <volume>9</volume>, <fpage>3171</fpage>. <pub-id pub-id-type="doi">10.3390/app9153171</pub-id>
</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Koutsabasis</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Vogiatzidakis</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Empirical research in mid-air interaction: A systematic review</article-title>. <source>A Syst. Rev.</source> <volume>35</volume>, <fpage>1747</fpage>&#x2013;<lpage>1768</lpage>. <pub-id pub-id-type="doi">10.1080/10447318.2019.1572352</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>C. J.</given-names>
</name>
<name>
<surname>Chu</surname>
<given-names>H. K.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Dual-MR: Interaction with mixed reality using smartphones</article-title>,&#x201d; in <source>Proceedings of the ACM symposium on virtual reality software and technology</source>, <conf-loc>Tokyo, Japan</conf-loc> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>). <pub-id pub-id-type="doi">10.1145/3281505.3281618</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Yung Lam</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Yau</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Braud</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Hui</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Hibey: Hide the keyboard in augmented reality</article-title>,&#x201d; in <conf-name>2019 IEEE International Conference on Pervasive Computing and Communications</conf-name>, <conf-loc>Kyoto, Japan</conf-loc>, <conf-date>11-15 March 2019</conf-date> (<publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/PERCOM.2019.8767420</pub-id>
</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lee</surname>
<given-names>T. H.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>J. S.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>H. J.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Virtual keyboards with real-time and robust deep learning-based gesture recognition</article-title>. <source>Conf. Name IEEE Trans. Human-Machine Syst.</source> <volume>52</volume>, <fpage>725</fpage>&#x2013;<lpage>735</lpage>. <pub-id pub-id-type="doi">10.1109/THMS.2022.3165165</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lin</surname>
<given-names>Y. C.</given-names>
</name>
<name>
<surname>Yamaguchi</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Evaluation of operability by different gesture input patterns for crack inspection work support system</article-title>,&#x201d; in <source>2021 60th annual conference of the society of instrument and control engineers of Japan</source>, <conf-loc>Tokyo, Japan</conf-loc> (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>Institute of Electrical and Electronics Engineers Inc.</publisher-name>), <fpage>1405</fpage>&#x2013;<lpage>1410</lpage>.</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>WR-hand: Wearable armband can track user&#x2019;s hand</article-title>. <source>Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.</source> <volume>5</volume>, <fpage>1</fpage>&#x2013;<lpage>27</lpage>. <pub-id pub-id-type="doi">10.1145/3478112</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lu</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Rai</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>FMHash: Deep hashing of in-air-handwriting for user identification</article-title>,&#x201d; in <conf-name>IEEE International Conference on Communications</conf-name>, <conf-loc>Shanghai, China</conf-loc>, <conf-date>20-24 May 2019</conf-date> (<publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/ICC.2019.8761508</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Macaranas</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Antle</surname>
<given-names>A. N.</given-names>
</name>
<name>
<surname>Riecke</surname>
<given-names>B. E.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>What is intuitive interaction? Balancing users&#x2019; performance and satisfaction with natural user interfaces</article-title>. <source>Interact. Comput.</source> <volume>27</volume>, <fpage>357</fpage>&#x2013;<lpage>370</lpage>. <pub-id pub-id-type="doi">10.1093/iwc/iwv003</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Madapana</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Wachs</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Database of gesture attributes: Zero shot learning for gesture recognition</article-title>,&#x201d; in <conf-name>2019 14th IEEE International Conference on Automatic Face &#x26; Gesture Recognition (FG 2019)</conf-name>, <conf-loc>Lille, France</conf-loc>, <conf-date>14-18 May 2019</conf-date> (<publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/FG.2019.8756548</pub-id>
</citation>
</ref>
<ref id="B27">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>McMahan</surname>
<given-names>R. P.</given-names>
</name>
<name>
<surname>Lai</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Pal</surname>
<given-names>S. K.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Interaction fidelity: The uncanny valley of virtual reality interactions</article-title>,&#x201d; in <source>Virtual, augmented and mixed reality</source>, <conf-loc>Toronto, Canada</conf-loc>. Editors <person-group person-group-type="editor">
<name>
<surname>Lackey</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Shumaker</surname>
<given-names>R.</given-names>
</name>
</person-group> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>), <fpage>59</fpage>&#x2013;<lpage>70</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-39907-2_6</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Milgram</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Takemura</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Utsumi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kishino</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>1995</year>). &#x201c;<article-title>Augmented reality: A class of displays on the reality-virtuality continuum</article-title>,&#x201d; in <source>Proceedings volume 2351, telemanipulator and telepresence technologies</source> (<publisher-loc>Boston, MA, United States</publisher-loc>: <publisher-name>SPIE</publisher-name>), <fpage>282</fpage>&#x2013;<lpage>292</lpage>. <pub-id pub-id-type="doi">10.1117/12.197321</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Min</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhuang</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>VPModel: High-fidelity product simulation in a virtual-physical environment</article-title>. <source>IEEE Trans. Vis. Comput. Graph.</source> <volume>25</volume>, <fpage>3083</fpage>&#x2013;<lpage>3093</lpage>. <pub-id pub-id-type="doi">10.1109/TVCG.2019.2932276</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mo</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dudley</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kristensson</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Gesture knitter: A hand gesture design tool for head-mounted mixed reality applications</article-title>,&#x201d; in <source>Conference on human factors in computing systems - proceedings</source>, <conf-loc>Yokohama, Japan</conf-loc> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>). <pub-id pub-id-type="doi">10.1145/3411764.3445766</pub-id>
</citation>
</ref>
<ref id="B31">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mueller</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Mehta</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Sotnychenko</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Sridhar</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Casas</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Theobalt</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Real-time hand tracking under occlusion from an egocentric RGB-d sensor</article-title>,&#x201d; in <source>Proceedings of the IEEE international conference on computer vision</source>, <conf-loc>Venice, Italy</conf-loc> (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>Institute of Electrical and Electronics Engineers Inc.</publisher-name>), <fpage>1163</fpage>&#x2013;<lpage>1172</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.131</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<collab>Nielsen</collab> (<year>2022</year>). <article-title>10 usability heuristics for user interface design</article-title>. <comment>arXiv</comment>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Page</surname>
<given-names>M. J.</given-names>
</name>
<name>
<surname>McKenzie</surname>
<given-names>J. E.</given-names>
</name>
<name>
<surname>Bossuyt</surname>
<given-names>P. M.</given-names>
</name>
<name>
<surname>Boutron</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Hoffmann</surname>
<given-names>T. C.</given-names>
</name>
<name>
<surname>Mulrow</surname>
<given-names>C. D.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>The PRISMA 2020 statement: An updated guideline for reporting systematic reviews</article-title>. <source>Br. Med. J. Publ. Group Sect. Res. Methods &#x26; Report.</source> <volume>372</volume>, <fpage>n71</fpage>. <pub-id pub-id-type="doi">10.1136/bmj.n71</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Plasson</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Cunin</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Laurillau</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Nigay</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>3d tabletop AR: A comparison of mid-air, touch and touch&#x2b;mid-air interaction</article-title>,&#x201d; in <source>PervasiveHealth: Pervasive computing technologies for healthcare</source>, <conf-loc>Salerno, Italy</conf-loc> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>). <pub-id pub-id-type="doi">10.1145/3399715.3399836</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="book">
<collab>Reuters</collab> (<year>1996</year>). <source>What&#x2019;s a-o.k</source>. <publisher-loc>USA</publisher-loc>: <publisher-name>lewd and worthless beyond</publisher-name>.</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sch&#xe4;fer</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Reis</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Stricker</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>The gesture authoring space: Authoring customised hand gestures for grasping virtual objects in immersive virtual environments</article-title>. <source>Mensch Comput.</source> <volume>2022</volume>, <fpage>85</fpage>&#x2013;<lpage>95</lpage>. <pub-id pub-id-type="doi">10.1145/3543758.3543766</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Serrano</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Morillo</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Casas</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Cruz-Neira</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>An empirical evaluation of two natural hand interaction systems in augmented reality</article-title>. <source>Multimedia Tools Appl.</source> <volume>81</volume>, <fpage>31657</fpage>&#x2013;<lpage>31683</lpage>. <pub-id pub-id-type="doi">10.1007/s11042-022-12864-6</pub-id>
</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shrestha</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Chun</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Computer-vision based bare-hand augmented reality interface for controlling an AR object</article-title>. <source>Ar. object</source> <volume>10</volume>, <fpage>1</fpage>. <pub-id pub-id-type="doi">10.1504/IJCAET.2018.10006394</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Su</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Du</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Teng</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). &#x201c;<article-title>A natural bare-hand interaction method with augmented reality for constraint-based virtual assembly</article-title>,&#x201d; in <conf-name>Conference Name: IEEE Transactions on Instrumentation and Measurement</conf-name> (<publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1109/TIM.2022.3196121</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Su</surname>
<given-names>M. C.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>J. H.</given-names>
</name>
<name>
<surname>Trisandini Azzizi</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Chang</surname>
<given-names>H. L.</given-names>
</name>
<name>
<surname>Wei</surname>
<given-names>H. H.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Smart training: Mask R-CNN oriented approach</article-title>. <source>Smart Train. Mask. r-CNN oriented approach</source> <volume>185</volume>, <fpage>115595</fpage>. <pub-id pub-id-type="doi">10.1016/j.eswa.2021.115595</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Armengol-Urpi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Reddy Kantareddy</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Siegel</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Sarma</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>MagicHand: Interact with iot devices in augmented reality environment</article-title>,&#x201d; in <conf-name>26th IEEE Conference on Virtual Reality and 3D User Interfaces, VR</conf-name>, <conf-loc>Osaka, Japan</conf-loc>, <conf-date>23-27 March 2019</conf-date> (<publisher-name>IEEE</publisher-name>). <pub-id pub-id-type="doi">10.1109/VR.2019.8798053</pub-id>
</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ungureanu</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Bogo</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Galliani</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sama</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Duan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Meekhof</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>HoloLens 2 research mode as a tool for computer vision research</article-title>. <comment>arXiv</comment>.</citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vuletic</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Duffy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hay</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>McTeague</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Campbell</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Grealy</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Systematic literature review of hand gestures used in human computer interaction interfaces</article-title>. <source>Int. J. Human-Computer Stud.</source> <volume>129</volume>, <fpage>74</fpage>&#x2013;<lpage>94</lpage>. <pub-id pub-id-type="doi">10.1016/j.ijhcs.2019.03.011</pub-id>
</citation>
</ref>
<ref id="B44">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>A prototype-based generalized zero-shot learning framework for hand gesture recognition</article-title>,&#x201d; in <source>2020 25th international conference on pattern recognition</source>, <conf-loc>Milan, Italy</conf-loc> (<publisher-loc>Piscataway, NJ</publisher-loc>: <publisher-name>Institute of Electrical and Electronics Engineers Inc.</publisher-name>), <fpage>3435</fpage>&#x2013;<lpage>3442</lpage>. <pub-id pub-id-type="doi">10.1109/ICPR48806.2021</pub-id>
</citation>
</ref>
<ref id="B45">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Xiao</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Hudson</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Harrison</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Direct: Making touch tracking on ordinary surfaces practical with hybrid depth-infrared sensing</article-title>,&#x201d; in <source>Proceedings of the 2016 ACM international conference on interactive surfaces and spaces</source>, <conf-loc>Niagara Falls, Ontario</conf-loc> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>), <fpage>85</fpage>&#x2013;<lpage>94</lpage>. <pub-id pub-id-type="doi">10.1145/2992154.2992173</pub-id>
</citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Xiao</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Schwarz</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Throm</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Wilson</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Benko</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>MRTouch: Adding touch input to head-mounted mixed reality</article-title>. <source>MRTouch Adding touch input head-mounted Mix. Real.</source> <volume>24</volume>, <fpage>1653</fpage>&#x2013;<lpage>1660</lpage>. <pub-id pub-id-type="doi">10.1109/TVCG.2018.2794222</pub-id>
</citation>
</ref>
<ref id="B47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jeon</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Park</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>H. I.</given-names>
</name>
<name>
<surname>Woo</surname>
<given-names>W.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Geometry-aware interactive AR authoring using a smartphone in a wearable AR environment</article-title>. <source>Lect. Notes Comput. Sci.</source> <volume>2017</volume>, <fpage>416</fpage>&#x2013;<lpage>424</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-58697-7_31</pub-id>
</citation>
</ref>
<ref id="B48">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Q.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>ARSketch: Sketch-based user interface for augmented reality glasses</article-title>,&#x201d; in <source>MM 2020 - proceedings of the 28th ACM international conference on multimedia</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>), <fpage>825</fpage>&#x2013;<lpage>833</lpage>. <pub-id pub-id-type="doi">10.1145/3394171.3413633</pub-id>
</citation>
</ref>
<ref id="B49">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Sykes</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Fels</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kin</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Gripmarks: Using hand grips to transform in-hand objects into mixed reality input</article-title>,&#x201d; in <source>Conference on human factors in computing systems - proceedings</source>, <conf-loc>Honolulu, Hawaii</conf-loc> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>). <pub-id pub-id-type="doi">10.1145/3313831.3376313</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>