- 1CNRS, UMR 7253 Heudiasyc, Université de Technologie Compiègne, Compiègne, France
- 2CNRS, UMR 7222 ISIR, Sorbonne Université, Paris, France
Social touch is one of the many modalities of social communication. It can take many forms (from handshakes to hugs) and can convey a wide variety of emotions and intentions. Social touch is also multimodal: it is not only comprised of haptic feedback (both tactile and kinesthetic), but also visual feedback (gestures) and even audio feedback (sound of the hand movement on the body). Virtual agents (VA) can perceive and interact with users by making use of multimodality to express attitudes and complex emotions. There are still few studies that have investigated how to integrate touch into VAs’ social abilities, despite a growing interest in haptic interactions within immersive virtual environments (IVE). While prior work has examined haptic feedback, auditory substitution, or agent-based touch in isolation, no study has systematically disentangled the respective and combined contributions of visual, auditory, and tactile cues to the perception of being socially touched by a virtual agent. To address this gap, we conducted three experiments that progressively isolate and combine modalities, revealing how each shapes touch recognition, believability, and agency attribution. Taken together, our results show that multisensory feedback improves experience of social touch, which is in line with previous research in IVEs. They also show that visual feedback has the potential to guide the recognition of the stimulus the most but both audio and tactile feedback further help disambiguate the recognition of a touch gesture in particular cases. The results also show that both level of animation and interpersonal distance are essential to how much the VA is felt as a social agent when initiating touch. While visual feedback and tactile feedback are the main contributors to participants feeling touched by the VA, audio feedback also has a significant impact.
1 Introduction
Social touch is a central modality in human communication. It plays a key role in regulating emotions, signaling intentions, and shaping interpersonal bonds from early development to adulthood (Montagu, 1984; Hertenstein et al., 2009; Hauser et al., 2019). Touch can fulfill multiple functions: ritualistic (e.g., a handshake), functional (e.g., capturing attention), or affective (e.g., comforting through a caress) (Jones and Yarbrough, 1985; Héron et al., 2024). The ability of touch to convey meaning and emotion is thus integral to how people establish and maintain social relationships.
With the growing sophistication of embodied virtual agents (VAs), integrating social touch into human–agent interactions has become an important research frontier. Equipping VAs with the ability to perform touch gestures could enhance their social and emotional credibility, increasing trust, engagement, and perceived empathy. However, simulating social touch in immersive virtual environments (IVEs) remains a challenge: while haptic technologies such as vibrotactile or force-feedback devices can provide sensations on the skin, these rarely capture the richness of human touch in terms of timing, texture, or intent. Moreover, touch is inherently multimodal, combining physical sensations with visual and auditory cues. Leveraging this multimodality through carefully designed combinations of visual animation, tactile stimulation, and sound may offer a way to create more convincing and believable social touch interactions.
In this work, we investigate how different sensory modalities and their combinations contribute to the perception of social touch initiated by a VA in VR. Prior research has explored sensory substitution for touch (de Lagarde et al., 2023), social touch in human–robot interaction (Teyssier et al., 2020), or human–agent touch in non-immersive settings (Huisman et al., 2016), but the systematic evaluation of visual, audio, and tactile cues in immersive VR remains largely unexplored. To our knowledge, this study provides the first controlled comparison of these modalities (alone, matched, and mismatched) within an immersive setup. Across three experiments, we assess not only whether participants feel touched, but also how they recognize the gesture and attribute agency to the VA. This exploratory approach clarifies the conditions under which multimodal feedback enhances the subjective experience of being touched in VR, and offers a foundation for designing richer and more credible virtual social touch interactions.
2 Related work
2.1 Social and affective touch
Social touch can be understood as signals used to communicate between a sender and a receiver. The sender encodes an emotion or an intention within a touch (“I am going to show that I care about you”), and the receiver performs a sensory judgment based on the information they perceived through their senses (“The sender wanted to show their care for me”). Following this notion, McIntyre et al, recorded different touch gestures by varying several parameters: types of motions, points of contact, velocity, duration, and emotion associated, as perceived by the receiver (McIntyre et al., 2022). This allowed them to build a taxonomy of touch gestures and their perceived meanings and they confirmed that human participants can recognize both those touch gestures and the socio-emotional intentions associated, including attention grabbing, love and happiness. Recent works from the neuro-physiological field on the CT-afferents (a specific class of unmyelinated cutaneous mechanoreceptors (Schirmer et al., 2023a)) have demonstrated that certain stroking speeds can maximize the pleasantness of affective touch (Schirmer et al., 2023b), and can induce psycho-physiological regulation responses (Püschel et al., 2022). However, the acceptability, pleasantness, and meaning of human-human social touch can be influenced by individual preferences, the relationship between the toucher and the touchee (Suvilehto et al., 2015), and socio-cultural context (Van Erp and Toet, 2015). Thus, as with all the other social abilities of humans, social touch on its own is an ambiguous communication modality. This can lead to misunderstandings or even feelings of invasion, as touch is also an intimate modality, which involves people’s bodies directly.
2.2 Virtual social touch and virtual agents
Simulating touch via technology (also called virtual touch in the context of IVEs) for social interactions usually relies on some form of haptic feedback, typically generated through haptic devices (Huisman et al., 2013b; Huisman et al., 2016; Ju et al., 2021). Alternatively, several studies have examined zoomorphic (Geva et al., 2020) and/or social (Song and Yamada, 2017) robots, which are tangible and physically embodied. Touching social robots can lead to strong physiological responses (Geva et al., 2020), and while some evidence suggests that people prefer touching a robot rather than being touched by it (Hirano et al., 2018), other studies show that robot-initiated touch can approximate the pleasantness of human touch when performed at the right velocity (Willemse et al., 2016).
The design space for simulating social touch is immense: parameters such as spatiotemporal profile, force, texture, or temperature can all be manipulated, yet no single technology currently reproduces the full richness of human touch (Teyssier et al., 2020). Vibrotactile feedback, despite its “unnatural” feel, has become the most common technology because it is affordable, customizable, and capable of producing tactile illusions such as caresses (Huisman et al., 2013b; Huisman et al., 2016; Ju et al., 2021). Studies using vibrotactile stroking have shown that pleasantness peaks at speeds between 1 and 10 cm/s, in line with CT-afferent activation in human-human touch (Huisman et al., 2016).
Huisman et al. (2013a) introduced one of the first systems enabling a VA to touch users via a vibrotactile sleeve composed of ERMs. Subsequent work has refined this approach: Ju et al. (2021), designed vibrotactile patterns with the TECHTILE toolkit to convey emotions; Boucaud et al. (2021), Boucaud et al. (2023), integrated vibrotactile cues with a VA in immersive VR; and other studies have examined human responses to VA touch in controlled settings (Hoppe et al., 2020). A common thread is that vibrotactile touch can replicate some benefits of human touch, but its relative importance compared to visual or auditory cues remains unclear.
Two recent surveys provide comprehensive overviews of this emerging field. Jacucci et al. (2024), systematically reviewed the past decade of research on social haptics in VR and highlighted three main research foci: how mediated touch can change behavior, how it expresses emotions, and how it serves as a notification channel. They also organized prior work by technology type and study context, pointing to the need for in-the-wild validation and for tools that enable richer design exploration. Olugbade et al. (2023), examined affective touch across HCI, human–robot, and human–virtual human interactions, discussing not only applications but also datasets, recognition methods, and ethical concerns. Together, these reviews provide valuable syntheses of technologies and approaches, and we refer readers to them for broader coverage of the field.
Recent experimental studies illustrate this trajectory. Kim and Choi (2024), employed multimodal haptic feedback (including force, tactile, and thermal cues) via a glove for user-initiated VA touch, showing that passive haptics (touching a silicone limb) often outperformed active devices in terms of embodiment and copresence. Sun et al. (2024), highlighted the importance of multisensory richness by showing that experimenter-delivered multimodal touch produced stronger effects than unimodal cues.
As social touch is inherently multimodal, a natural research question emerges: does supplementing vibrotactile feedback with additional cues (e.g., visual and auditory) enhance the realism and effectiveness of simulated touch in IVEs?
2.3 Supplementing touch with other sensory cues
A signal can substitute another signal (e.g., a gesture can indicate a referenced object without having to mention it verbally) or supplement it to disambiguate and/or complement the conveyed meaning (e.g., saying ‘the red box’ while pointing at a set of boxes (Lenay and Tixier, 2018)). de Lagarde et al. (2023) demonstrated that, given enough context (participants being told that the sounds are produced in human-human interaction), audio stimuli can effectively be recognized as social touch gestures and associated with socio-emotional intentions, which is consistent with haptics’ results in the social touch literature. Their study focused on four gesture types: tapping, hitting, caressing (stroking), and rubbing. The stimuli were created by recording vibrations produced by a touch performed directly on the skin, using a violin microphone attached to a human arm. These findings, along with previous research on touch and visual perception (Ravaja et al., 2017), highlight how closely interwoven the modalities of social touch are, reinforcing the importance of studying each modality’s contribution to eliciting a feeling of being socially touched by a VA.
3 Research questions
The overarching objective of this study is to disentangle the respective contributions of visual, audio, and tactile feedback in shaping the perception of social touch gestures performed by a VA in immersive VR. Specifically, we address both the sensory and the social dimensions of touch: not only whether participants feel a sensation, but also whether they interpret it as a socially meaningful action initiated by the agent.
Our research is guided by the following questions.
• RQ1: How does each sensory modality (visual, audio, tactile) influence the recognition of specific social touch gestures?
• RQ2: To what extent does each modality contribute to the subjective feeling of being socially touched by a virtual agent?
• RQ3: Does the attribution of agency to the VA, that is, perceiving the VA as the intentional source of the touch, depend on the available sensory modalities?
To answer these questions, we conducted three complementary experiments. The first two focus on RQ1 and RQ2 by systematically manipulating modality combinations, while the third expands these investigations to RQ3 by introducing variations in interpersonal distance and non-verbal behavior, thus probing how sensory and social cues jointly affect the perception of touch.
4 Methodology
The following section describes the rationale and methodological choices guiding the design of our stimuli across experiments. To investigate our research questions, we designed three experiments aimed at evaluating how different combinations of the three sensory modalities affect the perception of social touch in interactions with a VA. The experimental protocol was developed with two primary goals: (1) to isolate the contribution of each sensory modality to the perception of being touched and (2) to assess the impact of multimodal combinations on the realism and believability of the touch experience. This study was approved by our local IRB (INSEAD IRB approval no 2024-46). For each experiment, a different sample of participants was recruited, and there was no overlap of participants between experiments. They were all recruited in France, but we did not enforce a strict cultural background screening.
Our design leverages a controlled virtual environment to simulate touch gestures performed by a VA and incorporates precise synchronization of sensory feedback across modalities. Participants interact with a female-looking VA, who performs social touch gestures validated in prior studies (Boucaud et al., 2019), namely, tapping, hitting, and caressing. These gestures were chosen for their relevance to human-human social touch and adapted to match the capabilities of the haptic and audio systems employed in the study described further down in this section.
The protocol not only evaluates the effectiveness of individual modalities by integrating measures of gesture recognition and sensory perception, but also explores the influence of their interaction on the subjective experience of being touched. This methodology ensures a systematic and replicable approach to exploring the multisensory nature of social touch in virtual environments.
4.1 Methodology rationale
Our three experiments were designed to systematically explore the role of sensory matching through multimodal combinations in shaping the perception of social touch. Simulating social touch interactions involves navigating an expansive design space, as noted by Teyssier et al. (2020), who demonstrated that device-initiated touch can convey diverse emotions depending on subtle variations in sensory parameters. Their work underscores the complexity of simulating touch, where even minor changes in timing, intensity, or modality alignment can significantly alter user perception. Similarly, Delis et al. (2016), tackled this complexity in the domain of facial expressions by combining multiple parameters for emotion categorization. Their findings showed that isolating and comparing parameter effects leads to a deeper understanding of which elements are most critical in creating believable and emotionally resonant facial expression for virtual agents.
In our study, we extend this principle to social touch by isolating and combining sensory modalities to determine their individual and synergistic contributions to touch recognition and perception. We thus say that the sensory modalities are matched when each modality’s stimulus presented to the user is intended to represent the same gesture (e.g., the audio and the tactile and the visual animation of a tap gesture). On the other hand, we say that the sensory modalities are mismatched when the stimuli presented are not intended to represent the same gesture (e.g., the audio of a tap gesture combined with the vibrotactile pattern designed to simulate a hit, with the visual animation of a caress). Finally, we say that the modalities are partially matched when the three sensory modalities are present but only two of them are matched and the third one features a mismatched stimulus. Each trial could thus be a complete match, a complete mismatch or a partial match. Including mismatched trials enables us to investigate how mismatched stimuli disrupt perception, an approach supported by Teyssier et al,‘s observation that multisensory congruence plays a pivotal role in establishing emotional coherence in touch. By studying both matched and mismatched feedback across modalities, we aim to identify the conditions necessary for achieving realistic and compelling social touch interactions with virtual agents.
4.2 Vibrotactile device
The device used in all three experiments, an MTX-Prototype designed by CAYLAR SAS, is made up of a sleeve consisting of
Figure 1. Experimental setup for the experiments: (a) In-game view of the experiment, with the virtual agent (highlighted in purple here) touching the participant on the upper arm of their avatar. (b) Illustration of the experimental setup with the vibrotactile sleeve (as shown here on a member of the research team for illustration purposes). (c) The haptic sleeve.
4.3 Designing the visual and tactile stimuli from audio recordings
Few studies have validated the combination of tactile and auditory stimuli for social touch (Huisman et al., 2013b; Teyssier et al., 2020). For our study, we relied on the dataset by de Lagarde et al. (2023), who showed that recordings of human-human social touch (tap, caress, rub, hit) could be recognized above chance level when presented unimodally. To maintain experimental feasibility while still covering a range of social intentions, we restricted our set to three gestures: Tap, Hit, and Caress, selected for their higher recognition rates in the original validation. This choice reflects the exploratory nature of the present study, which cannot exhaustively cover the large design space of social touch.
4.3.1 Tactile stimuli design
The vibrotactile patterns were implemented on our ERM-based sleeve (see Section 4.2). Although ERMs lack the precision of voice-coils or grounded haptic devices, they remain practical, robust, and easy to integrate into immersive VR setups, making them suitable for our exploratory study (Remache-Vinueza et al., 2021). To derive tactile patterns, we analyzed the recordings in terms of gesture duration, intensity, and rhythm. Rhythm was prioritized as the most critical parameter for cross-modal alignment (Jack et al., 2015), ensuring that the onset and offset of vibrations corresponded to the temporal structure of the audio. In addition to directly audio-aligned patterns, we systematically varied stimulus duration (five levels) and intensity (three levels), yielding 15 candidate patterns per gesture.
4.3.2 Pilot study
A pilot study with 12 participants (6 males, 6 females,
• Synchronization between audio and vibrations,
• Accuracy in reproducing the intended gesture,
• Perceptual coherence of the combined stimulus.
No significant differences emerged between the participant-chosen patterns and the audio-aligned ones. We therefore retained the audio-aligned tactile patterns for the main study, ensuring consistency across participants by presenting the same tactile patterns irrespective of individual preferences identified in the pilot study.
4.3.3 Final tactile stimuli
The final patterns, summarized in Figure 2, were.
• Hit: a single 40 ms square-wave pulse across all actuators at 70% intensity.
• Tap: five 40 ms pulses at 40% intensity on 8 central actuators, arranged as two quick taps, a 500 ms pause, and three quick taps (160 ms gaps).
• Caress: the actuator layout of the sleeve (four rows of motors, see Figure 1) did not allow us to reproduce a continuous linear stroke along the length of the arm. During early design exploration, we therefore compared trajectories that progressed longitudinally (proximal–distal) with trajectories that progressed around the arm, i.e., a smooth, linear sweep of activations along the arm’s circumference from the front toward the back. Participants did not report systematic différences in how these variants were interpreted as a caress, while they perceived motion better with the circumferential pattern. We consequently implemented the around-the-arm sweep, which provided the smoothest and most continuous sensation given the hardware constraints. The final pattern lasted 1,300 ms at 10% intensity (approx. 13 cm/s stroking speed).
Figure 2. Sound spectrums and profiles of tactile pattern as a signal and on the actuators’ matrix for (a) caress gesture; (b) tap gesture; (c) hit gesture. (d) the actuators’ matrix displays the actuators involved for each tactile pattern.
4.3.4 VA gesture animations
The visual animations were adapted from the GRETA platform (Rosis et al., 2003; Boucaud et al., 2019), which provides validated social touch gestures performed by an embodied conversational agent. We preserved the core kinematics of these animations so that the three gestures remained comparable to prior work, while adjusting their timing and endpoint to align with our tactile and auditory patterns. In particular, Tap and Caress share a similar reaching phase of the forearm toward the participant’s upper arm, after which their trajectories diverge: Tap ends with a brief tapping motion, whereas Caress involves a slower, sliding hand movement. By contrast, Hit includes a more pronounced preparatory swing and a sharper impact, making it visually distinctive from the outset.
Because our goal was to investigate how modalities combine rather than to maximize photorealistic realism, we prioritized plausible and internally consistent movements over ultra–high-fidelity, motion-captured animation. Using GRETA-based gestures ensured continuity with validated social-touch datasets and provided tight control over rhythm and amplitude across conditions. As we discuss in Sections 5.7, 6.5, the main ambiguity between Tap and Caress emerges from the intrinsic subtlety of their endpoint motions and their shared reaching phase; structural properties that would persist even with motion-captured animations. This ambiguity is further amplified by the limited field of view of the headset, which can partially occlude the fine hand movements that differentiate these two gestures.
It is worth noting that, in immersive VR, animation fidelity does not necessarily entail photorealistic motion capture. Prior research on presence and virtual humans shows that users’ responses depend more on the plausibility and coherence of an agent’s behavior than on graphical realism (Slater, 2009; Slater, 2018; Pan and Hamilton, 2018). When the kinematic structure of an action is preserved, stylized or simplified characters can convey emotional and social meaning as effectively as highly realistic ones (McDonnell et al., 2008; McDonnell et al., 2012). Moreover, extremely high visual realism may expose small animation imperfections and reduce perceived believability (McDonnell et al., 2012). In this context, our use of GRETA-based gestures reflects a deliberate methodological choice to prioritize controlled and plausible social motion which is critical for multisensory experiments, over maximizing visual realism for its own sake.
5 Experiment 1
Onward, we call gestures the emblematic types of social touches featuring at least one sensory feedback among the three. For this first experiment, we explore how the coupling of visual animations of the gesture performed by a VA and either an audio or a tactile stimulus can foster the feeling of being socially touched by naive participants and how the recognition of the touch gestures is impacted by the different combinations.
5.1 Apparatus for experiment 1
Using Unity 3D 2020.3.48f1, we set up a basic VR environment similar to the literature (Boucaud et al., 2023), using a Vive Pro 2 headset (HMD) combined with a Leap Motion Controller for tracking, respectively, the user’s head and hands, allowing for basic presence and sense of embodiment towards a given avatar (see Figure 1a). This avatar consists of a white outline allowing users to see their virtual hands (detected by the Leap Motion Controller) and their virtual arms (controlled through inverse kinematics). The Vive Pro HMD provides a field of view (FOV) of approximately 116° and a refresh rate of 90 Hz. This relatively wide FOV ensured good visibility of the VA’s gestures, though some peripheral movements may still have extended beyond the participant’s immediate visual field. As discussed in Section 4.3, we used the GRETA platform to control the visual behaviour of our VA (the tap, hit and caress visual animations). The activation of the audio and of the tactile feedback are synchronized with the moment when the hand of the agent reaches the human user for each gesture. The virtual agent kept a neutral facial expression throughout the experiment. Participants were seated in front of a desk in the real world with a matching virtual desk in the IVE. In front of them, behind the desk, they could see the VA positioned at arm’s reach (see Figure 1a). As touch from a woman is usually more acceptable for both female and male touchees (Suvilehto et al., 2015), we used a VA with a female gender. The tactile feedback was rendered through the tactile sleeve described in Section 4.2. To play the audio, and prevent participants from hearing the vibrations, participants were equipped with noise cancellation DT770 Pro headphones from Beyerdynamic. All the questionnaires could be answered directly in the IVE through direct manual interaction with virtual panels. Examples of trials and of the virtual environment can be seen both in Figure 1a and in the video submitted as supplementary material.
5.2 Hypotheses
We have three hypotheses for the first experiment.
• H1: adding either audio or tactile feedback will increase the feeling of being touched by the VA compared to the condition with visual feedback only (Melo et al., 2020) (RQ2).
• H2: mismatched multisensory feedback will lead to a decreased perception of being socially touched (Jeunet et al., 2018; Richard et al., 2021) (RQ2).
• H3: visual animations will be the strongest factor in the recognition rate of the social touch gestures (RQ1).
5.3 Measures
The central dependent measure in Experiment 1 was participants’ subjective sense of being touched by the virtual agent (VA). We constructed a short questionnaire (Table 1) combining four items that targeted both the perception of a physical stimulus and the attribution of that stimulus to the VA. Items included, for example, whether the participant felt physically touched and whether the touch was perceived as caused by the agent.
While several validated instruments exist for evaluating user experience with virtual agents (e.g., the Artificial Social Agent Questionnaire, ASAQ, for believability and social presence - published after we ran our experiments (Fitrianie et al., 2025)) or attitudes toward social touch in human–human interaction (e.g., the Social Touch Questionnaire, STQ (Lapp and Croy, 2021)), none to our knowledge assess the combined experience of multimodal (visual–auditory–haptic) social touch or the recognition of touch gestures in immersive environments. We therefore designed exploratory composite indices tailored to the present study’s aims. The resulting FeelingTouched index drew on wording and conceptual inspiration from embodiment and social-presence research (e.g., Jeunet et al., 2018; Huisman et al., 2016; Kilteni et al., 2012) and achieved excellent internal reliability (
Responses were given on 7-point Likert scales and aggregated into a single score representing participants’ overall feeling of being touched by the agent. Although reliability was high, the four items encompass partially distinct constructs (e.g., tactile sensation, social attribution, spatial presence). The aggregated score should therefore be interpreted as an exploratory composite index rather than a validated unidimensional scale. This limitation motivated refinements introduced in Experiment 3, where the constructs of perceived TouchAgency and TouchBelievability were measured separately.
In addition, gesture recognition was assessed by asking participants to identify which gesture (hit, tap, or caress) they believed the VA had performed, chosen from a predefined list. This categorical outcome allowed for subsequent analyses of recognition accuracy across modality conditions.
5.4 Experimental design and conditions
To keep the experiment at a reasonable length for participants, we settled for three experimental conditions for this first experiment: only visual feedback (agent animations), which serves as a control condition (VisuOnly); visual and haptic (vibrotactile) feedback (VisuHaptic); visual and audio feedback (VisuAudio). We chose to always have visual feedback present as our research is aimed at interactions with VAs that always feature some virtual body, regardless of other modalities. The experiment followed a within-subject design. For our three kinds of gestures (tap, hit, and caress) we used the audio and vibrotactile stimuli and the animations described in Section 4.3 to produce all possible combinations for each condition, both matched and mismatched. There are thus 3 trials for the VisuOnly condition and
5.5 Procedure
After being seated in front of a plain office desk, participants read information about the experiment through an information notice and signed an informed consent form. Participants were instructed that the virtual agent would perform a series of touch gestures directed toward their avatar. They were told that the gestures might vary in their sensory characteristics (visual, auditory, tactile) and that their task was to (i) report how touched they felt by the agent and (ii) identify which gesture the agent performed when applicable. No information was given about congruence or mismatch between modalities. The experimenter then helped them put the vibrotactile device on their upper arm and adjust the head-mounted display (HMD) and the headphones. Participants were immersed in the IVE described in Section 5.1 and familiarized themselves with how to answer the questionnaires with their virtual hands using the Leap Motion Controller through a short tutorial. Once ready, they pressed a button to start the trials, in which the agent would touch them on the arm visually and with either vibrotactile feedback or audio feedback depending on the trial condition. They answered the questionnaire shown in Table 1) for each trial right after experiencing a stimulus. This was repeated for each of the 21 trials. The presentation order of the trials was fully randomized.
5.6 Results
Twenty-five participants took part in Experiment 1 (16 male, 9 female;
The experiment comprised 21 trials: 3 VisuOnly trials and 18 multisensory trials that combined vision with either haptics or audio in matched or mismatched form. For analysis, we grouped trials into five matching groups: VisuOnly, VisuAudioMatch, VisuAudioMismatch, VisuHapticMatch, and VisuHapticMismatch. Distributions per group are shown in Figure 3.
Figure 3. Box plot of participants’ average FeelingTouched scores (Likert scale, 1–7) across multisensory matching groups, as described in Section 6.4. The red asterisk indicates that all pairwise comparisons involving this group reached statistical significance. Within each modality combination, matched conditions elicited higher FeelingTouched ratings than mismatched ones.
5.6.1 Perception of the agent’s touch
We computed the aggregated FeelingTouched score by averaging items Per-1, Per-2, Per-3, and Per-4 (reverse-scored). The scale showed excellent internal reliability (Cronbach’s
5.6.2 Recognition of the gesture
Figure 4 reports recognition rates by matching group. Because the visual animation was present in all conditions, we recoded responses into a binary outcome indicating whether the selected gesture matched the intention of the visual animation (yes/no) (Figure 4a). We then fit a mixed-effects logistic regression with participant ID as a random effect (525 observations; 25 participants; variance
Figure 4. Gesture recognition rates across multisensory matching conditions. (a) Proportion of trials in which participants correctly recognized the gesture relative to the visual animation (“yes” if their choice matched the intended animation, “no” otherwise). (b) Distribution of chosen gestures across matching conditions. For matched groups, responses are classified as either the matched gesture or absent gesture (i.e., a gesture not present in the stimuli, such as recognizing a tap when the actual feedback - audio and visual - was a caress). For mismatched groups, responses are attributed to the modality that drove the choice (visual, tactile, or audio) or to absent. Overall, recognition was higher in matched conditions than in mismatched ones.
Table 2. Selected results from fixed effects (condition groups) of mixed-effects logistic regression (full results available in supplementary materials). The outcome variable (whether the gesture chosen by the participants corresponds to the intended visual animation) is coded as Yes = 0 and No = 1.
To examine which cue participants followed in mismatched trials, we ran chi-squared tests on the distribution of chosen gestures within each mismatched group. In VisuAudioMismatch, choices differed significantly across the Visual vs. Audio intended gestures (
5.7 Discussion
The results of Experiment 1 show that participants reported a stronger feeling of being touched by the VA whenever additional sensory feedback was present, compared to visual feedback alone. This effect held even for incongruent trials, confirming that multisensory stimulation enhances the experience of social touch. Tactile feedback led to the highest ratings overall, and congruent conditions consistently outperformed incongruent ones. These findings provide clear support for H1 and H2, and are consistent with previous work highlighting the importance of congruence for user experience (Jeunet et al., 2018; Richard et al., 2021).
Gesture recognition revealed a more nuanced pattern. The VisuAudioMatch condition significantly improved recognition compared to VisuOnly, while tactile feedback did not significantly alter recognition rates. In incongruent trials, audio cues influenced participants’ answers more strongly than tactile ones. Taken together, these findings suggest that audio feedback can be more effective than haptic feedback for disambiguating gestures, even though haptics contributes more strongly to the subjective sensation of being touched. This provides only partial support for H3.
Qualitative feedback provided additional context for these findings. Several participants described the Hit gesture as particularly easy to recognize, largely due to its distinctive preparatory motion and sharp impact. By contrast, Tap and Caress were sometimes reported as visually similar. Both gestures involved the agent leaning forward and extending the forearm toward the participant’s upper arm, with differences mainly in the final hand movement at the point of contact (a brief tap versus a slower caress). In the HMD, participants often focused on the VA’s face, and the limited field of view meant that this endpoint motion was not always fully visible in peripheral vision. As a result, the shared reaching phase may have made Tap and Caress appear more alike, prompting participants to rely more heavily on additional sensory cues for recognition. Importantly, this reduced recognizability of the Caress gesture was already present in the visual-only trials, indicating that the ambiguity stems from the intrinsic subtlety of the gesture and viewpoint constraints rather than from limitations of the tactile rendering.
In summary, Experiment 1 demonstrates that multimodal feedback significantly increases the subjective feeling of being touched, with tactile feedback playing the strongest role. Audio feedback, however, appeared more influential than haptic cues in guiding recognition. These findings suggest a functional dissociation between modalities: tactile feedback enhances the affective dimension of touch, while audio feedback supports gesture recognition. Importantly, the study did not examine how audio and haptic feedback might interact when combined. Addressing this gap motivated the design of Experiment 2, which directly investigates tri-modal (visual–audio–haptic) stimulation and explores whether adding the third modality produces additive or interaction effects beyond those observed for bimodal feedback.
6 Experiment 2
Experiment 2 was originally conceived as part of the same experimental study as Experiment 1. The initial design included every possible combination of visual, auditory, and haptic feedback (both matched and mismatched) within a single comprehensive framework. However, piloting revealed that completing all conditions in one session would lead to excessive duration and participant fatigue, potentially compromising data quality.
To preserve methodological rigor and participant engagement, we therefore divided the original design into two complementary experiments. Experiment 1 focused on unimodal and bimodal combinations to isolate pairwise contributions, whereas Experiment 2 extended this framework to full tri-modal integration. This division maintained the integrity of the overall research design while allowing independent analyses and feasible session lengths. The present experiment thus specifically investigates whether adding a third sensory channel yields additive or interactive effects on perceived realism and gesture recognition.
6.1 Measures
Measures were identical to those of Experiment 1 to allow direct comparison. The FeelingTouched questionnaire (Table 1) was again aggregated into a single exploratory index, and gesture recognition was assessed using the same forced-choice task.
6.2 Experimental design and conditions
The experimental setup is identical to the one from Section 5, in terms of the animations, sounds and tactile patterns used, as well as apparatus, measures and procedure. Only conditions differ.
Experiment 2 introduced a tri-modal configuration combining all three sensory feedback modalities (visual, auditory, and tactile) alongside a visual-only control condition. This design enabled us to examine how congruence across all sensory channels influences both the subjective feeling of being touched and gesture recognition.
Two high-level experimental conditions were tested.
• Visual-only (VisuOnly): participants saw the gesture animation without any accompanying audio or tactile feedback (3 trials);
• Visuo–Audio–Haptic (VisuAudioHaptic): participants received simultaneous visual, auditory, and tactile feedback, which could vary in congruence across modalities.
Within the VisuAudioHaptic condition, each modality could represent one of the three intended gestures (Hit, Tap, or Caress). All possible combinations were presented, resulting in matched, partially matched, and completely mismatched trials.
• CompleteMatch–all three modalities correspond to the same gesture (3 trials);
• AudioHapticMatch, VisuHapticMatch, and VisuAudioMatch–two modalities match and one differs (6 trials each);
• CompleteMismatch–each modality represents a different gesture (6 trials).
In total, participants experienced 3 VisuOnly trials and
6.3 Hypotheses
In this second experiment, a new level of mismatch is introduced as trials are made of three different sensory feedback. In consequence, our hypotheses are.
• H3: Completely matching trials will elicit a higher sense of being socially touched compared to partially matching trials, which will be better than completely mismatched trials (RQ2).
• H4: For partially or completely mismatched trials, visual feedback will be predominant for gesture recognition (RQ1).
• H5: When visual feedback is more ambiguous, audio feedback will be predominant over tactile feedback for gesture recognition (RQ1).
Hypothesis H5 was formulated as a result of the first experiment, where audio feedback seemed to drive recognition better than tactile feedback.
6.4 Results
Twenty-four participants took part in Experiment 2 (10 male, 14 female;
Figure 5. Box plot of participants’ average FeelingTouched scores (Likert scale, 1–7) across multisensory matching groups, as described in Section 6.4. The red asterisk indicates that all pairwise comparisons involving this group reached statistical significance. Overall, matched groups elicited higher FeelingTouched ratings than mismatched ones.
6.4.1 Perception of the agent’s touch
A Friedman test on the FeelingTouched scores revealed a significant effect of sensory congruence (
• The CompleteMatch condition differed significantly from all other groups.
• The VisuOnly condition also differed significantly from all other groups.
• The CompleteMismatch condition differed significantly from AudioHapticMatch (
The full set of comparisons is reported in the supplementary material.
6.4.2 Recognition of the gesture
As in Experiment 1, responses were recoded as yes/no depending on whether participants’ choices matched the intended visual animation (Figure 6a). This binary transformation enabled the use of a mixed-effects logistic regression (Table 3), with participant ID as a random effect (720 observations; 24 participants; variance
Figure 6. Recognition rates across multisensory matching groups. (a) Proportion of correct recognitions relative to the visual animation (y-axis = proportion of trials; “yes” = chosen gesture matched the animation, “no” = otherwise). (b) Distribution of recognized gestures across matching groups. For matched groups, responses are classified as either the intended gesture or a absent gesture (i.e., a gesture not present in the stimuli). For mismatched groups, responses are classified by the modality selected (visual, tactile, audio) or as absent. Visual feedback generally dominated recognition in matched conditions, whereas responses in mismatched conditions varied by modality.
Table 3. Mixed-effects logistic regression’s fixed effects (condition groups) selected results (full results available in supplementary materials). Outcome variable (whether the gesture chosen by the participants is the one intended to correspond to the visual animation) is coded as Yes = 0 and No = 1.
6.4.3 Influence of individual modalities
To further analyze mismatched trials, a chi-squared test was conducted on gesture distributions within the CompleteMismatch condition. No significant differences were found across modalities (
6.4.4 Role of the hit animation
Recognition rates across all 27 multisensory trials (excluding VisuOnly) are presented in Figure 7. The hit animation consistently dominated recognition, overriding incongruent sensory cues in most cases. The only exception occurred when a tap tactile pattern was combined with a non-hit audio cue, leading to split recognition between tap and hit. After removing all hit trials, a chi-squared test showed that in the CompleteMismatch condition, modalities were equally likely to guide responses (
Figure 7. Recognition rates of the three gestures across the 27 multisensory trials of Experiment 2 (excluding VisuOnly trials). Columns correspond to the visual animation, rows to the audio feedback condition, and stacked bars within each cell to the tactile feedback condition. Colors indicate the gesture recognized. The hit animation achieved consistently higher recognition than tap or caress, regardless of the presence of additional sensory cues.
6.5 Discussion
Reflecting on the FeelingTouched scores, CompleteMatch trials elicited the strongest sense of being socially touched, followed by partially matching conditions, and then CompleteMismatch. Even mismatched trials, however, produced higher ratings than VisuOnly. These results support H3, align with the findings of Experiment 1, and reinforce prior evidence that multisensory integration enhances immersion in IVEs (Sallnäs, 2004; Kilteni et al., 2012; Melo et al., 2020), with congruence playing a key role (Richard et al., 2021; Jeunet et al., 2018).
Turning to H4, recognition results were more nuanced. In CompleteMismatch conditions, participants distributed their responses equally across visual, audio, and haptic cues, showing no dominant modality. However, in AudioHapticMatch trials, participants more often relied on the mismatched visual cue than in VisuAudioMatch or VisuHapticMatch, suggesting that visual feedback can still dominate even when incongruent with other modalities.
Qualitative feedback offers a partial explanation: the Hit animation was consistently recognized due to its distinctive preparatory motion and clearly defined impact, while Tap and Caress animations were harder to differentiate. As in Experiment 1, participants noted that for Tap and Caress the agent first performed a similar reaching movement toward the upper arm, with the gestures diverging only in the more subtle hand motion at contact. Given the headset’s field of view, this endpoint difference was not always prominent, which likely contributed to the visual ambiguity between these two gestures. This pattern matches the recognition data (Figure 7), where Hit reached over 80% accuracy regardless of sensory cues. For Tap and Caress, participants reported depending more on the additional modalities, highlighting the role of multisensory congruence when visual cues are ambiguous. As in Experiment 1, this ambiguity also appeared in the visual-only trials, showing that the limited recognizability of the Caress gesture originates from the gesture’s subtle visual form and shared reaching phase rather than from constraints of the tactile feedback design.
Regarding H5, results did not confirm that audio feedback drives recognition more strongly than haptic feedback. VisuAudioMatch and VisuHapticMatch showed comparable performance, and mismatched conditions revealed no advantage for either modality. One possible explanation is the ecological validity of the audio stimuli. Several participants noted that some sounds, especially Caress, felt unconvincing and even resembled environmental noises, raising the possibility that realistic recordings may appear less believable in immersive contexts (Marini et al., 2012).
A few limitations should be noted. The questionnaire continued to rely on the term “touch,” which may have biased participants toward tactile sensations. In addition, the audio feedback, despite being validated in prior work (de Lagarde et al., 2023), may have been less effective in VR due to context-specific factors. These limitations motivated refinements in Experiment 3.
Taken together, Experiments 1 and 2 show that congruent multimodal feedback strengthens both the feeling of being touched and gesture recognition, with tactile input playing a central role. Yet, the relative contributions of audio and tactile cues remain inconclusive, as both improved recognition only when visual input was ambiguous. To further address the role of agency and the social interpretation of touch, we revised the protocol in Experiment 3 to manipulate the VA’s non-verbal behavior and interpersonal distance.
7 Third experiment
7.1 Experimental and stimuli design
For this third experiment, we refined our approach to better isolate the respective contributions of each sensory modality to the feeling of being socially touched. While Experiments 1 and 2 provided initial insights, they also highlighted limitations in stimulus realism and interpretability, particularly for the auditory and visual channels. In Experiment 3, we therefore introduced a new set of stimuli with more controlled visual conditions and revised audio recordings, focusing exclusively on the tap gesture.
This choice was guided by both practical and conceptual considerations. First, including only one gesture kept the number of conditions manageable
7.1.1 Visual feedback conditions
To better control the VA’s presence and movement, we defined four distinct visual feedback levels forming one independent variable.
• FullAnimation: the VA is positioned within peripersonal space (70 cm away from the participant) and performs a tap animation (Figure 8a).
• Fixed: the VA is within reach, with its arm extended and hand close to the participant’s arm, but without animation (Figure 8b).
• Close: the VA stands motionless within reach, facing the participant (Figure 8c).
• Far: the VA stands motionless outside of peripersonal space (public proxemic distance), 7 m away (Figure 8d).
Figure 8. Visual feedback conditions used in Experiment 3. (a) FullAnimation: the VA within peripersonal space, performing the tap gesture (blur indicates motion). (b) Fixed: the VA within reach, with the arm extended but no animation. (c) Close: the VA standing motionless within reach. (d) Far: the VA standing motionless outside peripersonal space.
This structure allowed us to examine the extent to which movement and proximity contributed to perceptions of agency and believability, independently of other modalities.
7.1.2 Audio stimuli
Feedback from earlier studies suggested that direct skin-to-skin recordings (de Lagarde et al., 2023) lacked believability, as they captured only contact vibrations without the airborne qualities by which humans normally perceive touch sounds. In response, we recorded a new tap sound using a microphone positioned to capture air-propagated transmission. The stimulus was generated by gently tapping a clothed arm to mirror the physical context of the experiment. Rather than aiming for perfect ecological fidelity, this design prioritized believability: a sound that listeners would plausibly associate with a tap on their own arm in VR, even if simplified compared to real-world acoustics.
7.1.3 Tactile stimuli
For consistency, we reused the tactile pattern for tap defined in Experiments 1 and 2: five 40 ms square-wave pulses at 40% intensity, applied to the eight central actuators, with a temporal structure matching the audio rhythm (two pulses separated by 160 ms, a 500 ms pause, followed by three pulses with 160 ms gaps). This ensured comparability across experiments.
7.1.4 Animation design
A new tap animation was created using motion capture to improve movement naturalism and to reduce the mechanical appearance reported in earlier experiments. The rhythm of the tap motion was aligned with the revised audio and tactile feedback. To simplify interaction and free the participants’ right arm for responding, the tap animation was applied to the participants’ left arm. Despite these improvements, we acknowledge that the animation remained limited in expressiveness, as it did not include facial or non-verbal cues that are typically integral to social interactions.
7.1.5 Multisensory feedback combinations
In addition to visual feedback, we defined a second independent variable with four levels of multisensory feedback.
• NoFeedback: no supplementary audio or tactile stimuli.
• Haptic: tactile feedback only.
• Audio: audio feedback only.
• AllModalities: combined tactile and audio feedback.
Participants experienced all 16 combinations of visual and multisensory feedback
7.2 Procedure
The experiment comprised all 16 combinations of the two independent variables. Each trial followed the same sequence: the scene first faded to black, during which the VA was positioned according to the condition (e.g., close, far, arm fixed near the participant, etc.). The scene then faded back in, and the stimulus was delivered with the corresponding sensory feedback (audio, tactile, both, or none). After stimulus presentation, the screen faded to black again, and participants completed the questionnaire described in Section 7.3. They could then initiate the next trial by pressing a designated button. Trial order was fully randomized.
7.3 Measures
In Experiments 1 and 2, our questionnaire on the feeling of being socially touched relied heavily on the word touch. This phrasing may have unintentionally biased participants toward interpreting their experience primarily in terms of vibrotactile sensations, rather than as a broader social phenomenon involving multiple modalities. For Experiment 3, we therefore revised the questionnaire to reduce this bias and to disentangle different dimensions of the experience more clearly.
Specifically, we introduced two subscales.
• Touch Believability: the extent to which participants felt that the stimulus was convincing as a touch-like event, regardless of modality. This captures the sensory plausibility of the experience.
• Touch Agency: the extent to which participants attributed the stimulus to the VA as an intentional, social actor. This reflects whether participants identified the VA as the cause of the stimulus and perceived it as a social act rather than a purely external event.
To our knowledge, no validated questionnaire currently exists for assessing perceived interpersonal touch or social attribution in immersive human–agent interaction. Consequently, both subscales were custom-designed for the present study, with item phrasing and conceptual inspiration drawn from prior work on embodiment, social presence, and human–agent behavior (e.g., Jeunet et al., 2018; Huisman et al., 2016; Kilteni et al., 2012). Items were written to avoid direct use of the word touch, emphasizing instead believability, causality, and agency. Each item used a 7-point Likert scale, and three items (A2, T1, and T5) were reverse-scored. Internal consistency was satisfactory for both subscales (TouchAgency:
The resulting questionnaire is presented in Table 4, with items grouped under the two subscales. Although this questionnaire has not yet been validated as a psychometric instrument, it represents a methodological step toward mitigating lexical bias and explicitly targeting the constructs of sensory believability and social attribution.
Unlike Experiments 1 and 2, gesture recognition was not assessed in Experiment 3, as the focus shifted to the role of interpersonal distance and non-verbal behavior in shaping participants’ attributions of believability and agency. By refining the questionnaire in Experiment 3 to reduce lexical bias and separate the constructs of believability and agency, we obtained a more precise account of how participants attributed both the sensory plausibility of the stimulus and the intentionality of the VA, strengthening the interpretation of findings across experiments.
7.4 Hypotheses
The main motivation of this third experiment is to distinguish and quantify the contribution of each sensory modality to the feeling of being socially touched by a VA. Based on related works (Section 2) and our Experiments 1 and 2 results, we hypothesize that.
• H6: Visual, tactile, and audio feedback will contribute differently to the sense of being touched (RQ2).
• H6.1: Multisensory feedback will increase the perception of being socially touched, and tactile feedback will increase it more than audio feedback (RQ2).
• H7: Visual feedback will enable feeling socially touched when the VA is inside the participants’ peripersonal space (RQ3).
7.5 Results
Thirty-one participants took part in Experiment 3 (16 female, 13 male, 2 preferred not to disclose;
From the post-trial questionnaire (see Section 7.3; Table 4), we computed three outcome metrics: TouchAgency, TouchBelievability, and FeelingTouched. The first two were obtained by averaging the relevant items (with reverse-scored items handled appropriately), while FeelingTouched was calculated as the mean of both scales combined. Distributions of these scores across conditions are shown in Figure 9.
Figure 9. Box plots of all 16 trials in Experiment 3, grouped by visual feedback conditions. (a) FeelingTouched. (b) TouchAgency. (c) TouchBelievability. Red bars indicate significant differences in pairwise comparisons.
7.5.1 Main effects of visual and multisensory feedback
Levene’s tests indicated significant violations of homogeneity of variances for both visual
7.5.2 Post-hoc comparisons
Wilcoxon signed-rank tests with Holm correction further specified these effects. All pairwise Wilcoxon tests were corrected for multiple comparisons using the Holm–Bonferroni method. For both FeelingTouched and TouchBelievability, all pairwise comparisons between conditions were significant, except between AllModalities and Haptic (for all visual levels) and between NoFeedback and Audio (for Far, Close, and Fixed visual conditions). For TouchAgency, 13 out of 24 possible pairwise comparisons reached significance. Full results are reported in Figure 9 and in the supplementary analysis repository.
7.5.3 Item-level analyses
To further examine how visual feedback shaped the social attribution of the agent, we conducted Wilcoxon pairwise comparisons across the four visual conditions for each of the five items comprising TouchAgency. This involved 30 comparisons (6 per item). Of these, 27 were significant. The three non-significant results were: (i) item A2, comparing Close vs. Fixed (
7.6 Discussion
Experiment 3 extended the first two studies by examining how the context of interaction, rather than only sensory congruence, influences the perception of being touched by a VA. Specifically, we manipulated the interpersonal distance between participants and the VA to test whether proximity conditions the attribution of agency and the believability of touch. In doing so, this experiment directly addressed RQ3, which asked how visual cues and multisensory feedback shape participants’ perception of the VA’s social ability.
The results show that distance strongly modulated participants’ responses. When the VA was positioned within reach, tactile events were consistently attributed to her, and the feedback was experienced as intentional touch. By contrast, when the VA stood further away, participants were less likely to link the stimuli to her actions and instead described the sensations as external or system-driven. This difference highlights that the spatial configuration of the interaction is critical for grounding touch as a social act, even when the same tactile patterns are used.
Gesture animation further reinforced agency attribution: static poses such as Fixed or Close elicited lower scores than the FullAnimation condition, where movement clearly conveyed intentionality. Multisensory feedback amplified this distinction, with haptics again being the dominant contributor and audio adding smaller but consistent support. These results indicate that participants do not automatically attribute agency to the VA whenever tactile feedback is present; instead, agency emerges when proximity, movement, and sensory cues jointly signal the VA as the plausible source of the touch.
Experiment 3 therefore adds an important dimension to our investigation. While Experiments 1 & 2 emphasized how visual, auditory, and tactile modalities combine to create convincing sensations, the present results show that contextual factors, especially distance and animation, determine whether these sensations are integrated as part of an intentional social gesture. Together, the three experiments indicate that successful simulation of social touch requires not only multimodal congruence but also a plausible context, especially regarding interpersonal distance, that affords agency attribution.
8 Conclusion
This work provides the first systematic exploration of how visual, auditory, and tactile feedback (alone and in combination) contribute to the perception of social touch in IVEs. Building on earlier studies of sensory substitution for touch (de Lagarde et al., 2023), social touch in robotics (Teyssier et al., 2020), and human–agent touch in non-immersive contexts (Huisman et al., 2016), we present one of the first controlled investigations of how these modalities interact when embodied in a virtual agent. While multimodal interaction has been widely studied in other domains, the specific question of how visual, audio, and tactile channels jointly shape believability, gesture recognition, and agency attribution in virtual social touch has not previously been addressed. By structuring three experiments around matched and mismatched feedback, we provide novel insights into the design space of virtual social touch, offering a foundation for more ecologically rich and socially contextualized investigations.
Across three experiments, our findings demonstrate that multimodal feedback enhances the perception of social touch, but with distinct contributions from each modality. Haptic feedback consistently played a central role in eliciting the sensation of touch. Visual feedback strongly influenced gesture recognition, although its effect depended on the prominent features of the animation and the immersive setup. Auditory feedback had a more limited role: while it sometimes aided recognition, its contribution to believability and agency was smaller than tactile or visual cues. Experiment 3 further revealed that proxemics and animation detail modulate whether participants attribute social agency to the VA, underscoring the importance of contextual factors in touch interactions. This directly addressed RQ3, demonstrating that agency attribution does not arise automatically from multimodal feedback, but depends on whether the interaction context makes the VA a plausible and intentional source of the touch.
While our work deliberately explored only a subset of the vast design space of social touch gestures and stimuli, it provides a foundation for understanding how modality combinations affect social touch in VR. By isolating matched and mismatched cues, we highlight both the robustness of haptic feedback and the fragility of believability when stimuli lack congruence. Taken together, these insights inform the design of future VR systems aiming for richer, more socially meaningful touch interactions.
9 Limitations and perspectives
Our findings underscore the promise of multimodal integration for simulating social touch, but they also highlight important limitations and opportunities for future research. These can be grouped into three broad categories: technological, methodological, and socio-cultural considerations.
9.1 Technological limitations
The first set of limitations arises from the technical means by which we simulated social touch. Visual feedback in our experiments was primarily varied through gesture animations and interpersonal distance. Participants, however, frequently remarked on the absence of accompanying nonverbal cues such as facial expressions, which are central to conveying intention and affect.
Similarly, while vibrotactile ERMs provided controlled and repeatable stimuli, they represent only a narrow part of the rich spectrum of haptic experiences that social touch entails. For the caress gesture, the actuator layout of the sleeve (four rows of motors arranged around the arm) constrained our ability to render a continuous stroke along the arm’s length. In early design exploration, we therefore compared different trajectories for the vibration sweep: strokes that progressed longitudinally (proximal–distal) and strokes that progressed around the arm’s circumference, with actuators firing sequentially from the front toward the back of the arm. Participants did not report meaningful differences in how these variants were interpreted as caresses, and the around-the-arm sweep produced the smoothest and most continuous pattern given the hardware limitations. Although this simplification reduces the naturalism of the caress stimulus, recognition difficulties were already present in the visual-only trials of Experiments 1 and 2, indicating that the ambiguity stems from the inherent subtlety of the gesture rather than from the tactile rendering. As such, the device constraint affects realism but does not undermine the interpretation of the tactile modality’s role in our findings.
Audio feedback proved particularly challenging: although Experiment 3 introduced improved air-propagated recordings, participants still found them less convincing than visual or tactile cues. This suggests that realism alone may not suffice; instead, sound design techniques from film or games, which emphasize believability over literal reproduction, may be more effective.
Finally, technical constraints also affected immersion. The limited field of view of the HMD used in Experiments 1 and 2 may have restricted participants’ ability to perceive wide or peripheral animations, potentially altering how visual cues were integrated with other modalities. While animation fidelity may always impose some constraints, the subtle motion differences between Tap and Caress would likely remain challenging to distinguish even with higher-end motion capture or more expressive facial animation, as these gestures are intrinsically less visually distinct. Future studies using wider-FOV headsets or CAVE systems could provide clearer insights into these interactions. For these reasons, we do not interpret the limited photorealism of our agent’s animations as a primary confound for gesture recognition. The Hit gesture, which included a distinct preparatory swing, was recognized reliably even in visual-only conditions, whereas Tap and Caress shared a similar reaching phase and differed mainly in a subtle endpoint motion. This structural similarity, combined with the headset’s FOV constraints and participants’ natural tendency to focus on the VA’s face, is sufficient to explain the observed ambiguity, and would persist even with more recent motion-captured animations. Consistent with prior work emphasizing plausibility over graphical realism in VR (Slater, 2009; Pan and Hamilton, 2018; McDonnell et al., 2008), our results indicate that the critical factor for recognition is the distinctiveness of the underlying kinematics rather than the absolute realism of the animation.
9.2 Methodological limitations
Beyond technology, methodological choices also shaped our findings. In Experiment 3, we deliberately restricted the design space to tap gestures in order to improve experimental control and reduce participant burden. This decision necessarily limited ecological variety, leaving unexplored the distinctive social meanings of other gestures such as hits or caresses. Similarly, while our questionnaires captured dimensions such as agency and believability, they remain exploratory and require further validation as robust measures of social touch.
Another omission concerns presence and co-presence. Although these constructs are closely tied to immersion and social realism in VR, we did not include standardized scales to measure them. As a result, we cannot determine the extent to which participants’ perception of touch was shaped by their sense of “being there” in the virtual environment or of “being with” the agent. Incorporating validated presence and co-presence measures would strengthen future studies by clarifying these relationships.
9.3 Socio-cultural perspectives
Finally, social touch cannot be reduced to technical or methodological parameters alone: it is inherently embedded in cultural and relational contexts. In our study, gestures were treated mainly as controlled sensory stimuli to isolate modality effects. Yet touches such as hits, taps, or caresses carry distinct affective and relational meanings, such as aggression, attention, or intimacy, that participants may have interpreted differently depending on their background or expectations. By omitting these meanings to focus on sensory integration, we gained control but sacrificed some ecological validity.
Cultural variation adds another layer: our participant pool was drawn from a single cultural setting, while norms of touch vary widely across societies. Future work should explore how cultural background, demographics, and relational context (e.g., perceived gender or role of the agent) shape the interpretation and acceptability of virtual social touch.
These considerations naturally raise ethical questions. Because social touch in VR involves intimacy and potential power asymmetries, respecting personal boundaries is crucial. Designing systems that allow users to calibrate their comfort levels (for example, by specifying the type, location, or frequency of touch) will be essential for the acceptability of such interactions.
Overall, our study provides an initial but systematic step toward disentangling how different modalities contribute to social touch in IVEs. By combining tactile, visual, and auditory cues with controlled variations in proximity and animation, we highlight both the potential and the challenges of designing believable, respectful, and meaningful virtual social touch interactions.
Data availability statement
The original contributions presented in the study are publicly available. This data can be found here: The complete statistical analyses for this manuscript are available on the Open Science Framework (OSF) repository: https://osf.io/buf4d/overview?view_only=56899eb8b0714ed6bad18d09938c3ae4.
Ethics statement
The studies involving humans were approved by Comité d’éthique de l’INSEAD (IRB) approval no 2024-46. The studies were conducted in accordance with the local legislation and institutional requirements. The participants provided their written informed consent to participate in this study. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.
Author contributions
GR: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review and editing. FB: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review and editing. CP: Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review and editing. IT: Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review and editing.
Funding
The authors declare that financial support was received for the research and/or publication of this article. This work was funded by the ANR-22-CE33-0010 (ANR MATCH) project grant. This work was also supported by French government funding managed by the National Research Agency under the Investments for the Future program (PIA) with the grant ANR-21-ESRE-0030 (Equipex+CONTINUUM project), the grant ANR-22-EXEN-0004 (PEPR-Ensemble PC3 project) and the Idex Sorbonne Université as part of the French government’s support for the Investissements d’Avenir program.
Acknowledgements
We also thank INSEAD and its members for supporting the organization of the studies. Yohan Bouvet has all our thanks for his help with recording the motion capture animations.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The authors declare that Generative AI was used in the creation of this manuscript. Generative AI was used to reformulate content that was written by the authors, to tighten phrasing (as english is not the authors’ main language). No content was created through the use of AI.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
References
Boucaud, F., Tafiani, Q., Pelachaud, C., and Thouvenin, I. (2019). “Social touch in human-agent interactions in an immersive virtual environment,” in HUCAPP, 129–136.
Boucaud, F., Pelachaud, C., and Thouvenin, I. (2021). “Decision model for a virtual agent that can touch and be touched,” in 20th international conference on autonomous agents and MultiAgent systems (AAMAS 2021), 232–241.
Boucaud, F., Pelachaud, C., and Thouvenin, I. (2023). ““it patted my arm”: investigating social touch from a virtual agent,” in Proceedings of HAI ’23 (New York, NY: ACM), 72–80. Available online at: https://dl.acm.org/doi/proceedings/10.1145/3623809.
de Lagarde, A., Pelachaud, C., Kirsch, L. P., and Auvray, M. (2023). Audio-touch: tactile gestures and emotions can be recognised through their auditory counterparts.
Delis, I., Chen, C., Jack, R. E., Garrod, O. G., Panzeri, S., and Schyns, P. G. (2016). Space-by-time manifold representation of dynamic facial expressions for emotion categorization. J. Vis. 16, 14. doi:10.1167/16.8.14
Fitrianie, S., Bruijnes, M., Abdulrahman, A., and Brinkman, W.-P. (2025). The artificial social agent questionnaire (asaq)—development and evaluation of a validated instrument for capturing human interaction experiences with artificial social agents. Int. J. Human-Computer Stud. 199, 103482. doi:10.1016/j.ijhcs.2025.103482
Geva, N., Uzefovsky, F., and Levy-Tzedek, S. (2020). Touching the social robot paro reduces pain perception and salivary oxytocin levels. Sci. Reports 10, 9814. doi:10.1038/s41598-020-66982-y
Hauser, S. C., McIntyre, S., Israr, A., Olausson, H., and Gerling, G. J. (2019). “Uncovering human-to-human physical interactions that underlie emotional and affective touch communication,” in 2019 IEEE world haptics conference (WHC) (IEEE), 407–412.
Héron, R., Safin, S., Baker, M., Zhang, Z., Lecolinet, E., and Détienne, F. (2024). Touching at a distance: the elaboration of communicative functions from the perspective of the interactants. Front. Psychol. 15, 1497289. doi:10.3389/fpsyg.2024.1497289
Hertenstein, M. J., Holmes, R., McCullough, M., and Keltner, D. (2009). The communication of emotion via touch. Emotion 9, 566–573. doi:10.1037/a0016108
Hirano, T., Shiomi, M., Iio, T., Kimoto, M., Tanev, I., Shimohara, K., et al. (2018). How do communication cues change impressions of human–robot touch interaction? Int. J. Soc. Robotics 10, 21–31. doi:10.1007/s12369-017-0425-8
Hoppe, M., Rossmy, B., Neumann, D. P., Streuber, S., Schmidt, A., and Machulla, T.-K. (2020). “A human touch: social touch increases the perceived human-likeness of agents in virtual reality,” in Proceedings of CHI ’20 (New York, NY: ACM), 1–11. Availble online at: https://dl.acm.org/doi/proceedings/10.1145/3313831.
Huisman, G., Bruijnes, M., Kolkmeier, J., Jung, M., Frederiks, A. D., and Rybarczyk, Y. (2013a). “Touching virtual agents: embodiment and mind,” in 9th international summer workshop on multimodal interfaces (eNTERFACE) (Springer), 114–138.
Huisman, G., Frederiks, A. D., Van Dijk, B., Hevlen, D., and Kröse, B. (2013b). “TaSSt: tactile sleeve for social touch,” in World haptics conference (IEEE), 211–216.
Huisman, G., Frederiks, A. D., Van Erp, J. B., and Heylen, D. K. (2016). “Simulating affective touch: using a vibrotactile array to generate pleasant stroking sensations,” in EuroHaptics ’16 (Springer), 240–250.
Jack, R., McPherson, A., and Stockman, T. (2015). “Designing tactile musical devices with and for deaf users: a case study,” in Proceedings of ICMEM, Sheffield, UK, 23–25.
Jacucci, G., Bellucci, A., Ahmed, I., Harjunen, V., Spape, M., and Ravaja, N. (2024). Haptics in social interaction with agents and avatars in virtual reality: a systematic review. Virtual Real. 28, 170. doi:10.1007/s10055-024-01060-6
Jeunet, C., Albert, L., Argelaguet, F., and Lécuyer, A. (2018). “do you feel in control?”: towards novel approaches to characterise, manipulate and measure the sense of agency in virtual environments. IEEE Transactions Visualization Computer Graphics 24, 1486–1495. doi:10.1109/TVCG.2018.2794598
Jones, S. E., and Yarbrough, A. E. (1985). A naturalistic study of the meanings of touch. Commun. Monogr. 52, 19–56. doi:10.1080/03637758509376094
Ju, Y., Zheng, D., Hynds, D., Chernyshov, G., Kunze, K., and Minamizawa, K. (2021). “Haptic empathy: conveying emotional meaning through vibrotactile feedback,” in Extended abstracts of CHI 21, 1–7.
Kilteni, K., Groten, R., and Slater, M. (2012). The sense of embodiment in virtual reality. Presence Teleoperators Virtual Environ. 21, 373–387. doi:10.1162/pres_a_00124
Kim, H., and Choi, S. (2024). “Expressing the social intent of touch initiator in virtual reality using multimodal haptics,” in 2024 IEEE international symposium on mixed and augmented reality (ISMAR) (IEEE), 416–425.
Lapp, H. S., and Croy, I. (2021). Insights from the german version of the social touch questionnaire: how attitude towards social touch relates to symptoms of social anxiety. Neuroscience 464, 133–142. doi:10.1016/j.neuroscience.2020.07.012
Lenay, C., and Tixier, M. (2018). “From sensory substitution to perceptual supplementation,” in Living machines: a handbook of research in biomimetic and biohybrid systems, 552–559.
Marini, D., Folgieri, R., Gadia, D., and Rizzi, A. (2012). Virtual reality as a communication process. Virtual Real. 16, 233–241. doi:10.1007/s10055-011-0200-3
McDonnell, R., Jörg, S., McHugh, J., Newell, F., and O’Sullivan, C. (2008). “Evaluating the emotional content of human motions on real and virtual characters,” in Proceedings of the 5th symposium on applied perception in graphics and visualization, 67–74.
McDonnell, R., Breidt, M., and Bülthoff, H. H. (2012). Render me real? Investigating the effect of render style on the perception of animated virtual humans. ACM Trans. Graph. (TOG) 31, 1–11. doi:10.1145/2185520.2335442
McIntyre, S., Hauser, S. C., Kusztor, A., Boehme, R., Moungou, A., Isager, P. M., et al. (2022). The language of social touch is intuitive and quantifiable. Psychol. Sci. 33, 1477–1494. doi:10.1177/09567976211059801
Melo, M., Gonçalves, G., Monteiro, P., Coelho, H., Vasconcelos-Raposo, J., and Bessa, M. (2020). Do multisensory stimuli benefit the virtual reality experience? A systematic review. IEEE Transactions Visualization Computer Graphics 28, 1428–1442. doi:10.1109/TVCG.2020.3010088
Montagu, A. (1984). The skin, touch, and human development. Clin. Dermatology 2, 17–26. doi:10.1016/0738-081x(84)90043-9
Olugbade, T., He, L., Maiolino, P., Heylen, D., and Bianchi-Berthouze, N. (2023). Touch technology in affective human–, robot–, and virtual–human interactions: a survey. Proc. IEEE 111, 1333–1354. doi:10.1109/jproc.2023.3272780
Pan, X., and Hamilton, A. F. d. C. (2018). Why and how to use virtual reality to study human social interaction: the challenges of exploring a new research landscape. Br. J. Psychol. 109, 395–417. doi:10.1111/bjop.12290
Püschel, I., Reichert, J., Friedrich, Y., Bergander, J., Weidner, K., and Croy, I. (2022). Gentle as a mother’s touch: C-tactile touch promotes autonomic regulation in preterm infants. Physiology and Behav. 257, 113991. doi:10.1016/j.physbeh.2022.113991
Ravaja, N., Harjunen, V., Ahmed, I., Jacucci, G., and Spapé, M. M. (2017). Feeling touched: emotional modulation of somatosensory potentials to interpersonal touch. Sci. Reports 7, 40504. doi:10.1038/srep40504
Remache-Vinueza, B., Trujillo-León, A., Zapata, M., Sarmiento-Ortiz, F., and Vidal-Verdú, F. (2021). Audio-tactile rendering: a review on technology and methods to convey musical information through the sense of touch. Sensors 21, 6575. doi:10.3390/s21196575
Richard, G., Pietrzak, T., Argelaguet, F., Lécuyer, A., and Casiez, G. (2021). Studying the role of haptic feedback on virtual embodiment in a drawing task. Front. Virtual Real. 1, 573167. doi:10.3389/frvir.2020.573167
Rosis, F. d., Pelachaud, C., Poggi, I., Carofiglio, V., and Carolis, B. D. (2003). From Greta’s mind to her face: modelling the dynamics of affective states in a conversational embodied agent. Int. J. Human-Computer Stud. 59, 81–118. doi:10.1016/S1071-5819(03)00020-X
Sallnäs, E.-L. (2004). The effect of modality on social presence, presence and performance in collaborative virtual environments. Stockholm: KTH. Ph.D. thesis. Available online at: https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A9557&dswid=9319.
Schirmer, A., Croy, I., and Ackerley, R. (2023a). What are c-tactile afferents and how do they relate to “affective touch”. Neurosci. and Biobehav. Rev. 151, 105236. doi:10.1016/j.neubiorev.2023.105236
Schirmer, A., Lai, O., Cham, C., and Lo, C. (2023b). Velocity-tuning of somatosensory eeg predicts the pleasantness of gentle caress. NeuroImage 265, 119811. doi:10.1016/j.neuroimage.2022.119811
Slater, M. (2009). Place illusion and plausibility can lead to realistic behaviour in immersive virtual environments. Philosophical Trans. R. Soc. B Biol. Sci. 364, 3549–3557. doi:10.1098/rstb.2009.0138
Slater, M. (2018). Immersion and the illusion of presence in virtual reality. Br. Journal Psychology 109, 431–433. doi:10.1111/bjop.12305
Song, S., and Yamada, S. (2017). “Expressing emotions through color, sound, and vibration with an appearance-constrained social robot,” in Proceedings of the 2017 ACM/IEEE international conference on human-robot interaction, 2–11.
Sun, W., Banakou, D., Świdrak, J., Valori, I., Slater, M., and Fairhurst, M. T. (2024). Multisensory experiences of affective touch in virtual reality enhance engagement, body ownership, pleasantness, and arousal modulation. Virtual Real. 28, 1–16. doi:10.1007/s10055-024-01056-2
Suvilehto, J. T., Glerean, E., Dunbar, R. I. M., Hari, R., and Nummenmaa, L. (2015). Topography of social touching depends on emotional bonds between humans. Proc. Natl. Acad. Sci. 112, 13811–13816. doi:10.1073/pnas.1519231112
Teyssier, M., Bailly, G., Pelachaud, C., and Lecolinet, E. (2020). Conveying emotions through device-initiated touch. IEEE Trans. Affect. Comput. 13, 1477–1488. doi:10.1109/taffc.2020.3008693
Van Erp, J. B., and Toet, A. (2015). Social touch in human–computer interaction. Front. Digital Humanities 2, 2. doi:10.3389/fdigh.2015.00002
Willemse, C. J., Huisman, G., Jung, M. M., van Erp, J. B., and Heylen, D. K. (2016). “Observing touch from video: the influence of social cues on pleasantness perceptions,” in Haptics: perception, devices, control, and applications: 10th international conference, EuroHaptics 2016, London, UK, July 4-7, 2016, proceedings, part II 10 (Springer), 196–205.
Keywords: virtual reality, haptic feedback, social touch, human-agent interaction, tactile and visuo-tactile, tactile and audio-tactile
Citation: Richard G, Boucaud F, Pelachaud C and Thouvenin I (2026) “Did she just tap me?“: qualifying multisensory feedback for social touch during human-agent interaction in virtual reality. Front. Virtual Real. 6:1718198. doi: 10.3389/frvir.2025.1718198
Received: 03 October 2025; Accepted: 27 November 2025;
Published: 06 January 2026.
Edited by:
Alex Wong, Yale University, United StatesReviewed by:
Yun Wang, Beihang University, ChinaAnna Lomanowska, University Health Network (UHN), Canada
Copyright © 2026 Richard, Boucaud, Pelachaud and Thouvenin. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Grégoire Richard, Z3JlZ29pcmUucmljaGFyZEBoZHMudXRjLmZy
Fabien Boucaud2