The Egocentric Nature of Action-Sound Associations

Navolio, Nicole; Lemaitre, Guillaume; Forget, Alain; Heller, Laurie M.

doi:10.3389/fpsyg.2016.00231

ORIGINAL RESEARCH article

Front. Psychol., 23 February 2016

Sec. Perception Science

Volume 7 - 2016 | https://doi.org/10.3389/fpsyg.2016.00231

The Egocentric Nature of Action-Sound Associations

1. Auditory Perception Lab, Department of Psychology, Carnegie Mellon University, Pittsburgh PA, USA
2. Department of Human-Computer Interaction, Carnegie Mellon University, Pittsburgh PA, USA
3. CyLab Usable Privacy and Security Research Group, Carnegie Mellon University, Pittsburgh PA, USA

Abstract

Actions that produce sounds infuse our daily lives. Some of these sounds are a natural consequence of physical interactions (such as a clang resulting from dropping a pan), but others are artificially designed (such as a beep resulting from a keypress). Although the relationship between actions and sounds has previously been examined, the frame of reference of these associations is still unknown, despite it being a fundamental property of a psychological representation. For example, when an association is created between a keypress and a tone, it is unclear whether the frame of reference is egocentric (gesture-sound association) or exocentric (key-sound association). This question is especially important for artificially created associations, which occur in technology that pairs sounds with actions, such as gestural interfaces, virtual or augmented reality, and simple buttons that produce tones. The frame of reference could directly influence the learnability, the ease of use, the extent of immersion, and many other factors of the interaction. To explore whether action-sound associations are egocentric or exocentric, an experiment was implemented using a computer keyboard’s number pad wherein moving a finger from one key to another produced a sound, thus creating an action-sound association. Half of the participants received egocentric instructions to move their finger with a particular gesture. The other half of the participants received exocentric instructions to move their finger to a particular number on the keypad. All participants were performing the same actions, and only the framing of the action varied between conditions by altering task instructions. Participants in the egocentric condition learned the gesture-sound association, as revealed by a priming paradigm. However, the exocentric condition showed no priming effects. This finding suggests that action-sound associations are egocentric in nature. A second part of the same session further confirmed the egocentric nature of these associations by showing no change in the priming effect after moving to a different starting location. Our findings are consistent with an egocentric representation of action-sound associations, which could have implications for applications that utilize these associations.

Introduction

It has been well established that environmental sounds portray information about our surroundings, such as event properties (Ballas, 1993; Houix et al., 2012) or as symbolic icons for nouns and verbs (Keller and Stevens, 2004; Giordano et al., 2010). Although it is clear that objects and actions can be represented by their accompanying sounds, it seems that action, rather than the object is most important in sound event perception. When asked to identify environmental sounds in a free identification task, people generally describe the actions that generated the sounds (Vanderveer, 1979). Additionally, a recent study found that listeners are better at identifying the action that caused a sound than they are identifying the object properties, such as material (Lemaitre and Heller, 2012). In fact, Lemaitre and Heller (2012) found that listeners were faster at identifying the action of a sound, even for a selection of sounds in which the actions and materials were equally identifiable. Neuroimaging studies also suggest that there are interactions between actions and sound processing in that action sounds activate more motor and pre-motor areas compared to control sounds (e.g., meaningless noise) (Aziz-Zadeh et al., 2004; Lewis et al., 2005; Pizzamiglio et al., 2005).

Because the associations between actions and sounds are important to human perception, behavioral studies have sought to uncover the nature of these associations. Castiello et al. (2010) showed that playing a priming sound before grasping an object sped up the execution of the grasping motion if the priming sound was the same as the sound produced by grasping the object. We recently performed a related series of experiments (Heller et al., 2012; Lemaitre et al., 2015), but with a paradigm that measured reaction time to cues that prompted different gestures. Participants were cued to initiate one of two gestures (e.g., tapping or scraping). Performing the gestures resulted in a response sound that was either naturally created (such as when a tapping gesture creates a tapping sound) or was artificially produced via an interface. Immediately before the gesture-instructing cue, a prime sound was played. The prime could be congruent, incongruent, or neutral with regard to the gesture. For example, a tapping sound being played before a tap cue would be congruent, while a scraping sound being played before a tap cue would be incongruent. Relative reaction times were significantly faster for congruent trials than incongruent trials, indicating that gestures can be primed by associated sounds.

Action-sound relationships have been examined to some extent, but little is known about the spatial frame of reference in which this particular association is made. In general terms, perceptual representation of spatial location can have an egocentric or an exocentric frame of reference (Klatzky, 1998). To describe the location of objects in space, an egocentric reference frame describes an object’s location with respect to the perceiver’s perspective. Conversely, an exocentric reference frame describes an object’s location independently of the perceiver’s perspective or location. For example, referring to a fellow automobile driver as being on your left side uses an egocentric frame of reference. However, referring to the location of the driver relative to the surface of the road would be an exocentric reference frame. Applying this distinction to action-sound associations, the frame of reference could in principle be egocentric by representing the action relative to the observers’ body, or it could be exocentric by representing the action relative to the environment, the external sound itself, or the artifact being used. For our purposes, action-sound associations that are represented egocentrically will be viewed in terms of self-generated gestures and thus integrated into the person’s body schema (Holmes and Spence, 2004), whereas action-sound associations that are represented exocentrically will be viewed in terms of motions applied to an object that produce a sound, represented relative to any external point of reference. Basic research into this distinction will help reveal a fundamental property of the psychological representation of actions and sounds. Additionally, the answer to this question could help guide the design of interfaces that utilize action-sound associations, as illustrated in the following examples.

Much of today’s technology makes use of the relationship between sound and gesture. When people press a button, swipe a screen, or plug in a device, they expect to hear something in response. If the response sound deviates from expectations (by perhaps being an “error”-type sound), users can tell that something has gone wrong. Likewise, if no sound is presented, individuals may question if the action was successfully performed. For example, delays in auditory feedback have been shown to impair the performance of musicians (Finney, 1997) as well as impair natural, complex movements, such running (Kennel et al., 2015). This important link between sound and gesture has been utilized by the technology industry to create user-friendly products, and it has been studied by researchers in multiple fields. For example, Caramiaux et al. (2014) showed that gestural descriptions of sound sources were more likely to involve actions (such as a crumple gesture) when the sound source was easy to identify (such as the crumpling of a piece of paper); such insights could lead to improved gestures in wearable computing if the gestures are matched with clearly identifiable sounds. Although distinguishing egocentric and exocentric viewpoints is important in usability, it has not yet been shown how they are manifested in action-sound associations.

The distinction between egocentric and exocentric is important for designing and understanding interfaces. Milgram and Kishino (1994) proposed a three-dimensional hierarchy of mixed reality virtual displays, in which one of the dimensions is Extent of Presence Metaphor, or simply how immersive the environment feels. This dimension directly corresponds to whether the virtual display is egocentric or exocentric, with the egocentric displays being more immersive, whereas more traditional interfaces such as the monitor-based “windows on the world” displays are completely exocentric and less immersive. Salzman et al. (1999) found that an egocentric frame of reference is beneficial for learning local, immersive details, but exocentric perspectives are better for more abstract, global concepts. Thus, they argue that a bicentric experience, which allows for alternating between the two, is superior. Likewise, Ferland et al. (2009) performed a study in which participants were asked to navigate a robot through various obstacles using an egocentric or exocentric 3D interface. Although egocentric viewpoints are useful for navigation, the exocentric reference frames are helpful in understanding the overall structure of the environment, and thus, they found that having access to both perspectives was beneficial to the task.

Whether action-sound associations are ego or exocentric has many implications for technology. First, if an immersive augmented reality is desired, action-sound associations should only be included if they are egocentric in nature, as exocentricity may make the experience feel less immersive (Milgram and Kishino, 1994). Additionally, if associations are egocentric, teaching action-sound associations should be done egocentrically (such as “use your thumb to play an F note on the clarinet” vs. “press the F key on the back of the clarinet”). As smart phones are now able to rotate their orientation, it is important to consider whether to design an interface egocentrically (relative to how a person is holding the phone) or exocentrically (relative to the phone). For example, swiping in an “up” gesture on a phone’s screen could raise the phone’s sound level. This is a simple association, but it is not immediately clear what should happen when the phone is rotated on its side or upside down. If action-sound associations are egocentric, then the phone should use its rotation sensor to account for the phone’s rotation and increase the sound level when swiped “up” relative to how the user is holding the phone (i.e., it might actually be to the left on the phone’s screen, for example). However, if action-sound associations are exocentric, then the interaction should be relative to the phone’s screen. Finally, gestural interfaces should be designed with the frame of reference in mind. Consider designing a musical device that generates pitches based on hand location. The hand location could be specified relative to the distance from the user’s body (egocentric) or relative to the distance from the floor (exocentric). If action-sound associations are egocentric, the first method would result in a more learnable and successful interface. Because the frame of reference is important for basic scientific understanding as well as for applications that utilize action-sound associations, we examined whether the action-sound relationships for computer keyboard users are egocentric or exocentric.

To address this question, a simple priming paradigm on a computer’s keypad was performed in which action-sound associations are created by pairing an action (keypress) with a sound (tone). For half of the participants, egocentric associations were introduced, and for the other half of participants, exocentric associations were introduced. All participants were executing the same action, and only the framing of the action varied, by altering the task instructions and directional cue. A priming paradigm was used to determine whether the association was learned in each condition. The egocentricity or exocentricity of action-sound associations was indicated by whether or not participants showed priming in each condition (i.e., if only the egocentric condition shows priming, we can conclude action-sound associations are egocentric in nature, and vice versa).

Part 2 further tests whether action-sound associations are egocentric or exocentric. The participants who showed an action-sound association halfway through the session (after part 1) were asked to switch to a different starting location (during the second half, part 2). If the association is purely egocentric, then changing to a different starting location will not change the results. Moving a finger “right,” for example, will be associated with the same sound, regardless of the finger’s starting location. On the other hand, if the associations are exocentric, moving to a new starting location will lower the effect size, as the new action-sound association would compete with the one that was just learned during part 1.

Part 1

Part 1 of this experiment tests whether action-sound associations are created in egocentric or exocentric conditions. The frame of reference is varied by altering task instructions in half of the participants, and the strength of the associations is measured using a priming paradigm.

Method

Participants

Participants were two groups of Carnegie Mellon University students recruited through an online psychology participant pool. Thirty-two English-speaking participants (17 female, 15 male) between the ages of 18 and 22 (median 19 years old) were in the egocentric experimental condition. Thirty-two participants (23 female, 9 male) between the ages of 18 and 21 (median 19 years old) were in the exocentric condition. The data from one 60-year-old participant were discarded in response to a reviewer’s request for our sample to match the customary age ranges used in RT experiments in the cognitive psychology literature; this removal did not affect the overall results.

All participants were right-handed with self-reported normal hearing and provided written informed consent prior to testing in accordance with procedures approved by the Carnegie Mellon University Institutional Review Board.

Interface and Apparatus

This experiment used an Apple USB keyboard (Model No: A1243), with tasks confined to the number keypad. Figure 1 shows the general layout of the task. Digital sound files were converted to analog signals by an Audiofire 4 audio interface. All audio was presented over Sennheiser HD 600 open circumaural headphones.

FIGURE 1

Stimuli

Prime sounds consisted of a short low-pitched tone (534-Hz sinusoid) and high-pitched tone (1730-Hz sinusoid). Both sounds were enveloped using an Attack-Decay-Sustain-Release technique, with an attack time of 5 ms, decay time of 10 ms, sustain duration of 50 ms, and release time of 5 ms (total duration = 70 ms). The tones remained at a constant amplitude during the sustain portion. Sounds were presented at a 44100-Hz sample rate with 16-bit resolution.

Response sounds were identical to the prime sounds, with the low-pitched tone occurring when participants pressed the “2” key (via finger movement to the “right”), and the high-pitched tone occurring when participants pressed the “4” key (via finger movement in the “up” direction).

Directional cues were recorded via an Audio-Technica AT3525 30 Series microphone in an IAC double-walled sound-attenuating booth. They consisted of the vocal recording of an American English-speaking male saying the directions “right,” “up,” “two,” and “four.” The onsets of these directional cues were matched perceptually based on piloting, rather than by examining the waveform to account for differences in the slopes of the onset ramps (Tuller and Fowler, 1980). The onsets of the primes and responses sounds were perceptually and acoustically identical. All sounds were selected to have perceptually equal loudness.

Procedure

The structure of a trial is represented in Figure 2. Each trial started with the participants in the “home” position, which required holding down the “1” key on the number pad with their right index finger. After a short delay (400 ms), the prime sound was presented. This prime could be the high-pitched tone, the low-pitched tone, or a period of silence for the neutral condition. After a delay of 10 ms, the prime was followed by the vocal directional cue indicating which gesture to execute. The directional cue was “up” or “right” for the egocentric condition and “2” or “4” for the exocentric condition. When the participants responded, a response sound was played that always matched the gesture that was performed (but the response sound did not always match the prime sound). When participants moved their finger “up” (i.e., to the “4” key), the high-pitched tone was played, and when they moved their finger “right” (i.e., to the “2” key), the low-pitched tone was played. Participants were instructed to respond as rapidly as possible without sacrificing accuracy. Reaction times were measured from the onset of the directional cue.

FIGURE 2

It is important to note that the prime sounds were, by design, never predictive of which gesture would be required (while the response tone did always match the gesture). Half of the trials required an “up” gesture, while the other half required a “right” gesture. One-third of the trials used a congruent prime, one-third used an incongruent prime, and one-third had a neutral prime (silence that lasted the same duration as the tones). A congruent prime was one that matched the resulting response sound (for example, a high-pitched prime followed by an “up” response cue, as shown in Figure 2). Therefore, there were six types of trials (two response gestures × three prime-types). A total of 324 trials were presented to each participant in 18 blocks of 18 trials each. Each block was guaranteed to have three instances of each of the six trial types presented in different random orders for each block and participant. Following each trial, a recorded vocal message indicated whether the response was correct. Likewise, after each block, vocal recordings were provided to encourage faster reaction times, and visual feedback was displayed on the computer screen revealing the percent of correct answers and average reaction time.

Before beginning the main session, each participant watched a short, 12 trial demonstration of the experimenter performing the task. Next, participants familiarized themselves with the procedure in a preliminary training session of 72 trials (four blocks of 18 trials) in the presence of the experimenter. During this training session, the participants interacted with the experimenter to clarify the procedure. The experimenter ensured that the participants were executing the correct gestures and were responding correctly and as quickly as they could. The response sounds were audible during the training session, but the prime sounds did not begin until the main session.

Results

Both accuracy and reaction time (RT) were measured. Raw RT data are available at https://zenodo.org/record/35563. A trial was considered incorrect if an incorrect key was pressed. RTs were measured from the onset of the directional cue and reflected the initiation of the movement away from the home position.

The preprocessing of RTs involved multiple steps. First, incorrect trials were removed. Next, outlier RTs were removed. The outlier cutoff was adjusted so that less than 0.5% of trials were excluded, based on the method described in Masson et al. (2011). The cutoff was established at 900 ms in the egocentric condition (0.491% of trials) and 990 ms in the exocentric condition (0.465% of trials). After preprocessing, each participant’s mean RT for each type of trial was calculated.

The neutral condition (silent prime) was used to correct for the inherent speed differences between the dominant and non-dominant hands. Silence was chosen because pilot attempts at finding a “neutral” cue sound failed to reveal a sound that was not biased toward one prime or the other. The choice to have a silent neutral condition prevents us from separating facilitation and inhibition effects, since the silence does not alert participants to the timing of the upcoming directional cue, and thus results in faster reaction times for primed trials compared to neutral trials. To account for the inherent differences between the gestures, we adjusted RTs by subtracting out the RTs for the neutral condition for each gesture and for each participant. This resulted in a measure of reaction time that was independent of gesture execution time, but this value was systematically negative (because neutral RTs were larger). Therefore, to appropriately characterize the relative reaction time between the two primed conditions, we added to this value the mean RT for the two primed conditions, averaged across all participants and conditions. The goal of this step was to produce positive numbers with the same average as the unprocessed RTs, which is easier to interpret than negative relative measures. The resulting relative RT is the average RT from trials with a prime for a given gesture and a given prime minus the RT for the baseline for the same gesture plus the average RT for any prime. This transformation allowed our analysis to be consistent with our previous research (Lemaitre et al., 2015). Note that, by definition, relative RT and raw RT produce the same statistics for the congruency variable (which was the main variable of interest). Relative RT does affect the gesture variable by subtracting out the baseline RT for each gesture, thus making the plots of congruency effects generalizable across a variety of gesture types (e.g., key presses, taps, and scrapes).

The relative RTs for the two gestures (“up” or “right”) and the two prime congruencies (congruent or incongruent) can be seen in Figure 3. Relative RTs were submitted to a repeated-measures ANOVA with the congruency and gesture as within-participant factors, the reference frame (egocentric or exocentric) as a between-participant factor, and the relative RTs as the dependent variable. There was a significant main effect of congruency [F(1,62) = 16.088, p < 0.01, η² = 0.0655]. This shows that there were significantly longer relative RTs for incongruent cues versus congruent cues (i.e., priming was observed). There was a significant main effect of reference frame [F(1,62) = 12.334, p < 0.01]. Analysis also revealed that there was a significant interaction between congruency and reference frame [F(1,62) = 20.758, p < 0.01, η² = 0.0845]. Figure 3 illustrates that the effect of congruency is larger for the egocentric condition than the exocentric condition. There was also a significant main effect of gesture [F(1,62) = 4.828, p < 0.05, η² = 0.0337]. There were no significant interactions between gesture and reference frame [F(1,62) = 0.768, p = 0.384], between gesture and congruency [F(1,62) = 0.219, p = 0.641], nor between gesture, congruency, and reference frame [F(1,62) = 2.906, p = 0.093].

FIGURE 3

Because of the significant interaction between congruency and reference frame, it is important to look at the two reference frames separately. An ANOVA was performed for just the egocentric condition, with gesture and congruency as between-subject factors and relative RT as the dependent variable. There was a main effect of congruency [F(1,31) = 43.772, p < 0.01, η² = 0.3508]. There was no significant main effect of gesture [F(1,31) = 1.537, p = 0.224] and no significant interaction between congruency and gesture [F(1,31) = 1.135, p = 0.295].

Likewise, a similar ANOVA was completed for just the exocentric condition. Here there was not a significant main effect of congruency [F(1,31) = 0.128, p = 0.723]. There was also not a significant main effect of gesture [F(1,31) = 3.297, p = 0.079] nor was there an interaction between congruency and gesture [F(1,31) = 1.780, p = 0.192].

Overall, accuracy was high, with an average of 98.1% (SD = 2.1%), with a minimum of 95% across all conditions. The uniformly high accuracies suggest that a substantial speed–accuracy tradeoff is unlikely.

Discussion

Because priming was observed in the egocentric condition, we can conclude that an egocentric association existed between the sounds (high-pitched and low-pitched tone) and the gestures (”right” and “up”). However, priming was not observed in the exocentric condition, providing no evidence for exocentric key-sound associations. This suggests that these action-sound associations are egocentric in nature.

Part 2

To further test whether action-sound associations are egocentric or exocentric, part 2 explores how altering the starting location affects the strength of the associations. If the associations are egocentric, then changing to the new starting key will not affect the results. Moving a finger “right,” for example, will be associated with the same sound, regardless of the finger’s starting location. Conversely, if the associations are exocentric, moving to a new starting location will lower the effect size, as the new action-sound association would compete with the one that was just learned during part 1.

In order to see if the change in starting location lowered the effect size, it was necessary to only include participants who showed an individual priming effect in part 1. Each participant’s data was analyzed to examine if there was an individual priming effect. Because the egocentric (but not the exocentric) condition in part 1 showed an overall priming effect, it was expected that most participants who did show an individual priming effect would be within the egocentric condition.