Crossmodal Constraints on Human Perceptual Awareness: Auditory Semantic Modulation of Binocular Rivalry

Chen, Yi-Chuan; Yeh, Su-Ling; Spence, Charles

doi:10.3389/fpsyg.2011.00212

ORIGINAL RESEARCH article

Front. Psychol., 12 September 2011

Sec. Perception Science

volume 2 - 2011 | https://doi.org/10.3389/fpsyg.2011.00212

This article is part of the Research TopicUpdates on multisensory perception: from neurons to cognition.View all 14 articles

Crossmodal constraints on human perceptual awareness: auditory semantic modulation of binocular rivalry

Yi-Chuan Chen¹*

Su-Ling Yeh²

Charles Spence¹

¹ Department of Experimental Psychology, University of Oxford, Oxford, UK
² Department of Psychology, National Taiwan University, Taipei, Taiwan

We report a series of experiments utilizing the binocular rivalry paradigm designed to investigate whether auditory semantic context modulates visual awareness. Binocular rivalry refers to the phenomenon whereby when two different figures are presented to each eye, observers perceive each figure as being dominant in alternation over time. The results demonstrate that participants report a particular percept as being dominant for less of the time when listening to an auditory soundtrack that happens to be semantically congruent with the other alternative (i.e., the competing) percept, as compared to when listening to an auditory soundtrack that was irrelevant to both visual figures (Experiment 1A). When a visually presented word was provided as a semantic cue, no such semantic modulatory effect was observed (Experiment 1B). We also demonstrate that the crossmodal semantic modulation of binocular rivalry was robustly observed irrespective of participants’ attentional control over the dichoptic figures and the relative luminance contrast between the figures (Experiments 2A and 2B). The pattern of crossmodal semantic effects reported here cannot simply be attributed to the meaning of the soundtrack guiding participants’ attention or biasing their behavioral responses. Hence, these results support the claim that crossmodal perceptual information can serve as a constraint on human visual awareness in terms of their semantic congruency.

Introduction

When viewing a scene, visual background context provides useful semantic information that can improve the identification of a visual object embedded within it, such as when the presentation of a kitchen scene facilitates a participant’s ability to identify a loaf of bread, say (e.g., Biederman, 1972; Palmer, 1975; Davenport and Potter, 2004; though see Hollingworth and Henderson, 1998). Importantly, however, our environments typically convey contextual information via several different sensory modalities rather than just one. So, for example, when we are at the seaside, we perceive not only the blue sea and sky (hopefully), but also the sound of the waves crashing onto the beach, not to mention the smell of the salty sea air. Do such non-visual contextual cues also influence the visual perception of semantically related objects? In the present study, we investigated whether the semantic context provided by stimuli presented in another sensory modality (in this case, audition) modulate the perceptual outcome in vision; namely, visual awareness.

The phenomenon of binocular rivalry provides a fascinating window into human visual awareness (e.g., Crick, 1996). Binocular rivalry occurs when two dissimilar figures are presented to corresponding regions of the two eyes. Observers typically perceive one of the figures as dominant (while often being unaware of the presence of the other figure); after a while, the dominance of the figures may reverse and then keep alternating over time. This perceptual alternation has been attributed to the fact that the visual system receives ambiguous information from the two eyes and tries to find a unique perceptual solution, and therefore the information presented to each eye competes for control of the current conscious percept (see Alais and Blake, 2005, for a review). The fact that a constantly presented dichoptic figure induces alternating perceptual experiences in the binocular rivalry situation demonstrates the dynamic way in which the brain computes sensory information, a process that gives rise to a specific percept (e.g., Leopold and Logothetis, 1996).

Several researchers have tried to understand how visual awareness emerges in the binocular rivalry situation. According to an early view put forward by Helmholtz (1962), the alternation of perceptual dominance is under voluntary attentional control. Subsequently, researchers suggested that the phenomenon occurs as a result of competition between either two monocular channels (Levelt, 1965; Tong and Engel, 2001) or else between two pattern representations, one presented to each eye (Leopold and Logothetis, 1996; Logothetis et al., 1996; Tong et al., 1998). More recent models (e.g., Tong et al., 2006) have suggested that the mechanisms underlying binocular rivalry include not only competition at multiple levels of information processing (for reviews, see Tong, 2001; Blake and Logothetis, 2002), but also some form of excitatory connections that facilitate the perceptual grouping of visual stimuli (Kovacs et al., 1996; Alais and Blake, 1999), as well as top-down feedback, including attentional control and mental imagery (Meng and Tong, 2004; Mitchell et al., 2004; Chong et al., 2005; van Ee et al., 2005; Pearson et al., 2008). That said, the underlying mechanisms giving rise to conscious perception in the binocular rivalry situation, while starting from interocular suppression, extend to a variety of different neural structures throughout the visual processing hierarchy.

Given that the phenomenon of binocular rivalry is, by definition, visual, one might have expected that the perceptual outcome for ambiguous visual inputs should thus be generated entirely within the visual system (cf. Hupé et al., 2008). On the other hand, however, some researchers have started to investigate whether visual awareness can be modulated by the information presented in another sensory modality. So, for example, it has recently been demonstrated that concurrently presented auditory cues can help to maintain the awareness of visual stimuli (Sheth and Shimojo, 2004; Chen and Yeh, 2008). Similar evidence has emerged from a binocular rivalry study demonstrating that the dominance duration of a looming (or rotating) visual pattern can be extended temporally when the rate of change of the visual stimulus happens to be synchronous with a series of pure tones or vibrotactile stimuli (or their combination, see van Ee et al., 2009). In addition, the directional information provided by the auditory modality can enhance the dominance duration of the moving random-dot kinematogram which happens to be moving in the same direction (Conrad et al., 2010).

Considering the seaside example outlined earlier, the meaning of a background sound (or soundtrack) plausibly provides a contextual effect on human information processing, which may, as a result, modulate the perceptual outcome that a person is aware of visually. Semantic congruency, which relies on the associations picked-up in daily life, provides an abstract constraint other than physical consistency between visual and auditory stimuli (such as coincidence in time or direction of motion mentioned earlier). This high-level factor has started to capture the attention of researchers interested in multisensory information processing (e.g., Greene et al., 2001; Molholm et al., 2004; van Atteveldt et al., 2004; Taylor et al., 2006; Iordanescu et al., 2008; Noppeney et al., 2008; Schneider et al., 2008; Chen and Spence, 2010; for a recent review, see Spence, 2011). On the other hand, modulations resulting from the presentation of semantically meaningful information have recently been documented by researchers studying unimodal binocular vision (Jiang et al., 2007; Costello et al., 2009; Ozkan and Braunstein, 2009). In the present study, we therefore investigated whether the semantic context provided by an auditory soundtrack would modulate human visual perception in the binocular rivalry situation.

Our first experiment was designed to test the crossmodal semantic modulatory effect on the dominant percept under conditions of binocular rivalry, while attempting to minimize or control any possible response biases elicited by the meaning of the sound. After first establishing this crossmodal effect, we then go on to explore the ways in which auditory semantic context modulates visual awareness in the binocular rivalry situation. Two visual factors, one high-level (selective attention) and one low-level (stimulus contrast) which have been shown to modulate visual perception in the binocular rivalry situation (Meng and Tong, 2004), are used to probe behaviorally the underlying mechanisms by which the auditory semantic context modulating visual awareness occurred in terms of current models of binocular rivalry (Tong et al., 2006).

Experiment 1

In our first experiment, we investigated whether the semantic context of a background soundtrack would modulate the dominance of two competing percepts under the condition of binocular rivalry. The participants viewed a dichoptic figure consisting of a bird and a car (see Figure 1) while listening to a soundtrack. When studying audiovisual semantic congruency effects, the possibility that participants’ responses are based on their utilizing a strategy designed to satisfy a particular laboratory task has to be avoided (see de Gelder and Bertelson, 2003). That is, there is a danger that the participants might merely report the stimulus that happened to be semantically congruent with the soundtrack rather than the percept that happened to be more salient (or dominant). In order to reduce the likelihood that the above-mentioned response bias would affect participants’ performance, a novel experimental design was used in Experiment 1A: the participants only had to press keys to indicate the start and the end time of the perceptual dominance of the pre-designated figure (e.g., “bird”) during the test period, while they listened to either the soundtrack that was incongruent with the visual target (i.e., a car soundtrack, in this case) or the sound that was irrelevant to both figures (i.e., a soundtrack recorded in a restaurant). A parallel task in which the pre-designated target figure was the car was also conducted. The participants listened to either the bird soundtrack in the incongruent condition or to the restaurant soundtrack in the irre- levant condition. Thus, the soundtrack was never congruent with the visual target that participants had to report. This aspect of the experimental design was introduced in order to reduce the likelihood that participants would simply report their perceptual dominance in accordance with whichever soundtrack they happened to hear. On the other hand, if the auditory semantic context can either prolong the dominance of the visual percept that happens to be semantically congruent with the soundtrack, or else shorten the dominance duration of the percept that happens to be semantically incongruent with the soundtrack, in the binocular rivalry situation, the dominance duration of the visual target should be shorter in the incongruent than in the irrelevant condition.

FIGURE 1

Figure 1. The trial sequence in Experiment 1A. An example of the dichoptic stimulus pairs used in the present study is demonstrated in the third frame. The soundtrack was presented from the start of the blank fame until the end of the visual stimuli (i.e., for a total of 65 s).

Experiment 1A

Participants

Twelve volunteers (including the first author, three males, with a mean age of 26 years old) took part in this experiment in exchange for a £10 (UK Sterling) gift voucher or course credit. The other 11 participants were naïve as to the specific purpose of the study. They all had normal or corrected-to-normal vision and normal hearing by self report. The participants were tested using depth-defined figures embedded in red–green random-dot stereograms to ensure that they had normal binocular vision. The study has been approved by the ethic committee and human participant recruit system in Department of Experimental Psychology, University of Oxford. All of the participants were informed of their rights in accordance with the ethical standards laid down in the 1990 Declaration of Helsinki and signed a consent form.

Apparatus and stimuli

The visual stimuli were presented on a 15′ color CRT monitor (75 Hz refresh rate). The participants sat at a viewing distance of 58 cm from the monitor in a dimly lit experimental chamber. The visual test stimuli consisted of the outline-drawings of a bird (4.44° × 2.76°) and car (4.41° × 2.27°) taken from Bates et al. (2003). The two figures were spatially superimposed, with the bird presented in red [CIE (0.621, 0.341)] and the car in cyan [CIE (0.220, 0.347)], or vice versa, against a white background [CIE (0.293, 0.332)]. These two color versions of the visual pictures (bird in red and car in cyan, or bird in cyan and car in red) were used to balance the influence of participants’ dominant eye when viewing dichoptic figures. The participants wore glasses with a red filter on the left eye and a cyan filter on the right eye during the course of the experiment.

Three sound files, bird (consisting of birds singing in a forest), car (consisting of car horn and engine-revving sounds in a busy street), and restaurant (consisting of the sound of tableware clattering together in a restaurant), which had been recorded in realistic environments (downloaded from www.soundsnap.com on 06/11/2008) were used as the auditory soundtracks. The sound files were edited so that the auditory stimulus started from the beginning of the sound file and lasted for 65 s. The sounds were presented over closed-ear headphones and ranged in loudness from 55 to 68 dB SPL.

Design and procedure

Two factors, semantic congruency (incongruent or irrelevant) and visual target (bird or car), were manipulated. Each participant reported the dominance of either the bird or the car percept in separate sessions in a counterbalanced order. Under those conditions in which the visual target was the bird, the participants were instructed to press the “1” key as soon as the image of the bird became dominant. The participants were informed that the criterion for responding that the bird was dominant was that they were able to see every detail, such as the texture of the wings, of the figure of the bird. As soon as any part of the bird figure became vague or else started to be occupied by the features of the car figure, they had to press the “0” key as soon as possible, to indicate that the image of the bird was no longer completely dominant. This criterion enabled us to estimate the dominance duration of the bird percept more conservatively, since it excluded those periods of time when the car percept being dominant as well as when participants experienced a mixed percept. Similarly, under those conditions in which the visual target was the car, the participants had to press “1” and “0” to indicate when they started and stopped perceiving the car percept as being dominant.

The participants initiated each trial by pressing the “SPACE” bar. A blank screen was presented for 5 s, followed by the presentation of the dichoptic figures for a further 60 s. The participants were instructed to fixate the area of the bird’s wing and car door and to start reporting the dominance of the target figure as soon as the dichoptic figures were presented. They had to monitor the dominance of the target picture continuously during the test period. The participants were also instructed to pay attention to the context of the sound as well (in order to ensure that the soundtracks were processed; see van Ee et al., 2009, Experiment 4). At the end of the trial, the question “What sound did you just hear?” was presented on the monitor, and the participants had to enter their answer (free report) using the keyboard. The sound was presented from the onset of the blank frame until the offset of the visual stimuli, in order to allow participants sufficient time to realize what the semantic context conveyed by the soundtrack was.

In both visual tasks (i.e., when the visual target was a bird and when it was a car), a block of 12 trials was presented (consisting of two sound conditions × two color versions of visual pictures, each conditions were repeatedly tested three times). The order of presentation of these 12 trials was randomized. Prior to the completion of the experimental block of trials, a practice block containing six no-sound trials was presented in order to familiarize the participants with the task. The participants were instructed to establish their criterion for reporting the exclusive dominance of the target picture, and to try and hold this criterion constant throughout the experiment. The experiment lasted for approximately 1 h.

Results

The proportion of time for which the target percept was dominant was calculated by dividing the sum of each dominance duration of the target percept by 60 s. Note that the participants may have occasionally pressed the “1” or “0” key twice. In such cases, the shorter duration (i.e., the duration from the second “1” keypress to the first “0” keypress) was used.

A two-way analysis of variance (ANOVA) was conducted with the factors of semantic congruency (incongruent or irrelevant) and visual target (bird or car; see Figure 2A)¹. The results revealed significant main effects of both semantic congruency [F(1,11) = 25.68, MSE = 0.0005, p < 0.0005, yes = 0.71] and visual target [F(1,11) = 11.50, MSE = 0.01, p < 0.01, yes = 0.51]. There was, however, no interaction between these two factors [F(1,11) = 0.60, MSE = 0.001, p = 0.46, yes = 0.07]. The planned simple main effect of the semantic congruency factor revealed that the proportion of dominance of the target picture was lower when listening to the incongruent soundtrack than when listening to the irrelevant soundtrack both when the visual target was the bird [F(1,22) = 6.34, MSE = 0.001, p < 0.05, yes = 0.15], as well as when it was the car [F(1,22) = 13.96, p < 0.005, yes = 0.33]. In addition, the magnitude of the auditory modulatory effect (incongruent vs. irrelevant) was not significantly different in the bird and car target conditions [F(1,11) = 0.60, MSE = 0.002, p = 0.45, yes = 0.06].

FIGURE 2

Figure 2. Results of the proportion of time for which the target percept was dominant (either bird or car, proportion of the 60-s viewing period) in Experiments 1A and 1B [(A,B), respectively]. Error bars represent ±1 SE of the mean.

Experiment 1B

Two further possibilities regarding the crossmodal semantic modulation reported in Experiment 1A need to be considered. First, the presented soundtrack may have accessed its associated abstract semantic representation and then modulated the dominant percept in the binocular rivalry situation. In this case, the semantic modulation constitutes a form of top-down semantic modulation rather than a form of audiovisual interaction. Second, even though the design of Experiment 1A effectively avoids the bias that the participants strategically reported the percept that is congruent with the meaning of the soundtrack as being dominant, it is important to note that a second type of bias should also be considered. That is, it could be argued that the presentation of the incongruent soundtrack may have provided a cue that discouraged the participants from reporting the target percept as being dominant, as compared to the presentation of the irrelevant soundtrack.

Experiment 1B was designed to control for the possibility that the crossmodal semantic modulation effects observed thus far might simply have resulted from the participants holding an abstract concept in mind, as well as the response bias elicited by the presentation of a cue that was incongruent with the identity of the visual target. Rather than presenting a soundtrack, the name of one of the soundtracks was presented on the monitor for 5 s prior to the presentation of the dichoptic figures (during this period, a blank frame had been presented in Experiment 1A). That is, the participants were provided with a word (the associated name of the soundtracks used in Experiment 1A) that was either incongruent with or irrelevant to the visual target, while they were tested in silence during the subsequent test period. The participants were instructed to retain the word in memory and to report it at the end of each trial, in order to ensure that they had maintained this semantic cue during the course of the test period. The word therefore provided an abstract semantic cue to the participants. In addition, the presentation and retention of this semantic cue in memory by participants would be expected to elicit a similar response bias in the incongruent (as compared to the irrelevant) condition. Our prediction was that if an abstract semantic cue or the response bias elicited by the incongruent cue (rather than the audiovisual semantic interaction) was sufficient to induce the semantic effect in the binocular rivalry situation, the significant difference between incongruent and irrelevant conditions should still be observed.

Two factors, semantic congruency (incongruent or irrelevant) and visual target (bird or car), were manipulated in this experiment. When the visual target was the bird, the words “car” and “restaurant” were presented in the incongruent and irrelevant conditions, respectively. Similarly, when the visual target was the car, the words “bird” and “restaurant” were presented in the incongruent and irrelevant conditions, respectively. The other experimental details were exactly the same as in Experiment 1A.