Visual cognition during real social interaction

Skarratt, Paul  A

doi:10.3389/fnhum.2012.00196

REVIEW article

Front. Hum. Neurosci., 29 June 2012

Sec. Cognitive Neuroscience

Volume 6 - 2012 | https://doi.org/10.3389/fnhum.2012.00196

This article is part of the Research TopicTowards a neuroscience of social interactionView all 56 articles

Visual cognition during real social interaction

Paul A. Skarratt¹*

Geoff G. Cole²

Gustav Kuhn³

¹Department of Psychology, University of Hull, Hull, UK
²Centre for Brain Science, University of Essex, Essex, UK
³Department of Psychology, Brunel University, Uxbridge, UK

Laboratory studies of social visual cognition often simulate the critical aspects of joint attention by having participants interact with a computer-generated avatar. Recently, there has been a movement toward examining these processes during authentic social interaction. In this review, we will focus on attention to faces, attentional misdirection, and a phenomenon we have termed social inhibition of return (Social IOR), that have revealed aspects of social cognition that were hitherto unknown. We attribute these discoveries to the use of paradigms that allow for more realistic social interactions to take place. We also point to an area that has begun to attract a considerable amount of interest—that of Theory of Mind (ToM) and automatic perspective taking—and suggest that this too might benefit from adopting a similar approach.

Social Attention: The need for Real Social Interaction

Ever since its inception during the late 1960s (e.g., Neisser, 1967), research into human visual attention has moved toward examining the behavior of individuals as they perform tasks alone. In the standard experiment a single observer is seated in front of a visual display and performs a required task. Clearly, this laboratory-based paradigm has been instrumental in uncovering many of the fundamental properties of visual cognition (e.g., Eriksen and Eriksen, 1974; Posner, 1980; Duncan, 1984; Tipper, 1985; Raymond et al., 1992; Watson and Humphreys, 1997; Simons and Rensink, 2005). However, humans are social animals and the majority of people spend some part of each day interacting with others. As this issue of Frontiers demonstrates, a growing number of visual cognition studies are beginning to reflect this and examine how attention is deployed when a person interacts with another individual or individuals. In this article we first present our own assessment of why the new field of social neuroscience can be considered as more than an attempt to improve the ecological validity of our experiments. We then go on to show how the social neuroscience method not only informs us about the mental processes involved in social interaction but has also revealed the existence of visual mechanisms that were previously unknown. Indeed, we cite particular cases where the method has revealed effects previously thought not to occur. Examples are drawn from our own research examining attention to faces, attentional misdirection, and a phenomenon we have previously labeled social inhibition of return (SIOR). Finally, we point to one recent debate that could likely benefit from this new approach. That is the question of whether Theory of Mind (ToM) and perspective-taking are automatic processes.

The study of social attention is often considered to have begun in the late 1990s with the first report of the gaze cueing effect (Friesen and Kingstone, 1998) in which a person's attention is oriented on the basis of another person's direction of gaze. However, it is more accurate to say that developmental psychologists have been studying these types of phenomena for decades and, as with the developing field of social neuroscience, their methods involved measuring behavior during interaction between real people. An early example (Scaife and Bruner, 1975) involved young infants sitting with a caregiver and looking directly ahead toward an experimenter. The experimenter would then turn their head 90° to the left or right and fixate an object. The infant's propensity for gaze following would then be recorded. This line of work was subsequently placed within the context of ToM in which infants were assessed for their ability to understand others as intentional agents (Tomasello et al., 1993).

The lead that developmental psychologists have taken in studying infant cognition in the real world is beginning to find favor amongst those advocating a cognitive ethological approach to adult cognition (Kingstone et al., 2008). Ethology, the study of animal behavior, was developed by a number of naturalists in Europe during the 1930s. Its central position was that animal “routines” and “patterns” should be examined in as natural environment as possible. Ethology's basic philosophy explicitly contrasted with that of the American-led Behaviorists during the same period. Their models of behavior were derived from laboratory studies of animals, usually rats and pigeons. Although many influential models of behavior were developed from the behaviorist approach, the field was often criticized for its lack of ecological validity. In the same way, cognitive ethologists emphasize the importance of ecological factors in human cognition and consider social interaction as being central. Although social neuroscience does not advocate a naturalistic setting per se, the field does employ paradigms that take into consideration the social situations in which cognition occurs.

A number of recent attention researchers have therefore examined adult visual attention in scenarios where participants perform tasks in conjunction with other individuals. For instance, Brennan et al. (2008) made the point that many everyday situations involve joint visual search, such as when an adult and child look through a picture book together. With this in mind, Brennan et al. examined whether joint visual search might be more efficient than that of a solitary observer. Pairs of participants were asked to search arrays for a target letter appearing amongst distractors, with cursors allowing each participant to see the location of their partner's gaze at any point during the search. Results showed that searches were almost twice as efficient when made jointly than alone. Furthermore, Brennan et al. showed that observers coordinated their search with their partner's without explicit training. The authors concluded that joint gaze can be used spontaneously to minimize collective effort and optimize search success.

The acknowledgment that psychological models can benefit from ecological consideration can be seen in other areas of vision research, as in the case of vision and action. The study of visual perception has been dominated by the view that the function of vision is to generate a representation of the external environment. That is, to provide a percept. However, a number of authors have pointed out that vision is most often accompanied by action (e.g., Milner and Goodale, 1995; Prinz, 1997; Jeannerod, 1999; Hommel, 2009) and paradigms have therefore been developed with this in mind. For instance, an abundance of work has shown that the appearance of a new object is particularly effective in attracting attention (e.g., Cole et al., 2004; Cole and Liversedge, 2006; Cole et al., 2007; Davoli et al., 2007; Cole and Kuhn, 2009, 2010; Yantis and Jonides, 1984). However, Welsh and Pratt (2006) demonstrated that the propensity with which new onsets capture attention is influenced by the type of action an individual makes when responding to them. More specifically, the authors showed that task-irrelevant offsets interfere with new object capture when a standard keyboard press is required but do not when a reaching response is made. Thus the act of reaching toward an object enables attention to be focused more effectively. As is the case with social neuroscience, proponents of the vision-for-action perspective argue that a phenomenon in question may be better elucidated when consideration is given to its functional significance in the real world.

Social neuroscience is clearly grounded in the notion that humans are social animals and this ought to be reflected in our experimental paradigms. In the field of visual attention the social aspect of this process has typically been implemented by presenting participants with social stimuli in the form of static images or video clips of people, and then measuring their effects on visual attention. The use of these often well controlled, yet rather reductionist depictions of real world stimuli offer a valuable tool to investigate social attention in the laboratory. This is particularly the case in the field of neuroscience, where experimental protocols are often limited by the logistical constraints of the apparatus (e.g., MRI, EEG; but for recent developments see Guionnet et al., 2012; Guionnet et al., and Schippers et al., 2010). Progress in overcoming these constraints has allowed face-to-face interaction between people by channeling a live video feed inside a scanner. In one such experiment, Redcay et al. (2010) recorded functional MRI data as participants interacted with an experimenter via a video screen in one of three cooperative scenarios: a “live” interaction, a recording of one of their earlier interactions, or a recording of the experimenter's interaction with a different participant. Hence, all three conditions contained identical visual information but differed according to the mental state imputed to the experimenter's actions. Results showed that the live feed elicited greater activation in the ventral striatum, amygdala, and anterior cingulate cortex (ACC), areas activated in studies of social reward (e.g., Walter et al., 2005), and the right posterior superior temporal sulcus (rpSTS), a region implicated in social perception and social cognition (Allison et al., 2000; Saxe, 2006). That these brain regions are differentially activated on the basis of the authenticity of face-to-face interactions leads one to enquire as to their importance. Are such areas central or peripheral to the processes we as experimenters attempt to measure in our studies of social visual cognition? How does their involvement impact upon processes occurring elsewhere in the brain? Do they scale up to produce measurable differences in behavior? And the question most pertinent to the current review: Are the processes that mediate perception of these social stimuli the same as those involved in perceiving a real person? Hence, it is not merely a matter of improving the ecological validity of our experiments; it is about the extent to which the findings from social attention studies translate to real person interaction. If these processes differ, it is likely to have serious implications for our understanding of social attention, and social neuroscience in general. In the next section we compare examples of what we have learned about social attention from classical behavioral studies and from those in which real people interact. We find that the former can yield very different results from the latter.

Attentional Orienting Toward Other People

Rather than processing all of the available sensory input, the visual system selects only that which is likely to be behaviorally important. Metaphors such as the attentional spotlight (Posner et al., 1980; Broadbent, 1982) or zoom-lens (Eriksen and St. James, 1986) describe the way in which attention is oriented around the field of vision, selecting for further processing any objects or locations falling within the “illuminated” boundary. Attentional orienting can be overt (i.e., where people look), or covert (i.e., where people attend without moving their eyes or head). Given the importance that attention plays in mediating what we see, it is not surprising that there has been much interest in determining how we select the information deemed to be “important” (Henderson, 2003). Some have argued that this selection process is largely driven by bottom-up stimulus features (e.g., Itti and Koch, 2000). According to these models, certain stimulus features, such as luminance contrasts, are particularly salient and thus automatically capture attention. Detailed computational models are remarkably accurate in predicting people's eye movements as they view images of natural scenes (Itti and Koch, 2001). However, due to the complexity of the processes they attempt to simulate, these models necessarily simplify humans as passive observers of the world (Findlay and Gilchrist, 2003). In reality, however, vision is an active process that enables us to carry out multifarious tasks in which required objects might not be the most salient aspect of the visual scene (Land, 2006). Consequently, others have argued that eye movements are driven by our top-down goals rather than by salient aspects of the visual scene (Land et al., 1999; Hayhoe and Ballard, 2005; Land, 2006). This reinforces the view outlined in the previous section that vision and action are intricately linked.

In addition to task goals and bottom-up salience, it has become clear that the attention system is strongly influenced by social factors. Some of the earliest eye tracking studies by Yarbus (1967) have shown that whilst viewing images of social situations our eyes are particularly attracted by the people in the scene. More recent studies have replicated this observation and shown further that attention is strongly drawn toward faces, and in particular the eyes (Yarbus, 1967; Kuhn and Land, 2006; Birmingham et al., 2008b, 2009; Fletcher-Watson et al., 2008; Kuhn et al., 2009). Indeed, such is the appeal of eyes that observers still tend to look at them even when faces are presented in isolation (Walker-Smith et al., 1977; Pelphrey et al., 2002; Itier et al., 2007).

However, to what extent do these findings generalize to more complex social interactions? Most studies of face perception involve faces presented in isolation and are, as a consequence, already attended (Walker-Smith et al., 1977; Pelphrey et al., 2002). Hence these studies may demonstrate a preference for the eyes because these are often the most complex or salient component of a pre-selected face. The true measure of eye-preference is whether they are able to summon attention when faces are embedded within a complex scene. Evidence has shown that this is indeed the case (Birmingham et al., 2008a,b, 2009). Although those studies used static scenes, others have investigated where observers look in dynamic scenes. For instance, Kuhn et al. (2009) had participants observe a magician performing a magic trick. In spite of the trick involving the magician's hands, the proportion of fixations on his head and eyes was close to 70%. Likewise, when participants were asked to watch videos of other students engaging in conversation, 77% of fixations were directed to the people in the clips (Foulsham et al., 2010).

Much empirical work has therefore demonstrated that as humans, we generally prioritize other humans, their faces and, in particular, their eyes when viewing natural scenes. Whilst these studies vary in terms of their ecological validity, there remain questions as to whether these studies capture the true nature of social cognition. Indeed, social cognition involves more than passively observing images of people; it involves interaction with real people. Interestingly, there is evidence that even the potential for real social interaction can influence behavior. For example, people will often meet the gaze of an approaching stranger that is depicted in an image (Henderson et al., 2005; Itier et al., 2007) but will avoid direct eye contact when the same event occurs in real life (Ellsworth et al., 1972). Hence the presence of a real person clearly elicits a different behavioral response. The main difference between these settings is that the latter involves the potential for social interaction whilst the former does not. A recent study by Laidlaw et al. (2011) directly examined the effect of those two scenarios. In the study, participants' eye movements were monitored as they sat in a waiting room. The crucial manipulation in the experiment was whether participants were joined by a real confederate posing as another research participant or the confederate appearing on a video screen in the waiting room. Results showed that whereas participants frequently looked at the confederate on the video screen, they rarely did so in person. Moreover, ratings of the participants' social skills correlated positively with the amount of time spent looking at the live confederate, yet did not in the video condition. Similar conclusions concerning the difference between real and artificial social interactions have been drawn from studies examining eye movements in response to social cues in autism (e.g., Nation and Penny, 2008).

In sum, it is clear that attempts to measure aspects of social cognition can yield different results depending on whether the social context is real or merely simulated. The majority of research demonstrates that our willingness to look at others is strongly influenced by whether or not they are physically present. Although traditional, controlled, computer-based tasks are important in examining some of the mechanisms involved in social attention, its underlying mechanisms may only be fully understood in more naturalistic settings that take into account how we interact with other people.

Attentional Orienting Away from Other People: Gaze Following

Eyes are not only highly effective in attracting attention, but also in orienting attention to other parts of the visual field indicated by their gaze direction. This orienting response to where other people look has been termed gaze following or gaze cueing and has been studied extensively since the late 1990s (Friesen and Kingstone, 1998; Driver et al., 1999). In these experiments, participants are typically presented with a face in the centre of a display with its eyes and/or head directed to the left or right. A target is then presented at either the gazed-at location or the opposite hemifield. The characteristic results are that response times are reduced for targets appearing in the gaze-indicated position, a facilitatory effect arising from the gaze cue having automatically shifted the observers' attention. In the years since this discovery, many variations of this paradigm have been developed to determine the parameters of gaze cueing and its underlying neural bases (Williams et al., 2005; Frischen et al., 2007; Materna et al., 2008).

Even though gaze cues are intended to represent real human faces, it has been argued that the paradigm may not necessarily capture the true nature of social attention (Kingstone et al., 2003, 2008; Kingstone, 2009). Although some researchers have tried to address this concern by improving the realism of the images used in their experiments (e.g., Hermens and Walker, 2010), the very nature of simulating social interaction via a computer monitor is questionable (Kingstone, 2009). Hence other researchers have begun to examine gaze following in more naturalistic settings whereby target locations are cued by real people. For instance, Gallup et al. (2012a) used a hidden video camera to record the number of glances received by an attractive stimulus as pedestrians walked by. The critical measure concerned whether a pedestrian's gaze would increase the likelihood of other passers-by glancing toward the stimulus. This was indeed the case. Moreover, this likelihood was greater for those who walked behind the pedestrian than for those who approached from the front. This finding is consistent with the notion of gaze avoidance by approaching strangers (Ellsworth et al., 1972), and demonstrates that the effectiveness of a visual cue in directing attention can be modulated by the social context (see also Gallup et al., 2012b, for gaze following in large crowds of people). Kuhn and colleagues have adopted a similar ethological approach to examining visual cognition by recording the eye movements of observers as they watch magic tricks (for reviews see Kuhn et al., 2008; Macknik et al., 2008; Kuhn and Martinez, 2012). Magicians are highly skilled in directing—and misdirecting—the attention of observers. Social cues play a crucial role in misdirection, and numerous studies have now demonstrated that gaze cues are instrumental in successfully achieving this (Kuhn and Land, 2006; Tatler and Kuhn, 2007; Kuhn et al., 2009; but see Cui et al., 2011). For instance, Kuhn et al. (2009) found that the magicians' gaze influenced where people looked, and consequently the likelihood of successful detection (see also Tatler and Kuhn, 2007). The advantage of this over the standard gaze cueing paradigm is not only are the cues generated in a more naturalistic way, but that they also compete against other salient features in the visual scene as well as the participant's intention to discover the trick. These paradigms therefore offer a significant step toward investigating attention in a more realistic social context. Importantly however, attempts have been made to improve the ecological validity still further by comparing the likelihood of trick detection when observed on a video or in a live setting (Kuhn and Tatler, 2005; Tatler and Kuhn, 2007). These have indicated that misdirection experienced during a face-to-face interaction is more effective, suggesting that social cues are stronger when presented by a real person. Moreover, the instructions concerning what participants would expect to see in the face-to-face scenario did not influence their eye movement behavior, nor did it improve their detection of the trick (Kuhn and Tatler, 2005). However, when viewed on a computer monitor, prior instructions influenced both detection as well as eye movement behavior (Kuhn et al., 2008).

In sum, whilst the gaze cueing paradigm has been immensely valuable in investigating different aspects of social attention, there is a clear difference in the way attention operates in the presence of real people as opposed to simulated people. We now turn to further evidence of this in the next section.

Social Inhibition of Return (SIOR)

Another example of how the presence of others influences attention is the way inhibition of return (IOR) is expressed during individual and joint visual search tasks. Indeed, the differences are such that Skarratt et al. (2010) have proposed that IOR and its social counterpart social IOR may even be independent processes rather than facets of the same processes as revealed in social and solitary search contexts.

IOR refers to the slowing of responses to targets appearing in previously attended locations (Kingstone and Pratt, 1999; Taylor and Klein, 2000; Godijn and Theeuwes, 2002). It has been proposed as having evolved as a means of expediting visual search (Klein and MacInnes, 1999). To this end, inhibitory mechanisms serve to bias attention from returning to previously inspected locations, and to discourage successive eye-movements being programmed to the same spatial location (Rafal et al., 1989). One can imagine the utility of such mechanisms during a search for a friend in a crowd of people. The search is less likely to yield a successful outcome if attentional and saccadic resources are repeatedly realigned with spatial locations that have recently been searched. However, as social animals, we are likely to have carried out many of our predatory and defensive search behaviors in conjunction with other individuals. This raises the interesting question of whether one might inhibit a spatial location knowing that another person has previously searched there. This very rationale motivated Welsh and colleagues (Welsh et al., 2005) to examine whether IOR can be socially “transferred” between different individuals. This was investigated by having pairs of participants sit across a table from one another. Each took turns at reaching out to one of two targets as they appeared on the workspace. The basic social IOR phenomenon is the observation that participants are slower to initiate a reaching action to a location previously responded to by a partner. That is to say, one inhibits a location on the basis that another person has searched there. As such, it can be said that IOR can indeed be “transferred” between two individuals. This effect is clearly a visual phenomenon based on real social interaction. However, not only has this paradigm revealed information concerning such interaction but, as the following two sections show, the procedure also reveals aspects of visual attention previously unknown.

Social IOR and New Insights into Human Cognition

The Role of Visible Transients

Skarratt et al. (2010) sought to investigate the extent to which social IOR is generated on the basis of social information rather than the visual information carried by another person's responses. If it is the former, then it ought to occur when participants simply know where their partner has responded without having seen it take place. To address this possibility, the view each participant had of their partner was restricted to a central portion measuring 12° across. All peripheral information was occluded, meaning participants could not see their partner's targets, response buttons and, consequently, the completion of their responses. In other words, all the visual information that could give rise to IOR at the response location was eliminated. White noise also masked the sound of the response buttons being pressed, thus precluding the likelihood of IOR occurring due to auditory stimulation (cf. Spence and Driver, 1998). This meant that participants could infer a response location only from their partner's eye gaze (signaling their intention to respond), or their initial hand movement toward the target. Results showed that social IOR emerged even under these restricted viewing conditions. Moreover, it was the same magnitude as the corresponding IOR effect observed under free-viewing conditions, to which all the sensory information had contributed. Thus, simply knowing where a person had responded was as effective as seeing the complete reaching response.

The implication of these findings is that a visual cognition effect, mediated in this case by inhibitory processes, is initiated by inferred events occurring in the external world. Importantly, this contrasts markedly with what was previously assumed about IOR from classical precueing studies in which participants perform alone. For instance, Cole et al. (2011a) showed that even when observers are aware that an occluded visual event has taken place in a spatial location, they do not inhibit it. The experiment involved participants having to detect a target appearing in a cued or uncued location. On some trials, a luminance cue indicated the possible target location. On others, a pattern mask briefly occluded the cue onset, but the indicated target location was revealed after the mask was removed. Thus, in both cases a precue indicated the potential target location, but only in one case did the participant see the cue generated. Results showed that IOR emerged only when the cue transient was visible, indicating that localized sensory input is required for inhibition to occur. Indeed, these findings concur with a great deal of evidence suggesting that local transients are necessary for attention to be marshaled at all. Using a similar method to occlude lower level visual transients, Franconeri et al. (2005) examined whether new perceptual objects can capture attention without an abrupt visual onset. To this end, they used a standard irrelevant singleton paradigm (Egeth and Yantis, 1997) comparing search slopes yielded by new versus already present targets. An annulus shape was presented around the perimeter of the array, which then contracted during the course of each trial. As the shape contracted, the new object appeared behind and was then revealed in a location previously seen to be unoccupied. As with Cole et al. (2011a), participants were aware that a scene change had taken place, but had not seen the accompanying transient that signaled its arrival. Results showed that new objects failed to capture attention under these conditions, yet they did attract attention when the annulus was seen to move behind the search items thus rendering a visible onset (but see Cole et al., 2011a, Experiment 6). In a similar vein, Skarratt et al. (submitted) have shown that attention is captured by objects that loom towards or recede away from the observer (see also Skarratt et al., 2009). These objects began their motion paths in far and near depth planes, respectively, before moving into alignment with objects remaining static throughout the trial. However, when the motion sequence was replaced with a blank frame, thus removing the transients associated with the objects' movement path, these objects were no longer capable of attracting attention. In the case of social IOR, however, spatial locations undergo inhibitory tagging on the basis of knowing rather than seeing that a stimulus event has taken place.

Do Central Cues Elicit IOR?

That social IOR occurs under restricted viewing conditions indicates that peripheral locations can be inhibited on the basis of centrally presented information. This contrasts with IOR, whose emergence in a peripheral location requires a localized peripheral cue, and which is not reliably observed when a peripheral location is indicated by a central arrow cue. Like peripheral cues, central arrows can facilitate processing when they precede a target by a short interval (Ristic et al., 2002; Tipples, 2002) yet they do not give rise to later IOR (Posner and Cohen, 1984; Abrams and Dobkin, 1994). Hence there is a clear difference in the effects of central cueing in a conventional precueing paradigm and in a social IOR paradigm. Indeed, this discrepancy becomes even more apparent when examining the effects of gaze cues. As we described earlier, attention can be oriented by the gaze direction of centrally presented faces. They too result in prolonged facilitation that rarely gives way to subsequent IOR (McKee et al., 2007; Greene et al., 2009). As far as we are aware, only two studies have shown IOR in response to gaze cues. According to Frischen et al. (2007; see also Frischen and Tipper, 2004), gaze cues do give rise to IOR but at much later cue-target intervals than can be observed with peripheral cues (around 2400 ms rather than 300 ms), and only when attention is disengaged from the gazed-at location prior to target presentation. The highly specific circumstances in which gaze cues elicit IOR are in contrast with those in which social IOR occurs. For instance, Skarratt et al. (2010; Experiment 3) ensured that each participant saw only their partner's face as they performed the alternating response task. This provided a very close approximation of the classic gaze-cueing method in that the partner looked toward their response location after which the participant's own target appeared in the same or opposite location. The results showed reliable social IOR occurring much earlier (i.e., between 1300–1700 ms) than the IOR effect found by Frischen et al., and without a controlled attempt to remove attention from the gazed-at location. These findings suggest that the mechanisms underlying attention and inhibition respond differently to real and simulated biological behavior.

Real Versus Animated Biological Behavior

This point is further underscored by our attempts to induce social IOR using a realistic animation of a person's response behavior. In this experiment (Skarratt et al., 2010, Experiment 1), individual participants performed the alternating response task in conjunction with an animated partner. This was achieved by projecting a movie of a male partner onto a screen such that he appeared to be seated opposite. In keeping with our other experiments, we disambiguated the social and visual information conveyed by the partner's response by manipulating the participants' view of him. Results showed the inhibitory effect occurred only when participants had an unrestricted view of the animated partner's responses, indicating that IOR was elicited by the associated lower level visual transients. The absence of social IOR in the restricted viewing condition suggests that the visuomotor system is less sensitive to simulated biological behavior than it is to the same behavior performed by a real person. Hence the critical factor in the generation of social IOR is that the observed behavior must demonstrate agency. This view is supported by several recent findings revealing different neural substrates for the perception of real and virtually real biological behavior. For instance, Gobbini et al. (2011) compared the BOLD responses of participants observing either human or robot faces performing basic emotional expressions. Perhaps unsurprisingly, both face types activated face-specialized regions such as the fusiform gyrus and the superior temporal sulcus. More interesting, however, was that human faces evoked stronger activations in the medial prefrontal and the anterior temporal cortices, and the right amygdala. The latter system has long been associated with emotion (Breiter et al., 1996; Morris et al., 1996), suggesting participants were less sensitive to automated displays of emotion, whilst the former regions are thought to be involved in the representation of others' mental states and ToM (Leibenluft et al., 2004; Mitchell et al., 2005; Amodio and Frith, 2006; Frith and Frith, 2006). These findings can be interpreted as observers empathizing more with sentient than with automated beings. This claim is corroborated with the observation that robotic faces elicited stronger activation in three gyri associated with the perception of inanimate objects and automated motion—the medial fusiform, the lingual, and mid temporal gyri (see Beauchamp et al., 2002). In a similar study by Perani et al. (2001), positron emission tomography (PET) was used to record the neural responses of participants whilst they observed scenes of a hand grasping various geometrical objects. Responses to scenes involving a real hand were compared to those evoked when the same scenes were rendered in 3D virtual reality or 2D movie clips. Results showed that observation of the real and virtual hands was associated with greater activation in the inferior temporal cortices and the right inferior parietal cortex. These are regions that have been implicated, respectively, in the perceptual representation of actions and motor planning (e.g., Decety, 1996; Decety et al., 1997), and these stronger activations may reflect greater sensitivity to more realistic depictions of behavior. Finally, when the same brain regions are activated by live-action and computer-animated behavior, overall activation is stronger for live-action images (Mar et al., 2007).

Future Directions

The findings described above demonstrate that classical attention paradigms can not only underestimate effects but may also fail to reveal aspects of human cognition. Throughout this review we have pointed out cases in which important theoretical advances in visual cognition have been made by implementing social interaction into experimental manipulations. In the remainder of this article, we focus on one such phenomenon that until recently has been the exclusive domain of developmental researchers, but which is now enjoying increased interest within our own discipline. We propose that a recent development concerning the ToM phenomenon is perfectly suited to experiments employing real social interaction.

Theory of Mind is the ability to impute mental states to oneself and to others. Although the concept was originally employed in the context of animal cognition (Premack and Woodruff, 1978), a number of developmental psychologists applied the idea to human infants (e.g., Wimmer and Perner, 1983). Indeed, ToM has now been applied within various contexts including, for instance, schizophrenia (Harrington et al., 2005), autism (Baron-Cohen, 2000), Alzheimer's disease (Gregory et al., 2002), decision-making (Torralva et al., 2007), and evolutionary psychology (Povinelli and Preuss, 1995). Real person interaction studies have been employed in work on ToM. For instance, Stuss et al. (2001) examined the ability of frontal lobe patients to infer the visual experience of others; that is, to perspective take. Rather than depict an individual on a computer monitor, the patients were asked to consider the perspective of a real person who was seated opposite. Using both depicted and real person interaction, Rilling et al. (2004) examined whether economical decision making in conjunction with another individual is subserved by cortical areas known to be involved in ToM (e.g., anterior paracingulate cortex and posterior superior temporal sulcus). Not only did Rilling et al. find this to be the case but the activation observed was greater when the decision maker was interacting with a real person.

Although adult humans are adept at considering others' mental states when required to do so, a number of authors have recently argued that ToM attributions occur automatically. That is, without conscious effort. Evidence for automatic ToM has come from a number of different paradigms including gaze cueing (Nuku and Bekkering, 2008; Teufel et al., 2009, 2010a,b). Gazing agents have been employed in the context of ToM because when a person looks to a location, a mental state such as intention can be assumed to be occurring. As Calder et al. (2002) point out, gaze implies that the person may have some intention or goal toward a fixated object. Similarly, Nuku and Bekkering (2008) argued that gaze cueing occurs because the observer infers that the agent is physically able to attend to the target. They based their conclusion on results from a gaze cueing procedure in which the targets would or would not be visible to the agent from his vantage point. The authors found larger cueing effects when the targets were visible to the agent. This clearly suggests that inferring the agent's mental state (i.e., seeing versus not seeing) influenced the degree to which the agent shifted the observer's attention.

Langton (2009), however, has urged caution in concluding that mental state attribution modulates gaze cueing. For instance, objects that have no mental state (e.g., a glove) but which incorporate a pair of eyes are effective in shifting attention toward the “looked-at” direction (e.g., Quadflieg et al., 2004). Moreover, Cole et al. (2011b) found that gaze cueing was not influenced according to whether the inducing agent had their view of a peripheral target blocked or not. These findings suggest that gaze cueing is largely controlled by bottom-up mechanisms with little contribution from higher processes that are responsible for mental state attribution.

Apperly et al. (2006) have also examined whether ToM can occur automatically. Adult participants were shown a video sequence in which an agent marked a box that she knew contained an object. After she was seen to leave the room, a second agent then secretly placed the marker on a different box, meaning that when she returned, the first agent would hold a false belief about which box contained the object. Participants were then given true/false statements assessing their own perspective (e.g., “the object is in the left box”) or occasionally that of the female agent (“she thinks the object is in the left box”). Apperly et al. reasoned that if participants automatically infer and encode another's perspective then judgments about the agent's beliefs should be made as quickly as are judgments about their own. Results showed, however, that participants were relatively slow to indicate the agent's belief when unexpectedly asked to do so. This therefore challenges the notion that ToM can occur automatically. By contrast, German et al. (2004) have provided support for the automatic ToM hypothesis using neuroimaging. They found that brain areas known to be concerned with inferring another person's intentions (medial prefrontal, inferior frontal, and temporoparietal cortex; Frith and Frith, 2006) are also recruited when participants view videos of social situations but are not required to make judgments about mental states; a phenomenon the authors refer to as automatic engagement of the intentional stance. Given that the issues relating to the automaticity of ToM are, by definition, concerned with the attribution of mental states to real people, we suggest that the real-person interaction paradigm we have advocated in this review would appear particularly suited to its investigation. Furthermore, the present review shows how much more sensitive the paradigm can be to cognitive phenomena.

Conclusions

In this review we have emphasized how real-person social interaction research can yield very different results when compared with paradigms in which the social context is merely depicted. Indeed, new information concerning visual cognition is being derived from the method. It is clear from the work described in this review that a fresh insight into human cognitive abilities can be gained from experiments that allow for more realistic social interaction. The development of such paradigms is particularly timely given the burgeoning interest in issues such as ToM and automatic perspective taking. Indeed, we suggest that if any debate within cognition could benefit from real-person interaction paradigms it is this. In the same way as processes underlying attentional orienting and IOR can be elucidated during social interaction with other people, one might hypothesize that those underlying perception of others' thoughts, intentions, goals and actions might also be better understood. The adoption of such an approach can only increase our understanding of these fascinating processes.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The authors would like to thank Andrew Gallup and R. Nathan Spreng for their helpful comments on the manuscript.

References

Abrams, R. A., and Dobkin, R. S. (1994). Inhibition of return: effects of attentional cuing on eye movement latencies. J. Exp. Psychol. Hum. Percept. Perform. 20, 467–477.