Face Processing is Gated by Visual Spatial Attention

Human perception of faces is widely believed to rely on automatic processing by a domain-specific, modular component of the visual system. Scalp-recorded event-related potential (ERP) recordings indicate that faces receive special stimulus processing at around 170 ms poststimulus onset, in that faces evoke an enhanced occipital negative wave, known as the N170, relative to the activity elicited by other visual objects. As predicted by modular accounts of face processing, this early face-specific N170 enhancement has been reported to be largely immune to the influence of endogenous processes such as task strategy or attention. However, most studies examining the influence of attention on face processing have focused on non-spatial attention, such as object-based attention, which tend to have longer-latency effects. In contrast, numerous studies have demonstrated that visual spatial attention can modulate the processing of visual stimuli as early as 80 ms poststimulus – substantially earlier than the N170. These temporal characteristics raise the question of whether this initial face-specific processing is immune to the influence of spatial attention. This question was addressed in a dual-visual-stream ERP study in which the influence of spatial attention on the face-specific N170 could be directly examined. As expected, early visual sensory responses to all stimuli presented in an attended location were larger than responses evoked by those same stimuli when presented in an unattended location. More importantly, a significant face-specific N170 effect was elicited by faces that appeared in an attended location, but not in an unattended one. In summary, early face-specific processing is not automatic, but rather, like other objects, strongly depends on endogenous factors such as the allocation of spatial attention. Moreover, these findings underscore the extensive influence that top-down attention exercises over the processing of visual stimuli, including those of high natural salience.


INTRODUCTION
Faces are of undeniable ecological signifi cance and commonly evoke behavioral and physiological responses that differ from responses to many other kinds of stimuli. Neuropsychological and behavioral evidence of the special character of face processing, available for many years, includes the existence of patients with selective defi cits in face perception (Bodamer, 1947), the ontogenetically early appearance of preference for face-like stimuli (Goren et al., 1976) and the fi nding that image inversion impairs perception of faces more than perception of other objects (Yin, 1969).
Correspondingly, numerous studies with both human and nonhuman primates have reported that visually-evoked neurophysiological responses to faces often differ from responses evoked by other kinds of objects. Over thirty years ago, neurophysiological studies with macaque monkeys revealed that a population of neurons in the inferotemporal cortex respond preferentially to images of faces (Gross et al., 1972). Subsequently, functional imaging studies have identifi ed certain regions in the human brain, including the fusiform gyrus and the superior temporal sulcus, that respond more strongly to faces than other objects (Clark et al., 1996;Kanwisher et al., 1997;McCarthy et al., 1997;Puce et al., 1995;Sergent and Signoret, 1992). Distinctive patterns of neural activity associated with face processing have also been observed in human electrophysiological recordings (Allison et al., 1999;Bentin et al., 1996;George et al., 1996;Lu et al., 1991;Sams et al., 1997). In particular, relative to other visual objects, faces elicit an enhanced negative-polarity event-related potential (ERP) component over lateral occipital scalp peaking about 170 ms after stimulus presentation (Bentin et al., 1996). Furthermore, in agreement with functional imaging studies, ERP source analysis has indicated that the activity underlying the face-specifi city of the N170 responses likely arises from a combination of activity in the fusiform gyrus and the superior temporal sulcus (Itier and Taylor, 2004).
Based on the unique behavioral and physiological responses elicited by faces, many investigators have concluded that the processing of faces is qualitatively distinct from the processing given to other types of objects. According to this view, faces are processed by an anatomically well-localized modular system that is highly specialized for analyzing images of faces (Farah et al., 1998;Tovee, 1998). Proponents of the modular view have argued that certain regions of the brain, most especially the fusiform gyrus (Allison et al., 1999;Kanwisher et al., 1997;McCarthy et al., 1997), are highly specialized for face processing. In this context, it is interesting to note that a number of studies have reported that the processing of faces appeared to be relatively impervious to the infl uence of the allocation of attention (e.g., Cauquil et al., 2000;Lueschow et al., 2004). If true, the processing of faces would be a striking exception to the robust attentiondependence broadly exhibited throughout the visual system and would strongly underscore the unique character of face processing.
The idea that face processing is immune to top-down infl uence, however, does not accord with our increasing appreciation of the prevalence of attentional modulation in the visual system. If face stimuli obligatorily receive special processing, the early neurophysiological responses selectively elicited by images of faces, such as the N170 ERP effect, should be about the same whether the faces are attended or not. It is now clear, however, that even the very early stages of cortical visual processing can be infl uenced by attention (e.g., Heinze et al., 1994;Mangun, 1995;Moran and Desimone, 1985;Motter, 1993;Poghosyan et al., 2005;Posner and Gilbert, 1999;Smith et al., 2006;Woldorff et al., 1997). Thus, the early analysis of complex objects like faces seems likely to be sensitive to the allocation of attention as well. Therefore, in order to determine whether the early cortical discrimination of faces depends on the allocation of attention, we have compared the neurophysiological responses of human observers to images of faces and other objects when they were presented in an attended location with the responses evoked by those same images when they were presented in an unattended location.

Participants
Nineteen right-handed adults (seven females and twelve males) ranging in age from 18 to 37 years (mean 22.4 years) participated in the study. Four subjects were excluded due to poor performance and/or excessive physiological artifacts such as eye blinks, eye movements, or muscular activity. The study protocol was approved by the Duke University Health System Institutional Review Board, and written informed consent was obtained from all participants.

Stimuli and paradigm
Subjects were seated comfortably in an electrically shielded, soundattenuated, dimly illuminated chamber facing a computer monitor. Stimulus presentation was controlled by a personal computer running the "Presentation" software package (Neurobehavioral Systems, Inc., Albany, CA, USA). All stimuli were displayed on a 15" CRT screen refreshed at 60 Hz.
Throughout each recording block, subjects were required to fi xate on a small cross in the center of the screen while streams of visual stimuli were presented above and below the fi xation cross (Figure 1). The upper stream consisted of a rapid serial visual presentation (RSVP) of alphanumeric characters, approximately two degrees square, centered about two degrees above the fi xation cross. The characters in this alphanumeric stimulus stream, were replaced every 150 ms. Simultaneously, a series of images of faces (obtained from the Psychological Image Collection at Stirling; http:// pics.psych.stir.ac.uk) and houses, each approximately fi ve degrees wide and six degrees high, were presented in randomized order at a point centered nine degrees below the fi xation point. These images were presented for 100 ms each and the intervals between image onsets were randomly varied between 600 and 900 ms (in increments of the frame rate).
Each subject performed 16 blocks of trials with each block lasting approximately 2.5 minutes. Prior to the start of each block, subjects were instructed to attend either the upper stream (the alphanumeric character stimuli) or lower stream (the face and building images) and to indicate the appearance of an occasional target in the designated stream by pressing a button on a key pad. When attending the alphanumeric stream, subjects attempted to detect the appearance of infrequently presented numerals (approximately 2% of the characters in the stream) amongst mostly uppercase alphabetical characters. Targets in the face and house stream were blurred images of these objects and comprised 20% of the number of images presented. In order to avoid biasing the attention of the subjects toward one type of image in the lower stream, the blurry target images were equally likely to be either a face or a house. Thus, whether the images contained a face or a house was irrelevant to the performance of the task. The order of experimental conditions (i.e., attend to alphanumeric stream or attend to face/house image stream) was randomized across blocks. Prior to each recording block, subjects were instructed to attend to one or the other of the two locations and detect occasional target stimuli in the stream at that location. In the attend-RSVP condition, participants attended to the stream of alphanumeric characters to detect an occasional digit (target) amongst mostly letters (non-targets, or "standards"). In the attend-images condition, participants attended to the stream of face and house images, most of which were in focus (standards), to detect the occasional occurrence of a blurred image (targets). Note that in this condition all the stimuli in the lower image stream (i.e., faces and houses) were attended, but the image content itself (i.e., whether it was a face versus a house) was completely orthogonal to the task of detecting blurred images of either type.

Electrophysiological recording and analysis
The EEG (electroencephalogram) was recorded from 64 electrodes in a customized elastic cap (Electro-Cap International, Inc.) and referenced to the right mastoid during recording. Electrode impedances were maintained at less than 2 kΩ for the mastoids and the ground electrode, less than 10 kΩ for the vertical and horizontal eye electrodes, and less than 5 kΩ for the remaining electrodes. The 64 channels of EEG/EOG were continuously recorded with a band pass fi lter of 0.01-100 Hz and a gain of 1000 (SynAmps, Neuroscan Inc.). The raw signal was continuously digitized with a sampling rate of 500 Hz.
Eye blinks and eye movements were monitored by horizontal and vertical electrooculogram (EOG) electrodes for later rejection of trials with such artifacts. Vertical eye movements and eye blinks were detected by two electrodes placed below the orbital ridge of each eye, each referenced to the electrodes above the eye. Horizontal eye movements were monitored by two electrodes located at the outer canthi of the eyes. During recording, subjects were also monitored using a closed circuit video monitoring system to detect gross eye and/or head movements. Subjects who displayed an excessive degree of eye movement or blinking were excluded from further participation in the study, and any data collected from such subjects was discarded. Artifact rejection was performed off-line by discarding trials in which the EEG/EOG were contaminated by eye movements, eye blinks, excessive muscle activity, drifts or amplifi er blocking. ERP averages to the various trial types were extracted by time-locked averaging from 500 ms before to 1000 ms after stimulus presentation and then digitally low-pass fi ltered with a nine-point moving average (which heavily fi lters out activity at and above 56 Hz at our 500-Hz digitization rate) and re-referenced to the algebraic average of the two mastoid electrodes. The analyses focused on the ERPs on the non-target trials, for both the RSVP and image streams, thereby avoiding the presence of any signifi cant target-detection or motor-related activity in the ERPs.
To evaluate the effect of attention on the steady-state modulation induced by the central letter stream, average EEG traces were computed for each subject on the channels of interest and the envelope of the SSVEP signal was extracted by complex demodulation (Draganova and Popivanov, 1999;Makeig et al., 1996;Muller et al., 1998). More specifi cally, the averaged EEG epoch was multiplied by a complex sinusoid at the frequency of the RSVP stream (6.67 Hz), and the resultant waveform was then lowpass fi ltered with a zero phase-shift fi lter and a cutoff of 2 Hz. For each condition, mean amplitude was subsequently calculated from the complex demodulated waveforms between 0 and 500 ms after stimulus onset. The difference of the oscillation amplitude between the attended condition and the unattended condition was tested using within-subject repeatedmeasure analyses of variance (ANOVAs). To evaluate the signifi cance of the effect of attention on the P1 component elicited by the images, the mean amplitudes of the image ERP waves between 60 and 140 ms were measured for each subject and condition, and ANOVAs were performed on these amplitude measures.
Image-type difference waves were calculated for each attention condition by subtracting the average ERP evoked by the houses from the average ERP evoked by the faces. The ERPs and ERP difference waves for the individual subjects were grand averaged across subjects. Repeated-measures ANOVAs were performed on mean amplitudes of the ERP waveforms and difference waves in specifi c latency windows across subjects, relative to a 200 ms prestimulus baseline. In particular, activity at several occipital sites in a window around 160-170 ms (the hallmark N170 component) was analyzed for signifi cant differences as a function of the factors of Attention (attended vs. unattended) and Object Type (face vs. house).

RESULTS
Subjects performed both the RSVP digit-detection task and the blurryimage detection task well (detecting an average of 93.5 ± 8.2% of the targets in the image stream and 85.7 ± 11.6% of the targets in the RSVP stream) and showed similar reaction times on both tasks (an average of 465.3 ± 40.7 ms for targets in the image stream and 477.3 ± 40.8 ms for targets in the RSVP stream).
The presentation of the constant-rate RSVP stream of characters above fi xation induced a steady state oscillation in the EEG traces over bilateral occipital scalp (Figure 2A, blue trace). As expected (Muller and Hillyard, 2000), when attention was directed toward this stream, the amplitude of this oscillation was much larger (F (1,14) = 25.3, p < 0.0005; Figure 2A, red trace), refl ecting the enhanced sensory processing of stimuli in an attended region of space. The images of the faces and houses in the other stream evoked the occipital P1 component 100 ms poststimulus that is characteristic of ERPs to visual stimuli (Figure 2B, blue trace). Also as expected (reviewed in Mangun, 1995), when subjects were attending to these images, the amplitude of the P1 to all the stimuli in the stream was greatly magnifi ed (F (1,14) = 7.11, p < 0.02; Figure 2B, red trace), demonstrating the strong infl uence of spatial attention on processing in the early visual sensory pathways.
More importantly, spatial attention had a profound infl uence on the face-specifi c processing refl ected in the difference between the ERP responses to faces and the ERP responses to houses in the N170 latency range (135-185 ms; Figure 3). When subjects were attending to the image stream, faces evoked a substantially larger negative wave over lateral occipital cortex in the N170 latency range than houses ( Figure 3A). This early face-specifi c activity was spatially focal, relatively right lateralized, and peaked at approximately 160 ms (Figure 3B), highly consistent with previously reported characteristics of the N170 component (e.g., Bentin et al., 1996). In contrast, when subjects were attending away from the image stream (i.e., attending to the RSVP stream), the difference between the N170-latency activity evoked by faces and houses was essentially eliminated (Figure 3C-D).

Grand average waveforms (n = 15) over occipital (visual) cortex demonstrating that the processing of stimuli at the attended location was enhanced. (A) Stimuli in the letter/digit stream, which were presented at a regular rate (6.67 Hz), produced a steady-state oscillation in the EEG trace. The amplitude of this oscillation was strongly enhanced when the letter stream was attended. (B) ERPs to non-target stimuli in the face/house image stream. When attention was directed to this stream, all the images in the stream evoked larger sensory responses, including a strongly enhanced sensory P1 component at 100 ms poststimulus.
The observed infl uence of attention on this early face-selective processing was refl ected statistically in several ways. First, there was a statistically signifi cant interaction between Attention and Object type (F (1,14) = 4.92, p < 0.05), revealed by a two-way repeated-measure analysis of variance (ANOVA) of the mean amplitude of the activity in the latency window around the N170. In addition, specifi c comparisons within the two attention conditions confi rmed that the face-house difference in the N170 latency range in the attended condition was highly signifi cant (F (1,14) = 8.82, p = 0.01), whereas there was no signifi cant difference in the unattended condition (F (1,14) = 1.37, p = 0.26).
Our analyses focused on the ERP responses derived relative to the standard averaged-mastoid reference. Because of the lateral occipital focus of the N170 activity, however, this reference may have been somewhat less sensitive to the N170 effects than might be optimal. Accordingly, to ensure that our choice of reference did not bias or otherwise limit our results, we also derived ERP averages for all subjects and conditions with respect both to a fully averaged reference (i.e., referenced to the average of all the channels) and to a frontal reference (forehead sites). Although the N170 effect in the attended channel appeared to be slightly larger with this derivation, the analyses of these data were completely consistent with those using a mastoid-reference data -namely, a robust facespecifi c N170 effect when the images were attended, no such signifi cant effect when attention was directed toward the letters, and a signifi cant two-way interaction between Attention and Object Type.

DISCUSSION
In the present study, we investigated the infl uence of spatial attention on the face-specifi c N170 effect, believed to refl ect the earliest stage at which face processing clearly and consistently diverges from the processing of other types of objects. To examine such face-specifi c processing, we compared the ERPs elicited by faces with the ERPs evoked by other objects (i.e., houses), under different spatial attention conditions. In agree-ment with previous reports, when subjects attended to the images, we found that the occipital N170-latency negative-wave response to faces was much larger than the response to houses. However, when attention was focused on a demanding task in another location, there was no signifi cant difference between the ERPs to faces and houses in the N170 latency range. Thus, in contrast to various prior reports, our results indicate that face-specifi c processing is not automatic but requires the allocation of spatial attention.
Prior reports in the ERP literature have generally shown, in contrast to the results reported here, little or no effect of attention on the N170 elicited by faces (Carmel and Bentin, 2002;Cauquil et al., 2000;Holmes et al., 2003). MEG studies have yielded similar fi ndings (Downing et al., 2001;Furey et al., 2006) These fi ndings had been interpreted as indicating that face processing is relatively immune to the effects of endogenous processes such as the allocation of attention, thereby reinforcing modular accounts of face processing. On the other hand, hemodynamically based neuroimaging studies have suggested that face-specifi c processing is modulated by attention. For example, several fMRI studies found that the activity evoked in the fusiform gyrus was enhanced when subjects selectively attended to the faces when watching a display containing images of faces and houses (Wojciulik et al., 1998) or watching a display containing superimposed transparent faces and houses (Carmel and Bentin, 2002;Cauquil et al., 2000;Holmes et al., 2003;O'Craven et al., 1999). This evidence therefore argues against the fully automatic nature of facespecifi c processing. The discrepancy between fi ndings from electrophysiological studies using ERP and MEG and neuroimaging studies using fMRI may have arisen from the differing temporal resolution of these methods. Hemodynamically based studies cannot resolve the time course of such attentional modulation and thus leave open the possibility that the infl uence of attentional allocation or task is limited to the later stages of face processing while the early processing of faces is fully automatic.
Nevertheless, there exists a discrepancy between our fi ndings and those of most previous ERP studies, which may result from differences in the types of attentional manipulation employed. Prior studies (Carmel and Bentin, 2002;Cauquil et al., 2000;Downing et al., 2001;Furey et al., 2006;Lueschow et al., 2004) mainly focused on manipulating objectbased attention rather than spatial attention. In other words, the stimuli used in those studies were all presented in attended locations, while the task relevance of faces was manipulated. Thus, the potentially highly robust infl uence of spatial attention on early face processing was not examined. Given the fi ndings of numerous ERP reports that early sensory processing components, including the P1 at 100 ms, are strongly modulated by spatial attention, it seems quite reasonable that the N170 component, with its later onset, would also be affected by spatial attention. Therefore, the null effects of attentional modulation on N170 or M170 in previous studies mainly demonstrate that "object-based" attention has relatively little infl uence on early-latency face-specifi c processing.
A couple of prior ERP studies have more specifi cally investigated the effect of spatial attention on face processing. In a recent ERP study focusing on the effects of attention on emotional face expression, a small enhancement of the N170 component was observed when subjects attended to a pair of face images in a display containing other objects relative to attending to a pair of house images in that display (Holmes et al., 2003). While this fi nding suggests that attending for faces can induce some modulation of the early responses evoked by images of faces, it does not address the question of whether the specifi c processing that faces receive requires attention. More specifi cally, the design of that experiment did not allow the assessment of face-selective activity (e.g., an N170 effect) in an unattended location, nor the ability to compare it to face-selective activity evoked by attended images. Whether face processing is largely automatic, therefore, was not resolved in that study.
A more recent study (Jacques and Rossion, 2007) showed larger effects of spatial attention on the amplitude of a negative component peaking at around 170 ms poststimulus; however, the ability to attribute the effect in this study to an attentional modulation of face-specifi c processing was rather limited. More specifi cally, these authors manipulated the diffi culty of a centrally presented discrimination task and showed that a negative wave in the N170 latency evoked by peripherally presented faces was strongly reduced when the central task was very demanding. However, it is well-known that spatial attention can enhance not just the P1 wave at 100 ms poststimulus, but also the occipital N1 component at 180 ms; this enhancement occurs for all visual stimuli (including faces). Accordingly, in order to isolate the infl uence of attention on "face-specifi c" processing, it is necessary to fi rst extract face-specifi c processing by comparing ERP responses to faces with ERP responses to non-face objects. Face-specifi c activity evoked by stimuli presented in an attended location can then be directly compared to the face-specifi c activity evoked by the same stimuli when they are presented in a spatially unattended location. We have performed this analysis in the present study, using a paradigm in which spatial attention was manipulated while object-based attention was controlled (i.e., the content of the images -whether they contained a face or a house -was irrelevant to the performance of the blurry-image detection task). Furthermore, by extracting the face-specifi c component in terms of the differential processing effects between faces and houses, we were able to demonstrate a clear and robust modulation of this early face-specifi c activity due to spatial attention.
We note further that the essential elimination of a signifi cant facespecifi c N170 effect in the unattended channel in the present study to be under circumstances where attention was strongly directed toward a very demanding task in the attended channel (the RSVP task). It may be that lower-load conditions in the task-relevant channel would allow additional processing in an unattended channel (Lavie, 2006), such that signifi cant levels of early face-house discrimination activity (such as that refl ected in the N170) could be elicited. Future studies will be important for delineating the relationship between the degree of attentional load and the ability of the brain to rapidly discriminate faces from other visual objects.
In our study, the effects of spatial attention on the basic early sensory processing of the stimuli was clear, as evidenced by the strong enhance-ment of the P1 response at 100 ms for attended images (as well as by the large attentional enhancement of the steady-state responses to the letter stream). The P1 effect was followed in the image ERPs by a large attentional modulation of the face-specifi c N170 effect. This pattern fi ts the hypothesis that the attentional modulation on the early visual sensory responses ramifi es forward to substantially gate the differential processing of faces shortly later in visual cortical processing. Such a result also argues against there being much input to the N170 face-specifi c brain activity from any highly automatic, alternate pathway specifi c for face processing information (e.g., Morris et al., 2001;Ohman, 2002) that circumvents the feedforward early sensory cortical pathways in extrastriate visual cortex and thereby in turn circumvents the pervasive infl uence that spatial attention that has been shown to exercise on these pathways (e.g., Heinze et al., 1994;Mangun, 1995;Moran and Desimone, 1985;Motter, 1993;Poghosyan et al., 2005;Posner and Gilbert, 1999;Smith et al., 2006;Woldorff et al., 1997).

CONCLUSION
In summary, our results clearly rule out any account of early face discrimination mechanisms that stipulate independence from the allocation of spatial attention. When faces appeared in an unattended spatial location, even the initial face-specifi c processing indexed by the differential N170 response was essentially absent. The early processing which differentiates faces from non-face objects thus strongly depends on endogenous factors such as the distribution of spatial attention. These results further suggest that the processing of faces may in fact be more similar to that applied to other highly signifi cant stimuli than the current prevailing view indicates. Moreover, these fi ndings underscore the extensive reach of visual attention in infl uencing the sensory processing of all stimuli in our environment, including at early stages of that processing.