Role of Temporal Processing Stages by Inferior Temporal Neurons in Facial Recognition

In this review, we focus on the role of temporal stages of encoded facial information in the visual system, which might enable the efficient determination of species, identity, and expression. Facial recognition is an important function of our brain and is known to be processed in the ventral visual pathway, where visual signals are processed through areas V1, V2, V4, and the inferior temporal (IT) cortex. In the IT cortex, neurons show selective responses to complex visual images such as faces, and at each stage along the pathway the stimulus selectivity of the neural responses becomes sharper, particularly in the later portion of the responses. In the IT cortex of the monkey, facial information is represented by different temporal stages of neural responses, as shown in our previous study: the initial transient response of face-responsive neurons represents information about global categories, i.e., human vs. monkey vs. simple shapes, whilst the later portion of these responses represents information about detailed facial categories, i.e., expression and/or identity. This suggests that the temporal stages of the neuronal firing pattern play an important role in the coding of visual stimuli, including faces. This type of coding may be a plausible mechanism underlying the temporal dynamics of recognition, including the process of detection/categorization followed by the identification of objects. Recent single-unit studies in monkeys have also provided evidence consistent with the important role of the temporal stages of encoded facial information. For example, view-invariant facial identity information is represented in the response at a later period within a region of face-selective neurons. Consistent with these findings, temporally modulated neural activity has also been observed in human studies. These results suggest a close correlation between the temporal processing stages of facial information by IT neurons and the temporal dynamics of face recognition.

show selective responses to complex visual images (Baylis et al., 1987;Kobatake and Tanaka, 1994;Tsunoda et al., 2001;Brincat and Connor, 2004;Yamane et al., 2006Yamane et al., , 2008, such as faces or animals (Gross et al., 1972;Bruce et al., 1981;Desimone et al., 1984;Perrett et al., 1985;Hasselmo et al., 1989;Nakamura et al., 1994;Tanaka, 1996;Tsao et al., 2006). Face-responsive/selective neurons give stronger responses, often twice as strong, to face images as compared with other images such as objects, geometric shapes, or scrambled face images. Information about faces is sparsely represented by a population of face-responsive neurons (Young and Yamane, 1992). The responses of these neurons are sensitive to the configuration of facial parts, such as the height of the forehead from the left eye to the hairline, the distance between the eyes and the mouth, or a combination of these parameters (Yamane et al., 1988;Freiwald et al., 2009). A possible role for IT face-responsive neurons in face recognition has been reported. The responses of the majority of face-responsive neurons are correlated with the animal's perceptual experiences Logothetis, 1997, 2001). IT neuronal responses have been shown to represent the configuration of facial parts that are useful for a categorization task when using monkeys that are well experienced with the categorization task (Sigala and Logothetis, 2002). In the anterior IT cortex, the response latency of neurons encoding view-invariant face-identity information correlates with the monkey's behavioral response latency during identification of a face (Eifuku et al., 2004). Performance in face discrimination tasks based on the configuration of facial parts is affected by cooling of the temporal cortex (Horel, 1993).

tiMe couRse of neuRonal infoRMation pRocessing of faces
At each stage along the ventral pathway, namely areas V2, V4, and the IT cortex, stimulus selectivity of neural responses to complex images becomes sharper during the later portion of the response as compared with selectivity during the initial transient response. Van Essen (2004, 2006) studied neuronal responses to gratings (sinusoidal, hyperbolic, and polar gratings) and contour stimuli (bars, crosses, and angles) in areas V2 and V4. They found that the population response of the neurons was better able to categorize the stimuli into broad groups, e.g., gratings vs. contour, during the initial phase of the response (area V2, 40-80 ms after stimulus onset; area V4, 80-100 ms), and was able to distinguish between individual stimuli within the stimulus groups after the initial phase Van Essen, 2004, 2006). Brincat and Connor (2006) examined neuronal responses in the posterior part of the IT cortex to simple shape stimuli combining convex, concave, and straight contours. They found that information about individual contour fragments is carried in the initial transient response (90% maximum at 122 ms after stimulus onset) and information about the specific multipart contour configurations emerges gradually (90% maximum at 184 ms after stimulus onset). Tamura and Tanaka (2001) showed that the stimulus selectivity of neuronal responses to photographs of natural objects and geometric shapes in the anterior part of the IT cortex is more selective in the later portion of the response (after 240 ms from stimulus onset) compared to the initial transient response (130 ms from stimulus onset).
The importance of temporal firing patterns during information coding of visual stimuli by IT neurons was first described by Optican, Richmond, and colleagues (Optican and Richmond, 1987; and has been confirmed since then. Tovee et al. (1993) showed that IT neuronal responses represent the greatest information about the test stimuli identity during the first 100-200 ms after stimulus onset (Tovee et al., 1993). The robustness of the temporal firing pattern across days and weeks of visual experience was also shown (Bondar et al., 2009).
We examined neuronal response in the anterior IT gyrus and the anterior part of the superior temporal sulcus in the IT cortex to visual stimuli, including geometric shapes and monkey/human faces with various expressions, using a fixation task (recoding location shown in Figure 1A; Sugase et al., 1999). Some neurons showed initial transient responses to faces, but not to shapes, and maintained a sustained response to only a particular facial expression or identity, indicating sharper stimulus selectivity in the later portion of the response (e.g., Figure 1 of Sugase et al., 1999, and Figure 2 of Matsumoto et al., 2005b). Using a moving time window, we calculated the time course of the transmitted facial stimuli information with respect to both the global category (human faces vs. monkey faces vs. shapes) and sub-or fine categories within each category member (monkey facial expression, monkey identity, human facial expression, or human identity). We found that the latency and peak of information for the global categories were earlier than those of the fine categories (measured as the middle of the 50-ms analysis window; latency, 91 and 142 ms after stimulus onset; peak, 152 and 179 ms, for global and fine categories, respectively; Figure 1B). This result indicates that the temporal firing pattern is important for information processing of faces and suggests that the initial transient response and the later response encode different aspects of facial characteristics, i.e., information about global categories and about fine categories. The result that both the peak of the information for the global categories and the peak of the information for the fine categories were observed within 100-200 ms after stimulus onset is consistent with the finding by Tovee et al. (1993) and supports the notion that neuronal responses within this time period represent the greatest information about the stimuli identity. Because each neuron displays a different temporal firing pattern (as shown in Figure 2 of Matsumoto et al., 2005b), we further analyzed the responses of a population of neurons and found that the separation of human vs. monkey vs. shapes was maximized at the [90, 140]-ms period after the stimulus onset, and that the separation of monkey facial expressions and human facial identities was maximized at the [140,190]-ms period ( Figure 1C; Matsumoto et al., 2005b). This result indicates that global categorization occurs earlier than finer categorization at the population level.
Recent single-unit studies in non-human primates provide evidence that is consistent with our findings. Neurons in the anterior IT cortex respond to human faces on average 15 ms earlier (103 ms mean latency) than to non-primate animal faces (118 ms; Kiani et al., 2005), showing that the initial transient response is modulated with respect to the global category, i.e., human vs. animal. Tsao, Freiwald, and colleagues made remarkable advances in identifying face-processing systems along the ventral visual pathway, identifying six interconnected cortical regions that consist of face-selective neurons, i.e., face patches (Figure 2A; Tsao et al., 2003Tsao et al., , 2006Tsao et al., , 2008Moeller et al., 2008;Freiwald et al., 2009;Freiwald and Tsao, 2010). In the middle face patch (ML and MF in Figure 2A), which is located within the superior of an individual. The causal relationship between signals from face-responsive neurons in the IT cortex and face perception was addressed by Afraz et al. (2006). Microstimulation within faceselective sites around the time of the initial transient response, i.e., [50, 100] ms after stimulus onset, biased the monkey's decision toward faces during a face/non-face categorization task (Afraz et al., 2006). This result indicates that the initial transient response of face-selective neurons has a causal relationship with face perception, i.e., whether an image contains a face, and the perception of a global category. The causal relationship between the later portion of the response and face recognition, e.g., recognition of facial identity, remains to be elucidated.
There are at least three different possibilities for the underlying mechanism that governs how initial transient responses and later responses encode different aspects of facial characteristics. One possibility is that the initial transient response encodes achromatic temporal sulcus, categorization (faces, bodies, fruits, hands, gadgets, or scrambled images) performance by the neuronal response (133 ms) precedes identification performance (192 ms; Tsao et al., 2006), showing an earlier representation of the global category information than of the subcategory image identity information. Freiwald and Tsao (2010) found that in the anterior medial patch (AM in Figure 2A), which is located in the anterior part of the IT gyrus, information about facial identity across different facial views emerges gradually and peaks at around 300 ms after stimulus onset ( Figure 2B). The fact that view-invariant facial identity information is represented in the response at a later period (∼300 ms) indicates that additional information processing is necessary to compute facial identity across different facial views, which may take a considerable amount of time.
This type of coding seems to be a plausible mechanism underlying the temporal dynamics of face recognition, e.g., the process of face detection/categorization followed by facial identification between responses at different temporal domains and the high and low spatial frequency components of images remains to be elucidated at the single-unit level in non-human primate studies.
The second possibility is that the representation of each facial part is followed by complete representation of multiple facial parts. The third possibility is that a visual stimulus is processed at different scales, from global-to-local, with the passage of time. Several studies support the second and third possibilities. With respect to the second possibility, individual object parts of abstract shapes are represented during the initial transient responses (peak, 122 ms after stimulus onset), and multipart configurations are represented in the later portion of the responses (peak, 184 ms; Brincat and Connor, 2006), in the posterior part of the IT cortex (area TEO and posterior TE), which sends efferent projections to the anterior temporal cortex (Saleem et al., 2000;Borra et al., 2010). Consistent with the suggestion by Brincat and Connor (2006), neurons in the anterior IT cortex may represent a diagnostic area that is useful in detecting facial features, such as the eyes (Lewis and Edmonds, 2003), during the initial transient responses and represents information about the configuration of facial parts that is useful for recognition of facial identity (Tanaka and Farah, 1993) during the later portion of the response. The third possibility has been investigated using Navon (1977) figures, e.g., a large letter N consisting of small letter H's (Tanaka and Fujita, 2000;Tanaka et al., 2001;Olson, 2009, 2010a,b). The behavioral response time to discriminate the global form is 20-30 ms earlier than that to discriminate the local form, for both monkey and human subjects (Tanaka and Fujita, 2000;Sripati and Olson, 2009). The global form is represented during the initial transient response, and the local form is represented approximately 30 ms later in the IT cortex (Sripati and Olson, 2009). Differences in the global-local latency of the neuronal response are related to the large-small latency difference. These results indicate that the global signal emerges earlier, because shapes at a larger scale elicit discriminative neuronal activity earlier. Therefore, characteristics of faces at a larger scale, e.g., the outline of the face, may elicit differential neuronal responses earlier than do facial characteristics at a smaller scale, e.g., face parts.
With respect to the cellular mechanisms underlying the representation of global-to-fine category information, we speculated that signals related to fine categories emerge through intra-areal contribution or feedback from other areas (Sugase et al., 1999). The amygdala is one of the candidate areas that send feedback signals and affect the later portion of the responses related to fine category information, since the IT cortex receives projections from the amygdala (Amaral and Price, 1984;Amaral et al., 1992) and neurons in the amygdala respond to specific facial identity or expression (Leonard et al., 1985;Nakamura et al., 1992;Kuraoka and Nakamura, 2006;Gothard et al., 2007). With respect to the contribution of intra-areal connections, the fine category information might emerge through their recurrent connections. We tested this possibility using an attractor network model that consisted of excitatory and inhibitory neurons (Matsumoto et al., 2005a). Since the model reproduced the temporal dynamics of the responses of the face-responsive neurons, we speculated that the recurrent processing within the IT cortex might be sufficient to give rise to the later portion of the response. We proposed two physiological experiments to investigate whether the attractor network model information through the magnocellular pathway, which arrives at the IT cortex earlier than chromatic information through the parvocellular pathway. However, this is unlikely. Edwards et al. (2003) found that colored images evoked larger responses from IT faceresponsive neurons than did achromatic images during the earliest part of the response. However, the possibility of the magnocellular and parvocellular pathways playing different roles in encoding fine/coarse facial information remains open for future studies. The initial transient response may encode information about the low spatial frequency components of face images through the magnocellular pathway, and the later portion of the response may encode information about the high spatial frequency components through the parvocellular pathway. This idea is supported by human fMRI studies (Goffaux et al., 2010, see later section). The relationship global category information (face vs. non-face), are required for the saccadic choice. The face bias during the saccadic choice task is at least partially related to the low-level physical characteristics of the images, as scrambling the orientation contents of the images but not scrambling their relative positions disrupts the face bias (Honey et al., 2008).
The detection of an object or categorization of an object at a basic categorical level (bird, car, dog, etc.) is approximately 65 ms faster than within-category identification (Grill-Spector and Kanwisher, 2005). Face detection using synthetic faces occurs 24 and 31 ms earlier than viewpoint and face identification, respectively (Or and Wilson, 2010). This result suggests that face recognition is a process of face detection/categorization followed by viewpoint/face identification and that additional processing is required to obtain the latter information. On the other hand, face identification becomes as fast as face categorization depending on the familiarity of the face. For example, identifying a face such as that of Bill Clinton is as fast as categorizing the face as human or monkey (Tanaka, 2001;Anaki and Bentin, 2009). Studies examining whether information about each facial part is analyzed, accumulated, and integrated over time during face recognition remain controversial (Singer and Sheinberg, 2006;Anaki et al., 2007;Cheung et al., 2011).
Humans and macaque monkeys have similar brain systems for face-processing Bell et al., 2009;Pinsk et al., 2009). The temporal dynamics of the neural correlates of facial information processing have also been studied in human subjects or patients using non-invasive brain-imaging techniques or invasive techniques. The results for these studies are consistent with studies on non-human primates and human behavioral studies. Studies using magnetoencephalography (MEG) showed that a face-selective response 100 ms after the stimulus onset (M100) was related to categorization of the stimuli as a face (Liu et al., 2002;Meeren et al., 2008) and that the later response at 170 ms after the stimulus onset (M170) was related to both facial categorization and recognition of individual faces (Liu et al., 2002). Both M100 and M170 were observed over the occipito-temporal cortex, though M100 was distributed slightly posteriorly. Since the existence of facial parts (eyes, nose, and mouth) is important for M100 regardless of their configuration, and the part-configuration is important for M170, Liu et al. (2002) speculated that M100 and M170 responses reflect processing related to the detection of diagnostic facial parts and to processing related to the analysis of the part-configuration, respectively, instead of global-to-local processing. The familiarity of faces affected the M170 response (Kloth et al., 2006) and other later responses (latency around 250-400 ms, M400) from the occipitotemporal sensors (Harris and Aguirre, 2008).
Event-related potential (ERP) studies have shown that the representation of facial categories and face-inversion effects occurs during an early negative component that peaks at around 170 ms after stimulus onset (N170, starting at 130-150 ms after stimulus onset; Rousselet et al., 2008). The strength of face selectivity of the N170 response (calculated from the peak amplitude between 140 and 200 ms) has been shown to be highly correlated across subjects with those of the hemodynamic response in the temporal lobe, i.e., the fusiform face area (FFA) and the posterior part of the right superior temporal sulcus, by a simultaneous ERP-fMRI study (Sadeh et al., 2010). A later response (e.g., a negativity between 300 and 500 ms is a plausible mechanism: one using noisy, fine feature degraded images, and the other using weakening connections between excitatory neurons within the anterior IT cortex. Although inhibition by γ-aminobutyric acid within the IT cortex contributes to neuronal stimulus selectivity (Wang et al., 2000), the relationship between the intra-areal inhibition (contribution of both the recurrent and lateral inhibition) and the temporal firing patterns of IT neurons still remains indefinite. In addition, the correlation between the responses of neuronal pairs in the IT cortex is higher during the presentation of face-like drawings than during the presentation of non-face-like drawings as early as [100, 300] ms after the stimulus onset (Hirabayashi and Miyashita, 2005). The result indicates that connections between neurons in a local circuit may be strengthened during information processing of face images.
There is another line of research, which suggests that the processing of visual stimuli occurs in a rapid feed-forward pass during the early portion of the response, with no role in the basic performance of recurrent/feedback processing during the later portion of the response (although it likely plays a role in top-down effects, such as attentional biases), i.e., the feed-forward hypothesis (Riesenhuber and Poggio, 1999;Serre et al., 2007). As supporting evidence for this hypothesis, Hung et al. (2005) showed that information about object categorization and identification can be read out from an unbiased sample of the IT neuronal site within a narrow time window (12.5 ms) at the earliest part of the response (starting from 125 ms after stimulus onset) using a support vector machine classifier. Since this study did not focus on face-responsive neurons, the time course to the readout of detailed information specific to human/monkey faces, facial identity and expression, is not reported.

huMan studies
At the behavioral level, visual recognition speed has been examined using large sets of images consisting of faces, animals, or objects. Thorpe and colleagues have shown that human observers are able to categorize briefly presented natural scenes with a reaction time of less than 400 ms and approximate accuracy of 94% using an animal vs. no-animal go/no-go task with manual responses (Thorpe et al., 1996;VanRullen and Thorpe, 2001;Rousselet et al., 2002). The reaction time of human subjects during the task was reported to be 100-180 ms (median reaction time) slower than monkey subjects (Fabre- Thorpe et al., 1998). Recently, these authors used saccadic eye movement for a choice task, and showed that subjects were especially rapid at detecting human faces. In the task, two images were simultaneously presented and the subjects made a saccade to an image of a particular target category, e.g., human faces or motor vehicles. The reaction time (start time) of the saccade was substantially shorter toward human faces than toward the other category images, e.g., motor vehicles (in their third experiment, human faces, 159 ms mean, 100 ms minimum; vehicles, 183 ms mean, 170 ms minimum; Crouzet et al., 2010). The reaction time was longer during the choice task than during a simple saccadic detection task, with only one stimulus being presented, both for the human face targets (at least 20 ms longer) and for the vehicle targets (80 ms longer). These results suggest that a lower number of processing steps is required to select faces than to select other objects. It remains to be clarified whether signals in the IT cortex, e.g., early orbitofrontal cortex to the IT cortex (Kveraga et al., 2007). These results support the hypothesis that the initial and later portions of neuronal responses in the IT cortex play important roles in the coarse and fine processing of face images, respectively.

coMpaRison of Monkey and huMan studies
Monkey and human studies suggest at least two temporal processing stages underlying the recognition of faces. The correlation between the two processing stages and neuronal activity has been shown by human MEG studies, indicating that M100 is related to face detection/categorization and M170 is related to face identification. The timing of the M100 response (averaged response peak time, 105 ms after stimulus onset, range 84.5-130.5 ms) and that of the M170 response (160 ms) overlaps with the peak time of global information (mean ± SD, 152 ± 57 ms) and that of fine information (179 ± 49 ms) reported by Sugase et al. (1999), respectively. The M100 response appears slightly earlier compared to the peak time of the global information, most likely due to M100 being distributed slightly more posteriorly than M170, thus including signals from posterior parts of the brain, i.e., the occipital.
The N170 response of ERP studies and the face-N200 of intracranial or subdural EEG recordings seem to reflect at least processing of the detection/categorization of faces. It is not apparent whether the later response (M400 for MEG, 300-500 ms for ERP, P290, and N700 for EEG) that is correlated with the familiarity of faces parallels the fine information reported by Sugase et al. (1999), or a slowly emerging signal reflecting the view-invariant identity information reported by Freiwald and Tsao (2010). Future studies at the single-unit level using non-human primates should address which temporal domains are influenced by the familiarity of faces.

conclusion
The accumulated body of evidence on non-human primates shows that temporal firing patterns in IT cortex neurons are important for information coding of visual features of faces. It has been shown that the initial transient response represents the face category, and that the later response represents facial identity and facial expression. Future studies should investigate the role of signals in the later response of face recognition to reveal the causal relationship between the later response and face detection/categorization, recognition of facial identity, or even face familiarization. It should also be examined whether the temporal processing stages are specific to face images or are general properties of the visual system. acknowledgMent This work was supported by Grant-in-Aid for Scientific Research on Innovative Areas, "Face perception and recognition" from MEXT KAKENHI (21119528, 23119732), and MEXT KAKENHI (22700161). from a parietocentral electrode) was shown to be associated with the familiarity of faces, suggesting that the late signals are related to familiar facial identity (Eimer, 2000). Electrocorticography (ECoG) recordings indicate that an enhanced activation in gamma power  and evoked responses that occur 150-200 ms after stimulus onset (similar timing as N170) are associated with successful recognition of the stimulus category, i.e., face, house, or object (Fisch et al., 2009). Studies using intracranial or subdural electroencephalography (EEG) recordings showed that face-specific ERPs (face-N200) within the fusiform and IT gyri were affected neither by semantic priming and face-name learning/identification nor by selective attention, but that subsequent slow evoked potential was affected (P290 and N700, or ∼240 ms after stimulus onset, respectively; Puce et al., 1999;Engell and McCarthy, 2010). These results indicate that the face-N200 is largely insensitive to cognitive task manipulations or attention, unlike the later responses, suggesting that the face-N200 reflects an initial and obligatory neural response that is related to the visual analysis of faces and that the later responses are susceptible to cognitive demands and the top-down control of attention. Quiroga et al. (2005) suggested an important role for the human medial temporal lobe (MTL; the hippocampus, amygdala, entorhinal cortex, and parahippocampal cortex) in recognizing facial identity. Some neurons in the MTL respond exclusively to images of a particular famous person regardless of different views, e.g., the actress Jennifer Aniston. The latency of the responses to a famous person are primarily observed during the later period, i.e., between 300 and 600 ms after stimulus onset (Quiroga et al., 2005;Mormann et al., 2008). The results indicate that the higher selectivity of the MTL neurons was found with longer latencies, in other words, in the later responses. Furthermore, these data suggest that interactions between the IT cortex and MTL may play an important role in the recognition of facial identity regardless of view variations.
A recent fMRI study revealed a role for temporal dynamics in face information processing, i.e., coarse-to-fine processing (Goffaux et al., 2010). By presenting face images that preserve either low spatial frequency or high spatial frequency for a 75, 150, or 300-ms duration, Goffaux et al. (2010) found that the coarse structure of a face, which is carried by low spatial frequency, is processed prior to the fine details transmitted by high spatial frequency in the FFA. At a 75-ms exposure, the responses to low spatial frequency face images are greater than the responses to high spatial frequency face images. In contrast, at 150 and 300-ms exposure, the responses to high spatial frequency face images, which contain local fine information useful for identification, are stronger than the responses to low spatial frequency face images. Another study suggests that coarse signals about visual stimuli, i.e., achromatic but with luminancecontrast stimuli (magnocellular biased stimuli), are derived rapidly through the dorsal visual pathway via the occipital visual cortex and