What are the Visual Features Underlying Rapid Object Recognition?

Crouzet, Sébastien  M; Serre, Thomas

doi:10.3389/fpsyg.2011.00326

REVIEW article

Front. Psychol., 15 November 2011

Sec. Perception Science

volume 2 - 2011 | https://doi.org/10.3389/fpsyg.2011.00326

This article is part of the Research TopicThe timing of visual recognitionView all 12 articles

What are the visual features underlying rapid object recognition?

Sébastien M. Crouzet*

Thomas Serre

Cognitive, Linguistic, and Psychological Sciences Department, Institute for Brain Sciences, Brown University, Providence, RI, USA

Research progress in machine vision has been very significant in recent years. Robust face detection and identification algorithms are already readily available to consumers, and modern computer vision algorithms for generic object recognition are now coping with the richness and complexity of natural visual scenes. Unlike early vision models of object recognition that emphasized the role of figure-ground segmentation and spatial information between parts, recent successful approaches are based on the computation of loose collections of image features without prior segmentation or any explicit encoding of spatial relations. While these models remain simplistic models of visual processing, they suggest that, in principle, bottom-up activation of a loose collection of image features could support the rapid recognition of natural object categories and provide an initial coarse visual representation before more complex visual routines and attentional mechanisms take place. Focusing on biologically plausible computational models of (bottom-up) pre-attentive visual recognition, we review some of the key visual features that have been described in the literature. We discuss the consistency of these feature-based representations with classical theories from visual psychology and test their ability to account for human performance on a rapid object categorization task.

1. Introduction

Object recognition is concerned with determining the identity of an object in our visual field of view. Such process relies on visual representations that need to be both selective (recognizing our friend among many other faces) and invariant (recognizing our friend irrespective of drastic changes in visual appearance due to changes in position, size, viewpoint, illumination, or even facial expression; Ullman, 1996; Riesenhuber and Poggio, 1999). The computations carried out on these representations feel effortless and almost immediate (our subjective experience suggests that we know what it is that we are looking at as soon as we see it).

Progress in our understanding of the computational mechanisms underlying visual object recognition has been significant, with converging evidence from neuroscience, psychology, and computer science (Serre and Poggio, 2010). Shape and object category information has been traditionally associated with processing in the ventral stream of the visual cortex. A long-standing metaphor for the underlying processes is that of filtering. The princeps discovery was made by Hubel and Wiesel(1959, 1968), who first reported the existence of bar and edge detectors in the primary visual cortices of the mammalian brain. They further proposed the first cortical model of visual processing thereby suggesting that such selectivity for oriented bars could be achieved via selective pooling mechanisms from the spatial arrangements of center-surround ganglion cells in the Lateral Geniculate Nucleus (LGN; Hubel and Wiesel, 1962). These ideas later formed the basis of Marr’s primal sketch in his prominent computational theory of visual processing (Marr, 1982). Today, edge detection and spatial frequency analysis as the building block of early vision remains the dogma. However, our understanding of subsequent stages of processing along the visual hierarchy remains a matter of debate.

Marr famously postulated that the next stage of visual processing was concerned with the building of intermediate 2(1/2)D representations for surfaces toward the explicit construction of 3D representations for matching stimuli to internal representations of objects and/or storage into memory. These ideas motivated a subsequent theory by Biederman (1987), the recognition-by-components (RBC), which emphasizes the role of figure-ground segmentation and explicit encoding of spatial relations between 3D object parts. These 3D parts, named geons, are analogous to syllables in linguistics and constitute a generic vocabulary for representing objects with different combinations and spatial arrangements of these elements. A typical processing pipeline is sketched on Figure 1 (left) with key stages of visual processing including: edge detection → grouping → segmentation → matching.

FIGURE 1

Figure 1. (Left) Overview of the successive steps involved in classical theories of object recognition (see text for details). (Right) An alternative view is that the bottom-up activation of a loose collection of hardwired feature detectors via a hierarchy of increasing complex processing stages may provide a coarse initial visual representation for more complex routines and several feedforward/feedback iterations to solve specific tasks including contours detection, grouping, figure segregation as well as the computation of spatial relations between parts, and more generally, the parsing and interpretation of complex visual scenes (see for instance: Hochstein and Ahissar, 2002; Zheng et al., 2007; Epshtein et al., 2008; Serre and Poggio, 2010 for a review).

Around the same time, several psychophysical studies suggested that a coarse image analysis based on simple feature detectors could be done very rapidly in parallel across the visual field (Treisman and Gelade, 1980; Julesz, 1981; Bergen and Julesz, 1983). The study of what can be seen “at first sight” has since been intensively pursued using the visual search paradigm. Two prominent theories seem to account for most experimental data: the Feature Integration Theory by Treisman and Gelade (1980) and the Guided Search Theory by Wolfe (2006). Both suggest that simple image features such as color, orientation, motion, or size (see Wolfe and Horowitz, 2004 for an extensive review) can be processed pre-attentively and in parallel. However, any search for more complex combinations of features (e.g., T among Ls) for which hardwired feature detectors are not readily available will lead to reaction times that are dependent on the number of distractors in the display; a phenomenon consistent with a serial attentional process.

Studies conducted on natural visual scenes came to challenge some of these ideas by demonstrating the incredible speed and accuracy of our visual system for some of the most challenging visual recognition tasks in natural scenes. For instance, the rapid serial visual presentation (RSVP, Potter and Levy, 1969) and the rapid visual categorization (Thorpe et al., 1996) paradigms showed that human subjects are able to recognize (and remember) objects presented very rapidly in the absence of eye movements and potentially, shifts of attention. Further EEG studies measuring event related potentials (ERPs) directly on the scalp showed robust differential activity between target and distractor images within 150 ms after stimulus presentation (VanRullen and Thorpe, 2001). Recent studies using backward-masking (Bacon-Macé et al., 2005) and saccadic responses (Kirchner and Thorpe, 2006; Crouzet et al., 2010) suggest that recognition is possible under even more severe time constraints, possibly via a single feedforward sweep through the visual system (Lamme and Roelfsema, 2000; VanRullen and Koch, 2003). The underlying visual representation remains relatively coarse as it was shown that participants frequently fail to localize targets that they had correctly detected in an RSVP stream (Evans and Treisman, 2005). In particular, this seems inconsistent with recognition processes that rely on explicit encoding of spatial relationships between parts and suggest instead that rapid recognition may rely on the detection of an “unbound” collection of image features.

Consistent with this idea, the rapid recognition of natural object categories such as animals does not seem to require attention: The level of performance of human observers remains high even when two images are flashed simultaneously (Rousselet et al., 2002) and when stimuli are presented in the periphery while an attention-demanding (letter discrimination) task is performed at the fovea (dual-task paradigm, Li et al., 2002). To account for these results, VanRullen suggested that the recognition of natural object categories must be based on a dictionary of features that are hardwired in the visual system (VanRullen, 2007).

This idea is, in fact, consistent with the unsupervised learning mechanisms of feature hierarchies postulated by current models of the visual cortex (see Serre et al., 2007a for a review) and provides a compelling explanation for why ecologically important stimuli such as animals can be recognized in a dual-task paradigm but artificial stimuli such as a bicolor disks cannot: Through development, our visual system learns a dictionary of features that forms the basis of the position and scale tolerant representation which is found in higher level visual areas (Serre et al., 2007a). One hypothesis is that the underlying visual representation is well adapted to natural object categories but poorly adapted to artificial ones due to lack of training.

Overall, the argument above leaves open the question of the nature of the visual features that form the coarse image representation underlying rapid categorization tasks, which will be discussed in the next section. The goal of the present study is: (1) to investigate the ability of current biologically plausible models of rapid recognition to perform object recognition in natural scenes and (2) to compare the performance of these models with the state-of-the-art in computer vision as well as human observers on a rapid visual categorization task.

2. Computational Models of Visual Recognition

Progress in computer vision over the past decade has been significant. Challenging visual recognition tasks such as the recognition of objects in natural scenes are no longer considered to be beyond the reach of artificial vision systems. Face detection systems are now readily available on consumer-grade digital cameras and automated face identification algorithms are being integrated in digital photo library suites. Automated pedestrian detection and computer systems for driver assistance are already available in selected vehicles, and these will become standard equipment on most models by 2014¹.

Beyond domain-specific applications, computer vision systems for the generic recognition of objects are becoming increasingly robust, as reflected by their performance on competitions such as the Pascal Challenge². The number of object categories to be recognized has been increased steadily every year as the performance of the top computer vision systems has continued to improve. As it started in 2005, the challenge contained only four object categories. This year, the ImageNet Large Scale Visual Recognition Challenge³ involved the recognition of a thousand object categories and millions of images. Overall, computer vision databases have been growing rapidly over the years with systems now being routinely tested on hundreds of object categories (Russell et al., 2007; Torralba et al., 2008, 2010; Deng et al., 2009; Everingham et al., 2010; Xiao et al., 2010).

Similarly, progress in our understanding of the computational mechanisms underlying visual recognition in cortex has been significant. Computational models have been described that have been quantitatively fitted to monkey electrophysiology data for both the processing of shape information in the ventral stream (Rust et al., 2005; David et al., 2006; Cadieu et al., 2007; Lee et al., 2007; Serre et al., 2007c; Zoccolan et al., 2007; Li et al., 2009; Cao et al., 2011; Grossberg et al., 2011; but see also Kayaert et al., 2005; Kriegeskorte et al., 2008; Op de Beeck et al., 2008; Kayaert et al., 2011) and motion information in the dorsal stream (Rust et al., 2006).

In addition, several computational models of rapid categorization have been described. Figure 2 shows how computational models of visual recognition can be organized along three main dimensions: Feature complexity, sparse vs. dense representations, and statistical vs. non-statistical models. All the models included in this review have been shown to perform well on various visual tasks (e.g., natural scene classification, object detection, texture analysis), but the focus of the present study is limited to the ability of these models to perform well on specific visual task, namely a rapid animal categorization task (Thorpe et al., 1996). It is common to find the terms model and features used interchangeably in the literature. For clarity, in this review, we will refer to a model as a whole architecture (i.e., HMAX) and features for specific layers or components of a model (i.e., the C₁, C₂, C_2b, and C₃ features of the HMAX model).

FIGURE 2

Figure 2. Various computational models of visual recognition organized along three main dimensions: feature complexity, sparse vs. dense representations, and statistical vs. non-statistical models.

2.1. Feature Complexity

Virtually all biological models of visual processing start by assuming an initial filtering stage. The simplest image feature that has been suggested for natural images is the WEIBULL image contrast statistics, which measures the distribution of contrast values for an image that is readily available from the response of the X and Y cells in the LGN (Ghebreab et al., 2009; Scholte et al., 2009). WEIBULL image contrast statistics were described as a model for natural scene identification and were shown to provide a good model of the ERP selectivity observed in EEG data.

The bottom-up SALIENCY algorithm by Itti and Koch (Itti et al., 1998; Itti and Koch, 2001) and the GIST algorithm by Oliva and Torralba (2001) are two examples of algorithms based on relatively low-level image features. Compared with the WEIBULL image contrast statistics model described above, these two models correspond to processing in the next stage of the visual hierarchy. Such models are built on the output of filter pyramids such as Gabor or steerable filters that model processing by simple and complex cells as found in the primary visual cortex (Hubel and Wiesel, 1962).

In addition to orientation, the bottom-up SALIENCY algorithm also includes simple feature dimensions such as contrast and color information (although only gray-value stimuli were used in the present study). While there is no a priori reason for the image saliency to be predictive of the presence or absence of an object category, Elazary and colleagues have shown that objects in natural scenes tend to be more salient than the background (Elazary and Itti, 2008). The performance of the SALIENCY model for the animal categorization task thus constitutes an interesting baseline.

Mid-level TEXTON features corresponding to combinations of oriented linear filter responses were shown to account for the level of performance of human observers for the classification of visual scenes (Renninger and Malik, 2004). The task tested involved the classification of natural scenes in ten categories (beach, forest, mountain, city, farm, street, bathroom, bedroom, kitchen, and living-room). Features of similar complexity were also used in the TEXTSYNTH texture synthesis algorithm by Portilla and Simoncelli (2000) and were shown to predict human performance in crowding experiments (Balas et al., 2009; see also Freeman and Simoncelli, 2011).

At the top of this hierarchy are visual features of higher complexity corresponding to multiple stages of visual processing computed by the HMAX hierarchical model of visual processing (see Serre et al., 2007a for details). Here we consider four stages of this model: (1) The C₁ stage, which corresponds to the output of V1-like oriented complex cells (and similar to the GIST features); (2) the C₂; and (3) C_2B stages, which have been matched to the tuning properties of cells in intermediate areas of the ventral stream of the visual cortex (area V4: Cadieu et al., 2007; Serre et al., 2007a; and PIT: Serre et al., 2007a) and correspond to features tuned to combinations of V1-like complex (C₁) units at multiple orientations, exhibiting some tolerance to changes in the position and scale of the preferred stimulus; and (4) the C₃ stage that corresponds to combinations of units from the C₂ stage. From the C₁, C₂, and C_2B to the C₃ layer, the visual architecture builds a hierarchical feature representation that is both increasingly complex and invariant to 2D transformations such as changes in position and scale.

To provide a baseline for the biological models, we further considered a state-of-the-art machine vision system. This popular algorithm originally introduced by Lazebnik et al. (2006) is called the Spatial Pyramid (SPATIALPYR). The complexity of the features used in this algorithm is similar to the C₂/C₃ stage of the HMAX described above. The overall approach has been shown to perform well on a number of visual recognition tasks (Lazebnik et al., 2006, 2009; Bosch et al., 2007; Varma and Ray, 2007; Yang et al., 2009, 2010; Boureau et al., 2010a; Gao et al., 2010; Wang et al., 2010; Zhou et al., 2010). We believe this algorithm is representative of the current state-of-the-art in computer vision and is certainly one of the most popular.

2.2. Dense vs. Sparse

Another useful dichotomy between the various feature representations described above corresponds to the sparsity of the underlying visual representation. The WEIBULL, GIST (and C₁ stage), as well as the TEXTON, and the SPATIALPYR are based on dense representations whereby features are matched at every location of an image.

This can be contrasted with sparse representations such as the C_2B and C₃ stages of the HMAX. Rather than measuring the degree of similarity between an input image and a stored representation at every position and scale, the underlying similarity in such model is based on the best match between a stored template and the whole image (as computed by a max operation computed across all locations and scales). Such pooling mechanisms allow the underlying representation to be tolerant to changes in position and scale.

The TEXTSYNTH algorithm probably falls somewhere in between these two extremes as it computes the statistical mean of the match across an image. For a strongly peaked distribution (as is the case for a salient object), we expect the statistical mean to closely approximate a max pooling operation and therefore behave like a sparser representation. Conversely for more textured images, one expects a broader distribution and thus the approach to behave more like a dense representation.

2.3. Statistical vs. Non-Statistical Models

Non-statistical models here refer to algorithms that are based on features computed via a simple template matching operation. Such an operation encodes the similarity between an image patch and a stored representation. In the HMAX model, an image feature at the top of the hierarchy corresponds to the best match between every patch of an input image and a stored template via a max operation.

Similarly, the GIST and the SALIENCY algorithms as well as some of the features of the TEXTSYNTH algorithm are based on the response of feature detectors. The WEIBULL image contrast statistics, the TEXTON algorithm, and the SPATIALPYR are based on first order statistics over the computed features (i.e., histograms of the count of the index of the closest image feature over locations and scales). The TEXTSYNTH model also computes higher order statistics such as the skewness and kurtosis of the feature distributions.

3. Results

3.1. Models Performance in a Categorization Task

Animals in natural scenes constitute a challenging class of stimuli because of the very large intra-class variations that they present. This includes large changes in appearance (terrestrial, aerial, and water animals all with a large spectrum of possible sizes) and non-rigid changes in pose, as well as clutter and changes in size and position in the visual scene. The human data presented here appeared in a study by Serre et al. (2007b). To vary the difficulty of the task, four sets of balanced image categories were used (150 animals and 150 matching distractors in each set, i.e., 1,200 total stimuli; see Materials and Methods), each corresponding to a particular viewing distance from the camera, from an animal head to a small animal or groups of animals in cluttered natural backgrounds (see Serre et al., 2007b for details).

To minimize the role of cortical feedback by forcing visual processing to be based on a single feedforward sweep, as well as to try to minimize possible attentional shifts across the image, a backward-masking protocol (1/f dynamic noise image lasting 80 ms) was used with a long 50-ms stimulus onset asynchrony (20-ms stimulus presentation followed by a 30-ms interstimulus interval). It was found (Bacon-Macé et al., 2005) that increasing the SOA on a similar animal vs. non-animal categorization task beyond this value only has a minor effect on performance (accuracy scores for longer SOA conditions were not significantly different). At the same time, for this duration, the mask is expected to block significant feedback effects from higher level visual areas through back-projections.

Figure 3A provides an overview of the performance of the various models computed as d′ for a zero threshold value. Baseline performance by human observers (error rate calculated across all subjects; n = 22) is indicated with a black vertical line (the 95% confidence interval of the bootstrapped distribution is indicated with a gray shaded area). The estimated confidence intervals reveal that HMAX and TEXTSYNTH reached a higher level of performance than all other models (including the SPATIALPYR, a state-of-the-art computer vision system). Most importantly, the average performance of human participants fell within the confidence intervals of these two models. The performance of the remaining models decreased from the SPATIALPYR and GIST to TEXTON, WEIBULL, and SALIENCY.

FIGURE 3

Figure 3. Models performance in the animal vs. non-animal classification task. (A) Models performance measured with the d′ value (at a zero discrimination threshold value, see Materials and Methods for details). Average values (bold) as well as upper and lower limits of the corresponding confidence intervals (95% CI obtained over independent cross-validations) are reported with each bar. The performance of human observers is indicated with a black horizontal line (95% CI calculated across all subjects is indicated with a gray shaded area; see text for details). (B) Receiver Operating Characteristic (ROC) curves for all models tested. This shows the sensitivity (True Positive rate vs. False Alarm rate) of the different models as the discrimination threshold is being varied. Curves were averaged across multiple splits of the stimulus database used for training and testing the models. Shaded areas indicate 95% CI. Black dots correspond to the performance of individual human subjects (n = 22). (C) Performance of the various feature types in the TEXTSYNTH and HMAX models. A classifier was trained and tested independently on each feature type.

The performance of individual subjects is shown on Figure 3B overlaid with the Receiving Operator Characteristic (ROC) curves of the computational models (corresponding to their level of performance for all possible discrimination threshold values). Each of the black dots (n = 22) corresponds to the performance of one of the human participants. The best four models (HMAX, TEXTSYNTH, SPATIALPYR, and GIST) capture relatively well the variety of behaviors exhibited by human participants: Some participants seem closer to the GIST, others to the HMAX or TEXTSYNTH). Also, it seems that the TEXTSYNTH algorithm tends to perform best at regimes with higher false alarm rates while the HMAX tends to perform better at lower false alarm rates, which also seems to be the regime that best correspond to most human participants.

The HMAX and the TEXTSYNTH algorithms both rely on different types of features (see Materials and Methods). Do all features contribute equally to the reported classification results? To answer this question, we trained a classifier for each type of features separately for the two models. Figure 3C shows the classification performance of the resulting systems. Overall this analysis suggests that the key features are those encoding the correlations of the magnitude of responses of oriented filters for the TEXTSYNTH and the C_2B features for the HMAX. These two types of features indeed exhibit a similar level of complexity and tolerance to position and perform at a very similar level of performance (d′ ∼ 1.8). The C_2B features have been shown previously to contain a similar amount of category information (and similar tolerance to changes in position and scale) as a representative population of IT neurons (Serre et al., 2007a).

Figure 4 shows, for each model, the six images that were classified as most animal-like and most non-animal-like (as measured by the confidence of the classifiers trained on the features from each model and averaged over all random splits). From visual inspection, it appears that the most non-animal-like images for the three models that are based on first order statistics and higher (i.e., TEXTON, TEXTSYNTH, and the SPATIALPYR) are mostly repeated textured patterns typically associated with urban scenes (e.g., buildings). Consistent with this idea, animal images that are most similar to non-animal images correspond to far groups of similar animals (e.g., flock of birds). The most animal-like images correspond to animal heads and bodies on relatively simple, near-uniform backgrounds (grass, water, snow). This is consistent with the fact that these types of features have been traditionally used for the recognition of textures (Julesz, 1981).

FIGURE 4

Figure 4. The six most animal-like and non-animal-like images for each model. This was measured through the probability output of the classifier for each image and averaged over multiple random splits. When the six extreme images did not contain an error (e.g., an animal considered as very non-animal), the first error was added next with the corresponding score. Saliency was excluded because of its poor level of performance in the task.

Interestingly the WEIBULL image contrast statistics seem to capture well the complexity of the stimuli with the least/most animal-like images corresponding to subjectively more complex/simpler cluttered backgrounds. However, this led to relatively poor classification performance for the present animal vs. non-animal categorization task. The behavior of the GIST and HMAX algorithms seem a bit more complex to interpret. While the GIST seems to rely heavily on the presence of vertical elements in an image, typically associated with urban and man-made scenes, for classifying an image as non-animal, no simple pattern seems to explain what constitutes representative animal-like images for either ones of the algorithms.

How do the computational models predict human performance on an individual image basis? Figure 5 shows the correlation between the various models and human observers computed using an “animalness” score as described in Serre et al. (2007b). For human observers, this index was computed as the fraction of human observers that classified a specific image as an animal irrespective of whether its true label is animal or distractor. A score of 1.0/0.0 means that all participants classified this image as animal/non-animal. Any value in between reflects some variability across subjects.

FIGURE 5

Figure 5. (A) Correlation between the “animalness” scores (computed for each image) from human participants vs. computational models. For each model, the main bar corresponds to the correlation between human participants and models (a single value is computed over the whole image dataset hence no error bar shown). The diamond inserted in the bar corresponds to the correlation obtained from a standard permutation test (indicating chance level), the circle corresponds to a restricted permutation test (randomization was done for animal and non-animal stimuli separately). The most interesting difference is the one between the circle and the main bar, indicating the ability of the models to capture some of the intrinsic difficulty of individual images beyond simply classifying animal vs. non-animal images (see text for details). (B) Agreement matrix between each participant and the various computational models. Participants are sorted according the the highest agreement value. The correlation value for the best model and those that fall within its confidence interval are all highlighted in white.

Similarly for the computational models, a confidence score for each image was computed every time the image was selected as part of the test set (this score reflected the normalized distance to the separating decision function for this particular image on this particular split). Averaging these confidence scores across all splits resulted in one score per image that we correlated with the average scores obtained from the human observers. The bars in Figure 5A reflect the correlation coefficients obtained by correlating the score from human observers with the score given by each model for every image. The black horizontal bar corresponds to the inter-subject correlation between half sets of participants (procedure repeated 1,000 times with random half splits to get the 95% confidence interval indicated with the shaded area).

We found that the relative ranking of algorithms in terms of their ability to explain human performance at the single image level was indeed similar to the one based on the absolute performance as shown on Figure 3. To estimate how well correct classification alone for individual images impacts this score, we performed a standard permutation analysis (Good, 2000) where we shuffled all the scores randomly (diamond inset for each model) as well as a restricted permutation procedure where we shuffled the scores for the animal and non-animal images separately (circle inset). The correlations obtained from these two restricted permutation procedures were significantly lower than those obtained for all models, except for the WEIBULL (p = 0.058). This suggests that with this one exception, the computational models are all able to capture some of the intrinsic difficulty of individual images. However, the correlation between even the best models and human participants remains significantly lower than the inter-subject correlation (dark horizontal bar).

We further computed the agreement between each model and individual participants. In order to obtain the matrix shown in Figure 5B, predicted labels from each model for every cross-validation were compared to behavioral responses from each individual participant. This allowed to get an estimate of the agreement for each participant and each computational model (over 40 cross-validations). As shown in Figure 5B, for every subject, HMAX was either picked as the best model for each subject or fell within the confidence interval of the actual best model when not selected as the winner (all models that fall within the confidence interval of the best model for each subject, highlighted in white). However, the GIST also seems to match at least as well or slightly better for three of the participants (subjects 4, 10, and 19). Between-participant differences could thus reflect different visual processing strategies for the task (one possibly faster but less accurate strategy based on lower-level features and one possibly slower and more robust based on higher level features).

3.2. Robustness to Image Manipulations

Several image modification procedures are commonly used in experimental studies to assess the “high-levelness” of a visual process. Among them, the Fourier amplitude spectrum normalization and image inversion are two of the most common. It is generally believed that high-level visual processes should not be disturbed after normalization of the Fourier amplitude of the stimuli (amplitude information being low-level), but should suffer with image inversion (this modification perturbating high-level but not low-level information). It is thus interesting to test how the computational models presented here cope with these two images modifications. Considering what we know about these models and the features they extract, the results are also informative for the relevance of these image modifications in terms of how well they differentially impact high- vs. low-level visual processes.

3.2.1. Fourier amplitude normalization

It has been shown that the Fourier amplitude spectrum contains diagnostic information about the category of objects in natural scenes (Torralba and Oliva, 2003). Evidence regarding the use of this type of information by human participants seems restricted to specific object categories (e.g., face (VanRullen, 2006; Honey et al., 2008; Crouzet and Thorpe, 2011) but not animal category (Gaspar and Rousselet, 2009; Wichmann et al., 2010)). Computational models and the corresponding statistical classifiers are likely to take advantage of any bias in the statistics of the image sets and we thus verified whether such low-level cues could be driving the categorization performance in the previous experiment. We created a new set of images by mixing the original phase information of the original set with the averaged amplitude spectrum from all images (target and distractor images mixed as in Crouzet and Thorpe, 2011). This procedure allows to normalize the amplitude content (generally considered as lower-level visual information) while preserving the phase information (generally considered as higher-level).

The results (Figure 6A) first show that the performance of all the features is reduced by this image modification. However none of the features performance falls to chance level. The fact that simple low-level features like GIST, TEXTON, or the C₁ features still perform very well in the normalized condition highlights severe shortcomings associated with this procedure and its ability to completely remove low-level biases in a stimulus set. Looking more precisely at the difference between models, features like the MagnitudeStat (TEXTSYNTH) as well as the C_2B and C₃ (HMAX) cope very well with such image modification as observed with human participants (Gaspar and Rousselet, 2009).

FIGURE 6

Figure 6. Classification performance (d′) of the different models (features classified separately for the different components of Hmax and Textsynth) with common image modifications, compared to the performance on original images (A) Original image vs. Fourier amplitude normalized images. The normalized set was generated after averaging the amplitude spectrum content in the Fourier domain over all images (targets and distractors, a procedure identical to the one used in Crouzet and Thorpe, 2011). (B) Upright images vs. inverted images. Models were trained on upright images and then tested on upright (0°) or inverted (180° rotated) images. Human data from Serre et al. (2007b).

3.2.2. Rotation

It is often assumed that rotating images upside-down degrades high-level information while maintaining low-level cues (which is true at least for statistics like luminance distribution and contrast). Figure 6B demonstrates that the effect of rotation on models performance is consistent among most of the features, irrespective of their underlying complexity. Overall we found that the observed drop in performance for most models was indeed consistent with the pattern observed for human observers in rapid categorization tasks (Rousselet et al., 2003; Guyonneau et al., 2006; Serre et al., 2007b). This suggests that perhaps this image transformation does not quite produce the effect usually intended by experimenters.

3.3. On the Benefit of Hierarchical Models

From the results presented above, the MAGNITUDESTAT features (as part of the TEXTSYNTH model) and the features from the higher stages of the HMAX (specifically the C_2B units) remain the two key contenders. As discussed above, these two types of features do indeed share some similarities as they try to capture local combinations of orientations. One key difference between these two types of features remain their invariance properties with respect to 2D transformations. While the HMAX model was designed with the goal of explaining the invariance properties of IT cells (Riesenhuber and Poggio, 1999), TEXTSYNTH was developed as a general model of texture perception without any particular focus on the problem of invariant recognition. It is thus expected that the corresponding features will exhibit significantly less tolerance to changes in position and scale.

Here to assess the invariance properties of the MAGNITUDESTAT and the C_2B features, we used a methodology similar to Logothetis and Pauls, 1995; see also Riesenhuber and Poggio, 1999 as well as Pinto et al., 2011 and Pinto, 2011 for a recent treatment). Here invariance is measured by first estimating a “tuning curve” (obtained by correlating a feature vector corresponding to one object at a given scale with the same feature vector obtained for the same object at different scales). An average tuning curve is then obtained by averaging tuning curves across a set of objects. Similarly a distractor response for each object is obtained by estimating the maximum correlation between the feature vector corresponding to the reference object at a given scale with the same feature vector obtained for all other objects in the set at the same scale. These responses are then averaged across all distractors and invariance to scale is then defined at the range of scales for which the correlation between the original object and its rescaled values remains significantly higher than the response to the distractors. Using 17 linearly spaced scales and 100 real-world isolated objects, Figure 7 shows that the invariance level increases for the HMAX features throughout the hierarchy from C₁ to C₂ to C_2B/C₃. As seen on the Figure, the invariance properties of the C_2B and C₃ features remain larger than those of the TEXTSYNTH features. The fact that these two models exhibit almost identical levels of performance on the animal categorization task described above reflects a limitation of the dataset in taping in these invariant mechanisms.

FIGURE 7

Figure 7. Scale invariance in Hmax and Textsynth. “Tuning curves” were obtained by correlating feature vectors computed at one reference scale for one object from a set with feature vectors computed for the same object across multiple scales. Shown are tuning curves averaged over a set of 100 stimuli (colored curves, error bars correspond to the 95% CI). Thin black lines correspond to the average correlation for the best distractor (i.e., the distractor for which the correlation with the feature vector of the target object is highest (thin black line with 95% CI in shaded gray). The invariance properties of each model is then estimated by computing the “tuning width” such that the corresponding feature vector for a reference object remains more strongly correlated with feature vectors for the same object across scales than the best distractor at the reference scale (width displayed with the thick black line at the bottom of each graph).

As discussed in Serre and Poggio (2010), an HMAX-like representation, with built-in tolerance to position and scale of the stimulus, should, in principle, lead to a simpler classification function (such as a linear classifier as opposed to a higher order polynomial, for instance) that requires fewer training examples to achieve a specific level of performance, thus lowering the sample complexity of the recognition problem.

4. Discussion

We have reviewed current computational models of rapid categorization and compared their performance to human performance on a rapid animal vs. non-animal categorization task (Serre et al., 2007b). This performance comparison revealed that HMAX and TEXTSYNTH can reach a level of performance similar to the average human observer. The GIST, which (much like the TEXTSYNTH) was not originally designed for the recognition of objects, also stands as a realistic model of visual processing and, in particular, seemed to capture well the performance of several individual participants.

Our results also suggest that features of intermediate complexity (e.g., the C_2B features of the HMAX model and the MAGNITUDESTAT features of the TEXTSYNTH) seem to perform better than lower-level features and on par with higher level features (the C₃ features). This is consistent with earlier proposals that features of intermediate complexity are optimal for object classification (Ullman et al., 2002). At the same time, the performance of low-level models remains relatively high on this database suggesting that despite researchers best effort to build a difficult database with changes in the position and scale of animals in these images, the dataset does exhibit some biases.

This point was already raised by Pinto et al. (2008, 2011) who showed that “natural” image databases such as the popular CalTech-101, because of biases for position and scale, may sometimes favor simpler models that are void of invariance properties (see also Gintautas et al., 2010; Sanbonmatsu et al., 2010). As discussed by Riesenhuber and Poggio, 1999; see also Geman, 2006), object recognition requires a difficult trade-off between invariance and selectivity. In short, more invariant features tend to be less selective and vice-versa. For instance, in Pinto et al. (2011), the C_2B features from the HMAX described above were shown to perform worse than V1-like features as well as other computer vision benchmarks on the CalTech-101 dataset but significantly outperform all other approaches on an artificial dataset that exhibited more variations in the position and scale of the objects thus requiring a higher level visual representation.

This idea seems consistent with the somewhat good level of performance obtained with low-level feature representations such as the C₁ features or the GIST model. Future work should address this question by selecting image subsets that are easy/difficult for lower-level vs. higher-level representations (that are tolerant to 2D transformations) and correlating the performance of human observers vs. features of various levels of complexity on these subsets.

Looking at finer level correlation between computational models and human observers for individual images, we found that all tested models (with the exception of the WEIBULL) were able to capture, to some extent, some of the intrinsic difficulty of individual images beyond what can be predicted from mere performance and potentially reflect computational mechanisms similar to those used by humans. However, all models exhibited a correlation with human observers significantly lower than the inter-subject agreement, suggesting that a significant fraction of the variance in the data remains unexplained and that existing computational models do not yet fully account for the pattern of behaviors observed from human participants.

Would simple extensions of these models allow to account fully for the pattern of performance of human observers? One promising direction would involve exploring parameters (such as receptive field sizes and connectivity) by, for example, using computer-intensive parameter screening. Such an approach was shown to lead to significant improvements in classification accuracy on a face identification task using hierarchical HMAX-like/convolutional architectures (Pinto and Cox, 2011).

Another possibility could involve more efficient learning algorithms that the random sampling procedure currently implemented in the HMAX architecture. There is currently much work on the topic of feature learning both within the context of biological vision (Brumby et al., 2009) and computer vision. In computer vision, architectures related to the HMAX include deep learning networks Hinton (2010), convolutional networks (Kavukcuoglu et al., 2010; Zeiler et al., 2011), and grammar-based approaches (Fidler and Leonardis, 2007; Zhu et al., 2008). General coding strategies based on local feature pooling (Boureau et al., 2010b) and sparse coding (Mairal et al., 2008; Yang et al., 2009) constitute yet another possible avenue for improving the performance of the computational models.

Furthermore, all of the models were trained on relatively small datasets (in comparison to the number of parameters for the models) and were not explicitly optimized to match the performance of human observers (they were simply trained for the animal vs. non-animal categorization task). It is possible that higher correlation with human observers could be obtained by training these models explicitly on the pattern of responses from human observers.

Hierarchical models of the visual cortex are complex. Several layers of non-linearity coupled with the high-dimensionality of the inputs to the final classifier (due to the large number of features) makes it very hard to interpret what is driving classification accuracy in these models (Landecker et al., 2010; He et al., 2011). In particular, it has been suggested that classification of natural image categories by some of the models described above may rely more on contextual information, i.e., features computed from the background rather than the foreground (He et al., 2011). These types of behavior most likely results from the relative small number of images used to train and the comparatively high-dimensionality of the feature vectors used for classification.

He et al. (2011) described a hierarchical (probabilistic) model whereby natural object categories are represented by a coarse hierarchical probability distribution over object geometry and spatial configuration of object parts. Because of this and the need for (manual) segmentation of animal images for training the model suggests that this class of models might however require attentional mechanisms and cortical feedback. While it might be incompatible with the severe time constraints imposed by rapid categorization tasks, such models do however suggest possible avenues for improving the performance of the computational models beyond the first initial feedforward sweep. Similarly Chikkerur et al. (2010) have shown that an extension of the HMAX model with feature-based and spatial attention was able to further improve recognition performance of the model on the task.

However, it is important to realize the intrinsic limitations of the specific computational framework we have described and why it is at best a first step toward understanding the visual cortex. Some important limitations for these types of object representations based on a loose collection of image features is that they remain sensitive to the presence of background clutter (Serre et al., 2007b; Chikkerur et al., 2010), do not explicitly encode spatial relations between parts that are known to play a key role in object recognition (Biederman, 1987) and do not explicitly distinguish figure from ground (Lamme and Roelfsema, 2000). Most importantly, given enough time, humans use eye movement to scan images, and performance in many object recognition tasks improves significantly over that obtained during quick presentations.

While these models remain simplistic models of visual processing, they do suggest an alternative to the classical visual pipeline sketched on Figure 1 (left), which places an emphasis on bottom-up computations for grouping, Figure-ground segmentation, and spatial relations. Instead this alternative view suggests that the bottom-up activation of a loose collection of hardwired feature detectors via a hierarchy of increasing complex processing stages may provide a coarse initial visual representation for more complex routines and several feedforward/feedback iterations to solve specific tasks including edge detection, grouping, figure segregation, and the computation of spatial relations between parts, among others, and more generally the parsing and interpretation of complex visual scenes (see for instance, Hochstein and Ahissar, 2002; Bar, 2004; Zheng et al., 2007; Epshtein et al., 2008; Serre and Poggio, 2010 for a recent review).

5. Materials and Methods

5.1. Computational Models

Below we describe in greater detail the models used in this study. Most of the softwares were publicly available from the authors web sites and/or provided by the authors. We tried to equalize as much as possible the various model parameters when possible (e.g., number of frequency bands and orientations, etc.,). When the benefit on performance was not significant, default parameters were kept.

Saliency

Models of bottom-up saliency compute the local conspicuity of an image region with respect to its surround as measured, for instance, by local contrast, color, or orientation. Here we used the matlab Saliency Toolbox 2.1 (Walther and Koch, 2006). Low-level features (pixel intensity, orientations) were extracted at multiple scales and local conspicuity maps were computed using local center-surround mechanisms. Note that color was not used here because the stimuli used were grayscale. The resulting conspicuity maps were then combined to form a saliency map to predict the location of the highest saliency value for the whole image (Itti et al., 1998; Itti and Koch, 2001). This intensity value (single feature) was then used to try to predict the presence or absence of an animal in images.

Weibull

The distribution of local contrast in an image can be well fitted with the so-called WEIBULL distribution (Ghebreab et al., 2009; Scholte et al., 2009). Such distribution can be estimated from the output of zero-crossing detectors similar to the center-surround cells found in the LGN (Scholte et al., 2009). These authors further hypothesized that this information could be available very rapidly for the visual system and as such be used for rapid categorization. Indeed, they showed that the β and γ parameters of the WEIBULL distribution correlate well with EEG activity and could even allow identification of the precise image presented to a human subject among a small set of natural images (Ghebreab et al., 2009). More precisely, the β and γ parameters of the WEIBULL distribution define a space where images are ordered according to their level of clutter/complexity and texture similarity (Scholte et al., 2009). Here we used the code provided by the authors which uses the simple β and γ parameters from the fitted WEIBULL distribution. We found the performance of the two models presented in Scholte et al. (2009) to be very similar and only report here the performance of the simpler abstract one.

Gist

The GIST features correspond to the model by Oliva & Torralba, who have shown that the amplitude spectrum of images in the Fourier domain could be predictive of scene category, leading later to the concept of spatial envelope or global image signature (Oliva and Torralba, 2001, 2006; Torralba and Oliva, 2003). To create this representation, global features (Torralba and Oliva, 2003) were computed by convolving each image in the database with a Gabor filter pyramid (8 levels and 8 orientations) and further down-sampling the resulting filtered image to produce a 4 × 4 × 64 (=1,024) dimensional vector, which is then used for classification.

Texton

The TEXTON features were described in Renninger and Malik (2004). They were computed with a filter pyramid (M = 96 Gabor filters at 8 orientations, 6 scales, and 2 phases). A large number of random patches were extracted from hundreds of M-dimensional edge-response images (from the pre-training set) and subsequently clustered using k-means to find 100 cluster centroids. Each of these centroids then became a TEXTON feature. For every image, an image of TEXTON counts was computed by finding the index of the nearest TEXTON to the vector of filter responses at each pixel location and accumulating the counts over the whole image leading to 100-dimensional histogram vector used for classification as done in Renninger and Malik (2004).

TextSynth

TEXTSYNTH was originally presented as a model of parametric texture analysis/synthesis⁴ (Portilla and Simoncelli, 2000). Its success at producing new texture images from random noise that seem to be perceptually similar to a seed image makes it an interesting addition to our study. Recent work by Balas et al. (2009) and Freeman and Simoncelli (2011) suggests that the model accounts well for the representation of the early visual system, and in particular can stand as a good model for crowding in the visual periphery if filter size is made larger in the periphery.

This model first measures the responses of oriented linear filters, which are computed using a complex-valued steerable pyramid decomposition. This approximates the response of V1 complex cells tuned to different orientations and scales (4 orientations and 4 scales, higher values for these parameters were tested but did not improve performance) which tile all positions in the image. Then, the model computes joint statistics of these features to capture intermediate-level image structure. The statistics used in this model fall into four main categories: (1) MARGINAL STATISTICS: the marginal distribution of luminance in the image (i.e., mean, variance, skew, and kurtosis as well as skewness and kurtosis of the low-pass image); (2) LUMINANCE AUTOCORRELATION (as a proxy for the detection of periodic structures in the stimulus); (3) MAGNITUDE STATISTICS: the correlations of the magnitude of the responses of oriented wavelets across differences in orientation, neighboring positions, and scales (to capture simple structures in the image such as lines, edges, corners, and junctions); and (4) PHASE STATISTICS: phase correlation across scales (in order to capture the alignment of phase structure in local features).

Hmax

The HMAX model of object recognition combines a hierarchical build-up of invariance and complexity (inspired by Fukushima, 1980) with the idea of view-based recognition of 3D objects (Riesenhuber and Poggio, 1999, 2000). Here we used the extended model described by Serre et al. (2007b,c). Over the years, several related hierarchical models have been developed (Mel, 1997; Wallis and Rolls, 1997; LeCun et al., 1998; Riesenhuber and Poggio, 1999; Ullman et al., 2002; Amit and Mascaro, 2003; Wersing and Koerner, 2003; Masquelier and Thorpe, 2007; Mutch and Lowe, 2008; Jarrett et al., 2009; Pinto et al., 2011; Saxe et al., 2011). We here focus on the HMAX because the underlying parameters of the architecture were explicitly derived from available neuroscience data. This system-level computer model seems consistent with monkey electrophysiological data in different cortical areas of the ventral visual pathway (Serre et al., 2007a) as well as human behavioral data during rapid categorization tasks with natural images (Serre et al., 2007b). These findings suggested that bottom-up processes may provide a satisfactory description of the very first pass of information in the visual cortex.

Here, we used the GPU implementation developed by Mutch et al. (2010), with the default parameters of the HMAX demo with the exception of the number of orientations and scales that were set to 8 (to better match the parameters of the other models). The dictionary was learned/extracted on a pre-training set of 128 natural images (different from the ones used for the training and test of the recognition stage). Here we used 2,048 C₁ and C₂ features selected at random. All 2,048 C_2B and 1,024 C₃ features were used.

SpatialPyr

This state-of-the-art computer vision system (Lazebnik et al., 2006; Yang et al., 2009, 2010; Boureau et al., 2010a; Gao et al., 2010; Wang et al., 2010; Zhou et al., 2010) provides a useful baseline for the biologically inspired models described above. The approach is based on increasingly fine sub-divisions of an image into partitions and the computation of local features histograms within these sub-regions. The resulting spatial pyramid has been shown to provide a simple and computationally efficient representation as demonstrated by the high-level of performance of the system on several image classification tasks (Lazebnik et al., 2006; Yang et al., 2009, 2010; Boureau et al., 2010a; Gao et al., 2010; Wang et al., 2010; Zhou et al., 2010).

5.2. Image Database

Here we consider the animal and non-animal dataset from the study by Serre et al. (2007a). The dataset contains 1,200 images (600 animals and 600 non-animals). As a pre-processing step, images were converted to grayscale and resized to be 256 × 256. Two of the computational models tested (TEXTON and HMAX) required the learning of a dictionary of features. As done in Serre et al. (2007a), we considered an additional set of 128 natural images containing various categories (from animals and vehicles to landscapes and human faces) specifically for the extraction of features and the learning of codebooks in these two models.

5.3. Human Data

Here we compare the models presented above to the human behavioral data collected by Serre et al. (2007b). In this rapid categorization task, the images were flashed for 20 ms on the screen, followed by a blank screen for 30 ms, and then a mask appeared for 80 ms (the Stimulus Onset Asynchrony was thus 50 ms). The participants had to respond as quickly as possible, indicating whether they saw an animal or a distractor image by pressing one of two keys (see Serre et al., 2007b) for a more detailed description. A total of twenty-four human observers participated in the original study. Two of the participants were excluded because of their overall poor level of performance.

5.4 Classification

To assess the diagnosticity of the various visual features described above for the categorization of animal and non-animal images, a linear Support Vector Machine (SVM) classifier was used (Fan et al., 2008, LIBLINEAR 1.7). The procedure runs as follow: First, the 1,200 image from the animal/non-animal database were equally split in a training set and a test set that contains an equal proportion of target (300) and distractor images (300). Second, an optimal cost parameter C was determined through line search optimization using 8-fold cross-validation on the training set of images. An SVM classifier was then trained and tested on the various types of features (the exact number of features used depended on the type of models considered, see above). For each model, the reported results correspond to the average performance (and corresponding 95% confidence intervals) using a cross-validation procedure (n = 40) whereby different training and test sets were selected each time at random.

5.5 Invariance Test

Here we considered the database of images used in Konkle and Oliva (2011) containing 100 isolated real-world objects. Features were extracted from all images at seventeen different scales and the middle scale was selected as reference. For each object and computational model, we computed a “tuning curve” based on the correlation between the feature vector obtained for the reference scale and the feature vector obtained for all remaining scales for the same object. We also computed a “distractor curve” based on the maximum correlation between the feature vector obtained for the reference object at the reference scale with all remaining (distractor) objects at the same reference scale (see Logothetis and Pauls, 1995 for details). A t-test (Two-sample, False Detection Rate correction for multiple comparison, α corrected from 0.05 to 0.024) was performed at every scale in order to compare the values obtained for the “tuning curve” and for the “distractor curve.” This allows to obtain the “tuning width” of the representation (thick black bar on Figure 7).

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The authors would like to thank the original authors of the computational models for providing the code or having made their code available on the internet. This work was supported by DARPA (DARPA-BAA-09-31). Part of this research was conducted using computational resources and services at the Center for Computation and Visualization, Brown University.

Footnotes

References

Amit, Y., and Mascaro, M. (2003). An integrated network for invariant visual detection and recognition. Vision Res. 43, 2073–2088.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Bacon-Macé, N., Macé, M. J.-M., Fabre-Thorpe, M., and Thorpe, S. J. (2005). The time course of visual processing: backward masking and natural scene categorisation. Vision Res. 45, 1459–1469.

Pubmed Abstract | Pubmed Full Text | CrossRef Full Text

Balas, B., Nakano, L., and Rosenholtz, R. (2009). A summary-statistic representation in peripheral vision explains visual crowding. J. Vis. 9, 1–18.