Original Research ARTICLE
Object Categorization in Finer Levels Relies More on Higher Spatial Frequencies and Takes Longer
- 1Department of Computer Science, School of Mathematics, Statistics, and Computer Science, University of Tehran, Tehran, Iran
- 2School of Biological Sciences, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- 3CERCO UMR 5549, Centre National de la Recherche Scientifique, Université de Toulouse 3, Toulouse, France
The human visual system contains a hierarchical sequence of modules that take part in visual perception at different levels of abstraction, i.e., superordinate, basic, and subordinate levels. One important question is to identify the “entry” level at which the visual representation is commenced in the process of object recognition. For a long time, it was believed that the basic level had a temporal advantage over two others. This claim has been challenged recently. Here we used a series of psychophysics experiments, based on a rapid presentation paradigm, as well as two computational models, with bandpass filtered images of five object classes to study the processing order of the categorization levels. In these experiments, we investigated the type of visual information required for categorizing objects in each level by varying the spatial frequency bands of the input image. The results of our psychophysics experiments and computational models are consistent. They indicate that the different spatial frequency information had different effects on object categorization in each level. In the absence of high frequency information, subordinate and basic level categorization are performed less accurately, while the superordinate level is performed well. This means that low frequency information is sufficient for superordinate level, but not for the basic and subordinate levels. These finer levels rely more on high frequency information, which appears to take longer to be processed, leading to longer reaction times. Finally, to avoid the ceiling effect, we evaluated the robustness of the results by adding different amounts of noise to the input images and repeating the experiments. As expected, the categorization accuracy decreased and the reaction time increased significantly, but the trends were the same. This shows that our results are not due to a ceiling effect. The compatibility between our psychophysical and computational results suggests that the temporal advantage of the superordinate (resp. basic) level to basic (resp. subordinate) level is mainly due to the computational constraints (the visual system processes higher spatial frequencies more slowly, and categorization in finer levels depends more on these higher spatial frequencies).
An object can be categorized in different levels of abstraction, including the superordinate (e.g., animal), basic (e.g., bird), and subordinate (e.g., duck) levels. The processing order of these levels is yet being debated. There are several studies suggesting that categorization in the basic level is completed prior to the superordinate level (Tanaka and Taylor, 1991; Collin and Mcmullen, 2005; Rogers and Patterson, 2007; Dehaqani et al., 2016). On the other side, the advantage of the basic level has been challenged by showing faster visual processing for the superordinate level in rapid-presentation experiments. Using two-forced-choice behavioral experiments with long (i.e., 500 ms) and short (i.e., 50 ms) presentation times, Bowers and Jones (2008) showed that superordinate level categorization (object/texture images) is completed before the basic level (e.g., dog/bus). Also, Macé et al. (2009) found that rapidly presented (26 ms) natural images were faster to be categorized at the superordinate level than the basic level. Using a forced-choice saccadic task, Wu et al. (2014) found that humans can accurately perform superordinate level categorization at 120 ms, while the accuracy of basic level categorization is around chance-level. Although Mack and Palmeri (2015) challenged the rapid presentation paradigm for studying the processing order of categorization levels, studies done by Poncet and Fabre-Thorpe (2014) and Vanmarcke et al. (2016) showed that the advantage of superordinate level is not affected by the stimulus duration (25–500 ms) and diversity. Also, Praß et al. (2013) showed that the background context and animacy have no effect on the superordinate level advantage.
There is evidence indicating that the visual system processes visual input in order from low to high spatial frequencies (HSFs) (Schyns and Oliva, 1994; Macé et al., 2005; Kauffmann et al., 2015), or from general shape to fine details. However, it is still disputed how the brain processes different spatial frequencies (Kauffmann et al., 2014). Bar et al. (2006) suggest that low spatial frequencies (LSFs) are analyzed quickly and provide an initial and general guess about the object, which then facilitates the object categorization. Indeed, it is suggested that the LSFs are rapidly conveyed by the magnocellular pathways into the high cortical areas (e.g., orbitofrontal cortex). There, a coarse object representation is formed and then back-projected to the inferiortemporal cortex to refine the subsequent processing of HSFs conveyed by the parvocellular pathways through the ventral visual cortex (Bar et al., 2006; Kauffmann et al., 2015). There are also theoretical studies which provide mathematical frameworks, like the hierarchical Bayesian models (Lee and Mumford, 2003), suggesting how such an integration of top-down contextual priors and bottom-up information can help visual cortex to implement a probabilistic inference about the observed objects.
Therefore, studying the impact of spatial frequencies on humans' accuracy and reaction time (RT) in object categorization tasks at different levels can help to unravel the entry categorization level challenge. Using speeded category verification tasks, Collin and Mcmullen (2005) showed that basic level categorization is completed earlier than superordinate and subordinate levels. The main critic to these types of experiments is the use of semantic labels, that involves semantic processing of the brain (Wu et al., 2014). Binding the object visual representation and its name takes time which may be different for each categorization level (Macé et al., 2009).
Here, we used a rapid presentation paradigm with frequency-filtered images to study the processing order of the categorization levels. Indeed, subjects were asked to determine the category of the object image presented for a duration of 12.5 ms in one of the superordinate, basic, and subordinate levels, when the image was intact or bandpass filtered into one of the LSF, HSF, or intermediate spatial frequency (ISF) bands. For each categorization level, we performed several psychophysics tasks in which images of different object categories were used. This way, we could check whether the results are independent of target object categories.
The results of our psychophysics experiments indicated that the superordinate level categorization mainly relies on LSFs, while the basic and subordinate levels require higher spatial frequencies. Indeed, for superordinate level, the human categorization accuracy peaks at LSF band and drops with the ISF and HSF bands. On the contrary, for the basic and subordinate levels, the accuracy increases by increasing the spatial frequency (with a greater slope in subordinate level). However, the RT always increases with the spatial frequency, whatever the categorization level, which is compatible with the processing order of spatial frequencies (from low to high). Also, RTs decrease with categorization level, whatever the frequency band. These findings are in support of the temporal advantage of superordinate to the basic level, as well as, basic to the subordinate level.
Computational models can be used to investigate whether human behavior is caused by the information constraints or specific neural processing in the brain. Yu et al. (2016) have used a computational approach to finding the set of category-specific features suitable for object categorization in each of the abstraction levels and showed that their model can explain the human behavior. We also evaluated two object categorization models on the same set of experiments in different categorization levels using images in different frequency bands. This helps to investigate whether the results of our psychophysics experiments are due to the specific processing mechanisms of the visual system or they are forced by the information content in each frequency band.
Interestingly, the categorization accuracies of both models strongly correlate with human accuracies in all categorization levels and frequency bands. This suggests that, from a computational point of view, the LSF band carries sufficient visual information to perform superordinate level categorization, while for the basic and subordinate levels, higher frequencies are required. Thus, since lower spatial frequencies are mainly processed earlier than higher ones (Schyns and Oliva, 1994; Macé et al., 2005; Kauffmann et al., 2015), the superordinate level appears to be the entry categorization level, and subsequently, the basic and subordinate levels are completed.
Further, we added different amounts of phase noise to the images and repeated all the psychophysics and computational experiments, to check for any possible ceiling effect. By increasing the noise level, the accuracies (resp. RTs) in all categorization levels and frequency bands are decreased (resp. increased). However, at each noise level, the trend of the accuracies and RTs is the same as the noise-less experiments. Again, this confirms that our results are caused by the visual information at each frequency band.
2. Materials and Methods
We used images of five object categories, including ducks, pigeons, cats, dogs, and cars (200 images per category). Most of the images were picked from the Imagenet dataset (Russakovsky et al., 2015), and others were gathered from the web. Each image contains the side view of a different instance of one of the object categories. For each categorization level, a different combination of categories has been used. For the superordinate level (animal vs. non-animal), there were four different sets of images in two categories: one of the four animals and the car category. Also, for the basic level (bird vs. non-bird animals) there was also four sets, each of which containing one bird (duck or pigeon) and one non-bird animal (cat or dog). In the subordinate level, only duck and pigeon categories were employed. Figure 1 shows some examples from each category as well as the hierarchy of the categorization levels. Note that the images were grayscaled and cropped to have 300 × 300 pixels.
Figure 1. (A) Examples of image stimuli and the hierarchy of three categorization levels. Levels are specified by different colors, i.e., superordinate in blue, basic in red and subordinate in green borders. (B) A sample image bandpass filtered into different spatial frequency bands and contaminated with different amounts of phase noise. Columns specify different frequency bands: full-band (unfiltered image), LSF, ISF, and HSF bands respectively from left to right. Rows correspond to the noise levels: 0% (without noise), 20, 30, 40, and 50% phase noise respectively from top to bottom.
All the images have also been filtered into LSF, ISF, and HSF bands (Figure 2). To produce the filtered images, first, each original image was Fourier transformed into the frequency domain, and then, multiplied by a 2D frequency filter to keep the desired frequencies, and finally, backed to the spatial domain using inverse Fourier transformation. Here, we used 2D Gaussian low-pass function to construct the desired frequency filters. The general form of a Gaussian low-pass, HLP, and Gaussian high-pass, HHP, filters are as follows:
where u and v are the frequency coordinates, M and N are correspondingly the maximum frequency component at each frequency dimension, and 0 ≤ Fl ≤ 1 is the frequency cut-off rate. Using two Gaussian filters, say HLP1 and HLP2 with the corresponding cut-off rates of Fl1 and Fl2 (Fl1 < Fl2), the band-pass filter, HBP, is calculated as:
To prepare the LSF-, ISF-, and HSF-filtered images, we used respectively HLP1, HBP and HHP2 (= 1 − HLP2) filters with cut-off frequency rates of Fl1 = 0.25 and Fl2 = 0.60.
Figure 2. A sample image filtered into different spatial frequency bands using Gabor filters and Gaussian bandpass filters. Image filtered into the LSF, ISF, and HSF bands are respectively shown from left to right. (A) Image of each frequency band is the accumulation of the outputs of Gabor filters in four orientations (0, 45, 90, and 135°). (B) Gaussian bandpass filtered images using HLP1, HBP, and HHP2 filters (see Section 1).
We also prepared a noisy version of the original and frequency-filtered images. We added phase noise in different levels (20, 30, 40, and 50%) to each image. Unlike other noise generating methods (e.g., simple white noise), the phase noise produces a noise signal that is proportional to the energy of image at each spatial frequency level. Indeed, it consists of frequency components of the image that have been displaced, therefore, the phase noise will have exactly the same energy distribution as the image itself. We used a noise addition mechanism analogous to Ales et al. (2012) study.
2.2. Psychophysics Experiments
We performed 12 rapid two-forced-choice object categorization experiments containing three categorization levels (superordinate, basic, and subordinate), two image types (i.e., original and frequency-filtered images), and two noise conditions (i.e., images with and without noise). Each trial started with a fixation point presented on a uniform gray background for 500 ms. Then, a stimulus image was shown for 12.5 ms (one frame on an 80 Hz monitor) followed by a uniform gray screen, presented for another 12.5 ms, as an inter-stimulus interval (ISI). Immediately afterward, a 1/f noise mask was shown for 150 ms. Finally, subjects should report the category of the stimulus image, by pressing the corresponding key on a keyboard. Each experiment session started with a training phase in which subjects learned to do the categorization task at the desired level, followed by a recording phase in which we recorded the subjects' RT and performance. The training phase of each session contained 20 images (10 images per category) that are randomly selected. When subjects reported their decision, a feedback was shown to them indicating whether they responded correctly or not. The recording phase contained 240 trials (120 images per category) without any feedback. Images used in the training phase are not shown in the recording phase.
Subjects were seated on a comfortable chair in a dark room and were instructed to respond as fast and accurate as possible. Stimuli were presented using Matlab Psychophysics Toolbox (Brainard, 1997) in a 17” CRT monitor with a resolution of 800 × 600 pixels, frame rate 80Hz, and viewing distance of 60 cm. Therefore, each stimulus covered 11 × 11° of visual angle. Regarding our psychophysics experimental setting, the cut-off values of the LSF and HSF filters correspond to ~2 and ~5 cycles per visual degree which are compatible with Guyader et al. (2017) study. Notably, from a 60 cm distance, our monitor has a resolution of ~7 cycles per visual degree.
All subjects voluntarily participated in the experiments and gave their written consent prior to participation. All experimental protocols were approved by the ethical committee of the University of Tehran. All experiments were carried out in accordance with the guidelines of the Declaration of Helsinki and the ethical committee of the University of Tehran.
For the superordinate level experiments, there were four animal/non-animal tasks (one of the four animals vs. car). Also, there were four bird/non-bird tasks (duck or pigeon v.s cat or dog) for the basic level, and one bird categorization task (duck vs. pigeon) for the subordinate level experiments. Regarding the image type and noise condition, we classify all the experiments in four groups:
• Original images (i.e., full-band):
In these experiments, the original images were used as stimuli. 40 subjects participated in the superordinate level experiment (10 for each task), 40 subjects performed the basic level experiment (10 for each task), and 20 subjects did the subordinate level experiment. Images were randomly shuffled and shown in different trials.
• Frequency-filtered images:
For these experiments, we used filtered images in LSF, ISF, and HSF bands (see Section 2.1). The number of subjects who participated in superordinate, basic, and subordinate level experiments was the same as in the “Original images” case. For each frequency band, 40 images per category were used in the recording phase of each experiment session (2 [category] × 3 [frequency bands] × 40 [images] = 240 images). For the training phase, we also used frequency-filtered images. Notably, images were presented in a random order.
• Noisy images:
In these experiments, we used the noisy version of the original images (see Section 2.1) with four different noise levels (20, 30, 40, and 50 %). Here again, we had 40 subjects for each of the superordinate and basic level experiments and 20 subjects for the subordinate level. We used 30 images for each noise level, and thus, in total 120 images per category were presented in each task (2 [category] × 4 [noise levels] × 30 [image] = 240 images). Similar to the previous case, the order of images was random.
• Frequency-filtered noisy images:
Frequency-filtered images contaminated with noise (Section 2.1) were used in this series of experiments. There again, the number of subjects in superordinate, basic, and subordinate level experiments were the same as in the “Original images” case. For each frequency band and noise level, 10 images per category were used in the recording phase of each experiment session (2 [category] × 3 [frequency bands] × 4 [noise levels] × 10 image = 240 images). For the training phase, we also used frequency-filtered images in different noise levels. Images were presented in random order.
2.3. Computational Models
To investigate the information content in each frequency band for the different categorization levels, we used two computational models. Each model was evaluated on similar categorization tasks as performed in our psychophysics experiments. Indeed, we performed the superordinate, basic, and subordinate level categorization tasks with the full-band, frequency-filtered, noisy, and frequency-filtered noisy images, separately. Thus, we had 12 experiments (3 [categorization levels] × 2 [image types] × 2 [noise condition]). Again, for the superordinate level experiments, we performed four animal/non-animal tasks. Also, for the basic and subordinate level experiments, we performed four bird/non-bird animal and one bird categorization tasks, respectively. For each categorization task, we used 400 training images (200 per category) for extracting features and training the classifier, and 400 test images to evaluate the model. In each task, both the training and testing images belonged to the same group. For instance, if the experiment was performed on LSF-filtered images, then all the training and testing samples were filtered in LSF band. However, the frequency filtering mechanism depends on the structure of each model (see below). It should be mentioned that for both models, we used grayscaled images that were rescaled to have 140 pixels in height.
2.3.1. Model I
The first model is largely inspired by the HMAX model (Serre et al., 2007), which is widely used in the visual neuroscience studies as a model of object recognition. In this model, the input image is first filtered in the S1 layer, by applying various Gabor filters with different spatial frequencies and orientations. Then, in the C1 layer, a local max operation is performed over the output of S1 layer. Afterward, random patches are extracted from the output of C1 layer over the training images. These patches are considered as object representative visual features corresponding to the object categories. For each test image, all extracted patches are convolved with the output of C1 layer in different positions (S2 layer) and the maximum convolution values, corresponding to the extracted visual features, provide the object representation in C2 layer. Finally, a classifier detects the category of the input image based on its C2 representation.
The standard HMAX model uses Gabor filters with 16 different spatial frequencies and four orientations in the S1 layer (i.e., 0, 45, 90, and 135°), which are compatible with the recordings in area V1; see Serre et al. (2007) for more details. In the C1 layer of HMAX, the local max operation is performed over a neighborhood of two adjacent frequencies, i.e., the C1 layer compresses every two consecutive S1 maps into one C1 map. We used the first two C1 maps as high, the next four as intermediate, and the last two as LSF bands. Hence, for each frequency band, the original images are fed into the model and the corresponding Gabor filters are applied on them in the S1 layer. Then, the C1 maps are computed by performing a local max operation over the output of two consecutive S1 maps. For instance, in the HSF band, we used Gabor filters in the first four spatial frequencies, while the other frequencies are totally neglected.
In the S2 layer, we picked random patches from each C1 map, where the size of these patches varied from 4 × 4 to 24 × 24 with a step of 2. Then, for each patch size, 1,000 random patches are extracted from different training images. Therefore, for instance, the S2 layer crops 22,000 random patches (2 [C1 maps] × 11 [patch sizes] × [1,000 patches per size]) for the HSF band. In C2 layer, we perform a global max operation over each S2 map and put them together as the representative feature vector. Finally, we employed a 1-nearest neighbor (1-NN) classifier to categorize the input test image based on the label of the closest training sample in C2 feature space. Figure 3A shows the overall sketch of Model I for the HSF band.
Figure 3. (A) The overall sketch of Model I for processing the HSF bands. In the S1 layer, the input image is convolved with Gabor filters (four orientations) in the first four high spatial frequency sizes (i.e., HSF). C1 layer pools each of the two consecutive S1 maps. During the learning, S2 layer extracts patches of different size from the C1 output over different input images. During the test phase, the S2 patches are convolved with the output of the C1 layer. Then the C2 layer applies a global max operation over the output of each S2 output map. Finally a K-NN classifier detects the category of the input image based on the location of its C2 vector and nearest training C2 vectors. (B) The overall sketch of Model II for processing the HSF band. The input images are first filtered into the HSF band and then S2 patches are extracted from the filtered images. The remaining parts are similar to Model I. For both models, the same procedures but with different filters in first layers are used for the ISF and LSF bands.
Therefore, for the experiments with the full-band images (i.e., original images) we used all the eight C1 feature maps. But, for the LSF, ISF, and HSF cases, we only used the corresponding C1 maps. In the noisy image experiments, first we added the noise to the input image, and then fed it to the model.
2.3.2. Model II
The frequency filtering mechanism applied in the Model I (using Gabor filters with different spatial frequencies) is different from what was used in the psychophysics experiments. Therefore, using Model II, which replaces S1 and C1 layers with directly frequency-filtered images, we could verify that the results are independent of the frequency filtering mechanism.
In this model, we discarded the first two layers of the standard HMAX. Thus, for frequency-filtered image experiments, we used the same procedure as explained in Section 2.1 to filter images into the desired frequency bands. For the full-band, we extracted 11,000 patches (11 [patch sizes] × 1,000 [patches per size]) directly cropped from the original images. For the Frequency-filtered (noisy) images, the patches are extracted from the filtered (noisy) images. Afterward, to construct the feature vectors, we convolved these patches with the input images and performed a global max operation. At the end, a similar classifier (1-nearest neighbor) is used for deciding about the category of the input image. Figure 3B shows the overall sketch of Model II for the HSF band.
In this section, we present the results of the psychophysical and computational experiments. Section 3.1 provides the accuracy and reaction time of human subjects performing the superordinate, basic, and subordinate level psychophysics experiments with the original and frequency-filtered images. Then, in Section 3.2, the recognition accuracy of both computational models over the same experiments is presented. Finally, the robustness of the results to different amounts of noise is examined in Section 3.3.
3.1. Humans' Accuracy and Reaction Time Depend on Spatial Frequency Information
The recognition accuracies of human subjects for the psychophysics experiments (see Section 2.2 for the details) with full-frequency band (i.e., original) images as well as the frequency-filtered images (i.e., LSF, ISF, and HSF bands) are shown in Figure 4. Figure 4A demonstrates the subjects mean accuracy for different tasks in each of the superordinate, basic, and subordinate level experiments. We performed three-factor ANOVA using spatial frequency, task, and categorization level as factors. This allows us to study the effect of each factor and their interactions on the categorization accuracy.
Figure 4. Subjects' mean accuracy for different categorization levels and spatial frequency bands. (A) Subjects' accuracy for the different tasks of each categorization level. (B) Average accuracies over the different tasks of the superordinate (duck vs. car, pigeon vs. car, cat vs. car, and dog vs. car), basic (duck vs. cat, duck vs. dog, pigeon vs. cat, and pigeon vs. dog) and subordinate (duck vs. pigeon) levels. The p-value matrix presents the significant and strongly significant values using + and * signs, respectively. Error bars represent standard error of the mean (SEM). The between (within) group degree of freedom (df) is 11 (108).
As seen in the Superordinate column (left part) of Figure 4A, there is no significant difference between the accuracies in the four superordinate animal/non-animal tasks (p = 0.542 > 0.05, F = 0.720). Also, it can be seen that the categorization accuracies are very high in all the frequency bands. This means that the coarse information at LSF band is sufficient for the superordinate categorization level. The subjects' mean accuracy over each of the four (bird/non-bird) basic level tasks is presented in the Basic column (middle part) of Figure 4A. Here again, there is no significant difference among the four basic level tasks (p = 0.958 > 0.05, F = 0.103). However, the accuracy in basic level tasks dropped, with respect to the superordinate level. The maximum accuracy drop has occurred in the LSF band, while it is not changed much for the HSF band. The same effect is observed for the subordinate task, with higher accuracy drop in the LSF and ISF bands.
Since there was no significant difference between the tasks corresponding to each categorization experiment, we also reported the average accuracy for each categorization level and each frequency band (see Figure 4B). We performed a two-factor ANOVA using frequency band and categorization level as factors. The interaction between these two factors was statistically significant (p = 0.0001, F = 10.503). For the superordinate level, by moving through the frequency bands from low to high, the accuracies decreased smoothly (this decrease is not statistically significant). In addition, the accuracy with the full-band images is significantly higher than the ISF and HSF bands, while the difference is not significant for LSF band (see p-value matrix in Figure 4). These together suggest that the LSF information is sufficient and necessary for superordinate level categorization. For the basic and subordinate levels, the LSF does not carry the required information by which the human subjects could precisely perform the categorization task. But, by shifting the frequency band toward the higher frequencies, the accuracy is constantly increasing. Compared to subordinate, the basic level has higher accuracy with lower frequency bands. Interestingly, for the HSF band, the accuracies corresponding to different categorization levels become closer to each other. Therefore, it can be said that the lower frequencies are suitable for higher categorization levels (e.g., superordinate), while performing low categorization levels (e.g., subordinate) require higher frequency information.
We also recorded the subjects RT during the experiments. Figure 5A presents the mean RT, independently for each psychophysics task. Like for the accuracies, we performed a three-factor ANOVA using the categorization level, frequency band, and task as independent factors. Again, there is no significant difference among the tasks in the superordinate (p = 0.795 > 0.05; F = 0.342) and basic (p = 0.919 > 0.05; F = 0.166) level experiments. Thus, in Figure 5B, we averaged the RTs of each categorization level and frequency band over the different tasks. For all categorization levels, the RT increases by moving from the low to the high ones. We also performed a two-factor ANOVA using frequency band and categorization level as factors. The interaction between these two factors was statistically significant (p = 0.0001, F = 61.886). Also, we found that RTs of ISF experiments are significantly higher than those of LSF, and RTs of HSF experiments are significantly higher than those of ISF (see p-value matrix in Figure 5). These results are compatible with the previous findings indicating that lower frequencies are processed earlier than higher ones (Macé et al., 2005; Kauffmann et al., 2015).
Figure 5. Subjects' mean RT for different categorization levels and spatial frequency bands. (A) Subjects' RT for the different tasks of each categorization level. (B) Average RT over the different tasks of the superordinate (duck vs. car, pigeon vs. car, cat vs. car, and dog vs. car), basic (duck vs. cat, duck vs. dog, pigeon vs. cat, and pigeon vs. dog) and subordinate (duck vs. pigeon) levels. The p-value matrix presents the significant and strongly significant values using + and * signs, respectively. Error bars represent standard error of the mean (SEM). The between (within) group df is 11 (108).
On the other hand, there are several studies suggesting that the initial guess about the object is taken based on the LSF information which facilitates the recognition process in higher visual areas (Fenske et al., 2006; Craddock et al., 2015). In agreement with these studies, the RTs of full-band experiments in all three categorization levels are significantly shorter than the HSF and ISF experiments, but close to the LSF ones (the differences are not statistically significant; see p-value matrix in Figure 5). Generally, the RTs of superordinate level are shorter than those of basic level, and RTs of basic level are shorter than those of subordinate level (see Figure 5B). In addition, the RT differences are longer in the LSF experiments.
By considering both RTs (Figure 5B) and accuracies (Figure 4B), three main conclusions can be drawn. First, using just LSF information, humans could quickly and accurately accomplish the superordinate categorization. While, the RT is much longer with HSF information, despite the reasonable accuracy. These together suggest that the superordinate categorization is mainly done using the LSF information. Second, although categorization in basic and subordinate levels using the LSF information is completed faster, the accuracies are very low. Therefore, LSF information is not sufficient for the basic and subordinate levels. Intuitively, in the superordinate level, there is a high inter-category dissimilarity, and therefore, LSF information is sufficient for performing the task. For the subordinate tasks, with higher inter-category similarity, higher frequency information is required which carries more details about the object. Third, to complete the basic and subordinate level categorization, subjects needed HSF information. However, higher frequencies are processed later than lower ones. Hence, it can be concluded that superordinate level is the entry object categorization level, and then, categorization in the basic and subordinate levels are accomplished.
3.2. Computational Models Account for Human Behavior
As mentioned in Section 2.3, we employed two computational models to study whether the changes in human performance over the frequency bands are due to the changes in the information content in each band or due to the way the visual system processes different frequency information. Therefore, we assessed the models on the same categorization tasks as human subjects performed in the psychophysics experiments. The details of the models and the way they are trained and tested in each task are fully explained in Section 2.3. Briefly, Gabor filters with different spatial frequencies are used in Model I to filter the input images into LSF, ISF, and HSF bands. Gabor filters with low (high) spatial frequencies act as low-pass (high-pass) filters and extract coarse (fine) information from the image. In Model II, images were directly filtered into the different frequency bands.
Figure 6A shows the accuracy of Model I over the different categorization tasks and levels, as well as frequency bands. Here again, we performed a three-factor ANOVA using categorization level, task, and frequency band as independent factors. Like for humans, there was no significant difference among the different tasks of each categorization level experiment (superordinate: p = 0.091 > 0.05, F = 2.199; basic level: p = 0.060 > 0.05, F = 2.525). Therefore, in Figure 6B, we averaged the accuracies across the categorization tasks. Interestingly, the overall trend in accuracies of the Model I is very similar to those of humans (see Figure 4B): the accuracy of this model in superordinate level drops by moving from LSF band to the higher ones. Computationally, this means that lower frequencies contain more information about superordinate categories than the higher ones. While, in the basic and subordinate levels, higher frequency bands lead to higher accuracies. However, compared to the subordinate, the basic level has higher accuracies in LSF and ISF bands. Surprisingly, similarly to humans, the accuracies of all categorization levels become close to each other at the HSF band. Here again, we performed a two-factor ANOVA on the accuracy of the Model I using frequency band and categorization level as factors. The interaction between these two factors was statistically significant (p = 0.0001, F = 93.182).
Figure 6. Mean accuracies of Model I (averaged over 10 independent runs) for different categorization levels and spatial frequency bands. (A) Model's accuracy for the different tasks of each categorization level. (B) Average accuracies over the different tasks of the superordinate (duck vs. car, pigeon vs. car, cat vs. car, and dog vs. car), basic (duck vs. cat, duck vs. dog, pigeon vs. cat, and pigeon vs. dog) and subordinate (duck vs. pigeon) levels. The p-value matrix presents the significant and strongly significant values using + and * signs, respectively. Error bars represent standard error of the mean (SEM). The between (within) group df is 11 (108).
The accuracies of the Model II are also presented in Figure 7, where Figure 7 contains the accuracies on each task, and Figure 7B shows the accuracies averaged over the tasks. We performed a two-factor ANOVA on the average accuracy of Model II using frequency band and categorization level as factors. The interaction between these two factors was statistically significant (p = 0.0001, F = 42.050). With respect to Model I, the accuracies of Model II have dropped, which is due to the elimination of the prepossessing stages (S1 and C1 layers) in this model. However, what matters is the trend of accuracies within and between the categorization levels. As seen, the results of Model II are similar to those of Model I. Again, LSF band lead to high accuracy in the superordinate level, while it results in the lowest accuracy for the other two categorization levels. Also, for accurate categorization in the basic and subordinate levels, HSF information is necessary.
Figure 7. Mean accuracies of Model II (averaged over 10 independent run) for different categorization levels and spatial frequency bands. (A) Model's accuracy for the different tasks of each categorization level. (B) Average accuracies over the different tasks of the superordinate (duck vs. car, pigeon vs. car, cat vs. car, and dog vs. car), basic (duck vs. cat, duck vs. dog, pigeon vs. cat, and pigeon vs. dog) and subordinate (duck vs. pigeon) levels. The p-value matrix presents the significant and strongly significant values using + and * signs, respectively. Error bars represent standard error of the mean (SEM). The between (within) group df is 11 (108).
Due to the consistency of the results of the two computational models with the human behavior, it can be said that the observed human accuracy pattern is mainly due to the information content in different frequency bands. In the LSF band, where the overall shape of the object is preserved and other details are removed, categorization in superordinate level can be done with high accuracy. This coarse information is not useful for the other two categorization levels (basic and subordinate), where the categories have more shape similarities. Therefore, the visual system needs more detailed information which lies in higher frequency bands. As stated before, lower spatial frequencies are processed faster than higher ones, and therefore, it is computationally difficult for the visual cortex to do superordinate categorization before basic and subordinate levels.
Interestingly, similarly to humans, the accuracies of both models dropped in superordinate level, when moving from LSF band to the higher ones. Clearly, at the superordinate level, there is a huge variation among the objects in each category. Therefore, lower frequencies, which maintain the overall shape of the objects, contain the required information for superordinate level categorization. However, in the basic and subordinate levels, the higher frequency bands are more informative. Compared to the subordinate level, the basic level has higher accuracies in LSF and ISF bands. Surprisingly, similarly to humans, the accuracies of all the categorization levels meet each other at the HSF band.
3.3. Results Are Robust to Noise
We repeated all the behavioral (see Section 2.2) and computational (see Section 2.3) experiments with noisy images described in Section 2.1. Adding noise to the input image will increase the errors, and therefore, it will avoid any potential ceiling effects in the performances. Also, it allows us to check whether the obtained results are still valid under more difficult visual conditions.
Figure 8A,B show the human subjects' categorization accuracy and RT for the full-band and frequency-filtered images with different amounts of noise, respectively. Note that, for each categorization level experiment, we averaged the accuracies and RTs corresponding to each task. For instance, for the subordinate level experiment, there were four animal/non-animal tasks (see Section 2.1).
Figure 8. The effect of adding different amounts of noise to the image stimuli (20, 30, 40, and 50%) on humans' accuracy (A) and RT (B). Error bars represent standard error of the mean (SEM). The between (within) group df is 59 (540).
As shown in Figure 8A, when increasing the noise level, the accuracy dropped. But, the overall trend of accuracies over the frequency bands and categorization levels remains the same. For the superordinate categorization, LSF band has higher accuracy than intermediate and high bands, even for 50% noise level. For the basic and subordinate levels, the accuracy increases moving from LSF to ISF and HSF bands. To statistically investigate the effect of noise on categorization accuracies, again we performed a three-factor ANOVA using noise, categorization level, and frequency band as independent factors. Our analysis indicates that there is no significant interaction between the categorization and noise levels, meaning that adding noise has no effect on the accuracy trend over the categorization levels. However, the noise level has a significant effect on the accuracy (p = 0.0001 < 0.05, F = 243.759). All the other effects and interactions were also significant (frequency: p = 0.0001 < 0.05, F = 74.261; categorization level: p = 0.0001 < 0.05, F = 99.628; frequency × categorization level: p = 0.0001 < 0.05, F = 10.636; frequency × noise: p = 0.038 < 0.05, F = 1.858).
Adding noise also increases the RT (see Figure 8B). Similarly to the noise-free experiments, for all categorization levels, the RT increases by moving from LSF to ISF and HSF bands. Here again, subordinate (basic) has longer RTs than basic (superordinate) level. Interestingly, for all categorization levels, as the amount of noise increases, the pattern of RTs is maintained but shifted toward longer times. We performed a three-factor ANOVA to study the impacts of noise, categorization level, and frequency band on RTs. Similarly to the accuracies, adding noise had a significant effect on RTs (p = 0.0001 < 0.05, F = 156.275), and there was no significant interaction between the categorization level and noise. All the other effects and interactions were also significant (frequency: p = 0.0001 < 0.05, F = 327.685; categorization level: p = 0.0001 < 0.05, F = 202.089; frequency × categorization level: p = 0.0001 < 0.05, F = 4.144).
Also, we evaluated the two computational models on the noisy images. This allowed to study, from a computational point of view, how adding noise affects the information at each frequency band, i.e., accuracy in each categorization level. The categorization accuracies of Models I and II for different levels of noise are shown in Figure 9A,B, respectively. Similarly to the humans, by increasing the amount of noise, the accuracies dropped, while the trend of accuracies over the frequency bands was maintained. In addition, for all noise levels, LSF band leads to higher accuracies in superordinate level than ISF and HSF bands, while higher frequencies are suitable for basic and subordinate levels.
Figure 9. The effect of adding different amounts of noise to the image stimuli (20, 30, 40, and 50%) on the accuracy of Model I (A) and Model II (B). Error bars represent standard error of the mean (SEM). The between (within) group df is 59 (540).
The aim of our study was to investigate the effect of spatial frequencies on object categorization in different levels (i.e., superordinate, basic and subordinate levels). We constructed an object dataset containing images of cars and four animals (ducks, pigeons, cats, and dogs). Images were also filtered into LSF, ISF, and HSF bands. We performed several rapid-presentation psychophysics experiments at different categorization levels, and recorded the subjects' accuracy and RT. The same categorization experiments were also performed by two computational models.
Although, the relation between the global (resp. local) visual processing and LSF (resp. HSF) information has been debated (Morrison and Schyns, 2001; Boutet et al., 2003; Loftus and Harley, 2004; Goffaux et al., 2005; Goffaux and Rossion, 2006), coarse information is obviously excluded from the HSF-filtered images, whereas the fine details such as sharp edges and textures are absent in LSF-filtered images. Results of our psychophysics experiments reveal that using just LSFs, humans could accurately perform superordinate level categorization. While, for the basic and subordinate levels, higher frequency information is required and humans could not reach high precisions using just LSF information. In addition, the accuracy at the basic level was greater than the subordinate level for LSF and ISF bands. These together indicate that basic level is not fully dependent on HSF or LSF bands. Also, at HSF band, the human accuracy in all categorization levels is almost equal. This suggests that HSF bands carry the same amount of information useful for each categorization level.
However, these results are contrary to the findings of Collin and Mcmullen (2005), where they suggested that the highest accuracy with LSF information is reached at the basic level, while superordinate and subordinate levels rely on higher frequencies. This contradiction could be due to the employed speeded category verification task which included more analysis beyond the object detection. Actually, they presented a category name followed by an object image, and subjects had to report if they matched or not. Reading and language cortical areas might be activated during the visual process which can give an advantage to the basic level, over the superordinate level. Therefore, this type of experiment has been criticized because of involving the semantic processing of the brain (Rosch et al., 1976; Macé et al., 2009; Wu et al., 2014).
Our results are compatible with the theory of coarse-to-fine temporal processing in the visual system (Navon, 1977; Schyns and Oliva, 1994; Hughes et al., 1996; Macé et al., 2005), where lower frequencies are processed earlier than higher ones, independently of the categorization level. In addition, the subjects' RT for the superordinate (resp. basic) level was shorter than the basic (resp. subordinate) level. These suggest that the superordinate level is the entry categorization level, and the subordinate level is the latest one. However, these results contradict with the findings of some earlier studies (Jolicoeur et al., 1984; Murphy and Brownell, 1985; Murphy and Wisniewski, 1989; Gauthier et al., 1997; Large et al., 2004) suggesting the temporal advantage for the basic level categorization. Particularly, Collin and Mcmullen (2005) suggested that whatever the spatial frequency band, the basic level is completed earlier than superordinate and subordinate levels. Again, this could be due to the employed speeded category verification paradigm and long stimulus presentation time in their experiments.
From a computational point of view, our results could be due to the information content at each frequency band, or the underlying neural processes involving object categorization at each level. We used two computational models to do the same categorization experiments as humans did. These models employ different frequency filtering mechanisms; one uses Gabor filters with low to HSFs, and the other one uses directly filtered images. Both models reached similar accuracy pattern to those of humans over the different categorization levels and frequency bands. Therefore, computationally, neither basic nor subordinate level can be the entry level, due to the lack of required visual information in the LSF band.
The computational models we used in this study only explain the human categorization accuracy and do not take the processing time into account. Employing temporal models like recurrent networks could help in this regard. However, to be compatible with the findings in the processing order of frequency information (i.e., from low to high) in the visual cortex, such model should process the LSF components earlier than HSFs. Based on our results, it is expected that the model will accomplish the superordinate level categorization much faster than basic and subordinate levels. Also, mathematical frameworks like the hierarchical Bayesian model (Lee and Mumford, 2003) can be used to explain the timing as well as the accuracy of the human visual system in object categorization at different abstraction levels. Based on the early input information, an initial probabilistic inference is made and then it is updated by the upcoming inputs and feedbacks. For instance, using LSF information, a general guess about the animacy of the seen object is made and it is completed by the upcoming HSF information to detect the species of the animal.
Here, we only used five object classes (dog, cat, pigeon, duck, and car) in our psychophysics and computational experiments. Increasing the number of classes, especially in basic and subordinate levels, will substantially increase the need for more human subjects to perform the psychophysics tasks. Instead, we increased the number of exemplars in each of the five categories and employed more subjects for each task. Although, for the superordinate level we only had the car class as the non-animal category, each subject participated in only one superordinate task in which he/she was asked to detect animals (one of the four animals) from non-animals (cars). Hence, subjects were not biased to detect cars only.
This study was carried out in accordance with the recommendations of Helsinki and the ethical committee of the University of Tehran with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the ethical committee of the University of Tehran.
MA, SK, TM, and MG sketched the experiments, contributed to the analysis of data and the preparation of manuscript, reviewed and finalized the manuscript. MA performed most of the experiments. MA and SK wrote the first draft of the manuscript.
Conflict of Interest Statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Ales, J. M., Farzin, F., Rossion, B., and Norcia, A. M. (2012). An objective method for measuring face detection thresholds using the sweep steady-state visual evoked response. J. Vis. 12, 18–18. doi: 10.1167/12.10.18
Bar, M., Kassam, K. S., Ghuman, A. S., Boshyan, J., Schmid, A. M., Dale, A. M., et al. (2006). Top-down facilitation of visual recognition. Proc. Natl. Acad. Sci. U.S.A. 103, 449–454. doi: 10.1073/pnas.0507062103
Collin, C. A., and Mcmullen, P. A. (2005). Subordinate-level categorization relies on high spatial frequencies to a greater degree than basic-level categorization. Attent. Percept. Psychophys. 67, 354–364. doi: 10.3758/BF03206498
Craddock, M., Martinovic, J., and Müller, M. M. (2015). Early and late effects of objecthood and spatial frequency on event-related potentials and gamma band activity. BMC Neurosci. 16:6. doi: 10.1186/s12868-015-0144-8
Dehaqani, M.-R. A., Vahabie, A.-H., Kiani, R., Ahmadabadi, M. N., Araabi, B. N., and Esteky, H. (2016). Temporal dynamics of visual category representation in the macaque inferior temporal cortex. J. Neurophysiol. 116, 587–601. doi: 10.1152/jn.00018.2016
Fenske, M. J., Aminoff, E., Gronau, N., and Bar, M. (2006). Top-down facilitation of visual object recognition: object-based and context-based contributions. Prog. Brain Res. 155, 3–21. doi: 10.1016/S0079-6123(06)55001-0
Gauthier, I., Anderson, A. W., Tarr, M. J., Skudlarski, P., and Gore, J. C. (1997). Levels of categorization in visual recognition studied using functional magnetic resonance imaging. Curr. Biol. 7, 645–651. doi: 10.1016/S0960-9822(06)00291-0
Goffaux, V., Hault, B., Michel, C., Vuong, Q. C., and Rossion, B. (2005). The respective role of low and high spatial frequencies in supporting configural and featural processing of faces. Perception 34, 77–86. doi: 10.1068/p5370
Goffaux, V., and Rossion, B. (2006). Faces are “spatial”–holistic face perception is supported by low spatial frequencies. J. Exp. Psychol. Hum. Percept. Perfor. 32:1023. doi: 10.1037/0096-15220.127.116.113
Guyader, N., Chauvin, A., Boucart, M., and Peyrin, C. (2017). Do low spatial frequencies explain the extremely fast saccades towards human faces? Vis. Res. 133, 100–111. doi: 10.1016/j.visres.2016.12.019
Kauffmann, L., Bourgin, J., Guyader, N., and Peyrin, C. (2015). The neural bases of the semantic interference of spatial frequency-based information in scenes. J. Cogn. Neurosci. 27, 2394–2405. doi: 10.1162/jocn_a_00861
Loftus, G. R., and Harley, E. M. (2004). How different spatial-frequency components contribute to visual information acquisition. J. Exp. Psychol. Hum. Percept. Perform. 30:104. doi: 10.1037/0096-1518.104.22.168
Macé, M., Joubert, O., and Fabre Thorpe, M. (2005). “Entry level at the superordinate level in visual categorization,” in 9th International Conference on Cognitive and Neural systems, Vol. 52 (Boston, MA), 2007–2018.
Macé, M. J.-M., Joubert, O. R., Nespoulous, J.-L., and Fabre-Thorpe, M. (2009). The time-course of visual categorizations: you spot the animal faster than the bird. PLoS ONE 4:e5927. doi: 10.1371/journal.pone.0005927
Murphy, G. L., and Brownell, H. H. (1985). Category differentiation in object recognition: typicality constraints on the basic category advantage. J. Exp. Psychol. Learn. Mem. Cogn. 11:70. doi: 10.1037/0278-7322.214.171.124
Murphy, G. L., and Wisniewski, E. J. (1989). Categorizing objects in isolation and in scenes: what a superordinate is good for. J. Exp. Psychol. Learn. Mem. Cogn. 15, 572–586. doi: 10.1037/0278-73126.96.36.1992
Poncet, M., and Fabre-Thorpe, M. (2014). Stimulus duration and diversity do not reverse the advantage for superordinate-level representations: the animal is seen before the bird. Eur. J. Neurosci. 39, 1508–1516. doi: 10.1111/ejn.12513
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252. doi: 10.1007/s11263-015-0816-y
Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., and Poggio, T. (2007). Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Tntell. 29, 411–426. doi: 10.1109/TPAMI.2007.56
Vanmarcke, S., Calders, F., and Wagemans, J. (2016). The time-course of ultrarapid categorization: the influence of scene congruency and top-down processing. i-Perception 7:2041669516673384. doi: 10.1177/2041669516673384
Keywords: spatial frequencies, object categorization, categorization levels, psychophysics, rapid object presentation
Citation: Ashtiani MN, Kheradpisheh SR, Masquelier T and Ganjtabesh M (2017) Object Categorization in Finer Levels Relies More on Higher Spatial Frequencies and Takes Longer. Front. Psychol. 8:1261. doi: 10.3389/fpsyg.2017.01261
Received: 29 March 2017; Accepted: 11 July 2017;
Published: 25 July 2017.
Edited by:Alan A. Stocker, University of Pennsylvania, United States
Reviewed by:Xue-Xin Wei, Columbia University, United States
H. Steven Scholte, University of Amsterdam, Netherlands
Copyright © 2017 Ashtiani, Kheradpisheh, Masquelier and Ganjtabesh. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Mohammad Ganjtabesh, email@example.com