Attention to local and global levels of hierarchical Navon figures affects rapid scene categorization

Brand, John; Johnson, Aaron P.

doi:10.3389/fpsyg.2014.01274

ORIGINAL RESEARCH article

Front. Psychol., 02 December 2014

Sec. Perception Science

Volume 5 - 2014 | https://doi.org/10.3389/fpsyg.2014.01274

Attention to local and global levels of hierarchical Navon figures affects rapid scene categorization

John Brand¹^*

Aaron P. Johnson^1,2

¹Department of Psychology, Concordia University, Montreal, QC, Canada
²Centre for Interdisciplinary Research in Rehabilitation of Greater Montreal, Montreal, QC, Canada

In four experiments, we investigated how attention to local and global levels of hierarchical Navon figures affected the selection of diagnostic spatial scale information used in scene categorization. We explored this issue by asking observers to classify hybrid images (i.e., images that contain low spatial frequency (LSF) content of one image, and high spatial frequency (HSF) content from a second image) immediately following global and local Navon tasks. Hybrid images can be classified according to either their LSF, or HSF content; thus, making them ideal for investigating diagnostic spatial scale preference. Although observers were sensitive to both spatial scales (Experiment 1), they overwhelmingly preferred to classify hybrids based on LSF content (Experiment 2). In Experiment 3, we demonstrated that LSF based hybrid categorization was faster following global Navon tasks, suggesting that LSF processing associated with global Navon tasks primed the selection of LSFs in hybrid images. In Experiment 4, replicating Experiment 3 but suppressing the LSF information in Navon letters by contrast balancing the stimuli examined this hypothesis. Similar to Experiment 3, observers preferred to classify hybrids based on LSF content; however and in contrast, LSF based hybrid categorization was slower following global than local Navon tasks.

Introduction

The ability to perceive a scene under increased attentional load is often cited as evidence of pre-attentive scene perception. This evidence is typically indexed using dual-task paradigms in which a secondary scene categorization task is unaffected by a concurrent, cognitively demanding primary task. Researchers argue that scene perception is pre-attentive as it is immune to inattentional blindness (Mack and Rock, 1998), unimpaired under dual task conditions (Li et al., 2002; Rousselet et al., 2002), susceptible to stroop interference (Greene and Fei-Fei, 2014), and impervious to change blindness if the object's removal does not change the meaning of the scene (Rensink et al., 1997; Simons and Levin, 1997).

However, other researchers question the evidence in support of the automaticity of scene perception. Cohen et al. (2011) argued that previous studies falsely demonstrated pre-attentive scene perception because they failed to use sufficiently demanding primary tasks, thereby allowing attentional resources to be allocated to the scene stimuli. By increasing the primary task difficulty, Cohen and colleagues demonstrated that concurrently completing multiple-object tracking and serial representation visual presentation (RSVP) tasks impairs scene categorization. Together with previous research in which deficits in scene perception were indexed using attentional blink (Marois et al., 2004; Evans and Treisman, 2005; Slagter et al., 2010), inattentional blindness (Mack and Clarke, 2012), and dual task (Walker et al., 2008) paradigms, Cohen and colleagues concluded that conscious scene perception requires attention.

Although concluding that attention is necessary for a scene to reach conscious awareness, Cohen et al. (2011) acknowledged that some higher-level aspects of scene processing occur in the absence of attention. One of the strongest findings in support of this hypothesis is the presence of scene-related behaviors that occur so rapid that attention is thought to play little or no role. Kirchner and Thorpe's (2006; see also, Crouzet et al., 2010) study illustrates this point. They showed that when two natural images are presented concurrently, observers are able to make an ultra-rapid saccade to the image that contained an animal in as little as 120–130 ms. Consistent with this view, Thorpe et al. (1996; see also, Fabre-Thorpe et al., 2001) showed that observers are able to remove their finger from a button box within 300 ms in response to the presence of an animal. Critically, simultaneous event-related potentials revealed a differential frontal lobe activity between target and non-target displays approximately 150 ms after stimulus onset. This suggests that scene categorization is made prior to this time point. Researchers (VanRullen and Thorpe, 2002) cite such results as evidence that scene categorization is accomplished, in part, by an automatic feed-forward mechanism, a conclusion corroborated by simulation evidence (Serre et al., 2007).

The rapid ability to categorize scenes suggests that a scene's semantic content is based on information originating from early visual processes. Consistent with this idea, Schyns and Oliva (1994) suggested that rapid scene categorization is based on a scene's global layout. Highways, for example, tend to have fewer vertical straight lines compared to city landscapes that have many dense, vertical orientations. Although these global image properties can vary from one scene to another (e.g., some cities are less dense than others), the consistency of spatial organization across different scenes is thought to activate a scene schema that can be used for rapid scene categorization. Schyns and Oliva tested this hypothesis by introducing a new type of scene stimuli, termed a hybrid image. A hybrid image contains information from two separate sources at different spatial frequencies. For example, an image that contains the low spatial frequency (LSF) content of one picture (e.g., a city scene), and the high spatial frequency (HSF) content of a second picture (e.g., a highway scene). Of particular importance to Schyns and Oliva was not spatial frequency per se, but rather the information that each spatial scale conveyed for scene recognition. Converging evidence from neurophysiological and psychophysical studies suggest that visual information is organized into spatial frequency channels in which global information is conveyed by LSFs and finer information is conveyed by HSFs (for a review, see Morrison and Schyns, 2001). Consequently, the authors reasoned that if scene recognition is based on coarse information, then observers should prefer to categorize hybrid images based on LSF content.

To test their hypothesis, Schyns and Oliva (1994) asked observers to indicate whether a briefly presented (30 or 150 ms) sample image matched a subsequent target image. The sample image was either a hybrid, low-pass filtered (i.e., contained only LSFs), high-pass filtered (i.e., contained only HSFs), or a full broadband spatial frequency scene (i.e., an unaltered original image). The target image was always a broadband image. Of critical importance here was the association between hybrid samples and target images. On LSF-hybrid trials, the hybrid's LSF content matched the target scene. On HSF-hybrid trials, the hybrid's HSF content matched the target scene. When presentation duration was short, LSF-hybrid trials were more accurate than HSF-hybrid trials; conversely, when presentation duration was long, HSF-hybrid trials were more accurate than LSF-hybrid trials. Critically, categorization performance was high for all control conditions, suggesting that differences in spatial frequency availability cannot account for the differential processing of hybrid images. Schyns and Oliva attributed this result to a coarse-to-fine processing bias in which the early availability of a scene's global layout activates a scene schema from memory. Finer details emerge later and fill in the details of the scene's content (e.g., object recognition).

Oliva and Schyns (1997) modified the coarse-to-fine hypothesis to reflect the fact that either global, or fine scale information can be used for scene recognition. They asked observers to first complete a sensitization phase during which they were briefly presented natural images that were meaningful at only one spatial frequency (e.g., a LSF version of a highway scene with HSF structured noise). A test phase immediately followed in which observers were asked to classify hybrid images. Observers were more likely to categorize hybrids based on LSF and HSF content, respectively, if they were first sensitized to the same frequencies during the sensitization phase. Interestingly, observers claimed to be aware of only a single spatial scale within the hybrid images, suggesting that diagnostic scale selection was based on the scale that was previously the most informative.

To explain this flexibility in spatial scale selection, Oliva and Schyns (1997) suggested that attention is driven to diagnostic spatial frequencies in which recognition is based on scale specific cues of a scene category (e.g., natural landscapes contain LSFs at a horizontal orientation that correspond to the horizon). This idea dovetails with Chong and Treisman's (2005) notion that different distributions of attention facilitate the extraction of different types of information within a scene. According to Chong and Treisman, a scene's layout is organized hierarchically and attention can be deployed either locally, globally, or distributed over a set of similar items. When attention is focused locally, features are bound together resulting in the identification of an object. In contrast, when attention is distributed globally, the gist or semantic meaning of a scene is extracted based on its global layout. Finally, when attention is distributed over a set of similar items, summary representations of set properties are automatically extracted (e.g., average size; Ariely, 2001).

Global and local distributions of attention are typically studied using hierarchical Navon stimuli (e.g., a large “A” comprised of smaller “Cs”). Navon (1977) reported a global precedence effect that is characterized by two robust findings. First, global letters are identified faster than local letters; and second, global recognition interferes with local recognition but not vice versa. Several researchers (Shulman and Wilson, 1987; Badcock et al., 1990) explained the global precedence effect using the coarse-to-fine processing framework. Similar to the identification of coarse and fine information, the hypothesis is that the identification of global and local information is based on LSF and HSF information, respectively. In addition, Flevaris et al. (2011) showed that adopting different attentional distributions facilitates the selection of different spatial scales. They asked participants to classify the orientation of either the LSF or HSF component of a compound sine-wave grating immediately following global, or local Navon tasks. When discriminating the orientation of the LSF component, observers were faster following global Navon tasks; conversely, when asked to discriminate the orientation of the HSF component, observers were faster following local Navon tasks.

Flevaris et al.'s (2011) result suggests that attending to global and local levels should differentially affect scene categorization by facilitating the selection of LSFs and HSFs, respectively. In the present research, we tested this hypothesis by asking participants to categorize briefly presented hybrid images following global, or local Navon tasks. However, because hybrid images contain competing sources of categorization content, it was important that we first demonstrated the ability of our observers to extract both sources of information. Additionally, it was also important that we understood the spatial frequency that our observers preferred to use for categorization, irrespective of any attention manipulation. Thus, in Experiment 1 we assessed spatial scale sensitivity and in Experiment 2 we assessed diagnostic spatial scale preference.

Experiment 1 was a probe design similar to Schyns and Oliva (1994) in which observers were asked to indicate whether a probe word matched a briefly presented (32 or 150 ms) hybrid image. The probe word matched either the hybrid's LSF, or HSF content. In a control condition, the probe word matched neither spatial frequency. The measure, d prime (d′) was computed to measure observers' sensitivity to both LSFs, and HSFs. d′ values were above 1.5 in each condition, suggesting that both LSFs and HSFs are available in our hybrid images, at both short and long durations. Experiment 2 was a replication of Experiment 1, with the exception that we used an all-alternative forced choice paradigm in which observers were asked to choose the image category from a list of all possible target categories. Critically, this design allowed us to compute an objective measure of preferred diagnostic spatial scale. Results indicated that observers preferred to categorize hybrid images based on LSF content, at both short and long durations. Together with the results of Experiment 1, Experiment 2 demonstrated that our observers preferred to base categorization on LSF content, despite the fact that both LSFs, and HSFs were perceptually available.

The fact that our observers preferred to base hybrid categorization on LSF content suggests that attending globally facilitates scene categorization. A consequence of this prediction is that LSF-based hybrid categorization should be faster following global compared to local Navon tasks. In Experiment 3, we directly tested this hypothesis by asking observers to classify hybrid images immediately following global and local Navon tasks. Similar to Experiment 2, observers preferred to categorize hybrid images based on LSF content. Furthermore, and consistent with our hypothesis, LSF-based hybrid image categorization was faster following global Navon tasks. In Experiment 4, we directly tested whether this facilitation effect was the result of processing LSFs associated with a Navon figure's global structure. We thus replicated Experiment 3 with the exception that we contrast balanced the Navon stimuli in order to suppress their LSFs (see Supplementary Material). Similar to Experiment 3, observers preferred to classify hybrid images based on LSF content, irrespective of the Navon task completed; however, and in contrast to Experiment 3, LSF-based hybrid image categorization was slower following global than local Navon tasks.

Experiment 1

The goal of Experiment 1 was to demonstrate the availability of both spatial frequencies in our hybrid images. We asked observers to complete a classification task in which they were required to indicate whether a cue word corresponded to a previously presented low-pass, high-pass, broadband, or hybrid image.

Methods

Observers

Eight undergraduate students from Concordia University participated in this study in return for partial course credit. All observers self-reported normal or corrected-to-normal vision. The University Human Research Ethics Committee at Concordia University approved all experiments reported in this article and all observers provided written consent.

Stimuli and apparatus

Stimuli were presented on a 21-in. Viewsonic 225fb CRT monitor (1024 × 768 resolution; 100 Hz refresh rate) controlled by a Dell Precision T3400 core2 quad processor running Microsoft Windows 7. Experiment Builder (SR Research, Ottawa, Ontario) was used to display the stimuli and record the responses. All participants were seated 60 cm away from the screen, and their head position was controlled using a table-mounted chinrest.

Stimuli were 128 natural images (32 unique images of highways, cities, living rooms, and valleys, respectively) taken from the Sun image database (Xiao et al., 2010). All images were equalized for mean luminance and RMS contrast (as described in Appendix B of Loschky et al., 2007) and were presented on a gray background (RBG values = [128, 128, 128]; luminance of 52 cd/m²). These images were the same categories used by Schyns and Oliva (1994), who showed that their overall contrast was similar (i.e., the Fourier amplitude spectra of the images are highly correlated with one another). Images were broadband, low-pass (below 2 cycles deg⁻¹ of visual angle), high-pass (above 6 cycles deg⁻¹ of visual angle), or hybrid images. Hybrids were constructed by combining the low frequency components of one scene (e.g., a city) with the high frequency components of another scene (e.g., highway). Mathwork Matlab (ver. 2011b) was used to create the images. A total of 32,768 possible hybrid images were constructed by taking every possible combination of the four scene categories. All images were gray scaled, located in the center of the screen, and were 256 × 256 pixels.

Procedure

A trial schematic is presented in Figure 1A. Each trial began with a fixation cross located in the center of the screen presented for 250 ms, followed by a single image presented for either 32, or 150 ms. A white noise mask (amplitude spectrum slope = 0; orientation magnitude = 0) immediately followed offset of the image and was presented for 64 ms. The image was a broadband, low-pass, high-pass, or a hybrid image. Immediately following offset of the mask, observers were presented with a display screen in which they were asked to indicate whether a probe word (e.g., highway, city, living room, or valley) corresponded to the category of the previously presented image. On 50% of trials, the cue word corresponded to the image category. Of those 50% of trials on which the image was a hybrid, the probe word matched the hybrid's LSF and HSF content 25% of the time, respectively. We instructed observers to press “1” on the keyboard number pad if they believed the probe word matched the previously presented image and the “2” key if they believed that it did not. The probe word was displayed in the center of the screen and stayed visible until a response was made. Trial-to-trial feedback was not provided.

FIGURE 1

Figure 1. (A) Trial sequence in Experiment 1; (B) Trial sequence in Experiment 2.

Observers completed 16 blocks of 48 trials for a total of 768 trials. Image type and presentation duration varied from trial-to-trial within a block, and the order of images and presentation duration was chosen at random by the program. Observers completed 32 practice trials prior to beginning the experiment. The scene categories used during the practice trials were not used in the experimental trials (e.g., forest and barn scenes) and practice trials were not analyzed.

Results

Sensitivity

The sensitivity measure, d′ was calculated for each condition. Condition varied according to image type (broadband, low-pass, high-pass, and hybrid) and presentation duration (32 and 150 ms). Because hybrid images contained both low and HSF content, we further separated these trials into those on which the probe word matched the hybrid's low (Hybrid-LSF) and HSF content (Hybrid-HSF). As can be seen in Figure 2A, d′ values were high (d′ > 1.5) in all conditions, suggesting that observers were sensitive to all image types at both presentation durations. We entered d′ values into a 5 (image type) × 2 (presentation duration) repeated measures Analysis of Variance (ANOVA). There were significant main effects of image type, F_{(4, 28)} = 8.09, p < 0.001, η² = 0.54, and presentation duration, F_{(1, 7)} = 34.47, p < 0.001, η² = 0.83. The image type × presentation duration interaction was also significant, F_{(4, 28)} = 4.65, p < 0.001, η² = 0.39.

FIGURE 2

Figure 2. The results of Experiment 1. (A) d′ values for each image type at each presentation duration. The error bars represented here, and throughout the manuscript are the 95% within-subject confidence intervals described by Loftus and Masson (1994); (B) Mean scene categorization RTs for each image type at each presentation duration; (C) Mean RTs for hybrid-LSF and hybrid- HSF categorization at each presentation duration.

Because Experiment 1 was designed to determine the availability of spatial frequencies in our hybrid images, we were particularly interested in comparisons between Hybrid-LSF and Hybrid-HSF trial types. However, before investigating the competing spatial scale information within hybrid images, we first compared performance between control images (low-pass, high-pass, and broadband), to ensure that our observers were sensitive to all spatial scales. We first computed the planned comparison comparing d′ values using a 3 (image type) × 2 (presentation duration) planned contrast. This contrast was not significant, suggesting that there was no statistical difference in spatial frequency processing as a function of presentation duration, F_{(1, 7)} = 1.38, p > 0.279, η² < 0.01. We then compared sensitivity between control images using a series of contrast comparisons. Specifically, we computed contrasts comparing d′ values between broadband images and high-pass (Ψ₁) and low-pass (Ψ₂) filtered images, respectively. d′ statistics and the results of these contrasts are displayed in Table 1. Observers were more sensitive to broadband images (M = 3.55; SD = 0.45) than both high-pass (M = 2.88; SD = 0.62) and low-pass filtered images (M = 2.33; SD = 0.40). Observers were equally sensitive to low-pass and high-pass filtered images (Ψ₃). The effect size measures in Experiment 1 paralleled the significance results. The largest effect sizes were between broadband images and low-pass (η² = 0.76) and high-pass (η² = 0.47) filtered images. The effect size between low-pass and high-pass filtered images was relatively smaller in comparison (η² = 0.26).

TABLE 1

Table 1. d prime statistics for each image type at each presentation duration in Experiment 1.

Following the control image type analysis, we computed the contrast comparing hybrid trial types (Hybrid—LSF and Hybrid—HSF) as a function of presentation duration. This was not statistically significant, F_{(1, 7)} = 0.137, p > 0.722, η² < 0.01. We followed up this analysis by comparing sensitivity between hybrid trial types using a planned contrast, collapsing over presentation duration (Ψ₄). Observers were more sensitive to hybrid-HSF image types (M = 2.71; SD = 0.49) than hybrid-LSF image types (M = 2.14; SD = 0.26). Furthermore, the associated effect size (η² = 0.66) was similar to the effect sizes reported for the significant control contrasts, suggesting that observers were in fact more sensitive to HSFs than LSFs in the hybrid images.

Reaction time

We calculated mean reaction time (RT) measures for each trial type as a function of presentation duration. These means are displayed in Figure 2B. We entered these means into a 4 (image type) × 2 (presentation duration) repeated measures ANOVA. Unlike the calculation of d′ statistics, hybrid images were not separated further because target absent trials are the same between Hybrid—LSF and Hybrid—HSF trial types. The main effect of image type was significant, F_{(3, 21)} = 3.29, p < 0.04, η² = 0.3 However, the main effect of presentation duration and the image type × presentation duration interaction were not: F_{(1, 7)} = 0.368, p > 0.563, η² < 0.05 and F_{(3, 21)} = 0.009, >0.899, η² < 0.001.

Similar to the sensitivity analysis, we were primarily interested in differences between Hybrid-HSF and Hybrid-LSF image types, but first report the results related to the control images. Specifically, we computed contrasts that paralleled the sensitivity comparisons. Reaction time statistics and mean difference contrasts are displayed in Table 2. Observers were faster to respond to broadband images (M = 950.04; SD = 58.18) than both high-pass (M = 1005.26; SD = 36.67) (Ψ₁) and low-pass filtered images (M = 1007.03; SD = 48.75) (Ψ₂). There was no RT difference between low-pass and high-pass filtered images (Ψ₃). Consistent with the sensitivity analysis, the largest effect size was between broadband images and low-pass filtered images (η² = 0.52) followed by the effect size for the difference between broadband images and high-pass filtered images (η² = 0.38). The effect size between low-pass and high-pass filtered images was negligible (η² < 0.01).

TABLE 2

Table 2. Reaction time statistics for each image type at each presentation duration in Experiment 1.

Reaction times on target present trials were compared between Hybrid—LSF and Hybrid—HSF image types and are displayed in Figure 2C. We entered these means into a 2 (hybrid trial) × 2 (presentation duration) planned contrast. Consistent with the sensitivity analysis, this contrast was not significant, suggesting that RTs did not differ between hybrid image types as a function of presentation duration, F_{(1, 7)} = 0.617, p > 0.458, η² < 0.08. We then compared RTs between hybrid—LSF and hybrid—HSF image types, collapsing over presentation duration. This contrast was significant, F_{(1, 7)} = 7.58, p < 0.028, η² = 0.52. Observers were faster to respond to Hybrid—LSF image types (M = 1013.81; SD = 16.37) than Hybrid—HSF image types (M = 1068.62; SD = 41.90). This was a difference of approximately 54.81 ms (SD = 52.65; 95% CI [11.91, 97.71]). It is interesting to note that the associated effect size was similar to the effect size reported in the parallel sensitivity analysis (η² = 0.66), suggesting that the effect of hybrid trial type is robust across dependent variables.

Discussion

The critical result from Experiment 1 is that we corroborated Oliva and Schyns (1997) finding that both spatial scales are available to form the basis for hybrid image categorization. Observers in our study were sensitive to both sources of spatial frequency content and there was no significant interaction with presentation duration, although observers were overall more sensitive to HSFs than LSFs in the hybrid images. An interesting finding from Experiment 1 is that d′ values were overall high, which is suggestive of weak masking effects. The most likely explanation for this result is that we constructed our masks so that their amplitude spectrum slope (i.e., the slope that conveys amplitude and orientation information in an image) would have a value of 0. Hansen and Loschky (2013) found that white noise masks with this property are the least effective at masking natural scene stimuli, whereas white noise masks whose amplitude spectrum slope most closely resembled that of a natural scene (e.g., ~ alpha = 1; Hansen et al., 2008) are the most effective. This suggestion is consistent with previous studies that showed that the most effective mask for a particular spatial frequency is one whose amplitude spectrum information is most similar to the target stimuli (Stromeyer and Julesz, 1972; Losada and Mullen, 1995; Mullen and Losada, 1999).

Experiment 2

Experiment 2 is an extension of Experiment 1. Whereas Experiment 1 assessed the availability of spatial scale information, Experiment 2 assessed diagnostic spatial scale preference between competing sources of LSF and HSF information. Thus, Experiment 2 is a replication of Experiment 1, with the exception that we assessed scene categorization using an all-alternative forced choice paradigm. We asked observers to choose which of all possible target categories corresponded to the previously presented hybrid image. Because a hybrid image's LSFs and HSFs convey information related to different categories, forcing observers to choose between all possible target categories indexes their preferred diagnostic spatial scale.