Global scene layout modulates contextual learning in change detection

Conci, Markus; Müller, Hermann  J.

doi:10.3389/fpsyg.2014.00089

ORIGINAL RESEARCH article

Front. Psychol., 10 February 2014

Sec. Perception Science

Volume 5 - 2014 | https://doi.org/10.3389/fpsyg.2014.00089

Global scene layout modulates contextual learning in change detection

MC
Markus Conci ^*
HJ
Hermann J. Müller

Allgemeine und Experimentelle Psychologie, Department Psychologie, Ludwig-Maximilians-Universität München München, Germany

Article metrics

View details

Citations

9,7k

Views

2,7k

Downloads

Abstract

Change in the visual scene often goes unnoticed – a phenomenon referred to as “change blindness.” This study examined whether the hierarchical structure, i.e., the global–local layout of a scene can influence performance in a one-shot change detection paradigm. To this end, natural scenes of a laid breakfast table were presented, and observers were asked to locate the onset of a new local object. Importantly, the global structure of the scene was manipulated by varying the relations among objects in the scene layouts. The very same items were either presented as global-congruent (typical) layouts or as global-incongruent (random) arrangements. Change blindness was less severe for congruent than for incongruent displays, and this congruency benefit increased with the duration of the experiment. These findings show that global layouts are learned, supporting detection of local changes with enhanced efficiency. However, performance was not affected by scene congruency in a subsequent control experiment that required observers to localize a static discontinuity (i.e., an object that was missing from the repeated layouts). Our results thus show that learning of the global layout is particularly linked to the local objects. Taken together, our results reveal an effect of “global precedence” in natural scenes. We suggest that relational properties within the hierarchy of a natural scene are governed, in particular, by global image analysis, reducing change blindness for local objects through scene learning.

INTRODUCTION

Complex natural environments require the visual system to provide structure to the visual input, integrating fragmentary parts into coherent objects that are segregated from the background and other objects. The available, individuated objects of a given natural scene are usually represented hierarchically, with multiple levels of representation ranging from more global to more local instantiations. For example, a forest has trees and the trees in turn have leaves, illustrating a hierarchical relationship between parts and wholes at different levels of perceptual resolution. Similar hierarchical relations have also been demonstrated for a variety of composite figures that require global- and local-level structural elements to be integrated (see, e.g., Hübner and Volberg, 2005). For instance, the Navon letter (Navon, 1977) depicted in Figure 1A consists of an arrangement of the local letters “H,” which combine to form the global letter “U.” Similarly, Kanizsa figures (Kanizsa, 1976) comprise a local arrangement of circular inducers, which yield the impression of a salient, global square shape (Figure 1B). Comparable relationships can also emerge from other hierarchical shape stimuli, were, for instance, local squares combine to form a global triangle (Figure 1C; Kimchi and Palmer, 1982).

FIGURE 1

While object information can be accessed at various levels, attentional orienting by default exhibits “global precedence” (Navon, 1977). A number of studies demonstrated that visual search is based on global-level representations that comprise integrated objects, while fragmentary (local) object parts are not accessible with comparable efficiency (Pomerantz et al., 1977; Donnelly et al., 1991; Rensink and Enns, 1995; Rauschenberger and Yantis, 2001; Conci et al., 2007a,b, 2011; Deco and Heinke, 2007). For instance, processes of target detection in visual search prioritize complete global shape representations in Kanizsa figures or Navon letters, while the corresponding local parts are more difficult to access (e.g., Conci et al., 2007b; Deco and Heinke, 2007). Taken together, these finding suggests that the “forest comes before the trees” (Navon, 1977).

Priority of global over local levels of representation may also be related to a phenomenon referred to as “change blindness,” which demonstrates a striking inability to detect changes that occur in the visual ambient array (Simons and Levin, 1997; Rensink, 2002 for reviews). For example, in the flicker paradigm (Rensink et al., 1997), an original image and a slightly modified image of a natural scene are presented in rapid alternation with a blank screen interposed between them. Given this sequence of images, observers are usually relatively poor at identifying the object that changes across both views of the same scene. For instance, observers quite frequently do not detect a change within the first alternation of the images. In many cases, they even fail to detect a change after 1 min of continuous image flicker, or when the eyes fixate relatively close to the changed location (O’Regan et al., 2000). By contrast, when the two images are presented without a blank, changes are easily detected, due to the transients that mark the changed object. Given this pattern of results, it appears that the internal representation of the outer world is rather sparse, providing only the rough “gist” of a scene that is carried over from one instance to the next (e.g., Simons and Chabris, 1999). The sparse gist of the available information after a change may in turn be attributable to global precedence. That is, global, “forest”-like scene properties are registered more readily, whereas local changes (e.g., to trees and leaves) go relatively unnoticed. In this view, scene memory is primarily reliant upon global image analysis, thus preventing detection of local change.

Given this weak ability to register changes across views of a scene, orienting attention to the changed (local) objects is a key requirement for actually detecting what has changed (Rensink et al., 1997; Simons and Levin, 1997; see also Simons and Chabris, 1999). One possibility to increase the detectability of a change is by means of providing top-down information, for instance, giving observers a verbal cue (Rensink et al., 1997). In addition, change detection can also be improved to a certain extent by the degree to which a changed object is accompanied by salient bottom-up signals, such as luminance, color, or motion changes (Cole et al., 2004; Arrington et al., 2006; Cole and Liversedge, 2006). Compared to such changes within an existing object, the sudden appearance of a new object (or the sudden disappearance of an object) is most effective for reducing change blindness (Mondy and Coltheart, 2000; Scholl, 2000; Cole et al., 2004).

Besides bottom-up and top-down factors enhancing target detection, relational scene properties that potentially alter the basic structure of a scene have also been found to modulate change blindness. For example, when varying relational grouping cues (Jiang et al., 2004) or the assignment of figure and ground (Landman et al., 2004) between two successive images, change detection performance is severely impaired. By contrast, a reduction of change blindness is obtained when a given perceptual change alters the scene gist (Sampanes et al., 2008). Consequently, these results show that the relational properties of a scene (i.e., the spatial relations of objects among each other) critically influence change detection performance.

Invariant relational properties of a scene not only support detection of a change at a given instance, but can also influence long-term adaptive processes. For example, when a given target object is repeatedly paired within a consistent surround of contextual items, change detection performance is improved (Jiang and Song, 2005). Thus, repetition of the spatial contextual layout increases the detectability of a changed item. Jiang and Song(2005; see also Chun and Jiang, 1998) interpreted their findings in terms of an (implicit) mechanism that automatically associates a given target object with the statistically invariant relations given by the scene (i.e., the repeated context). Consequently, when observers are presented with an arrangement of items, associations will be formed between the target object and its surrounding context of neighboring objects, thus facilitating detection of the target on future occasions when it is presented within the same configuration. Of note, contextual learning also manifests in naturalistic environments, with scene memory linking target locations to invariant configurations, such as the spatially distributed arrangements of objects in indoor scenes (Brockmole et al., 2006). Accordingly, contextual memory may provide ecologically valid cues in predicting potential target locations, and learning the co-occurrence relations of objects in the environment can help guide behavior.

The aim of the present study was to investigate in further detail how relational structures of a scene can be acquired through contextual learning. Previous studies (described above) have shown that change blindness can be reduced when the change is associated with the global scene gist. In the current study, we investigated how learning of such global hierarchical structures in a natural scene can influence change detection. Observers were required to detect a local-object change within a natural (breakfast table) scene that presented global arrangements with either a congruent or an incongruent global scene structure (see Figure 2). Importantly, the same scene was presented repeatedly throughout the experiment to enable learning of the invariant object layout. Thus, we examined whether the structure provided by the global layout would affect learning of contextual regularities and, in turn, what effect learning would have on the detection of (local) changes.

FIGURE 2

EXPERIMENT 1

Experiment 1 was designed to test whether learning of hierarchical structure in a natural scene context can influence change blindness. To this end, observers were presented with a sequence of two search displays (i.e., a “one-shot” procedure; e.g., Cole et al., 2004; Jiang and Song, 2005; Cole and Liversedge, 2006) that consisted of photographs depicting a laid breakfast table (see Figure 2). As illustrated in Figure 2A, on each trial, one item was added to the second search display, and observers were required to indicate the side on which the change had occurred (left vs. right half of the screen). Importantly, there were two types of displays: for the global-congruent display layout, all items on the breakfast table were placed such that the global structure of the table resembled a prototypical, meaningful arrangement (Figure 2B). In contrast, for the global-incongruent display layout, the very same local items were presented at comparable eccentricities and orientations, but the arrangement was pseudo-random such that the typical, “meaningful” structure of the scene was missing (Figure 2C). Thus, in congruent displays, the relational structure among the objects provided a coherent global organization of the scene – so that to-be-detected changes of local objects were accompanied by an “intact” global configuration. The identical local objects were also available in the incongruent condition – however, the coherent global scene structure was lacking. Consequently, comparisons of both types of layout would indicate whether the analysis of global image properties influences change detection performance over and above the (salient) signal provided by the onset of the local target object itself. Moreover, we examined whether detectability of the local target varies over the duration of the experiment, to assess whether the concomitant learning of contextual regularities is modulated by the congruency of the global scene structure.

MATERIALS AND METHODS

Participants

Sixteen observers (eight male; mean age = 27 years; normal or corrected-to-normal visual acuity) participated in the experiment.

Ethics statement

The present study, including the consent procedure, was approved by the ethics board of the LMU Munich Psychology Department and conducted according to the principles of the Declaration of Helsinki. Participants received information about the study and their rights and gave informed verbal consent.

Apparatus and stimuli

The experiment was controlled by an IBM-PC compatible computer using Matlab routines and Psychophysics Toolbox extensions (Brainard, 1997; Pelli, 1997). Stimuli were presented on a 17-inch monitor (at a frame rate of 85 Hz). On each trial, two successive search displays were presented for 1200 ms, with an offset of 600 ms in-between the display presentations (see Figure 2A for an example trial sequence). Search displays were color pictures of a laid breakfast table (1024 × 768 pixels), subtending approximately 34° × 28° of visual angle, and photographed from a bird’s eye view. Each picture consisted of 26 objects, placed on the background of a tablecloth with a red and white checkerboard pattern. All objects were typical breakfast items (four plates, two glasses, two cups, three spoons, three forks, three knifes, milk, butter, jam, bread, butter, egg, salt, sugar, apple, bread basket) that were distributed equally across the left and right halves of the screen. Displays on a given trial could be from either the global-congruent or the global-incongruent display layout condition. Congruent displays (see Figure 2B) presented all local objects in a prototypical global arrangement, with the breakfast items placed at “standard” locations (e.g., forks and knifes were presented next to the plate, butter and jam were located in the central region of the table). By contrast, with incongruent layouts (see Figure 2C), the very same local items were presented at comparable eccentricities and orientations; however, the global organization of the breakfast table was lacking its typical arrangement (e.g., forks and knifes were presented at the center of the table, whereas butter and jam were located next to the plate).

On a given trial, the two sequential displays were either both from the global-congruent or both from the global-incongruent condition (Figure 2A). In the first display, one object was missing on the breakfast table, whereas the second display always consisted of the complete layout of 26 objects (Figures 2B,C). Thus, there was always an onset of a new object in the second display. There were 20 objects that served as onset objects, while six of the 26 objects (four plates, sugar, basket; distributed equally across both halves of the screen) always remained stationary across the display sequence. Subjects were instructed to indicate, as accurately as possible, whether a change (i.e., the onset of a new object in the second display) had occurred on the left or the right side of the screen.

Trial sequence

Each trial started with the presentation of a central fixation cross for 500 ms. Next, the first display (with one object missing) was presented for 1200 ms. Following a blank screen of 600 ms, the second (complete) display was presented for another 1200 ms. Finally, a blank screen was presented during which the participants were instructed to give a non-speeded response via mouse keys. Participants’ task was to indicate the side of the screen (left or right) on which a change (i.e., an object onset) had occurred. In case of an erroneous response, feedback was provided by an alerting sign (“–”) presented for 1000 ms at the center of the screen. Trials were separated from each other by an interval of 1000 ms. Figure 2A illustrates a typical trial sequence with an object onset (slice of bread) occurring in the bottom right quadrant of the screen.

Design and procedure

A two-factors within-subjects design was used. The independent variables were display type and block. Display type had two levels: global-congruent and global-incongruent. For global-congruent displays, all items were arranged such that the image of the laid table had a prototypical layout. In contrast, for the global-incongruent trials, the same items were arranged at comparable eccentricities and orientations, though with a pseudo-random stimulus configuration such that all items were clearly visible but lacked the global structure given in the congruent display condition. The second variable, block, simply divided the experiment into five consecutive bins, to permit examining for possible learning effects over the course of the experiment.

At the beginning of the experiment, participants completed one block of 40 practice trials to become familiarized with the task. All subsequent experimental blocks contained 20 global-congruent and 20 global-incongruent trials, presented in randomized order, such that in each bock, all 40 different target objects were presented once. There were five blocks in the experiment, yielding 200 experimental trials in total.

RESULTS AND DISCUSSION

Accuracy of detecting the changes across display sequences was relatively high, with an average of 76% of correct responses across all conditions.

The mean correct responses were analyzed by means of a two-way repeated-measures analysis of variance (ANOVA) with main terms for display type (global-congruent, global-incongruent) and block (1–5). This analysis revealed significant main effects for display type, F_(1,15) = 14.76, p < 0.003, and block, F_(4,60) = 7.63, p < 0.001. The main effect of display type was due change detection being more accurate, by 5.4%, for global-congruent as compared to global-incongruent display layouts (Figure 3A). The main effect of block indicated that responses became, in general, more accurate as the experiment progressed (70%, 73%, 76%, 79%, and 82% for blocks 1–5). Importantly, the display type by block interaction was also significant, F_(4,60) = 2.67, p < 0.05, owing to the fact that the advantage for congruent display types was not evident right from the beginning, but became reliable only as the experiment progressed: As can be seen from Figure 3B, there were no significant differences in blocks 1 and 2 (mean difference: 0.8%, all ps > 0.7); but from block 3 onwards, changes were detected more accurately with global-congruent compared to global-incongruent display layouts (mean difference: 8.5%, all ps < 0.03). Thus, this analysis shows that global congruent scene layouts attenuated change blindness. In addition, accuracy increased in general with repeated exposure of the changed items across blocks, but this performance gain was particularly pronounced for global-congruent display layouts. This suggests that learning of the global-congruent displays gradually facilitated detection of the object onsets.

FIGURE 3

In a subsequent step, an additional analysis was performed to further investigate what was actually learned in the display layouts: it could be the case that the increase in accuracy across blocks was due to the repeated exposure to the scene layouts, which might have facilitated object detection through learning the relational scene structure (e.g., Chun and Jiang, 1998). Alternatively, it could be that on each trial, observers registered and assigned an “inhibitory tag” (e.g., Klein and MacInnes, 1999) to the location of the changed target object; as a result, on subsequent trials, already registered target locations would then be inhibited – effectively biasing attention towards locations that so far (within a given block) had not contained a target. Thus, on this account, observers learn to inhibit locations at which a change had occurred on previous trials, permitting them to anticipate the locations of the upcoming targets with increasing validity across the trials within a block (as the number of alternative locations decreases). To examine whether this form of inhibitory tagging might explain the current results, blocks 3–5 (which showed evidence for a benefit for global-congruent layouts) were each separated into two halves. Evidence for inhibition of already detected changes should then result in an increase in performance for the second half of each block, as the number of potential target locations decreases with each additional trial. A repeated-measures ANOVA performed on the collapsed data from blocks 3–5 with main terms for display type (global-congruent, global-incongruent) and block half (first, second) revealed the display type effect to be significant, F_(1,15) = 37.52, p <0.001, essentially mirroring the above difference between global-congruent and -incongruent displays. However, there was no main or interaction effect that included block half (all ps >0.2), at variance with inhibitory location tagging across trials in the current experiment (accuracies were 79% and 78% for the first and second block half, respectively). Note that this null-effect also rules out accounts on which internal search “templates” for already detected target objects, rather than their scene locations, are inhibitorily tagged (e.g., along the lines of Houghton and Tipper, 1994). Thus, by default, the performance benefit for congruent displays is likely attributable to contextual learning of the global scene layout (rather than a particular “anticipation” strategy).

EXPERIMENT 2

Having established that global scene structure reduces change blindness after several exposures, in particular for congruent layouts, Experiment 2 was designed to further examine possible causes of this effect. One possibility is that detection of the change in global-congruent layouts is facilitated by scene memory; that is, learning of the scene context may lead to more efficient encoding and, as a result, enhanced detection of the changed (target) object. Alternatively, however, familiarity with a given layout after a few trials might also facilitate localization of the target just by inspecting the pre-change display. Thus, for instance, by searching for the object that is missing from pre-change display, observers might be able to detect an inconsistency in the scene without actually comparing the pre- and post-change displays. That is, observers might come to use – over repeated encounters – their acquired knowledge of what the search layout should look like, giving rise to an effect of familiarity on the detection of the missing object without contextual learning coming playing a role. Thus, on this alternative account, the results of Experiment 1 are attributable to a relatively general effect of contextual familiarity on search, rather than a specific influence of contextual learning on the localization of a target. It should be noted that this explanation would not necessarily require inhibitory tagging within a block of trials (see above), as scene inconsistencies might be detected without registration of previous targets.

Experiment 2 was performed to decide between these two possible explanations. Observers were presented with a single image of a scene (identical to the first displays in Experiment 1) and asked to indicate the side on which an object was missing within the scene. Thus, if observers detect the missing object in repeated layouts on the basis of familiarity, then scene congruency should modulate performance in a way comparable to Experiment 1. Conversely, no influence of scene congruency should be evident if contextual learning is specifically associated with detection of a target object.