Scene-object semantic incongruity across stages of processing: From detection to identification and episodic encoding

Ortiz-Tudela, Javier; Jiménez, Luis; Lupiáñez, Juan

doi:10.3389/fcogn.2023.1125145

ORIGINAL RESEARCH article

Front. Cognit., 28 February 2023

Sec. Attention

Volume 2 - 2023 | https://doi.org/10.3389/fcogn.2023.1125145

This article is part of the Research TopicInsights in Attention: 2022View all 15 articles

Scene-object semantic incongruity across stages of processing: From detection to identification and episodic encoding

Javier Ortiz-Tudela^1,2^*

Luis Jiménez³

Juan Lupiáñez¹

¹Centro de Investigación Mente, Cerebro y Comportamiento (CIMCYC), Universidad de Granada, Granada, Spain
²Department of Psychology, Goethe University Frankfurt, Frankfurt Am Main, Germany
³Facultad de Psicología, Universidad de Santiago de Compostela, Santiago de Compostela, Spain

Visual processes are assumed to be affected by scene-object semantics throughout the stream of processing, from the earliest processes of conscious object detection to the later stages of object identification and memory encoding. However, very few studies have jointly explored these processes in a unified setting. In this study, we build upon a change detection task to assess the influence of semantic congruity between scenes and objects across three processing stages, as indexed through measures of conscious detection, object identification, and delayed recognition. Across four experiments, we show that semantically incongruent targets are easier to detect than their congruent counterparts, but that the latter are better identified and recognized in a surprise memory test. In addition, we used eye-tracking measures, in conjunction with these three behavioral indexes, to further understand the locus of the advantage observed in each case. The results indicate that (i) competition with other congruent objects modulates the effects of congruity on target detection, but it does not affect identification nor recognition memory, (ii) the detection cost of scene-congruent targets is mediated by earlier fixations on incongruent targets, (iii) neither fixation times, dwell times, nor pupil dilatation are related to the effects obtained in identification and recognition; and (iv) even though congruent targets are both better identified and remembered, the recognition benefit does not depend on the identification demands. The transversal approach taken in this study represents a challenging but exciting perspective that holds the potential to build bridges over the seemingly different but related fields of conscious detection, semantic identification, and episodic memory.

Introduction

The amount of information with which our cognitive system is continuously faced is overwhelming. Of all the information that gets through our senses, only a small portion reaches a state in which we actually become aware of it. In turn, an even smaller fraction of that information is stored into memory and can eventually be remembered. Understanding what sorts of transformations that information undergoes across the stream of processing is thus a very important, but often neglected, aspect of the study of human cognition. Analyzing the course of the same information across different processing stages can provide new insights into the underlying mechanisms and processes at play throughout this course.

One of the key modulators at several stages of that multiple-filter operation is semantic information. For instance, previous knowledge about the world may bias the information that gets access to our conscious awareness, by anticipating the most likely stimuli given a set of priors (Rao and Ballard, 1999; Summerfield et al., 2006). Similarly, the semantic features of a scene can also determine which objects will actually be attended, even beyond the biases imposed by other lower-level perceptual features (Peelen and Kastner, 2014; Santangelo et al., 2015; Henderson and Hayes, 2017). Moreover, previous knowledge can help us to interpret and give meaning to seemingly meaningless stimuli (Mooney, 1957; Gorlin et al., 2012) and it can even adjust which information gets stored into memory and which does not (Henson and Gagnepain, 2010; Van Kesteren et al., 2012). In this study, we will use prior semantic knowledge of real-world visual scenes to jointly characterize three key stages in the processing of information: detection, identification, and episodic encoding.

Object detection

The unspecific report of the detection of a visual stimulus can be studied by means of many different paradigms. Most of them require participants to press a given key in response to the detection of a target stimulus independently of features such as its location, color, or identity. These seemingly unimportant features are often used as independent variables that either speed up or slow down detection times and can even facilitate or impair detection accuracy, leading to positive and negative effects like priming (Kroll and Potter, 1984), change blindness (Simons and Rensink, 2005) or inhibition of return (Posner et al., 1985), which are often interpreted as the result of a detection cost (Lupiáñez et al., 2013).

The semantic features of an image are also thought to bias detection responses during scene processing. Hollingworth and Henderson (2000) showed that the detection of a changing target improves when the to-be-detected object is embedded in a semantically incongruent context (Hollingworth and Henderson, 2000). Moreover, LaPointe and Milliken (2016) showed that incongruent objects had shorter first fixation latencies. This variable represents the lag of time from the moment the trial starts until the object is fixated for the first time and it has been often used as a measure of pre-attentional processes influencing attentional capture.

Object identification

Even though detection and identification of an object appear to be two seamless stages of perception, LaPointe et al. (2013) showed that semantic information can be used to dissociate both processes, as they were affected in opposite ways by semantic congruity (LaPointe et al., 2013). They used a change detection task in which the identity of the to-be-detected object either matched or mismatched the gist of the surrounding scene, and they asked participants to detect and subsequently identify the changing object. Their results replicated the previously reported congruity detection cost, but they showed a simultaneous benefit for congruent targets on the identification task. This congruity identification benefit thus refers to facilitated access to the semantic features of a target when it is presented in the context of other semantically related objects. This finding is in line with research on prior knowledge and expectations, which shows that object identification is improved when the visual input matches what the observer is expecting (Eger et al., 2007; Esterman and Yantis, 2010). Importantly, at least one previous study has looked at on-target dwell time (i.e., the sum of time spent fixating the target region) as a proxy for total target processing time in the contest of the identification benefit (LaPointe and Milliken, 2016). This study found no differences in dwell time between congruent and incongruent objects thus supporting the notion that this benefit does not reflect merely increased processing time.

Long-term storage and retrieval

Both, the detection cost and the identification benefit are immediate measurable consequences of embedding an object in a semantic context. However, surrounding semantic information can have also long-term consequences by impacting how the object is encoded into memory. As a consequence, the ability to distinguish a previously seen object from one never seen before (i.e., a recognition memory), will be modulated by the semantic context in which the object was presented. For instance, a congruent background can facilitate later access to a given object by easing its integration into existing schemas (Gronau and Shachar, 2015; Kaiser et al., 2015; Ortiz-Tudela et al., 2016; Brod and Shing, 2019; Wynn et al., 2019). Conversely, an incongruent background can also render memorable a given object by signaling it as salient or unexpected (Henson and Gagnepain, 2010; Van Kesteren et al., 2012). This seemingly incompatible finding is currently the focus of active research (Ortiz-Tudela et al., 2018b; Greve et al., 2019; Quent et al., 2021) and the consideration of the role of the adjacent process can provide important insights into the debate.

Previous research using gaze measures to study memory phenomena (Võ et al., 2008; Otero et al., 2011; Kafkas and Montaldi, 2012) has largely relied on pupil dilation which is the variation in the diameter of the pupil, and has often been used as a measure of cognitive effort devoted to the task. These studies consistently observe larger pupil dilation at retrieval for successfully remembered items. This effect is generally assumed to be a consequence of either increased mental effort that leads to better memory or of a subjective feeling of familiarity with the correctly identified items; either of these interpretations must be ascribed to processes taking place at the moment of retrieval. In our study, we placed our focus on semantic congruency effects during encoding (i.e., during visual processing of the stimuli) and how this relates to eventual memory performance.

The present study

Because much of the abovementioned research has focused exclusively on one or a subset of these three different stages, it remains largely unknown whether they rely on independent mechanisms. We argue that a simultaneous study of these different phenomena might provide a more realistic picture of the hierarchical nature of this continuous stream that would have the potential to reveal existing interactions and dependencies between them. Thus, in this study, we intend to better explore how the semantic relatedness between an object and its scene context may affect different stages in the perceptual processing of the object, and ultimately determine its encoding in memory. We designed four experiments with a change detection task in which we manipulated the semantic congruity of the targets with the gist of the scenes in which they were embedded and assessed which of these changing targets were more efficiently detected, identified, and recognized. In Experiments 1A and 1B we compared two presentation procedures and two types of scenes differing in the number of objects presented on the scenes by assessing the indices of detection, identification, and recognition. In Experiment 2, we removed the identification task and replicated the setup for detection and recognition, to assess whether the effects obtained in recognition were independent of explicit identification demands. Finally, Experiment 3 typified the gaze patterns associated with each of these three processes, analyzing separately the amount of time elapsed from the start of the trial to the first fixation on the target, the amount of time spent fixating the target region, and the average pupil dilatation measured on each trial. Because each of these measures has been taken to reflect different cognitive functions such as attentional capture (first fixation), total processing time (dwell time) or cognitive effort (pupil dilatation), we surmise that this study might reveal important information on the impact of semantic relatedness at each of these three processing stages and illustrates a potentially useful approach to the study of how semantic congruity may affect the full stream of processing.

Experiment 1

Whether the semantic effects described in the introduction (i.e., detection cost, identification benefit, and recognition benefit) are a consequence of priming or of object competition mechanisms is still unsolved. Stein and Peelen (2015) recreated a situation in which detection took place with no competition from other objects (i.e., the target was presented alone in the context of visual noise). Their study included a cue which could either match or mismatch the category of an object suppressed under CFS conditions (Tsuchiya and Koch, 2005). With this paradigm, participants benefited from congruent cues. In these conditions, and in the absence of potential competitors, mechanisms such as priming (Kroll and Potter, 1984) or top-down inferences over ambiguous stimuli (Bar, 2003; Gorlin et al., 2012) are most likely responsible for guiding behavior. In contrast, in the conditions imposed by change detection paradigms, can be considered as the opposite situation: responding to cluttered images heavily relies on object competition since the participants' goal is to selectively detect a changing target among many distracters. Under this conditions, the presence of many different but semantically related objects hinders the detection of the specific (changing) target (Hollingworth and Henderson, 2000; LaPointe et al., 2013; LaPointe and Milliken, 2016; Ortiz-Tudela et al., 2016, 2018a). In Experiment 1 of the present study, we attempted at recreating an intermediate situation, using LaPointe et al.'s task, but reducing the presence of distracters, to prevent competition. We presented participants with two types of natural scenes: cluttered scenes, in which the images included many non-target objects together with the target one, and sparse scenes, in which only the target object was presented against a background image.

If semantic effects take place as a consequence of priming-like or top-down inferential mechanisms, they ought to be present in both types of scenes, since the propagation of semantic properties from the scenes to the individual objects can equally occur in both conditions. Conversely, if the aforementioned effects arise as a consequence of stimulus competition, they should appear selectively in cluttered trials, where there are many objects that compete with each other. More specifically: we hypothesized that, in the present experiment, the detection cost ought to be present only for cluttered trials. In opposition, the identification benefit, which arguably relies on spreading activation from the context image to the object (Palmer, 1975; Davenport and Potter, 2004; Eger et al., 2007), ought to be present in both cluttered and sparse trial types. Lastly, given that the recognition benefit has been previously hypothesized to be driven by schema-integration processes (Ortiz-Tudela et al., 2016), and those rely solely on the availability of contextual schema and not on the presence of other objects, we hypothesized that the recognition benefit should also be observed for both stimulus types.

Finally, because including qualitatively different sets of images in a task might entail not only the differential processing of those images but an overall change in participants' task set and strategies, we conducted two separate but complementary experiments. In Experiment 1A, the order of presentation of the two stimulus types was randomized so that it was impossible to anticipate the nature of the upcoming trial and to be specifically prepared for it in advance. In Experiment 1B, stimuli from the same set of images (i.e., cluttered vs. sparse) were grouped into blocks, so that all the trials from one group were presented together; this blocked setup allows participants to adjust their strategy to the corresponding block so that the optimal task set can be prepared before the onset of every trial.

Material and methods

Participants

Twenty students (18 female; mean age: 21.84; SD: 6.30) from the Universidad de Granada participated in Experiment 1A; another 20 students (18 female; mean age: 20.45; SD: 5.65), extracted from the same pool, participated in Experiment 1B. All of them volunteered in exchange for course credit and signed an informed consent approved by the local ethics committee. The sample size was determined based on previous studies using a similar paradigm (LaPointe et al., 2013; Ortiz-Tudela et al., 2016, 2018a) and sensitivity analysis was conducted to estimate the smallest detectable effect size. This analysis revealed that, with the available sample size, we would be able to detect effect sizes of at least d = 0.58, with 80% power and an alpha level of 0.05 (one-tailed matched samples t-test). All experiments in this paper, which are part of a larger research project approved by the Universidad de Granada Ethical Committee (175/CEIH/2017), were conducted according to the ethical standards of the 1964 Declaration of Helsinki (last update: Seoul, 2008).

Stimuli

All of the stimuli included in this and subsequent experiments in this study were either borrowed from previous publications (LaPointe et al., 2013; LaPointe and Milliken, 2016; Ortiz-Tudela et al., 2016, 2018a) or specifically built to match the needs of our experiment (see also below). All the stimuli consisted of scene-object combinations and both, scenes and objects, depicted real-world content (e.g., the image of a forest with a deer as an object). All the scene images were 850 × 565 pixels and the original object images were 500 × 500 pixels in size. All the objects were digitally resized and embedded in the scenes using Adobe Photoshop CS6. Each object was paired with two images, one congruent and one incongruent (Supplementary Table S1). Although the size of the objects was adjusted for each individual scene, an attempt was made to keep the size relatively similar across the two versions. We provide probability maps of the area covered by the objects in both congruency conditions as well as a statistical analysis of the differences in size between conditions and a correlation of each object's size across conditions (Supplementary Figure S1). The analysis confirmed the lack of differences in object size between conditions (BF01 = 4.327) and a strong within-object correlation of the small differences (Pearson's R = 0.846, p < 0.001). In addition, we also computed pixel-wise saliency (Supplementary Figure S2) and luminance (Supplementary Figure S3) metrics and run a Bayesian t-test between congruency conditions. The results also supported the lack of differences in either of the measures (BF01 = 5.968 and BF01 = 7.951, respectively).

Procedure

Each participant completed three sequential phases: the first one consisted of a change detection task. This phase was followed by 10 min of mathematical operations that served as a distracter task. Finally, memory of the target objects from the change detection task was assessed via a surprise recognition test. The duration of the entire session was ~45 min.

The overall structure of the session was identical for Experiment 1A and 1B with the sole exception of the order of presentation of the cluttered vs. sparse trial types of the change detection task (i.e., randomized for Experiment 1A and blocked for Experiment 1B). In Experiment 1B randomization was applied within each block so that the sequence of trials within that block was different for each participant; the order of the blocks was counterbalanced across participants.

Change detection task

Each trial consisted of a rapid alternation of two versions of the same image, each displayed for 250 ms. The two versions represented scenes which were identical to each other except for the presence or absence of a key object. Participants were required to press the space bar on a QWERTY keyboard as soon as they noticed any detail that was different between the two versions of the scene. To prevent the changing object from popping out, an intervening blank screen was displayed for 250 ms between the two presentations. This intervening screen rendered the standard flickering appearance of the paradigm (Rensink et al., 1997). Critically, we manipulated the congruity between the to-be-detected object and the background scene. On half of the trials, the target identity matched the gist of the scene (i.e., congruent trials) and on the other half, it corresponded to an object that was not expected or frequent in that context (i.e., incongruent trials). After the detection response, or after a maximum of nine alternation cycles, the sequence stopped and a new screen prompted participants to identify the changing object with a few words (e.g., black dog) or by locating it on the screen (e.g., bottom-left) if identification was not possible (Figure 1). To assure participants' engagement in the task, 10% of no-change trials were included (i.e., catch trials). Participants were not informed of the presence of these no-change trials since previous studies have shown that being aware of the presence of those trials can change participants' response bias (Ortiz-Tudela et al., 2016). A total of 90 object-image combinations were used.

FIGURE 1

Figure 1. Trial structure for the change detection task in Experiments 1A, 1B, and 3. Participants sequentially performed a detection task followed by an identification task (see Ortiz-Tudela et al., 2016, for a simiar procedure).

More importantly for our purposes, we included two sets of trials. The cluttered set was built so that the target object (i.e., the changing one) was one among many other presented objects. Conversely, in the sparse set scenes, the target object was presented in isolation against an open background image (Figure 2). For the cluttered set complex natural scenes were selected such as a busy city street, a park with children and trees or a big city skyline; for the sparse set, rather empty scenes were selected such as a wide prairie, a desert, or an open sky. Cluttered and sparse set scene trials were intermixed within the same block of trials in Experiment 1A and in different blocks of trials in Experiment 1B.

FIGURE 2

Figure 2. Example of stimuli used in Experiment 1A and 1B. Scenes in the cluttered set were taken from Ortiz-Tudela et al. (2016); for the sparse set, scenes with none or just a few non-target objects were selected.

Distracter task

Participants completed paper and pencil math operations for a maximum time of 10 min. None of the participants completed the entire set of proposed operations. The exact operations used are available at https://github.com/ortiztud/three_indices.

Recognition memory test

All the target objects from the change detection task, together with 90 new objects, were used in the memory test. Each object was presented alone (i.e., stripped from any scene context) at the center of the screen and covering ~10° of visual angle. Participants performed an old vs. new judgment without any time restriction. Correct responses to old objects were coded as hits and incorrect responses to old objects were coded as False Alarms (FAs).

Results

Experiment 1A

Participants (N = 5) who reported a change in more than 40% of catch trials were excluded from the analyses. The three dependent variables of interest were analyzed separately using 2 × 2 repeated measures ANOVAs with scene-object congruity (congruent vs. incongruent) and trial type (cluttered vs. sparse) as within-subjects factors.

Detection

Performance on the detection task was evaluated by combining detection times with the proportion of correct responses in an overall detection index (proportion of correct responses/detection times; Ortiz-Tudela et al., 2018b). The analysis of the detection index revealed a significant trial type by congruity interaction, F_{(1, 14)} = 5.954, p = 0.029, ηp² = 0.40, showing that on the cluttered set responding to congruent targets was less efficient than responding to incongruent targets, F_{(1, 14)} = −3.41, p = 0.004, ηp² = 0.43, but there were no differences in the sparse set, F_{(1, 14)} = −1.32, p = 0.208, ηp² = 0.01.

Identification

Only correctly detected objects for each participant were included in the following analyses. The results of the analysis of the proportion of correctly identified objects appropriately replicated previous findings of higher identification scores for congruent objects, F_{(1, 14)} = 10.981, p = 0.005, ηp² = 0.47. Importantly, the trial type by scene-object congruity interaction was not significant in this measure, F < 1, suggesting that the identification benefit was present in both trial types, F_{(1, 14)} = 2.49, p = 0.026, ηp² = 0.36 and F_{(1, 14)} = 3.24, p = 0.006, ηp² = 0.43 for cluttered and sparse respectively.

Recognition

Trials that were correctly detected and correctly identified were passed along to the recognition analyses. Overall d' was 1.27 and beta 1.84. Since it was not possible to assess independent FA rates for congruent and incongruent trials, overall hit rates were used as a measure of memory performance. The analysis did not show a significant effect of trial type, F_{(1, 14)} = 3.082, p = 0.101, ηp² = 0.15, even though we measured numerically higher recognition scores for objects in the sparse set (0.78) compared to those in the cluttered set (0.75). The numerical pattern also showed higher memory rates for congruent than for incongruent objects, at least for the cluttered scenes (see Table 1), but neither this difference nor the two-way congruity x trial type interaction were close to statistical significance, Fs < 1.

TABLE 1

Table 1. Mean RT and percentage of accurate detection responses (in parenthesis) for object detection, and percentage of accurate responses for object identification and delayed recognition, for each of the four experiments.

Experiment 1B

The same approach as in Experiment 1A was adopted for the analyses of Experiment 1B. Data from three participants were excluded from the analysis for poor performance in the detection task.

Detection

The analysis of detection efficiency replicated those of Experiment 1A. The trial type by congruity interaction was close to significance for the detection index, F_{(1, 16)} = 3.977, p = 0.063, ηp² = 0.20. In other words, again more efficient responses were made on incongruent than on congruent trials on cluttered trials, F_{(1, 16)} = −3.89, p = 0.001, ηp² = 0.43, but no differences between congruent and incongruent target objects were obtained on sparse trials, both F_{(1, 16)} = −1.56, p = 0.139, ηp² = 0.13.

Identification

The pattern of the identification scores in Experiment 1B mimicked that of Experiment 1A. Consistent with an identification benefit effect, congruent target objects were better identified than incongruent objects, F_{(1, 16)} = 4.746, p = 0.045, ηp² = 0.21. There was no indication of an effect of trial type, or of interaction between stimulus type and congruity, F < 1.

Recognition

The memory pattern in Experiment 1B also resembles that of Experiment 1A. Overall d' was 1.35 and overall beta was 2.05. The main effect of trial type was close to significance, F_{(1, 16)} = 4.92, p = 0.05, ηp² = 0.23, with better memory for objects in the sparse trials (0.75) than in the cluttered ones (0.66). No significant effect of congruity nor an interaction between trial type and congruity were observed, both Fs < 1.

Discussion

The aim of Experiments 1A and 1B was to test whether the semantic congruity effects reported in the literature on the detection, identification and delayed recognition of objects could rely on different combinations of semantic facilitation and object competition. To that end, we used a change detection paradigm, that reliably produces the expected indexes [i.e., a detection cost, identification benefit, and recognition benefit; (LaPointe et al., 2013; Ortiz-Tudela et al., 2016, 2018a)], and we compared two stimulus sets which either included the target among many distracter objects or presented the target embedded in a sparse background. Because we reasoned that participants' responses can be affected by the adoption of a specific mindset evoked by surrounding trials, Experiment 1A and 1B also explored the potential effect induced by presenting these two types of contexts either in a random order (Experiment 1A) or grouped into blocks (Experiment 1B).

The results of the two experiments showed that while the identification benefit is present when using both cluttered and sparse stimuli, the detection cost is only found in the presence of stimulus competition. This result suggests that the detection cost arises only when there is a number of coactive stimuli competing for attentional resources, whereas the benefits found for identification seem to depend on semantic facilitation which might arise either from the activation of a group of semantically related objects or from the overall meaning of the background scene (Eger et al., 2007; Esterman and Yantis, 2010). The absence of differences in detecting congruent and incongruent trials in the sparse set is consistent with the idea that sparse scene contexts represent an intermediate situation between Stein and Peelen's minimalistic setup (in which better detection followed a category-matching cue) and the cluttered arrangement of LaPointe et al.'s (2013) paradigm (in which a detection cost was obtained).

Lastly, and surprisingly, we were not able to measure a statistically significant recognition benefit in spite of having arranged conditions very similar to those presented in Ortiz-Tudela et al. (2016). This unexpected result can be due to the inclusion of the sparse trials within the list of items to be retrieved at the memory test. Indeed, performance in any memory test is highly dependent not only on the processes taking place at encoding but also on those taking place during consolidation and retrieval and those can be affected by the amount and nature of the elements to be held in memory. Thus, before jumping to speculative conclusions about the recognition benefit, we decided to further explore and characterize the processes in another experiment.

The purpose of Experiment 2 was, therefore, two-fold. First, replicating the recognition benefit by attempting to measure it only with the standard cluttered scenes (as used in previous studies). Second, to further characterize this memory process by dissociating the recognition benefit from the identification task.

Experiment 2

LaPointe et al. (2013) used the detection cost and the identification benefit to claim that a clear dissociation could be behaviorally established between the detection and identification processes. Ortiz-Tudela et al.'s (2018a) later report of the recognition benefit followed the same direction as the identification benefit. However, the dual-task conditions arranged in this latter study, in which participants were required to detect and then identify the changing object, made it impossible to separate the influence of each of these two tasks in the memory results. Thus, it is possible that the recognition benefit arises as a consequence of the offline elaboration required to respond to the identification question and not to the mechanisms at play while the processing of the scene was carried out.

Therefore, in Experiment 2 we eliminated the identification question altogether to avoid any effects of this post-response task on later recognition. In addition, in order to ensure the detection cost and to improve the chances of measuring the recognition benefit effect, we used only cluttered scenes as in previous reports (LaPointe et al., 2013; Ortiz-Tudela et al., 2016, 2018a; Spaak et al., 2020).