When Couches Have Eyes: The Effect of Visual Context on Children's Reference Processing

Cooper-Cunningham, Rebecca; Charest, Monique; Porretta, Vincent; Järvikivi, Juhani

doi:10.3389/fcomm.2020.576236

ORIGINAL RESEARCH article

Front. Commun., 13 November 2020

Sec. Psychology of Language

Volume 5 - 2020 | https://doi.org/10.3389/fcomm.2020.576236

When Couches Have Eyes: The Effect of Visual Context on Children's Reference Processing

Rebecca Cooper-Cunningham¹

Monique Charest¹

Vincent Porretta²

Juhani Järvikivi³^*

¹Department of Communication Sciences and Disorders, University of Alberta, Edmonton, AB, Canada
²Department of Linguistics and Languages, McMaster University, Hamilton, ON, Canada
³Department of Linguistics, University of Alberta, Edmonton, AB, Canada

We examined the effects of semantic and visual cues to animacy on children's and adults' interpretation of ambiguous pronouns, using the visual world paradigm. Participants listened to sentences with object relative clauses that varied the animacy of potential referents, followed by test sentences beginning with a referentially ambiguous subject pronoun he. Participants viewed images of the referents, with semantically inanimate objects (e.g., a TV) shown with or without added facial features. Results from offline verbal report and online gaze data revealed consistent effects of both semantic animacy and visual context on pronoun resolution in both groups: There was a preference for semantically animate referents as antecedent, but this preference decreased or disappeared when semantically inanimate referents had facial features. The results indicate that the use of animacy as a linguistic cue is flexible and responsive to the visual context. They further suggest that like adults (Nieuwland and Van Berkum, 2006), 4-year-olds can already use fictional, here visual, context to adjust their online and offline language comprehension preferences.

Introduction

Animacy, or whether an entity is alive and volitional in the real world, is a semantic property with well-documented effects on language processing in both adults and children. The semantic animacy of objects, and their representation in various media, however, do not always align. It is commonplace in child-directed books and media for otherwise perfectly inanimate everyday objects, like the famous Brave Little Toaster (Disch, 1980; Kushner et al., 1987), Cars' Lightning McQueen (Anderson and Lasseter, 2006), or Beauty and the Beasts' Mrs. Teapot (Hahn et al., 1991), to have features such as eyes and mouths that signal animate agency, intentions, goals, and even personalities. Adult language comprehension quickly adapts to story contexts with inanimate objects acting as sentient agents (Nieuwland and Van Berkum, 2006), but the extent to which this is true for young children is not known. Of particular interest in the current study is whether and how visual cues to animacy (hereafter visual animacy) interact with semantic animacy to affect children's and adults' comprehension of reference. More specifically, we examine how visual and semantic animacy interact to influence processing of ambiguous pronouns.

In general, language users tend to take animate referents as representing syntactic subjects and semantic agents, especially if they occur in typical subject positions (see also Lowder and Gordon, 2015, for forces of nature as agents). Sentence subjects tend to be animate more often than inanimate (Clark and Begun, 1971). Thus, animate entities make better grammatical subjects, “doers” or “feelers,” than inanimate ones. Within linguistic theory, thematic role relations rely heavily on the concept of animacy: only animate entities can be true agents and experiencers (Jackendoff, 1978), and prototypical agents are higher on the animacy hierarchy than prototypical patients, with properties corresponding to sentience, volition, and intentionality (Dowty, 1991; Kako, 2006). More generally, human referents are taken to be more prominent - and accessible - than animal referents, which, in turn, are taken as more prominent than inanimate referents (e.g., Comrie, 1989). In line with this expectation, corpus studies have shown that animate entities, in addition to being more frequently mentioned than inanimate entities (e.g., Givón, 1983), are much more likely to occur as sentence subjects and inanimate entities as objects (Clark and Begun, 1971; Dahl and Fraurud, 1996).

The match between such animacy-based expectations and the sentence structure (grammatical or thematic role assignment) affects sentence processing ease. ERP studies clearly demonstrate listeners' sensitivity when animacy expectations related to subjecthood or agency are violated (Weckerly and Kutas, 1999; Szewczyk and Schriefers, 2011; Nieuwland et al., 2013). Kuperberg et al. (2003), for example, showed that sentences with animacy violations (e.g., For breakfast the eggs would only eat toast and jam) elicited a strong P600 effect as compared to baseline sentences (e.g., For breakfast the boys would only eat toast and jam), whereas sentences with pragmatic violations (e.g., For breakfast the boys would only bury toast and jam) elicited only a weak (non-significant) N400 effect (see Kuperberg et al., 2007, for further evidence). More relevant to the present study, object relatives like There is the snake/carrot that the bunny found tend to show a processing cost compared to subject relatives (Gordon et al., 2001, 2002; Traxler et al., 2002, 2005). However, object relatives become easier to process when the head noun is inanimate than when it is animate (e.g., Ford, 1983; King and Just, 1991; Trueswell et al., 1994; Mak et al., 2002; Traxler et al., 2002; Clifton et al., 2003; Lowder and Gordon, 2014), with older adults showing this advantage more strongly than younger adults (DeDe, 2015)

Children are sensitive to animacy during language processing from an early age. Although English-speaking children rely strongly on word order to guide their interpretation of who-did-what-to-whom in standard Subject-Verb-Object sentences, animacy influences interpretations when word order is less available as a cue – such as for young children or non-canonical word orders (e.g., Bates et al., 1984; Thal and Flores, 2001). Animacy relationships also influence children's processing of relative clauses. Early studies suggested that processing of object relative clauses was of particular difficulty for children in comparison to subject relative clauses (Corrêa, 1995; Kidd and Bavin, 2002; Arnon, 2005). Indeed, in an act out study, accurate identification of the agent in centrally embedding object relatives like The sheep that the pig pushed eats some grass, was as low as 55% for 5-year-old children, in contrast to identification rates of 83% for centrally embedded subject relatives, The horse that jumped over the fence knocks down the chicken (Corrêa, 1995). Differences in reliability of interpretation between subject and object relative clauses has been observed with sentences containing two animate referents (Dick et al., 2004) and more unusual sentences containing two inanimate referents (e.g., The watch that had hugged the truck behind the kite was bright; The box that the kite had splashed behind the shoe was dry; Montgomery et al., 2016). Comprehension differences between subject and object relatives disappear, however, when the animacy relations of the referents match with expectations for thematic and grammatical role assignment. For example, Kidd et al. (2007) examined children's abilities to imitate presentational-style right-branching relative clauses, which are the most common type of relative clause produced by children (Diessel and Tomasello, 2005). In a sentence imitation task, they found that for both English and German children, processing of object relatives with inanimate heads, e.g., Here is the food that the cat ate in the kitchen today, was as easy as processing of subject relatives with animate heads, e.g., Here is the lady that helped the girl at school today. Thus, effects of animacy on processing are wide ranging and, on the whole, consistent – both adults and children expect animate but not inanimate beings to do, think, feel, experience, and, importantly, serve as grammatical subjects and semantic agents.

However, as we obviously can and do imagine possible worlds where inanimate objects exhibit all these qualities, then what does it mean, from a language comprehension perspective, to be animate in a visually- or discourse-situated context? Nieuwland and Van Berkum (2006) demonstrated that linguistic representations of animacy can be highly flexible and rapidly responsive to the discourse context. They had adult participants listen to five-sentence stories with semantically anomalous characters (Experiment 1). In these stories, normally inanimate characters such as peanuts and yachts were treated as animate, in violation of normal processing assumptions. For example, they outlined a story about a yacht receiving psychotherapy over his fear of water. Using event related potentials, they found that as the stories progressed, the strength of the N400 component, which responds to semantically anomalous language (Kutas and Federmeier, 2011, for a review), decreased (and was completely absent by the third mention), showing adaptation to the appearance of an inanimate agent. In Experiment 2, they created more elaborate stories, for example, a story about a peanut performing a song and a dance about his romantic relationship with an almond. They found that at the end of a story centered around an inanimate agent (a dancing peanut), the N400 was stronger when the character was described with a semantically fitting yet pragmatically incongruent characteristic (salted) than when the character was described with a pragmatically fitting yet semantically incongruent characteristic (in love). That is, listeners were more accepting of a pragmatically appropriate descriptor that fit the discourse context than a semantically appropriate descriptor that would be in accordance with everyday experience. Thus, the context provided by the preceding story overruled default effects of animacy on processing. When debriefing with participants, Nieuwland and Van Berkum (2006) found that while listening to the test stories, the participants visualized the inanimate nouns (e.g., the peanut) in anthropomorphic ways, such as cartoon-like, with faces and limbs.

The findings from Nieuwland and Van Berkum (2006) are consistent with the body of research showing that listeners take immediate advantage of the linguistic and non-linguistic context to shape their interpretations during comprehension (e.g., Trueswell et al., 1993; Spivey-Knowlton and Sedivy, 1995; Ferreira et al., 2013; Coco and Keller, 2015). It has been well-established that adult listeners can make immediate use of the visual context to constrain syntactic interpretation and resolve temporary syntactic (Tanenhaus et al., 1995; Spivey et al., 2002; Chambers et al., 2004; Ferreira et al., 2013), and referential (e.g., Altmann and Kamide, 1999; Sedivy et al., 1999; Knoeferle and Crocker, 2006) ambiguities in sentences. This literature indicates that the visual referential context and visual affordances incrementally inform sentence parsing strategies and referential choices (Chambers et al., 2004; Knoeferle and Crocker, 2007; Knoeferle et al., 2007). The results of Nieuwland and Van Berkum further show that discourse context can rapidly shape comprehenders' discourse representations and override normal effects of world knowledge on language processing (e.g., Hagoort et al., 2004; Hagoort and Van Berkum, 2007). The participants' reported visualizing strategies, moreover, suggest that visual input during comprehension could have effects on processing that may be similar to the kinds of effects observed for linguistic story context.

The present study asks whether visual context in the form of animate facial features influences processing of inanimate nouns in sentential contexts that have animate and inanimate nouns behaving in animate ways, and whether visual context of this kind might be sufficient to eliminate differences in referential processing expectations between animate noun and inanimate noun referents. This possibility is particularly germane for children's language processing. Children are consistently exposed to media portraying inanimate characters. It may seem self-evident to assume that children adapt to such storylines easily. However, when one considers the research surrounding children's more general use of visual context to guide linguistic interpretations, it is not clear how children's language processing adapts to these manipulations of visual features. Despite the evidence that adults use visual context to constrain language processing, children's ability to use visual information for the same purpose is less certain. Studies on the processing of temporarily ambiguous sentences suggests that 5-year-olds might not be sensitive to visual cues to the same extent as adults (Trueswell et al., 1999; Hurewitz et al., 2000; Snedeker and Trueswell, 2004; Weighall, 2008; Kidd et al., 2011). For example, when manipulating objects in the visual environment according to instructions like Feel the frog with the feather, where with the feather can either be analyzed as an instrument or as a modifier of the preceding noun phrase, the frog, children, unlike adults (e.g., Tanenhaus et al., 1995; cf. Ferreira et al., 2013), do not seem to benefit from the visual environment to constrain their syntactic parsing choices. Instead, they are influenced by sentence-internal, linguistic and distributional factors, such as whether the verb in question, (e.g., feel vs. choose vs. tickle), is followed with equal frequency (feel) or greater frequency by a modifier (choose) or instrument (tickle) in everyday language use (Snedeker and Trueswell, 2004; Kidd et al., 2011).

Other studies, however, have shown that children are sensitive to visual contextual information when it serves referential processing rather than parsing (Weighall and Altmann, 2011; Zhang and Knoeferle, 2012; Huang and Snedeker, 2013; Hughes and Allen, 2013; Van Rij et al., 2016). Huang and Snedeker (2013), for example, found that 5-year-olds took into account visual scene information when interpreting the scalar adjectives “big” and “tall.” They had 5-year-olds and adults listen to instructions such as Point to the big (vs. small) coin while they tracked participants' gaze to displays with one or two coins. In this task, both children and adults used the visual scene and looked to the correct item more quickly when the scene contained two possible referents compared to just one – even though children were delayed compared to adults. In relation to pronoun resolution, Van Rij et al. (2016) showed that young children's comprehension of object pronouns was dependent on whether the visual context showed a self- or other directed action. In the same vein, Järvikivi and Pyykkönen-Klauck (2019) showed that 4-year-olds were more likely to choose as the referent of an ambiguous subject pronoun the character that was co-present in the visual context at the time of the pronoun than the one that left the screen before the pronoun onset, whereas co-presence did not change adult participants' pronoun resolution preferences. These studies suggest that young children in particular may be influenced by visual contextual information when it comes to referential processing, and especially when assigning reference to ambiguous pronouns.

As research demonstrates that animacy expectations in particular are contextually flexible and rapidly affected by discourse manipulation, it is of interest to ask whether visual contextual cues to animacy have a similar impact. This is especially relevant with respect to children's language processing: while their ability to recruit grammatical and semantic information for real-time language comprehension is in development, their daily experience is filled with story books and cartoons where inanimate objects play the role of sentient agents. Yet, whether children's use of contextual information in animacy processing is as flexible as adults' has not been previously examined. We simply assume that there is no conflict and that for both children and adults, visual cues to animacy would allow semantically inanimate referents to be processed more similarly to semantically animate ones when these behave in ways that contradict our perceived daily experience and the lexical meaning of the referents, thus aiding comprehension in accordance with the linguistic pragmatic context. Particularly relevant to children's language processing, inspecting the impact of visual vs. linguistic cues to animacy may show abilities to use visual context to guide processing that have been less evident with other aspects of language.

Present Study

In the present study we used the visual world eye tracking paradigm and referent selection to examine how children's and adults' offline and online reference resolution preferences are affected by the animacy of the potential referent nouns, and perhaps more interesting, by a visual context that suggests that an inanimate noun can behave in animate ways. We manipulated the visual animacy context by adding facial features (eyes and mouths) to cartoon images of inanimate objects. The neural basis of face recognition develops within the first 6 months of life, within which time faces also develop into a separate class of perceptual objects (Nelson, 2001). Prior research has established that the addition of eyes to inanimate objects serves as a visual cue to animacy even in infancy, affecting infants' categorization decisions (Welder and Graham, 2006; Anderson et al., 2018). Eye gaze signals speaker's attention to the listener (Langton et al., 2000), interlocutors attend closely to each other's faces (Argyle and Cook, 1976), and the effects of eye gaze are reflexive and present even when eye gaze is manipulated to not be an informative cue (e.g., Friesen and Kingstone, 1998). Eye-gaze has also been shown to affect pronoun resolution preferences when manipulated during (Nappa and Arnold, 2014) and preceding the pronoun (Hawthorne et al., 2016). As previous research suggests that linguistic story context can override normal assumptions about animacy and agentivity, we asked whether the same is the case with visual cues, and whether pronoun resolution in children and adults is affected by these cues to the same degree. A further question of interest was whether these effects appear immediately when visual-contextual information becomes available, or whether processing adapts to visual context more gradually.

Children and adults listened to sentence pairs that consisted of an object relative clause with a semantically animate or inanimate object noun plus a semantically animate or inanimate subject noun, followed by a sentence that began with the ambiguous pronoun he (e.g., There is the snake/couch that the bunny/TV phoned. He was excited to go to the movie that night). While listening, the participants viewed scenes depicting these characters, with semantically inanimate objects (couch, TV) depicted as either having eyes and mouths or not. We tracked their eye gaze to the pictures following the onset of the ambiguous pronoun he. We expected that animate nouns would be selected as the preferred referent for the pronoun he over inanimate nouns. However, we also expected this preference to be mitigated by the presence of facial features on images of the inanimate nouns.

Pronoun resolution provides an ideal context to examine questions about semantic and visual animacy for several reasons. English singular pronouns code explicitly for animacy, as in most cases, he and she will only refer to animate nouns. Pronouns refer to entities active in the listener's discourse representation and available as antecedents (Gernsbacher and Hargreaves, 1988; Gundel et al., 1993; Foraker and McElree, 2007). Their interpretation is inherently context-bound, making use of syntactic, semantic, and discourse information, as well as prior knowledge in building a coherent mental representation of the discourse event (e.g., Gernsbacher, 1989; Gordon et al., 1993; Almor, 1999; Pyykkönen and Järvikivi, 2010; Kehler and Rohde, 2013; Engelen et al., 2014; Van Dijk, 2014; Schumacher et al., 2017). Psycholinguistic studies have shown that pronoun processing preferences are facilitated by properties of the antecedents, in particular syntactic role and position, the subject or first-mentioned entity of the immediately preceding clause/sentence often enjoying a more privileged status than other positions and roles (e.g., Gernsbacher et al., 1989; Gordon et al., 1993; Gundel et al., 1993; Arnold et al., 2000; Järvikivi et al., 2005; Kaiser and Trueswell, 2008). Pronouns are used to refer to human entities more frequently than non-human entities (Dahl and Fraurud, 1996). In line with this, animacy has been shown to modulate the selection of referring expressions when people continue sentences and stories: the likelihood of using a pronoun (vs. a full NP) is higher when continuations refer to animate entities than to inanimate entities (Fukumura and Van Gompel, 2011; Vogels et al., 2013, 2014). Moreover, recent processing studies using the visual world eye tracking paradigm have shown that children as young as 3–5 years of age already show many of these adult-like preferences, albeit often later in the time course (Song and Fisher, 2005, 2007; Arnold et al., 2007; Pyykkönen et al., 2010; Clackson et al., 2011; Järvikivi et al., 2014; Hartshorne et al., 2015), making pronoun resolution an appropriate tool to use. Bittner and Kuehnast (2012) have further shown that 5-year-old German and Bulgarian children resolve pronouns toward the subject more often when the subject is animate (“the monkey is hugging the dog”) rather than inanimate (“the ball is touching the bear”). However, whether children of the same age benefit from the visual context during online pronoun resolution in general, and whether it incrementally modulates their use of animacy to determine subjecthood during processing, is not known.

Some research suggests that pronoun resolution effects may differ within and between sentences. Whether within and between sentence pronoun resolution is taken to rely on the same (Gordon and Hendrick, 1998) or different underlying mechanism (Miltsakaki, 2002), some research suggests that at least information structural effects, such as focusing, might affect anaphor resolution to a lesser extent within than between sentences (e.g., Colonna et al., 2012, 2015; Järvikivi et al., 2014). As animacy, the topic of the present study, has been shown to be a factor affecting referent prominence in 3-year old children's pronoun resolution (Pyykkönen et al., 2010), we chose to use intersentential rather than intrasentential materials. As our main question pertains to whether visual cues to animacy would affect referent prominence and thus pronoun resolution preferences, we wanted to give these cues a fair chance.

In the current study, the context sentences were presentational style right-branching object relative clauses. We chose presentational style relative clauses for the experimental stimuli due to the interpretability of the sentence frame for the age range of the children in this study (Kidd and Bavin, 2002; Diessel and Tomasello, 2005; Kidd et al., 2007). Presentational style refers to sentences where the subject of the matrix clause is a demonstrative pronoun (this), and the matrix verb is a copula. Right-branching relative clauses are those that modify the object of the matrix verb, rather than the subject. Presentational-style right-branching relative clauses are the most common type of relative clause produced by children (Diessel and Tomasello, 2005).

Crucially, object relative clauses separate the grammatical subject and the first-mentioned noun. For example, in the sentence This is the boy that the girl teased at school yesterday, while the girl remains the grammatical subject, the boy is the first-mentioned noun. As discussed above, both first-mention and subjecthood (or agenthood) have been shown to guide pronoun resolution preferences in adults and children (e.g., Frederiksen, 1981; Gernsbacher and Hargreaves, 1988; Crawley et al., 1990; Gordon et al., 1993; Carreiras et al., 1995; Järvikivi et al., 2005, 2014, 2017; Song and Fisher, 2005, 2007; Kaiser and Trueswell, 2008; Fukumura and Van Gompel, 2015; Hartshorne et al., 2015; Van Gompel and Järvikivi, 2016). By choosing a form that puts these two preferences into conflict, we increased the likelihood of uncovering effects of semantic and visual animacy, which may have otherwise been overshadowed by both subject preferences and first-mention preferences pointing toward the same noun.

Methods

This study received research ethics approval from the University of Alberta Research Ethics Board, Project Name “Manipulating the Constraint of Animacy with Visual Context,” No. Pro00049753.

Participants

Two groups of participants – adults and young children – participated in the experiment. Forty-three adult undergraduate university students were recruited through a participant database within the University of Alberta Linguistics Department and received course credit for participation. All adult participants completed the experiment at the Center for Comparative Psycholinguistics. Five of the adult participants were excluded from the analysis due to bilingual language background, and two were excluded due to technical difficulty with maintaining eye tracking. After exclusions, there were 36 participants in the adult group (18–52 years old, M = 21.2; 30 female). All included participants reported being monolingual English speakers with normal or corrected-to-normal vision, and normal hearing. Informed consent in writing was collected from adult participants and parents of child participants prior to participating in the experiment.

Forty-eight children aged 4;0–5;5 participated in the study. We recruited the children through community preschools and daycares, the Child and Adolescent Research Group database at the University of Alberta, and word of mouth. All included participants were monolingual English speakers, had normal corrected-to-normal vision, normal hearing, and typical language development according to parent report. Children reported as having a first language other than English, or who were rated by their parents as presenting any fluency in a second language were excluded. Children who were reported as being exposed to another language but not yet presenting with any fluency in the second language were included. Child participants received a t-shirt for participation.

Typical language development of the child participants was determined primarily by parental report. In addition, each child completed the Recalling Sentences subtest of the Clinical Evaluation of Language Fundamentals – Preschool 2 (CELF-P2) (Wiig et al., 2004) as an additional screen of language abilities. The Recalling Sentences subtest is standardized with a mean of 10 and a standard deviation of 3. No child scored more than 1 SD below the subtest mean (Range: 7–18, M = 12.2). One participant declined to participate in the sentence repetition task. Given parent report of typical language development, this child's data were retained for analysis.

Participants were also excluded from the study if they did not correctly answer at least 14/20 comprehension questions from the filler items (described below), as we could not be certain that they had reliably attended to or understood the task. Four of the child participants were excluded for this reason. In addition, three participants chose to discontinue the experiment before completion, two participants were outside the specified age range, one child was excluded due to technical difficulty maintaining eye tracking, one child was excluded for not looking at the screen during the experiment, and one child was excluded due to bilingual language background. After exclusions, there were 36 participants (22 girls) in the child participant age group, ranging in age from 4;0 to 5;3 years old (M = 4;6).

Materials

Linguistic Stimuli

The linguistic stimuli consisted of sentence pairs. The first sentence, the “context sentence” began with a demonstrative pronoun (here, there, this, that) and contained a presentational style right-branching object relative clause. The second sentence of each pair, the “test sentence” began with the ambiguous pronoun, he and made a general statement that was not specific to either of the referents. An example of a sentence pair is, There is the snake that the bunny phoned at school. He was excited to go to the movie that night. In the context sentence, the bunny is the subject of the relative clause and the snake is the object. The context sentence always ended with a prepositional phrase referring to a location in order to pull eye gaze away from the images of the subject and object before the pronoun in the test sentence was presented. Altogether 20 experimental and 20 filler pairs were created using 40 different transitive verbs (20 for experimental and 20 for filler items). The 20 verbs used in the experimental pairs were taken from Pyykkönen et al. (2010; see Supplementary Material). The subject (S) and object (O) noun of the relative clause were manipulated for semantic animacy (animate animal vs. inanimate object) in four conditions, resulting in four versions of the context sentence for each verb (O Inanimate-S inanimate, O inanimate-S animate, O animate-S inanimate, and O animate-S animate). Table 1 presents one set of four context sentence versions plus the associated test sentence. The full list of experimental stimuli can be found in the Supplementary Material. All subject and object nouns had a minimum frequency of 50 occurrences per million words for the age range of 48–66 months in the ChildFreq database (Bååth, 2010).

TABLE 1

Table 1. Example experimental materials.

The 20 filler items were similar to the experimental items, but did not include relative clauses or pronominal reference in the test sentence. The first sentence was a simple active declarative sentence ending with a mention of a location, similar to the experimental items. All subject and object characters in the filler sentences were animate, either human or animal. The second sentence directly referred to the subject or object of the previous sentence using a full noun phrase, for example, The boy swam with the man at the lake. The man found it very cold. The full list of filler pairs is available in the Supplementary Material.

The stimulus sentences were recorded by an adult female North American English speaker in a sound attenuated booth at a sampling frequency of 44.1 kHz. The speaker maintained a broad focus prosodic contour for each item (used when answering a “what happened?” question) and was coached to speak as if speaking to a preschool aged child. A silent pause of 800 ms between sentences 1 and 2 was programmed into the experiment.

Visual Stimuli

Each sentence pair was accompanied by stylized (cartoon) images of the subject and object characters (e.g., couch/snake and TV/bunny) and the location (e.g., school). We manipulated the visual animacy of the pictures of the semantically inanimate nouns (e.g., couch). This was done by taking the original picture of the inanimate object and adding eyes and a mouth. In one half of the trials, participants were shown the original images (visually inanimate: no face). In the other half of the trials they saw the manipulated image (visually animate: with a face). Figure 1 provides an example of these manipulations. The full set of visual stimuli is available in the Supplementary Material (Experimental Images).

FIGURE 1

Figure 1. Example Visual Manipulations. Example visual manipulations of test item “There is the couch that the TV phoned at school. He was excited to go to the movie that night.” for each condition and sentence type.

The position of the three images on the screen was counterbalanced between items. The images appeared in the top center, bottom left corner, and bottom right corner of the screen, equidistant from the center of the screen. Each image was placed inside an interest area 370 pixels wide by 330 pixels high. An example of the Visual World display where objects have been manipulated to be visually animate is available in Supplementary Figure 1.

Apparatus

All children were tested with an arm-mounted SR Research Eyelink 1000 Plus head-free eye-tracker, sampling at 500 Hz. Monitor size was 1,280 × 1,024 pixels. Participants' right eye was always tracked. The experiment was presented with SR Research Experiment Builder software (Version 1.10.1025). The linguistic stimuli were presented over table-top speakers. Before the experiment began, the eye tracker was calibrated with a nine-point calibration display, and a comfortable listening volume was established for the participant.

Seventeen of the adult participants were tested on the same eye-tracker as the child participants, and 19 of the adult participants were tested on a table-mounted SR Research Eyelink 1000 eye-tracker with a headrest, also sampling at 500 Hz. Screen size was 1,920 × 1,080 pixels.

Design and Procedure

Each participant completed two blocks of 20 trials, with 10 experimental and 10 filler trials per block. Visual animacy of the semantically inanimate nouns was blocked within the experiment and block order was counterbalanced between participants. That is, images with faces added on to the inanimate noun (Face Block) were presented in either Block 1 or Block 2. Images of the semantically animate nouns (e.g., bunny), did not vary between the two blocks. The order of the items was randomized within block, with the exception of the very first item of the experiment, which was always a filler item.

For both experimental and filler trials, stimulus items were assigned to Block 1 or 2 in a counterbalanced manner across participants. For the experimental items, each child was presented with one of the four possible versions created for each of the 20 verbs (i.e., O Inanimate-S inanimate, O inanimate-S animate, O animate-S inanimate, and O animate-S animate). For example, each participant was presented with only one of the four possible combinations of semantic animacy for the verb phoned, as presented in Table 1, with the visual features of the stimuli varied according to whether the trial appeared in the Face or NoFace Block. As there was only one sentence version created for each filler verb, the 20 filler items were the same for all participants.

Each of the four semantic animacy conditions was presented 5 times across the 20 experimental trials in the experiment for each participant. Within each block, 2 of the 4 semantic animacy conditions were presented twice, and 2 of the 4 semantic animacy conditions were presented three times for a total of 10 trials. The semantic animacy conditions that appeared twice in Block 1 appeared 3 times in Block 2, and vice-versa, to ensure 5 presentations of each semantic condition per participant. Six counterbalanced lists were created to account for all possible combinations of these distributions. In this way, we maintained equal counts for each semantic animacy condition across the full sample.

Participants were tested in quiet rooms/areas within their preschools and daycares, or at the Center for Comparative Psycholinguistics at the University of Alberta. Six adults and six children were assigned to each list, and block order was counterbalanced within those participants (i.e., three participants saw the visually inanimate block first, and three participants saw visually animate block first). The child participants were seated in front of a computer screen. The experimenter told the child participants that they were going to hear some silly stories and see pictures on the screen. Each trial started with a cartoon star as a fixation point presented in the middle of the screen. When the child fixated the star, the experimenter started the trial. The visual display was presented for 1,000 ms, after which the corresponding sentence pair was played over speakers. After each trial, the experimenter verbally asked the participants a comprehension question formulated from the second sentence, by simply replacing the pronoun he with who. For example, for the sentence pair There is the couch that the TV phoned at school. He was excited to go to the movie that night, the comprehension question was, Who was excited to go to the movie that night? Filler comprehension questions were designed in the same way. The comprehension questions served two purposes: in the case of the test items, it allowed us to gather offline information on the participants' final interpretation of the pronoun. In the case of the filler items, comprehension questions had specific correct/incorrect responses, which were tallied to guarantee a minimum level of attention and comprehension as mentioned in the exclusion criteria.

Responses were recorded in writing by the experimenter and coded with a key press as referring to the subject or object of the first sentence, or “other.” Reasonable synonyms for the subject and object nouns were accepted as long as the referent was unambiguous, for example, “boot” for “shoe.” Ambiguous responses (e.g., “the guy”), non-ambiguous responses that did not match either the subject or the object, and “I don't know” responses were coded as “Other.” Between block 1 and 2, an optional movement break was offered to participants. After the experiment, the child participants completed the CELF-P2 Recalling Sentences subtest.

The procedure for adult participants was similar to the child participants, with some differences. Adult participants were told they would be completing the same experiment as children, and that they would listen to sentences while looking at pictures on a screen. However, instead of hearing the comprehension questions spoken by the experimenter, adult participants read the comprehension questions on the screen and provided a verbal response.

Results

We first report the data from the offline judgments of pronoun antecedent, followed by the online gaze data. For both data types, we report the effects of semantic and then visual animacy on pronoun resolution for the children first, followed by the adults.

Offline Response Data

Responses coded as “Other” (7.2% of child responses and 0.6% of adult responses) were removed prior to statistical analysis. The remaining responses were coded as referring to either the subject or the object of the first sentence. We analyzed the resulting binomial data (subject vs. object responses) with generalized linear mixed models using lme4 in R (Bates et al., 2015), with the subject preference as the dependent variable. Models included only random effects for subject and item intercepts, due to lack of convergence when by-subject and by-item slopes were included in the models. We used backward elimination, starting with a model that included the main contrasts and interaction terms for sentence type and image condition, as well as the random intercepts for participants and items. The resulting model was tested against the model without the interaction term, and used function anova() to compare the models. We then tested for by-participant and by-item random slopes for sentence type and image condition.

To investigate a potential between-groups effect, we re-ran the offline response analysis on the combined dataset including a three-way interaction between sentence type, image condition, and group. Group was not statistically significant, neither in interaction [X²(7) = 4.4738, p > 0.5], nor as a main effect [X²(1) = 0.570, p > 0.1]. Thus, to reduce model complexity and to investigate effects specific to each group, separate analyses were conducted.

Children

Figure 2A summarizes the distribution of children's verbal responses for each condition. Data are expressed as the raw difference in subject vs. object responses (raw number of object responses subtracted from subject responses). Visual inspection reveals an apparent trend for a subject advantage in the verbal responses of the children that appears to be stronger in the NoFace condition (dark gray bars). The exception to this trend is the animate-inanimate sentence type (e.g., There is the snake that the TV phoned at school). In this sentence type, there appears to be an object advantage when the subject (e.g., the TV) did not have a face on the image, and neither a subject nor an object advantage when the image on the subject contained facial features.

FIGURE 2

Figure 2. (A) Child verbal response data. (B) Adult verbal response data. Bar height indicates the subject advantage in responses (raw number of subject responses minus object responses).

The verbal response data were analyzed with a binomial mixed-effects model including sentence type (four levels: animate-animate, inanimate-inanimate, inanimate-animate, and animate-inanimate) and image condition (two levels: NoFace, Face) as fixed predictors. Model comparison showed that adding the interaction between sentence type and image condition significantly improved model fit (X² = 16.474, p < 0.001). Table 2 presents the summarized results. Positive estimates and z-values reflect a greater number of subject responses, while negative estimates and z-values reflect a greater number of object responses.

TABLE 2

Table 2. Child verbal response results: Mixed-effects modeling of verbal response to comprehension questions indicating subject noun vs. object noun interpretation of ambiguous pronoun.

Semantic animacy

As the positive sign of the intercept shows, for the animate-animate sentences in the NoFace block, children showed a significant preference for the subject as antecedent in their verbal responses. Both the inanimate-animate and animate-inanimate sentence types were significantly different from the animate-animate intercept in the NoFace block. These differences are illustrated in Figure 2A. In comparison to the animate-animate sentences (e.g., There is the snake that the bunny…), the inanimate-animate (No Face) sentences (e.g., There is the couch that the bunny…) generated more responses indicating the subject (the bunny) as antecedent, as indicated by the positive sign of the estimate, whereas the animate-inanimate sentences (e.g., There is the snake that the TV…) generated more responses indicating the object (the snake) as antecedent, as indicated by the negative sign. This suggests that the semantic animacy of the nouns affected the offline resolution of pronouns. Children were more likely to consider animate nouns as pronoun referents in their verbal responses.

Visual animacy

To investigate the interaction between sentence type and image condition, we ran pairwise comparisons between the NoFace and Face conditions for each sentence type, using package emmeans in R (Lenth, 2018). Table 3 presents the results.

TABLE 3

Table 3. Child verbal response results: Pairwise contrasts, comparing No Face and Face conditions for each sentence type.

For both inanimate-animate (e.g., …couch that the bunny…) and animate-inanimate (e.g., …snake that the TV…) sentence types, adding faces significantly affected the offline responses to pronoun resolution. For inanimate-animate sentences, the subject preference seen without the facial features added is very strong, and while maintained, decreases in strength significantly when faces were added, indicating that when the object (the couch) has animate visual characteristics, it is more likely to be considered as a pronoun referent. For animate-inanimate sentences, there is a strong object preference when no faces are added, indicating that the animate noun, that is, the syntactic object (the snake), is considered more likely to refer to the pronoun than the inanimate noun (the TV). When facial features are added, the object preference disappears, and both nouns are equally considered for pronoun reference.

Adults

The verbal responses to the comprehension questions for adults were analyzed in the same manner as the child data. Figure 2B summarizes the distribution of adults' verbal responses for each condition. Visual inspection reveals similar trends in the adult data to that of the children. There remains a trend toward a subject advantage in the verbal responses. However, in contrast to the children, there does not appear a subject or object advantage in the animate-animate sentence type. Similarly to the children, animate-inanimate sentence type suggests an object advantage when the subject (e.g., the TV) did not have a face on the image, and there was neither a subject or object advantage when the image on the subject contained facial features.

The verbal response data were again analyzed with a binomial mixed-model including sentence type and image condition as fixed predictors. Model comparison [function anova()] showed that adding the interaction term significantly improved model fit (X² = 9.933, p = 0.019). Table 4 presents the summarized results.

TABLE 4

Table 4. Adult verbal response results: Mixed-effects modeling of verbal response to comprehension questions indicating subject noun vs. object noun interpretation of ambiguous pronoun.

Semantic animacy

In contrast to the children's verbal responses, the adult participants showed no significant preference for object or subject antecedents in the reference condition, animate-animate NoFace. Similar to the child verbal response data, in the No Face condition, inanimate-animate (e.g., “…couch that the bunny…”) sentences showed significant differences to the intercept, illustrating that there is an effect of semantic animacy on pronoun resolution when measured offline through a verbal response (Table 4). There was an increase in participants indicating the animate subject (the bunny) as the antecedent in comparison to the animate-animate sentence type. Despite the visual trend in Figure 2B suggesting that participants indicated that the animate object (the snake) was the antecedent more often than in the animate-animate sentence type, this difference was not significant.

Visual animacy

The same pairwise comparisons of the verbal responses were completed for the adult group to explore the differences between image condition within the individual sentence types. Table 5 presents the results.

TABLE 5

Table 5. Adult verbal response results: Pairwise contrasts, comparing No Face, and Face conditions for each sentence type.

Again, both inanimate-animate (e.g., “…couch that the bunny…”) and animate-inanimate (e.g., “…snake that the TV…”) sentence types showed significant differences in pronoun resolution with the addition of faces. For inanimate-animate sentences, the subject preference, and for animate-inanimate sentences the object preference, are decreased in strength significantly when faces were added, indicating that when the inanimate noun within a given sentence has animate visual characteristics, it is more likely to be considered as a pronoun referent.

Eye Tracking Data

The sample data were exported using SR Research Data Viewer (v 1.11.900), relative to the onset of the pronoun. Further processing of the data was done using the R package VWPre (Version 0.9.6; Porretta et al., 2016). The data were converted to proportion of samples falling within and outside each of the predefined interest areas in 20 ms windows (10 data points per 20 ms window). These proportions were then converted to empirical logits using 10 samples per bin and a constant of 0.5 (see Barr, 2008). Of interest here is the preference in looking behavior to the subject over the object of the relative clause. Therefore, the difference between looks to subject and looks to object was calculated, which we henceforth refer to as subject advantage. As it takes ~200 ms before eye gaze begins to reflect processing of an acoustic signal (Fischer, 1992; Matin et al., 1993), we used 200–2,000 ms post pronoun onset as the window for the subsequent analysis.

Model Fitting

Visual world eye tracking data are inherently a time series and the effect over time is typically non-linear. Of interest in visual world studies is how the time course is influenced by other variables (e.g., animacy of the objects on screen). We therefore used generalized additive mixed modeling (GAMM) (Hastie and Tibshirani, 1990; Wood, 2006; Baayen et al., 2017; Porretta et al., 2018). GAMM does not assume a linear relationship between continuous predictors (e.g., time) and the response variable (subject preference), and it allows us to model the time course of the effects without the need to aggregate over a series of shorter time windows. Additionally, GAMM also allows for the control of autocorrelation in the time series data (Baayen et al., 2018). Autocorrelation refers to the correlation between data points in a time series; a measurement at time point t is correlated to differing degrees with a measurement at time point t-i, depending on the lag. The presence of autocorrelation can lead to overconfidence of the model estimates. GAMM models were fitted in R using the package mgcv (Wood, 2016).

For both the child data and the adult data separately, a model was fitted for the response variable, subject advantage (see above). The model was fitted to the data using a backward step-wise procedure (see Zuur et al., 2009) in order to evaluate the contribution of each predictor variable. First, we fitted a full model, that is, a model with all the predictors and interactions of interest, including random effects. Second, autocorrelation of the error was estimated from the model. The model was refitted with this parameter in order to adjust the confidence of the estimates. Third, we evaluated the contribution of the individual predictors in the model. For this, two criteria were used: the p-value of the term (indicating whether a given effect is not zero) and Maximum Likelihood (ML) score comparison between model variants (indicating whether the inclusion of the predictor improved the fit of the model). This process was done iteratively until the model contained only predictors that were statistically significant and contributed to the model fit.

For the child and adult models the following input variables were considered: animacy-by-face condition (the combination of animacy and face conditions of the visual scene), time, trial, subject, item, and event (the combination of subject and trial, indexing each unique time-series). The random structure consisted of random intercepts for events, factor smooths for time by subjects, and factor smooths for time by items. Factor smooths allow for the shape of the average time-course to vary by subject and item, as the time-course of the preference may vary for each. Random intercepts for event allow each unique time-course to have its own intercept in the model. For trial, a non-linear functional relation with the response variable was allowed for using a smooth function (Wood, 2006). For the interaction between time and animacy-by-face, a non-linear functional relation was allowed for using a smooth interaction (Wood, 2006). This interaction fits a curve for time for each level of animacy-by-face. In fitting the models (using the procedure mentioned above), only trial was removed for both the child and the adult data.

To investigate a potential between-groups effect, we ran the eye-tracking analysis on the combined dataset including an interaction between Manipulation and Group. Group was not statistically significant, neither in interaction [X²(7.0) = 4.243, p > 0.1], nor as a main effect [X²(1.0) = 0.113, p > 0.5]. Thus, to reduce model complexity and to investigate effects specific to each group, separate analyses were conducted. Below we present first the child data and model, followed by the adult data and model.

Children

Figure 3 displays the subject advantage scores for the eye tracking data, calculated by subtracting the proportion of looks to the object from the proportion of looks to the subject, and presents the gaze data from the Face and NoFace conditions, separately for each sentence type. The x-axis displays from the onset of the pronoun in the test sentence. Positive values indicate more looks to the subject image, while negative values indicate more looks to the object image. The dark gray lines in the figure depict the sentences presented in the No Face block and the light gray lines the sentences presented in the Face block. Thus, for semantically inanimate antecedent objects (e.g., sofa, phone) these lines show the difference in looks between when no eyes or mouth were added (dark gray lines) and with eyes and mouths added (light gray lines). For animate antecedents (snake, bunny), these lines depict the difference between whether they were presented in the No Face block (dark gray lines) or Face block (light gray lines). Note, in the case of the Animate_Animate sentences, with two semantically animate antecedents, the only difference is the block they were presented in.

FIGURE 3

Figure 3. Child proportional eye tracking data by sentence type and image condition. Subject Advantage values represent looks to the subject interest areas minus looks to the object interest areas. Error bars represent standard error. (A) Animate-animate sentence type. (B) Inanimate-inanimate sentence type. (C) Inanimate-animate sentence type. (D) Animate-inanimate sentence type.

Visual inspection of the data from the NoFace condition (dark gray lines) suggests that, in the absence of visual modification, semantic animacy of the referents had a strong effect on pronoun resolution: There is a notable subject preference for inanimate-animate sentences which starts rising immediately after pronoun onset. In contrast, animate-inanimate sentences show a strong preference for object antecedents beginning around 750 ms. Visual inspection further suggests that there was no reliable subject or object preference for either the animate-animate or inanimate-inanimate sentence types, which suggests that there was no systematic preference for either the subject or object noun in animate-animate NoFace sentences (dark gray line is not significantly different from the zero-line representing 50% or chance).

With the addition of facial features to the inanimate images (Face condition, light gray lines) the subject preference for inanimate-animate sentences, and the object preference for animate-inanimate sentences both decrease. That is, the subject advantage becomes closer to zero, and looking patterns appear more similar to the animate-animate and inanimate-inanimate sentences.

The results of the child model are reported in the Supplementary Material. Supplementary Table 1 contains a summary of the model output. Part A of the table reports parametric coefficients related to adjustments to the intercept for each condition. Part B reports smooth terms related to time curves for each condition as well as the random effects.

Importantly, two of the curves for time were statistically different from zero, namely those for Animate_Animate_Face (F = 3.9894, p = 0.003) and Animate_Inanimate_No_Face (F = 5.0748, p < 0.0001). Additionally, as indicated by an EDF > 1 (3.072 and 4.474, respectively), these time curves were estimated to be non-linear. The p-value associated with each of these curves only indicates that the curve is not a zero line. In order to assess possible significant differences between the conditional curves predicted by the model (i.e., smooths for Time), visual inspection is needed, thus, it was necessary to calculate difference curves between the conditions. This was done using the R package itsadug (Van Rij et al., 2015). Below we report these separately for the effect of semantic animacy and visual animacy.

Semantic animacy

Figure 4 presents two difference curves related to the effect of semantic animacy. Each curve represents the difference between two model-predicted time curves and are shown with 99% confidence bands. The vertical lines indicate the statistically significant portion of the curve (i.e., the portion for which zero is not included in the confidence bands). The left panel displays the difference between Inanimate_Animate and Animate_Animate in the No Face condition. As can be seen, there were significantly more looks to the subject after the Inanimate_Animate than the Animate_Animate sentences during the whole analyzed time course until about 1,450 ms from the onset of the pronoun. The right panel displays the difference between Animate_Inanimate and Animate_Animate in the No Face condition. This curve shows that between 950 and 1,750 ms from the pronoun onset, there were significantly more looks to the object, or fewer looks to the subject, in the Animate_Inanimate condition as compared to the Animate_Animate one.

FIGURE 4

Figure 4. Child data sentence type difference curves calculated from model-predicted curves with 99% confidence bands. Vertical lines indicate time points for which zero is not included in the confidence bands. (A) Inanimate-animate minus animate-animate (NoFace conditions). (B) Animate-inanimate minus animate-animate (NoFace conditions).

Visual animacy

Figure 5 presents two difference curves related to the effect of visual animacy. Each curve is presented with 99% confidence bands and vertical lines indicating statistical significance. The left panel displays the difference between Face and No Face in the Inanimate_Animate condition. As can be seen, adding visual features of animacy to pictures accompanying the Inanimate_Animate sentences resulted in significantly fewer subject looks starting at ~500 ms after pronoun onset and lasting until about 1,400 ms after. The right panel displays the difference between Face and No Face in the Animate_Inanimate condition. This curve shows that after about 400 ms from the pronoun onset until at ~1,950 ms after, there were significantly more looks to the subject antecedent for Animate_Inanimate sentences with the added visual features of animacy than without.

FIGURE 5

Figure 5. Child data image condition difference curves calculated from model-predicted curves with 99% confidence bands. Vertical lines indicate time points for which zero is not included in the confidence bands. (A) Inanimate-animate Face minus NoFace conditions. (B) Animate-inanimate Face minus NoFace conditions.

Adults

Similar to the child data, Figure 6 displays the subject advantage scores for the adult eye tracking data and presents the gaze data from the Face and NoFace conditions, separately for each sentence type. Positive values indicate more looks to the subject image, while negative values indicate more looks to the object image. Dark gray lines depict the sentences presented in the No Face block and light gray lines depict the sentences presented in the Face block.

FIGURE 6

Figure 6. Adult proportional eye tracking data by sentence type and image condition. Subject Advantage values represent looks to the subject interest areas minus looks to the object interest areas. Error bars represent standard error. (A) Animate-animate sentence type. (B) Inanimate-inanimate sentence type. (C) Inanimate-animate sentence type. (D) Animate-inanimate sentence type.

Visual inspection of the data from the NoFace condition (dark gray lines) suggests that, similar to the child data, semantic animacy of the referents had a strong effect on pronoun resolution: The subject preference for inanimate-animate sentences starts rising about 250 ms after pronoun onset. As well, the object preference for animate-inanimate sentences begins around 500 ms. Also similar to the child data, visual inspection suggests no reliable subject or object preference for either the animate-animate or inanimate-inanimate sentence types, which suggests that there was no systematic preference for either the subject or object noun in animate-animate NoFace sentences (dark gray line is not significantly different from the zero line representing 50% or chance).

Adding facial features to the inanimate images (Face condition, light gray lines) had the same apparent effect in adult participants as in the child participants. The subject preference for inanimate-animate sentences, and the object preference for animate-inanimate sentences both decrease. The subject advantage becomes closer to zero, and looking patterns appear more similar to the animate-animate and inanimate-inanimate sentences.

The full results of the adult model are reported in the Supplementary Material, Supplementary Table 2 containing a summary of the model output (both parametric coefficients and smooth terms). Here, two of the curves for time were statistically different from zero, namely those for Animate_Inanimate_Face (F = 4.3596, p = 0.037) and Inanimate_Animate_No_Face (F = 7.6981, p = 0.005). Additionally, as indicated by EDF values of approximately one, these time curves were estimated to be linear. Again, the p-value associated with each of these curves only indicates that the curve is not a zero line. In order to assess possible significant differences between curves also accounting for differences in height (i.e., adjustments to the intercept), it was necessary to calculate difference curves between the conditions, as above. Below we report these separately for the effect of semantic animacy and visual animacy.

Semantic animacy

Figure 7 presents two difference curves related to the effect of semantic animacy. Each curve is presented with 99% confidence bands and vertical lines indicating statistical significance. The left panel displays the difference between Inanimate_Animate and Animate_Animate in the No Face condition. As with children, the adult data showed more looks to the subject for Inanimate_Animate than the Animate_Animate sentences. This effect was statistically significant starting at about 750 ms and continuing until 2,000 ms from the pronoun onset. The right panel displays the difference between Animate_Inanimate and Animate_Animate in the No Face condition. This curve shows significantly more object looks in Animate_Inanimate than Animate_Animate sentences. This effect was significant between 800 and 2,000 ms after pronoun onset.

FIGURE 7

Figure 7. Adult data sentence type difference curves calculated from model-predicted curves with 99% confidence bands. Vertical lines indicate time points for which zero is not included in the confidence bands. (A) Inanimate-animate minus animate-animate (NoFace conditions). (B) Animate-inanimate minus animate-animate (NoFace conditions).

Visual animacy

Figure 8 presents two difference curves related to the effect of visual animacy. Each curve is presented with 99% confidence bands and vertical lines indicating statistical significance. The left panel displays the difference between Face and No Face in the Inanimate_Animate condition. As the left panel indicates, adding facial features to pictures accompanying the Inanimate_Animate sentences resulted in significantly increased looks to the object. This effect is visible from 1,200 ms after the pronoun onset and lasting until 2,000 ms after. The right panel displays the difference between Face and No Face in the Animate_Inanimate condition. This curve shows that after about 900 ms from the pronoun onset until 2,000 ms after, there were significantly more looks to the subject antecedent for Animate_Inanimate sentences with the added facial features than without.

FIGURE 8

Figure 8. Adult data image condition difference curves calculated from model-predicted curves with 99% confidence bands. Vertical lines indicate time points for which zero is not included in the confidence bands. (A) Inanimate-animate Face minus NoFace conditions. (B) Animate-inanimate Face minus NoFace conditions.

General Discussion

The purpose of this study was to investigate how children would use semantic animacy and visual indicators of animacy during language comprehension, and to investigate whether similar effects are observed for children and adults. For both children and adults, we expected to see effects of both semantic and visual animacy on the proportion of looks to the subject noun vs. object noun during the presentation of the pronoun in the test sentence. This expectation was met. The results revealed that both semantic and visual animacy guided pronoun interpretation in offline, verbal response data and online, eye gaze data.

The question for the first analysis was: would semantic animacy affect pronoun resolution? It has already been established that semantic animacy affects processing within comprehension of the relative clause itself (e.g., Trueswell et al., 1994; Clifton et al., 2003; Brandt et al., 2009; Lowder and Gordon, 2014). However, it has not been established how semantic animacy affects the online resolution of an animate, ambiguous pronoun, such as he, nor have the effects of using semantically anomalous nouns been established. Other semantic factors such as gender, verb transitivity, and agentivity have been shown to affect pronoun resolution in adults and children (Arnold et al., 2000, 2007; Rose, 2005; Pyykkönen et al., 2010; Schumacher et al., 2017). Like gender, animacy can be a distinguishing feature, as inanimate nouns should be referred to as it, and animate nouns should be referred to as he/she. Moreover, like verb transitivity, animacy can affect semantic prominence, as animate nouns are more likely to be agents, while inanimate nouns are more likely to be patients (Trueswell et al., 1994). Factors that increase semantic prominence increase the saliency of certain nouns within sentences, which increases the likelihood that they will be referred to by pronouns in proceeding sentences. These two factors – providing distinguishing information and increasing saliency - suggest that animacy information should be used during pronoun resolution by both adults and children. In fact, this was found in offline and eye tracking results for both the children and adults.

The reference condition, animate-animate sentences, showed a significant subject preference in the child offline data, but not in the adult offline data. Even though in English, subjecthood and first-mention most often coincide, evidence from languages with less strict word order suggests that these two preferences are at least partly independent (e.g., Järvikivi et al., 2005). As the subject was the second, or most recent, mentioned antecedent in our stimuli, preference for the first-mentioned antecedent conflicted with preference for subject antecedent, predictably resulting in the pattern seen in the adult-data, when both the subject and object were animate. That children still showed a preference for the subject is in line with prior literature showing that the first-mention preference is less stable in children than in adults, at least until about 5 years of age, possibly longer and when it is observed, it is much weaker and appears later in time course than in adults (see Hartshorne et al., 2015, for an overview).

Inanimate-inanimate sentences never differed significantly from the animate-animate sentences, suggesting a “default” approach to pronoun resolution until there is a contrast in animacy between the two nouns. Only then do we see that in general, pronoun reference is directed toward the animate noun.

The effect of semantics established, we then looked at the individual sentence types to compare the images without faces to the images with faces added to determine whether the manipulation of visual animacy modulated pronoun interpretation. For both the child and adult eye tracking data and offline verbal response data, inanimate nouns with faces were more likely to be considered as referents than the same nouns without, for both inanimate-animate and animate-inanimate sentence types.

This study differs from previous research on how visual context can affect language comprehension in children in two important ways. First, many previous studies investigated whether visual context affected children's syntactic parsing strategies (Trueswell et al., 1999; Hurewitz et al., 2000; Snedeker and Trueswell, 2004; Weighall and Altmann, 2011; Zhang and Knoeferle, 2012). In contrast, this study investigated whether visual context affected semantic interpretation of transitive sentences and of the subsequent anaphoric reference to the respective actors in the event. Other studies that focused on semantic interpretation also found effects of visual context on processing (Huang and Snedeker, 2008; Van Rij et al., 2016). While children's use of visual cues during syntactic processing remains under debate, it appears much clearer that semantic processing is affected by visual context. Second, previous studies used linguistic items that had only one correct interpretation to observe possible effects of visual information, whereas this study used ambiguous pronoun resolution, where structurally, both subject and object reference may have been possible. Using ambiguous pronouns allowed us to observe the effects of various cues on language processing without requiring participants to formulate and revise hypotheses about sentence structure.

Semantic information is learned initially through connecting words to our real-world experiences that are accessed through visual and other sensory processes (Gleitman et al., 2005). It would be advantageous then, for semantic interpretations to be flexible to changes in non-linguistic context. Nieuwland and Van Berkum (2006) have suggested that our semantic representations of lexical items are not separate from pragmatic knowledge, how words are used in different situations. They used discourse context, text setting the stage for a fictional situation, to illustrate this. The current study supports and extends their suggestion, by showing that non-linguistic visual information affects how semantic information is used during comprehension. Perhaps the visual information helps constrain the pragmatic context to one of a fictional, cartoon scenario in the same way discourse context did so for the Nieuwland and Van Berkum (2006) study, where a fictional scenario changed what was considered semantically appropriate in linguistic interpretation. The ability for “semantic appropriateness” to change with context suggests that it is not inherent to the language system but tied closely to pragmatic and non-linguistic context. However, semantic animacy remained a strong cue within and of itself in the offline data, as seen by significant differences between the animate-animate and inanimate-animate sentences within the faces block.

In Nieuwland and Van Berkum (2006), the adaptation happened gradually. Our results suggest that when out-of-the-normal markers of animacy are presented visually, at least when these directly mark the actors in the story, participants adapt to it immediately. This is indicated by the absence of an effect of experimental trial or experimental block in the analyses.

However, even though block order did not significantly contribute to the model fit, if we inspect the effects of block order on children and adults, an interesting trend emerges. When participants received the Face block before the No Face block, their processing in the latter clearly reflects this fact. As Supplementary Figure 1 shows, participants' subject preference for animate-animate and animate-inanimate sentences in the No Face block mirrors that in the Face block when the latter precedes the former. However, this is not the case the other way around. This suggests that even when visually animate features on the inanimate objects go away in the second block, both children and adults continue to uphold the animacy assumptions as if they would be the same as in the first block. This suggests that the presence of visual cues in the first block changed how adults and children processed the sentence pairs even when these cues were no longer present. This raises the question of what would happen if linguistic story context were pitted against a visual context. Whether one or the other would take precedence, whether the effects would be the same for adults and children in that case, and whether the effects of adaptation would be the same for these two type of contexts, are interesting questions that deserve to be answered in future research (see Lee et al., 2017).

There are some limitations to this study that deserve a mention. First, both the child and adult participant groups were fairly homogenous in terms of socioeconomic status (SES), and further, most of the parents had a high educational background. SES may affect the type and frequency of language children are exposed to, both through interactions with people and media such as books and television (Hoff, 2003). Inclusion of a measure of exposure to fictional cartoon situations would be of interest for both children and adult participants. Children with limited exposure to cartoon television and literature may show differences in processing to their highly exposed counterparts. Such findings would help distinguish if differences between adults and children are due to the relevance of the visual information to the individuals' lives, or if the differences are due to children's inability to inhibit the irrelevant visual information. Ours was also a fairly homogenous group in terms of age and language background. This leaves open when exactly children start showing adult-like sensitivity to animacy. Thus, further research might shed light at the age at which children start exhibiting adult-like sensitivity for animacy – semantic and visual – as a pronoun resolution cue; and, further, whether and the extent to which general language development (or experience) would predict their performance.

Second, coming back to the issue of inter- vs. intrasentential pronoun resolution, it might be that our use of between-sentence materials facilitated the effects of visual animacy, and a visual context might have less of an effect within sentence. However, these effects in both adults and children were relatively rapid, appearing well-within the time in which effects of subject-/agenthood have been shown in prior experiments with ambiguous pronouns in adults (e.g., Järvikivi et al., 2005; Clackson et al., 2011; Schumacher et al., 2017), and even earlier than what has been usually observed in children (Hartshorne et al., 2015, for an overview), suggesting that pronoun resolution processes were as immediate as could have been expected if the antecedent had been in the same sentence. This suggests that while the overall pattern of preferences could conceivably have been different had we opted for testing pronoun resolution within- rather than between sentences (but see Colonna et al., 2015), there is less reason to believe that the effects of visual animacy would have changed. However, as our aim was to see if visual animacy would have an effect in the first place and when during processing, this is a question for future research to answer. Lastly, a secondary finding to this study, the early online first-mention preference and the offline subject preference in children, and the overall lack of preference in adults for typical, animate-animate sentences, highlights need for continued research on pronoun resolution preferences in both children and adults. Relative clauses offer a unique opportunity in English to separate the two constraints, first-mention and subjecthood, which otherwise generally go hand-in-hand in English. Further research where animacy is not manipulated in surrounding test sentences, and subject relative clauses are included, may aid in exploring how subjecthood and order-of-mention interact in adult how it develops in child pronoun resolution.

Conclusion

This study aimed to shed light on how adults and children use visual and semantic animacy when resolving pronouns following right-branching object relative clauses. We found effects of semantic animacy on pronoun resolution in both children and adults. These effects occurred for sentences with either subject or object animacy variations, over different timeframes in the eye-tracking data. This suggests that already at 4 years of age, children use the semantic and visual cues to animacy similarly to adults to guide their pronoun interpretation. All observed effects in this study suggest that animate nouns, whether semantically or visually so, are preferred over inanimate nouns during pronoun resolution. Thus, when faced with decisions about ambiguous pronouns, both adults and children still use semantic animacy to some degree separately from visual animacy.

The findings of this study suggest that, much like adults, already 4-year-old children are capable of using semantic and visual cues to animacy to aid subsequent pronoun resolution information during language comprehension, and that in both age groups, use of animacy as a linguistic cue is flexible and responsive to the visual context.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics Statement

The studies involving human participants were reviewed and approved by University of Alberta Research Ethics Office. Written informed consent to participate in this study was provided by the participants' legal guardian/next of kin.

Author Contributions

MC, RC-C, and JJ conceptualized and designed the study. RC-C collected the data. MC, RC-C, JJ, and VP analyzed, interpreted the data, wrote, and commented on the manuscript. All authors contributed to the article and approved the submitted version.

Funding

This research was supported by a Social Sciences and Humanities Research Council of Canada (http://www.sshrc-crsh.gc.ca/) Insight Grant (Understanding Children's Processing of Reference in Interaction, 435-2017-0692) to JJ.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

We would like to thank Paulo Arago, Katrina Bartel, Thomas Pasterfield, Jillian Smith, and Megan Wilson for their help with data collection. Preliminary analyses and some of the content of this manuscript has been published as part of the Master's thesis of Cooper (2016).

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcomm.2020.576236/full#supplementary-material

References

Almor, A. (1999). Noun-phrase anaphors and focus: the informational load hypothesis. Psychol. Rev. 106, 748–765. doi: 10.1037/0033-295X.106.4.748