Similarity of wh-Phrases and Acceptability Variation in wh-Islands

Atkinson, Emily; Apple, Aaron; Rawlins, Kyle; Omaki, Akira

doi:10.3389/fpsyg.2015.02048

ORIGINAL RESEARCH article

Front. Psychol., 12 January 2016

Sec. Psychology of Language

Volume 6 - 2015 | https://doi.org/10.3389/fpsyg.2015.02048

This article is part of the Research TopicEncoding and Navigating Linguistic Representations in MemoryView all 49 articles

Similarity of wh-Phrases and Acceptability Variation in wh-Islands

Department of Cognitive Science, The Johns Hopkins University, Baltimore, MD, USA

In wh-questions that form a syntactic dependency between the fronted wh-phrase and its thematic position, acceptability is severely degraded when the dependency crosses another wh-phrase. It is well known that the acceptability degradation in wh-island violation ameliorates in certain contexts, but the source of this variation remains poorly understood. In the syntax literature, an influential theory – Featural Relativized Minimality – has argued that the wh-island effect is modulated exclusively by the distinctness of morpho-syntactic features in the two wh-phrases, but psycholinguistic theories of memory encoding and retrieval mechanisms predict that semantic properties of wh-phrases should also contribute to wh-island amelioration. We report four acceptability judgment experiments that systematically investigate the role of morpho-syntactic and semantic features in wh-island violations. The results indicate that the distribution of wh-island amelioration is best explained by an account that incorporates the distinctness of morpho-syntactic features as well as the semantic denotation of the wh-phrases. We argue that an integration of syntactic theories and perspectives from psycholinguistics can enrich our understanding of acceptability variation in wh-dependencies.

Introduction

Much work in syntax has investigated the acceptability of English sentences that involve multiple wh-phrases, as in (1):

(1) a. Who __ wondered who bought the car?

b. ^∗What did you wonder who bought __ ?

Despite the superficial resemblance of sentences in (1), native speakers of English perceive (1a) as a more acceptable sentence of English than (1b). This example illustrates the so-called wh-island constraint (Chomsky, 1964, 1977; cf. Ross, 1967): the grammar disallows dependency formation between the fronted wh-phrase (e.g., what) and its thematic position when there is another intervening wh-phrase (who). The discovery of this constraint raised a number of empirical and theoretical questions that remain unresolved: what types of representational or derivational constraints underlie the wh-island phenomenon? Are all wh-islands created equal, such that they all produce a similar degree of degradation? If not, what types of linguistic or cognitive factors affect the acceptability variation in wh-island violation?

The present paper aims to shed light on these questions through experimental tests of a recent, influential theory of wh-islands, called Featural Relativized Minimality (henceforth Featural RM; Friedmann et al., 2009; Belletti et al., 2012; Rizzi, 2013; for related proposals, see also Starke, 2001; Boeckx and Jeong, 2003). As the review below illustrates, there are two reasons why this theory deserves ample attention from syntacticians and psycholinguists. First, unlike many syntactic theories that only distinguish grammatical from ungrammatical sentences, Featural RM predicts fine variations in acceptability across different types of wh-islands, in particular, how the acceptability of wh-island violations can ameliorate depending on the similarity of wh-phrases. Second, as noted by Rizzi (2013), Featural RM resembles memory constraints on sentence processing, where the similarity of competing words in the sentence often predicts comprehension difficulties. As such, empirical investigations of wh-island amelioration effects provide a unique opportunity to explore the link between Featural RM and memory constraints in parsing. We report 4 experiments that explore the empirical predictions of Featural RM, and demonstrate that the theory needs refinement by incorporating aspects of memory encoding and retrieval constraints that guide the real-time computation of syntactic representations.

Featural Relativized Minimality and Similarity Interference in Parsing

The definition of the Featural RM constraint can be summarized as in (2), which is slightly modified from Rizzi (2013) for expository purposes:

(2) In the configuration [… X … Z … Y …], X and Y cannot form a dependency if Z c-commands Y, and Z is the same structural type as X.

The syntactic condition as stated in (2) ensures that a wh-dependency cannot be established when there is a competing intervener [Z in (2)] that is structurally closer to the thematic position (Y) than the fronted wh-phrase (X). In Featural RM, the definition of the structural type that constitutes a violation of RM is stated in terms of morpho-syntactic features of those constituents.

A critical empirical observation that led to the use of morpho-syntactic features in Featural RM is the amelioration of wh-island violations with a D(iscourse)-linked wh-phrase (Pesetsky, 1987). While D-linked wh-phrases have been intuitively characterized as linked to previous discourse in some way, we will primarily use it here as a cover-term for which-phrases that denote a set of individuals. In the syntax literature, it has been reported that extracting the bare wh-phrase what from the wh-island, as in (3a), results in an ungrammatical sentence, but the extraction of the D-linked wh-phrase which problem in (3b) is considered marginally grammatical. This suggests that the wh-island violation in (3b) is somewhat ameliorated, though its acceptability is still degraded compared to the grammatical wh-extraction in (3c).

(3) a. ^∗What do you wonder who solved __?

b. ?Which problem do you wonder who solved __?

c. Which problem do you think that John solved __?

Assuming the acceptability pattern indicated in (3), Rizzi and colleagues proposed that the degree of overlap in morpho-syntactic features of wh-phrases accounts for the acceptability variation (Friedmann et al., 2009; Belletti et al., 2012; Rizzi, 2013). For example, the feature relation between the two wh-phrases can be characterized as identity (3a), inclusion (3b), and disjunction (3c). In (3a), the extracted constituent and the intervener both contain only a [+Q(uestion)] feature, and hence the feature sets are identical. This identity relation results in a severe degradation in acceptability. In (3b), the intervener only contains [+Q], whereas the feature set for the D-linked wh-phrase contains [+Q] as well as [+N(oun)], the latter of which represents the “referential status” of the D-linked wh-phrase (see Cinque, 1990). This configuration is called an inclusion configuration, as the extracted constituent is more richly specified, and its feature set is a superset of that of the intervener. This inclusion relation leads to a less severe degradation in acceptability, and the wh-island effect is ameliorated relative to (3a), but the sentence is not necessarily judged as fully acceptable. Finally, in (3c) the embedded clause contains no [+Q] feature, and hence the feature specifications for the extracted constituent and the (potential) intervener are distinct. This is termed a disjunction configuration, which leads to no violation of Featural RM. These three feature set relations and their well-formedness statuses are summarized in Table 1.

TABLE 1

TABLE 1. Taxonomy of feature set and well-formedness in Featural RM.

In summary, a key property of Featural RM is that it is concerned with the similarity of the fronted constituent and intervener in terms of morpho-syntactic features: the overlap of features causes degradation, and amelioration is observed when the extracted constituent has a richer or distinct set of morpho-syntactic features than the intervener.

The data discussed above concern the acceptability of sentences, but related observations have been made in adult and child sentence processing research on comprehension of filler-gap dependencies. For example, children experience greater comprehension difficulties with object wh-questions like Which dog did the cat bite __ ? than Who did the cat bite __ ?, possibly due to the overlap of [+N] feature in the fronted wh-phrase which dog and the intervening NP the cat (Friedmann et al., 2009; Belletti et al., 2012; for counter-arguments, see Goodluck, 2010; Bentea and Durrleman, 2014). In adult sentence processing, object relative clauses with two definite Noun Phrases (NPs) like The banker that the barber praised __ pose greater comprehension difficulties than sentences in which the intervening NP is replaced by a pronoun or a name, as in The banker that you/John praised__ (Gordon et al., 2001, 2002, 2004, 2006; Warren and Gibson, 2002, 2005). This adult finding may be compatible with Featural RM if we expand the relevant morpho-syntactic features to include features that distinguish definite NPs from pronouns or names.

An alternative explanation, which has received much support from sentence processing as well as domain-general working memory research, is that these observations reflect constraints on memory encoding and retrieval mechanisms, which are subject to so called similarity-based interference (Lewis and Vasishth, 2005; for a review, see Van Dyke and Johns, 2012). There are two ways in which similarity-based interference could occur. The first and more well-known type of similarity-based interference is retrieval interference. Comprehension of relative clauses or wh-questions requires the parser to retrieve the fronted wh-phrase and relate it to its thematic position. According to these memory accounts, this retrieval mechanism uses a cue-based search process, and activates all NPs that meet (some of) the search cues. The retrieval competition among candidates with similar features results in comprehension difficulties. The second type is called encoding interference. This type of interference is observed when the parser encounters words or phrases that are similar to one another, and the process of encoding and storing them as distinct items in memory is disrupted. The resulting representations that are stored in memory may be less precise or robust, and may require more cognitive resources to retrieve later in the sentence (see Gordon et al., 2002).

This raises questions about whether the variation of acceptability judgments in (3) may also be an instance of similarity-based interference: the identity relation in (3a) causes greater similarity-based interference than the inclusion configuration in (3b), which in turn causes more interference than (3c). In fact, it may even be possible to reduce Featural RM (Table 1) to constraints on working memory. However, as noted by Rizzi (2013), one key difference between Featural RM and memory retrieval accounts is that Featural RM is strictly concerned with the overlap of morpho-syntactic features, whereas similarity-based interference is typically sensitive to a variety of similarities, including semantic features (Van Dyke and McElree, 2006; Hofmeister, 2011; Hofmeister and Vasishth, 2014; Kush et al., 2015). Thus, further investigations of the role of semantic overlap in wh-island amelioration could shed light on the link between Featural RM and similarity-based interference.

The Present Study

The present study uses acceptability judgment experiments to explore the role of morpho-syntactic and semantic features in amelioration of wh-island violations. Specifically, we will explore the acceptability of the inclusion configuration (4a), and how it compares to the acceptability of the D-linked identity configuration (4b).¹

(4) a. Which athlete did she wonder who would recruit __? (Inclusion)

b. Which athlete did she wonder which coach would recruit __? (D-linked identity)

In (4a) the extracted wh-phrase is D-linked and the intervener is a bare wh-phrase, whereas in (4b), both the extracted wh-phrase and the intervener wh-phrase are D-linked. Under Featural RM, the dependency in (4b) should be classified as an identity configuration, since both wh-phrases have features [+Q, +N]. We will refer to this configuration as D-linked identity, to distinguish it from the typical identity configuration [e.g., (3a)] that only includes bare wh-phrases. The dependency in (4a) is an inclusion configuration, since the intervening wh-phrase only has the feature [+Q]. Given these assumptions about the morpho-syntactic features, Featural RM predicts that (4b) should be less acceptable than (4a). On the other hand, both wh-phrases in the D-linked identity configuration (4b) are semantically more specific, as they characterize distinct sets of individuals: a set of athletes and a set of coaches. The wh-phrases in (4a) are less distinct because they do not denote distinct sets: the set of athletes is a proper subset of the set of people denoted by who. Thus, if semantic distinctness plays a role in dependency formation, the D-linked identity configuration (4b) may cause less similarity-based interference and lead to wh-island amelioration, possibly more so than in the inclusion condition (4a).

Informal judgment data reported in the syntax literature (Pesetsky, 1987, 2000; Comorovski, 1996; Shields, 2008) suggest that the D-linked configuration in (4b) should be more acceptable than the inclusion configuration in (4a); in fact, Pesetsky originally annotated them as fully grammatical, in contrast to non-D-linked identity examples. This may challenge the predictions of Featural RM, but it may reflect the fact that differences such as (4a) vs. (4b) are extremely subtle, and the reliability of the data in (4) may be in question. Although D-linked wh-phrases are reported to ameliorate wh-island violations, those sentences are still often described as unacceptable or ungrammatical to some degree. In other words, sentences like (4a) differ from non-D-linked identity sentences only in the severity of degradation, which is not guaranteed to be readily distinguishable in informal judgments. While D-linked identity examples are often (but not uniformly) annotated as fully grammatical in the linguistics literature, there is evidence that they have a different status than non-D-linked identity examples (Pesetsky, 2000; Shields, 2008). For example, Pesetsky (2000) demonstrates that they, unlike regular grammatical multiple-wh examples, e.g., (1a), show intervention effects, e.g., ^∗Which book didn’t which person read? Because the contrasts are empirically subtle and complex, we will use acceptability judgment experiments with a 7-point scale that provide a quantitative measure of acceptability variation. Such experiments have proven useful for a variety of syntactic phenomena that involve subtle contrasts in acceptability intuitions (e.g., McDaniel and Cowart, 1999; Featherston, 2005; Alexopoulou and Keller, 2007; Hofmeister and Sag, 2010; Sprouse et al., 2012; Sprouse and Hornstein, 2013).

In fact, several experimental studies have provided preliminary evidence that semantic information may indeed play a role in island amelioration (Alexopoulou and Keller, 2013; Goodall, 2015; see also Fanselow et al., 2011). Alexopoulou and Keller (2013) investigated the acceptability of extraction out of whether-islands (e.g., What does Claire wonder whether we will watch __ at the cinema?) while manipulating the animacy and D-linking status of the wh-phrase (e.g., what, who, which movie, which colleague). Here, it was found that bare inanimate wh-phrase what was less acceptable than the other three wh-phrase types, which did not differ from each other. This may suggest that inanimate nouns may be easier to extract out of an island, but this result is difficult to relate to the present study for two reasons. First, the animacy effect did not hold for the D-linked wh-phrases, suggesting that this may not be a robust effect. Second, whether-islands are different from wh-islands in (4) since the intervener (i.e., whether) itself does not relate to another (distant) thematic position. Goodall (2015) found clear evidence that D-linked wh-phrases ameliorate wh-islands that are more similar to those used in the present study. However, his D-linking manipulation compared bare wh-phrase against partitive wh-phrase (What / Which of the cars do you wonder who might buy __ ?). We note that, potentially, this partitive wh-phrase may have inflated the amelioration effect for a variety of reasons; for example, it contains a richer semantic content, which is known to facilitate retrieval processes in general (Hofmeister, 2011; Hofmeister and Vasishth, 2014). For this reason, our experiments will focus on D-linking manipulation that does not involve the partitive, in line with the D-linking manipulation that has been used more widely in the syntax literature.

Before presenting the experiments, it is important to clarify the scope of the present paper. The similarity-based interference accounts provide the motivation for the present study, as well as the critical predictions that semantic similarity should also play a role in acceptability variation in wh-islands. However, oﬄine acceptability judgment data that we report here does not necessarily shed light on whether the observed acceptability variation in wh-islands actually reflects working memory constraints on encoding and retrieval processes during real-time sentence processing. As such, our aim is not to investigate how acceptability variation unfolds during real-time sentence processing, but rather to test whether the ultimate acceptability judgment data is compatible with the predictions of the similarity-based interference accounts.²

Experiment 1

This experiment investigates the acceptability of wh-island violations with D-linked identity and wh-island violations with an inclusion configuration, where only the extracted phrase is D-linked. We test this using a 2 × 2 design with movement from within a wh-island (non-island vs. island) and feature relation (non-identity vs. identity) as factors, as in Table 2. The extraction conditions contain extractions out of wh-islands. The non-extraction counterparts in do not contain wh-island violations and, hence, serve as baseline conditions.

TABLE 2

TABLE 2. Sample item set from Experiment 1.

Featural RM predicts that the D-linked identity condition should be severely degraded because the set of features on both D-linked wh-phrases (which NP, [+Q, +N]) are identical. On the other hand, the inclusion configuration should be less degraded than D-linked identity, because the features on the fronted phrase (which NP, [+Q, +N]) are a superset of the features on the intervener (who, [+Q]).

Method

Participants

Twenty-five self-reported native English speakers were recruited on the internet via Amazon Mechanical Turk, which has proven to be a useful venue in which participants provide reliable acceptability judgment data (Gibson et al., 2011; Sprouse, 2011). They were paid $0.30 for their participation. The data from 3 additional participants was excluded from the analysis, as they only used the extreme ends of the scale in the pre-test phase (see below). This and the following experiments were approved by the Johns Hopkins University Institutional Review Board, and all participants provided informed consent.

Materials

The stimuli for this experiment consisted of 16 sets of bi-clausal wh-questions (Table 2). These 16 items were counter-balanced across four lists, so that each participant saw only one version of each target item. Forty-eight filler items of comparable length and varying acceptability were randomly interspersed with these target items for a total of 64 items. Based on our informal judgments and acceptability judgment data in the literature, we manipulated the acceptability of filler items to create three groups of fillers: those that are expected to receive high acceptability rating (good fillers), those that are expected to receive low rating (bad fillers), and sentences whose acceptability was expected to fall in between (middle fillers). Fillers consisted of both declaratives and questions, which were included to ensure that the target items were not the only questions in the experiments. Having filler items with varying acceptability serves two purposes. First, this encourages the participants to use a large portion of the scale, which is critical for revealing subtle contrasts. Second, the data from fillers can serve as a baseline measure that can be used to estimate the magnitude of amelioration effects in target sentences. Stimuli from all four experiments, including the fillers, are provided in Supplementary Materials.

Procedure

All of the acceptability judgment experiments in this paper have the same basic procedure. Participants were instructed to rate sentences on a scale from 1 (bad) to 7 (good). Before beginning the experiment, participants were provided with detailed instructions and examples to illustrate that the task is not about stylistic considerations, prescriptive norms, or the plausibility of the event described. This was followed by additional examples with varying degrees of acceptability to illustrate what type of sentence corresponded to different parts of the scale. None of these example sentences used the same structure as the target sentences shown in (5).

Additionally, the first six experimental trials were identical for all participants and served as a pre-test phase. These six trials consisted of two highly acceptable sentences, two highly unacceptable sentences, and two marginal ones. These sentences were included to encourage participants to use the entire scale. The use of a large range of points on the scale was critical for the present study, because the target comparison involves two unacceptable sentence conditions. The acceptability contrast between such sentences may not be revealed if participants used, for example, only the two extreme ends of the scale and treated the task as a binary judgment task. If participants restricted their judgments to the extreme ends of the scale (i.e., 1 and 7) on these initial items, the data from these participants were excluded from further analyses, as it suggests that the participants are treating the scale as if it is a binary choice, which may skew the acceptability ratings in unexpected ways.³

Data Analysis

All experiments in this paper use the same data analysis procedure. First, the raw judgment ratings, including both targets and fillers, were converted to z-scores within participants (Schütze and Sprouse, 2013). The z-score transformation converts a participant’s scores to units that represent the number of standard deviations a particular rating is from that participant’s mean rating. This procedure corrects for the potential that individual participants treat the scale differently, e.g., using only a subset of the available ratings, because it standardizes all participants’ results to the same scale. We also ran the reported analyses with the raw ratings and the results were unchanged in all experiments, although we will only report data and analyses based on z-scores.

Linear mixed-effect models were used to analyze the data; these models allow the simultaneous inclusion of random participant and random item variables (Baayen et al., 2008). Each model was fit using the maximal random effects structure that converged (Barr et al., 2013). These models were run in the R environment (R Core Development Team, 2015) using the lme4 package (Bates et al., 2015). P-value estimates for the fixed and random effects were calculated using the Sattherwaite approximation in the lmerTest package (Kuznetsova et al., 2015). When the results showed a significant interaction, planned pairwise comparisons were also performed to determine significance between individual conditions. These pairwise comparisons used separate linear mixed-effects models with maximal random effects structure; unlike other statistical analysis methods, mixed-effects models are robust to multiple comparisons.

Results

Figure 1 presents the z-score transformed average ratings for each condition and for each filler type. Good filler sentences were rated as most acceptable (mean z-score = 0.80), while bad fillers were rated as least acceptable (mean z-score = -0.75). Middle fillers received ratings near participants’ mean rating (i.e., near a z-score of 0, mean = -0.21). This pattern of acceptability for the fillers is common across all four experiments.

FIGURE 1

FIGURE 1. Mean z-score acceptability rating of target questions by wh-phrase combination and islandhood, and mean z-score acceptability rating of filler sentences by filler type. Error bars indicate ± 1 SE.

For the target items, we found that the island conditions were rated as less acceptable than the non-island conditions (island mean z-score = -0.71, non-island mean z-score = -0.05). Within the island conditions, the D-linked identity condition is rated as more acceptable than the inclusion condition (-0.58 vs. -0.84). In the non-island conditions, average z-scored ratings are around zero (means -0.04 and -0.07), suggesting that they were rated close to individual participants’ mean ratings. This likely reflects the fact that sentences with two wh-phrases are generally uncommon and difficult to process out of context.

Table 3 presents the estimated coefficients and the standard error for the Linear Mixed Effect model with islandhood and feature relation as fixed effects and random intercepts and slopes for participants and items. Significant effects are marked by their beta estimates.

TABLE 3

TABLE 3. Fixed effects summary for Experiment 1 with maximal by-participant and by-item random effects.

There is a main effect of islandhood such that wh-island violations are significantly less acceptable than non-island violating questions. There is no main effect of feature relation, but there is a significant interaction of islandhood and feature relation. The estimated coefficient of this interaction indicates that the feature combination had a significant effect in the island conditions, but not in the non-island conditions. This is supported by planned pairwise comparisons: the two non-island conditions are not significantly different from one another (β = -0.02, SE = 0.12, p > 0.1), while the D-linked identity condition is rated as significantly more acceptable than the inclusion condition (β = 0.26, SE = 0.09, p < 0.01).

Discussion

The results indicate that movement out of a wh-island generally results in severe degradation of acceptability. More importantly, this degradation is modulated by the feature relation between the two wh-phrases: the D-linked identity condition shows greater acceptability than the D-linked inclusion condition. These results replicate informal acceptability judgments in the literature that D-linking ameliorates wh-island effects, as well as judgment contrasts that D-linked identity leads to greater acceptability than inclusion (Comorovski, 1996; Shields, 2008). However, these results are not easily explained by the current formulation of Featural RM, which predicted that an identity configuration should be more degraded than an inclusion configuration. In fact, our results indicate that the D-linked identity configuration leads to a greater amelioration of the wh-island violation than an inclusion configuration.

We have so far focused only on the D-linked identity configuration. No items in this first experiment involve an identity configuration with bare wh-phrases, even though Rizzi’s (2013) proposal critically relies on an acceptability difference between an identity configuration with bare wh-phrases and an inclusion configuration with a fronted, D-linked wh-phrase. In order to confirm the presence of wh-island amelioration in the inclusion configuration, as predicted by Featural RM, Experiment 2 compares the inclusion condition against a D-linked identity condition as well as a bare identity condition, where both the fronted wh-phrase and the intervener are bare wh-phrases.