Referential Choice: Predictability and Its Limits

Kibrik, Andrej A.; Khudyakova, Mariya V.; Dobrov, Grigory B.; Linnik, Anastasia; Zalmanov, Dmitrij A.

doi:10.3389/fpsyg.2016.01429

ORIGINAL RESEARCH article

Front. Psychol., 23 September 2016

Sec. Psychology of Language

Volume 7 - 2016 | https://doi.org/10.3389/fpsyg.2016.01429

Referential Choice: Predictability and Its Limits

1. Department of Typology and Areal Linguistics, Institute of Linguistics, Russian Academy of Sciences Moscow, Russia
2. Department of Theoretical and Applied Linguistics, Lomonosov Moscow State University Moscow, Russia
3. Neurolinguistics Laboratory, National Research University Higher School of Economics Moscow, Russia
4. Consultant Plus Moscow, Russia
5. Linguistics Department, University of Potsdam Potsdam, Germany

Abstract

We report a study of referential choice in discourse production, understood as the choice between various types of referential devices, such as pronouns and full noun phrases. Our goal is to predict referential choice, and to explore to what extent such prediction is possible. Our approach to referential choice includes a cognitively informed theoretical component, corpus analysis, machine learning methods and experimentation with human participants. Machine learning algorithms make use of 25 factors, including referent’s properties (such as animacy and protagonism), the distance between a referential expression and its antecedent, the antecedent’s syntactic role, and so on. Having found the predictions of our algorithm to coincide with the original almost 90% of the time, we hypothesized that fully accurate prediction is not possible because, in many situations, more than one referential option is available. This hypothesis was supported by an experimental study, in which participants answered questions about either the original text in the corpus, or about a text modified in accordance with the algorithm’s prediction. Proportions of correct answers to these questions, as well as participants’ rating of the questions’ difficulty, suggested that divergences between the algorithm’s prediction and the original referential device in the corpus occur overwhelmingly in situations where the referential choice is not categorical.

Introduction

As we speak or write, we constantly mention various entities, or referents. The process of mentioning referents is conventionally called reference. When the speaker’s/writer’s decision to mention a referent is in place, another discourse phenomenon becomes relevant: referential choice that is the process of choosing an appropriate linguistic expression for the referent in question. The question of reference per se, that is of how and why a speaker/writer decides which referent to mention at a given place in discourse, is out of the scope of this paper (cf. the point of Gatt et al., 2014, p. 903, that referential choice is not directly related to the likelihood with which a referent is mentioned), that referential choice is not directly related to the likelihood with which a referent is mentioned). The focus of this study is the phenomenon of referential choice: we explore what guides a speaker/writer in choosing a linguistic expression when s/he has already made a decision to mention a certain referent.

The approach to referential choice adopted in the present study relies on earlier work by Chafe (1976, 1994), Givón (1983), Fox (1987), Tomlin (1987), Ariel (1990), and Gundel et al. (1993). These and other theoretical approaches assumed some kind of a cognitive characterization of a referent that underlies referential choice, such as givenness, topicality, focusing, accessibility, salience, prominence, etc. In terms of the cognitive model developed by Kibrik (1996, 1999, 2011) referential choice is governed by activation in working memory. In that model reference per se is claimed to be associated with a distinct cognitive phenomenon of attention. Attention and working memory are two related but distinct neurocognitive processes (Cowan, 1995; Awh and Jonides, 2001; Engle and Kane, 2004; Awh et al., 2006; Repovš and Bresjanac, 2006; Shipstead et al., 2015). Accordingly, reference and referential choice, as linguistic manifestations of attention and activation, are related but distinct processes (see Kibrik, 2011, Chap. 10).

As is widely held since Chafe (1976) and Givón (1983), the more given (or salient, accessible) a referent is to the speaker at the moment of reference, the less coding material it requires. In terms of the cognitive model we assume, the main law of referential choice can be formulated as follows:

• If the referent’s activation in the speaker’s working memory is high, use a reduced referential device. If the referent’s activation in the speaker’s working memory is low, use a lexically full referential device.

Thus the basic, coarse-grained referential choice is between reduced (or attenuated) and lexically full referential devices. In the case of English, it is the distinction between pronouns (personal and possessive), on the one hand, and a variety of full noun phrases, on the other. This distinction is the first level of granularity in the domain of referential options, and all scales and hierarchies that relate givenness (or equivalent concepts) to referential forms (Givón, 1983; Ariel, 1990; Gundel et al., 1993) acknowledge this basic distinction, even though they involve a greater detail in the taxonomy of referential devices. The second level distinction in the domain of referential options is between proper names and descriptions (Anderson and Hastie, 1974; Ariel, 1990; McCoy and Strube, 1999; Poesio, 2000; Heller et al., 2012). There are also further levels of distinction related to varieties of proper names and especially descriptions. In the present study, we mostly concentrate on the first level distinction between pronouns and full noun phrases, and will look briefly into the second level distinction between proper names and descriptions. Our focus is thus different from most work in the current tradition or referring expression generation (REG or GRE, beginning from Dale, 1992 and reviewed in Krahmer and van Deemter, 2012), primarily addressing various types of descriptions. Interestingly, however, Reiter and Dale (2000) recognize that the choice of the “form of referring expressions” (that is, the choice between pronouns, proper names, and descriptions) is the primary one. Krahmer and van Deemter (2012, p. 204) also suggest that first “the form of a reference is predicted, after which the content and realization are determined”.

This study is based on a corpus of written English, specifically newspaper (Wall Street Journal) texts. The corpus is annotated in accordance with the MoRA (Moscow Reference Annotation) scheme, detailed in Section “Materials and Methods” below. We assume that written media texts are a good testing ground for our approach. Specific aspects of referential processes differ across various discourse modes and types (see e.g., Fox, 1987; Toole, 1996; Strube and Wolters, 2000; Efimova, 2006; Garrod, 2011), but the basic cognitive principles of referential choice must be shared by all users of a given language and apply to various discourse types.

Example (1) (from the WSJ corpus we explore) illustrates the major referential options.

(1)
But beyond this decorative nod to tradition, Ms. Bogart and head off in a stylistic direction that all but transforms Gorky’s naturalistic drama into something akin to, well, farce. The director’s attempt to Ø force some Brechtian distance between her and characters frequently backfires with performances that are unduly mannered. Not only do stand outside characters and make it clear are at odds with them, but often literally stand on heads.

Two referents recur a number of times in (1). They are emphasized with two different kinds of underlining: Ms. Bogart and . The first referent is mentioned with a proper name (title plus last name), a description (the director), as well as with a pronoun (her) and a zero (in an infinitival construction). The second referent is mentioned by two different descriptions ( and ), pronouns (, ), and a zero (in a coordinate construction). (In written English, zeroes are not a part of discourse-based referential choice, but they can serve as antecedents; see discussion in Section “Materials and Methods”.)

What factors influence actual referential choices in discourse? In usual face-to-face conversation, an entity sometimes become visually available to the interlocutors (via shared attention), and that may be enough for using an exophoric pronoun without any antecedent (see e.g., Cornish, 1999). In written discourse, however, factors affecting referential choice are mostly associated with (i) the referent’s internal properties and (ii) the discourse context. Referent’s internal properties vary from most inherent, such as animacy, to more fluid, such as being or not being the protagonist of the current discourse. The factors of discourse context are diverse and include the following groups:

• those related to a prospective anaphor, such as the ordinal number of the given mention in the given discourse
• those related to the antecedent’s properties, such as its grammatical role (subject, object, etc.)
• those related to discourse structure, such as the distance between the anaphor and the antecedent, measured in the number of clauses or paragraphs.

Referential choice thus belongs to a large family of multi-factorial processes, generally characteristic of language production. Most of the factors employed in our study, such as animacy, grammatical role, or distance to antecedent, have been proposed in prior literature, in particular (Paducheva, 1965; Chafe, 1976; Grimes, 1978; Hinds, 1978; Clancy, 1980; Marslen-Wilson et al., 1982; Givón, 1983; Brennan et al., 1987; Fox, 1987; Tomlin, 1987; Ariel, 1990; Gernsbacher, 1990; Gordon et al., 1993; Dahl and Fraurud, 1996; Kameyama, 1999; Yamamoto, 1999; Strube and Wolters, 2000; Arnold, 2001; Stirling, 2001; Tetreault, 2001; Arnold and Griffin, 2007; Kaiser, 2008; Fukumura and van Gompel, 2011, 2015; Fukumura et al., 2013; Fedorova, 2014; Rohde and Kehler, 2014, i.a.). There is no room here to review this literature in detail, but many of these studies are discussed in Kibrik (2011); see also recent reviews in van Deemter et al. (2012) and Gatt et al. (2014). In some of the above-mentioned studies one of the factors was emphasized, while others were ignored or shaded. We find it important to take as many relevant factors as possible into account, as they actually operate in conjunction.

Within the cognitive model we assume, these factors are interpreted as activation factors, contributing to the cumulative current referent’s activation. This cognitive model of referential choice is depicted in Figure 1 (see further specification of the model in Sections “Discussion: Referential Choice Is Not Always Categorical” and “Experimental Studies of Referential Variation”). Two kinds of activation factors operate in conjunction and determine a referent’s current degree of activation, which in turn predicts referential choice.

FIGURE 1

In Kibrik (1996, 1999) a simple mathematical model was developed, capturing the multiplicity of factors and their relative contributions to referent activation and, therefore, to the ensuing referential choice. In those studies referent’s current activation level was assessed numerically, as a so-called activation score ranging from a minimal to a maximal value. In this paper, in contrast, we present a study based on machine learning techniques, in which we supply activation factors’ values to algorithms and obtain predictions of referential choice as an output. Therefore, the activation component remains hidden within the algorithm, and only mappings of activation factors upon referential options are explicit. In this respect this study is similar to most other studies or referential choice cited above, as well as to the studies based on annotated referential corpora, such as Poesio and Artstein (2008) and Belz et al. (2010). Still we find it important to keep the larger picture in mind and recognize that in the human cognitive system referent’s activation level mediates between the relevant factors and the actual referential choice.

We pursue two goals in this paper. The first goal is to predict referential choice as reliably as possible. We explore a corpus of English written discourse and use machine learning techniques to predict referential choice maximally close to the original texts. This part of the study is reported in Section “Corpus-Based Modeling”. In the course of this work it is found that even well-trained algorithms sometimes diverge from the original referential choices in the corpus texts.

That brings us to the second goal of our research: is 100% accurate prediction of referential choice possible in principle? In addressing this question, we consider the possibility that certain instances of divergence between the predicted and original forms may be due to the incomplete categoricity of referential choice. In Section “Experimental Studies of Referential Variation”, we submit the instances of divergence to an experimental assessment by human participants, in order to see whether people accept referential variation in the spots where divergences take place.

The discussion of our findings and concluding remarks follow in Section “General Discussion”.

Corpus-Based Modeling

Related Work

During the last twenty years or so a number of corpus resources for studies of coreference and reference production has appeared, including MUC-6/-7 (Chinchor and Sundheim, 1995; Grishman and Sundheim, 1995; Chinchor and Robinson, 1997), the ASGRE challenge (Gatt and Belz, 2008), the GNOME corpus (Poesio, 2000, 2004), the ARRAU corpus (Poesio and Artstein, 2008), and the GREC-08, -09, -10 challenges (Belz and Kow, 2010; Belz et al., 2008, 2009). Among these, the series of studies conducted for the GREC (Generating Referring Expressions in Context) challenges were somewhat similar in their goals to the present study: they predicted the form of a referring expression (common noun, name/description, pronoun, or “empty” reference) in Wikipedia articles about cities, countries, rivers, and people. One of the successful algorithms, a memory-based learner (Krahmer et al., 2008), was able to predict the correct type of referring expression in 76.5% of the cases. Krahmer et al. (2008) used automatic language processing tools to mark the following parameters for every entity: competition, position in the text, syntactic and semantic category, local context (POS tags), distance to the previous mention in sentences and NPs, main verb of the sentence, and syntactic patterns of three previous mentions. The systems in the 2010 GREC challenge used various sets of factors and machine learning techniques; for example, Greenbacker and McCoy (2009) used such features as competition, parallelism, and recency. The best system’s precision in the prediction task reached 82-84%. Zarrieß and Kuhn (2013) report a similarly high prediction accuracy in their study inspired by the GREC tasks on a corpus of German robbery reports. Crucial differences of the present work from the GREC studies are that, first, all referents are considered, not just the main topic referent of each article, and, second, semantic discourse structure is taken into account. Recent reviews providing detailed accounts of corpus-based studies of reference production can be found in Krahmer and van Deemter (2012) and Gatt et al. (2014).

Early modeling studies by Kibrik (1996, 1999) were mentioned in Section “Introduction”. Grüning and Kibrik (2005) applied the neural networks method of machine learning to the same small dataset as in Kibrik (1999); that study showed that machine learning is in principle appropriate for modeling multi-factorial referential choice and raised the question of creating a much larger and statistically valid corpus designed for referential studies. Several studies of our group addressed a corpus of Wall Street Journal texts, somewhat larger than the one used in the present paper (Kibrik and Krasavina, 2005; Krasavina, 2006) and used the annotation scheme proposed in Krasavina and Chiarcos (2007). More recently we developed the MoRA (Moscow Reference Annotation) scheme and conducted machine learning studies on the corpus data, looking into the basic referential choice (two-way choice between pronouns and full NPs) and the three-way choice between pronouns, proper names, and descriptions (Kibrik et al., 2010; Loukachevitch et al., 2011). Compared to our previous publications, in the present study we have substantially improved the quality of corpus annotation and modified the annotation scheme and the machine learning methods.

A number of studies emphasized the role of discourse structure in referential choice. In his classical work, Givón (1983) introduced the concept of linear distance from an anaphor back to the antecedent, measured in discourse units such as clauses. Other studies (Hobbs, 1985; Fox, 1987; Kibrik, 1996; Kehler, 2002) underlined the contribution of the semantic structure of discourse, including the hierarchical structure. Several models of discourse-semantic relations have been proposed in the recent decades (see Hobbs, 1985; Polanyi, 1985; Wolf et al., 2003; Miltsakaki et al., 2004; Joshi et al., 2006, i.a.), one of the best known being Rhetorical Structure Theory (RST) (Mann and Thompson, 1987; Taboada and Mann, 2006). RST represents text as a hierarchical structure, in which each node corresponds to an elementary discourse unit (EDU), roughly equaling a clause. Fox (1987) demonstrated a possible connection between reference and RST-based analysis of dicourse, and Kibrik (1996) introduced the measurement of rhetorical distance (RhD) that captures the length of path between an anaphor EDU and the antecedent EDU along the rhetorical graph; see Section “Materials and Methods”. In a neural networks-based study (Grüning and Kibrik, 2005) it was also found that RhD was an important factor. Experimental studies of Fedorova et al. (2010b, 2012) demonstrated that RhD is a relevant factor affecting referent activation in working memory, as well as reference resolution in the course of discourse comprehension.

The WSJ MoRA 2015 corpus employed in this paper (we used the name “RefRhet corpus” for earlier versions in previous publications) is based on a subset of texts of the RST Discourse Treebank, developed by Daniel Marcu and his collaborators (Carlson et al., 2002). This allows us to combine our own annotation (see Materials and Methodsith the rhetorical annotation produced by the Marcu’s team, and to compute RhD on the basis of their annotation. To the best of our knowledge, corpora intended for referential studies and containing discourse semantic structure annotation are few on the market; cf. the German corpus Stede and Neumann (2014). An English language resource comparable to ours in using discourse semantic structure as a part of referential annotation is the so-called C-3 corpus outlined in Nicolae et al. (2010). As these authors correctly state,

“the most widely known coreference corpora < … > are annotated with relations between entities, not between discourse segments. The most widely known coherence corpora are Discourse GraphBank (Wolf & al., 2003), RST Treebank (Carlson & al., 2002), and Penn Discourse Treebank (Prasad & al., 2008), none of which was annotated with coreference information.” (Nicolae et al., 2010, p. 136).

Nicolae et al.’s (2010) project is similar to ours in that they picked an already existing corpus annotated for discourse semantic relations and added further annotation for the purposes of modeling reference. Unlike us, however, they chose not the RST Discourse Treebank but the Discourse GraphBank of Wolf et al. (2003). The latter corpus is based on a less constrained kind of discourse representation compared to RST; see discussion in Marcu (2003), Wolf et al. (2003), and Wolf and Gibson (2003).

Referential annotation added by Nicolae et al. (2010) includes primarily types of entities (persons, organizations, locations, etc.), referential status (specific, generic, etc.) and referential form (pronoun, proper name, description, etc.). The number of entity types is greater than in our annotation scheme, but in general there are much fewer parameters involved. In particular, it seems that the syntactic role of anaphors and antecedents is not annotated. Generally Nicolae et al. (2010) followed the ACE (Automatic Content Extraction, 2004) guidelines principles of coreference annotation. They developed their own annotation tool. We are not aware of specific modeling studies based on the C-3 corpus.

A variety of algorithms have been used in computational studies of referential choice. One of the well-known early algorithms is the so-called incremental algorithm that was used by Dale and Reiter (1995) to predict the choice of attributes in descriptions. Modifications of this algorithm include the ones developed by van Deemter (2002) and Siddharthan and Copestake (2004), i.a.. In the 2000s, with the development of corpora for referential studies, researchers began to use classical machine learning algorithms and methodology to analyze some features of referential expressions. For example, in Cheng et al. (2001) the classification task was to determine the NP type, and the corpus annotation was used to train a classifier. The authors used the CART (Classification and Regression Trees) classifier and achieved 67 and 75% accuracy on different text sets by cross-validation procedure. Early corpus- and machine learning-based studies similar to ours in design are Poesio et al. (1999) and Poesio (2000). In the studies related to the GREC challenges (Belz and Varges, 2007; Belz and Kow, 2010), the algorithms had to identify the correct referring expression from a provided set. Participants used various methods and features to perform the task. For example, in 2008 they were: Conditional Random Fields with a set of features encoding the attributes given in the corpus, information about intervening references to other entities, etc. (UMUS system); a set of decision tree classifiers that checked the length of referring expressions and correctness of pronouns (UDEL system); XRCE system that used a great number of features with levels of activation. Other studies applying machine learning specifically to discourse reference include Jordan and Walker (2005), Viethen et al. (2011), and Ferreira et al. (2016). Also, there is a number of studies in which machine learning was used in other language generation tasks, such as prediction of adjective ordering (Malouf, 2000), content selection (Kelly et al., 2009), accent placement (Hirschberg, 1993), sentence planning (Walker et al., 2002), automated generation of multi-sentence texts (Hovy, 1993), as well as other tasks (e.g., Dethlefs and Cuayáhuitl, 2011; Dethlefs, 2014; Stent and Bangalore, 2014).

Materials and Methods

The Corpus

The WSJ MoRA 2015 corpus explored in this study consists of Wall Street Journal articles from the late 1980s, including broadcast news, analytical reviews, cultural reviews, and some other genres. Text length varies from 70 words to about 2000 words, the average length being 375 words. A general quantitative characterization of the WSJ MoRA 2015 corpus appears in Table 1.

Table 1

Feature	Comment	Number in corpus
Texts		64
Paragraphs		511
Sentences		976
Elementary discourse units (EDU)	EDU segmentation of texts is automatically extracted from the RST Discourse Treebank	2928
Words		23952

The WSJ MoRA 2015 corpus: a quantitative characterization.

Referential annotation of the corpus consists of two parts: annotation of referential devices and annotation of candidate activation factors. We consider these two kinds of annotation in turn.

Annotation of referential devices

Referential devices are technically named markables that is those referential expressions that can potentially corefer. Coreferential expressions form a referential chain. Non-first members of a referential chain are termed anaphors below. The breakdown of markables by type is shown in Table 2.

Table 2

	Type of markable	Comment	Number in corpus
1.	Reduced referential devices	Sum of #2 to #7	1373
2.	Personal pronouns		495
3.	Possessive pronouns		264
4.	Zeroes		375
5.	Demonstratives		67
6.	Relative pronouns		135
7.	Other		37
8.	Full noun phrases	Sum of #9 and #18 minus #27^∗)	5042
9.	Descriptions	Sum of #10 to #15	3517
10.	The-descriptions		1241
11.	A-descriptions		420
12.	Bare descriptions		1200
13.	Demonstrative descriptions	E.g. this house	88
14.	Possessive descriptions	E.g. his house, the company’s shares	490
15.	Other		78
	Special subtypes
16.	Attributive descriptions	E.g. the American president; the first American president who was elected…	1458
17.	Numeral descriptions	E.g. the two books	136
18.	Proper names	Sum of #19 to #25^∗)	1681
19.	First names		21
20.	Last names		229
21.	First plus last names		193
22.	Initials plus last names	E.g. G.W.Bush	1
23.	Non-persons	Names of countries, organizations, units, etc.	915
24.	Acronyms	E.g. GE, the US	277
25.	Other		45
	Special subtype
26.	Titled proper names	E.g. Mr. Bush	162
27.	Mix: description plus proper name	E.g. President Bush	156
	TOTAL		6415

Types and numbers of markables (referential expressions).

^∗Special subtypes in lines 16–17 and 26 cross-cut the mutually exclusive subtypes appearing in lines 10–15 and 19–25, respectively, and therefore are not summed with those in the counts shown in lines 9 and 18.

Note that not every markable in the corpus is actually used for analysis. First, there are 2580 singleton markables that are not linked to any other markable by a coreference relation and are not pertinent to referential choice. (They are nevertheless annotated, as they are taken into account when the values for the factor “distance in markables” are calculated.) In the modeling task we only use those markables that form referential chains. Second, certain types of referential expressions are only considered as antecedents, but not as anaphors in our analysis of referential choice. This concerns the following categories:

‒ indefinite descriptions (introduced by indefinite determiners, such as a(n), some, few, etc.);
‒ bare descriptions;
‒ all types of pronouns other than personal and possessive;
‒ first and second person pronouns;
‒ zero references.

In particular, quite common zero references in English only appear in fixed syntactic contexts, such as coordinate, gerundial, and infinitival constructions; at least this applies to the kind of written English we explore (cf. Scott, 2013). Syntactically induced zeroes should not be treated as a discourse-based referential option on a par with third person pronouns and full NPs. At the same time, zeroes make bona fide antecedents, so they must be annotated as markables in a referential corpus¹. Similar reasoning applies to relative pronouns. In written discourse, nominal demonstratives such as that typically refer to situations rather than entities.

In the corpus, there are 777 referential chains that comprise at least one anaphor, meeting the above-listed requirements (i.e., is not a bare description, a zero, etc.). Such chains include 3199 markables used in the modeling tasks. Average chain length is 4.1 markables, and the maximum length of a chain is 52 markables.

We thus address the basic referential choice between third person personal/possessive pronouns and full noun phrases. Table 3 shows the numbers of anaphors in the corpus.

Table 3

Anaphor type	Number used for analysis
Third person pronouns (personal or possessive)	585 (26.0%)
Descriptions	856 (38.1%)
Proper names	807 (35.9%)
Total	2248 (100%)

Anaphor types.

Annotation of candidate activation factors

The second part or referential annotation addresses candidate activation factors that is parameters that are potentially useful for the prediction of referential choice. The complete list of candidate factors used in this study is shown in Table 4. For each factor, its values included in the study are listed after a colon. Most of the factors’ values are derived from the MoRA scheme annotation, but some are computed automatically.

Table 4

(1) Referent’s factors

• Animacy: animate, inanimate, collective (for such entities as organizations)
• Gender (for animate referents only): masculine, feminine, mixed (for groups of people with various or unspecified gender)
• Person: 1, 2, 3
• Number: singular, plural
• Protagonism: numeric value

(2) Anaphor’s factors

• Ordinal number of referent mention in the referential chain: integer
• Type of phrase: noun phrase, prepositional phrase
• Grammatical role: subject, direct object, indirect object, oblique (with preposition), attribute, ’s-genitive, of-genitive, postpositive specification

(3) Antecedent’s factors

• Type of phrase (values same as in the section “Anaphor’s factors”)
• Grammatical role (values same as in the section “Anaphor’s factors”)
• Referential form:
- ∘ pronoun: personal, possessive, demonstrative, relative, zero
- ∘ description: a-description, the-description, bare description, demonstrative description, possessive description
  - ∘ attributive
  - ∘ numeral
- ∘ proper name: first, last, first and last, initials and last, non-person, acronym
- ∘ Antecedent length, in words: integer

(4) Distances between anaphor and antecedent

• Distance in words: integer
• Distance in all markables: integer
• Number of markables in chain from the anaphor back to the nearest full NP antecedent: integer
• Linear distance in EDUs: integer
• Rhetorical distance (RhD) in elementary discourse units: integer
• Distance in sentences: integer
• Distance in paragraphs: integer

Candidate factors of referential choice.

In Table 4, the factors are listed in four groups. In the terms of Figure 1, the group 1 factors roughly correspond to the “Referent’s internal properties” activation factors, while group 2–4 factors to the “Discourse context” activation factors. For the sake of brevity, the logic of factors is somewhat simplified in Table 4. In particular, most factors include the value “other” that we omit here. Several of the factors call for clarifying comments.

Protagonism means referent’s centrality in discourse. Two models of protagonism were used (Linnik and Dobrov, 2011): in the first one, to each referent corresponds the ratio of its referential chain length to the maximal length of a referential chain in the text; in the second model, to each referent corresponds the ratio of its chain length to the gross number of markables in the text. In both instances, the most frequently mentioned referent is the same, but relative weights of referents may be different.

Regarding the “Type of phrase” factor, it is important to explain why we consider prepositional phrases (such as of the president or with her) a particular type of phrase, rather than a combination of a preposition with a referential device (noun phrase). First, referential choice may depend on whether the antecedent or the anaphor is a plain noun phrase, or a noun phrase subordinate to a preposition (that is, constitutes a prepositional phrase); so this information must be retained. Second, consider English ’s- and of-genitives. The former are inflectional word forms and cannot be divided into a referential device and a separate unit, and it is reasonable to treat the two different kinds of genitives in the same way. More generally, in many languages, equivalents of English prepositions would be case endings, and nobody would deduct these from referential expressions.

Most of the distance factors are identifed for the closest linear antecedent. In contrast, RhD is computed from the anaphor back to the nearest rhetorical antecedent along the hierarchical graph. Figure 2 presents an example of the RST Discourse Treebank annotation, as well as illustrates the difference between the linear and the rhetorical antecedents, and the corresponding distances. Principles of RhD computation were outlined in Kibrik and Krasavina (2005).

FIGURE 2

In all, 25 potentially relevant activation factors are extractable from the annotated WSJ MoRA 2015 corpus; these are independent variables in the computational models discussed below. The parameter anaphor’s referential form is the predicted, or dependent, variable.

Each text of the WSJ MoRA 2015 corpus was annotated by two different annotators, and each pair of annotations was compared with the help of a special script that identified divergences. All problematic points were fixed by an expert annotator. The corpus was subsequently cross-checked with a variety of techniques and corrected by the members of our team.

Figure 3 provides a screenshot from the MMAX2 annotation tool (Müller and Strube, 2006) for the same text excerpt that was used as Example (1) in Section “Introduction”. Here, all expressions that refer to “Ms. Bogart” are highlighted and grouped into one referential chain with lines that mark coreference.

FIGURE 3

A special property of the MoRA scheme is the annotation of groups. A group is a set of markables that, collectively, serve as an antecedent of an anaphor. In Figure 3, two groups are present, marked with curly brackets and with italics: {[Ms. Bogart] and [company]} and {between [[her] actors] and [[their] characters]}. Later on in the text, there is indeed the markable [of the ensemble], the antecedent of which is {[Ms. Bogart] and [company]}.

Computational Modeling

In this study we use the system Weka² (see Hall et al., 2009) that includes many algorithms of machine learning, as well as automated means of algorithms’ evaluation. Several types of algorithms, or classifiers, are used. We consider the wide variety of used algorithms as an important methodological property of our study, distinguishing it from most other studies in reference production.

First, we use a logical algorithm (decision trees C4.5) as it lends itself to natural interpretation. Second, we use logistic regression because its results often exceed those of logical algorithms in quality. In addition, we use the so-called classifier compositions: bagging (Breiman, 1996) and boosting (Freund and Schapire, 1996). These composition algorithms use, as a source of their parameters, another machine learning algorithm that we will call the base algorithm. Using the base algorithm, composition algorithms construct multiple models and combine their results. As was shown in several experimental studies (for example, Schapire, 2003), composition algorithms or their modifications “performed as well or significantly better than the other methods tested” (Schapire, 2003, p. 162).

In the boosting algorithm the base algorithm undergoes optimization. An adaptation of classifiers is performed, that is, each additional classifier applies to the objects that were not properly classified by the already constructed composition. After each call of the algorithm the distribution of weights is updated. (These are weights corresponding to the importance of the training set objects.) At each iteration the weights of each wrongly classified object increase, so that the new classifier focuses on such objects. Among the boosting algorithms, AdaBoost was used in our modeling with C4.5 as the base algorithm.

Bagging (from “bootstrap aggregating”) algorithms are also algorithms of composition construction. Whereas in boosting each algorithm is trained on one and the same sample with different object weights, bagging randomly selects a subset of the training samples in order to train the base algorithm. So we get a set of algorithms built on different, even though potentially intersecting, training subsamples. A decision on classification is made through a voting procedure in which all the constructed classifiers take part. In the case of bagging the base algorithm was also C4.5.

In order to control the quality of classification, the cross-validation procedure was used:

(1)
The training set is divided into ten parts.
(2)
A classifier operates on the basis of nine parts.
(3)
The constructed decision function is tested on the remaining part.

The procedure is repeated for all possible partitions, and the results are subsequently averaged. The criterion for choosing both an optimal set of features and an algorithm is accuracy that is the ratio of properly predicted referential expressions to the overall amount of referential expressions. As was pointed out above, all the independent variables contained in Table 4 were treated as candidate factors of referential choice and included into our machine learning studies.

Results

Predicting Basic Referential Choice

The results of modeling the basic choice between reduced and full referential devices are given in Table 5. The baseline means the frequency of the most frequent referential option, that is, full noun phrase. If an algorithm always predicted the most frequent option, its accuracy would equal that option’s frequency. Table 5 also includes information on three additional measures assessing the quality of classification: precision, recall, and F1 (or harmonic mean).

Table 5

Algorithm	Accuracy	Full NP			Pronoun

		Precision	Recall	F1	Precision	Recall	F1
Baseline	74.0%	74.0%	1	85.0%	0	0	0
C4.5 algorithm	88.9%	91.7%	92.0%	91.9%	77.3%	76.7%	77.0%
Logistic regression	88.6%	91.5%	92.6%	92.1%	78.5%	76.0%	77.2%
Bagging	89.4%	91.9%	93.6%	92.7%	81.0%	76.8%	78.9%
Boosting	89.8%	92.2%	93.6%	92.8%	80.9%	77.4%	79.1%

Prediction of the basic referential choice.

The results yielded by any of the algorithms surpass the baseline substantially. At the same time, with the given set of factors all the algorithms demonstrate very close results; in particular, the accuracy rate is in the vicinity of 89-90%. The boosting algorithm fairs somewhat better than the others, but its difference from the other algorithms is not statistically significant. (We performed the McNemar’s test of statistical significance, in accordance with the method described in Salzberg, 1997.)

The confusion matrix (i.e., information on the amount of divergent predictions done by a classifier) for the boosting algorithm appears in Table 6. The model predicts over 93% of full NPs correctly, but is less effective with respect to pronouns: only 77% are predicted correctly. Such difference in performance can be explained by the class imbalance in the task: machine learning algorithms “prefer” to predict the most frequent class (full NP in our case) and thus achieve higher overall accuracy (Longadge et al., 2013). It is hardly possible to avoid class imbalance in a corpus-based study, in which relative frequencies of tokens consitute an inherent part of the data.

Table 6

	Predicted full NP	Predicted third person pronoun	Total
Original full NP	1556 (93.6%)	107 (6.4%)	1663 (100%)
Original pronoun	132 (22.6%)	453 (77.4%)	585 (100%)

Confusion matrix for the boosting algorithm, basic referential choice.

Interpreting Decision Trees

Among the machine learning algorithms, decision trees may be particularly telling in explicitly specifying the concrete role of certain factors. For our corpus, a decision tree was generated that comprised 110 terminal nodes each corresponding to a specific prediction rule. Consider the following branch from the decision tree: if the anaphor is a prepositional phrase and its antecedent lies within the same sentence, then it is most probable that a full noun phrase will be chosen, not a pronoun. Of 100 instances observed, only 8 display pronominalization. A typical example can be seen in (2).

(2)
Israel has launched a new effort to prove the Palestine Liberation Organization continues to Ø practice terrorism, and thus to persuade the U. S. to break off talks with the group.

This finding is quite surprising, given the closeness of the anaphor to the antecedent. The specific explanation of the finding is yet to be determined, but it is clear that the decision tree algorithms provide a source of new cause-effect generalizations about referential choice that would otherwise remain unrevealed.

Factors’ Contribution

What is the role of individual factors to the success of prediction? In order to evaluate such role, we have applied the boosting algorithm to different subsets of factors in order to find out the individual contribution of factors or their combinations. The results are provided in Table 7.

Table 7

Factors	Accuracy(%)
All factors	89.8
— without animacy	89.4
— without protagonism	89.7
— without the anaphor’s grammatical role	88.3
— without the antecedent’s grammatical role	89.2
— without grammatical role	87.7
— without the antecedent’s referential form	89.4
All non-distance factors only	75.5
— plus distance in all markables	82.5
— plus distances in words and paragraphs	87.2
— plus RhD, distance in words, and distance in sentences	88.7
All distance factors only	83.2

The significance of factors in modeling the basic referential choice (boosting with 50 iterations).

We used a number of distance measurements in this study. The data in Table 7 suggests that this group of factors is essential for successful prediction. As the distance factors are highly correlated, using any of them increases accuracy dramatically. Accuracy increases further if two or three distance factors are included. The non-distance factors have complex impact on accuracy: eliminating them one by one does not impair prediction significantly, but removing all of them results in a significant decrease of accuracy and is therefore inadvisable.

An earlier study of our group (Loukachevitch et al., 2011) specifically looked into the selection of factors and explored the relationships between them. Models based on various subsets of the factors were tested, and it was demonstrated that none of those models surpassed the full set of factors in classification quality. Deduction of each individual factor led to some deterioration of prediction. This makes us believe that the full set of factors used in our studies can hardly be reduced without detriment to the quality of prediction.

Modeling the Three-way Referential Choice

The set of candidate activation factors employed in this study is derived from the vast tradition of studies on basic referential choice. We have reached a significant success in predicting the basic choice. Now, what governs the second-order choice between the types of full noun phrases, that is, proper names and descriptions? Studies of these issues are relatively few (cf. Anderson and Hastie, 1974; Arutjunova, 1977; Seleznev, 1987; Ariel, 1990; Vieira and Poesio, 1999; Enfield and Stivers, 2007; Helmbrecht, 2009; Heller et al., 2012). We have experimentally applied our set of factors to the three-way choice between third person pronouns, proper names, and descriptions. The results can be seen in Table 8. The baseline is the frequency of descriptions, the most frequent referential option.

Table 8

Algorithm	Accuracy (%)
Baseline	38.1
C4.5 Decision tree algorithm	72.3
Logistic regression	73.5
Bagging	73.1
Boosting	75.7

Prediction of the three-way referential choice.

The fairly high accuracy of prediction we have obtained for the three-way task is intriguing. Apparently, the factors responsible for the choice between proper names and descriptions substantially intersect with our basic set of factors. This issue requires further investigation.

Note that in the three-way task boosting again demonstrates the highest results, as it did in the two-way task. Even though the advantage of boosting over the other methods again is not statistically significant, the tendency of its good performance motivates our solution to employ this method in the subsequent part of this study. (However, if we used another algorithm, at least one of those included in our study, the difference would be minimal.)

Discussion: Referential Choice Is Not Always Categorical

Even though the machine learning modeling was quite successful, the accuracy of prediction of the basic referential choice is still quite away from 100%. An important question arises: if we continue improving our annotation (e.g., by extending the set of factors) and tuning up the modeling procedure, can referential choice be ultimately predicted with the accuracy approaching 100%? In other words, is the 10% difference between the algorithm’s prediction and the original texts due to certain shortcomings of our methods or to some more fundamental causes? We propose that complete accuracy may not be attainable due to the nature of the process of referential choice.

Referential choice appears to not be a fully categorical and deterministic process. True, there are many instances in which only a pronoun or only a full noun phrase is appropriate, but there are also numerous instances in which more than one referential option can be used. This issue was explored in Kibrik (1999, p. 39), and the basic referential choice was represented as a scale comprising five potential situations:

(3)
i. full NP only
ii. full NP, ^?pronoun
iii. either full NP or pronoun
iv. pronoun, ^?full NP
v. pronoun only.

In (3), situations i and v are fully confident, or categorical, in the sense that language speakers would only use this particular device at the given point in discourse. Situations ii and iv suggest that, in addition to a preferred device, one can marginally use an alternative question-marked device. Finally, situation iii means free variation. In Kibrik (1999) specific referent mentions were attributed to five categories via an experimental procedure. Participants were offered modified versions of the original text, in which referential options were altered – for example, a full noun phrase was replaced by a pronoun or vice versa. Participants were asked to pinpoint infelicitous elements in the text and edit them. As a result of this procedure, some referential devices were assessed as categorical (types i, v). Other referential devices were judged partly (types ii, iv) or fully (type iii) alterable, or non-categorical. (Refer to the original publication for further details.) From the cognitive perspective, this can be interpreted as a mapping from the continuous referent activation to the binary formal distinction, as shown in Figure 4. That is, the formulation of the main law of referential choice, as offered in Section 1, suggests an overly categorical representation. It only captures correctly the two poles of the activation scale, but there are intermediate grades of activation in between that lead to less than categorical referential choice. The model of referential choice that we propose, as shown in Figure 4, differs from the well-known hierarchies of Givón (1983), Ariel (1990), and Gundel et al. (1993) in two respects. First, it explicitly recognizes a continuous cognitive variable, and second, it only focuses on the highest level distinction between full and reduced referential devices.

FIGURE 4

Non-categorical and/or probabilistic nature of referential choice has previously been addressed in a number of studies (e.g., Viethen and Dale, 2006a,b; Belz and Varges, 2007; Gundel et al., 2012; Khan et al., 2012; van Deemter et al., 2012; Engonopoulos and Koller, 2014; Ferreira et al., 2016; Hendriks, 2016; Zarrieß, 2016). For example, the well-known scale of Gundel et al. (1993) is implicational in its nature, and that is a way to partly account for the incomplete categoricity of referential choice. Krahmer and van Deemter (2012), noting that the deterministic approach dominates the field, discuss the studies by Di Fabbrizio et al. (2008) and Dale and Viethen (2010) that proposed probabilistic models accounting for individual differences between speakers. van Deemter et al. (2012, p. 18) remark that the probabilistic approach can be extended to a within-individual analysis:

Closer examination of the data of individual participants of almost any study reveals that their responses vary substantially, even within a single experimental condition. For example, we examined the data of Fukumura and van Gompel (2010), who conducted experiments that investigated the choice between a pronoun and a name for referring to a previously mentioned discourse entity. The clear majority (79%) of participants in their two main experiments behaved non-deterministically, that is, they produced more than one type of referring expression (i.e., both a pronoun and a name) in at least one of the conditions.

Overall, there is accumulating evidence suggesting that human referential choice is not fully categorical. There are certain conditions in which more than one referential option is appropriate and, in fact, each one would fare well enough. Under such conditions human language users may act differently on different occasions. If so, an efficient algorithm imitating human behavior may legitimately perform referential choice in different ways, sometimes coinciding with the original text and sometimes diverging from it. Therefore, ideal prediction of referential choice should not be possible in principle.

We have designed an experiment in which we attempt to differentiate between the two kinds of the algorithm’s divergences from the original referential choices. Of course, there may be instances due to plain error. But apart from that, there may be other instances associated with the inherently non-categorical nature of referential choice.