False perspectives on human language: Why statistics needs linguistics

A sharp tension exists about the nature of human language between two opposite parties: those who believe that statistical surface distributions, in particular using measures like surprisal, provide a better understanding of language processing, vs. those who believe that discrete hierarchical structures implementing linguistic information such as syntactic ones are a better tool. In this paper, we show that this dichotomy is a false one. Relying on the fact that statistical measures can be defined on the basis of either structural or non-structural models, we provide empirical evidence that only models of surprisal that reflect syntactic structure are able to account for language regularities. One-sentence summary Language processing does not only rely on some statistical surface distributions, but it needs to be integrated with syntactic information.

A sharp tension exists about the nature of human language between two opposite parties: those who believe that statistical surface distributions, in particular characterized using measure like surprisal, provide a better understanding of language processing, vs. those who believe that discrete recursive hierarchical structures implementing linguistic information are a better tool, more specifically, syntactic structures, the core and unique characteristic of human language (7).In this paper, we show that this dichotomy is a false one.Relying on the fact that statistical measures can be defined on the basis of either structural or non-structural models, we provide empirical evidence that only models of surprisal that reflect syntactic structure are able to account for language regularities.

On four different models of surprisal
It is a truism that during language processing the brain computes expectations about what material is likely to arise in a given context.The natural next step from this observation and one that characterizes much work in psycholinguistics is to formulate a hypothesis about the differences in processing load: in general, the less expected a piece of linguistic material is, the more difficult its processing (8,13).Expectation can be quantified in terms of the information theoretic notion of Surprisal (2), where the surprisal of a word  in context  is defined as: If a word is highly unlikely in a context, its surprisal will be very high.In contrast, if the word's is highly likely, its surprisal will approach 0.
Surprisal serves as a very useful linking hypothesis between patterns of behavior and brain response on the one hand and a single numerical quantity, namely the probability of a form.And because surprisal does not make explicit reference to linguistic structure, surprisal is often thought to provide an alternative perspective on language processing that avoids the necessity to posit such structure.This view is incorrect, however.Surprisal depends crucially on a particular characterization of a word's probability.Such a characterization, a probability model, may or may not make reference to linguistic structure.In this section, we will describe two dimensions along which language probability models can vary, and then use these dimensions to characterize four distinct probability models.Each of these models can be used as the input to the surprisal equation given above, so that different values of surprisal can result depending on the assumptions behind the probability model (see Fig. 1).

Dimension 1: Sequences vs. Hierarchical Structure
Our first dimension concerns the structure that is assumed in the generation of language.The simplest conception views language as a concatenative system.In this view, a sentence is simply a sequence of words generated one after another in a linear fashion.To account for which sentences are well-formed and which are not, constraints are imposed on adjacent elements, or bigrams.For example, in the context preceded by word 'the', a linear model of English will permit words like 'cat' or 'magazine' to occur, but not 'of'.To make a probability language model, we can simply assign a probability to a word  in a given context defined by the previous word  , so that the probabilities for all of the words sum to 1 for each context.Given a sufficiently large corpus, we can estimate these probabilities by taking the ratio of the number of occurrences of the context and of the context-word bigram: This model can be extended to an n-gram model, where the length of the context is increased to include more material: in an n-gram model, the conditioning context will include n-1 words.A 3-gram model could thus assign a higher probability to 'magazine' than 'cat' in the context 'read the' while doing the reverse in the context 'fed the'.A bigram model could not assign distinct probabilities in the two contexts, since the single adjacent word, namely 'the', is identical in both.For this reason, an n-gram model gives a more refined assessment of likelihood as the value of n grows.However, because the number of conditioning contexts expands exponentially with the length of the context, it becomes increasingly difficult to accurately estimate the values of the probability model.A variety of methods have been proposed to integrate the information from longer contexts with information in shorter contexts.We use such a composite model for our model of N-gram surprisal.
Chomsky (1957) (4) famously argued that linear models, were inadequate models of natural language, as they are incapable of capturing unbounded dependencies.To illustrate, consider the likelihood of the word 'is' or 'are' in context 'The book/books that I was telling you about last week during our visit to the zoo'.This will depend on the whether the word 'book' or 'books' appears in the context.Because the distance between this contextual word and the predicted verb can grow without bound, no specific value of n will yield an n-gram model that can correctly assign probability in such cases.
Chomsky's suggested alternative generates language using a hierarchically organized process.In this way, linearly distant elements can be structural close.One simple model for this involves context-free grammars (CFG), a set of rules that specify how a unit in a sentence tree can be expanded: Where S is the sentence, NP is a noun phrase, VP is a verb phrase, Det is a determiner, N is a noun and V is a verb.
Generating a sentence with such a grammar starts at the start symbol S. A rule whose lefthand side matches this symbol is then selected to expand the symbol.Each element of this expansion is in turn expanded with an appropriately matching rule, until the only remaining unexpanded symbols are words.The result of this CFG derivation is a tree-structured object T, whose periphery consists of the words of the sentence that is generated, called the yield of T. A CFG can be used as the basis of a probability model by assigning probability distributions for the possible expansions of each symbol (i.e., a value between 0 and 1 is assigned to each rule, with the values for the rules that share the same lefthand side summing to 1).In such a probabilistic CFG (PCFG), derivations proceed as with CFGs, but the choice of expansions is determined by the probabilities.In PCFGs, the probability of a tree structure is the product of the probabilities of each of the expansions.Because a sequence of words S might be generated by different trees, the probability of S is the sum of the probabilities of all of the trees T with yield S. Hale (2001) (9) shows how to use PCFGs to calculate the surprisal for a word given a context: we take the summed probability of all trees whose yield begins with the context-word (i.e., the prefix probability for context-word) divided by the summed probability of all trees whose yield begins with the context (i.e., the prefix probability for context).
PCFGs of this form suffer from being unable encode dependencies between lexical items: the choice of the verb in a VP is made independently of the choice of the noun in the verb's NP object.A body of work in the literature in natural language processing has addressed this shortcoming by adding 'lexicalization' to a PCFG, and this is the approach we adopt, following (11).

Dimension 2: Word vs. Category prediction
As already noted, n-gram models with longer contexts suffer from an estimation problem: it is impossible to get accurate estimates of the likelihood of relatively infrequent words in contexts that are defined by sequences of, say, 5 words.We can avoid this problem by incorporating another aspect of abstract linguistic structure: the categorization of words in part-of-speech (POS) classes.We can define a POS n-gram model as one where both the context (and the predicted element are POS (e.g., noun, verb, determiner, etc.).To compute the surprisal of a word w, then, equation ( 2) becomes: where  is the POS of the context, and  is the POS of the target word.
This is what we use for our model of POS surprisal.
With a small set of POS labels, the probability values for longer n-grams can be accurately estimated.Note though that POS n-gram model is insensitive to the meaning of individual words, so it will be unable to distinguish the probability of 'cat' and 'magazine' occurring in any context, as they are both nouns, but could distinguish their likelihood from that of prepositions like 'of' or adjectives like 'furry'.As a result, this model's predictions for surprisal will differ from those of a word-based surprisal model.(11) propose a method for separating between word vs. category prediction in the context of a hierarchy-sensitive probability models.Specifically, for the category predictions, the prefix probability of the context-word sequence omits from the probability of the generation of the word.Following Roark et al., we call the resulting surprisal predictions Syntactic Surprisal.For word predictions, on the other hand, the context includes not only that contributed by the preceding words, but also the structure up to, but not including, the generation of the word.Again following Roark et al. (11), we call the surprisal values computed in this way Lexical Surprisal.

Roark et al. (2009)
Fig. 1 Fig. 1: The two dimensions of language models (linear vs. hierarchical structure and word vs. category prediction).A choice in each dimension yields a distinct model of language, from which we can extract probability values.

Challenging data
In order to test different types of surprisal models a new set of stimuli has been designed building on Artoni et al. ( 2020) (1).In that work the neural decoding of linguistic structures from the brain was found in carefully controlled data, where confounding factors such as acoustic information were factored out distinguishing this work from previous in the field such as in (3,6,12).Specifically, their stimuli involved pairs of sentences sharing strings of two words with exactly the same acoustics (homophonous phrase, hence HP) but with completely different syntax.This strategy was made possible by relying on the properties of the Italian language.HPs could be either a Noun Phrase (NP) or a Verb Phrase (VP), depending on the syntactic structure that is involved.More specifically, HPs contained two words, such as la porta [laˈpɔrta]: a first monosyllabic word (e.g., la) which could be interpreted either as a definite article (Eng.'thefem.sing.')or an object clitic pronoun (Eng.'her'); a second polysyllabic word (e.g., porta) which could be interpreted either as a noun (Eng.'door'), or a verb (Eng."brings").The whole HP could be interpreted either as a NP ('the door') in Pulisce la porta con l'acqua (s/he cleans the door with water) or as a VP ('brings her') in Domani la porta a casa (tomorrow s/he brings her home) depending on the syntactic context within the sentence where they were pronounced.Unfortunately, it was not completely possible to disentangle surprisal from the syntactic information, since the linguistic material preceding HPs was different in the NPs vs. VPs interpretation, such as in Pulisce la porta (s/he cleans the door) vs. Domani la porta (tomorrow s/he brings).Relying on this kind of sentences, three experimental conditions have been generated here by modulating the syntactic context preceding HPs, which predicts the syntactic type of HPs, in order to modulate the relation between syntactic and surprisal information: (i) unpredictable HPs (UNPRED): the syntactic context preceding HPs allows both NPs and VPs since it is an adverb.Therefore, the syntactic types of HPs are not predictable at the beginning of the sentence, but only after the HPs: if HPs are followed by verbs (such as in Forse la porta è aperta, 'Maybe the door is open') they realize NPs, otherwise they realize VPs (Forse la porta a casa, 'Maybe s/he brings it at home').Weak predictable HPs (Weak_PRED): the syntactic context preceding HPs allows both NPs and VPs, as in the unpredictable HPs, thus the first word of the HP (la) could either be an article or a clitic pronoun, but the second word of the HP (porta) can only be analyzed as a noun (door), as in 1 st class predictable HPs, since the temporal adverb introducing the sentence (such as ieri, 'yesterday') requires a past tense whereas the verbal form of the HP displays a present tense (brings) (such as Ieri la porta era aperta, 'Yesterday the door/*brings it was open').As in the unpredictable class, the surprisal value is eliminated by the lexicon preceding HPs, which is the same for both NPs and VPs (only the morphosyntactic shape of the second HP word forces the interpretation forward the NP).

Statistical Analysis
Here we compared the N-gram, Lexical, POS and Syntactic surprisal of the 5 classes of stimuli (Strong_PRED-NP, Strong_PRED-VP, Weak_PRED-NP, UNPRED-NP, UNPRED-VP) relative to the first and the second word of the HPs.Kruskal-Wallis tests revealed significant differences across the surprisal values associated with all five classes for all notions of surprisal.For the nouns and verbs, the difference was significant only for the POS surprisal and the syntactic surprisal.We further investigated these differences using Conover post-hoc tests with Holm-Bonferroni correction.For the articles and clitics, only the syntactic surprisal captured the difference across all three classes of predictable items (p<0.0001,

Discussion and conclusion.
In this paper four different probability models of surprisal have been compared by exploiting the following contrasting factors: words vs. parts-of-speech and sequences vs. hierarchical structures.In order to test these models three experimental conditions have been generated by modulating the surprisal context: those where the phrase was completely unpredictable by the contexts (unpredictable phrases), those where the phrase was immediately predictable by the first word of the phrase (strong predictable phrases), and those where the phrase was predictable only after the second word of the phrase (weak predictable phrases).Notably, all confounding factors, including acoustic information, were factored out distinguishing our work from previous in the field such as in (3,6,12).We found that only those models combining hierarchical structures and part-of-speech categories successfully fit the three classes.On the other hand, surprisal models that only considers sequences of both words and parts-of-speech fail to replicate the expectation associated to the three classes.More specifically, statistical surface distributions are proved to be largely insufficient when it comes to language structure.

Fig 2 ,
top row).The POS and N-gram surprisal values of the articles were lower than those of the clitics (p<0.05), while the lexical surprisal values of the articles of the Strong_PRED-NP sentences were lower than the lexical surprisal values of the articles of weak_PRED-NP sentences and the clitics of Strong_PRED-VP sentences.For nouns and verbs, both the POS surprisal and the syntactic surprisal showed a difference between all three stimuli classes (p<0.05,Fig 2, bottom row).There was no difference between the N-gram surprisal values or lexical surprisal values of nouns and verbs.

Fig. 4
Fig. 4 Decoding results.Boxplots of the accuracies for the distinct classification tasks using different sets of features.Each data point is the accuracy of 1 fold in a 10-fold cross validation procedure.The red dashed lines are the chance levels.Strong Predictable N vs. V: classification task (i).(Strong and weak) Predictable N vs. V: classification task (ii).Unpredictable N vs. V: classification task (iii).Predictable vs. unpredictable: classification task (iv).For each set of features both the surprisal of the article/clitic and of the noun/verb were considered.The set of features are: ngram -N-gram surprisal; lex -Lexical surprisal; pos -POS surprisal; syn -Syntactic surprisal; tot -all of the above.* p < 0.05, ** p < 0.01, *** p < 0.001, **** p < 0.0001.
Artoni et al. (2020)ntext preceding HPs is exactly the same for both NPs and VPs, no differences in the surprisal value can be detected at the HP.(ii)Strong predictable HPs (Strong_PRED): the syntactic context preceding HPs allows either NPs or VPs (but not both) and, therefore, the syntactic type of HP is predictable at the beginning of the sentence: if HPs are preceded by verbs, they realize NPs (such as in Pulisce la porta con l'acqua, 'S/he cleans the door with water'); if HPs are preceded by nouns, they realize VP (such as in La donna la porta domani, 'A woman brings her tomorrow').This was the kind of stimuli exploited inArtoni et al. (2020), where the lexical context preceding HPs was different in NPs and VPs, allowing different surprisal values in the two cases.(iii) Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 (EMNLP) (Association for Computational Linguistics, 2009), pp.324-333.12. C. Shain, I. A. Blank, M. van Schijndel, W. Schuler, E. Fedorenko, fMRI reveals languagespecific predictive coding during naturalistic sentence comprehension.Neuropsychologia 138 107307 (2020).13.W.L. Taylor, "Cloze procedure": a new tool for measuring readability.Journalism quarterly 30(4), 415-433 (1953).