Schrödinger's tree—On syntax and neural language models

In the last half-decade, the field of natural language processing (NLP) has undergone two major transitions: the switch to neural networks as the primary modeling paradigm and the homogenization of the training regime (pre-train, then fine-tune). Amidst this process, language models have emerged as NLP's workhorse, displaying increasingly fluent generation capabilities and proving to be an indispensable means of knowledge transfer downstream. Due to the otherwise opaque, black-box nature of such models, researchers have employed aspects of linguistic theory in order to characterize their behavior. Questions central to syntax—the study of the hierarchical structure of language—have factored heavily into such work, shedding invaluable insights about models' inherent biases and their ability to make human-like generalizations. In this paper, we attempt to take stock of this growing body of literature. In doing so, we observe a lack of clarity across numerous dimensions, which influences the hypotheses that researchers form, as well as the conclusions they draw from their findings. To remedy this, we urge researchers to make careful considerations when investigating coding properties, selecting representations, and evaluating via downstream tasks. Furthermore, we outline the implications of the different types of research questions exhibited in studies on syntax, as well as the inherent pitfalls of aggregate metrics. Ultimately, we hope that our discussion adds nuance to the prospect of studying language models and paves the way for a less monolithic perspective on syntax in this context.


Introduction
Syntax -how words are combined to form sentences in natural language -has perhaps never garnered as much attention from NLP researchers as it does in the present day.Naturally, its recent relevance at conferences is owed to the deep learning paradigm, which the NLP community has embraced with open arms since the midpoint of the last decade.Prior to this paradigm shift, questions central to syntax were often restricted to the parsing domain.There, researchers were largely interested in developing supervised algorithms for processing structured input -usually in the form of annotated constituency or dependency treebanks.Beyond parsing, syntax also often factored into researchers' hypotheses about what information models may need to succeed in a given task.Feature engineering was a pivotal component of pre-neural NLP, where text was filtered through hand-crafted feature templates that emphasized parts of speech, morphology, and tree structure, so as to inform simple, often linear models about the underlying syntax of sentences.
The deep learning revolution of the mid 2010s quickly obviated the need for feature engineering, which was widely considered a time-consuming and painstaking process.
Embeddings -dense vectors representing the distributional properties of wordsquickly replaced the sparse, hand-crafted vectors of yore and boosted performance dramatically (Mikolov et al. 2013;Pennington, Socher, and Manning 2014).Such progress presented a trade-off, however: accuracy at the expense of interpretability.Indeed, without the guiding hand of the feature engineer, it became difficult to ascertain what properties of natural language the new neural models -highly complex and non-linear -had come to rely on.
It was this uncertainty that inspired a new line of inquiry within NLP, concerning what exactly models know and how they come to learn it.Early insights from this domain intimated that neural networks could capture facets of the hierarchical structure of language, beyond the linear order of words in a sentence.The Long Short Term Memory network (LSTM) (Hochreiter and Schmidhuber 1997) featured prominently in such studies, where researchers employed linguistic minimal pairs (mostly based on agreement phenomena) in order to demonstrate the model's sensitivity to syntactic hierarchy (Linzen, Dupoux, and Goldberg 2016a;Gulordava et al. 2018).Such findings were deemed exciting mainly due the LSTM's design as a sequence processor, which lacked the sort of structural supervision or inductive bias that one might encounter in the parsing literature.
Amidst skyrocketing research budgets and the continued advancement of processing hardware, NLP faced another paradigm shift in 2018-19.Researchers began realizing that representations for input words need not be fixed to a single static vector per type (as with word embeddings), but can instead be computed dynamically, with each word contextualized with respect to the rest of the sentence (Peters et al. 2018).Per this logic, it also became apparent that models capable of generating such representations could be fine-tuned with respect to downstream tasks, with impressive gains in performance thereafter (Howard and Ruder 2018).Language models -the basis of classic word embedding algorithms -were a natural fit for this paradigm and became NLP's backbone going forward.
In the modern day, models like BERT (Devlin et al. 2019), GPT (Radford et al. 2019) and their successors feature prominently in NLP research, showcasing the efficacy of the pretrain-and-finetune paradigm.Naturally, the human-like generation capability of such models, as well as their success on natural language understanding (NLU) benchmarks (Wang et al. 2018(Wang et al. , 2019)), makes the question of what the models know about language and how they acquire such knowledge and ever-pressing one.Increasingly, we find, NLP researchers turn to the field of syntax -with its decades of research, theory, and debate -in order to answer such questions.In this paper, we attempt to take stock of the ever-growing literature on the syntactic capabilities of neural language models.In doing so, we observe a lack of clarity across numerous dimensions, which influences the hypotheses that researchers form, as well as the conclusions they draw from their findings.We argue that this failure of articulation results in a body of work whose hypotheses, methodologies, and conclusions comprise many conflicting insights, giving rise to a paradoxical picture reminiscent of Schrödinger's cat -where syntax appears to be simultaneously dead and alive inside the black box models.In particular, by framing studies around aggregate metrics and benchmarks, syntax is often reduced to a monolithic phenomenon, which fails to do justice both to the complex interplay between different manifestations of hierarchical structure in natural language and to the substantial variation that exists across typologically different languages.
Our goal in this article is not to criticize earlier studies, which all provide valuable pieces of evidence for understanding the role of syntax in contemporary NLP, particularly language models.Instead, we propose a number of conceptual distinctions, the consideration and articulation of which, we argue, can help us better understand the seemingly conflicting results, resolve some of the apparent contradictions, and pave the way for a more nuanced and articulated research agenda.To provide the necessary background for this analysis, we begin by introducing the concept of syntax from a bird's eye perspective.We then review a representative sample of investigations into the syntactic capabilities of neural language models, which we categorize as belonging to three different paradigms.We supplement this review by discussing what we perceive to be important distinctions about syntax left implicit in this body of work.This leads to a discussion of different classes of research questions underlying the surveyed literature, and the role of aggregate metrics in addressing these research questions.We conclude with some thoughts on how our analysis can inform our research methodology for the future.

Background: Aspects of Syntax
Syntax is usually described as the way that words are combined into larger expressions like phrases and sentences.On one hand, syntax can then be contrasted with morphology, which is concerned with the internal structure of words.On the other hand, it can be contrasted with semantics, which deals with the meaning of words, phrases and sentences -as opposed to their form.In reality, however, syntax is concerned with the complex mapping between form and meaning at the phrase and sentence level.It is therefore important to make a distinction between syntactic structure -an abstract hierarchical structure that determines or constrains semantic composition -and coding properties -expressive devices such as word order, function words and morphological inflection that are used to partially encode the syntactic structure.To illustrate this point, let us consider two equivalent sentences in Finnish and English: (1) koira jahtasi kissan huoneesta dog-NOM chase-PRS cat-ACC room-ELA 'a/the dog chased a/the cat from a/the room' (2) the dog chased the cat from the room Most linguists would agree that (1) and ( 2) not only mean (roughly) the same thing but also have a similar syntactic structure, where the main verb (jahtasi, chased) takes a subject (koira, the dog), a direct object (kissan, the cat) and a locative modifier (huoneesta, from the room).However, the encoding of this syntactic structure is quite different in the two languages.In English, the subject and object are primarily identified through their position relative to the verb, while the locative modifier is introduced by a preposition (from).In Finnish, the role of all three dependents of the verb is indicated by morphological case inflection, and constituent order is not significant.1Note also that the overt coding properties (word order, function words, morphological inflection) do not (always) uniquely determine the syntactic structure.For example, in the English example, the phrase from the room could also function as a modifier of the noun phrase the cat, although this is a less likely interpretation in most contexts.
While coding properties are concrete aspects of the sentence, the syntactic structure is essentially an abstract concept that is not directly observable.Nevertheless, linguists have over the years accumulated compelling evidence for the existence of a hierarchical structure over and above the sequential order of words.The most obvious type of evidence is perhaps the occurrence of structural ambiguity, where a single sequence of words can be assigned multiple compositional interpretations, exemplified in the following classic examples: (3) she saw the man with the telescope (4) old men and women (5) flying planes can be dangerous Other types of evidence come from substitution and permutation tests (see, e.g., Matthews 1981).However, while the existence of a hierarchical structure is hardly contested today, the linguistic theories developed to account for this structure vary in their theoretical assumptions as well as in their mathematical representations of syntactic structure.A broad distinction can be made here between theories based on phrase structure (constituency) (Bloomfield 1933;Chomsky 1957) and theories based on dependency structure (Tesnière 1959;Mel'čuk 1988), but there are also other approaches and considerable variation within each family of theories.To some degree, it is possible to convert syntactic representations from one theoretical framework to another, but the conversion is usually heuristic and lossy and, therefore, the different representations are not commensurable, strictly speaking.
The existence of a wide range of syntactic theories arises from contested views on how a diverse range communicative principles, including the use of different coding properties, can come to exist across languages.For example, the Chomskyan tradition posits that an innate human grammar -a set of rules and processes that govern human cognition -is privy to a series of language-specific transformations that result in such idiosyncrasies (Chomsky 1965(Chomsky , 1981(Chomsky , 1995)).Other accounts argue that syntax itself is shaped by functional or cognitive constraints (Zipf 1949;Givón 1995;Hawkins 2004;Jaeger and Tily 2011;Gibson et al. 2019), such as managing memory load by preferring dependencies of shorter length (Gibson 1998;Gibson et al. 2000) -a process which can also influence coding properties like word order (Hahn, Jurafsky, and Futrell 2020;Futrell, Levy, and Gibson 2020).Cultural differences across languages are likewise theorized to play a large role (Tomasello 2009;Evans and Levinson 2009), with complex morphosyntactic processes like polysynthesis being largely observable in small, nonindustrial communities with dense social-network structures (Trudgill 2017).Directly or not, such debates revolve around the controversial poverty of the stimulus argument (Lasnik and Lidz 2017)-linguistics' own spin on psychology's nature vs. nurture debate -where the human capacity to acquire and generalize across structures is perceived as either predominantly learned or predominantly innate.
Neural networks -especially large scale language models -have recently assumed an interesting place in this discussion.Primarily, syntactic theory has offered a useful toolkit for more fine-grained evaluation of language models, which have shown an ability to generate coherent, grammatical output, resembling that of humans.To this end, researchers have employed well-studied coding properties like subject-verb agreement (Linzen, Dupoux, and Goldberg 2016b) or phenomena like filler-gap dependencies (Wilcox et al. 2018) to articulate exactly on which grounds a models' output might be judged as grammatical or not.Such studies have served as a welcome complement to the ubiqutious, yet opaque perplexity metric -a measure of how predictable sentences or documents are, given a model's parameterization.In a sense, however, they can likewise be perceived as a means of sanity-checking models' behavior (Baroni 2021), with paper titles often framed interrogatively: do neural language models learn ?Nonetheless, answering such questions is useful, and a concrete understanding of the ability of neural networks to generalize with respect to natural language -as well as the algorithmic processes that underlie this capacity -could, in the least, provide interesting perspectives on the age-old debates mentioned above (Linzen and Baroni 2021).

Review: The Quest for Syntax
In this section, we review work belonging to what we perceive as the three dominant paradigms for attesting language models' knowledge of syntax -targeted syntactic evaluation, probing, and (downstream) NLU evaluation.Though comprehensive surveys of such studies can be found, for example, in Linzen and Baroni (2021) or Manning et al. (2020), our aim here is to relate them to the concepts and distinctions discussed in the previous section.Readers interested in a more detailed description and analysis are referred to the aforementioned work.

Targeted Syntactic Evaluation
Targeted syntactic evaluation2 (TSE) is arguably the most popular framework for assessing neural networks' ability to make syntactic -therefore hierarchical -generalizations.At its core, TSE is a black-box testing approach concerned with measuring model output (typically probabilities) with respect to a curated set of stimuli.Such stimuli are typically based on minimal pairs motivated by phenomena in the syntax literature.For example, consider the (by now classic) example in sentences 6a and 6b.
(6) a.The keys to the cabinet are on the The literature dictates that a competent English speaker would rely on a structural analysis of the keys to the cabinet to infer number agreement between the plural subject (keys) and the copula verb (are).On the other hand, a purely sequential processing of the sentence would arrive at the opposite conclusion in 6b: is agrees with the adjacent singular noun (cabinet).To ascertain whether or not a language model M follows the former logic, one could, for example, compare the probabilities assigned to the target verb be in 6a-6b, given the context C = the keys to the cabinet, and examine whether P M (are|C) > P M (is|C).This can also be extended to full paradigms, where, in the case of 6, M has to assign higher probabilities to both (6a) and (6d) with respect to (6b) and (6c).TSE (per this formulation) can thus be seen as based on an accuracy metric, which, if returning a high value over n stimuli, implies that M is able to generalize with respect to the relevant syntactic phenomenon.It should be noted that probability assigned to the word form x, per various theoretical justifications, is sometimes replaced with surprisal, e.g., S = −log 2 P M (x|C), as in Wilcox et al. (2018) and Futrell et al. (2019).Furthermore, in situations where the locus of ungrammaticality does not lie on a single word (as in English subject-verb agreement), but is dependent on the interaction of several words (e.g., as in negative polarity items), it is common to compare the probabilities or perplexities of entire sentences (Marvin and Linzen 2018;Jumelet and Hupkes 2018).
The TSE framework also allows for flexibility in integrating more complex sets of stimuli, as in the study on syntactic state by Futrell et al. ( 2019): (7) a.As the doctor studied the textbook, the nurse walked into the office.
b. *As the doctor studied the textbook.c. ?The doctor studied the textbook, the nurse walked into the office.d.The doctor studied the textbook.
With respect to (7), Futrell et al. (2019) formulate a set of hypotheses, whereby they posit (1) that the surprisal at the matrix clause after the comma (... the nurse walked into the office.)should be lower for (7a) than for (7c) (the network knows it is in a subordinate clause per the subordinator as), and (2) that the surprisal at the matrix clause should be higher for 7b than 7d (the network expects a matrix clause per the subordinator).
Though the aforementioned accuracy approach could likewise be appropriated here as a summary statistic, researchers also often employ significance testing in order to accept or reject their hypotheses.For example, Futrell et al. ( 2019) apply a linear mixed-effect model on their models' stimulus-level predictions in order to accept hypothesis (1) on behalf of all models, but reject hypothesis (2) for all but two.This formulation -in line with common paradigms in psycholinguistics -leads them to conclude that, while all models are partially capable of tracking syntactic state across subordinate and main clauses, certain training conditions are required (large data or explicit structural objectives) in order to fully capture the structural expectations induced by subordinators.
A similar methodology is employed in Wilcox et al. (2018) for investigating filler-gap dependencies.
The popularity of the TSE framework has precipitated the creation of challenge suites, which offer holistic measures of models' performance across a variety of linguistic phenomena.Marvin and Linzen (2018) were among the first to introduce such datasets, employing a context-free grammar to procedurally generate minimal pair sentences -such as 6a and 6b -for a variety of phenomena: agreement (of various kinds), reflexive anaphora, and negative polarity items.Warstadt et al. (2020) later presented a similar, automatically generated dataset of minimal pairs (BLiMP), albeit with wider coverage: 1000 sentences per 67 paradigms belonging to 12 different phenomena.The authors used BLiMP to study various popular language model architectures (LSTM, Transformer), whereby they associated average accuracy across phenomena with models' linguistic knowledge.A similar suite was contemporaneously introduced by Hu et al. (2020), albeit in employ of 2 × 2 templates like 6 for hand-curated stimuli culled from syntax textbooks.Like Warstadt et al. (2020), Hu et al. (2020) used their suite3 to study language model architectures, most notably relating language models' syntactic generalization (SG) score -measured in aggregate across phenomena -to their test set perplexity.

Probing
Probing4 is another popular paradigm for attesting NLP models' acquisition of syntax.
The key distinction between TSE and probing is that, while the former is concerned with model behavior, the latter focuses explicitly on model representation.In this context, behavior is likened to the probabilities assigned to certain outputs (extracted, typically, from the output layer of a language model), while representation refers to the intermediate hidden state vectors computed by the model.Mainly, probing is motivated as being necessary due to deep learning's end-to-end nature: features are learned with respect to a given task, not engineered like in traditional systems.Due to this fact, neural models' representations are wholly uninterpretable to the human interlocutor and thus require intervention in order to understand what they portray.
Formally speaking, probing is concerned with representations h extracted from a model M for a given input x: h = M (x).A representation h ∈ R 1×d is typically a fixed-length dense vector corresponding to input word x (e.g., keys in 6a), where d is the hidden-state dimensionality of M .A probe f for a given linguistic property A is a classifier fit on h to produce output y ∈ Y , where Y is a finite label set: y = f A (h).For properties that can be decoded from single words, such as part-of-speech (POS) tags, a trained probe f POS must be able to assign the correct label to h with respect to the ground truth, e.g., ŷ = NOUN for M (keys).For properties concerning two or more words, such as dependencies or phrases, a concatenation of hidden states corresponding to discontiguous tokens x i , x j or a contiguous span of tokens x i , ..., x j is applied.In this latter formulation, deemed edge-probing by Tenney et al. (2019), one might expect a probe f DEP to decode ŷ = NSUBJ for M (keys, are) and f CON to decode ŷ = PP for M (on, the, table ).Though probing models vary widely in terms of architecture, parameters, optimization, etc., the vast majority of them assume a training set D A representative of property A on which f 's parameters Θ can be fit, like a treebank.Such probes are then evaluated in standard supervised learning fashion via accuracy on a held out test set.If such accuracy is high, it can then be said that A is decodable from h, i.e. that M learns it.This framework was notably employed by Liu et al. (2019a) and Tenney et al. (2019), who concurrently demonstrated that representations extracted from popular contextual embedding models (ELMo, BERT, GPT) yielded exceedingly good performance on suites of linguistic tasks.Also noteworthy is Tenney, Das, and Pavlick (2019)'s study, which showed that BERT's representations appear to evolve in capturing properties with increasing levels of complexity, from lexical features to syntax and semantics.
While the aforementioned word-level approach is by far the most popular probing setup, other methods for decoding the syntactic structure of entire sentences have been proposed.One model that is of particular interest is that of Hewitt and Manning (2019), who attempt to decode dependency structure from models' vector spaces.To this end, they propose to learn transformations over model representations, such that (1) the squared l 2 distance between any vectors h i , h j reflects the distance between their corresponding words x i , x j in a parse tree, and (2) that the l 2 norm of any vector h i reflects the depth of its corresponding word x i in a parse tree.They find that this method is particularly effective for decoding Stanford Dependencies trees (de Marneffe, MacCartney, and Manning 2006) from ELMo and BERT representations, with respect to several lexical-only baselines.Beyond Hewitt and Manning (2019)'s method, which can be imagined as doing parsing by proxy, other work has directly employed (underparameterized) dependency parsers as probes.For example, see Hewitt and Liang (2019), who employ a graph-based bilinear probe; Maudslay et al. (2020), who investigate the relation between probing and parsing; and Pimentel et al. (2020), who advocate for adding full dependency parsing to the probing task suite.
At this stage, probing can be considered a field of inquiry in its own right, with researchers presenting new models, metrics, and criticisms for every conference cycle.Naturally, the use of intermediary models trained on top of extracted representations warrants caution from the interlocutor.Some concerns expressed in the literature include, but are not limited to: the use of smaller, linear models vs. larger, nonlinear ones; appropriate baselines and evaluation metrics; properties being learned by the probe vs. occurring in representations; properties being employed by the model in the original task vs. simply being decodable, etc.Though a full consideration of these methodological concerns is outside the scope of this article, we refer the interested reader to Belinkov (2021) -a comprehensive review of the paradigm, open issues, and alternative approaches like attention analysis.

NLU Evaluation
Outside of TSE and probing, another technique that has recently attracted much attention is the evaluation of models (imbued with or deprived of syntactic knowledge) on downstream tasks.The logic inherent to this line of inquiry is as follows: if a model has come to rely on human-like knowledge of language to solve complex NLP tasks, then it should (1) perform poorly on such tasks when the surface form of an utterance has been corrupted beyond (human) comprehension, and (2) perform identically when imbued with the exact abstract structure theorized by linguists as governing the surface form.Such tasks are typically taken from the GLUE benchmark -a suite of natural language understanding (NLU) datasets "designed to favor and encourage models that share general linguistic knowledge across tasks" (Wang et al. 2018).GLUE has served as the principal point of comparison for pretraining architectures, where, as of writing, 15 models have surpassed the published human performance on the same tasks.
In terms of input corruption, many studies have investigated the effect of word order on NLU task performance.Indeed, word order is the primary means of encoding syntactic argument structure in English, and such work often hypothesizes that sensitivity to this particular property should result in lower NLU scores.Gupta, Kvernadze, and Srikumar (2021) demonstrate that this is not the case for BERT when fine-tuning on various GLUE tasks: sequences corrupted at test-time by means of shuffling, sorting, duplicating, and dropping tokens still retain 70-90% performance of the non-perturbed input.Moreover, models appear to be as confident in assigning labels to perturbed inputs as they are to naturalistic ones.These results are corroborated by Pham et al. (2020), who show that models predominantly seek salient words in sequences, with numerous attention heads specializing themselves for this exact purpose.Sinha et al. (2020) report similar findings for various NLI datasets (in English and Chinese) across a variety of model architectures.They show that models are insensitive to word reorderings, some of which can actually result in improved task performance.Perhaps most strinkingly, Sinha et al. (2021) show that pre-training full-scale RoBERTa models on perturbed sentences (across n-grams of varying lengths) and fine-tuning them on unaltered GLUE tasks leads to negligible performance loss.They also report that a popular probe for dependency structure, that of Pimentel et al. (2020), is able to decode trees from the perturbed representations -even a unigram baseline with resampled words -with considerable accuracy.
As a conceptual counterpoint to the permutation-based line of research, several studies have posed the opposite question: does explicitly injecting syntactic structure into models' representations or training objectives lead to better downstream performance?The observations in such studies are similar to the aforementioned work, albeit slightly more subtle: models that factor syntax into their decisions generally do not benefit in performance via its injection, which is taken to imply that such structure is redundant to the model, or not needed at all.Most notably, Glavaš and Vulić (2021a) fine-tune BERT and RoBERTa (Liu et al. 2019b) as dependency parsers, before finetuning the same models again on NLI, paraphrase detection, and commonsense reasoning tasks.They show that, while intermediate parsing training (IPT) can produce near state-of-the-art parsers, repurposing these parameters for NLU tasks leads to negligible improvement.A similar trend is shown in Kuncoro et al. (2020), who train a BERT model distilled from an RNNG teacher (Dyer et al. 2016).They, too, find that, while their syntactically-aware model achieves top marks on a suite of parsing and otherwise syntactic tasks, the benefits for fine-tuning on GLUE are scant, if any.Swayamdipta et al. (2019) corroborate these findings for ELMo models conditioned on chunked input derived from phrase structure trees.

Discussion: A Call For Clarity and Caution
After our general discussion of syntax, as well as our review of work exploring its role in contemporary language models, we are now in a position to make a few basic distinctions.In this section, we attempt to situate the findings of the aforementioned studies along several dimensions that we deem important towards the advancement of our research agenda.

Coding Properties are not Syntax
First, we would like to highlight the need to be clear about whether a study is concerned with abstract syntactic structure, overt coding properties, or with some relation between the two.A typical fallacy that may arise from not observing this distinction is to conflate a particular coding property with the abstract syntactic structure that it partially encodes.Naturally, if we fall victim to this fallacy when interpreting certain findings, we risk drawing conclusions based on insufficient or irrelevant evidence.This applies to situations where we may be tempted to employ coding properties as proxies of syntactic structure -either for attesting models' sensitivity to the latter or refuting it.
For example, it is important to acknowledge that studying agreement via TSE gives us a glimpse into how language models capture the syntactic relationship between selected words, such as verbs and their subjects.Per this view, high performanceeven in the presence of various types of attractors -does not necessarily ential that a model has learned the grammar of a language.Rather, it has simply shown itself to be particularly sensitive to a single coding property, grammatical relation, or dependency type.Notably, English agreement is limited to expressing the number or person of the subject on the finite main verb (when in the present tense).This amounts to being, in the vast majority of cases, a binary distinction between correct and incorrect inflections, which bears a strong random choice baseline of 50% in the case of TSE.Thus, when one considers types of agreement manifested in other languages -such as number, gender, and case agreement between nouns and their modifying adjectives (e.g., German, Russian), or polypersonal agreement between a verb and multiple arguments (e.g., Basque, Georgian) -it becomes difficult to judge agreement as the primary mechanism by which syntax is encoded in English.Indeed, studies have shown that models tend to struggle with more expressive agreement mechanisms in morphologically rich languages (Ravfogel, Goldberg, and Tyers 2018).Such insights call not only for a typologically driven research agenda, but also for nuance in interpreting positive findings for singular properties in selected languages.
We must also note that the above logic can apply in reverse: a model's lack of sensitivity to a single coding property, for example, word order (Dryer 1992), does not imply that the model has failed to acquire syntax as a byproduct of its training objective.Even in a language like English, where word order is very salient, it is not the only coding property that signals syntactic structure.Consider chases the cats the dog as a permutation of the dog chases the cats: it is not unreasonable for an English speaker to decode the argument structure of this permutation using subject-verb agreement alone.Indeed, recent research in psycholinguistics has intimated that humans are relatively robust to permutations of linguistic form (Traxler 2014).In the context of word order, Mollica et al. (2020) show that humans are able to process permuted sentences similarly to naturalistic ones, albeit when local structure (measured via pointwise mutual information) is preserved.Recently, this has been corroborated for models fine-tuned on GLUE as well, with performance therein strongly correlated with the extent of local structure corruption (Clouatre et al. 2021).With this in mind, one can see that order perturbation studies do not provide enough evidence to conclude that models (or humans) are insensitive to syntax.Instead, when conducting such studies, we must recall that word order (or agreement, for that matter) is simply a single coding property in a mosaic of such properties, all of which are privy to underlying processes that drive composition and comprehension.

Syntactic Representations are not Linguistic Data
As a second point, if a study is concerned with syntactic structure, we need to clarify whether it assumes a specific type of syntactic representation, since the choice of representation may affect the results.Other things being equal, we may therefore prefer methods that do not presuppose specific syntactic representations, since conclusions will otherwise be valid only on the assumption that the chosen representation correctly captures syntactic structure.This consideration is even more important when we make use of automatically parsed data -as opposed to manually annotated sentences from treebanks -where otherwise sound syntactic representations may give misleading results due to parsing errors.At the same time, it is important to note that avoiding syntactic representations altogether may be limiting in another way, as it may restrict our methodological repertoire.Thus, as long as we maintain a critical attitude towards representation-dependent methods, they may still provide us with valuable results that cannot be obtained with other methods.
To illustrate the importance of representations in the context of probing, we can start by asking: does high UAS on a particular treebank imply that those trees are indeed the structures encoded by a given model?Or can alternative, linguistically plausible structures be decoded with comparable accuracy?Kulmizev et al. (2020) explore this question when probing various models for UD, a dependency formalism which prioritizes content-word heads (de Marneffe et al. 2021), and Surface-Syntactic UD, which assumes a traditional function-word head style analysis (Gerdes et al. 2018).
They find that, while the difference in decoding UAS between the two formalisms is minimal for some treebanks, other treebanks exhibit strong preferences for either UD or SUD.They contribute such preferences to a complex interplay between the formalisms' inherent graph properties (e.g., average tree height), the probe employed for decoding (Hewitt and Manning (2019)'s, in their case), annotation factors like tokenization, and morphology.Though preliminary, Kulmizev et al. (2020)'s study is a cautionary tale in tree-based probing, where choice of representation directly affects what conclusions one may draw about models.
We can ask similar questions when attempting to imbue models with syntactic structure.For example, is the injection of UD trees into a model's architecture enough to draw conclusions about the role of syntax in downstream performance?Or do alternative, linguistically plausible representations exist that models might yet benefit from?Beyond this, what privileges one particular injection method, say intermediate parsing training (Glavaš and Vulić 2021a), over another, such as knowledge distillation from an RNNG teacher (Kuncoro et al. 2020)?A template for exploring such considerations can be found in Wu, Peng, and Smith (2021), who report that infusing BERT with semantic dependencies can provide modest gains on GLUE.In that study, they compare the DM representation focused explicitly on predicate-argument structure (Ivanova et al. 2012) with the more syntactically oriented UD, finding that the former leads to slightly better performance.5Furthermore, they compare their chosen infusion method -semantic graph embeddings learned via a relational graph convolution encoder (Schlichtkrull et al. 2018) -with other means of injecting structure into representations, where their method performs best in most cases.

Theory, Model, and Task
As a further means of promoting clarity, we consider it vital, for any study investigating the syntactic capabilities of language models, to motivate the evaluation with a direct aspect of syntax in mind.Indeed, TSE studies on coding properties and probing studies on tree representations satisfy this desideratum by design.Studies that investigate models' knowledge of syntax through the prism of downstream tasks, however, do not.
For example, if we observe models achieving high performance on NLU tasks, we may be tempted -by virtue of the NLU branding -to assume that they are, in some fashion, similar to humans in their decision-making.To ascertain the extent to which this is true, we must first attempt to articulate exactly how humans behave when choosing a particular label for a sample over another.Indeed, this has, to an extent, been achieved for some textual entailment datasets, where annotators were asked to justify their labeling decisions with free-text rationales (Camburu et al. 2018;Rajani et al. 2019).Such datasets, however, introduce a third variable into the mix (input, output, and explanation), which complicates architecture design and further divorces the original model (one that may be evaluated, presumably, for its syntactic capabilities) from the task on which it is now being fine-tuned and offering explanations for.Furthermore, and perhaps more importantly, the existence of a third variable complicates the interpretation of the self-rationalizing model itself, as noted by Wiegreffe, Marasović, and Smith (2020): for example, how might we ensure that the model's explanation is indeed faithful to the label it produced?To its input?Given that we typically lack an explicit characterization of how humans make labeling decisions at the sample level, we are then left to interpret the behavior of the model itself.For example, if working with Transformer models, we may be tempted to examine their self-attention weights after fine-tuning, as in Clark et al. (2019) or Htut et al. (2019).However, the prospect of treating attention as explanation has received its fair share of criticism in recent years, due to (1) a lack of input token identifiability in deep Transformer models (Brunner et al. 2019;Abnar and Zuidema 2020) and (2) an unclear correspondence between a model's attention patterns and its outputs (Jain and Wallace 2019; Wiegreffe and Pinter 2019;Serrano and Smith 2019).Due to these drawbacks, we may then attempt to employ alternative methods of explanation, such as gradientbased saliency (Simonyan, Vedaldi, and Zisserman 2013;Sundararajan, Taly, and Yan 2017), Shapley value-based feature attributions (Lundberg and Lee 2017), or model agnostic approaches (Ribeiro, Singh, and Guestrin 2016), inter alia.Here, we are met with a space of methods, whose utility in NLP tasks has been surveyed with mixed results (Poerner, Schütze, and Roth 2018;Arras et al. 2019;Atanasova et al. 2020).Hence, in order to draw insights about a model's syntactic capabilities vis-a-vis its performance on a downstream task, we must not only make a principled choice of explanation model, but also relate that model's explanations -which are computed with respect to predictions downstream -to the particular aspect of syntax we are interested in.Again, questions central to faithfulness (Jacovi and Goldberg 2021) become vital: how can we guarantee that an explanation of our model's behavior represents what it actually did?Further yet, how can we use explanations pertaining to one type of behavior to shed light on another, unobserved one?
Certainly, numerous caveats are endemic to the process of interpreting model behavior (Lipton 2018).This may lead us to put our full trust in the task, which we can employ as a prism through which to opine on the model.This necessitates that the task is indeed well-motivated and designed, and difficult to exploit via heuristics.If we believe this to be true, we can hypothesize that, by performing well on such tasks, our models possess whatever latent ability humans do in solving them -see, e.g., Sinha et al. (2020): "models should have to know the syntax first, . . .if performing any particular NLU task that genuinely requires a humanlike understanding of meaning".Unfortunately, in the context of NLI (which Sinha et al. (2020) study) this is a highly dubious claim: the crowd-funded nature of such datasets makes them prone to annotation artefacts (e.g., subsequence overlap between premise/hypothesis, lexical choice across inference classes, sentence length, etc.), which models often exploit as heuristics, thus leading to highly inflated performance metrics (Gururangan et al. 2018;Poliak et al. 2018;McCoy, Pavlick, and Linzen 2019).
Ultimately, the extent of trust we place in the model (performing as hypothesized) over the task (being correctly expressed) may influence not only our hypotheses, but also the conclusions we draw from our findings.For example, consider Pham et al. (2020) as a counterpoint to Sinha et al. (2020).Though the observations regarding BERTbased models' insensitivity to word order are largely similar, the former are more critical of the task ("GLUE does not necessarily require syntactic information or complex reasoning"), and the latter of the model ("current models do not yet 'know syntax' in the fully systematic and human-like way we would like them to").Naturally, for this and the aforementioned reasons, disentangling the complex web of theory, model, and task remains a gargantuan undertaking.However, if we formulate our hypotheses using the exact behavior we would like to investigate and evaluate our models along those same lines, then the conclusions we draw from our findings should be much clearer to interpret.

What are the Research Questions?
In addition to specifying which aspect of syntax a study is concerned with, we also need to be clear about what our research questions are.For example, given a model M , a task T , some training data D and some aspect of syntax A, we may ask the following three questions: 1.
To what degree does M learn A when trained on D to perform T ?

2.
To what degree can M learn A when trained on D to perform T ?

3.
To what degree does M need to learn A when trained on D to perform T ?
Questions of type 1 are the most straightforward to investigate as long as we have a valid and reliable method for measuring the degree to which M learns A in the context of D and T .This is quite a big assumption in itself, and one that we will return to shortly, but we will focus first on the logic for answering different research questions.Questions of type 2 are modal in nature and therefore hard to investigate empirically, except indirectly by investigating questions of type 1.For example, in the pioneering study by Linzen, Dupoux, and Goldberg (2016a), discussed in Section 3, the authors were primarily interested in whether an LSTM (M ) can learn "syntax-sensitive dependencies" (A)a question of type 2. To investigate this, they examined the actual learning behavior of the model in two specific settings (questions of type 1): (a) when trained on unlabeled text (D U ) for the task of language modeling (T LM ), and (b) when trained on labeled sentences (D L ) for a specific agreement decision task (T A ).The results were largely negative in the first case and positive in the second.From the positive result, they could conclude that the model can learn the relevant dependencies when trained on D L for T A ; from the negative results, they could however only conclude that there was no evidence that the model was capable of learning the relevant aspect of syntax when trained on D U for T LM .This illustrates the fundamental asymmetry between positive and negative results when it comes to generalizations about possibility.A single positive resultif interpreted correctly -is sufficient to establish that something is possible, while any number of negative results are in principle inconclusive. 6Indeed, as discussed in Section 3, the later study by Gulordava et al. (2018) managed to obtain positive results also in a setting similar to the first scenario of Linzen, Dupoux, and Goldberg (2016a), from which they concluded that LSTMs are capable of learning at least one aspect of syntactic structure without explicit supervision.Questions of type 3 are more complex still, because they involve causality as well as modality.More precisely, they combine the question of whether learning A results in better performance of M on T (causality) with the question of whether learning A is necessary to achieve better peformance (modality).A typical example is the study of Glavaš and Vulić (2021b), discussed in Section 3, where the authors study the effect of intermediate parser training of a pre-trained language model later fine-tuned for various language understanding tasks.The underlying research question is whether knowledge of syntax is needed for language understanding -a question of type 3and the lack of improvement may suggest a negative answer, but this conclusion is only warranted if it can also be shown (a) that the model has actually learned (some aspects of) syntax and (b) that this knowledge causally affects the model's behavior on the downstream task (and still fails to improve performance).Note, however, that a positive improvement would not be more conclusive in this case, because it would only show that improvement is possible, not that it is necessary.This illustrates the complexity involved when relating experimental results to research questions and again points to the need for careful meta-analysis.

Aggregate Metrics may be Misleading -but are Necessary
Let us finally turn to the question of how we can measure the degree to which a model M learns some aspect of syntax A when trained on data set D to perform task T -a question that is crucial to all studies in this area, regardless of what the more general research questions are.As we have seen in Section 3, the answer usually involves measuring performance on an appropriate task T ′ , although the exact solution depends on the type of study.In TSE studies, T ′ is typically the task of discriminating positive from negative instances of some grammatical pattern, for example by assigning higher probability to the positive instance in a minimal pair.In probing, T ′ is a supervised classification task assumed to reflect syntactic knowledge.And in NLU evaluation, T ′ is simply the downstream language understanding task and thus normally coincides with the main task T .Each of these paradigms comes with its own methodological pitfalls, which have been extensively discussed in particular in the case of probing, but we will focus here on the complexities that are common to all of them.
First of all, we note that performance on T ′ is almost always measured by averaging over individual test instances.In the simplest case, this may just be the arithmetic mean of a 0-1 loss metric, such as the accuracy reported for a probing classifier predicting partof-speech tags.In other cases, it may be a more or less sophisticated macro-average, like an average over different grammatical patterns in a TSE study.In all cases, however, such aggregate measures need to be interpreted carefully.First of all, how do we know whether a given metric value indicates presence or absence of syntactic knowledge?Does a value of 0.5 mean that the glass is half full or half empty?This highlights the importance of relevant and informative baselines, a point that has been made in the literature before but that has perhaps not been fully appreciated.
Second, it is in the nature of aggregate metrics that they can easily be misleading by hiding important variation, especially if the distribution of different types of phenomena is heavily skewed.For example, in the related field of syntactic parser evaluation, Rimell, Clark, and Steedman (2009) have shown that parsers with very respectable performance according to standard aggregate metrics like EVALB can have close to zero accuracy on certain types of unbounded dependency constructions.Moreover, aggregation may hide important variation in a number of different ways.If we use naturally occurring text in our test sets, certain words and constructions will inevitably be much more frequent than others and therefore dominate the aggregate scores in the same way as for syntactic parser evaluation.As a result of this, Newman et al. (2021) argue that standard metrics used in TSE overestimate the systematicity of language model behavior.If in addition we aggregate over different syntactic phenomena, we may hide the fact that different phenomena are captured to different degrees.And if we aggregate over multiple languages -or only report results for a single language -we may neglect important language-specific properties and risk over-generalization.
Lastly, we must consider the role aggregation plays in the interpretation of models' performance on benchmarks like BLiMP, SyntaxGym, or GLUE.At its core, such an enterprise entails that all aspects of syntax or language understanding -at least those of particular salience -have been successfully enumerated.Given the abstract nature of these notions, and the extent of debate regarding them, it is naturally doubtful that such an enumeration could ever be attained.Relaxing this somewhat, in assuming a salient set of aspects has indeed been collected, one must likewise assume -before aggregating -that a principled weighting of such aspects exists.This is especially relevant when dealing with a space of tasks or phenomena where fine-grained categorizations are likewise included -for instance, the 6 subject-verb agreement settings attested in BLiMP.In such cases, one must not only choose between micro or macro averaging across phenomena and their fine-grained attestations, but also articulate whether or not all phenomena lie on an equal playing field -in other words, that they are all equally (1) difficult to attest and (2) salient for evaluation.Certainly, in the vast majority of cases we assume a uniform weighting of classes when aggregating, since introducing handselected weights may introduce bias that we would otherwise prefer to avoid.However, we must not fail to acknowledge that benchmarks, in themselves, are influenced by designers' theories on what component parts adequately represent abstract notions like syntax or language understanding.
There is unfortunately no simple remedy to the complexities discussed in this section.In particular, giving up aggregate metrics is definitely not an option, since they are necessary for statistical significance testing and generalization.However, to make further progress in this area, we need to be aware of their inherent limitations, be open to consider alternative metrics, and to make sure to complement them with more specific forms of analysis whenever possible.

Conclusion
The rapid progress in NLP thanks to deeper and larger neural network models trained on very large data sets with little or no linguistic supervision raises a number of questions concerning the status of traditional linguistic notions and theories in this landscape.Is there still a role for traditional techniques like supervised syntactic parsing?If not, is this because neural language models learn the relevant generalizations about linguistic structure without explicit supervision, or because language understanding does not really depend on such generalizations in the way traditionally assumed.If the latter, does this hold only for language understanding by machines, or does it also have implications for human language understanding?These are exciting questions and it is therefore not surprising that we have seen a considerable body of research in this area recently.They are also hard questions, and the methodology for tackling them is still under development, so it is also not surprising that results so far have been inconclusive and sometimes contradictory.As stated in the introduction, the goal of this article has not been to criticize previous efforts, but to contribute to our understanding of methods and results by articulating and discussing some of the inherent complexities in this research area.Without pretending to have any solutions, we want to conclude with some recommendations for future research, echoing the main points made throughout the paper.
Our general recommendation can be summed up as a plea for clarity and caution.Clarity means being clear about what conception of syntax underlies our investigations, which aspects of syntax are being studied, and whether we make specific assumptions about syntactic representations.Clarity also implies that we explicitly discuss what research questions are being asked, and how they can be elucidated by the specific experiments we perform.Caution means being careful when interpreting aggregate results, always looking for alternative metrics and additional analysis, and making sure to consider evidence from multiple languages if we want to draw conclusions about natural language in general.Caution finally means resisting the temptation to draw strong conclusions from any single study, which is usually impossible given the complex interplay of research questions, methodology and data.It is only by piecing together all available evidence that we can hope to see the forest for the trees.