How Much Data Does Linguistic Theory Need? On the Tolerance Principle of Linguistic Theorizing

Mendívil-Giró, José-Luis

doi:10.3389/fcomm.2018.00062

CONCEPTUAL ANALYSIS article

Front. Commun., 09 January 2019

Sec. Psychology of Language

Volume 3 - 2018 | https://doi.org/10.3389/fcomm.2018.00062

This article is part of the Research TopicTheoretical Syntax at the Crossroads: Big Data, Citizen Science and CrowdsourcingView all 6 articles

How Much Data Does Linguistic Theory Need? On the Tolerance Principle of Linguistic Theorizing

José-Luis Mendívil-Giró^*

Department of General and Hispanic Linguistics, University of Zaragoza, Zaragoza, Spain

Yang's (2016) Tolerance Principle describes with incredible precision how many exceptions the mechanisms of child language acquisition can tolerate to induce a productive rule, and is a notable advance in the long-standing controversy as to the amount of data necessary for the acquisition of language. The present contribution addresses a different but related issue, that of the amount of data on variation in languages needed by a linguist to develop a theory of language. Using as a model the perennial question of how many languages should be considered to formulate a general theory of language, I will show that discussions about the type and amount of data needed for linguistic theorizing cannot be fruitful without taking into account the type of linguistic theory and its goals. Moreover, the type of linguistic theory itself depends on the way in which the object of study is conceived. I propose that the two main types of current linguistic theory (functionalism and formalism) correlate broadly to different scientific methods: the inductive one (which proceeds from languages to language) and the deductive one (which proceeds from language to languages), respectively. My aim is to show that the type of data that can falsify a certain linguistic theory is different depending on whether the theory is deductive or inductive. That is, the two types of theory have a different “tolerance threshold” regarding the sparseness of data. Hence, the expectation of progress that new sources of data on language variation can provide for linguistic theory should be modulated according to the objectives and assumptions of each language theory.

Introduction: Two ways of Relating Data and Theory

It is impossible to predict whether the rapid development of new sources of data on linguistic variation, as a result of the expanding breadth and scope of information technologies, will have a comparably large impact on linguistic theory (and especially on syntactic theory). In principle, it seems safe to say that obtaining new evidence can only be beneficial for any science. I do not intend in to question this general statement in any way, yet I would like to qualify it in the context of contemporary linguistic theory. My goal is to show that the degree of the impact of these new sources of information will be different depending on the type of linguistic theory involved: the impact can be notable for linguistic traditions based on the inductive method, but will surely have a more modest effect (although not necessarily an irrelevant one) for traditions that adopt a deductive methodology.

The remainder of this section deals with the specific senses in which I use the expressions inductive model and deductive model and how both models are related in general to the data that they use. In the following section Introduction: Two Ways of Relating Data and Theory consider how these models are instantiated in contemporary theories of linguistic diversity. The How Many Languages do We Need to Formulate a Theory of the Faculty of Language? section discusses ways in which the two theoretical models diverge markedly in their conceptions of the object of study, using an analogy with research on the role of environmental stimuli in language acquisition to justify my claim that the two models have different “tolerance thresholds” regarding data on linguistic variation.

The relationship between data and theory in science is different depending on whether an inductive or a deductive methodology is employed. Following (Dougherty's, 1994) characterization, we can say that in an inductive model there exists a certain set of procedures and operations with which the scientist uses the data to develop a theory to adequately describe the phenomena under investigation. This theory is derived from the data by inductive processes. If the methodology is followed correctly, the scientist will arrive at an empirically motivated theory to describe the phenomena under consideration. So, in this model “the empirical motivation for accepting (or rejecting) a theory stems from the data which give rise to the theory, i.e., the data which played a role in its discovery. In this view, the discovery of a theory and the justification of a theory are a single process; discovery and justification cannot be distinguished” (Dougherty, 1994: 331). On the contrary, in the deductive model there does not exist a set of procedures and operations with which the scientist works on the data to discover a theory. Rather, in this model the theory is a product of human creativity. A theory is a conjecture advanced as a possible explanation of the phenomena under investigation. According to Dougherty, “the means by which a theory is arrived at are irrelevant in determining its empirical adequacy. The theory derives its total empirical motivation from the comparison of the consequences deduced from the theory with observable experimental phenomena. In this view, the discovery of a theory and the justification of a theory are two different processes” (Dougherty, 1994: 331). The history of modern science is a clear illustration of the primacy, in the realm of the natural sciences, of the deductive method, generally known as the hypothetico-deductive model (Hempel, 1966 remains an excellent exposition of this model).

As we know, Chomsky's naturalistic conception of language implies the adoption of the hypothetico-deductive method for linguistic theory. And it can also be argued (as discussed in the next section) that a good part of the criticisms of Chomskyan linguistic theory, both past and present, are based on the conviction that the only way to construct a theory of language is through an inductive model. In this sense, Cohen (1955) explained the differences between Einstein and the physicist Ernst Mach (defender of the inductive model):

“Einstein said he had always believed that the invention of scientific concepts and the building of theories upon them was one of the great creative properties of the human mind. His own view was thus opposed to Mach's, because Mach assumed that the laws of science were only an economical way of describing a large collection of facts” (Cohen, 1955: 73).

In fact, many opponents of the Chomskyan conception of linguistic theory conceive of the study of language as the economic systematization of a large collection of linguistic facts. But in contemporary natural science, scientific theories are not inductive generalizations from the data (although these are necessary), but are theoretical constructs (formulated in terms of hypotheses) that must predict the data. The increase in the quantity and/or quality of data does not necessarily imply a radical change in the theory, but may offer a greater opportunity for its empirical falsification. Therefore, any improvement in the quantity and quality of observational data will necessarily imply an improvement in any type of theory, but it is evident that it will have a far greater impact on an inductive than a deductive theory.

Of course, a hypothetico-deductive theory is not immune to data (in such a case it would be unfalsifiable and therefore unscientific). When I affirm that a deductive theory has a higher threshold of tolerance to the scarcity of data, I mean that it does not need the same amount of data for the formulation of a hypothesis as the inductive method does. If we consider the specific issue that concerns us here, that is, the possibility of collecting and manipulating enormous amounts of data on linguistic variation, the impact on a deductive language theory will again be lower, in this case due to the very nature of the object of study of the deductive theory of language: the faculty of language, and not the languages generated by it.

The Nature of the Object of Study and the Methodology of Linguistic Theory

The object of study of Chomskyan linguistic theory is not human languages, but the faculty of language (FL). Of course, no one speaks FL: people either speak a specific language or they do not speak at all. Faculty of language (FL) determines part of the structure of languages, and therefore languages must be studied in order to discover the structure and properties of FL, but languages are not the ultimate object of study. Chomsky expressed this very clearly:

“Thus, what we call ‘English,' ‘French,' ‘Spanish,' and so on, even under idealizations to idiolects in homogeneous speech communities, reflect the Norman Conquest, proximity to Germanic areas, a Basque substratum, and other factors that cannot seriously be regarded as properties of the language faculty. Pursuing the obvious reasoning, it is hard to imagine that the properties of the language faculty—a real object of the natural world—are instantiated in any observed system. Similar assumptions are taken for granted in the study of organisms generally” (Chomsky, 1995: 11, fn. 6).

And for this reason, the inductive model is simply insufficient to discover the truth about FL. There is, of course, an inductive phase in all hypothetico-deductive theories, and therefore, to a large extent, inductive linguistic theories (such as those developed by Greenberg and many others) are useful for deductive linguistic theory, but these two types of theories do not have the same goals, nor the same objects of study (see Mendívil-Giró, 2012 for a discussion here).

The differences that these two main traditions show in the way they approach the issue of the diversity of languages are not ultimately based on different conceptions of science; rather, the different conceptions of science are inspired by different conceptions of the object of study. From a generativist point of view, language is conceived of as a natural phenomenon, and languages are understood as particular environmentally conditioned (and historically modified) manifestations of that phenomenon. That is, we proceed deductively from language to languages. One of the clearest examples of this procedure is parametric theory (Chomsky, 1981; Baker, 2001). Regardless of the specific formulations that it might take (see Gallego, 2011), the basic logic of parametric theory remains strong: from common design principles, the various emerging systems respond to variations in development processes that have systematic implications, just as happens in the development of natural organisms.

In contrast, from a functionalist point of view, we proceed inductively from languages to language. This model implies that languages exist in themselves and that language is a secondary concept induced from the descriptive generalizations obtained from the study of languages. Echoing Bloomfield's (1933: 20) assertion (“the only useful generalizations about language are inductive generalizations”), Dixon considers it to be an error to think that “linguistic theorizing should be largely deductive,” arguing that “the most profitable theoretical work is inductive” (Dixon, 1997: 137). Indeed, there is not much difference here with what Bloomfield observed 50 years earlier (and that Dixon quotes): “when we have adequate data about many languages, we shall have to return to the problem of general grammar and to explain these similarities and divergences, but this study, when it comes, will not be speculative but inductive” (Bloomfield, 1933: 20).

Authors in the broad area of functionalism (and also so-called cognitive linguistics) favor an inductive model of linguistic theory for one clear reason: they do not consider FL to be a legitimate object of study. In general, such authors conceive of languages as cultural objects or institutions that are not the instantiation of a biologically determined FL, but are objects that must be studied in themselves and for themselves. As Evans and Levinson (2010) recommend, “first analyze a language in its own terms, then compare” (Evans and Levinson, 2010: 2734). This externalist view explains the adoption of the inductive model when it comes to relating the theory of language to the study of languages; it also explains one of the most frequent criticisms of Chomsky's hypothetico-deductive model, that of starting from a reduced sample of data to formulate a theory of language: “We have no quarrel with abstract analyses per se, but we would like to see these arise inductively, and not be derived deductively from a model based on English and familiar languages.” (Evans and Levinson, 2010: 2754).

It is therefore legitimate to ask (especially in the context of the present Research Topic) how much data is required to develop a theory of language. To my knowledge, this issue has not been discussed widely in the history of our discipline, yet there is a long tradition of discussing an analogous question (that is, a qualitatively similar one), as expressed in the title of the following section.

How Many Languages do we Need to Formulate a Theory of the Faculty of Language?

In strictly logical terms, this question has only two answers: (i) a sufficient number of languages or (ii) all languages (the possible answer “none” is not acceptable, since we would no longer be in the field of empirical science). And, again in strictly logical terms, the deductive model would have to choose (i) as a response, and the inductive model should choose (ii). However, it is clear that answer (ii) is ineffective, since studying all languages is not possible: thousands (perhaps tens of thousands) of languages have been extinguished without a trace, and many of those that remain are undocumented. As Dixon points out, “there are 2,000 or 3,000 languages, for which we have no decent description” (1997: 138). Therefore, the truly relevant question is what is meant by “a sufficient number” for each of the models. Given the impossibility of option (ii), the inductive approach has developed protocols to determine representative samples, such as in the case of typological studies (usually in the direction of maximizing both genealogical and areal diversity). But we should not ignore the fact that any selection will be arbitrary and incomplete (and, therefore, potentially destructive to the inductive model). From the logic of the deductive method it can be stated that, if it is not possible to consider all languages, then it is not necessary to study more than one, so the answer to the question could be: the more the better, but at least one.

Perhaps this is the reason why Chomsky has argued that, theoretically, FL could be studied from a single language. The arguments offered by him and others here have to do, on the one hand, with practical aspects and, on the other, with theoretical aspects. On the practical side, it is argued that the first generativist studies were pioneers of this type of study, and that focusing on the in-depth analysis of one language to take the first steps toward understanding the problem was more profitable than a shallower analysis of a greater number of languages. See Rizzi (1994) and Newmeyer (1993: 332 ff.) for further developments of this line of argument. The theoretical arguments are more relevant and, also, more controversial:

“I have not hesitated to propose a general principle of linguistic structure on the basis of observation of a single language […] The inference is legitimate, assuming that humans are not specifically adapted to learn one rather than another human language […] Assuming that the genetically determined language faculty is a common human possession, we may conclude that a principle of language is universal if we are led to postulate it as a ‘precondition' for the acquisition of a single language. To set such a conclusion, we will naturally want to investigate other languages in comparable detail. We may find that our inference is refuted by such investigation” (Chomsky, 1979: 48).

Note that although Chomsky admits that it will be necessary to investigate other languages (in comparable detail), to confirm or falsify hypotheses, in fact (and again speaking theoretically) this would not be necessary if we were able to distinguish in the study of a specific language those of its elements which derive from the environment and those which emerge from the organism itself (and which are, therefore, “a ‘precondition' for the acquisition”). But, of course, we have no way of doing this directly, and hence, for such an objective the consideration of language diversity is essential as a means of refining the theory. Verifying the formal properties in which languages (or dialects) differ has a very directly bearing on what aspects of language are not fixed by nature.

In any case, there is one important point to note here: whereas it is clear that the consideration of linguistic typology (and of linguistic diversity in general) is crucial for the development of a theory of FL, this does not imply that we should accept, as functionalists do, that the theory of language must be inductive.

Actually, Chomsky (1985: 40 ff.) has argued, form a deductive point of view, that the study of one language can provide crucial data concerning the structure of another, that is, if we continue to accept the plausible assumption that the ability to acquire language is common to the species. Thus, according to Chomsky, a study of English is a study of the realization of the initial state S₀ under particular conditions. The study therefore involves assumptions regarding S₀ that must be made explicit. But if S₀ is constant, Japanese, for example, is an instantiation of the same initial state under different conditions. Research on Japanese can show that the assumptions about S₀ derived from the study of English were incorrect, as these assumptions may give conflicting results for Japanese. Therefore, after correcting these assumptions, we may be forced to modify the grammar postulated for English. Since the consideration of Japanese data is relevant in terms of the adequacy of a theory of S₀, it can have an indirect weight on the choice of grammar adopted in an attempt to characterize English; indeed this is a common practice in the tradition of generative grammar. Thus, Rizzi (1994: 404) analyses Chomsky's consideration of English to establish the so-called “finite sentence condition.” This restriction stipulates that finite sentences have certain island properties that would explain, for example, that in (1) the anaphoric expression in the subject position of the embedded finite sentence cannot have a noun phrase outside of the sentence as an antecedent, but that this does happen in the non-finite sentence in (2):

(1) *Mary saw [that herself won the prize] on TV

(2) Mary saw [herself win the prize] on TV

Rizzi observes that the same happens in Italian and other languages. However, he also points out that later work on Portuguese and Turkish led to the refinement of this condition, which does not seem unique to finiteness, but depends on whether there is agreement between subject and verb. Given that in Portuguese and Turkish, tense and agreement do not necessarily coincide, it can be observed that, for example in Portuguese, there are infinitives that agree with the subject and behave like finite sentences in English such as (1). Rizzi concludes: “The correct generalization could not have been determined on the basis of data from English alone, since in this language it is obscured by the essential overlap of the two notions of finiteness and agreement” (Rizzi, 1994: 404).

Even so, this does not imply that a theory of language must necessarily be inductive, as Comrie (1981) suggests. For Comrie the idea that the study of a language can serve to discover universal properties of language is unacceptable, and defends Greenberg's option that in order to establish something as universal in language it would be necessary to consider a wide variety of languages. Comrie recognizes the coherence of Chomsky's position, and makes a useful comparison with other sciences:

“[I]f one wanted to study the chemical properties of iron, then presumably one would concentrate on analyzing a single sample of iron, rather than on analyzing vast numbers of pieces of iron, still less attempting to obtain a representative sample of the world's iron. This simply reflects our knowledge (based, presumably, on experience) that all instances of a given substance are homogeneous with respect to their chemical properties” (1981: 6).

According to Comrie, this assumption of uniformity cannot be applied in the study of linguistic universals. He rejects the comparison with iron as being inadequate, and proposes another one, which is very symptomatic:

“On the other hand, if one wanted to study human behavior under stress, then presumably one would not concentrate on analyzing the behavior of just a single individual, since we know from experience that different people behave differently under similar conditions of stress, i.e., if one wanted to make generalizations about over-all tendencies in human behavior under stress it would be necessary to work with a representative sample of individuals” (Comrie, 1981: 6).

Which example best fits linguistic theory, that of the study of iron or that of the study of human behavior under stress? It seems that the choice depends on the way in which the object of study is conceived: the faculty of language (“a real object of the natural world”) or languages themselves (its manifestations). As Newmeyer suggests, taking the example of stress, “Comrie has unwittingly suggested an even more appropriate analogy: generativists study the neurophysiology of stress, typologists its behavioral manifestations” (Newmeyer, 1983: 337). In fact, as I have already pointed out, they are not incompatible conceptions, but complementary ones.

Comrie justifies his preference by assuming that if what “we want to find out in work on language universals is the range of variation found across languages and the limits placed on this variation, it would be a serious methodological error to build into our research programme aphoristic assumptions about the range of variation” (Comrie, 1981: 6). Yet we must note here that the goal of Chomskyan linguistic theory is not to discover the range of variation found across languages and the limits placed on this variation, since it does not constitute an inductive approach. As Chomsky himself pointed out, any theory of the Universal Grammar (UG) must meet two conditions:

“On the one hand it must be compatible with the diversity of existing (indeed, possible) grammars. At the same time, UG must be sufficiently constrained and restrictive in the options it permits so as to account for the fact that each of these grammars develops in the mind on the basis of quite limited evidence” (Chomsky, 1981: 3).

The assumption of uniformity is not therefore a methodological error, but one of the factors that must restrict the form of a hypothetico-deductive theory of language. It is important to recall that another Chomskyan idealization that has been misunderstood, and that is directly related to the problem at hand, is the notion of the ideal speaker-hearer as an object of study:

“Linguistic theory is concerned primarily with an ideal speaker-listener, in a completely homogenous speech-community […]. This seems to me to have been the position of the founders of modern general linguistics, and no cogent reason for modifying it has been offered. To study actual linguistic performance, we must consider the interaction of a variety of factors, of which the underlying competence of the speaker-hearer is only one. In this respect, study of language is no different from empirical investigation of other complex phenomena” (Chomsky, 1965: 3–4).

The rejection of this idealization (see, for example, Botha, 1989) is once again based on not distinguishing between the two models described. From the deductive point of view, individual variation is irrelevant and a homogeneous speaking community (which obviously does not exist) is assumed, precisely because what one wants to discover does not depend on individual linguistic performance. As Chomsky (1980) pointed out in his own defense, idealization would only be inadequate if it were shown that people cannot acquire their language in a homogeneous linguistic community or that linguistic variation is an essential key to the process of language acquisition. None of these ideas seems to make sense. Note that the Saussurean concept of langue, defined as “la somme des images verbales emmagasinées chez tous les individus” (de Saussure, 1916: 30), also implies an idealization that eliminates individual differences in a linguistic community. As we have seen, from the deductive point of view the difference between two dialects of a language and between two different languages is in fact a matter of degree, not of kind. And for that reason I believe that this old controversy is relevant to the subject that concerns us here.

Comrie explicitly assumes in the excerpt quoted above that the goal of typologists is to discover “the range of variation found across languages and the limits placed on this variation,” that is, an inductive study from a set of facts; meanwhile, Chomsky's stance implies that variation is only of interest as a source of empirical testing for the theory of FL, in a deductive sense.

Conclusions

An inductive theory is essentially determined by the data from which it is obtained. The more detailed the description of linguistic variation, the more complex the theory becomes. A deductive theory, rather, is by definition less dependent on the data, although obviously it must have empirical support. As a consequence, inductive models tend to emphasize diversity to the detriment of language uniformity, as the programmatic article by Evans and Levinson (2009) explicitly shows. In contrast, deductive models, such as generative grammar, tend to consider linguistic diversity as superficial and largely confined to the components of language externalization (see, for example, Berwick and Chomsky, 2011).

Undoubtedly, a greater knowledge of the range of (intra-linguistic and inter-linguistic) variation provided by new technologies may provide greater opportunities for the formulation of specific hypotheses within the theory of language and, especially, a greater empirical basis for its falsification (see Garzonio and Poletto, 2018 for a suggestive approach). But the advent of Big Data is unlikely to involve a revolution in syntactic theory analogous to the one witnessed in the application of the hypothetico-deductive model of the natural sciences to language.

It may be appropriate to recall that the development of generative grammar introduced a new perspective in the study of language. Since Chomsky's first contributions (e.g., Chomsky, 1957) a grammar is no longer understood as a more or less systematic description of a language, but is a theory of a language, and, as such, that theory is subject in its construction and evaluation to the same restrictions and principles that any other scientific theory. What Chomsky pushed was, therefore, a radical change of perspective in linguistics (and in cognitive science in general): from the study of behavior and behavioral outcomes, to the study of mental systems of computation and representation. And, as one reviewer suggests, it is interesting to highlight the relationship between the controversy I have reviewed within linguistic theory and the recent crisis replication within psychology and cognitive science (see Open Science Collaboration, 2015). Thus, Smith and Little (2018) argue that the strategy against the replication crisis should not necessarily be to increase the size of samples (for example with much larger samples of participants), but to favor studies with smaller and qualitatively significant samples:

“We argue that some of the most robust, valuable, and enduring findings in psychology were obtained, not using statistical inference on large samples, but using small-N designs in which a large number of observations are made on a relatively small number of experimental participants” (Smith and Little, 2018: 2084).

Yang's (2016) equation shows that the mechanisms of child language acquisition seem to be designed to optimize learning in a context of limited exposure to data, since the smaller the amount of data in the learner's linguistic experience, the greater the tolerance of exceptions for the induction of productive rules. On the other hand, in a curiously analogous way, a deductive syntactic theory has a greater ability to overcome the data on linguistic variation in looking for the invariant principles of the human faculty of language, its primary object of study.

Author Contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Funding

The present research has been funded by the Spanish AEI and Feder (EU) to grant FFI2017-82460-P.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Baker, M. C. (2001). The Atoms of Language. New York, NY: Basic Books.