# THE EVOLUTION OF MUSIC

EDITED BY : Leonid Perlovsky and Aleksey Nikolsky PUBLISHED IN : Frontiers in Psychology, Frontiers in Neuroscience and Frontiers in Sociology

#### Frontiers eBook Copyright Statement

The copyright in the text of individual articles in this eBook is the property of their respective authors or their respective institutions or funders. The copyright in graphics and images within each article may be subject to copyright of other parties. In both cases this is subject to a license granted to Frontiers. The compilation of articles constituting this eBook is the property of Frontiers.

Each article within this eBook, and the eBook itself, are published under the most recent version of the Creative Commons CC-BY licence. The version current at the date of publication of this eBook is CC-BY 4.0. If the CC-BY licence is updated, the licence granted by Frontiers is automatically updated to the new version.

When exercising any right under the CC-BY licence, Frontiers must be attributed as the original publisher of the article or eBook, as applicable.

Authors have the responsibility of ensuring that any graphics or other materials which are the property of others may be included in the CC-BY licence, but this should be checked before relying on the CC-BY licence to reproduce those materials. Any copyright notices relating to those materials must be complied with.

Copyright and source acknowledgement notices may not be removed and must be displayed in any copy, derivative work or partial copy which includes the elements in question.

All copyright, and all rights therein, are protected by national and international copyright laws. The above represents a summary only. For further information please read Frontiers' Conditions for Website Use and Copyright Statement, and the applicable CC-BY licence.

ISSN 1664-8714 ISBN 978-2-88966-286-9 DOI 10.3389/978-2-88966-286-9

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

#### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

#### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

#### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# THE EVOLUTION OF MUSIC

Topic Editors: Leonid Perlovsky, Northeastern University, United States Aleksey Nikolsky, Braavo! Enterprises, United States

Citation: Perlovsky, L., Nikolsky, A., eds. (2020). The Evolution of Music. Lausanne: Frontiers Media SA. doi: 10.3389/978-2-88966-286-9

# Table of Contents


Peter E. Keller, Rasmus König and Giacomo Novembre


Psyche Loui, Sean Patterson, Matthew E. Sachs, Yvonne Leung, Tima Zeng and Emily Przysinda


Andrea Ravignani, Bill Thompson and Piera Filippi


Massimo Lumaca, Andrea Ravignani and Giosuè Baggio


Henkjan Honing, Fleur L. Bouwer, Luis Prado and Hugo Merchant


Aleksey Nikolsky, Eduard Alekseyev, Ivan Alekseev and Varvara Dyakonova

*260 The Pastoral Origin of Semiotically Functional Tonal Organization of Music*

Aleksey Nikolsky

# Editorial: The Evolution of Music

#### Aleksey Nikolsky <sup>1</sup> \* and Leonid Perlovsky <sup>2</sup>

*<sup>1</sup> Braavo Enterprises, Los Angeles, CA, United States, <sup>2</sup> Department of Psychology, Northeastern University, Boston, MA, United States*

Keywords: interdisciplinary approach, scientific/humanitarian methods, ethnomusicology, empirical musicology, musilanguage, musicological analysis, organology, geomusicology

#### **Editorial on the Research Topic**

#### **The Evolution of Music**

Two decades ago, Wallin et al. (1999) opened a new page in the quest for the origins of music by integrating the methodologies and data coming from numerous disciplines:


• organology

• historical linguistics

• musicological analysis • geomusicology • demography • information theory • statistic modeling

• developmental psychology

This volume updates this multi-disciplinary approach by further advancing the fields defined by Wallin et al. and by introducing and specifying new fields of inquiry:

Edited and reviewed by:

*Bernhard Hommel, Leiden University, Netherlands*

\*Correspondence:

*Aleksey Nikolsky aleksey@braavo.org*

Specialty section: *This article was submitted to*

*a section of the journal Frontiers in Psychology* Received: *16 August 2020* Accepted: *18 September 2020* Published: *28 October 2020*

*Cognition,*


In addition, this volume addresses a number of interdisciplinary problems that were identified by the contributing authors<sup>1</sup> . The latter issue has recently become critical: the very idea of multidisciplinary study of music has been questioned. There is a growing conviction amongst Western scholars of ethnomusicological background that humanities and sciences are fundamentally split, and the scientific approach somehow introduces an "anti-humanitarian" bias (Parncutt, 2017). According to this view, specialists in sciences should adjust their methodologies to comply with the conventions of political correctness currently adopted by many Western specialists

Citation:

*Nikolsky A and Perlovsky L (2020) Editorial: The Evolution of Music. Front. Psychol. 11:595517. doi: 10.3389/fpsyg.2020.595517*

<sup>1</sup>This includes not only those papers that were submitted and published as part of our Research Topic, but also the ones that were not published (for various reasons).

in musicology and social sciences. This view exploits the argument that the public trust in science supports the "scientific hegemony" in an ongoing cultural "warfare" between disciplines of art and science, thereby precluding the advance in human knowledge (Cohen, 2001). This argument was introduced half-a-century ago in a popular book by Snow (1964). Despite being convincingly debunked by Wilson (1998) and Gould (2003), it resurfaced again in the ideology of "new mysterianism," propagated by Chomsky (2016) and McGinn (2015). Their intellectual weight has made the call for "humanizing" science more appealing to scientists.

Mysterianist and anti-scientific sentiments found fertile ground amongst many musicians and musicologists that regard a scientific approach as being "bound to the science lab" and therefore irrelevant to music practice—even detrimental to the expressive efficacy of music-makers (Woody, 2004). Indeed, differences in jargon between musicological<sup>2</sup> and psychoacoustic disciplines set a barrier against co-understanding, which preserves the popular myths about perception/production of music (Juslin et al., 2012). Many believe that music is incommensurable, irrational, and mystic—unsuitable for objective investigation (Dubal, 1985) and demanding an intuitive approach (Woody, 2000).

A big role in the spread of this attitude has been played by the ongoing trend of "scientism" in influential Western schools of composition, and the critical acclaim they have been receiving from well-established musical critics and musicologists (Regelski, 2014). Overwhelmingly, the music works composed after WWII and esteemed by academia bear a strong flavor of scholasticism which resembles that of religious scholarly traditions of preindustrial cultures. The New Music of the West has deliberately exploited the "scientific" image in employing an inventionlike approach, equating the method of strictly following a set of newly invented abstract compositional principles to mathematics (Babbitt, 2003). Such music usually violates the psycho-physiological restraints (Thomson, 2010), contributing to the public impression of its fundamental indigestibility (Lerdahl, 1992).

However, in toto, the situation is far from a global scientific/humanitarian schism: science does not defy humanities—it only corrects erroneous traditional beliefs. The only humanitarian discipline that consistently defies scientific methodology is Western ethnomusicology. In the past 40 years it has adopted the view that there is no such thing as "music" but myriad of "musics" (Becker, 1986), each requiring its unique frame of investigation (Nettl, 2010). The very application of a scientific method is viewed here as exercising Eurocentric political power over non-European cultures (Messner, 1993) seen even in merely calling a non-European artifact "music" (Bohlman, 1999). This highly politicized philosophy resulted in abandoning comparative studies of music systems (Savage and Brown, 2013), of the origins and evolution of music (Nettl, 2005), and of music analysis (Nattiez, 2012), especially pronounced in the US and UK ethnomusicological schools (Zemtsovsky, 2002).

General shift of Western ethnomusicology away from comparative musicology to fractured sociomusicology of isolated musical communities was inspired by concerns for compensating for the earlier Eurocentric bias in ethnographic research prior to the mid-twentieth century. Late twentieth century ethnomusicologists avoid direct comparisons between different music cultures altogether, especially those involved in establishing cultural evolution (Nettl, 2010, p. 70–92). This revisionism relies on a systemic over-evaluation of the scope and limitations of purely emic approach to the study of music, in combination with a drastic under-evaluation of the advantages of the etic approach (see Nikolsky et al.).

The special issue of "The World of Music," the journal of the International Music Council, (Vol. 22, No. 3, 1980), was dedicated to the discussion of the possibility to draw a general history of music, explore its origins, and formulate generalities in evolution of different musical cultures. Its overwhelming negative conclusions reflected the emerging trend of replacing the comparative studies that had been conducted within the field of systematic musicology by the disconnected studies of each musical culture, presumably entitled to its own unique "history" (Myers, 1993). Accordingly, the official name of the discipline was changed from "systematic musicology" to "ethnomusicology." Noteworthy is the radical inversion of the views on comparative studies by the "old-timers," such as Bruno Nettl, from positive in 1968 (Nettl and Blum, 1968) to negative in 2005 (Nettl, 2005).

Responding to Gourlay's call for ethnomusicology to abandon the "pretense of objectivity" in favor of "humanization" (Gourlay, 1982), Western ethnomusicologists started viewing their mission as "the study of people making music" rather than "the study of music" as the term "musicology" suggests (Titon, 2015). They substituted the study of text (that has traditionally been considered the "primary reality" for studying the arts) with the study of people's behaviors, causing a methodologic shift away from the analysis of music to sociology (Zemtsovsky, 1997) exemplified by the following quotation: "Unless the formal analysis begins as an analysis of the social situation that generates the music, it is meaningless" (Blacking, 1974, p. 71). Analysis per se has acquired the reputation of a form of "composing" the analyzed piece of music on the part of an analyst, thereby distorting the original meaning of that music (Agawu, 2004). In effect, such a position deprived Western ethnomusicology of an empirical and objective foundation.

This trajectory is polarly opposite to the trend of equipping musicology with means of scientific research, prevalent in the former Soviet Bloc countries (Myers, 1993) 3 . Integration

<sup>2</sup>This refers not only to the matters of music theory and the performance practices of Western forms of music, but to the undocumented implicit music theories transmitted orally by music-makers within non-Western traditional music cultures. Confused by scientific terminology, they also often see scientific approach as essentially foreign to their music practice—despite their appreciation of scholarly attention to their respective traditions.

<sup>3</sup>Perhaps, a good demonstration of the unity of scientific and humanitarian approaches to music, common in the Soviet and modern Russian academia, is the partnership between both authors of this article: A. Nikolsky started his career as a professional composer, music theorist, and pianist, whereas L. Perlovsky—as a nuclear physicist. Somehow both authors met in their research on the origins and evolution of music.

of scientific and humanitarian knowledge has been a long tradition there since Lobachevsky<sup>4</sup> (Grigoryan, 2011)—affecting scholarship of all orientations: materialistic (Aleksei Losev), neo-Christian (Pavel Florensky), and esoteric (Peter Ouspensky). Music-related studies made no exception: noteworthy were such figures as Losev<sup>5</sup> , Maykapar<sup>6</sup> , and Samoilov <sup>7</sup> . All major Russian researchers of psychoacoustics could interpret/perform music at a professional level or/and compose or arrange music: Sofia Beliayeva-Ekzempliarskaya, Nikolai Garbuzov, Boris Teplov, Aleksey Ogolevets, Alexander Volodin, Yevgenii Nazaikinsky, and Yuri Rags. In 1944, Institute of History of Arts at the USSR Academy of Sciences was founded for scientific investigation of the arts (including music).

In the West too, there are ongoing attempts to bridge humanities and science by making musicology "empirical" (Clifton, 1983; Deutsch, 1996; Gjerdingen, 1999; Clarke and Cook, 2004; Honing, 2006; Baily, 2009; Schneider and Ruschkowski, 2011; Kendall and Lipscomb, 2013).

After all, every musician routinely conducts informal experiments: performers evaluate alternative ways of rendering music; studio-musicians experiment with various arrangements, and ear-trainers test students. Experimental trial of the premises of musical theory was discussed at the year-long seminar at Stanford, involving representatives of humanitarian and scientific disciplines (e.g., Lerdahl, Narmour, Gjerdingen, Bharucha, Palmer, and Krumhansl)—with the outcomes published in a special issue (1996) of "Music Perception." Since then, this idea has attracted attention of many scholars..

This is the direction we pursue in this collection of papers. It starts with Harvey overviewing the origins of music and presenting a theory of music that reflects a "society of selves" through social cooperation. Harvey attributes the invention of music to the promotion of group coherence and personal well-being.

Montagu informs readers without music education about the capacities of early humans, their possible musical behavior, and the overall evolution of musicality. Special attention is given to the development of musical instruments, which provides a window to the reconstruction of the musicking practices of the past.

Malloch and Trevarthen present an account of music in terms of human cognition and biology, with emphasis on musical education. They show how cultural life and learning depend on the motivation for sharing projects of thought and action, musically. Music empowers the transmission of the narratives of one's "inner life" in bodily movements. This ongoing practice must have transformed the primate brain for the affective regulation of social learning, thereby determining the evolution of human musical mind.

Brown updates his widely acknowledged musilanguage theory (Brown, 2000) by proposing the joint prosodic origin of protolanguage and proto-music, where both shared specialization in emotional communication and neither featured scaled pitches. Brown introduces a "prosodic scaffold" model—i.e., specific vocal articulations and accompanying mimics/gestures forming signs for "acoustic pantomimes," designed to express one's affective state. According to Brown, combinatorial and compositional mechanisms of utterances generated the affective prosody, designating characteristic patterns of global acoustic expression for common emotional states. This forged into national prosody that branched into proto-language and proto-music based on, respectively, dialogic and chorusing formats of communication distinguished by different approach to timing. At this point, music acquired concise temporal organization and synchronicity of collective production/perception.

Nikolsky introduces the term "isophony" to refer to the tonal and rhythmic properties of the musilanguage system. This amendment, endorsed by Brown, corrects the mismatch between the psychoacoustic and musicological taxonomies of musical texture. Nikolsky formulates a set of clear structural distinctions between the most common textural types: heterophony, homophony, and polyphony—in comparison to "isophony."

van der Schyff and Schiavio overview the key positions in evolutionary musicology, identifying the problems of the nature-or-culture antithesis. They elaborate on the "biocultural" approach exemplified in the work by Tomlinson (2015), to resolve these problems. The authors examine a range of supporting evidence for this approach vs. the "embodied" approach that regards music as a bio-cultural process governed by interaction of the enacted and social aspects. The authors cross-relate the current developments in evolutionary musicology, such as "enactivism" and "4E cognition," and suggest how the biocultural and "enactivist" approaches can be improved.

Nikolsky brings together insights from semiotics, musicology, psychoacoustics, evolutionary biology, anthropology, ethology, linguistics, and geomusicology to coin a new line of inquiry revolving around the concept of expressive aspects of music in contradistinction from phonetics and prosody in natural languages. Central in this approach is the role of tonal organization and the comparative analysis of the acoustic features

<sup>4</sup>Nikolai Lobachevsky outlined the principles of the ideal scholarship in his speech "On the most important subjects of education" (1828), which he delivered upon his election as the rector of the Kazan University.

<sup>5</sup>Losev's semiotic studies eventually became most influential in the Soviet academia—laying out an encyclopedic Aristotelean-like theoretic foundation for all the disciplines in the empirical sciences and the arts. Losev's books "Music as a Subject of Logic" (1927) and "Dialectics of Artistic Form" (1927) reflected his lectures in Moscow Tchaikovsky Conservatory, where he taught aesthetics, and synthesized the neo-Platonic concept of music as a sounding number with Romantic ideas of symbolism in coining rational definitions of the notions of rhythm, melody, and harmony.

<sup>6</sup>Maykapar (1900) was a renowned composer, an outstanding concert pianist, and a distinguished pedagogue. He pioneered Helmholtz' approach in Russia to validate and reform the traditional way of teaching and making music—which included the psychoacoustic investigation of attention, melodic intonation, rhythm, timbre, tonality, and modality (Maykapar, 1900).

<sup>7</sup>Alexander Samoilov, assistant of Ivan Pavlov, and one of the developers of electrocardiography, was a virtuoso pianist and a musicologist who authored a number of publications on music theory (i.e., "Natural numbers in music," "Musical ethnographic museum instruments and their musical tuning," "Musical notation and its history"). He founded the Scientific-musicological Circle at Moscow University in 1902, which was later transferred to the Moscow Conservatory, where Sergei Taneyev and Samuel Maykapar subsequently took over his directorship. After moving to Kazan, Samoilov taught courses in math, physics, acoustics, music theory, and music history, in Kazan University.

of indigenous musical traditions and animal communication calls. Nikolsky offers a new method of multifactorial modal analysis of tonal organization and its graphic representation ("musogram") and examines its pros and cons in light of the emic/etic antithesis. He argues that music evolved from animal communication through gradual substitution of "oneended" communication with "two-ended," where each of multiple aspects of expression acquired a repertory of proprietary signs for effective communication of emotional information, thereby remapping the animal-like instinctive correspondences between acoustic traits and affective states. The complete "semiotization" of all principal aspects of expression must have occurred no earlier than during the Neolithic "revolution" within the framework of the emerging bi-specific communication between humans and domestic animals—exemplified in the surviving pastoral culture of kulning.

Jan proposes yet another method of musicological analysis for tracking the lineage of diachronic evolution of specific musical structures based on the memetic approach to cultural changes. Elaborating on the research by Savage (2016), Jan combines the quantitative corpus-analysis techniques, adapted from molecular biology, with qualitative method of identifying perceptual-cognitive elements of music—"musemes"—revealed by music's motivic organization. This novel humanitarianscientific integration promises a compelling and potentially testable means for studying the cultural evolution of music.

Lumaca et al. define a new interdisciplinary field of research the contribution of neural constraints and biases on the cultural evolution of musical structures through the chain of cultural transmissions. Capacities of the human brain constrain acquisition, production, and reproduction of music. To illustrate this, the authors demonstrate a progressive diatonization of tonal organization in the multi-generational signaling game settings, which suggests that the smaller the information-processing bottleneck in individuals, the larger the pressures to regularize the music material. This poses new intriguing questions, such as the role of neural variability in music diversity—which is of greatest value for folk forms of music that entirely rely on oral transmission.

Podlipniak explores the Baldwinian evolutionary modeling in search for a compelling and testable theory for the phenomenon of human musicality, hypothesizing that it might constitute an adaptive phenomenon. In the Baldwin effect, animals, learn new behaviors that allow them to survive and reproduce in the changing environment. Podlipniak argues that the increasing group size of ancestral hominins required new mechanisms of "social consolidation" in the form of "collective imitation." There is then a natural selection for the evolution of new circuitry for vocal learning to maintain social bonds, thereby reducing the cost of learning.

Nikolsky et al. describe and explain the "timbre-based music" as a special system of musicking, communication, psychological, and social usage, along with corresponding beliefs—placing it in the timeline of the evolution of music. Timbral music opposes conventional Western music by its personal orientation: musicking here occurs primarily for oneself and/or for close relatives/friends. Throughout northeastern Eurasia, "personal song" serves as an important means of individual identification and territorial marking, akin a passport, supporting individual's mental health under harsh environmental conditions. The authors use demographic, geomusicological, paleoenvironmental, organological, and paleophonological data to argue that Siberian timbre-oriented music is remnant of the pan-Eurasian prehistoric music tradition that originated from the Last Glacial Period. Western frequency-oriented music, with its reliance on collective production/perception, might have been exported from Africa (where the population was much denser) before the LGP.

Ravignani et al. draw a parallel between researching the evolutions of music and language. They call for a systemic revaluation of an empirical and scientific approaches as opposed to claims from Chomskyan and anthropological perspectives against the scientific study. Given the intellectual influence of Chomsky, this point is extremely important. It would benefit musicologists to follow the lead of phonologists in establishing the evolutionary chain of developments. Realigning musicological and linguistic methodologies can allow the developmental psychologists and ethologists to make better choices between musical and linguistic paradigms in their research.

Fenk-Oczlon reports an intriguing phenomenon: the number of vowels and pitch-classes in native languages and native music tend to match. This is most pronounced in cases of simple systems—tritonic musical vs. 3-vowel vocal systems—but also noticeable in very complex 12-element systems. The mean values also match at 5–7 elements. Such correspondence supports Brown's model by revealing the shared ground between musical and vowel pitches.

Ravignani and Madison analyze the phenomenon of isochrony across human music and speech against animal communication, integrating the data from mathematics, physics, signal processing, physiology, and neuroscience. They define the concept of isochrony and propose an evolutionary hypothesis to explain why amongst all animals it is only humans that possess superb isochronous perception which does not confer evolutionary advantage to modern humans.

Honing et al. use an EEG oddball paradigm to assess the neural sensitivity to isochronous or arrhythmic beats in two monkeys. This non-invasive EEG methodology enables a direct comparison of the perception of monkeys, non-human primates, and humans. The authors found the MMN responses to the isochronous pattern but no strong evidence for beat sensitivity, confirming the Gradual Audiomotor Evolution model (Merchant and Honing, 2014) that holds metric organization as a biological marker of human music.

Loui et al. raise an intriguing question of musical anhedonia through cross-examination of its rare case against a panel of neurotypical participants. Their findings demonstrate the categorically different decreased connectivity between auditory and reward systems, supporting the Mixed Origins of Music model (Altenmüller et al., 2013). The authors identify neural pathways engaged in music's operation as an affective signaling system.

Based on his extensive teaching experience and research, Crickmore asserts his earlier experimental study and the test settings for measuring "aesthetic emotions"—listeners' response to detected "musical emotions" expressed by music creators. His revised test paves the road for clarifying the relations between human emotions, genres, and personality.

Masataka demonstrates that young people with autism spectrum disorder display more interest to dissonant music than typically developed matched group. This indicates that neural diversity within autism spectrum might have played a role in the evolution of dissonant sad music—important for overcoming negative aesthetic emotions of cognitive dissonances (Masataka and Perlovsky, 2013).

Trulla et al. explain consonance of musical intervals based on "second order beats,"<sup>8</sup> described by an approach borrowed from dynamical systems analysis—a quantitative index obtained from Recurrence Quantification Analysis. The novelty of this method is that it accounts for frequency ratio relationships plus temporal behavior. The authors confirm that musical consonance/dissonance has a mathematical foundation and that music perception, in general, and harmonic intervals, in particular, are a consequence of the entrainment of the nervous system with sound excitation.

## REFERENCES


Keller et al. provide a piece of evidence to support the Darwinian hypothesis of music's origin in sexual selection. In their study, a professional boys' choir was found to exhibit musical behavior essentially similar to male chorusing in many animal species. In female presence, bass singers instinctively emphasized and rose their singing formant—despite the conventional non-acceptance of such technique in choirsinging. This alteration, however, resulted in a more expressive performance. The authors explain this by covert competition between sexually mature males for female attention, which inadvertently maximizes choir's collective output.

We hope that this volume will bring us closer to answer Darwin's quest (Darwin, 1890) to disclose the mystery of music's origin.

#### AUTHOR CONTRIBUTIONS

AN wrote the first draft of the manuscript. LP provided critical revision of the manuscript and important intellectual contributions. All authors contributed to the article and approved the submitted version.

#### ACKNOWLEDGMENTS

The authors would like to thank all the contributors of this Research Topic for their work and the high quality of the submitted articles.


<sup>8</sup> Second order beats are perceptible when two pure tones slightly deviate from simple, small-integer ratio relationships, e.g., 2:1 (octave), 3:2 (perfect 5th), and 4:3 (perfect 4th). Deviation from simple mathematical relationships results in some degree of chaos, perceived in dissonant intervals, e.g., tritones, minor 7ths.


**Conflict of Interest:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Nikolsky and Perlovsky. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Measurement of Aesthetic Emotion in Music

Leon Crickmore\*

*Retired, London, UK*

Keywords: enjoyment of music, growth in appreciation, evolution of music, factor analysis, aesthetic emotion, dynamic gestalt

In an earlier Opinion Article, Perlovsky challenged cognitive scientists to measure aesthetic emotions (Perlovsky, 2014). The present author claims to have done exactly that almost 50 years ago (Crickmore, 1968, 1973). Details of supportive factor analyses (Thomson, 1951) may be viewed in the attached Supplementary Material.

## A SYNDROME TEST OF MUSIC APPRECIATION

My syndrome hypothesis serves both as a functional definition of music appreciation, and as an operational definition of a dynamic gestalt. The hypothesis states that when a musical composition has been assimilated aesthetically, it will leave the listener feeling interested, happier, more relaxed, with a desire to remain quiet, satisfied and without any particular mental pictures. To cite a typical example of such a response from a student:

Record 5 (Whistlin' Rufus: Chris Barber's Jazz Band) L7<sup>1</sup> , I+, M+, T−, V−, S+, P0 "Mind completely at rest and relaxed. Very enjoyable piece of music."

#### Edited by:

*Leonid Perlovsky, Harvard University, USA*

#### Reviewed by:

*Stephan Thomas Vitas, Independent Researcher, Silver Spring, USA*

#### \*Correspondence:

*Leon Crickmore crickmore176@talktalk.net*

#### Specialty section:

*This article was submitted to Cognition, a section of the journal Frontiers in Psychology*

Received: *06 January 2017* Accepted: *12 April 2017* Published: *03 May 2017*

#### Citation:

*Crickmore L (2017) The Measurement of Aesthetic Emotion in Music. Front. Psychol. 8:651. doi: 10.3389/fpsyg.2017.00651* The psychological experience indicated by this syndrome may differ in intensity according to the profundity of the music, or to the listener's emotional state at the time. Occasionally, the experience can even reach the sublimity which T. S. Eliot describes in the third of his Four Quartets:

"Music heard so deeply That it is not heard at all, but you are the music While the music lasts."

Lest any readers should consider that such an intense experience would be unlikely to be within the grasp of an 18-year-old technical student, let me quote an instance from a student. After listening to a recording of the second movement of Tchaikovsky's Sixth Symphony—a piece which another student described as the "worst record played yet"—this student responded with the full syndrome, adding:

"I wasn't listening to the music but it was me inside. I was living it–this was especially noticeable with the climaxes."

Various factor analyses of my experimental data were carried out. These showed the Syndrome Test to be independent of the Maudsley Personality Inventory, the Wing Standardized Test of Musical Intelligence and Raven's Progressive Matrices (A–E)—a non-verbal test related to general intelligence. Also, when a number of the musical items were repeated later, the increase in the scores of students proved to be statistically significant, indicating that the test can also be used to measure growth in appreciation. This feature raises an issue that may be interpreted as a demonstration of the statistical unreliability of the test, as has been suggested by Clive Gabriel in our joint

<sup>1</sup>Liking 7 or 6 was included in the earliest versions of the test but later discarded.

article (Gabriel and Crickmore, 1977). This article describes the first use of the test with a totally random population, rather than with a group of students who had chosen to participate in a course of musical appreciation. Alternatively, such a result could be cited as evidence of the inappropriateness of traditional psychometric methodology for measuring the formation of dynamic gestalten, since such a process involves the integration of sub-wholes at a moment of ripeness (Maritain, 1959, pp. 132–134)<sup>2</sup> . The most frequent pattern of responses which Gabriel found was the postulated profile (G) with the exception of an incorrect response on scale P (P+ instead of P−). He called this response "a relaxed G." Yet on another occasion one of my students after listening to the first movement of Beethoven's Fifth Symphony wrote:

"It is surprising too that when one's mind 'wanders' into mental pictures the actual beauty of the music escapes one and only surface music is absorbed."

## AESTHETIC EMOTIONS

Perlovsky states that "emotions related to knowledge have been called aesthetic since the time of Kant." I would contend, however, that since Kant held knowledge to be exclusively "discursive," that is to say the opposite of receptive and contemplative (Pieper, 1959, p. 32), this dogmatic assertion has contributed to the formulation of a methodology in psychology which has constrained cognitive scientists from attempting to measure aesthetic emotions. This postulate led Kant (1790) to rank music in the "lowest place" amongst the arts, and Pinker (1997) to dismiss music as "mere auditory cheese." Cross (1999, 2013) and Bannon (2016), however, both cite evidence indicating that Darwin believed, but deemed it inappropriate to say so at the time, that music's origins preceded those of language, which later emerged as an offshoot.

The concept of aesthetic emotion in academic literature seems to have been first posited by the aesthetician Bell (1914) and incorporated into cognitive psychology by Payne (1961, 1965). Later it was developed philosophically by Langer (1957). Langer's argues that music is both "a symbolism without assigned connotation" and "our myth of the inner life," a view that is wholly compatible with my syndrome hypothesis. Nevertheless, I prefer the greater metaphysical precision of the older explanation offered by Thomas Aquinas who distinguished between conceptual knowledge (per cognitionem) and knowledge from experience (per connaturalitatem). The aesthetic enjoyment of music belongs to the latter category (Crickmore, 1966).

Probably the best psychological theory that matches such philosophical views lies in Gestalt psychology, in the concept of a "dynamic gestalt," as expounded by Perls, Hefferline and Goodman's in Gestalt Therapy (Perls et al., 1951, pp. 403– 404). Their book maps a process of contact, conceived as a single whole but which can also be conveniently divided into a series of grounds and figures: Fore-contact; Contacting; Final contact—when a listener would be "all ears" (Crickmore, 1972) and Post-contact.

## MUSICAL EMOTIONS

Cooke (1959) has described music as a "language of the emotions," and has demonstrated how certain tonal patterns have become conventionally associated with particular basic emotions. The idea that an aesthetic response to music can be defined as a match between the emotions of a composer and those of the listener is widely accepted. However, people do in fact feel different emotions when listening to the same music, while a single listener may feel different emotions when listening to the same music at different times. I suggest, therefore, that musical emotions which do not lead to physical action are best considered as "as if " emotions. Langer has noted (Langer, 1957, p. 238), that music reflects "only the morphology of feeling." Musical emotions are more a matter of stipulation rather than of logical necessity.

Despite all that, a number of cognitive psychologists have assembled evidence for the measurability of a number of basic musical emotions (Juslin, 2013) With a view to future research, I have therefore added a means for registering these in a revised form of my Syndrome Test. The Revised Syndrome Test is shown in **Table 1**. The test in its original form is displayed in the upper part of the Table, with the exclusion of the opportunity for listeners to record their "Liking" for the music on a scale of 1– 7. This was omitted, when it was discovered how highly the L. scale correlated with the I. scale. A new lower part of the test now provides opportunities for listeners to record the musical emotions they experience.

It has subsequently been found that the empirically verifiable musical emotions in several different cultures tend to be conventionally associated with a particular musical genre—for example, sadness—funeral music; pride—a national anthem. During the evolution of each of these genres, it seems likely that music would initially have been introduced as part of a communal ritual, thereby heightening the intrinsic emotions of a particular social activity. Only later, would a few well-manicure examples of these genres have come to be listened to in contemplative silence. Recently, some evidence has been found for a link between musical genre and personality (Neuman et al., 2016).

This evolutionary consideration brings to mind two even more significant pieces of recent pioneering research, the first by Perlovsky (2008, 2010, 2014) and Nikolsky (2015, 2016). For Perlovsky music has played a far greater part in human evolution than has so far been generally recognized. He defines its main evolutionary function as the resolution of the cognitive dissonances brought about by the split in the vocalizations of proto-humans into two types, one evolving into language, and the other into music. He also postulates the existence of a "knowledge instinct" through which the brain is able to match top-down and bottom-up signals in a manner for which he has coined the term "dynamic logic." Furthermore, his dynamic logic is computable, and could well share something in common with the psychological idea of a dynamic gestalt.

Working within Perlovsky's theoretical framework, and employing a methodology for melodic analysis little known

<sup>2</sup> I would justify the inclusion in the factor analyses of derived scores, and rotation to discover the most helpful viewing point on the grounds that loadings are simply mathematical entities (entia rationis).

#### TABLE 1 | Revised syndrome test.


Any other comments:

Then please tick any of the emotions listed below which you associate with the music you have just heard, and describe any others you have felt but which are not listed:

*This revision of the syndrome test is characterized by the addition of a possibility for each listener to tick emotions felt during the music from a list of six emotions for which researchers have already found validating evidence. In this form the test has not yet been used experimentally. Until experiment provides some initial evidence concerning which emotions listeners actually do tick, and how frequently, it is not possible to decide whether, and if so how, such fresh data might be incorporated into a Factor Analysis of scores on the scales of the syndrome, together with those from some appropriate personality test. If those who believe that music is in some way a language of the emotions are right, such an analysis should be feasible. If my hypothesis that music is not a language of the emotions is true, any emerging data is unlikely to be compatible with factor analysis.*

outside the world of Russian-speaking musicologists, Nikolsky has managed to define, together with appropriate audio illustrations, 11 stages in the evolutionary development of music, from the time of Stephen Mithen's "Singing Neanderthals" (Mithen, 2005) and the earliest pre-modal music, via pentatonic and heptatonic stages and culminating in modern tonality. Could it be that, just as Pierre Teilhard de Chardin proposed (de Chardin, 1959; Crickmore, 1978), unlike the majority of modern scientists who tend to consider the biosphere as being the result of a largely fortuitous series of events, human evolution has a direction: namely the establishment of a phylum of "direct cerebralization?" Once this threshold of refection has been crossed evolution would continue in a new form, that of humanly invented combinations—logic, mathematics, art, and music. Could it also be possible that Nikolsky's model of tonal evolution might be taken to suggest a direction for musical evolution: toward the establishment of the diatonic system? Gary Tomlinson's A Million Years of Music (Tomlinson, 2015) is probably the best available survey of current knowledge in this field.

Each age in the evolution of music would have required a different style of listening. For us, in the twenty-first century, any description of this must take account of the effects of popular music with all its associated digitization and consumerist technologies. Our traditional definition of music in the West, based as it is largely on the world of concert halls and operahouses therefore needs to be extended, possibly even to include listening to a Walkman and Muzak. Philip Tagg's book Music's Meanings (Tagg, 2013) could help us clear away some of the intellectual brushwood which has been accumulated over the years by musicologists and historians of music. Tagg has suggested an operational re-definition of music as "that form of human communication in which humanly organized nonverbal sound can, following culturally specific conventions, carry meaning relating to emotional, gestural, tactile, kinetic, spatial and prosodic patterns of cognition."

#### CONCLUSION

There is a need for a programme of further research using my Revised Syndrome Test in **Table 1**. Such a programme should seek to clarify further the nature of the relationships between aesthetic emotions as measured by my test, musical emotions, genres, and personality. It should also include some fMRI scanning of the participants, both as they enjoy familiar music and as they grow to enjoy that which is less familiar to them. As long as Homo sapiens survive, music will survive too. In his poem The Metalogue to the Magic Flute, W. H. Auden has highlighted a matter which future researchers will have to bear in mind:

"The history of music as of Man Will not go cancrizans, and no ear can Recall what, when Archduke Francis reigned, Was heard by ears whose treasure-hoard contained A Flute already but as yet no Ring, Each age has its own mode of listening."

## ETHICS STATEMENT

The experiment mentioned took place in the 1960s before the Helsinki agreement with the consent of the students involved.

## AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2017.00651/full#supplementary-material

## REFERENCES


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Crickmore. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# How Music and Instruments Began: A Brief Overview of the Origin and Entire Development of Music, from Its Earliest Stages

#### *Jeremy Montagu\**

*University of Oxford, Oxford, United Kingdom*

Music must first be defined and distinguished from speech, and from animal and bird cries. We discuss the stages of hominid anatomy that permit music to be perceived and created, with the likelihood of both *Homo neanderthalensis* and *Homo sapiens* both being capable. The earlier hominid ability to emit sounds of variable pitch with some meaning shows that music at its simplest level must have predated speech. The possibilities of anthropoid motor impulse suggest that rhythm may have preceded melody, though full control of rhythm may well not have come any earlier than the perception of music above. There are four evident purposes for music: dance, ritual, entertainment personal, and communal, and above all social cohesion, again on both personal and communal levels. We then proceed to how instruments began, with a brief survey of the surviving examples from the Mousterian period onward, including the possible Neanderthal evidence and the extent to which they showed "artistic" potential in other fields. We warn that our performance on replicas of surviving instruments may bear little or no resemblance to that of the original players. We continue with how later instruments, strings, and skin-drums began and developed into instruments we know in worldwide cultures today. The sound of music is then discussed, scales and intervals, and the lack of any consistency of consonant tonality around the world. This is followed by iconographic evidence of the instruments of later antiquity into the European Middle Ages, and finally, the history of public performance, again from the possibilities of early humanity into more modern times. This paper draws the ethnomusicological perspective on the entire development of music, instruments, and performance, from the times of *H. neanderthalensis* and *H. sapiens* into those of modern musical history, and it is written with the deliberate intention of informing readers who are without special education in music, and providing necessary information for inquiries into the origin of music by cognitive scientists.

Keywords: music, rhythm, dance, instruments, development, social cohesion, performance, tonality

## HOW DID MUSIC BEGIN? WAS IT *VIA* VOCALIZATION OR WAS IT THROUGH MOTOR IMPULSE?

But even those elementary questions are a step too far, because first we have to ask "What is music?" and this is a question that is almost impossible to answer. Your idea of music may be very different from mine, and our next-door neighbor's will almost certainly be different again. Each of us can only answer for ourselves.

Mine is that it is "Sound that conveys emotion."

#### *Edited by:*

*Aleksey Nikolsky, Braavo! Enterprises, United States*

#### *Reviewed by:*

*Dean Falk, Florida State University, United States Josef Verba, The Muse of Life Society, Inc., United States*

*\*Correspondence: Jeremy Montagu jeremymontagu@gmail.com*

#### *Specialty section:*

*This article was submitted to Evolutionary Sociology and Biosociology, a section of the journal Frontiers in Sociology*

*Received: 01 March 2017 Accepted: 23 May 2017 Published: 20 June 2017*

#### *Citation:*

*Montagu J (2017) How Music and Instruments Began: A Brief Overview of the Origin and Entire Development of Music, from Its Earliest Stages. Front. Sociol. 2:8. doi: 10.3389/fsoc.2017.00008*

We can probably most of us agree that it is sound; yes, silence is a part of that sound, but can there be any music without sound of some sort? For me, that sound has to do something—it cannot just be random noises meaning nothing. There must be some purpose to it, so I use the phrase "that conveys emotion." What that emotion may be is largely irrelevant to the definition; there is an infinite range of possibilities. An obvious one is pleasure. But equally another could be fear or revulsion.

How do we distinguish that sound from speech, for speech can also convey emotion? It would seem that musical sound must have some sort of controlled variation of pitch, controlled because speech can also vary in pitch, especially when under overt emotion. So music should also have some element of rhythm, at least of pattern. But so has the recital of a sonnet, and this is why I said above that the question of "What is music?" is impossible to answer. Perhaps the answer is that each of us in our own way can say "Yes, this is music," and "No, that is speech."

Must the sound be organized? I have thought that it must be, and yet an unorganized series of sounds can create a sense of fear or of warning. Here, again, I must insert a personal explanation: I am what is called an ethno-organologist; my work is the study of musical instruments (organology) and worldwide (hence the ethno-, as in ethnomusicology, the study of music worldwide). So to take just one example of an instrument, the ratchet or rattle, a blade, usually of wood, striking against the teeth of a cogwheel as the blade rotates round the handle that holds the cogwheel. This instrument is used by crowds at sporting matches of all sorts; it is used by farmers to scare the birds from the crops; it was and still is used by the Roman Catholic church in Holy Week when the bells "go to Rome to be blessed" (they do not of course actually go but they are silenced for that week); it was scored by Beethoven to represent musketry in his so-called Battle Symphony, a work more formally called *Wellingtons Sieg oder die Schlacht bei Vittoria*, Op.91, that was written originally for Maelzel's giant musical box, the Panharmonicon. Beethoven also scored it out for live performance by orchestras and it is now often heard in our concert halls "with cannon and mortar effects" to attract people to popular concerts. And it was also, during the Second World War, used in Britain by Air-Raid Precaution wardens to warn of a gas attack, thus producing an emotion of fear. If it was scored by Beethoven, it must be regarded as a musical instrument, and there are many other noise-makers that, like it, which must be regarded as musical instruments.

And so, to return to our definition of music, organization may be regarded as desirable for musical sound, but that it cannot be deemed essential, and thus my definition remains "Sound that conveys emotion."

#### SO NOW WE CAN ASK AGAIN, "HOW DID MUSIC BEGIN?"

But then another question arises: is music only ours? We can, I think, now agree that two elements of music are melody, i.e., variation of pitch, plus rhythmic impulse. But almost all animals can produce sounds that vary in pitch, and every animal has a heart beat. Can we regard bird song as music? It certainly conveys musical pleasure for us, it is copied musically (Beethoven again, in his *Pastoral Symphony*, no.6, op. 68, and in many works by other composers), and it conveys distinct signals for that bird and for other birds and, as a warning, for other animals also. Animal cries also convey signals, and both birds and animals have been observed moving apparently rhythmically. But here, we, as musicologists and ethnomusicologists alike, are generally agreed to ignore bird song, animal cries, and rhythmic movement as music even if, later, we may regard it as important when we are discussing origins below. We ignore these sounds, partly because they seem only to be signals, for example alarms etc, or "this is my territory," and partly, although they are frequently parts of a mating display, this does not seem to impinge on society as a whole, a feature that, as we shall see, can be of prime importance in human music. Perhaps, too, we should admit to a prejudice: that we are human and animals are not…

So now, we can turn to the questions of vocalization versus motor impulse: which came first, singing or percussive rhythms? At least we can have no doubt whatsoever that for melody, singing must long have preceded instrumental performance, but did physical movement have the accompaniment of hand- or bodyclapping and perhaps its amplification with clappers of sticks or stones, and which of them came first?

Here, we turn first to the study of the potentials of the human body. There is a large literature on this, but it has recently been summarized by Iain Morley in his *The Prehistory of Music* (Morley, 2013). So far as vocalization is concerned, at what point in our evolution was the vocal tract able to control the production of a range of musical pitch? For although my initial definition of music did not include the question of pitch, nor of rhythm, once we begin to discuss and amplify our ideas of music, one or other of these, does seem to be an essential—a single sound with no variation of pitch nor with any variation in time can hardly be described as musical.

Studies based on fossil remains of the cranium and jaw formation of the early species of homo suggest that while *Homo ergaster* from between two million and a million and a half years ago could produce some variation of pitch but perhaps without much breath control, *Homo erectus* may have had greater ability, and *Homo heidelbergensis*, and certainly its later development from around a million years ago into the common ancestor of *Homo neanderthalensis* and *Homo sapiens*, could certainly "sing" as well as we can, though of course we can have no evidence of whether they could control such ability, whether they used it, and if so to what extent. So we can say that vocalization, while absent from the capability of our cousins the great apes and of the early forms of *Homo*, could be as old as at least a million years. It would seem that *Homo heidelbergensis* had the muscular abilities, but perhaps not the full mental capacities and that it was not until *H. sapiens* arrived that all the requirements for vocalization were in place, both exported and imported, and possibly not even in the earliest stages of the evolution of *H. sapiens*. It is here that there is controversy over the relative musical abilities of *H. neanderthalensis* and *H. sapiens*, to which in due course we shall return.

Much of this work also discusses the origins of speech as well as that of music. The two processes seem to have much the same physiological requirements, the ability to produce the various consonants and vowels that enable speech, and the ability to control discrete musical pitches. But this capacity goes far beyond the ability to produce sounds.

All animals have the ability to produce sounds, and most of these sounds have meanings, at least to their ears. Surely, this is true also of the earliest hominims. If a mother emits sounds to soothe a baby, and if such sound inflects somewhat in pitch, however vaguely, is this song? An ethnomusicologist, those who study the music of exotic peoples, would probably say "yes," while trying to analyze and record the pitches concerned. A biologist would also regard mother–infant vocalizations as prototypical of music (Fitch, 2006). There are peoples (or have been before the ever-contaminating influence of the electronic profusion of musical reproduction) whose music has consisted only of two or three pitches, and those pitches not always consistent, and these have always been accepted as music by ethnomusicologists. So we have to admit that vocal music of some sort may have existed from the earliest traces of humanity, long before the proper anatomical and physiological developments enabled the use of both speech and what we might call "music proper," with control and appreciation of pitch.

In this context, it is clear also that "music" in this earliest form must surely have preceded speech. The ability to produce something melodic, a murmuration of sound, something between humming and crooning to a baby, must have long preceded the ability to form the consonants and vowels that are the essential constituents of speech. A meaning, yes: "Mama looks after you, darling," "Oy, look out!" and other non-verbal signals convey meaning, but they are not speech.

The possibilities of motor impulse are also complex. Here, again, we need to look at the animal kingdom. Both animals and birds have been observed making movements that, if they were humans, would certainly be described as dance, especially for courtship, but also, with the higher apes in groups. Accompaniment for the latter can include foot-slapping, making more sound than is necessary just for locomotion, and also bodyslapping (Williams, 1967). Can we regard such sounds as music? If they were humans, yes without doubt. So how far back in the evolutionary tree can we suggest that motor impulse and its sonorous accompaniment might go? I have already postulated in my *Origins and Development of Musical Instruments* (Montagu, 2007, p. 1) that this could go back as far as the earliest flint tools, that striking two stones together as a rhythmic accompaniment to movement might have produced the first flakes that were used as tools, or alternatively that interaction between two or more flintknappers may have led to rhythms and counter-rhythms, such as we still hear between smiths and mortar-and-pestle millers of grains and coffee beans. This, of course, was kite-flying rather than a wholly serious suggestion, but the possibilities remain. At what stage did a hominim realize that it could make more sound, or could alleviate painful palms, by striking two sticks or stones together, rather than by simple clapping? Again we turn to Morley and to the capability of the physiological and neurological expression of rhythm.

The physiological must be presumed from the above animal observations. The neurological would again, at its simplest, seem to be pre-human. There is plenty of evidence for gorillas drumming their chests and for chimpanzees to move rhythmically in groups. However, apes' capacity for keeping steady rhythm is very limited (Geissmann, 2000), suggesting that it constitutes a later evolutionary development in hominins. Perceptions of more detailed appreciation of rhythm, particularly of rhythmic variation, can only be hypothesized by studies of modern humans, especially of course of infantile behavior and perception.

From all this, it would seem that motor impulse, leading to rhythmic music and to dance could be at least as early as the simplest vocal inflection of sounds. Indeed, it could be earlier. We said above that animals have hearts, and certainly, all anthropoids have a heartbeat slow enough, and perceptible enough, to form some basis for rhythmic movement at a reasonable speed. Could this have been a basis for rhythmic movement such as we have just mentioned? This can only be a hypothesis, for there is no way to check it, but it does seem to me that almost all creatures seem to have an innate tendency to move together in the same rhythm when moving in groups, and this without any audible signal, so that some form of rhythmic movement may have preceded vocalization.

## BUT WHY DOES MUSIC DEVELOP FROM SUCH BEGINNINGS? WHAT IS THE PURPOSE OF MUSIC?

There are four obvious purposes: dance, personal or communal entertainment, communication, and ritual.

Dance we have already mentioned, though we can never know whether rhythmic motion led to the use of accompaniment, or whether the use of rhythm for any work led to people moving rhythmically in a way that became dance. It is well accepted in anthropology that when people are working, or moving together, their movements fall into a rhythm, that people may grunt and make other noises into that rhythm. The grunts may move into something that verges on or morphs into song; the other noises may be claps or beating pairs of objects together (concussive) or beating one object on another (percussive). Such objects can only be idiophonic, such as sticks, stones, and other solid objects that require no additional features to help them make a sound, in the classificatory system for instruments (Hornbostel and Sachs, 1914). This is simply because to create a drum with a skin (membranophones) is a complex process, because a skin will not produce sound unless it is under tension.

There is no doubt whatsoever that rhythmic sound without any melodic input must be regarded as music. It appears in many cultures, even if rarely, and we have Varèse's *Ionization* to take as an example from our modern orchestral repertoire.

Our second purpose was personal or communal entertainment. Communal entertainment, to some extent, overlaps with dance and with rhythmic work; personal entertainment overlaps for the mother and baby, mentioned above, with communication, as does the traveler using an instrument to indicate to people or villages that he passes that his purpose is peaceful and that he is not a robber intent on purloining their property, a well-known practice anthropologically but one that we can have no way to measure its antiquity.

Our third purpose, communication by musical means is again widespread. We have the "bush telegraph" in Africa and other parts of the world with slit drums and other instruments, the alphorn in Switzerland and in other mountainous or marshy regions, the conch in Papua New Guinea, as random examples of the use of an instrument to pass messages. We have the whistling language of the Canary Islands (*silbo*) and many other parts of the world, and the high vocal calls of other peoples as examples of non-instrumental music for the same purpose.

Our fourth purpose, ritual, is a well-known trap in archeology and anthropology. Any object, any practice that cannot otherwise be explained, is assigned as "ritual." But there seems to be no form of religion, to use that word in its widest sense, that does not attract music to its practices. And here, we have another conflict, again that between music and speech. Schönberg's "invention" of *Sprechgesang*, an interface between speech and music, was nothing new. Many forms of ritual chants would be difficult to notate precisely in pitch; the words are spoken but they are inflected up and down quasi-melodically. Some bardic narrative is also an example of this, while often breaking intermittently into song. In both cases, the musical inflection renders the text less boring and helps the speaker with his or her memory of the text. It is undoubtedly speech, for the meaning of the words is the essential part, but there is also the element of pitch variation that would make an ethnomusicologist claim it to be music even while the practitioner would often vehemently deny any such claim, especially within the stricter forms of Islam, those in which music is forbidden.

Seemingly more important than these fairly obvious reasons for why music developed is one for why music began in the first place. This is something that Steven Mithen mentions again and again in his book, *The Singing Neanderthals* (Mithen, 2005): that music is not only cohesive on society but almost adhesive. Music leads to bonding, bonding between mother and child, bonding between groups who are working together or who are together for any other purpose. Work songs are a cohesive element in most pre-industrial societies, for they mean that everyone of the group moves together and thus increases the force of their work. Even today "Music while you Work" has a strong element of keeping workers happy when doing repetitive and otherwise boring work. Dancing or singing together before a hunt or warfare binds the participants into a cohesive group, and we all know how walking or marching in step helps to keep one going. It is even suggested that it was music, in causing such bonding, that created not only the family but society itself, bringing individuals together who might otherwise have led solitary lives, scattered at random over the landscape.

Thus, it may be that the whole purpose of music was cohesion, cohesion between parent and child, cohesion between father and mother, cohesion between one family and the next, and thus the creation of the whole organization of society.

Much of this above can only be theoretical—we know of much of its existence in our own time but we have no way of estimating its antiquity other than by the often-derided "evidence" of the anthropological records of isolated, pre-literate peoples. So let us now turn to the hard evidence of early musical practice, that of the surviving musical instruments.1

This can only be comparatively late in time, for it would seem to be obvious that sound makers of soft vegetal origin should have preceded those of harder materials that are more difficult to work, whereas it is only the hard materials that can survive through the millennia. Surely natural materials such as grasses, reeds, and wood preceded bone? That this is so is strongly supported by the advanced state of many early bone pipes—the makers clearly knew exactly what they were doing in making musical instruments, with years or generations of experiment behind them on the softer materials. For example, some endblown and notch-blown flutes, the earliest undoubted ones that we have, from Geissenklösterle and Hohle Fels in Swabia, Germany, made from swan, vulture wing (radius) bones, and ivory in the earliest Aurignacian period (between 43,000 and 39,000 years BP), have their fingerholes recessed by thinning an area around the hole to ensure an airtight seal when the finger closes them. This can only be the result of long experience of flute making.

So how did musical instruments begin? First a warning: with archeological material, we have what has been found; we do not have what has not been found. A site can be found and excavated, but if another site has not been found, then it will not have been excavated. Thus, absence of material does not mean that it did not exist, only that it has not been found yet. Geography is relevant too. Archeology has been a much older science in Europe than elsewhere, so that most of our evidence is European, whereas in Africa, where all species of *Homo* seem to have originated, site archeology is in its infancy. Also, we have much evidence of bone pipes simply because a piece of bone with a number of holes along its length is fairly obviously a probable musical instrument, whereas how can we tell whether some bone tubes without fingerholes might have been held together as panpipes? Or whether a number of pieces of bone found together might or might not have been struck together as idiophones? We shall find one complex of these later on here which certainly were instruments. And what about bullroarers, those blades of bone, with a hole or a constriction at one end for a cord, which were whirled around the player's head to create a noise-like thunder or the bellowing of a bull, or if small and whirled faster sounded like the scream of a devil? We have many such bones, but how many were bullroarers, how many were used for some other purpose?

So how did pipes begin? Did someone hear the wind whistle over the top of a broken reed and then try to emulate that sound with his own breath? Did he or his successors eventually realize that a shorter piece of reed produced a higher pitch and a longer segment a lower one? Did he ever combine these into a group of tubes, either disjunctly, each played by a separate player, as among the Venda of South Africa and in Lithuania, or conjointly lashed together to form a panpipe for a single player? Did, over

<sup>1</sup>All the known archaeological instruments that we have, up to the end of the Neolithic period, are listed in tables by Morley (2013), and many are illustrated and described in his text.

the generations, someone find that these grouped pipes could be replaced with a single tube by boring holes in it, with each hole representing the length of one of that group? All this is speculation, of course, but something like it must have happened.

Or were instruments first made to imitate cries? The idea of the hunting lure, the device to imitate an animal's cry and so lure it within reach, is of unknown age. Or were they first made to imitate the animal in a ritual to call for the success of tomorrow's hunt? Some cries can be imitated by the mouth; others need a tool, a short piece of cane, bits of reed or grass or bone blown across the end like a key or a pen-top. Others are made from a piece of bark held between the tongue and the lip (I have heard a credit card used in this way!). The piece of cane or bone would only produce a single sound, but the bark, or in Romania a carp scale, can produce the most beautiful music as well as being used as a hunting call. The softer materials will not have survived and with the many small segments of bone that we have, there is no way to tell whether they might have been used in this way or whether they are merely the detritus from the dining table.

We have many whistles made from an animal phalange or toe bone, blown between a pair of protrusions at one end, across a sound hole near the center. Two of them come from the Mousterian period of the Middle Paleolithic, over 50,000 years ago, and there are many from the Aurignacian down to the Magdalenian and later; most, but not all, are reindeer phalanges. D'Errico has warned us, though, that the "sound hole" on many of these look as though they were made from a carnivore bite (D'Errico et al., 2003). It was in the Mousterian period that the Neanderthals co-existed with *Homo Sapiens*; the latter arrived in Europe between fifty and forty thousand years ago (though far earlier in the Near East), whereas Neanderthals had long been established in Europe, perhaps as long as 200,000 years before. Whether any that were blown by humans were used for signaling, or whether they were also used for music we cannot know, but whistles are certainly regarded as musical instruments.

More controversially in this Mousterian period, and certainly associated with other Neanderthal remains, is the young cave bear femur from the Divje Babe cave in Slovenia, dated to around 60,000 years Before the Present (BP). This has two holes in it and what might be three others at the broken-off ends, two on one side and one on the other. The fragment of bone is just over 10 cm long and while many people have claimed it as a flute, for it can certainly produce several pitches when reproductions of it are blown, many others have claimed that the holes are the result of other carnivores gnawing it, especially at the ends. As for the two complete holes, some writers have claimed that they are just the right size, shape, and spacing to have been produced by bears, for whose presence in the cave there is ample evidence, nor does there seem to be any trace of any possible human work on the bone. There is a very considerable literature on this possible instrument, well summed up and cited by Morley and by D'Errico et al., and the general consensus had been that it was not a musical instrument but simply the result of animal action. Nevertheless, the original discoverers have returned to the attack with a recent publication (Turk, 2014) which goes to show that human agency not only could have but did pierce those holes. For now, we can only leave this question open, with all the problems of an unicum; there are convincing conclusions on both sides of the argument, with at present rather greater weight on the "yes" side, partly due to this recent publication, and partly to the evidence in the following paragraph. What we really need are more examples from the Mousterian period.

This bone does raise the whole question of whether *H. neanderthalensis* knew of or practised music in any form. For rhythm, we can only say surely, as above—if earlier hominids could have, so could *H. neanderthalensis*. Could they have sung? A critical anatomical feature is the position of the larynx (Morley, 2013, 135ff); the lower the larynx in the throat the longer the vocal cords and thus the greater flexibility of pitch variation and of vowel sounds (to put it at its simplest). It would seem to have been that with *H. heidelbergensis* and its successors that the larynx was lower and thus that singing, as distinct from humming, could have been possible, but "seems to have been" is necessary because, as is so often, this is still the subject of controversy. However, it does seem fairly clear that *H. neanderthalensis* could indeed have sung. It follows, too, that while the Divje Babe "pipe" may or may not have been an instrument, others may yet be found that were instruments. There is evidence that the Neanderthals had at least artistic sensibilities, for there are bones with scratch marks on them that may have been some form of art, and certainly there is a number of small pierced objects, pieces of shell, animal teeth, and so forth, found in various excavations that can only have served as beads for a necklace or other ornamentation – or just possibly as rattles. There have also been found pieces of pigments of various colors, some of them showing wear marks and thus that they had been used to color something, and at least one that had been shaped into the form of a crayon, indicating that some reasonably delicate pigmentation had been desired. Burials have been found, with some small deposits of grave goods, though whether these reveal sensibilities or forms of ritual or belief, we cannot know (D'Errico et al., 2003, 19ff). There have also been found many bone awls, including some very delicate ones which, we may presume, had been used to pierce skins so that they could be sewn together. All this leads us to the conclusion that the Neanderthals had at least some artistic and other feelings, were capable of some musical practices, even if only vocal, and were clothed, rather than being the grunting, naked savages that have been assumed in the past.

It is in the Middle and Upper Paleolithic, from the Aurignacian period, which starts around 43,000 BP in eastern Europe and around 40,000 in the west, to the Magdalenian and later, ending around 10,000 BP, which we have a very considerable number of instruments, plus a few representations. Many of them, like those from Geissenklösterle above, are end-blown flutes made of bone, most commonly of large birds such as vultures and swans. Some of them are blown *via* a notch; some appear to be duct flutes, similar to our recorders, though of course the block made of wood, pith, or fiber has not survived—more probably, they are likely to have been tongue-duct flutes, using the tongue in the end instead of a block, and some are listed as such in Morley's tables—and others may have been plain-end blown, diagonally across the top, like the Arab *nay*. With these last, though, it is possible that a reed was used as the sound generator, either a double reed like that of our oboe or a split-cane single reed like that of many Arab instruments, or possibly even lip-blown (trumpeted), though the narrowness of the bore makes this seem less likely. It is, therefore, probably better to refer to this last group as pipes, rather than as flutes.

Reproductions of many can be and have been played, but there is little to be learnt from this practice. We know what pitches and sounds *we* can get out of them, but unless we know their playing techniques, which of course we do not know, we cannot tell what sort of pitches and tone qualities they would have obtained in antiquity. Every recorder and tin-whistle player knows of a number of ways to inflect the pitch and the tone; every Arab *nay* player knows even more, and ethnomusicologists have produced evidence for even more, and our experimental musicians have shown that quite extraordinary pitches and sounds can be obtained from many of our orchestral instruments, sounds that their makers or normal players never conceived. Thus the archeologists (who are seldom trained musicians), who publish the scales and pitches of the pipes that they have found, can give us no more than conjecture and the experience of their own musicality. I have a collection of musical instruments from all over the world; I know the sounds that I can get out of them, but without the presence of the original player, or a field recording of the original player on that very instrument, I have no way to tell what sounds or pitches he or she produced. So much less can we have any idea what sounds and pitches were heard in the Paleolithic times.

However, there is one salient point, emphasized by D'Errico: a significant number of these pipes has varied spacing of finger holes. While, it would seem that the majority have the finger holes evenly spaced along the tube, there are certainly some that have a wider gap between the second and third holes. There are two fairly obvious possible reasons for this: one is that their "scale" of pitches had intervals similar to wholetones and minor thirds; the other that it was convenient or comfortable to have a wider gap between the two hands. This latter suggestion is raised because it was a standard feature of our flutes from the later Middle Ages right through into the early nineteenth century, and this was not only because from around 1700 the middle joint of the Baroque flute was divided into an upper and lower joint at this point – the earlier one-piece flutes also showed this gap. There are also some Aurignacian flutes or pipes that have one hole closer to another, showing that a semitone or a small wholetone was desired. Thus, these details emphasize that not only were these well-developed instruments, with the bodies well-scraped and smoothed, the finger holes with secure seating for the fingers, a certain amount of incised decoration, but that also there was a desire for precise tuning, and that they were not just made to produce fairly random pitches.

In addition, there is the point that many of these features appear both in Geissenklösterle in Germany, in Isturitz in France, in Spain, and also elsewhere, and over long periods of time, strongly suggesting that populations were not isolated but that there were links between them. This is not so surprising. If *H. sapiens* had traveled across Africa and into Europe, surely they could also travel between these areas and elsewhere.

There is little point in listing all these pipes; all the Paleolithic examples from Europe, or close by, found before 2013 are listed by Morley in his Appendices.

Were there other instruments? There is at least one conch trumpet, found in the Marsoulas cave, in the Haute-Garonne area of southern France, dating from around 20,000 years BP. Shell is a hard material that survives the ages, and although we have so far only this one example from the Upper Paleolithic, we have a very considerable number from the Neolithic times, some of them much further from the sea, so it is fair to assume a continuous use (Montagu, in press).2 So what about animal horns? Here the material is soft, and only in very dry conditions such as desert sands do any survive; none of those that I have heard of or seen were blowing horns, but it seems likely that they existed. For blowing, the horn must be naturally hollow, such as those of the cow family, sheep and goats, antelopes, elephant tusks, hollow wood, gourds, and wide-bore bamboo, with the tip broken or cut off, or a hole bored in the side; such were surely blown in high antiquity (Montagu, 2014). There are several bullroarers from the Magdalenian period that we can be certain were instruments. There are many phalange whistles later than the Mousterian ones noted above. There are rasps, usually bones notched along their length, which would have been scraped with another bone or a stone for rhythmic music.

There is the complex of mammoth bones dating from around 20,000 BP, found in the Ukraine and published by Bibikov (1981). Many of the bones show signs of wear, almost certainly from repeated striking, and others, though this is not mentioned in the English summary, have striations similar to those of rasps, suggesting that some were scraped whereas others were struck. It is claimed that this was an ensemble, and although it would be difficult to prove that this was so, it would be even more difficult to show that each of these bones was struck only singly as an individual solo instrument. So here perhaps we have the first evidence of an "orchestra."

There are from the Magdalenian period, some 12,000 years BP, the caves themselves, where not only were stalactites struck but the caves themselves were used as resonators for sounds; both Lucie Rault and Lya Dams have brought together a number of convincing reports of this (Dams, 1985; Rault, 2000). Resonant stones must also have been struck outside the caves, the so-called rock gongs, boulders struck on resonant points, and these are of unknown antiquity but many bear well-worn cup marks on their surfaces. Rock gongs were first reported by Bernard Fagg in Nigeria, and following his article (Fagg, 1956), many more have been reported from around the world (Fagg, 1997).

There is no evidence in the Paleolithic period for stringed instruments nor for skin drums.

At what point in history did someone discover that by cupping the hands together and blowing between the knuckles of the thumbs produced a sound? This is a vessel flute or ocarina whose pitch is varied by moving the fingers to alter the area of open hole. Many peoples have long used gourds and other hollow vegetal objects, and today pottery, to play music in this way, also with the hands as hunting lures, but since there are no animal bones of such a shape, we can have no evidence of vessel flutes earlier than the Neolithic, in which period pottery first came into use.

<sup>2</sup>Montagu, J. (in press). *The Conch Horn*.

Did voice changers precede instruments? Did someone sing into a hollow object to change his voice from that of a human into that of a spirit or a deity? Was a shell sung into before ever a shell was blown? This precedence is something that has at times been suggested, but it can never be more than a hypothesis for we have no evidence to prove it. We do know that certain Greek statues had voice changers built in, usually a tube with a skin over one end, our kazoo, and there are many African masks with such a device.

Stringed instruments probably originated by the Mesolithic period, and certainly by the Neolithic, for it is in those periods that we begin to find flint arrow-heads, and the archer's bow and the musical bow are symbiotic as we shall see below (Balfour, 1899).

Skin drums (membranophones), as we said above, need the skin to be under tension to function. At what stage could there have been frames to which a skin could have been fastened securely enough to be tight enough to play? One can only say as early as skins were dressed, wetted, and dried on a frame, but since neither skins nor wooden frames, nor hollow logs, can ever have survived, this is simply an unknown; ceramic bodies rigid enough to support the skins can only have been available in or shortly before the Neolithic period.

So far, we have been discussing instruments only from Europe or its immediate environment. Simply, this is because where the evidence is. Archeology has been going on longer in Europe than elsewhere, as we have said. Much is being found now in China, but since most of it has been published in Chinese, much of this information is inaccessible, at least to me.

All the instruments that we have discussed above continued through the Neolithic and, with archery and pottery available, many others have joined them.

The earliest stringed instrument is undoubtedly the musical bow (Balfour, 1899). The one string instrument that might possibly be earlier is one that is identical with an animal trap—a noosed cord, presumably gut or sinew, running from a bent stick or branch to a peg in the ground. When an animal puts its head or leg into the noose, the cord is jerked from the peg and the stick or branch springs up and traps the animal. It has been suggested by Sachs, Balfour, and others that the hunter may also have plucked the string, so creating the ground bow, varying the tension of the cord, and thus the pitch, by bending the stick or branch. The ground harp is of unknown antiquity—our only evidence for the existence of the instrument is nineteenth-century reports from anthropologists.

Bows themselves, of course, never survive, but the presence of arrowheads in the lithic evidence proves their existence. Whether the archer's bow preceded the musician's or vice versa is arguable, but man's addiction to warfare, and even more to hunting, makes the archer's the more likely. We have ethnographic evidence for the use of the same bow for both purposes by the same person, but each developed in different ways, the archer's for strength and the musician's for producing musical sounds in different ways. The string of the musical bow is most commonly tapped by a light stick, initially presumably by an arrow, and is held to the player's mouth where, by changing the shape of the mouth, different overtones are sounded as with the jews harp (better and less prejudiciously called trump, which is the earlier English name). By dividing the string with a loop of cord linking the string to the stave, or by shortening the string at one end by the thumb of the holding hand, two fundamentals, each with their own overtones, makes a much greater range of pitches available. Attaching a gourd resonator to the stave creates greater volume, and opening or closing the mouth of the gourd against the player's chest will again elicit overtones. Both these forms survive to the present day in various modifications and many parts of the world, especially in Africa south of the Sahara (Kirby, 1934). A third form consists of attaching several bows to one resonator to form a pluriarc, as is still found in Central Africa.

One can postulate developments from both the gourd bow and the pluriarc. The gourd, eventually of wood, can be built on to one end of the stave to create both the category of instruments called lutes, with a straight stave as the neck, and of harps, with a curved stave. If the two outermost bows of the pluriarc become rigid, with a cross bar running between them to hold the distal ends of the strings of the inner bows, which then become redundant, the instrument is then much more stable and is called a lyre. Whether such developments took place, or whether lutes, harps, and lyres were independently invented, we can never know, but my own guess, based partly on various intermediate forms in various cultures, is for this process of development.

As for drums, frame drums are still ubiquitous around the world today, not only with our own tambourine, but a wooden or pottery body of manifold shapes exists almost everywhere. One possible early source for another type of drum is created by fixing the skin of the animal just eaten, over the top of the pot in which it had been cooked, so creating the instrument very appropriately called the kettledrum, using the word kettle in the sense of a caldron.

Another very common use of pottery is to create a rattle, a vessel containing seeds, pebbles, or nodules of pottery. Such vessel rattles must have been long preceded by gourds or woven leaves or baskets, all of which are still common today.

Once humanity entered the metal ages, the potentialities of instruments becomes infinite.

We can never know to what extent any groups of instruments or voices played together in high antiquity, though the existence of the group of mammoth bones above, does strongly suggest an ensemble. Not until the days of representational iconography, as in Mesopotamia and Egypt, or with the introduction of literacy, such as our Bible, do we have any real evidence. We have plenty of information from these sources.

What then did music sound like? We have early notations from Sumeria (Galpin, 1936) and Ancient Greece, the well-known hymn to Apollo, covering a wide range of pitches; Hickmann tried to derive a notation from hand-signals, called cheironomy, portrayed in Egyptian paintings and carvings (Hickmann, 1961). It has been thought by ethnomusicologists that less-advanced cultures than those, used pentatonic scales (five steps to the octave) such as we can still hear today in some areas, and perhaps even fewer steps with or without knowledge of the octave. But for these, naturally there is no evidence. Even with Sumerian, Greek, and Egyptian systems, the various transcriptions of which are all controversial, we cannot know the actual sounds, for not until the later classical Greek period do we have written evidence of the sizes of scalar steps.

We do know, from the transcription of cuneiform tablets, that it was the Babylonians, and very possibly the Sumerians before them, who cataloged the skies and their constellations, establishing thus the basics of the calendar and of time that we use today, and who invented the hexadecimal system of mathematics. They turned their attention to sound also, and the Sumerians developed a system of diatonic scales based on alternating fourths and fifths. The Greeks, who took such knowledge from them, devised a diatonic scale based on the ratios of the harmonic series, starting from the eighth partial, a scale today called Just Temperament, one that is still used today by unaccompanied voices and sometimes by bowed string players or wind instruments playing without keyboards. For other instruments, such as lyres and harps, Just Temperament could also serve well, but only and until the players wished to change key; as soon as they did so, for reasons more complex than are needed here but are discussed below, chaos would ensue. Nevertheless, despite the purity of such a scale, we know that even the Greeks used other and more complex scales (Barbour, 1951) as, from the anthropological record, did many other peoples. Therefore, despite such transcriptions as we have of the ancient texts above, we can have no certain knowledge of what the music sounded like, for we do not know the exact sizes of the steps of the scales.

Even within Europe the 13th partial, the so-called alphorn fa, halfway between F and F-sharp appears in vocal music and on bagpipes as well as on natural horns and trumpets; the neutral third, between E and E-flat also appears, and as we shall see, the third is the most mutable interval in our classical music. In the Balkans, people sing in close seconds rather than wider intervals or unisons.

One thing that the ethnomusicologists can tell us is that either humanity has no inbuilt sense of consonant tonality, or that other people's sense of consonance is different from ours. The musical bow will by its nature produce the pitches of Just Temperament, for all its pitches are the overtones of the harmonic series, but despite this some peoples, who use the bow, will sing in seven equal steps to the octave. The one interval that does seem to be common to almost all peoples is the octave; this most probably originates with men and women singing in "unison" together, for women's voices tend to be an octave higher than men's. It is also a natural step to recognize when any piece of music extends beyond the range of one octave, and this repetition of scalar steps beyond the octave is built into many woodwind fingering systems.

We have many other examples of other scales that do not use what we, in our culture, may consider to be pure tuning. Let us take just one example that may be familiar to many of us today, the Javanese gamelan. This uses two different scales, *slendro* and *pelog*. Both employ the octave, but neither uses a pure fifth or third, the notes that make up our "common chord." *Slendro* has five almost equal steps to the octave; *pelog* has seven rather less equal steps. Not one of the steps of *slendro* is the same as those of *pelog*. Nor were the *slendro* or *pelog* in Java exactly the same between one gamelan and another, though similar, before the recent days when almost all gamelans are tuned to the pitches used by Radio Yogyakarta.

Nor are the scales of the Near and Middle East compatible with ours (Wizārat al-Tarbiyah wa-al-Ta'līm, 1934). Nor even, save for the octave, are the pitches of Just Intonation the same as those of the Equal Temperament that we use on our pianos today. Each culture develops the tuning system that best suits its ideas of musicality. It is up to the cognitive scientists to determine why this should be so, but they have to admit, if they are willing to listen to the exotic musics of the world, that these differences exist.

Let us now return to the history of music and of the instruments on which it was played.

At least we do know what instruments some peoples used in the later millennia BCE, for not only do we have a few survivals in our museums from the Sumerian, Babylonian, Egyptian, Greek, Etruscan, and Roman periods, and also from the Orient, but we also have a wealth of iconography, much of it published in the *Musikgeschichte in Bildern* series by the Deutsche Verlag für Musik in Leipzig from the 1960s onward. This series is, alas, incomplete, for its publication ceased with the reunification of Germany.

We see among the Sumerians and Babylonians lyres and harps of various kinds, the latter quite small, a horizontal or vertical sound box with, at the distal end, a forepillar standing up at 90°, whereas in Egypt harps were normally curved, some of them as tall as the player, others, called the bow harp, were small enough to be held on the shoulder, and these last gradually passed into Central Africa where they are still found today. We see also lutes, a hollowed sound box like a small trough, with the open top covered with a skin to form the belly. A rod acts as the neck and passes through slits in the skin to hold it in place. These also still appear in Africa today. All these instruments were plucked, either with the fingers or a plectrum—the bow, such as we use on our fiddles, was as yet far in the future. There were pipes, usually double, held one in each hand, though sometimes, especially later in Egypt, lashed together so that the fingers of each hand could reach across both pipes. There were occasional drums, some very large, and many forms of rattles. We also see many of these instruments combined into what appear to be ensembles. This use of bands of instruments is confirmed in literature, for example in chapter 3 of the book of Daniel in our Bible where, when all the instruments play together, all those present bow down to the deity. Again in the Bible (II Samuel 6), a band of instruments escorts the Ark of the Covenant to David's city, with David dancing before them to the scorn of his queen.3 Beware, however, of the huge choirs and groups of instruments in the two books of Chronicles; this is a late account, written long after any of the events it records, and smacks strongly of a child's playground exaggeration: "my brother is bigger and better than yours."

In ancient Greece, the lyre and the double pipe, the *aulos*, predominated. Lyres came in three forms. The simplest, the *chelys* or *lyra*, had a tortoise-shell body with two vertical curved

<sup>3</sup>For descriptions of all the instruments, see Montagu (2001).

wooden rods or horns, set in the shell with a third rod running horizontally as the cross bar. The strings were attached at one end to the bottom of the shell and at the other were twisted with *kollopes*, strips of skin, and wound round the horizontal bar. These *kollopes* set firmly enough on the bar to hold a tuning, but could be turned on the bar to retune. This type of lyre was taught to, and used for after-dinner symposia, by all educated people. It traveled up the Nile to the Meroitic people, probably in the Hellenistic period, and eventually throughout East Africa, where it is still used today, with the skin *kollopes* replaced with strips of cloth and the tortoise-shell with a gourd or wooden body as the resonator, and a skin belly. A more elegant form of Greek lyre, with longer curved arms, was called the *barbiton*. The professional musician's version, the *kithara*, was much more elaborate, with a wooden box-body and with what appears to be some form of semi-mechanized tuning devices. All three had gut strings that were normally plucked with a plectrum of wood, bone, or ivory, and all three are seen on many Greek vases and statues.

The *aulos* was a reed-pipe, shorter and somewhat stouter than the Sumerian and Egyptian; whether with a double reed like that of the oboe or a single reed like that of early folk clarinets as in the Near East today, is much argued, but Schlesinger's illustrations clearly show both types, though probably more often with the double reed (Schlesinger, 1939). The *aulos* passed on to Rome, where it was known as the *tibia*, to which quite elaborate tuning mechanisms were applied, with rings that could be turned to close off one hole and open another slightly differently placed, so as to play in a different key or mode. There was also a single pipe, the *monaulos*, and that is still found today, with a large double reed, all down the Silk Road, from Turkey, Kurdistan, and Armenia to China, Korea, and Japan. Whether it traveled east from Greece, or whether it originated in Central Asia like a number of other instruments and then traveled both east and west, is debatable.

That several instruments originated in Central Asia, probably somewhere between Persia and the Caspian Sea, is undoubted. The gong started there and was known in the Near East by St Paul (I Corinthians 13:1) as *chalkos ēchon* (Montagu, 2001, 123). The Chinese encyclopedias said that they got the gong from the West, which also suggests a Central Asian origin. The long trumpet seems to have started there also and it spread across the whole of Asia and to Greece, Etruria, and Rome, and in the Middle Ages through to North Africa as *alnafir* and, with the Moors, up into Spain as the *añafil*, and thence into the rest of Europe, and with the Hausa down into Ghana and Nigeria as the *kakaki*.

According to Al Farabi the Arab *'ud*, that became the lute in medieval Europe, also originated there, and so, around the eighth century CE, did the fiddle bow (Bachmann, 1969). Initially, this was a rough stick or reed scraping the string, but it was not long before it was modified with the strands of horsehair that we still use today.

This at last allowed stringed instruments to produce a sustained sound, something that could emulate the human voice, as all wind instruments had been able to do ever since their introduction.

In the early thirteenth century, and probably a little earlier, there came a revolution of the instruments we used in Europe. This seems to have been due to the often-interrupted symbiosis of Moorish, Jewish, and Christian cultures in Spain, and possibly also with some effect from returning Crusaders from the Holy Land. A flood of new instruments appeared, as can be seen in the many miniatures of the *Cántigas de Santa Maria*, a series of poems written by Alfonso X, called El Sabio, the wise.4 We see there the Arab *'ud* which became our lute, the small bowed fiddle, the *rebab*, which became our rebec, the reed-blown pipe the *zamr*, which became our shawm, the ancestor of our oboe, several types of bagpipe, harps with a forepillar, various zithers such as the *qanun* that became our canon and then the psaltery, the transverse flute, other types of lute that became our gitterns and eventually citterns and guitars, *alnafir* that became the Spanish *añafil* and our long trumpet, pipe and tabor, the pipe played with one hand and the tabor struck with the other, which became a standard one-man band from the Middle Ages into the sixteenth century, the timbre, a frame drum that became our tambourine, and the *naqqere*, two small kettledrums, our nakers, that hung low from the belt in front of the player, and eventually became our timpani. Within the ensuing century, these spread all over western Europe and can be seen in a great many medieval manuscripts, church carvings, and other sources.

We know little of the extent that these played together. There are some group scenes in the *Cántigas*, but mostly, the miniatures show either one instrument or two of the same sort tuning or playing to each other. We do see large groups of instruments in manuscripts of the following centuries, but these are mostly portrayals of biblical scenes or of texts such as psalm 150 and may not represent anything that actually happened in the Middle Ages.

Then, in the fourteenth century, came another revolution, this time an industrial one (Gimpel, 1988). All over Europe, there had been windmills and watermills, primarily for grinding grain, but often also for minor industrial purposes. Now came the idea of siting watermills under the arches of bridges on major rivers, where the flow of water, restricted by the pillars of the bridge, thus produced far greater force. This powered mills for working metals and, for our purposes, of drawing brass and iron wire to standard quality and in much finer gages than had been available earlier except in softer, and more costly, metals such as silver and gold. The result was strings for harps, psalteries, and dulcimers and thence to keyboard instruments, first the clavichord, which was a keyed development of the monochord, and then the harpsichord. All, as can be seen in the manuscript of Arnault de Zwolle from around 1440, were established by that date (Le Cerf and Labande, 1932).

The use of keyboards led to a revision of musical pitch and tuning. Just Temperament had served well for unaccompanied voices and some solo instruments, but its inadequacies had now become more apparent. If one depends on the partials of the harmonic series, their ratios makes it obvious that the step from

<sup>4</sup>Escorial Library, Madrid, Ms. T I 1 (sometimes T. J. 1).

8 to 9 is greater than that of 9 to 10. To avoid using sharps and flats, let us take these pitches as C for 8, D for 9, and E for 10. And for clarity let us use the musicologist's interval-measuring system of cents, analogous to the general use of millimeters for linear measurement. The major tone of 8–9 is 204 cents; the minor tone of 9–10 is 182 cents, and together these make up the third, C to E, of 386 cents. Now if we want to play in C major, all is well, but if instead, we want to start a scale on D, we are in trouble, for where we need a major tone we have only a minor tone. Voices have no trouble with this for they simply shift the D and the E, but for any instrument with strings such as those of a lyre, a harp, or keyboards, the player has to stop and retune all his strings. The problem was already recognized by the ancient Greeks, and it was allegedly Pythagoras who solved the problem and who decided to make all the wholetones the same size, with 204 cents for each. However, adding those together produces a wildly sharp third of 408 cents from C to E, which when used in a common chord with C and G was so intolerable that in the Middle Ages it was regarded as a dissonance. Thus the Pythagorean Temperament was intolerable on the new keyboard instruments, and the music theorist Pietro Aron devised a new temperament in 1523. He returned to the natural third of 386 cents and, taking its mean or average of 193 cents for each whole tone, created the Quartercomma Meantone Temperament. To the modern ear, accustomed to the Equal Temperament of our piano, with its wholetones of 200 cents and semitones of 100 cents, these differences may seem small, but if one listens to music played in other temperaments, it really does sound different—even today a 400-cent third still sounds quite badly out of tune. This whole subject is quite complex and Barbour, 1951, or the article on Temperaments in the *New Grove Dictionary of Music*, will give fuller details.5 The basic problem is that the natural fifth of 702 cents is incompatible with the octave of 1200 cents; if one piles up a sequence of fifths, C to G, G to D, D to A and so on, the series will never return to C, only to a B-sharp 22 cents higher than C. Somehow those 22 cents, called a comma, have to be brought back into the octave, and this is done, with greater or lesser success, by using one of the various so-called irregular temperaments.

We have been neglecting vocal music. This has continued unchecked through the ages. When and how choral music, in our modern sense of song, evolved we do not know, but it had certainly appeared by biblical times and by that of the Greek dramatists. While we have mentioned some early suggested musical notations, music was normally taught by rote or simply by listening to others and joining in. What, if any, types of harmony were used, other than singing in octaves, we cannot know for we have no notation system, other than those early ones mentioned above for a basic melody, until we reach the early church chants. Here, we meet Gregorian and other church chants. These appear initially to have been purely monophonic, with everyone singing in unison. The earliest notation, called neumes, shows musical movement rather than precise pitches, and can only have served as a reminder of how music, already learned by rote, was to proceed. What pitch the music started on would depend on the preferred vocal range of the singers. Not until the thirteenth century do we start to see music written on a staff, then usually on only four lines rather than our present five-line stave, and with a symbol to tell us which line is C, similarly to our own alto or tenor clefs.

By the end of the twelfth century, we have composers such as Perotin writing organum, two or more parallel lines a fifth, fourth, or octave apart, with some slight freedom for each line to ornament a little. Organum probably derives from the organ itself, for while the first organs, which appeared in Alexandria in the second century BCE, were purely monophonic, though with the ability to play a chord, the larger church organs of the ninth or tenth centuries CE, used a system called *Blockwerk*. This meant that each key, when depressed, sounded a chord, a group of fourths or fifths and octaves. We have vivid descriptions of the tenth-century organ of Winchester Cathedral in Britain (Perrot, 1971), and we have surviving pipes from the organ of Bethlehem from the eleventh century of the Latin Kingdom of the Crusaders; the groups of lengths of these pipes show that this organ must also have used *Blockwerk* (Montagu, 2005).

What about secular music? Here, our earliest manuscripts seem to be from the thirteenth century with Adam de la Halle and his contemporaries writing motets for singers, and with anonymous, usually monophonic, dance music. Early polyphony, music in more than one part, was normally based on a cantus firmus, or tenor, often derived from a church chant, around which other, more elaborate parts, were woven. Polyphony of this sort seems to have been a purely European development; other cultures then, and in many cases still, prefer a single line or monophony, or if singing in groups or a single line with accompaniment, using heterophony, people all singing much, but by no means exactly, the same. Later motets might have three or four independent lines, sometimes each with their own text, woven together. These, in the early Renaissance, led to the madrigals and thence to our various styles of choral music today.

How do we define public performance, and how far back does it go? If one defines it as making music where other people can hear you, it must be as early as music ever existed. Any dance, whether Australian corroborees, war or hunting dances, people dancing on the village green, or any other similar occasions, must have involved music of some sort—how else could people keep their movement together? Here, we return to the use of rhythm, and surely to that of concussion or percussion of some sort, whether just body or hand clapping or that of instruments.

The shaman has always used music of some sort, often to help to throw him- or herself into the necessary trance. The bard has always been a valued member of society—and has always chanted and sung his lays, and always to self-accompaniment on an instrument. All these were "public" performances, either deliberately or at the very least where other people could hear them. At what stage was music deliberately performed to a public? Dance again, of course, and in religious ceremonies. The Christian church could be considered to be the first concert hall, with all free to enter and to hear the chant and, as time went

<sup>5</sup>There is also a comparatively simple explanation available on my website, jeremymontagu.co.uk, as a download: Montagu (1990).

on, listening to the deliberately composed music for the Mass. The medieval mystery plays were enacted in front of or within the church, and these always included music and were designed deliberately to draw in the public and to show them aspects of their religion.

When did people pay to hear music? Surely, this is part of our definition of public performance. Bards were certainly paid, domestic ones with board and lodging and presumably some cash, and itinerant ones certainly with cash or its portable equivalent, and shamans and medicine-men or -women always with cash or its equivalent, for that was the only way to be sure of a cure rather than a curse.

Formal concerts are said to have begun in Italy with the *Accademia*, meetings of intellectuals and musicians, in the fifteenth century, and private groups of musicians and musically interested people proliferated in many places, coming together to hear their own members playing and/or singing, for example the German *Collegia*. Aristocratic courts had their own orchestras, often merely for prestige, but sometimes, because the prince was himself a composer and musician. All these were private occasions, with admission confined to their members, their friends, and their guests.

Public concerts, with people paying for admission, began first in England perhaps as extensions of the Elizabethan theaters, where again people paid for admission, and which had often included musical performances along with the plays. England had no princely courts such as were common in continental Europe, and it was the first country to grow a middle class educated enough at the many grammar schools to appreciate musical culture and wealthy enough to pay for its pleasures. John Banister, himself a musician, was the first to invite the public to come, pay, and hear his concerts in 1673, and he was famously followed by Thomas Britton, "the small coal man," who opened a room above his shop to paying customers in 1678 and continued to provide weekly concerts for 36 years. Very shortly afterward, the first hall designed for musical performance was opened in London. It seems that in other countries such public performances did not take place until into the eighteenth century, and then in theaters and other improvised places, or out of doors. It was not until 1781 that the Leipzig Gewandhaus was built, the first public concert hall on the Continent.

A more elaborate form of music, the opera, began also as a court entertainment, but it rapidly became a public entertainment for which people paid for admission, probably because the costs of mounting an opera are far greater than chamber or orchestral concerts, and the first public opera house opened in Venice in 1637.

This is as far as we need to go for Europe, but what of the rest of the world? We have historical records and encyclopedias of music for the high cultures of China and India. We have, through archeology, surviving instruments such as the great assembly of Marquis Yi of Zeng in Suizhou, Hubei Province of China (Falkenhausen, 1993; So, 2000).6 This was found in his tomb of around 433 BCE and elsewhere a Chinese set of Neolithic period bone flutes was found and published widely. Through the treasures of the great Depository of the Shōsōin in Nara (Shōsōin Office, 1967), we know how the instruments of the Chinese Tang court passed to Japan, and through the work of Laurence Picken and his successors how the music of that court changed in Japan (Picken et al., 1981 ff). All this tells us nothing further of how music began, but it does tell us that music progressed and developed, analogously with our own, in the high cultures of the world.

But, we have little knowledge of how, or even whether, music developed and changed in the rest of the world. We have glimpses, patchily, through the ages due to the iconographical records of some areas that we have mentioned above. We know much that goes on today, thanks to those ethnomusicologists who have been working around the world since the latter part of the nineteenth century, and we are dependent on their work for evidence of any possible sort simply because much of the music and the performances they recorded or described has vanished within our own lifetimes due to the globalized transmission of music. But even with that evidence, to what extent can we project any of it back in time? We could suggest that before the days of European exploration of the rest of the world, from the fifteenth century onward, peoples in sub-Saharan Africa were so isolated within their individual areas that their musics never changed from one generation to another. But that is a nineteenth-century attitude, of the time when Europeans refused to believe that sites such as Great Zimbabwe could ever have been built by African peoples, before the recognition of the great metal workers of West Africa and the high artistic levels of the Nok people or of Benin. I believe that any form of back-projection would be dangerous, whether in Africa or anywhere else in the world. I think that we simply have to say that we do not know and to admit that if *H. sapiens* could progress to such an extent as we know that it did in Europe and the Middle and Far East, so it could have done elsewhere.

We do have to say that much traditional music is dying out around the world, driven out by the perceived "superiority" of so-called "Western" music. Throughout the world now, there are symphony orchestras, even more widely there are all the manifestations of pop and other such musics. Yes, Bach, Beethoven, Mozart, Stravinsky, and others produced great works of music, but so did those of other cultures, and those musics are vanishing and their cultural contexts are dying out and treasures are being lost. And yet tradition manages to cling on, especially in the areas of pop music. West African versions of all the manifold varieties of popular musics do not sound the same as the New York versions. What we hear as "World Music," although heavily influenced by Western instruments and practices, still retains its local connotations and styles. The Soviet idea was that the individual solo performer from the eastern provinces should be replaced with groups on a concert platform with orchestras of alto, tenor, and bass versions of his or her instrument, still played their own musics in modified versions of their own styles. Music is and always has been created by people. It changes with time, and the ease of travel from the days of trains and steamships, and especially now globalization, has accelerated the rate of change

<sup>6</sup>This was published fairly briefly as So (2000), and in much greater detail as Falkenhausen (1993).

from the nineteenth century onward. But travel, even on foot and in log canoes, has been with us since the Paleolithic and so has inventiveness. Change in music and change in instruments will always be with us, but traditions, however changed, will always survive.

## AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.

#### REFERENCES


#### ACKNOWLEDGMENTS

The author is grateful to the reviewers for their insightful comments.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at http://journal.frontiersin.org/article/10.3389/fsoc.2017.00008/ full#supplementary-material.


Schlesinger, K. (1939). *The Greek Aulos*. London: Methuen.


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2017 Montagu. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Simultaneous Cooperation and Competition in the Evolution of Musical Behavior: Sex-Related Modulations of the Singer's Formant in Human Chorusing

#### Peter E. Keller 1, 2 \*, Rasmus König<sup>2</sup> and Giacomo Novembre<sup>3</sup>

<sup>1</sup> MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Sydney, NSW, Australia, <sup>2</sup> Max Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany, <sup>3</sup> Department of Neuroscience, Physiology and Pharmacology, University College London, London, United Kingdom

#### Edited by:

Leonid Perlovsky, Harvard University and Air Force Research Laboratory, United States

#### Reviewed by:

McNeel Gordon Jantzen, Western Washington University, United States Jessica Phillips-Silver, Georgetown University Medical Center, United States

\*Correspondence:

Peter E. Keller p.keller@westernsydney.edu.au

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 21 March 2017 Accepted: 28 August 2017 Published: 14 September 2017

#### Citation:

Keller PE, König R and Novembre G (2017) Simultaneous Cooperation and Competition in the Evolution of Musical Behavior: Sex-Related Modulations of the Singer's Formant in Human Chorusing. Front. Psychol. 8:1559. doi: 10.3389/fpsyg.2017.01559 Human interaction through music is a vital part of social life across cultures. Influential accounts of the evolutionary origins of music favor cooperative functions related to social cohesion or competitive functions linked to sexual selection. However, work on non-human "chorusing" displays, as produced by congregations of male insects and frogs to attract female mates, suggests that cooperative and competitive functions may coexist. In such chorusing, rhythmic coordination between signalers, which maximizes the salience of the collective broadcast, can arise through competitive mechanisms by which individual males jam rival signals. Here, we show that mixtures of cooperative and competitive behavior also occur in human music. Acoustic analyses of the renowned St. Thomas Choir revealed that, in the presence of female listeners, boys with the deepest voices enhance vocal brilliance and carrying power by boosting high spectral energy. This vocal enhancement may reflect sexually mature males competing for female attention in a covert manner that does not undermine collaborative musical goals. The evolutionary benefits of music may thus lie in its aptness as a medium for balancing sexually motivated behavior and group cohesion.

Keywords: music, vocal expression, singer's formant, evolution, non-verbal communication

## INTRODUCTION

Music is a social communicative art form whose pervasiveness across human cultures suggests convergent evolutionary origins (McDermott and Hauser, 2005; Fitch, 2006; Merker et al., 2015). Although, these origins remain as mysterious today as they were in antiquity (McDermott, 2008; Morley, 2013), the comparative study of communicative signaling in human and non-human species has identified two broad classes of factors—competitive and cooperative—that could have driven the evolution of musical behavior (Brown, 2000; Mithen, 2005; Cross and Morley, 2009; Hagen and Hammerstein, 2009; Merker et al., 2015).

Competition for mates and territory in a variety of non-human animal species (e.g., songbirds and humpback whales) has selected for music-like signaling behavior that increases an individual's prospects for reproduction and survival by attracting sexual partners and repelling rivals (Catchpole and Slater, 1995; Marler, 2000; Fitch, 2006; Smith et al., 2008). In the case of birdsong, playback experiments with European starlings, wrens, and flycatchers have shown that females are attracted to audio recordings of conspecific male vocalizations (Eriksson and Wallin, 1986; Mountjoy and Lemon, 1991; Johnson and Searcy, 1996). Related work with dunnocks has found that female responsiveness is greatest when the recording is of an alpha male, the song is complex, and the female is in her fertile period (Wiley et al., 1991). Furthermore, studies of song production in great reed warblers have demonstrated that males sing longer, more complicated songs when advertising for females than when already paired (Catchpole, 1983). Playback of such complex male songs elicits increased sexual display behavior in females (Catchpole et al., 1986), and males who sing more complex songs attract more females and produce more offspring (Catchpole, 1986). Song-like signaling may also play a role in attracting potential mates in some marine mammals. For example, the incidence of singing in male humpback whales is greater in the presence of unescorted mother-calf pairs (in which the adult female provides a mating opportunity) than for mother-calf pairs who are already escorted by another male (Smith et al., 2008). By analogy, it has been proposed—originally by Darwin (1871)—that the genesis of human music lies in courtship displays that evolved through sexual selection fuelled by competition between individuals (Miller, 2000).

Cooperative accounts provide an alternative by focusing on benefits to the individual gained through group membership. Group music making is a vital part of social life across the world's cultures, with individuals frequently coming together to communicate emotions and expressive intentions through rhythmically coordinated body movements and sounds (Merriam, 1964; Blacking, 1973; Nettl, 1983). In other social animals, the demands of cooperating with conspecifics has selected for collective displays where the inter-individual coordination of signals in space and time facilitates pair bonding [e.g., gibbon "duets" (Geissmann, 2000) and whistle matching in dolphins (Janik, 2000)] and group cohesion [e.g., call-andresponse acoustic signaling in bats (Chaverri et al., 2012)]. Social accounts of the origins of human music hold that rhythmic coordination between individuals engaged in collaborative musical activities similarly facilitates cooperative behavior by promoting interpersonal bonding and feelings of mutual affiliation, trust, and commitment (Brown, 2000; Huron, 2001; Kirschner and Tomasello, 2010; Tarr et al., 2014; Weinstein et al., 2016). Practices involving interacting with others through music thus buttress human society and culture by exploiting the basic capacity for rhythmic interpersonal coordination to foster pro-social behavior (Lomax and Berkowitz, 1972; Mithen, 2005; Cross, 2012; Morley, 2013; Launay et al., 2016).

Accounts based on competition or cooperation alone, however, are not sufficient for dealing with the extensive palette of musical behaviors in the human repertoire, ranging from soothing lullabies, through festive folksongs and magnificent ceremonial and "art" music, to aggressive war dances and rap battles (Merriam, 1964; Nettl, 1983; Clayton, 2009; Clayton et al., 2012). An alternative is that human music serves both cooperative and competitive functions (Brown, 2000). On this view, music does not function solely in sexual or natural selection at the level of the individual or as a vehicle for group selection, but may rather play roles at both the level of the individual and the group.

The motivation for this unifying view comes from multiple angles. To begin with, it has been argued that sexual selection alone is not viable because music is rarely used as an obvious courtship display and is not a sexually dimorphic trait in humans (Brown, 2000). Natural selection is more generally problematic because the precise sense in which music serves to adapt individuals to their environment, and thereby enhance one's chance of survival, is obscure (Fitch, 2006). It has, however, been countered that pure cooperative accounts are questionable because they rely on group selection, which some evolutionary theorists consider to be a weaker explanatory principle than individual selection (Miller, 2000).

To circumvent these drawbacks, unifying accounts of human musical behavior postulating multilevel selection of both cooperative and cooperative functions have been advanced. One such account holds that coordinated displays in music and dance are beneficial in communicating information about coalition quality within a group to potential allies and competitors outside the group (Hagen and Bryant, 2003). Another unified account posits that music can function as a reward system that reinforces vital ritualistic group behaviors under some circumstances and as courtship or fitness displays in others (Brown, 2000). A strong version of the hypothesis that music functions both cooperatively and competitively, which we advance here, is that music is capable of doing so simultaneously, thus supporting different forms of communication in parallel at the level of the group and the individual. This dual-function hypothesis is motivated by work on rhythmically coordinated communal displays in non-human species (Merker, 2000).

Although, group music making can be viewed as an exalted human activity, it is only one amongst several stunning examples of rhythmic communal signaling in nature (Bailey, 1991; Merker, 2000; Ravignani et al., 2014; Merker et al., 2015). Bioluminescent fireflies flash together (Buck, 1988; Branham and Greenfield, 1996), fiddler crabs collectively wave their oversized claws (Murai and Backwell, 2006), fish drum their swim bladders in concert (Locascio and Mann, 2011), and cicadas (Hoppensteadt and Keller, 1976), crickets (Walker, 1969; Snedden and Greenfield, 1998), and frogs (Whitney and Krebs, 1975; Partridge and Krebs, 1978) produce choruses of sounds wherein coordination ranges from strict synchrony to relatively flexible alternation (Greenfield, 2005; Ravignani et al., 2014).

These non-human displays, which are typically produced by congregations of males, fulfill functions related to courtship and conflict avoidance (Greenfield, 1994). By coordinating their signals, group members maximize the amplitude of their collective broadcast and thereby increase its salience and geographic reach, resulting in a "beacon" effect (Buck and Buck, 1966) that enhances the ability to attract distant females, repel territorial adversaries, and confuse predators (Greenfield, 2005). While such coordinated behavior appears cooperative, research on communal sexual displays suggests that, in a number of species, it arises from the operation of competitive mechanisms by which individual males jam—i.e., block or interfere with rival signals (Whitney and Krebs, 1975; Greenfield and Roizen, 1993; Greenfield, 1994).

On this account, males who signal faster than their neighbors when competing for female attention gain an advantage by producing relatively early signals that effectively mask following signals (Otte and Loftus-Hills, 1979; Greenfield, 1994). Precise temporal coordination between males nevertheless ensues when leading calls trigger a call in neighbor after a fixed phase delay or a rebound interval associated with inhibitory resetting of a neural pacemaker (Buck, 1988; Greenfield and Roizen, 1993), with globally stable patterns of interaction occurring when variations in individual signaling rates cause switches in momentary leadership (Greenfield et al., 1997). Group coordination under such circumstances is thus an epiphenomenon resulting from competitive mechanisms selected by general female preferences for male signals that contain greater energy due to delivery at a fast rate or high intensity (Greenfield, 2005).

The current case study tested the hypothesis that human music is likewise capable of simultaneously serving cooperative and competitive functions. To this end, we investigated the effect of introducing females into an otherwise male audience during repeated performances of a short concert performed by an internationally acclaimed male choir, the St. Thomas Choir of Leipzig, famed for its 800-year tradition of excellence under directors including J. S. Bach. Consistent with the dualfunction hypothesis, it was expected that the presence of females might elicit vocal embellishments that allow individual males to compete for female attention in a manner that does not undermine ensemble cohesion. Moreover, this effect should be most prevalent in sexually mature males who have already undergone puberty-related maturational changes that deepen the voice, that is, in members of the tenor and bass sections of the choir.

To detect the anticipated vocal embellishments, we recorded individual singers from the choir and then conducted acoustic analyses of spectral energy, sound intensity, performance tempo, and tone onset timing. It was expected that spectral energy (which affects vocal tone quality) would be a likely locus of effects of female presence, as vocal spectrum can be varied without influencing performance parameters such as tuning and timing that could compromise ensemble balance and coordination. Furthermore, aspects of performance that affect ensemble cohesion in terms of timing and balance may vary across repeat performances due to practice effects. To take these effects into account, we measured the degree of tempo matching across parts, interpersonal asynchrony, and balance in terms of relative loudness. Modulations of these indices of ensemble cohesion were not expected to occur due to the extensive experience of the St. Thomas Choir with group performance in general and also with the specific pieces that were sung.

Instead, we were particularly interested in a high frequency band of voice's spectrum, known as the "singer's formant" (∼2,500–3,500 Hz). Increased energy in the singer's formant adds carrying power and brilliance to the voice by emphasizing a frequency region to which human hearing is maximally sensitive (Sundberg, 1974, 1999). In accompanied solo singing, a prominent singer's formant assists the voice to stand out from background instrumental sounds, thus allowing opera singers to project their voices above loud orchestras, which contain relatively little energy in this frequency region (Sundberg, 1987). The use of the singer's formant in choral singing is a matter of debate. Although, enhancement of the singer's formant is not typically advocated in choral singing (where the focus is on the blending of voices; Ternström, 1991, 2003), it can nevertheless be observed in choral performance under some circumstances (especially when the singers are also experienced soloists; Rossing et al., 1986).

## MATERIALS AND METHODS

## Participants: The St. Thomas Choir

Sixteen members of the St. Thomas Choir of Leipzig, a boys' choir in Germany, participated in the study (N = 16): Four sopranos, four altos, four tenors, and four basses. Sopranos were aged 12 (n = 2) and 13 years (n = 2); altos were aged 12 (n = 2), 13 (n = 1), and 16 years (n = 1, singing falsetto); tenors were aged 16 (n = 2) and 18 years (n = 2); basses were aged 16 (n = 1), 17 (n = 2), and 19 years (n = 1). All singers were naïve to the purpose of the study. A senior member of the choir (a tenor, aged 18 years) who held the position of prefect and was also naïve to the purpose of the study served as the conductor during the recording session.

All participants had thorough musical training, high levels of vocal skill, and extensive experience singing together. The St. Thomas Choir, which is an acclaimed ensemble that rehearses daily, tours globally, produces commercial recordings, and maintains a schedule of three performances per week that attract crowds of tourists to the St. Thomas Church in Leipzig: http:// www.thomanerchor.de. The choir was established in 1212 and has run continuously since then under the leadership of esteemed cantors (musical directors), including Johann Sebastian Bach. Members of the choir all attend the St. Thomas School, a public boarding school, from the fourth to twelfth class. The music training that choir members receive during this period includes weekly singing and instrumental lessons, music theory classes, instruction in phonation, and daily individual practice sessions and rehearsal of the choir in full and in sections.

At the time of data collection for the current study, the St. Thomas Choir included 103 members in total. The 16 students who participated in our study were recruited following earlier consultation with the cantor's assistant, who was asked to recommend the best available choristers. Thus, although the participant sample is small, it comprises elite individuals with highly honed vocal and ensemble skills. Participants' parents provided written informed consent for their sons to take part in the study, which was run in accordance with the declaration of Helsinki and protocol approved by the ethics committee of the University of Leipzig.

#### Materials

The materials consisted of two distinctive pieces of choral music composed by Johann Sebastian Bach. One (a chorale) requires strict synchrony between soprano, alto, tenor, and bass sections of the choir and the other (a fugue) is characterized by greater rhythmic independence between voices. Both pieces are in the standard repertoire of the St. Thomas Choir, and all participants were familiar with the pieces and had sung them on numerous previous occasions.

The chorale "Du heilige Brunst, süßer Trost [You holy zeal, sweet consolation]" is from the motet "Der Geist hilft unser Schwachheit auf [The Spirit gives aid to our weakness]" (BWV 226) composed by Bach in Leipzig in 1729. The chorale setting is 24 bars long, in quadruple meter (with 4 quarter-note beats per bar), homophonic in texture (although parts contain passing notes), and, with its stately rhythm and holy text (from a Pentecostal hymn written by Martin Luther), may be described as "reverent" in character. The fugue comes from the motet "Singet dem Herrn ein neues Lied [Sing unto the Lord a new song]" (BWV 225), first performed in Leipzig around 1727. The fugue a setting of biblical text "Alles, was Odem hat, lobe den Herrn [All that have voice, praise the Lord]" from Psalm 150:6—spans 112 bars (from bar 256 to the end of the work), is in triple meter (with 3 eighth-note beats per bar), polyphonic in texture, and has a "lively" character featuring rapid runs of sixteenth notes cascading through the choral parts.

Both pieces are written in the key of B-flat major with four vocal parts: soprano, alto, tenor, and bass. For the chorale, the soprano part contains 95 notes ranging in pitch from F4 to G5 (349–784 Hz), the alto part has 104 notes ranging from C4 to D5 (262–587 Hz), the tenor part has 110 notes ranging from E-flat3 to F4 (156–349 Hz), and the bass part has 120 notes ranging from B-flat2 to D4 (117–294 Hz). For the fugue, the soprano part has 320 notes (F4 to B-flat5; 349–932 Hz), the alto part has 334 notes (B-flat3 to D5; 233–587 Hz), the tenor part has 369 notes (F3 to A5; 175–440 Hz), and the bass part has 419 notes (G2 to E-flat4; 98–311 Hz).

### Procedure

The study took place in the chamber music auditorium at Castle Colditz in Germany, where the St. Thomas Choir was interned at their annual summer camp. The auditorium contained a stage area, which was large enough to accommodate the choir comfortably, and six rows of tiered seating. The procedure was completed in a single 1-h session. Prior to commencing the session, the 16 members of the choir and the conductor were assembled and introduced to the researchers. They were told that the researchers were studying choral singing and were specifically interested in the level of balance, timing precision, and intonation achieved in a top choir. Members of the choir and the conductor were given the musical scores for the two pieces and were told that the upcoming task involved singing three repeat performances of the pieces.

Choir members were then fitted with individual head-worn wireless microphones (Sennheiser ME 3). Signals picked up by the microphones were sent via radio transmitters (Sennheiser SK 500) to a receiver (Sennheiser EW 500) connected to a stage unit (Roland S-1608) with built-in analog-to-digital converter and microphone preamplifiers. The stage unit routed the audio to a digital mixer (Roland M-300), which was used to record each singer's voice onto a separate channel with digital audio software (Cakewalk Sonar X1 Producer) running on a laptop computer (Dell Inspiron with REAC driver). The equipment was configured to record audio in each channel in mono at a sampling rate of 48 kHz in .WAV format. In addition to this multi-track audio recording, two digital video cameras (Sony HDR-HC7) were used to record the performances. One camera was positioned at the rear of the auditorium and was oriented in such a way that it captured a frontal view of the choir. The second camera, which was placed to the side of the choir, recorded a frontal view of the audience.

Once the audio set up was complete, choir members were asked to stand in two rows, as per their conventional formation, with the front row consisting of the four sopranos on the left and the four altos on the right, and the back row consisting of the four tenors on the left (behind the sopranos) and the four basses on the right (behind the altos). The conductor was positioned in front of the choir, facing the singers (and could therefore not see the audience while conducting). When in position, a sound check was performed to verify that each channel was receiving a clear signal and to optimize recording levels. The conductor was then asked to run through each piece once with the choir in preparation for the recording session.

The choir was informed that there would be researchers around throughout the recording session, as well as a small audience to create a natural performance situation. The choir was also told that a "school group" was due to arrive for a castle tour and would join the audience for a performance and would be taking notes for a school magazine article. This school group consisted of four females who constituted our manipulation (described below). The choir was instructed to sing as they normally would in a concert performance.

In the recording session, the choir performed the two pieces, one after the other, three times each in a separate "concert." Recordings of the performances are provided as Supplementary Audio. The chorale was performed before the fugue in each concert. Each concert was separated by an intermission lasting 1–3 min, during which the composition of the audience was changed. The first of the three concerts was sung to a small audience comprised of adult male listeners. Four adolescent females were added to the audience for the second concert (which also included the males). The females then departed before the third concert, which was sung to the original male audience. (While we would have liked to vary the type of additional audience members systematically in a larger study design, including younger and older females and males of corresponding ages, such extra manipulations were precluded by the limited availability of the St. Thomas Choir).

The male audience members were four staff members of the St. Thomas School (aged between 35 and 55 years). Also present in the auditorium were two adult male researchers (aged 28 and 42 years), two adult female research assistants (aged 26 and 24 years), and an adult male recording engineer (aged 27 years) and photographer (aged 28 years). The adolescent female audience members were aged 15 (n = 2) and 16 years (n = 2). These females were relatives of one of the female research assistants, who invited them to attend the recording session. The females were naïve with regard to the goals of the study. Prior to entering the auditorium, each female was given a notepad and pen and instructed to make a written record of their impressions of the choir's performance during the concert. This task was intended to occupy the females and decrease the likelihood that they would engage in behavior that could possibly distract the choir. Examining the video recordings revealed no such behavior during the performances.

At the conclusion of the recording session, each member of the choir completed a questionnaire to evaluate his subjective impressions of the performances and views on choral singing. Questionnaire items addressed (1) the level of concentration achieved while singing, (2) the focus of attention (on self, neighboring singers, whole choir, conductor, and audience), (3) the degree to which the audience is focused upon generally in concerts, (4) the degree to which the audience proved distracting during the recording session, (5) whether the presence of girls in the audience led to changes in the individual's own performance, and (6) which of the three concerts went the best (see **Appendix**). Item 5, on effects of the girls, included several options that could be selected, including one that directly probed whether the respondent attempted to attract the girls' attention. The conductor was asked which performance he thought went the best.

#### Data Analysis

Prior to analysis, the audio recordings of individual voices for each performance of each piece were extracted from the continuous recordings of the full session. The extraction process involved opening the recordings of individual voices simultaneously in separate channels in Cubase LE 6 digital audio software, and then cutting the 16 tracks in such a way that they all started 1 s prior to the onset of the first vocal sound and ended 1 s after the offset of the final vocal sound in each performance for each piece. These individual voice recordings, thus synchronized across choir members, were then subject to acoustic analysis using MIRtoolbox 1.6.1 (Lartillot et al., 2008), a Matlab toolbox for the extraction of musical features from audio files. The following features were analyzed: energy in the singer's formant spectral region (2,500–3,500 Hz), sound intensity (to assess ensemble balance), global performance tempo, variability in local tempo, and relative note-onset timing (to assess the accuracy and precision of ensemble coordination). This latter analysis also made use of a full choir version for each performance created by mixing the 16 individual voice channels into a single mono file in .WAV format. The feature extraction process is described below.

#### Spectral Energy

For data visualization, the spectral envelope was extracted from each individual voice file using the "mirspectrum" function of MIRtoolbox to decompose the signal into its frequency components via Fast Fourier Transform, and then using the "mirenvelope" function to fit a curve tracing the amplitude of each frequency component of this spectrum. The "Terhardt" option was used with "mirspectrum" to modulate the energy following the Terhardt (1979) outer ear model, which emphasizes frequencies around 2,000–5,000 Hz, at which human hearing is particularly sensitive.

For statistical analysis, the proportion of total energy in the 2,500–3,500 Hz spectral region of each individual voice file was computed using the "mirbrightness" function of MIRtoolbox. This function calculates the proportion of energy (0–1) above a user-specified cut-off frequency. We applied the "mirbrightness" function twice to each file, once with a cut-off frequency of 2,500 Hz and then with a cut-off frequency of 3,500 Hz, and subsequently computed the difference to yield the proportion of total energy between 2,500 and 3,500 Hz. A proportional measure was chosen because the relative amount of energy in high frequency regions is a reliable correlate of perceived timbre in vocal expression (von Bismarck, 1974; Laukka et al., 2005).

#### Sound Intensity

Global sound intensity was measured by calculating the root mean square of the amplitude profile of each individual voice file using the "mirrms" function from MIRtoolbox. This analysis takes into account energy at all frequencies of the sound spectra, and therefore allows overall ensemble balance in terms of relative intensity of different voice parts to be assessed. The ANOVAs on these data allowed us to test whether female presence affected ensemble balance by changing intensity in some voices relative to others. This could occur if, for example, distraction caused by female presence disturbed the so-called "self-to-other ratio," which reflects the degree to which an individual can hear their own sounds amongst co-performers' sounds (Ternström, 2003).

#### Global and Local Tempo

Global tempo was estimated for each individual voice file using the "mirtempo" function in MIRtoolbox. This function detects note onsets based on the amplitude envelope of the audio signal, then searches for periodicities in the note-onset time series for a range of tempi (40–200 beats-per-minute [bpm]), and finally selects the tempo yielding the maximum periodicity score. Variability in local tempo was assessed by calculating the standard deviation of time points in the tempo map for each file. The "Metre" option of "mirtempo" was employed in this analysis to take into account variability at metrical levels other than the beat period (Lartillot et al., 2013).

#### Relative Note-onset Timing

Note-onset times were estimated for each individual voice file, as well as for the mixed full choir file for each performance, using the "mironsets" function in MIRtoolbox. This function applies a peak-picking algorithm to an onset detection curve computed based on the audio signal's amplitude envelope. The "SpectroFrame" option was used in this analysis, with the frame length and hop factor both set to 150 ms to avoid the detection of spurious onsets. This value was chosen based on the assumption that, given the notated rhythms in the pieces and the tempi at which they were performed, successive note onsets would be separated by at least 150 ms.

To assess ensemble coordination, asynchronies between note onsets in each individual voice file and onsets in the mixed full choir file were computed for each performance. These asynchronies represent temporal deviations of estimated note onsets produced by each singer from estimated onsets in the choir's collectively produced sound, which served as a temporal frame of reference for the current analysis. Asynchrony series were generated with an algorithm that, for each note onset in a given individual voice file, searched for the nearest onset in the mixed full choir file and, if these two onsets were separated by <150 ms, recorded the asynchrony. Median asynchrony was taken as a measure of synchronization accuracy and the standard deviation of the asynchronies series was taken as a measure of synchronization precision for each asynchrony series.

#### Statistical Tests

Data for spectral energy were entered into a (2 × 2) × 4 mixed Analysis of Variance (ANOVA) with female presence (present, absent) and piece (chorale, fugue) as within-participant factors, and voice type (bass, tenor, alto, soprano) as a betweenparticipants factor. The criterion for statistical significance was set at α = 0.05 in all analyses reported here. Significant interaction effects were followed up with two-tailed paired-samples t-tests comparing spectral energy when females were present vs. absent for each voice type separately. Given the stylistic differences between the two pieces, affecting musical texture, dynamics, rhythm, and tempo, data for each of the remaining features were entered into a separate (2) × 4 ANOVA, with factors female presence (present vs. absent) and voice type (bass, tenor, alto, soprano), for each piece.

#### RESULTS AND DISCUSSION

The results reported below address the effects of the presence of female audience members on individual voices recorded from the soprano, alto, tenor, and bass sections of the St. Thomas Choir during performances of two distinctive musical pieces. Results for spectral energy in individual boys' voices (focusing on 2,500–3,500 Hz "singer's formant" region) are presented first, followed by sound intensity (to assess ensemble balance), global performance tempo, variability in local tempo, and relative toneonset timing (to assess the accuracy and precision of ensemble coordination).

#### Spectral Energy

Spectral energy was computed for individual singers from audio signals for each of performance of each piece. Time-averaged mean spectral magnitude for sopranos, altos, tenors, and basses (4 singers per part), averaged across the two pieces (chorale and fugue), for performances with female audience members present vs. absent are presented in **Figure 1**. For display, the plot shows output of the Terhardt (1979) outer ear model for audio signal inputs at frequencies up to 4,000 Hz. Here it can be seen that there is a relatively high peak in energy in a high frequency band (∼2,500–3,500 Hz, marked by vertical dotted lines) of the vocal spectrum, corresponding to the singer's formant, in the bass section.

For statistical analysis, the proportion of total energy in the 2,500–3,500 Hz singer's formant region of each individual voice file was calculated (on the raw audio signal, without applying the Terhardt model). **Figure 2A** shows the proportion of energy in this region for individual singers in each voice section for the

FIGURE 1 | Effect of female presence on spectral energy in male voices singing in chorus. (A) Time-averaged spectra for choir voice sections singing with adolescent females present in the audience. Output of the Terhardt outer ear model (see Section Spectral Energy) is shown for audio signal inputs at frequencies up to 4,000 Hz in sopranos, altos, tenors, and basses. Audio was recorded with individual head-worn microphones as the choir performed two distinctive pieces. Mean spectral magnitude is shown for each voice section (4 singers per section), averaged across the two pieces. There is a relatively high peak in energy in the singer's formant region (2,500–3,500 Hz) for basses. (B) Average spectra for voices when singing the same pieces to a male audience with females absent. The energy peak in basses in the singer's formant region is weaker than when females were present. (C) The difference between spectral magnitudes when females were present vs. absent, highlighting the singer's formant increase in basses.

three performances of the two pieces. Performance 1 was sung to the male audience, performance 2 to the audience with additional females, and performance 3 to the male audience again. For both pieces, there is an increase in energy from performance 1 to 2 (when females were present), followed by a decrease in performance 3, for each of the four basses, but no such consistent pattern in other voice sections. **Figure 2B** highlights the main result by displaying the proportion of total energy in the singer's formant region for each voice part (averaged across individual singers and pieces) when females were present vs. absent.

An ANOVA testing for effects of female presence (present, absent), piece (chorale, fugue), and voice type (bass, tenor, alto, soprano) on proportion of energy in the singer's formant range revealed statistically significant main effects of female presence [F(1, 12) = 15.191, p = 0.002], piece [F(1, 12) = 25.180, p < 0.001], and voice type [F(3, 12) = 5.634, p = 0.012], as well as a significant interaction between female presence and voice type [F(3, 12) = 8.440, p = 0.003]. The remaining interaction effects were not statistically significant (ps > 0.05).

The main effect of female presence indicates that energy in the 2,500–3,500 Hz singer's formant region was generally higher when females were present than absent. This effect was, however, qualified by the interaction between female presence and voice type, suggesting that—as can be seen in **Figure 2B**—the bass section was primarily responsible the increase in energy observed when females were present. Paired-samplest-tests confirmed that the effect of female presence was statistically significant for the basses [t(3) = 4.917, p = 0.016] while reliable effects were not

FIGURE 2 | Effects of female presence on spectral energy (based on audio signals) in the voices of individual male singers for two distinctive pieces of choir music. (A) Proportion of energy in the singer's formant region for individual singers in each voice section for three performances of each piece, a chorale (left) requiring rhythmic synchrony between choir sections and a fugue (right) requiring rhythmic independence between sections. Performance 1 was sung to the male audience, performance 2 to the audience with additional females, and performance 3 to the male audience again. Separate plots are shown for basses (blue circles), tenors (green squares), altos (black diamonds), and sopranos (red triangles). For both pieces, there is an increase in energy from performance 1 to 2 (when females present), followed by a decrease in performance 3, for each of the four basses, but no such consistent effect in other voice sections. (B) Proportion of total energy in the singer's formant region for each voice part (averaged across individual singers and pieces) when females were present vs. absent. Proportion of energy is higher when females were present than absent only in basses (female presence × voice type interaction, p = 0.003; asterisk indicates p = 0.016 for simple effect of female presence for basses only). Error bars are 95% confidence intervals based on within-participants s.e.m. (bars occluded by data marker symbols for sopranos and altos).

observed for tenors (p = 0.462), altos (p = 0.487), or sopranos (p = 0.554).

The specificity of the effect to members of the bass section, that is, to singers with the deepest voices, is noteworthy. A prominent singer's formant is conventionally considered to be more desirable in solo than choral singing (Ternström, 2003). However, because enhancing the singer's formant does not affect other performance parameters such as tuning and timing, the observed effect may reflect sexually mature males competing for female attention in a manner that does not threaten group cohesion.

The main effect of piece indicates that proportion of energy in the 2,500–3,500 Hz range was generally greater for the chorale than the fugue, which may be due to differences in compositional texture and rhythmic structure between the pieces. Specifically, enhancement of the singer's formant might have been elicited by the homogenous texture of the chorale, where separate voice sections sing predominantly in rhythmic unison, more so than for the polyphonic fugue, where the vocal lines sung by separate voice sections are differentiated in terms of rhythm. Furthermore, longer note durations in the chorale (mainly quarter and eighth notes at a steady tempo) than the fugue (eighth and sixteenth notes at a rapid tempo) may generally allow more scope for modulating spectral properties of vocal sound.

The main effect of voice type indicates that singer's formant energy increased across choir sections from sopranos through altos and tenors to basses. This increase may be due in part to differences in the perceptual salience of separate voice parts, as well as to changes in vocal tract size and shape associated with maturational development. With regard to perceptual salience, the general tendency for superior processing of high voices in multi-part musical textures (Palmer and Holleran, 1994; Crawley et al., 2002; Trainor et al., 2014) implies little need to employ additional techniques to enhance the soprano part. With regard to vocal tract configuration, the lowering of the larynx and widening of the pharynx during adolescent development (Fitch and Giedd, 1999) increases the size of a resonance cavity that causes clustering of high formants in the voice spectrum (Sundberg, 1987). Boys who have undergone growth associated with puberty are therefore more likely to display a prominent singer's formant and, in adults, enhancement of the signer's formant is more commonly observed in male tenors and basses than in female sopranos and altos (Sundberg, 1999). Given that participant age was related to voice type (though note that tenors and basses were similar in age), it is not surprising that energy in the 2,500–3,500 Hz region increased across voice types.

It can also be noted that age differences were not responsible for the presence of the effect in basses and absence of the effect in tenors. Although, two tenors exhibited an enhanced singer's formant during the second performance (see **Figure 2A**), these individuals were in fact younger (aged 16 years) than the other two tenors who did not show an effect (aged 18 years).

#### Sound Intensity

Global sound intensity was measured by calculating the root mean square of the amplitude profile of each individual voice file. Average data for each voice section for each performance of each piece are shown in **Table 1**. Separate ANOVAs, testing for effects of female presence (present vs. absent) and voice type (bass, tenor, alto, soprano) on sound intensity, for each piece yielded no statistically significant effects (ps > 0.05). The absence of significant interactions involving voice type suggests that ensemble balance, in terms of relative intensity of different voice parts, was consistent across performances.

## Global and Local Tempo

Global tempo was estimated for each individual voice file and data for this measure are displayed in **Table 1**. The average estimated global tempo across voices was 80.34 bpm for the chorale and 161.07 bpm for the fugue. An ANOVA on global tempo data for the chorale revealed a statistically significant main effect of female presence [F(1, 12) = 5.357, p = 0.039] and a significant interaction of female presence and voice type [F(3, 12) = 4.009, p = 0.034]. The main effect of voice type was not significant. These results reflect an overall slightly slower performance tempo when females were present (81.12 bpm) than TABLE 1 | Measures of sound intensity (root mean square of amplitude profile), global tempo (bpm), local tempo variability (SD of onsets in tempo map), median asynchrony (in ms, relative to note onsets in mixed full choir file), and SD of asynchronies (ms) computed from individual audio files for three performances (females present in audience for performance 2) of two pieces (chorale and fugue).


See Data Analysis for details. Values averaged across the four singers in each voice section (soprano, alto, tenor, and bass; s.e.m. in parentheses).

absent (79.56 bpm), with this effect being more pronounced in sopranos and basses than in altos and tenors (although t-tests for each voice type failed to reveal any significant effects). In contrast, an ANOVA on tempo data for the fugue only yielded a significant main effect of female presence [F(1, 12) = 13.453, p = 0.003], reflecting overall slightly faster performance tempo when females were present (161.93 bpm) than absent (160.20 bpm). The effects of female presence on global tempo were therefore weak and inconsistent across the two pieces. Variability in local tempo was assessed by calculating the standard deviation of time points in a tempo map for each file. The ANOVAs on local tempo variability data (**Table 1**) yielded no statistically significant effects for either piece (ps > 0.05).

### Ensemble Coordination

To assess ensemble coordination, asynchronies between note onsets in each individual voice file and onsets in the mixed full choir file were computed for each performance. These asynchronies represent temporal deviations of estimated note onsets produced by each singer from estimated onsets in the choir's collectively produced sound. The median of each individual singer's asynchrony series was calculated as a measure of his synchronization accuracy and the standard deviation of the asynchrony series was taken as a measure of synchronization precision (**Table 1**; Rasch, 1988).

The sign of the median asynchrony (negative or positive) is informative about whether note onsets in a particular voice part occur relatively early or late. In the current performances, the sopranos showed a numerical tendency to lead (median asynchrony = −4.33 ms, averaged across the four singers), followed by altos (−2.89 ms), tenors (1.17 ms), and basses (2.66 ms). These differences were, however, not statistically significant. ANOVAs on synchronization accuracy did not reveal significant effects for either piece (ps > 0.05). With regard to synchronization precision, the mean standard deviation of asynchronies, averaged across all singers, was 59 ms for the chorale and 48 ms for the fugue, which is commensurate with levels typically observed in expert ensembles (Keller, 2014). ANOVAs on synchronization precision did not yield statistically significant effects (ps > 0.05), apart from a main effect of voice type in the fugue [F(3, 12) = 3.961, p = 0.036], reflecting a general decrease in precision from sopranos to basses (which may be a consequence of the increase in number of notes across these parts; see Section Materials).

#### Subjective Evaluations

After the recording session, each member of the choir completed a questionnaire addressing his evaluation of the performances and views on choral singing in general. Analysis of the questionnaire items revealed no reliable differences between voice types in terms of (1) self-reported level of concentration, (2) focus of attention on self, neighboring singers, whole choir, conductor, or audience, (3) general tendency to focus on the audience during concerts, (4) the degree to which the audience proved distracting during the recording, (5) whether the presence of girls led to changes in the individual's own performance. For this last item, none of the boys selected the response option indicating that they attempted to attract the girls' attention. In response to a final item probing views on which of the three concerts went the best, all tenors and basses thought that the one sung in the presence of females was best, while only one out of the four sopranos and no altos thought that this was the case. The conductor (who could not see the audience while conducting) indicated that the performance with females present went the best.

## GENERAL DISCUSSION

The present case study tested the hypothesis—motivated by research on non-human "chorusing" displays (such as those produced by congregations of male insects and frogs to attract females)—that group music making serves cooperative and competitive functions simultaneously. In support of this dualfunction hypothesis, acoustic analyses of individual voices from the St. Thomas Choir of Leipzig revealed increased energy in the singer's formant spectral region (2,500–3,500 Hz) in members of the bass section, that is, boys with the deepest voices, when females were included in an otherwise male audience during repeated performances of a concert. This selective effect of female presence generalized across different musical pieces a chorale requiring synchrony between sections of the choir and a fugue characterized by greater rhythmic independence between voices—and did not disrupt ensemble cohesion in terms of balance or temporal coordination. Enhancing the singer's formant generally modifies vocal tone quality by imbuing the voice with what Helmholtz (1875, p. 116) described as "a clear tinkling of little bells". Observing such subtle embellishment in the context of choral singing suggests that, under conditions where group cohesion is strongly favored, sexually mature males have at their disposal a covert means to compete for female attention.

Evidence that the enhanced singer's formant was motivated by sexual competition lies in the finding that the effect of female presence was only reliably observed in members of the bass section of the choir. The presence of adolescent females (aged 15–16 years) may have elicited competitive behavior exclusively in bass singers (aged 16–19 years) due to relatively high levels of the sex hormone testosterone in males with the deepest voices (Zitzmann and Nieschlag, 2001). Testosterone lengthens the vocal tract by stimulating a secondary descent of the larynx in males during puberty (Fant, 1975). This lengthening lowers the fundamental frequency of the voice and reduces dispersion in formant frequencies (Bruckert et al., 2006), which increases perceived attractiveness and dominance (Evans et al., 2008). The boost in the singer's formant that we observed may constitute an attempt by the basses to establish a privileged social communication channel with female listeners by drawing attention to these appealing vocal qualities.

The fact that female presence did not lead to a reliable enhancement of the singer's formant in tenors (aged 16–18 years) suggests that the effect was not related to a general desire to lift performance standard in the presence of female peers. An increase in the desire to perform well could also be expected to affect other performance parameters, even in sopranos (12–16 years) and altos (12–16 years), but no such additional effects were observed. Furthermore, as revealed by post-concert questionnaire responses, members of the tenor and bass section were unanimous in thinking that the performance sung in the presence of females was the best, but none admitted to attempting to attract the girls' attention. Therefore, while tenors and basses converged in terms of their subjective evaluations of the performances, their objectively measured behavior diverged.

Our results point to an analogy between human choral singing and chorusing displays in non-human species (crickets, cicadas, and frogs) where congregations of males produce rhythmically coordinated signals that collectively serve as a beacon to attract female mates. While non-human chorusing appears cooperative to the extent that inter-individual coordination maximizes the intensity and geographic reach of the collective broadcast, these communal displays can arise via competitive mechanisms through which individual males jam rival signals (Greenfield, 1994, 2005). By extension, the selective enhancement of the singer's formant observed in our study suggests that, in human music, an individual beacon motivated by competition can safely nest within a group beacon arising through cooperation.

Apart from enhancing the singer's formant in basses, female presence did not otherwise impact reliably upon performance. Features including balance in terms of overall vocal intensity, global tempo, local tempo fluctuations, and degree of temporal coordination between voices were generally commensurate across performances with and without female audience members. This indicates that individual singers did not attempt to attract attention by obvious means such as producing louder or earlier sounds than their co-performers—unlike the case in non-human chorusing. Clearly, the tradition of excellence and discipline in the St. Thomas Choir and the presence of a conductor (who was blind to the manipulation) precluded variations in rhythmic precision or vocal blend that would compromise ensemble cohesion. Indeed, the fact that we observed any vocal embellishment at all is remarkable given that the music was composed for religious celebration in the Lutheran Protestant tradition, where flamboyance is eschewed (in contrast to opera, gospel, and pop choirs, for example).

The absence of effects for measures of ensemble timing and balance also argues against the possibility that the enhancement of the singer's formant for the second of three performances represents a practice effect. Aspects of performance that are central to the explicit goal of achieving ensemble cohesion, such as timing and balance, are presumably more likely to be susceptible to practice effects than unsanctioned behaviors such as the observed spectral modifications. In any case, practice effects were not expected to occur with the St. Thomas Choir due to their extensive experience with the material performed. Fatigue effects were likewise not expected due to the choir's routine experience singing much longer programs, including masses and oratorios lasting several hours.

As noted earlier, our approach entails a single case study of an elite musical ensemble rather than a fully controlled experiment addressing general music making in a random sample of individuals. Conducting a case study with experts allowed us to provide a hard test of the hypothesis that female presence selectively impacts upon spectral rather than temporal or intensity aspects of male vocal production, and that this effect is specific to males who have undergone voice changes associated with puberty. These goals would not be readily achieved with alternative study designs. Amateur choirs typically comprise sections in which individuals are mixed in terms of age and experience (especially amongst tenors and basses), making these variables difficult to place under experimental control. Forming an ad hoc choir for the purpose of the study would likewise be problematic due to the fact that new ensembles usually spend considerable time—several weeks of regular rehearsal learning to produce a cohesive sound. Moreover, true ensemble excellence is a hallmark of years of working together (Murnighan and Conlon, 1991; Blank and Davidson, 2007). A choir lacking such extensive preparation would most likely produce a high degree of variation in performance timing and balance, which could mask subtle spectral effects associated with the singer's formant.

Of course, investigating a single elite group has the obvious limitation of a small sample size. For this reason, caution should be taken in drawing generalized conclusions, especially since the current study focused on a single, culture-specific musical style. Indeed, one may ask whether effects observed with a highly drilled European boys choir would generalize to everyday forms of group music making in different cultures. While we cannot answer this question based on present findings, work in the field of ethnomusicology suggests that such generalizability is feasible. Group singing is vital to the existence of many indigenous cultures. For example, the male Amazonian Mekranoti Indians sing daily before dawn in a defensive vigil that maintains arousal to guard against attacks from neighboring tribes (Werner, 1984; Huron, 2001) while the nearby Suya people use song as a routine communicative device that complements and overlaps with speech (Seeger, 1987). It is possible that acoustic analyses of these everyday group vocal displays would reveal spectral modulations based on social communicative goals.

Notwithstanding caveats related to the case study approach, our results have several implications for future studies aimed at understanding human non-verbal communication from the perspectives of evolutionary biology, psychology, music cognition, and bioacoustics. With regard to the evolutionary biology, the concept of simultaneous cooperation and competition points to a relationship between human and non-human "chorusing" that goes beyond the fact that both involve rhythmic coordination. Building on this link, the hypothesis that the evolutionary benefit of music lies in balancing sexually motivated behavior and group cohesion advances theory by bringing together opposing accounts of the origins of music, thus highlighting a promising avenue for cross-species comparative research. With regard to psychology, our study highlights music's potential to advance knowledge about the dynamics of non-verbal communication during social interaction. In particular, demonstrating how cooperative and competitive mechanisms operate in parallel—when differing goals are pursued at the level of the group and the individual showcases music as an ecologically valid domain in which to investigate this social balancing act under controlled conditions (D'Ausilio et al., 2015).

Finally, with regard to music cognition and bioacoustics, the current study is informative about the role of sound quality, in particular the singer's formant, in communicating performers' intentions. In addition to its well-documented function in adding carrying power to the voice in operatic soloists (Sundberg, 1987), our results suggest that the signer's formant may be used to draw the listener's attention to a particular auditory stream in a multipart choral texture. It is important to note that this effect was observed in an elite ensemble under conditions that ostensibly favor the blending of voices. This emphasizes the need for models of musical expression, which to date focus predominantly on variations of timing and intensity (Todd, 1985, 1992; Clarke, 1988; Palmer, 1997; Gabrielsson, 2003; Keller, 2012), to take into account the role of spectral properties in signaling performers' intentions in ensemble music.

In conclusion, the findings of the present study suggest that, analogously to non-human chorusing, human music making may enable the coexistence of cooperative and competitive behavior that potentially mediates social communication simultaneously at the level of the group and the individual. Even during highly formalized types of musical interaction (exemplified here by Western church music), performers are free to introduce unsanctioned behavioral embellishments that allow individuals to enter into covert competition for the attention of potential mates without undermining collaborative goals related to emotional communication and artistic expression. The fact that we observed such subtle modifications in the singing voice an ancient and universal means of musical expression (Lomax and Berkowitz, 1972; Nettl, 1983; Mithen, 2005; Hagen and Hammerstein, 2009)—raises the possibility that balancing sexual competition with group cohesion in musical contexts is a human capacity with deep biological roots. Accounts of the origins of human music in social cohesion vs. sexual selection can therefore be unified in the hypothesis that the evolutionary benefit of music lies in its ability to serve cooperative and competitive functions simultaneously, with this duality augmenting music's communicative power.

## ETHICS STATEMENT

This study was carried out in accordance with the recommendations of the ethics committee of the University of Leipzig with written informed consent from all subjects. All subjects gave written informed consent in accordance with the Declaration of Helsinki. The protocol was approved by the ethics committee of the University of Leipzig.

## AUTHOR CONTRIBUTIONS

RK and PK designed the study with input from GN. RK and PK collected and analyzed the data. All authors discussed the results. PK wrote the manuscript with input from RK and GN.

## FUNDING

Funding from the Max Planck Society supported the study. PK is supported by a Future Fellowship grant from the Australian Research Council (FT140101162).

#### ACKNOWLEDGMENTS

We thank the members of the St. Thomas Choir, their management team, including Roland Weise, Thoralf Schulze, and Titus Heidemann, and former cantor Georg Christoph Biller for making the study possible. We also thank Christoph Wonneberger, Marko Kronberg, Markus Eckardt, and Stephan Liebig for technical assistance, and Maria Bader and Theres König for assistance with data collection.

## REFERENCES


## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2017.01559/full#supplementary-material

Supplementary material includes audio recordings of the full choir's three performances of the two pieces:

Audio S1 | chorale\_1\_females\_absent.mp3

Audio S2 | chorale\_2\_females\_present.mp3

Audio S3 | chorale\_3\_females\_absent.mp3

Audio S4 | fugue\_1\_females\_absent.mp3

Audio S5 | fugue\_2\_females\_absent.mp3

Audio S6 | fugue\_3\_females\_absent.mp3

Note that the 16 separate channels (one per choir member) are mixed in these files to give an overall impression of the choir's performances, and the files have been compressed for presentation purposes. Uncompressed files are available upon request from PK (p.keller@westernsydney.edu.au).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Keller, König and Novembre. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## APPENDIX

Note that this questionnaire has been translated into English from the German original used in the study.



$$\text{o Other:} \underline{\hspace{1cm}}$$

6. In your opinion, which part (block) went best for all of you today? o Part 1 o Part 2 o Part 3

# Neurodiversity, Giftedness, and Aesthetic Perceptual Judgment of Music in Children with Autism

#### Nobuo Masataka\*

Primate Research Institute, Kyoto University, Inuyama, Japan

#### Edited by:

Leonid Perlovsky, Harvard University and Air Force Research Laboratory, United States

#### Reviewed by:

Robin W. Wilkins, Joint School for Nanoscience and Nanoengineering Gateway MRI Center, United States Alexander Ovsich, Boston College, United States

\*Correspondence: Nobuo Masataka masataka.nobuo.7r@kyoto-u.ac.jp

#### Specialty section:

This article was submitted to Cognition, a section of the journal Frontiers in Psychology

Received: 22 April 2017 Accepted: 31 August 2017 Published: 22 September 2017

#### Citation:

Masataka N (2017) Neurodiversity, Giftedness, and Aesthetic Perceptual Judgment of Music in Children with Autism. Front. Psychol. 8:1595. doi: 10.3389/fpsyg.2017.01595 The author investigated the capability of aesthetic perceptual judgment of music in male children diagnosed with autism spectrum disorder (ASD) when compared to agematched typically developing (TD) male children. Nineteen boys between 4 and 7 years of age with ASD were compared to 28 TD boys while listening to musical stimuli of different aesthetic levels. The results from two musical experiments using the above participants, are described here. In the first study, responses to a Mozart minuet and a dissonant altered version of the same Mozart minuet were compared. In this first study, the results indicated that both ASD and TD males preferred listening to the original consonant version of the minuet over the altered dissonant version. With the same participants, the second experiment included musical stimuli from four renowned composers: Mozart and Bach's musical works, both considered consonant in their harmonic structure, were compared with music from Schoenberg and Albinoni, two composers who wrote musical works considered exceedingly harmonically dissonant. In the second study, when the stimuli included consonant or dissonant musical stimuli from different composers, the children with ASD showed greater preference for the aesthetic quality of the highly dissonant music compared to the TD children. While children in both of the groups listened to the consonant stimuli of Mozart and Bach music for the same amount of time, the children with ASD listened to the dissonant music of Schoenberg and Albinoni longer than the TD children. As preferring dissonant music is more aesthetically demanding perceptually, these results suggest that ASD male children demonstrate an enhanced capability of aesthetic judgment of music. Subsidiary data collected after the completion of the experiment revealed that absolute pitch ability was prevalent only in the children with ASD, some of whom also possessed extraordinary musical memory. The implications of these results are discussed with reference to the broader notion of neurodiversity, a term coined to capture potentially gifted qualities in individuals diagnosed with ASD.

Keywords: autism spectrum disorder, music, aesthetic judgment, consonance, neurodiversity, Spielmann (wandering minstrel)

## INTRODUCTION

fpsyg-08-01595 September 20, 2017 Time: 11:41 # 2

Neurodiversity refers to the notion that seemingly 'impaired' cognitive as well as emotional properties characteristic of developmental disorders such as autism spectrum disorders (ASD) a neurodevelopmental disorder with unusual sensory processing, are not necessarily deficits, but fall into normal behavioral variations exhibited by humans. Stated more formally, this notion was recently described as "a concept where neurological differences are to be recognized and respected as any other human variation" (Armstrong, 2012). The term was first coined in the late 1990s by New York journalist Harvey Blume and Australian autism activist Judy Singer, and has become an important component of the civil rights movement for those with neurologically based disabilities. While this is indeed a paradigm shift from "deficit-oriented view of ASD" to "strengthoriented view of ASD" (Silberman, 2015), experimental and empirical scientific evidence confirming this conceptual notion and paradigm has been meager.

From the archeological perspective, however, a recent review mentioned the enhanced perceptual abilities of the individuals with this disorder and argues the possibility that these might contribute the survival of the individuals in a Paleolithic context (Spikins et al., 2016). From a cognitive science perspective, on the other hand, here the author presents experimental evidence indicating the gifted cognitive capability of such individuals and suggests the historical presence of a group of professional individuals who had relied upon the capability to survive, being incorporated into human societies. That concerns with the capability of aesthetic judgment of music.

As noted by Temple Grandin (Grandin, 1996; Grandin and Cook, 2004), a strong appreciation for music is commonly observed in children with ASD. Kanner (1943) has already reported extraordinary musical memory in 6 of 11 individuals who were diagnosed as this disorder. The first systematic attempt to identify superior performance on a musical task found that reproduction of atonal melodies was superior in children with ASD, as compared IQ-matched, typically developing (TD) children (Applebaum et al., 1979). Since that study, increased sensitivity to musical pitch and timbre (Heaton et al., 1998), including absolute pitch (AP), has been documented frequently in children with ASD (see Heaton, 2009 for review).

So far, such atypical musical skills have been interpreted in terms of enhanced perceptual processing of lower-order structure of music that is characteristic of this disorder on the assumption that the superiority should not extend to the cognitive sense of music in children with ASD (Mottoron et al., 2000, 2006). However, recently, neurological evidence against the assumption has been presented (Gebauer et al., 2014). In that study, neural correlates of emotional response to music were compared between adults with ASD and neurotypical controls, using functional magnetic resonance imaging. The results revealed that both groups engaged similar neural networks during processing of emotional music. However, in the ASD group, increased activity in response to happy compared to sad music was observed in dorsolateral prefrontal regions as well as in the rolandic operculum/insula, indicating enhanced cognitive processing and physiological arousal in response to emotional musical stimuli in this group. These findings suggest the possibility that an aesthetic emotional response per se somehow occurs atypically upon listening to music in individuals with ASD, among whom there are known to be some remarkably musically skillful individuals. This is the issue pursued in the current study.

As a first step, here, the author has compared the aesthetic judgments of consonant/dissonant melody between children with ASD and TD children because this is the issue that has been most intensively as to spontaneous preference for some specific patterns of music. In the Western world as well as in Japan, even newborns are known to show preference for a Mozart's minuet, most of which consists of consonant intervals, over a modified version of it that mostly consists of dissonant intervals (Trainor and Heinmiller, 1998; Masataka, 2006). Thus, the identical testing was undertaken first here with children with ASD and TD children.

While consonant intervals are most often used among popular music around the world, it is also true that some pieces of Western classical music are highly dissonant (Boulez, 1971; Masataka, 2003). Nevertheless, they are appreciated as a source of pleasure. Darwin (1871) already mentioned this fact and was cautious about the connection between musical consonance and hedonicity. For instance, sad and dissonant music like Schoenberg's piano pieces and Adagio by Alibinoni certainly evoke some aesthetic emotions in individuals exposed to them (Boulez, 1971; Thompson et al., 2001; Masataka and Perlovsky, 2013). However, such a pleasurable experience may require a sensitivity that needs to be cultivated as it develops in a particular music culture, unlike preferences for a Mozart's minuet over its modified version with more dissonant intervals. Given this possibility, the degree of the preference for these music pieces in a child, if it is measurable, would reveal the degree to which the musical intelligence defined by the theory of multiple intelligences (Gardner, 2011) has developed in the child. Along such reasoning, in the second experiment of the current study, preferences for such works as well as for typical consonant music such as Mozart's and Bach's pieces were investigated in the two groups of participants.

In order to confirm the major findings of previous studies about unusual musical sensitivity of children with ASD, subsidiary data were collected after the completion of the above experiments from all the participants about AP, the ability to identify the frequency or musical name of a specific tone, or to identify a tone without comparing it with any objective reference tone (Masataka, 2011). The parents of the participants were also interviewed to ask about instances of extraordinary musical memory in their children, which has been reported to occur often in association with ASD (Kanner, 1943).

#### MATERIALS AND METHODS

This investigation was conducted according to the principles expressed in the Declaration of Helsinki. All experimental protocols were consistent with the Guide for Experimentation with Humans, and were approved by the Institutional Ethics Committee, of the Primate Research Institute, Kyoto University (#2011-150). The authors obtained written informed consent from parents of all participants involved in the study.

## Participants

fpsyg-08-01595 September 20, 2017 Time: 11:41 # 3

A group of 19 male children with ASD aged 4–7 years (M = 5.8; SD = 1.2) and 28 TD male children aged 4–7 years (M = 5.6; SD = 1.4) were studied in the current study. All participants were musically untrained. There was no significant difference between the mean age of the two participant groups [t(45) = 1.10, p = 0.28]. All participants were Japanese, right-handed, naïve as to the purpose of this study, and auditorily normal.

Nineteen children with ASD were recruited for the current study. Based on direct clinical observation of each child by an independent child psychiatrist, a diagnosis of autism was made according to ICD-10 (World Health Organization, 1994) as well as DSM-IV (American Psychiatric Association, 1994). On the basis of such criteria, each participant in the group of children with ASD was diagnosed as either F84.0, F84.9, or F84.8. Moreover, such diagnoses were also confirmed by the Autism Diagnostic Interview-Revised (ADI-R), an extensive, semi-structured parental interview (Lord et al., 1994) that was conducted by an independent psychiatrist. The ADI-R provides information about the presence of verbal language skills, defined as daily, functional and comprehensive use of spontaneous phrases of at least three words and occasionally a verb. All of the participant ASD children were found to express verbal language. All of the TD children were recruited via the board of education in a small city in Japan. All participants attended normal classes characteristic of their chronological age level. None of the participants included in the groups of TD children met any diagnostic criterion for autism or any other pervasive developmental disorder.

## Procedure

Throughout the current study, two experimental trials were executed to investigate music preference of each participant (referred to as "Experiment 1" and "Experiment 2" below), using the same experimental protocol but different materials as stimuli. None of the participants had been exposed to any of the materials prior to the current study.

The materials used in Experiment 1 were the original, harmonic version of a Mozart's simple minuet (C major K.#1f) and its modified, inharmonic version. They were essentially the same as used previously (Trainor and Heinmiller, 1998; Masataka, 2006; Masataka and Perlovsky, 2012, 2013). Both the original and the modified versions were digitally generated and created by piano timbre. They were made up of 60 intervals. In the original, harmonic version, only three of them were dissonant, and all three were tritons (6-semitone intervals). In the modified, inharmonic version, all Gs were changed to F#s and all Ds to C#s. This had the effect of creating 21 additional dissonant intervals, including a total of 12 of the two mostdissonant intervals, i.e., seven tritons and five minor ninths (13 semitones). In the present stimuli, the upper voice and the lower voice were separated by more than an octave in each interval. The tempo was identical across the two versions.

The materials used in Experiment 2 were four musical pieces, namely, Mozart's piano sonata K.448, Bach's toccata in G major BWV 916, Schoenberg's Klavierstuek op.33a, and Adagio in G minor for organ and strings by Albinoni. While listening to these musical pieces of Mozart and Bach has been reported to enhance spatial tasks, referred to as Mozart effect (Cooper, 2001), the musical pieces of Schoenberg and Albinoni used here are well known as representative of the most popular pieces of highly dissonant western classical music.

As the protocol, the current study adapted the child contingent head-turn preference procedure to allow musical preferences to be measured by the participant pressing a key in order to hear a given extract of music (from a choice of either of two keys/musical pieces in Experiment 1 and from a choice of four possible keys/musical pieces in Experiment 2). The data thus show discrimination among a number of musical styles, and preferences can also be measured. Participants were tested individually, sitting in a quiet room under daylight conditions in front of a table. There was a toy keyboard equipped with eight keys on the table. When a participant entered the room and was seated, an experimenter who was sitting apart from him instructed him that some of the keys would start flashing and that he could hear something by pressing either of them. The keyboard was connected to a portable computer with specially written software, enabling the presentation of each of the prepared stimuli to be linked to each of the flashing keys and to be heard only when the key is pressed. Ten seconds after the instruction, two of the keys started flashing and preliminary trials began. When the participant pressed either of them, a popular Japanese playsong was played. It continued to be played as long as he kept pressing. However, nothing was played when he pressed others keys or more than one key simultaneously. By conducting such trials for a maximum of a 5-min period, each participant learned to perform what was necessary to listen to prepared music stimuli. After the experimenter had confirmed the learning, Experiment 1 was started after a 1-min break. It was 5-min long, and was followed by Experiment 2 of 10-min length after another 1-min break. Thereafter, the participant was instructed to leave the testing chamber. During each break, none of the keys flashed.

In Experiment 1, only two keys of the eight were to flash lights, while four of them were to flash in Experiment 2. The relation of each of the keys and the type of stimulus used was fixed in each experimental trial involving a given participant, randomly determined across the participants, and counterbalanced. The keys were kept flashing during Experiments 1 and 2. During that period, he was allowed to press any key. As long as he kept pressing either of the flashing keys, he was exposed to the music assigned to that key (sound pressure level: 60 dB). However, the music was never played for duration longer than 1 min. If he kept pressing the key longer than 1 min, the computer ceased to present the stimulus. Key presses were recorded in msec, and preference for each music stimulus was calculated as a proportion of the total time allocated to the stimulus.

In testing of AP ability (referred to as Experiment 3), the participants heard 36 pure sine wave tones, presented in pseudorandomized order, which ranged from A3 (tuning: A4 = 440 Hz) to A5, with each tone being presented once. Each tone of the AP test had a duration of 1 s, with a 6-s interstimulus interval. The participants had to answer the tonal label after hearing the accordant tone. Prior to the testing, each participant had been confirmed to have acquired sufficient verbal knowledge about the label to answer. The whole test unit and its components were created with Adobe Audition 1.5. The accuracy was evaluated by counting correct responses. The participants were not asked to identify the adjacent octaves of the presented tones because for AP, identifying the correct chroma is a most notable prerequisite.

In interviews with the parents of the participants, episodes of two children with ASD documented by Kanner (1943) were mentioned; Charles N. was able to distinguish among 18 symphonies before he was 2 years old, and when his mother played his favorite records, he could answer the name of the symphony heard correctly. Another boy, John F., was similarly talented at recognizing melodies. If his father whistled a tune, he often identified it, for instance, as Mendelssohn's violin concerto. In addition, he was able to recite many prayers and nursery rhymes from memory. After such episodes were mentioned to them, the parents in the present study were asked whether they had experienced similar instances in their children.

#### RESULTS

The overall results of Experiment 1 are summarized in **Figure 1**, which shows the overall mean duration of listening to the original, harmonic version of the Mozart's minuet and to its modified, inharmonic version with many dissonant intervals in the group of TD children and that of ASD children. When the collected data were analyzed using a 2 (ASD/TD, PARTICIPANT) × 2 (original version versus modified version, STIMULUS) analysis of variance (ANOVA), one of the two main effects (STIMULUS) was statistically significant [F(1,45) = 49.53, p = 0.0000, η 2 <sup>p</sup> = 0.299]. The other main effect (PARTICIPANT) was not significant [F(1,45) = 0.84, p = 0.46, η 2 <sup>p</sup> = 0.033]. However, the interaction between PARTICIPANT and STIMULUS was also significant [F(1,45) = 4.25, p = 0.028, η 2 <sup>p</sup> = 0.35].

Subsequent analyses of simple main effects (Bonferroni correction) revealed that the mean listening time was shorter in response to the inharmonic version than to the harmonic version in both the children with ASD and TD children (p < 0.001). Moreover, the children with ASD listened to the harmonic version longer than the TD children (p = 0.03), who listened to the inharmonic version longer than the children with ASD (p = 0.04).

The overall results of Experiment 2 are summarized in **Figure 2**, which shows the overall mean duration of listening to the pieces of music from the four classic works in the group of TD children and that of ASD children. When the data were analyzed using another ANOVA, both of the two main effects were

FIGURE 1 | Mean preference (error bars: SDs) for an original simple harmonic minuet of Mozart (Harmonic) and its modified inharmonic version containing many dissonant intervals (Inharmonic) in children with autism spectrum disorder (ASD) and in typically developing (TD) children.

(ASD) and in typically developing (TD) children.

statistically significant [F(1,45) = 48.33, p = 0.0000, η 2 <sup>p</sup> = 0.287 for STIMULUS and F(1,45) = 9.12, p = 0.009, η 2 <sup>p</sup> = 0.133 for PARTICIPANT]. The interaction between PARTICIPANT and STIMULUS was also significant [F(3,135) = 4.25, p = 0.012, η 2 <sup>p</sup> = 0.30].

Pair-wise comparisons of the mean listening time across the four different music stimuli in the TD children and the children with ASD revealed that the TD children listened longer to both Mozart's piece and Bach's piece than to either Schoenberg's piece or Albinoni's piece (ps < 0.01). Besides these, no such differences were found with regard to any pair among the four stimuli (ps > 0.10). The mean listening time of the children with ASD to any of the four stimuli did not significantly differ from that to any of the remaining three stimuli (ps > 0.10). Overall the children with ASD listened to Schoenberg's piece and Albinoni's piece longer than the TD children (ps < 0.01) whereas the

mean listening time did not differ significantly between the two participant groups with regard to Mozart's piece or Bach's piece (ps > 0.10).

The overall results of the subsequent AP testing (Experiment 3) are presented in **Figure 3**. While the accuracy scores of all of the TD children remained within the level of chance, those of 15 of the 19 children with ASD were found to exceed the level of chance. During the interviews with the parents of the participants, some instance of extraordinary musical memory in their children was reported by six parents of the children with ASD, but there was no such report by any parent of the TD children.

#### DISCUSSION

Regarding TD children, the results of Experiment 1 of the current study are consistent with those reported previously (Trainor and Heinmiller, 1998; Masataka, 2006) and these participants exhibited a preference for the original, harmonic version of a Mozart's minuet over its modified version containing many dissonant intervals. The results of this experiment also showed the fact that such a preference was confirmed in children with ASD. Moreover, the extent of the preference was more robust in children with ASD than in TD children, indicating that children with ASD are more sensitive than TD children are with regard to the aesthetic judgment of consonance and dissonance per se in music.

The results of Experiment 2 are seemingly to contradictory to those of Experiment 1. Namely, children with ASD listened to the music stimuli indiscriminately whether the music consisted of many dissonant intervals or not whereas TD children listened to the pieces of music consisting of many consonant intervals longer

than to those consisting of dissonant intervals, to which the TD children actually avoided listening. Due to such avoidance, the total duration of listening to the stimuli was longer in children with ASD than in TD children. However, it should be noted that the pieces of music containing many dissonant intervals that were used as the stimuli in Experiment 2 were from famous composers of Western classical music. These pieces contained such dissonant intervals as a consequence of the composers' attempt to aesthetically express the emotion of sadness, whereas in Experiment 1, Mozart's original minuet was merely modified using a means to create its dissonant version.

This difference differentially influenced the responses to the stimuli in children with ASD, but not in TD children. This fact appears to deserve further analysis. Possibly, it may point to a question about the validity of the general assumption that the relative proportion between consonance and dissonance in music corresponds to a hedonic division between positive and negative that, apparently, in turn corresponds to the distinction between major and minor in music. The present results suggest that children with ASD possess enhanced capability of aesthetic judgment and appreciated the presented pieces of classical music even though they contained many dissonant intervals. This, in turn, indicates that it would be difficult to explain the more robust response to consonance/dissonance difference in children with ASD than in TD children found in Experiment 1 merely in terms of an enhanced perceptual processing of lower-order structure of music that is associated with ASD.

Among a number of acoustical dimensions constituting music, that of consonance/dissonance is the developmentally earliest that can be judged. During early childhood, neurotypical individuals are likely to listen to harmonic music preferably to music with many dissonant intervals even if those intervals in the latter music were a consequence of an artistic attempt to express the emotion of sadness aesthetically (Winner, 2006). Appreciation of sad music develops later. While this was indeed experimentally confirmed in Experiment 2 of the current study, the present results also raise the possibility that children with ASD who are aged-matched to TD children possess a more developed capability of aesthetic judgment of music even though musically untrained.

Along with the conceptualization of the theory of multiple intelligences (Gardner, 2011), the cognitive characteristics revealed in children with ASD can be explained as greater than typical strength of musical intelligence in themselves although the children experience difficulty with interpersonal communication (Lord et al., 1994; World Health Organization, 1994). While the pattern of operation of an individual person's mind can be categorized according to the domain toward which that individual is more oriented, individuals with ASD, overall, do not rely upon their social relationships but rather are predisposed to process perceived non-social objects in more depth (Masataka, 2017). For instance, these individuals "exhibit enhanced discrimination between auditory stimuli, more accurate local target detection of auditory stimuli, and diminished global interference with auditory processing." These abilities, referred to as "naturalist intelligence" by Gardner (2011) are highly adaptive for living in nature, and in particular,

would seem to have conferred an evolutionary advantage upon individuals during prehistoric times, as well as upon individuals in later times who would prefer to live outside any community (Armstrong, 2017).

The enhanced capability of aesthetic judgment of music found here was associated with some atypical musical skills characteristic of ASD such as AP and extraordinary musical memory, as revealed by the subsidiary data collected after the completion of Experiments 1 and 2. These skills are known to have been possessed by an outstanding artist in the 16th century, Wolfgang Mozart, who could name the note on hearing a bell toll, or a clock or even a pocket watch strike, and could memorize a 10-min-long melody that he heard only once (Deutsch, 2006). It should also be noted that his family kept continually touring throughout Europe from the time he was 5 years until 21 years of age, which contributed to establishing the young composer's reputation as a musical prodigy. Such a lifestyle was indeed in accord with the lifestyles of such individuals who were documented in historical records in Europe as professional musical performers (Hartung, 2003).

The documentation of such individuals in southern Europe, the region between the Alps and the Pyrenees, appeared as early as the 10th century, with them being referred to as "Spielmann (plural: Spielleute)" in German and as "wandering minstrels" in English. Spielleute had no settled abode, and instead used to roam about from place to place over a broad range of regions across a number of communities. In the 13th century, such individuals came to be distributed over all regions of Europe. While all other individuals who were integrated into feudal society were occupied with predetermined, inherited work, Spielleute were the first who had chosen this occupation of their own will. Born in urban regions in the society, most of them felt difficulty in staying there and decided to live as outcasts. As a result, they were treated as being dishonorable by society members. They were talented in a variety of rudimentary forms of contemporary art performance, performing arts including music and dancing without experiencing any form of professional training. In particular, they were well known to perform mimicking of a variety of sounds heard in the landscape (such as bird songs) while playing stringed and/or woodwind instruments simultaneously (Hartung, 1982). When one of their performances was successful, the melody was memorized no matter how long it was without recording it using musical notation (actually, they were unable to read music) and it became part of their repertoire of routine performance. Such a seemingly mysterious endowment seems also to have been possessed by some famous modern artists who contributed to the subsequent development of classical music, such as Mozart, Bach and Beethoven (Deutsch, 2006), and also by the children with ASD in the current study.

#### REFERENCES


Spielleute were allowed to travel around as they liked by the churches and the lords of communities, their daily income totally depending upon their performances. They were apparently regarded as exceptional individuals (outsiders). The official position of the Catholic church was that they lived in a state of sin (Stuart, 1999). Nevertheless they used to play an important role, i.e., the role of 'negotiator' once conflict happened between regional communities within feudal society because they were familiar with members of both of the disputing communities. That indicates that despite being socially marginalized, such individuals were socially functional. While "Spielleute" or "wandering minstrels" were neurdivergent individuals who came to be no longer recorded after the 18th century as Europe became industrialized, the fact should be noted that similar outcast groups of individuals have been commonly documented among a variety of cultures and that all of them were talented in musical performance without exception (Hartung, 2003). That again strongly suggests the possibility that they possessed an enhanced aesthetic sense of music such as the giftedness that was revealed here in the performance of children with ASD.

As clinical implications of the findings, one can also mention the possibility that this early onset and/or potentially predispositional demonstration of heightened aesthetic judgment of music might eventually lead to new job-related opportunities for individuals with ASD in uniquely useful and professionally applicable settings in new fields. No doubt, this is an issue that should be clinically addressed in the next step of research.

## AUTHOR CONTRIBUTIONS

NM designed the study, collected and analyzed the data and drafted the manuscript.

## FUNDING

The study was supported by Japan Society for the Promotion of Science London and the grant-in-aid number is #25285201.

## ACKNOWLEDGMENTS

The author is greatly indebted to Thomas Armstrong for his number of thoughtful comments on the earlier version of the manuscript. The author is also grateful to Hiroki Koda for assistance in conducting the experiments and to Elizabeth Nakajima for proofreading the English of the manuscript.

Armstrong, T. (2012). Neurodiversity in the Classroom: Strength-based Strategies to Help Students with Special Needs Succeed in School and Life. Alexandria, VA: ASCD, 1–183.


Cooper, J. S. (2001). The mozart effect. J. R. Soc. Med. 94, 170–172.

fpsyg-08-01595 September 20, 2017 Time: 11:41 # 7


Grandin, T. (1996). Thinking in Pictures. New York, NY: Vintage, 1–177.


Kanner, L. (1943). Autistic disturbances in affective contact. Nerv. Child 2, 217–250.


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Masataka. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

**47**

# What Vowels Can Tell Us about the Evolution of Music

#### Gertraud Fenk-Oczlon\*

Alpen-Adria-Universität Klagenfurt, Klagenfurt, Austria

Whether music and language evolved independently of each other or whether both evolved from a common precursor remains a hotly debated topic. We here emphasize the role of vowels in the language-music relationship, arguing for a shared heritage of music and speech. Vowels play a decisive role in generating the sound or sonority of syllables, the main vehicles for transporting prosodic information in speech and singing. Timbre is, beyond question, the primary parameter that allows us to discriminate between different vowels, but vowels also have intrinsic pitch, intensity, and duration. There are striking correspondences between the number of vowels and the number of pitches in musical scales across cultures: an upper limit of roughly 12 elements, a lower limit of 2, and a frequency peak at 5–7 elements. Moreover, there is evidence for correspondences between vowels and scales even in specific cultures, e.g., cultures with three vowels tend to have tritonic scales. We report a match between vowel pitch and musical pitch in meaningless syllables of Alpine yodelers, and highlight the relevance of vocal timbre in the music of many non-Western cultures, in which vocal timbre/vowel timbre and musical melody are often intertwined. Studies showing the pivotal role of vowels and their musical qualities in the ontogeny of language and in infant directed speech, will be used as further arguments supporting the hypothesis that music and speech evolved from a common prosodic precursor, where the vowels exhibited both pitch and timbre variations.

#### Edited by:

Aleksey Nikolsky, Braavo! Enterprises, United States

#### Reviewed by:

John G. Neuhoff, College of Wooster, United States Bahia Guellai, University of Paris Ouest Nanterre, France

#### \*Correspondence:

Gertraud Fenk-Oczlon gertraud.fenk@aau.at

#### Specialty section:

This article was submitted to Cognition, a section of the journal Frontiers in Psychology

Received: 25 May 2017 Accepted: 29 August 2017 Published: 22 September 2017

#### Citation:

Fenk-Oczlon G (2017) What Vowels Can Tell Us about the Evolution of Music. Front. Psychol. 8:1581. doi: 10.3389/fpsyg.2017.01581 Keywords: vowels, timbre, musical scale, non-Western cultures, ethnomusicology, musical protolanguage, evolution of music, linguistic typology

## INTRODUCTION

The evolution of music, the evolution of language and possible common evolutionary pathways of these achievements remain a matter of debate. Is music only a non-adaptive by-product of language as suggested by Pinker (1997) or did "previously developed musical powers" (Darwin, 1871, p. 12) precede the oratory? An intriguing hypothesis is that music and language share a common origin in form of a musical protolanguage in which song-like strings had holistic meanings (e.g., Jespersen, 1894; Mithen, 2005; Fitch, 2006).

Here, we emphasize the role of vowels in the music-language relationship<sup>1</sup> , and present arguments that reinforce the hypothesis of a prosodic protolanguage or 'protomusic' (Fitch, 2010). Vowels play a decisive role in generating the sound or sonority of syllables, the main vehicles for

<sup>1</sup>There are, of course, many other areas where music and language show similarities. For example, striking rhythmic relationships between music and speech (Patel and Daniele, 2003; Neuhoff and Lidji, 2014) or correspondences in the length of 'intonation units' (Fenk-Oczlon and Fenk, 2009b; Lehmann and Goldhahn, 2016).

transporting prosodic information in speech, and singing. In tone languages, which represent more than half the world's languages, vowels carry the pitch modulations that convey grammatical and lexical information. The tight bond between vowels and pitch is supported by experimental findings suggesting strong interactions in the processing of vowels and melody, but not between consonants and musical information: "Vowels sing but consonants speak" (Kolinsky et al., 2009, p. 1). Likewise, a mismatch negativity (MMN) study by Lidji et al. (2010) revealed a close processing relationship between vowels and pitch even at a pre-attentive level.

This perspective paper starts with a brief theoretical analysis of the sound systems of language and music, highlights general parallels in sound inventories, and reports coincidences between vowel systems and musical scales even in specific cultures. Moreover, we demonstrate a close match between vowel pitch and musical pitch in nonsense syllables of Alpine yodelers and in the yodeling of African Pygmies. We will discuss these findings in the context of other ethnomusicological findings, frequently used (Wiora, 1962; Nikolsky, 2015) to shed light on prehistoric music or on the origin of music. Studies showing the pivotal role of vowels and their musical qualities in the ontogeny of language and in infant directed speech will be used as further arguments for the musical protolanguage hypothesis.

## SOUND SYSTEMS IN MUSIC AND LANGUAGE: PITCH AND TIMBRE

Jackendoff (2009, p. 198) emphasizes the differences in the sound system of language and music: "In phonological structure, the repertoire of speech sounds forms a structured space of timbres /. . ./. In music, by contrast, the notes are distinguished by the way they form a structured space of pitches. . ."

But the distinction between timbre contrasts in speech on the one hand and pitch contrasts in music on the other is not always as clear-cut. Especially in some non-Western music, vocal timbre and musical melody are often intertwined. Walker (1997) provides evidence that Canadian Inuit, West Coast Native people, and Australian Aborigines do not distinguish between melody and vocal timbre when discussing their music. In the language of the Inuit, for instance, there does not even exist a word that differentiates between language and music: nipi means music and sound of the spoken voice (Nattiez, 1990). And microtonal pitch intervallic movements that could be related to vocal timbre changes (Walker, 1997) are characteristic for Australian Aboriginal music. Vocal timbre and timbral variations are also the main idea behind khasmatonal music whose tonal organization can be characterized as "half-spoken/half-sung with intense timbral/pitch modifications" (Nikolsky, 2015, p. 13). In addition to vowel timbre, which seems to be pivotal in timbre-driven music, a person's voice timbre (Sheikin, 2002) is an important and often exaggerated element in the music of many Siberian ethnicities.

## Vowels: Timbre, Intrinsic Pitch, Intensity, Duration

Timbre is, beyond question, the primary parameter that allows discriminating between different vowels and to some degree also between different consonants (e.g., nasals, liquids). The human voice, as well as the sounds of most musical instruments, is "made up of many nearly harmonically related frequency components, or partials" (Pierce, 1999, p. 8). The timbre of sounds is determined by the distribution of the sound energy among partials (overtones) of different frequencies.<sup>2</sup> Vowels differ from each other by specific peaks or formants in their sound spectra, whereas the formants F1 and F2 are most relevant for their identification (Peterson and Barney, 1952). The formants correspond to the resonances of the vocal tract or oral cavity<sup>3</sup> ; the main articulatory parameters responsible for vowel timbre are tongue height, front-to back position of the tongue, and lip rounding.

But vowels not only differ in their timbre but also in their intrinsic pitch, intensity, and duration. It is generally postulated that open (low) vowels have a higher intrinsic duration and a higher intrinsic intensity than close (high) vowels (for more details see Möbius, 2003). Concerning vowel intrinsic pitch, it is known since Meyer (1896) that high vowels such as [i] have a higher intrinsic fundamental frequency (IF0) than low vowels such as [a]. While the mechanism which determines IF0 is still a subject of debate, there seems to be general agreement that vowel pitch depends primarily on the frequency of the second formant F2 (Marks, 1978; Traunmüller, 1986). Vowels with a high F2 (e.g., [i], [y]) have a higher intrinsic pitch than vowels with a low F2 (e.g., [a], [o]). The close association between F2 or spectral energy allocation and vowel intrinsic pitch indicates that timbre and intrinsic pitch of vowels are closely interrelated and cannot be separated.

## Vowel Timbre and Musical Melody

In Fenk-Oczlon and Fenk (2009b) we hypothesized that in songs containing strings of meaningless syllables the vowels might be connected to melodic direction in close correspondence to their timbre or intrinsic pitch. We tested this assumption based on all monophonic Alpine yodelers (n = 15) in Pommer's (1893) collection. The test revealed a surprisingly uniform pattern: the melody descended in 118 out of 121 [i]→[o] successions and ascended in 132 out of 133 [o]→[i] successions. A similar assumption was tested in Austrian traditional songs, which include successions of nonsense syllables. In 24 out of 26 songs in the Dawidowicz (1980) collection we found the expected coincidence between the vowel [i] and the highest pitch in melody (Fenk-Oczlon and Fenk, 2009a).

A strong relationship between vowel timbre and musical pitch in meaningless syllables is also reported in the yodeling of

<sup>2</sup> In fact, timbre does not only depend on spectral aspects of sound but also on time-varying aspects, i.e., the amplitude envelope (e.g., Caclin et al., 2005).

<sup>3</sup>For a detailed description of the human vocal tract, its relationship to vowel formant frequencies and potential relevance in the evolution of human language (see, e.g., Fitch, 2010), and a recent discussion of this interesting topic by Fitch et al. (2017) and Lieberman (2017).

African Pygmies (Fürniss, 1991; Demolin, 2013): front vowels are associated with high pitch and back vowels with low pitch.

Vowel timbre, moreover, plays a key role in transforming spoken information into whistled languages (Meyer, 2008), in 'talking khomus' using the jew's harp to transmit verbal information in Yakut traditional music (Alexeyev and Spiridon, 2004), as well as in many mnemonic systems for transmitting or representing musical melodies (Hughes, 2000).

## SOUND SYSTEMS IN LANGUAGE AND MUSIC: PARALLELS IN THE INVENTORY SIZE

Are there parallels in the sound inventory size of language and music? Authors looking for parallels in the sound inventories of language and music often compared the whole phonemic inventory to musical pitches per octave and found that the number of phonemes across languages varies to a much greater extent ["from 11 in Polynesian to 141 in the languages of the Bushmen" (Besson and Schön, 2001, p. 235)] than the number of pitches per octave. And Rakowski (1999) argues that the number of phonemes in languages is much higher than the number of musical intervals, which roughly corresponds with Miller's magical number seven. An alternative approach was provided in Fenk-Oczlon and Fenk (2009b), comparing only the vowel inventories with the number of intervals in a musical scale, instead of the whole phonemic inventory.

Concerning the vowel inventories across languages there is unanimous consensus that five-vowel systems are the most frequent ones, followed by six and seven vowel-systems (Crothers, 1978; Schwartz et al., 1997; Ladefoged, 2005; Maddieson, 2005). For instance, Maddieson's (2005) sample in the World Atlas of Language Structures (WALS) comprises 563 languages; the smallest vowel quality inventory is 2 and the largest is 14. Most of the languages have five vowels, followed by six and seven. Four languages have two contrasting vowel qualities, only one language (German) has 14 basic vowels, and only two languages (British English; Bété) make use of 13 vowels. Note, however, that in the respective samples only basic vowel qualities are counted, and variations of basic vowel qualities such as nasalisation, pharyngalisation, length, etc. are not considered.

Unfortunately, statistical databases for scale types across cultures, analogously to those of vowel inventories, do not seem to exist. Musical scales "are classified according to the number of tones used, their range, and their intervals" (Nettl, 1956, p. 46). The simplest scales are diatonic (Nettl, 1956), and Burns (1999) argues that 12 pitches per octave represent a practical limit. The existence of a higher number of tones – e.g., through "intervals that bisect the distance between the Western chromatic intervals" – is, at least as a standard in the culture in question (the Arab-Persian system), a rather controversial question (Burns, 1999, pp 217–218). There is a general agreement that five-tone (pentatonic) scales are the most frequent ones amongst traditional forms of music, and that musical scales across cultures typically have five to seven pitches (Trehub et al., 1999, p. 965). In contrast to language, where six-vowel systems appear to be slightly more frequent than seven-vowel systems, the seven-tone scale (heptatonic scale) seems to be more frequent than the six-tone (hexatonic) scale. Concerning the lower number of six-note scales in comparison to five- or seven-note scales, Gill and Purves (2009) argue that six-note variants of five- or seven-note scales are very frequent e.g., in blues scales, or in melodies using only six out of the seven tones of the heptatonic scale, or using passing tones not included in the pentatonic scale. They assume that six-note scales "are simply not recognized as formally as their five- and seven-note counterparts in Western music theory" (p. 8). In non-Western musical cultures pentatonic scales are, as in Western music, the most frequent ones; but it is interesting to see, that they use, according to Nettl (1956, p. 60), hexatonic scales more frequently than heptatonic scales.

We here state some striking coincidences in the sound inventories of language and music: an upper limit of roughly 12 elements, a lower limit of 2 elements, and a frequency peak at 5–7 elements. It should be noted, that in our comparison between vowels systems and musical sales we only considered the number of elements, and we did not consider the patterning of intervals in a scale or the distribution of vowels in the vowel space.

## Are There Correspondences between Vowels and Scales Even in Specific Cultures?

In Fenk-Oczlon and Fenk (2009a) we speculated that there might be coincidences between vowels and scales even in specific cultures. Consistently, most of the Australian Aboriginal languages have three vowels, only those of the Northern Territory are reported to have more vowels (Butcher and Anderson, 2008). And Lauridsen (1983) reports exactly from those cultures in the Northern Territory the use of a higher number of pitches in song. The most difficult part in testing our assumption is to establish which use of certain musical features is indigenous, and which comes as cultural borrowing. Therefore, looking for coincidences between the number of vowels and number of pitches may be particularly informative in cultures before they had extended contact with other musical traditions. Some more evidence we found so far:

The indigenous cultures of the Americas – such as the Arapaho, Blackfoot, Cheyenne, Comanche, Klamath, Modoc, Navaho, Pawnee – frequently have only four or three vowels (Maddieson, 2005), and tritonic and tetratonic scales are frequently used in their music (Rhodes, 1954; Nettl, 1956). Noteworthy, Nettl (1956, p. 112) reports that in Pueblo music which includes the music of the Hopi, Zuni, and Taos "the scales tend to have more tones than they do elsewhere on the continent." And Hopi, Zuni, and Taos belong exactly to those rather rare indigenous languages of the Americas which have five or six vowels.

Correspondences between three-vowel systems and tritonic scales show in the Inca musical tradition and in the music of Greenland. Pre-Hispanic herranza ritual music of the Andes

is generally tritonic –rather than "incomplete pentatonic" (Holzmann, 1980), and Incan Quechua as well as other Quechuan varieties have just three vowels. Greenlandic like the vast majority of Inuit languages has only three vowels, and tritonic modes are still common in the music of East Greenland (Olsen, 1972). Moreover, tritonic patterns are supposed to have been characteristic to Inuits from Greenland to Siberia.

The correspondences found between vowel inventories and musical scales are, of course, highly tentative. Nevertheless, it would be interesting to investigate whether these correspondences can also be found in other autochtonous cultures.

## VOWELS IN ONTOGENY, PHYLOGENY, AND MOTHERESE

Vowels and their musical qualities play a pivotal role in the ontogeny of language and in infant directed speech. Vowellike sounds are the first speech sounds that children produce and are already present in the cooing stage appearing at about 6–8 weeks of age. "The first coos that infants make sound like one long vowel" and the infant's learning process to be undertaken in this stage is to "produce a series of different vowel-like sounds strung together but separated by intake of breath" (Hoff, 2009, p. 143). The coos exhibit pitch variations, which according to Halliday (1975, cited in Masataka, 2007) are already connected with pragmatic functions. For instance: requests for objects are associated with a rising pitch, and labeling is associated with falling pitch. Masataka (2007) argues that pitch variations of this sort might be strengthened by caregivers in using infant directed speech or motherese, which can be above all characterized by high pitches and exaggerated pitch contours. Thus, infant directed speech is generally considered having many musical qualities and providing a scaffold for language acquisition (Falk, 2004). Again it is the vowels that are acoustically exaggerated and the more caregivers are exaggerating the vowels, and the larger the vowel space they use, the better the infant's language performance (Liu et al., 2003). The importance of vowel pitch or vowel timbre in early communication is demonstrated by Papousek's (1992) findings (cited in Walker, 1997) where infants' abilities to discriminate pitch variations in the mother's voice is linked to "vowel" pitch, not "musical" pitch.

But vowels or vowel-like sounds may not only play a prominent role in the ontogeny of human language, but also in its early evolution. Notably, recent work by Boë et al. (2017) demonstrated astonishing similarities between human vowels and "vowel like segments" produced by Guinea baboons. Combining acoustical analyses of baboon vocalizations with an anatomical study of the animals' vocal tract, the authors showed that baboons are capable to produce and combine five vowellike sounds, despite their high larynx, suggesting a proto-vocalic system was already present in the last common ancestor of humans and baboons, at least 25 million years ago (Boë et al., 2017).

## DISCUSSION AND CONCLUSION

We started with a theoretical analysis of the sound systems of language and music and and demonstrated a closer relationship and more parallels between these systems than generally assumed. Vowels show all the core properties of music, i.e., timbre, intrinsic pitch, intensity, and duration. We revealed general correspondences between vowel systems and musical scales across cultures, and presented correspondences even in specific cultures. While the general parallels in the sound inventories may reflect cognitive and physiological constraints of our auditory and articulatory apparatus, the coincidences found between number of vowels and number of tones in specific cultures indicate a tight bond between vowels and music. The match between vowel pitch and musical pitch in meaningless syllables of Alpine yodelers and yodelers of African Pygmies and the relevance of vocal timbre in the music of many non-Western cultures, in which vocal timbre/vowel timbre and musical melody are often intertwined, further support a close relationship between vowels and music. How can these findings be related to the evolution of music?

The evidence we provided for a close relationship between vowels and musical pitch stems from studies of ethnic or non-Western music. And ethnomusicological findings are often assumed to particularly revealing with regards to the origin of music. For example, Nikolsky (2015) uses ethnic music to reconstruct prehistoric music and assumes that the earliest music "was organized not by pitch, but by timbre" (p. 25). This assumption implicates that at this early stage, music and speech must have been together and that it was the vowels which displayed – besides voice timbre – timbre and pitch modulations.

Taking further into account the pivotal role of vowels in ontogeny – phylogeny in some respects and to some degree resembles ontogeny and vowel-like sounds are the first speech sounds that children produce – it is tempting to speculate that the earliest human vocal communication started with vowels or vowel syllables strung together, which were connected by semivowels or glides such as [w], [h], [j] or the glottal stop [P]. The sequences of vowels exhibited pitch and timbre modulations, which were combined for different pragmatic and social functions, and were probably propositionally meaningless (cf. also Fitch, 2010). In a later stage, more 'real' consonants such as obstruents emerged and were combined with vowels into consonant-vowel syllables. This was likely the emergence of articulated speech (Jordania, 2006) and of utterances which could express propositional meanings.

We still encounter vocal music in which mere vowel sounds are connected. For example, Lewis (2013) reports that the songs of the BaYaka Pygmies rarely have words, but many songs are based on vowel-sound melodies or hocketed vowel sounds.

In conclusion, we have demonstrated a close relationship between vowels and music in non-Western cultures which may shed light on the earliest human vocal communication and may strengthen the idea of a musical protolanguage. Although our findings are preliminary, we are convinced that future research combining ethnomusicological findings with those from linguistic typology will provide further insight into the evolution of music, and in music as such.

### REFERENCES

fpsyg-08-01581 September 26, 2017 Time: 19:6 # 5


#### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.

Jespersen, O. (1894). Progress in Language. London: Swan Sonnenschein & Co.



**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Fenk-Oczlon. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# White Matter Correlates of Musical Anhedonia: Implications for Evolution of Music

Psyche Loui<sup>1</sup> \*, Sean Patterson<sup>1</sup> , Matthew E. Sachs<sup>2</sup> , Yvonne Leung1,3, Tima Zeng<sup>1</sup> and Emily Przysinda<sup>1</sup>

<sup>1</sup> Music, Imaging and Neural Dynamics Lab, Department of Psychology, Program in Neuroscience and Behavior, Wesleyan University, Middletown, CT, United States, <sup>2</sup> Department of Psychology, Brain and Creativity Institute, University of Southern California, Los Angeles, CA, United States, <sup>3</sup> The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Penrith, NSW, Australia

Recent theoretical advances in the evolution of music posit that affective communication is an evolutionary function of music through which the mind and brain are transformed. A rigorous test of this view should entail examining the neuroanatomical mechanisms for affective communication of music, specifically by comparing individual differences in the general population with a special population who lacks specific affective responses to music. Here we compare white matter connectivity in BW, a case with severe musical anhedonia, with a large sample of control subjects who exhibit normal variability in reward sensitivity to music. We show for the first time that structural connectivity within the reward system can predict individual differences in musical reward in a large population, but specific patterns in connectivity between auditory and reward systems are special in an extreme case of specific musical anhedonia. Results support and extend the Mixed Origins of Music theory by identifying multiple neural pathways through which music might operate as an affective signaling system.

#### Edited by:

Aleksey Nikolsky, Braavo! Enterprises, United States

#### Reviewed by:

Christopher I. Petkov, Newcastle University, United Kingdom John G. Neuhoff, College of Wooster, United States Chris Baber, University of Birmingham, United Kingdom

> \*Correspondence: Psyche Loui ploui@wesleyan.edu

#### Specialty section:

This article was submitted to Cognition, a section of the journal Frontiers in Psychology

Received: 19 May 2017 Accepted: 11 September 2017 Published: 25 September 2017

#### Citation:

Loui P, Patterson S, Sachs ME, Leung Y, Zeng T and Przysinda E (2017) White Matter Correlates of Musical Anhedonia: Implications for Evolution of Music. Front. Psychol. 8:1664. doi: 10.3389/fpsyg.2017.01664 Keywords: music, evolution, auditory, affective, communication, diffusion tensor imaging

## INTRODUCTION

Music is celebrated and valued in every human culture, yet we know relatively little about why music exists, or what functions music might serve for humankind. The evolutionary function of music has been a subject of debate since Darwinian times (Darwin, 1871). On the one hand, some scholars espouse views that music is an evolutionary byproduct that confers no cognitive advantage, i.e., that music is "auditory cheesecake" (Pinker, 1997). On the other hand, most researchers in the field of music perception and cognition posit that music serves many adaptive functions (Huron, 2001; Honing et al., 2015). For each of these functions, musical sounds function as an auditory channel for interpersonal communication, possibly preceding speech and language (Mithen, 2007). Thus, the need for interpersonal communication through an auditory channel is at the core of evolutionary pressures that are thought to shape music.

This need for interpersonal communication likely changes the cognitive system by virtue of lasting effects that music exerts upon our species. The notion that music is a human invention but transforms our experience, i.e., that music is a transformative technology of the mind (TTM) (Patel, 2008, 2010), is attractive for two reasons. Firstly, the TTM view reconciles the debate between more traditional adaptationist and exaptationist views (cf. Justus and Hutsler, 2005; Trainor, 2006) by pointing out the false dichotomy between these two views. Secondly, TTM brings to the forefront

the idea that brains can change as a result of musical experience. Thus, studies that relate musical experience to inter-individual variability within our species may be informative of how music came to be valued in our species. Another important evolutionary role of music is in its value as an emotional signal: music has power to communicate and evoke strong emotions through an auditory channel (Snowdon et al., 2015). Building on to Patel's TTM theory, the Mixed Origins of Music (MOM) theory posits that music transforms the brain through an affective signaling system that is common to many socially living animals (Altenmüller et al., 2013b; Snowdon et al., 2015). Specifically, the neural mechanisms through which chills occur in response to music may be informative of the evolution of music as an affective communication tool (Altenmüller et al., 2013b). Music elicits a variety of emotions: from abstract, aesthetic experiences to strong, more physiologically measurable emotional responses (Scherer, 2004). Several underlying mechanisms have been proposed through which emotional responses to music might be elicited. For instance, the BRECVEMA model provides a comprehensive account of emotional mechanisms for music (Juslin and Västfjäll, 2008; Juslin, 2013); they include (among others) brainstem responses and evaluative conditioning mechanisms, which involve brain areas within the dopaminergic reward system.

Music that is rewarding is processed by functional connectivity between auditory areas [superior temporal gyrus (STG)] and reward system areas such as the nucleus accumbens (NAcc) (ventral striatum), caudate (dorsal striatum), and areas in the classic limbic system including the amygdala and anterior insula (AIns) (Salimpoor et al., 2011, 2013). Individual differences in the tendency to derive chills, i.e., measurable psychophysiological responses, from music are associated with structural connections from auditory regions (STG) to the AIns, which is consistently activated during the experience of strong emotions, and the medial prefrontal cortex (mPFC), which is important for computing social value; furthermore, this association is modulated by connectivity through the NAcc, a hub in the dopaminergic reward system (Sachs et al., 2016). On the extreme end of the spectrum of individual differences in musical reward, recent work has found evidence for specific musical anhedonia, a rare but intriguing condition where individuals derive no reward responses from their musical experience (Mas-Herrero et al., 2013, 2014). The underlying brain mechanisms are similar to those reviewed above, in that they involve functional connectivity between auditory regions and reward regions, notably the dopaminergic pathway centering around the NAcc (Martínez-Molina et al., 2016).

Although differences in functional connectivity and general brain structure have both been observed in subjects with musical anhedonia (Martínez-Molina et al., 2016; Belfi et al., 2017), it is not yet known whether, and to what extent, these functional connectivity differences identified in musical anhedonics might also be structurally detectable. Furthermore, although structural brain differences in white matter connectivity between auditory and emotion and reward areas have been related to individual differences in reward responses to music (Sachs et al., 2016), it is unknown whether specific musical anhedonia simply reflects the low end of a continuum of normal individual differences in brain connectivity and reward responses to music, or whether musical anhedonia is a categorically distinct disorder that reflects anatomically dissociated neural substrates from normal variations in reward sensitivity. If the former is the case (i.e., musical anhedonia represents the low end of a single continuum), then one would expect that differences in auditoryto-reward connectivity between musical anhedonics also extend to the rest of the population. Conversely, if the latter is the case (i.e., musical anhedonia is different from normal variability in musical reward), one would expect that musical anhedonics have different patterns of reward and auditory-to-reward connectivity from the variations that are generally observed within the population that reflect reported differences in the experience of reward from music.

Here we test the primary hypothesis that musical anhedonia reflects specific differences in white matter connectivity within the reward system, and between the auditory and reward systems. Secondly, we test the hypothesis that the same patterns of white matter connectivity reflect individual differences in the normal variations of reward experiences in music. Using combined behavioral and diffusion tensor imaging (DTI) methods, we compare the white matter connectivity of a musically anhedonic subject, BW, to a group of normal controls (n = 46) who report a range of reward from music. Results will identify the neuroanatomical networks that predispose the human brain toward successful affective communication through music.

## MATERIALS AND METHODS

### Subjects

Subject BW (male, age 53 years, right-handed) presented with a self-reported, socially debilitating lack of reward experience from music despite intact reward responses to visual art. **Table 1** shows demographic information and information about musical training. Screening measures including Montreal Battery for Evaluation of Amusia (Peretz et al., 2003) and the nonverbal measure of the Shipley Institute of Living Scale (Shipley, 1940) were used to rule out any differences due to amusia or general intellectual impairment, respectively.

Control subjects (n = 46, 17 females, all right-handed) consisted of Wesleyan students and community members. Subjects reported a variety of musical training, and tested within normal ranges for MBEA and Shipley (**Table 1**). Among the control subjects, 85% (39 subjects) completed the BMRQ. All subjects gave written informed consent as approved by the Institutional Review Boards of Wesleyan University and Hartford Hospital.

#### Stimuli

In addition to screening tools reported above, 39 of the 46 subjects completed the Revised Physical Anhedonia Scale (PAS) (Chapman et al., 1976) and the Barcelona Music Reward Questionnaire (BMRQ) (Mas-Herrero et al., 2013). The PAS is a self-report scale used to measure anhedonia, the lowered ability to experience pleasure (Chapman et al., 1976). It consists of 61

statements that describe pleasurable experiences (e.g., "I have usually found lovemaking to be intensely pleasurable."). Subjects are asked to indicate whether each statement is true or false as it applies to them. Among the 61 statements, 10 items pertain to sounds (e.g., "The sounds of a parade have never excited me.") whereas the others are non-sound items that include other sensory and social pleasures (e.g., "I have often found walks to be relaxing and enjoyable." "I have often enjoyed receiving a strong, warm handshake.").

The BMRQ (Mas-Herrero et al., 2013) was used to assess how BW experienced reward associated with music, in comparison with the control group. The BMRQ is a 20-item questionnaire designed to measure musical reward experiences as a combination of five factors: musical seeking, emotion evocation, mood regulation, sensory-motor, and social reward.

### Procedures

After informed consent procedures, subjects completed surveys to report their demographic and musical training data. They also completed the MBEA and Shipley tests as screening measures for amusia and intellectual impairment. They then completed the PAS and BMRQ to assess possible general and musical anhedonia.

In addition to behavioral data, high-resolution T1 and DTI images were acquired in a 3T Siemens Skyra MRI scanner at the Olin Neuropsychiatry Research Center at the Institute of Living. Anatomical images were acquired using a T1-weighted, 3D, magnetization-prepared, rapid-acquisition, gradient echo (MPRAGE) volume acquisition with a voxel resolution of 0.8 mm × 0.8 mm × 0.8 mm. Diffusion images were acquired using a diffusion-weighted, spin-echo, echo-planar imaging sequence (TR = 4.77 s, voxel size = 2.0 mm × 2.0 mm × 2.0 mm, axial acquisition, 64 noncollinear directions with b-value

TABLE 1 | Demographic information, baseline tests, and scores on Barcelona Music Reward Questionnaire and Physical Anhedonia Scale for BW and control subjects.


of 1000 s/mm<sup>2</sup> , 64 noncollinear directions with b-value of 2000 s/mm<sup>2</sup> , 1 image with b-value of 0 s/mm<sup>2</sup> ).

## Data Analysis

All MR images were processed using FMRIB's Software Library (FSL) (Jenkinson et al., 2012). The images were then corrected for eddy current distortions using the eddy correct function. Non-brain structures were removed from each participant's images by the brain extraction tool. A diffusion tensor model was fit at each voxel in the extracted brain using the dtifit function to get a fractional anisotropy (FA) image for each participant. Probabilistic tractography was conducted using a Bayesian Estimation of Diffusion Parameters Obtained using Sampling Techniques (bedpostX) to determine the probable directions of each fiber for each brain voxel (Behrens et al., 2007).

Probabilistic tractography was conducted to determine structural connectivity in each hemisphere between each pair of the following regions of interest: STG, AIns, mPFC, and NAcc. The same regions were used as in our previous study (Sachs et al., 2016), as they were specifically identified to include white matter regions within the reward system (mPFC, NAcc, and aIns) and the auditory system (STG). The STG and NAcc were extracted from the Harvard-Oxford Cortical atlas (Desikan et al., 2006), and masked with a standardized FA image. The AIns was extracted from the LONI atlas (Shattuck et al., 2008). Then, using previous literature as a reference (Uddin and Menon, 2009), the anterior portion was defined anatomically within the lateral sulcus. As atlases varied in their delineation of the prefrontal cortex, the mPFC was hand drawn on coronal slices in the anterior portion of the corona radiata (Marchina et al., 2011). Each ROI was extracted or hand drawn on the standardized FA template by a first coder, and verified by a second coder. Each ROI was then warped to each individual subject's FA image in native space and binarized.

Tractography was then initiated from each ROI as a seed toward each other ROI as a waypoint mask; and then tractography was initiated again using the original waypoint mask as the seed and the original seed as the waypoint mask; these two directions of probabilistic tractography were then averaged to yield a single tract between each pair of regions. Each resultant tract was averaged and then thresholded at 10% of its robust intensity level to minimize extraneous tracts. Tract volume and mean FA of the normalized tracts were exported for statistical comparisons. Additionally, to enable visualization all subjects' tracts and FA images were aligned and normalized to the FSL 1 mm FA template using both linear registration (FLIRT) (Jenkinson and Smith, 2001) and nonlinear registration (FNIRT) tools, and canonical tract images were created by averaging each binarized tract across subjects in the control group, and thresholding voxels below the median.

One-sample z-tests were used to compare tract volume and normalized FA between BW and control group. Furthermore, to test for brain–behavior relationships within the control group, we ran two separate multiple regression models, both using Music Reward (overall score from the BMRQ) as the dependent variable. The first regression used FA values from each tract as predictor variables; the second regression used volumes from each tract as predictor variables. Collinearity for all variables in both regressions was minimal (Tolerance > 0.1, VIF < 8). For tracts that were significant predictors of the Music Reward score, we also conducted follow-up tests for correlations between tract FA and each of the five subscores from BMRQ, while applying the Bonferroni statistical correction for the five subscores.

#### RESULTS

#### Behavioral Results

fpsyg-08-01664 September 21, 2017 Time: 16:33 # 4

Barcelona Music Reward Questionnaire showed that BW had low reward response to music in all categories of musical reward. While controls had an average factor score of 50 (SD = 10) on the BMRQ (Music Reward overall score), BW had an overall factor score of −9, which was 5.89 standard deviations below controls. BW scored more than 2.5 standard deviations below controls on all subscales of the BMRQ (**Figure 1A**).

Physical Anhedonia Scale showed that BW was not generally anhedonic, except for items that pertain to sound. Control subjects generally scored an average of 17% of responses in the anhedonic ("pathological") direction (SD = 9%). BW scored a total of 39% of responses in the anhedonic direction. Item analysis of the PAS was done by separately analyzing sound and non-sound categories. While the control subjects showed similar proportions of anhedonic scores for sound items and non-sound items (M = 19.6%, SD = 14% anhedonic responses for sound items; M = 16.8%, SD = 11% anhedonic responses for non-sound items); BW showed 21.7% anhedonic scores for non-sound items (within 1 SD of the mean) but 90.9% anhedonic responses for sound items (more than 5 SD above the mean). This striking dissociation (**Figure 1B**) suggests that BW does not have general anhedonia, but is specifically anhedonic toward sounds, especially to music.

#### DTI Results

#### Musical Anhedonic vs. Controls

**Figure 2** compares tract FA and volume between BW and control subjects, showing some differences in auditory–reward connectivity in the subject with musical anhedonia. BW had significantly lower tract volume than controls in tracts between the left STG and left NAcc (z = −2.16, p = 0.03) and between the left AIns and left NAcc (z = −1.98, p = 0.04) at the uncorrected p < 0.05 level. No other tracts showed statistically significant differences between BW and controls according to z-tests. Mean FA (after normalizing for volume to enable a direct comparison of FA values) was greater for BW than controls between left STG and left AIns (z = 3.08, p = 0.002), the same tract in which he showed lower volume than controls, surviving Bonferroni correction at p < 0.05/10. No other tracts showed significant differences in FA according to the z-test.

#### Individual Differences within Control Group

A multiple regression model with the dependent variable of Music Reward score, with tract volume (in mm<sup>3</sup> ) of each tract as predictor variables, accounted for 38% of the variability (R <sup>2</sup> = 0.38), but was not significant after accounting for the number of predictors (adjusted R <sup>2</sup> = 0.15, F = 1.69, p = 0.13). Among the controls, the Music Reward score was significantly predicted by the volume of tracts between LSTG and LAIns (β = 1.11, t = 2.76, p = 0.01, bivariate correlation r = 0.26, partial correlation r<sup>p</sup> = 0.463), between RSTG and RNAcc (β = −0.81, t = −2.33, p = 0.027, r = 0.036, r<sup>p</sup> = −0.40), and between RSTG and RMPFC (β = 0.74, t = 2.10, p = 0.045, r = 0.193, r<sup>p</sup> = 0.37). Although these tract volumes were significant predictors of Music Reward at the p < 0.05 level, they did not survive correction for multiple comparisons across the 10 tested tracts. **Figure 3** shows these tracts and scatterplots of their bivariate correlations with the Music Reward score.

A multiple regression model with the dependent variable of Music Reward score, with FA values of each tract as predictor variables, accounted for 26% of the variability (R <sup>2</sup> = 0.26, adjusted R <sup>2</sup> = −0.002, F = 0.99, p = 0.47). None of the tracts emerged as significant predictors (all p > 0.05).

#### Predicting Musical Anhedonic Brain and Behavior from Control Group Data

To assess whether BW falls along the same continuum of brain– behavior relationships as predicted by controls, we first used the regression model from all tract volume data to generate a prediction for BW's Music Reward score. Given the multiple regression model obtained from tract volume data above (see section "Individual Differences Within Control Group"), BW's tract volume data predicted his Music Reward score to be 0.29, which was much higher than his actual score (−9). However, pairwise correlations between behavior and tract volume (scatterplots in **Figure 3**) showed that BW is a predictable outlier from the control subjects' data, with low volume in tracts between LSTG and LAIns and between RSTG and RMPFC, as predicted by his low Music Reward score and by control subjects' data. To assess whether BW's tract volumes belonged to the same continuum as controls, we used the slope and intercept of the trend line that best fit the bivariate relationship among control subjects to predict BW's tract volumes using his Music Reward score (**Table 2**), thus extrapolating control subjects' data to predict BW's tract volumes. The prediction fits BW's actual data with 7.6% error for LSTG\_LAIns tract, with 9.4% error for the RSTG\_RNAcc tract, and with 1.0% error for the RSTG\_RMPFC tract, suggesting that for these three tracts, BW falls on the extreme end of the same continuum as the control subjects.

## DISCUSSION

Individual differences in brain and behavior can be demonstrated by the normal variance within the general population, as well as extreme cases where substantial variations in brain and behavior give rise to striking deviations from the general population. To the evolution of music, the existence of musical anhedonia presents one such model of a striking dissociation, in which some individuals have a lack of reward responses specifically to sound. Here, we see that patterns of white matter connectivity in the auditory and reward systems reflect individual differences in

the tendency to perceive reward from music. Auditory–reward connectivity differences are observed in our extreme case of musical anhedonia, and also reflect individual differences in music reward sensitivity within the control group.

BW, a subject with severe musical anhedonia, had decreased white matter volume but higher FA between auditory and reward areas, specifically between left STG and left NAcc. The left STG is a cortical hub of the auditory system: it includes auditory belt and parabelt areas which are important for analyzing temporal content of sounds, including speech-specific content (Overath et al., 2008, 2015). The NAcc is central to the mesolimbic pathway of the dopaminergic reward system, with its known role in reward and reinforcement (Wise, 2006), and is the crucial waystation of a reward network activated during the peak experience of musicrelated reward (Salimpoor et al., 2011; Zatorre and Salimpoor, 2013; Koelsch, 2014). Although the left NAcc showed higher volume of connectivity to the ipsilateral AIns as well as STG, the volume results were only significant at the uncorrected level; in contrast, the increased FA between left NAcc and STG was significant at the Bonferroni-corrected level in BW. FA, the main outcome variable in DTI, is an index of white matter integrity which includes myelination and coherence of axonal bundles. Probabilistic tractography requires FA values of each voxel to be above the white matter threshold, in order to derive tract volume (Behrens et al., 2007). Here, the pattern of simultaneously increased white matter integrity and decreased volume may suggest increased myelination and/or decreased crossing fibers in BW's anatomical connections between LSTG and LNAcc, which could result in increased inhibition from LSTG to LNAcc. Functionally, the increased inhibition from LSTG could lead to a downregulation of the activity of LNAcc, resulting in deactivation of the NAcc as observed in recent functional MRI work in musical anhedonics (Martínez-Molina et al., 2016). Although these results are correlative rather than causal, the finding that BW had decreased volume but increased white matter integrity between these two regions adds to existing literature on the role of auditory–reward connectivity in affective responses for music (Salimpoor et al., 2013; Sachs et al., 2016); the implications of this data pattern for the evolution of music will be considered again later in this section "Discussion".

The PAS showed that BW was anhedonic to all sound items, including non-music items (e.g., "the sounds of a parade"; "the cackling of fire in a fireplace"). Upon further interview, BW stated: "The crackle of a fireplace, the rustle of leaves, the swish of ocean waves – I just don't appreciate them." It remains to be seen whether musical anhedonics in other studies also report anhedonia toward sound items from the PAS, or whether BW is unique in his lack of appreciation of all auditory stimuli. If BW is different from other musical anhedonics in this regard, then one might expect that his auditory–reward disconnection is also more general than other cases of musical anhedonia.

Regarding the lack of appreciation for sounds, an interesting related question concerns whether BW could have misophonia, another auditory disorder where an individual reacts aversively to trigger sounds (Kumar et al., 2017). While more research is needed in the future to determine the extent of overlap or shared traits between misophonia and musical anhedonia, our study identifies BW as having musical anhedonia rather than misophonia, mainly because BW's main complaint is that he feels no enjoyment from music, rather than being angered or anxious in response to trigger sounds as is common among misophonics (Edelstein et al., 2013). According to his self-report: "Music doesn't particularly change my mood or give me an emotional response." "Music never disgusts me. (The taste of cheese disgusts me. The smell of rotten eggs disgusts me. The sight of gore disgusts me.) Mostly I'd say that I'm neutral about music, because I just don't care (and I don't care that I don't care!), and I mostly tune it out." He also reports normal responses to speech and nonverbal vocal sounds. In contrast, misophonics most commonly report feeling disgusted, trapped, and/or anxious in response to trigger sounds, which are typically sounds produced by other people (Edelstein et al., 2013). From our findings, BW shows an abnormal pattern of connectivity from the NAcc; this was not observed in misophonics (Kumar et al., 2017). Thus, at present results suggest that musical anhedonia pertains more to a lack of reward, whereas misophonia pertains more to the experience of negative emotions such as anger and irritation in reaction to trigger sounds.

Within our control group, volume of some tracts between auditory and reward regions, specifically between LSTG and

LAIns, between RSTG and RNAcc, and between RSTG and RMPFC, were predictive of musical reward at the 0.05 (uncorrected) level. Although these results do not survive correction for multiple comparisons, it is noteworthy that only tracts from left or right STG (the only auditory regions in our model) emerged as significant predictors, suggesting that individual differences in music reward do pertain to auditoryspecific access to the reward system. It is also noteworthy that BW's tract volume data can be predicted by extrapolating the trend line that best fits the bivariate relationship between music reward and volume of the significant predictor tracts. In contrast, the multiple regression model obtained from control

subjects did not accurately predict BW's music reward score. Thus, control subjects' data can predict BW's tract volumes but not his behavioral scores. This may be because BW's music reward score, at 5.89 SD below controls, is much more of an outlier than his brain measures; thus, the brain predictors of behavior derived from control subjects do not apply to BW's very unusual behavioral data, but BW's tract volume data appear to lie at the low end of a normal distribution. The fact that BW is a very extreme outlier on the BMRQ also suggests that true musical anhedonia, at least as represented by the case of BW, is probably very rare. This is consistent with the observation that across patients of many types of brain damage, few report

bivariate correlations (r) between Music Reward score and the volume of each tract, as well as partial correlation coefficients (rp) from the regression for purposes of comparison against bivariate correlations. BW's data are also shown on scatterplots for purposes of comparison.

musical anhedonia (Belfi et al., 2017). Future studies might rely on more targeted strategies to identify more such cases of musical anhedonia.

The tract between LSTG\_LAIns shows a continuum in volume that best reflects our range of behavioral data: its volume is reduced in the musical anhedonic as well as positively correlated with music reward. Connections between AIns and STG likely include the arcuate fasciculus, part of the auditory dorsal pathway that connects superior temporal and inferior frontal regions that is related to musical ability (Loui et al., 2009, 2011; Halwani et al., 2011; Loui, 2015). Furthermore, AIns is reduced in functional connectivity to auditory cortex in singers (Kleber et al., 2013), and functional connectivity between LSTG and LAIns is correlated with lexical retrieval in spontaneous speech (Chai et al., 2016). In addition to its role in vocal–motor integration and speech, the AIns is part of the classic limbic system and is implicated in the quartet theory of emotions due to its importance in interoception and emotional regulation (Koelsch et al., 2015). Given these diverse roles of AIns in the auditory–motor system, the present finding of increased tract volume between left AIns and LSTG in controls who experience high musical reward may relate to auditory–motor behavior especially as it applies to vocal–motor behavior. This auditory–insula connectivity may be related to the differentiation of vocalization repertoire as posited in the MOM theory (Altenmüller et al., 2013a). The MOM theory states that differentiation of vocalization repertoire, as driven by chill experiences, led to the capacity for fine-grained rhythmic– melodic discrimination. In our evolutionary history, it is possible that individuals with high LSTG\_LAIns connectivity, who were highly reward-sensitive to music (e.g., frequently experiencing chills in response to music), then went on to acquire fine-grained auditory discrimination skills, which then gave rise to language and music. Since the AIns is an evolutionarily older part of the brain than its neighbor the inferior frontal gyrus (which is a classic endpoint of the arcuate fasciculus) (Galaburda and Pandya, 1982; Semendeferi and Damasio, 2000), the LSTG\_LAIns connection could have predated the arcuate fasciculus, thus serving as a pathway for the differentiation of vocalization response that preceded the hypothesized bifurcation of auditory information into music and language (Mithen, 2007).

Superior temporal gyrus connections to NAcc and mPFC may include the arcuate as well as the uncinate fasciculus, the latter being part of the auditory ventral pathway that connects the temporal and frontal lobes (Wakana et al., 2004) and is involved in processing local syntactic structures (Friederici, 2009). mPFC is also part of the default mode network and is involved in social, self-referential, and emotional processing (Fox et al., 2005; Mason et al., 2007; Jenkins and Mitchell, 2010; Kim and Johnson, 2015). As the mPFC is a waystation of the dopaminergic system that probably emerged later in evolution (Galaburda

#### TABLE 2 | Predicting musical anhedonic from control data.

fpsyg-08-01664 September 21, 2017 Time: 16:33 # 8


BW's music reward score was not successfully predicted (103% error) from the overall regression model of DTI predictors. In contrast, BW's tract volume of RSTG\_RMPFC, LSTG\_LAIns, and RSTG\_RNAcc tracts was successfully predicted (10% error) from bivariate correlations between music reward and tract volume. a [(Actual−Prediction)/Actual].

and Pandya, 1982; Semendeferi and Damasio, 2000), the finding that connections to it correlate with musical reward suggests a further involvement of an evolutionarily younger part of the dopaminergic system in music processing beyond the NAcc. Interestingly, while the LSTG\_LAIns and LSTG\_LMPFC tracts show positive bivariate as well as significantly positive partial correlations to music reward, the RSTG\_RNAcc tracts show no significant bivariate correlation with music reward, but a significant negative partial correlation after partialling out the effects of the other predictors. This is especially intriguing when considered alongside data from the musical anhedonic subject: BW had a lower volume but higher FA in LSTG\_LNAcc; highly hedonic controls had lower volume in RSTG\_RNAcc. Together these results suggest that auditory access to the mesolimbic pathway is hemispherically asymmetric, with normal variations in reward sensitivity occurring on the right but abnormal lack of reward on the left. This is consistent with hemispheric asymmetry to attractive vs. aversive stimuli in animals, but only in learned responses (Besson and Louilot, 1995; Molochnikov and Cohen, 2014). In light of the MOM theory, which posits that chill responses were initially a reward to novel auditory patterns prior to its driving of differentiated vocalization repertoire as discussed above, the present findings link the STG\_NAcc pathway to this very early step in the evolution of music.

While this study cannot tease apart when or how these individual differences emerged, the pattern of results can be considered in the context of known steps in brain evolution as well as development, which together provide support for the MOM theory. Our rare case of musical anhedonia possesses a different pattern of white matter pathways between auditory regions and reward-sensitive regions, possibly due to abnormal neuronal migration in utero or early in development. In the multiple regression analysis to predict musical reward scores from diffusion measures, since we tested pairwise connections between regions in the auditory and reward networks, this necessarily resulted in an elevated number of statistical comparisons. The brain–behavior relationships within the control group are only significant at the uncorrected level. Thus, although the current results are interesting they should be interpreted cautiously until further verification. Nevertheless, the FA difference between BW and the control in the LSTG–NAcc tract survives correction for comparisons across the 10 tested tracts; this gives us higher confidence in a structural difference between auditory and reward areas that is linked to musical anhedonia.

A remaining question concerns whether musical anhedonia is likely to be a spectrum disorder. The answer to this question depends on how we define musical anhedonia. Considering that the BMRQ is for now the only diagnostic tool explicitly in use to identify musical anhedonia, and it yields a continuum of scores when administered to a large population (Mas-Herrero et al., 2013), the lack of musical reward appears to be continuously distributed. On the other hand, if we define musical anhedonia by self-identification of a socially debilitating lack of reward experiences specific to music, then it might not be a spectrum. However, defining musical anhedonia by selfidentification would mean that identification depends upon the subject's awareness of their own condition, which would in turn depend on their social environment. For instance, if BW had not heard about musical anhedonia, or if he lived in an environment where music was less celebrated, then he might not have become aware of his condition. Thus, large-scale testing of musical reward sensitivity across different cultures may be helpful for future definitions of cultural norms against which we define musical anhedonia.

Results show that musical anhedonia is related to different patterns of connectivity from auditory to emotion and reward centers of the brain. This auditory access to the reward system informs the evolutionary basis of music: perhaps music evolved as a direct auditory pathway toward social and emotional reward centers in the brain.

With regard to the shared evolutionary basis of music with language, it is worth noting that in contrast to music, language does not seem to achieve the same set of evolutionary functions; that is, although language and music both involve connectivity between auditory, motor, and cognitive systems, language has more direct and specific sound-to-meaning mappings, but music more readily establishes aesthetic or emotional connections such as chills (Silvia and Nusbaum, 2011). Thus, language and music may have shared evolutionary origins as a protolanguage (Mithen, 2007), but their divergence led to different evolutionary functions and outcomes.

Successful musical communication depends on an auditory channel through which reward and emotional areas can be accessed. This is consistent with views of music as mixed origins, which posits that music evolved from evolutionarily ancient chill

reactions to affiliative sounds (Altenmüller et al., 2013b) that then transform the mind (Patel, 2008). Evolutionarily, the emotional content of sound might have accessed these auditory–reward pathways, which then predisposed the brain toward developing reward sensitivity and thus the need for successful emotional communication. In that regard, results suggest that other species who have connectivity between auditory and reward systems would also be able to enjoy music given the appropriate exposure.

Previous work on congenital amusia has been discussed in terms of its implications on the evolution of music (Patel, 2008); in particular white matter connectivity in congenital amusia supports the hypothesis for a shared basis of music and language (Loui et al., 2009; Loui, 2015). Similarly, white matter connectivity in musical anhedonia informs the evolutionary basis of music on emotion. While reward pathways and auditory perception–action pathways are conventionally seen as separate and dissociable systems in the brain, the present study suggests that they operate in concert, and that this concert of brain systems may be important for the evolution of music: in fact, they may provide support for the MOM as tools to transform the mind (Kleinman, 2015).

Individual differences in structural connectivity between the auditory and reward networks likely represent normal variation in musical reward sensitivity, with some additional patterns that give rise to extreme cases such as musical anhedonia. While increased connectivity between auditory and reward networks is indicative of intense emotional responses to music such as frissons (Harrison and Loui, 2014; Sachs et al., 2016), decreased

## REFERENCES


volume coupled with increased myelination or coherence between specific nodes of these networks reflects the striking lack of specific emotional responses as observed in musical anhedonia. By distinguishing between common variations and rare extremes in individual differences in musical reward sensitivity, the present study attempts to extend the MOM theory by identifying distinct neural pathways through which music might operate as an affective signaling system.

## AUTHOR CONTRIBUTIONS

PL conceptualized the idea, designed the study, and wrote the first draft of the manuscript. SP, MS, and EP acquired and analyzed the data. YL and TZ contributed DTI data analyses. All authors revised the manuscript and approved the submission.

## FUNDING

This work was supported by the Imagination Institute, Grammy Foundation, and NSF STTR grant 1720698 to PL.

## ACKNOWLEDGMENT

We thank BW and our control subjects for participating in the study.


Juslin, P. N. (2013). From everyday emotions to aesthetic emotions: towards a unified theory of musical emotions. Phys. Life Rev. 10, 235–266. doi: 10.1016/

fpsyg-08-01664 September 21, 2017 Time: 16:33 # 10


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Loui, Patterson, Sachs, Leung, Zeng and Przysinda. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Evolutionary Musicology Meets Embodied Cognition: Biocultural Coevolution and the Enactive Origins of Human Musicality

#### Dylan van der Schyff 1, 2 \* and Andrea Schiavio3, 4, 5

*<sup>1</sup> Faculty of Education, Simon Fraser University, Burnaby, BC, Canada, <sup>2</sup> Faculty of Music, University of Oxford, Oxford, United Kingdom, <sup>3</sup> Institute for Music Education, University of Music and Performing Arts, Graz, Austria, <sup>4</sup> Department of Music, The University of Sheffield, Sheffield, United Kingdom, <sup>5</sup> Centre for Systematic Musicology, University of Graz, Graz, Austria*

#### Edited by:

*Aleksey Nikolsky, Braavo! Enterprises, United States*

#### Reviewed by:

*L. Robert Slevc, University of Maryland, College Park, United States Tom Froese, National Autonomous University of Mexico, Mexico*

\*Correspondence:

*Dylan van der Schyff dbvanderschyff@gmail.com*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience*

Received: *23 February 2017* Accepted: *04 September 2017* Published: *29 September 2017*

#### Citation:

*van der Schyff D and Schiavio A (2017) Evolutionary Musicology Meets Embodied Cognition: Biocultural Coevolution and the Enactive Origins of Human Musicality. Front. Neurosci. 11:519. doi: 10.3389/fnins.2017.00519* Despite evolutionary musicology's interdisciplinary nature, and the diverse methods it employs, the field has nevertheless tended to divide into two main positions. Some argue that music should be understood as a naturally selected adaptation, while others claim that music is a product of culture with little or no relevance for the survival of the species. We review these arguments, suggesting that while interesting and well-reasoned positions have been offered on both sides of the debate, the nature-or-culture (or adaptation vs. non-adaptation) assumptions that have traditionally driven the discussion have resulted in a problematic *either/or* dichotomy. We then consider an alternative "biocultural" proposal that appears to offer a way forward. As we discuss, this approach draws on a range of research in theoretical biology, archeology, neuroscience, embodied and ecological cognition, and dynamical systems theory (DST), positing a more integrated model that sees biological and cultural dimensions as aspects of the same evolving system. Following this, we outline the enactive approach to cognition, discussing the ways it aligns with the biocultural perspective. Put simply, the enactive approach posits a deep continuity between mind and life, where cognitive processes are explored in terms of how self-organizing living systems enact relationships with the environment that are relevant to their survival and well-being. It highlights the embodied and ecologically situated nature of living agents, as well as the active role they play in their own developmental processes. Importantly, the enactive approach sees cognitive and evolutionary processes as driven by a range of interacting factors, including the socio-cultural forms of activity that characterize the lives of more complex creatures such as ourselves. We offer some suggestions for how this approach might enhance and extend the biocultural model. To conclude we briefly consider the implications of this approach for practical areas such as music education.

Keywords: origins of music, biocultural coevolution, music cognition, enactive cognition, dynamical systems theory

## INTRODUCTION

The debate over the origins and meaning of music for the human animal is one of the most fascinating areas of inquiry across the sciences and humanities. Despite the diversity of perspectives on offer, however, this field has traditionally been guided by approaches that see adaptation by natural selection as the central mechanism driving evolutionary processes (Huron, 2001; for a discussion see Tomlinson, 2015). This extends to the brain, which is often understood as a computing machine that evolved to solve the kinds of problems faced by our prehistoric ancestors in their everyday lives (see Anderson, 2014). Importantly, this "adaptationist" orientation posits a rather strict separation between the products of natural selection (i.e., adaptations) and those of culture. Because of this, evolutionary musicologists have often been faced with something of a dichotomy: Music tends to be seen either as a naturally selected adaptation that has contributed directly to our survival as a species, or as a product of culture with little or no direct connection to our biological heritage (see van der Schyff, 2013a; Tomlinson, 2015; Killin, 2016a, 2017). Various arguments have emerged in support of each position (more on this below; see Pinker, 1997; Huron, 2001; Mithen, 2005; Patel, 2008; Honing et al., 2015). Moreover, the influence of the computational model of mind has tended to focus research and theory in music cognition toward a complex information-processing hierarchy limited to the brain (Sloboda, 1985; Deutsch, 1999; Huron, 2006; Levitin, 2006). This is sometimes discussed in terms of discrete cognitive modules that have been naturally selected to perform specific tasks related to the survival of the species (Fodor, 1983; Pinker, 1997; Coltheart, 1999), leading some scholars to postulate 1:1 mappings between anatomical brain regions and musical functions (Peretz and Coltheart, 2003; cf. Altenmüller, 2001). While this research has indeed produced a number of important insights, it has arguably tended to downplay the role of the environmentally situated body in the development of musicality as a cognitive domain (see Clarke, 2005; Johnson, 2007).

In recent years, new perspectives have emerged that place more focus on the embodied, ecological, and dynamical dimensions of musical cognition (e.g., Borgo, 2005; Clarke, 2005, 2012; Reybrouck, 2005, 2013; Leman, 2007; Jones, 2009; Krueger, 2013; Maes et al., 2014; Moran, 2014; Laroche and Kaddouch, 2015; Godøy et al., 2016; Schiavio and van der Schyff, 2016; Schiavio et al., 2016; Lesaffre et al., 2017). Recent research has also tended to weaken the modular hypothesis by emphasizing the plastic and self-organizing properties of the (musical) brain (Jäncke et al., 2001; Pantev et al., 2001; Münte et al., 2002; Lappe et al., 2008; Large et al., 2016). The past two decades have also seen the development of a "biocultural" hypothesis for the origins and nature of the musical mind that looks beyond the traditional nature-culture dichotomy (Cross, 1999, 2003; Killin, 2013, 2016a,b, 2017; van der Schyff, 2013a,b; Tomlinson, 2015). This approach draws on a range of research in theoretical biology, neuroscience, embodied and ecological cognition, and dynamical systems theory (DST), positing a more integrated model that sees biological and cultural dimensions as aspects of the same evolving system. Here the origin of music is not understood within a strict adaptationist framework. Rather, it is explained as an emergent phenomenon involving cycles of (embodied) interactivity with the social and material environment.

Our aim in the present article is to contribute to the theoretical discussion supporting the biocultural hypothesis by considering it through the lenses of the enactive approach to cognition. This perspective first emerged in the work of Varela et al. (1991) and has been developed more recently across a range of contexts (Thompson, 2007; Stewart et al., 2010; Colombetti, 2014; Di Paolo et al., 2017). Most centrally, the enactive approach posits a deep continuity between mind and life, where cognitive processes are explored in terms of how self-organizing living systems enact relationships with the environment that are relevant to their survival and well-being. It highlights the embodied and ecologically situated nature of living agents, as well as the active role they play in their own developmental processes. Importantly, the enactive approach sees cognitive and evolutionary processes as driven by a range of interacting factors, including the sociocultural forms of activity that characterize the lives of more complex creatures such as ourselves (Malafouris, 2008, 2013, 2015). We suggest, therefore, that it may help to extend the biocultural hypothesis in various ways.

We begin by providing a brief overview of some key positions in the field of evolutionary musicology, discussing how many tend to adhere to the "nature-or-culture" dichotomy mentioned above. We then outline the biocultural hypothesis, reviewing supporting research and theory in theoretical biology, neuroscience, and ecological and embodied cognition. Here we place a special focus on Tomlinson's (2015) approach as, for us, it represents the current state of the art in the field. While we are largely in agreement with his position, we suggest that future work could benefit from exploring a wider range of perspectives in embodied-ecological cognition. With this in mind, we then discuss the enactive approach and consider how it might enhance the biocultural perspective. More specifically, we suggest that the enactive view could offer theoretical support and refinement to Tomlinson's claim that the origins of the musical mind should be sought for in the embodied dynamics of coordinated action that occurred within the developing socio-material environments of our ancestors—and not first in terms of cognitive processes involving (quasi-linguistic) representational mental content. Following this, we consider how the recently emerged 4E approach—which sees cognition as embodied, embedded, enactive, and extended—aligns with the biocultural perspective, offering some tentative possibilities for how this framework might guide future research associated with the biocultural approach. To conclude we briefly consider the implications this perspective may have for thought and action in practical musical contexts (e.g., music education). Before we begin, we would also like to note that although the enactive approach is being explored across several disciplines (see Stewart et al., 2010), it has only recently been adopted in musical contexts (Borgo, 2005; Silverman, 2012; Krueger, 2013, 2014; Matyja and Schiavio, 2013; Elliott and Silverman, 2015; Loaiza, 2016; Schiavio et al., 2016). Therefore, this article may also contribute to the development of the enactive perspective for musical research and theory more generally.

## EVOLUTIONARY MUSICOLOGY AND THE DICHOTOMY OF ADAPTATION

An important point of discussion in evolutionary musicology concerns whether musicality can be considered as a bona fide adaptation, or if it is better understood as a product of culture (Huron, 2001; Davies, 2012; van der Schyff, 2013a; Lawson, 2014; Honing et al., 2015; Killin, 2016a, 2017). Some researchers (including Darwin, 1871) have drawn on comparisons with music-like behavior in other animals, suggesting an adaptive function for music in mate selection and territorial display in our prehistoric ancestors (see Miller, 2000). It has been argued, however, that although music-like behavior in nonhuman animals (e.g., bird song) may well be a product of natural selection, these traits are not homologous with human music making, but rather are analogous (Pinker, 1997; Hauser and McDermott, 2003). Because of this, it is claimed that comparative studies involving more phylogenetically distant species may not have great relevance for understanding the biological origins of human musicality (McDermott and Hauser, 2005; but see Fitch, 2006). Additionally, evidence of "musical" behaviors in our closest primate relatives is often understood to be sparse. For some scholars, this suggests there was no properly musical phenotype prior to modern humans in the hominin line (Huron, 2001; Justus and Hutsler, 2005; Patel, 2008).

Such arguments have been used to support claims that music should not be conceived of as an adaptation, but rather as a product of culture (e.g., Sperber, 1996; Pinker, 1997). Here it is posited that music is dependent on cognitive structures (e.g., modules) and abilities that evolved to support properly adaptive functions in our ancestors (e.g., language, auditory scene analysis, habitat selection, emotion, and motor control—for a discussion see Trainor, 2015). Perhaps the strongest version of this approach is found in Pinker (1997), who argues that music is an "invention" designed to "tickle" these naturally selected aspects of our cognitive and biological nature. Music itself, however, has no adaptive meaning: From an evolutionary point of view, it is the auditory equivalent of "cheesecake" a cultural invention that is pleasurable, but biologically useless. In line with this, it is suggested that music might be a kind of exaptation—where the original (i.e., adapted) function of a trait becomes co-opted to serve other purposes<sup>1</sup> (Davies, 2012). Thus, as Sperber (1996) posits, music may be understood as "parasitic on a cognitive module the proper domain of which pre-existed music and had nothing to do with it" (p. 142).

By contrast, other researchers have suggested the existence of cognitive modules that appear to be specialized for musical functions. For example, Peretz (1993, 2006, 2012) research in acquired amusia has led her to (cautiously) posit an innate musicspecific module for pitch processing, suggesting that music may be as "natural" as language (Peretz, 2006). Such claims are countered by Patel (2008), who argues that evidence indicating the existence of adapted music specific modules may in fact be explained by (ontogenetic) developmental processes, whereby cortical areas become specialized for certain functions through experience (e.g., via processes of "progressive modularization"; see Karmiloff-Smith, 1992). However, while Patel (2008, 2010) maintains that musicality in humans is not a "direct target" of natural selection, he also acknowledges the profound biological and social benefits associated with musical activity, claiming that music is a powerful "transformative technology of the mind" (Patel, 2008, p. 400–401). Here Patel discusses how musical experience may lead to long-lasting changes in brain structure and processing (e.g., though neuroendocrine effects). Interestingly, he also notes that the phenomenon of infant babbling, the anatomy of the human vocal tract, and the fixation of the FOXP2 gene, might be indicative of adaptations that originally supported both language and vocal music (Patel, 2008, p. 371–372). However, he suggests that because language appears to emerge more quickly and uniformly in humans, and because the lack of musical ability does not appear to entail significant biological costs, these factors are better understood to support the adaptive status of language. In brief, he posits that musical processing is a "by-product" of cognitive mechanisms selected for language and other forms of complex vocal learning (see also Patel, 2006, 2010, 2012).

These last claims are questioned by those who argue that they may reflect a rather narrow perspective on what musicality entails—e.g., the assumption that musical activity necessarily requires special forms of training, or that music is a pleasure product to be consumed at concerts or through recordings (for discussions see Small, 1999; Cross, 2003, 2010; van der Schyff, 2013a,b; Honing et al., 2015). With regard to this point, ethnomusicological and sociological research has revealed musical activity around the world to be central for human well-being—it is inextricable from work, play, social life, religion, ritual, politics, healing, and more (Blacking, 1973, 1995; Nettl, 1983, 2000; DeNora, 2000). Moreover, in many cultural environments music is highly improvisational in character, and the acquisition of musical skills begins in infancy and develops rapidly, often without the need for formal instruction (Blacking, 1973; Cross, 2003; Solis and Nettl, 2009). It has also been suggested that because certain physical and cognitive deficits need not hinder survival and well-being in modern Western society, certain "musical" impairments may go almost completely unnoticed (van der Schyff, 2013a). Likewise, music's relevance for human survival across evolutionary time has been considered in terms of its importance for bonding between infants and primary caregivers, and between members of social groups (Benzon, 2001; Tolbert, 2001; Dissanayake, 2010; Dunbar, 2012). Musical developmental processes appear to begin very early on in life (Parncutt, 2006) and researchers have demonstrated the universal

<sup>1</sup>The term "exaptation" refers to changes in the function of a given physiological or behavioral trait in the process of the biological evolution of an organism. The classic example is bird feathers, which originally evolved for thermoregulation, but were later co-opted for mating-territorial display, catching insects, and then flight. The developmental systems approach discussed below complicates the causal relation of adaptations and exaptations. Here they stand not in a linear sequence, but rather in a cyclical relationship, where the new uses of an adaptation associated with the exaptation may lead to secondary adaptations and so on (see Gould and Vrba, 1982; Anderson, 2007). Referring to the relationship between adaptations and exaptations Tomlinson (2015) writes, "the first are not necessarily prior to the second, since behaviors originating as exaptations might alter selective pressures in ways leading to new adaptations" (p. 36).

and seemingly intuitive way caregivers create musical (or musiclike) environments for infants through prosodic speech and lullabies (Dissanayake, 2000; Trehub, 2003; Falk, 2004). Along these lines, Trevarthen (2002) has proposed that humans possess an in-born "communicative musicality" that serves the necessity for embodied inter-subjectivity in highly social beings such as ourselves (see also Malloch and Trevarthen, 2010).

In all, it is argued that the wide range of activities associated with the word "music" may have immediate and far-reaching implications for survival and socialization for many peoples of the world, as it may have had for our prehistoric ancestors (see Blacking, 1973; Mithen, 2005). And indeed, the archeological record shows evidence of musical activity (i.e., bone flutes) dating back at least 40,000 years (Higham et al., 2012; Morley, 2013). Such concerns drive the "musilanguage" theory put forward by Mithen (2005) and others (Brown, 2000; Lawson, 2014), where both music and language are understood to have developed from a "proto-musical ancestor" that evolved due to selective pressures favoring more complex forms of social behavior e.g., enhanced types of communication associated with foraging and hunting, mate competition, increased periods of child rearing (soothing at a distance), and more complex forms of coordinated group activity (Dunbar, 1996, 2003, 2012; Cross, 1999, 2003; Falk, 2000, 2004; Balter, 2004; Bannan, 2012). Here it is also suggested that musical behavior may have contributed to the development of shared intentionality and Theory of Mind (ToM) in modern humans, which in turn permitted the rapid development of cultural evolution and the emergence of modern human cognition (Tomasello, 1999; Tomasello et al., 2005).

## THE BIOCULTURAL HYPOTHESIS

Thus far, we have offered only a brief outline of some of the main positions in the discussion over the status of music in human evolution. We would like to suggest, however, that although many important and well-reasoned accounts have emerged on both sides of the debate, the nature-or-culture perspective that appears to frame this discussion renders both sides somewhat problematic. On one hand, arguing that music is primarily a product of culture may tend to downplay its deep significance for human well-being, as well as the rather rapid and intuitive ways it develops in many cultural contexts. Indeed, as we have just considered, these manifold developmental and social factors are taken to be indicative of the biological relevance of music for the human animal. On the other hand, arguments for music as an adaptation (e.g., Mithen, 2005; Lawson, 2014) often tend to posit a singular adaptive status for what is in fact a complex phenomenon that spans a wide range of biological, social, and cultural dimensions (Tomlinson, 2015).

In line with such concerns, other scholars (Cross, 1999, 2001, 2003; Killin, 2013, 2016a; van der Schyff, 2013a,b; Currie and Killin, 2016) have offered alternative "biocultural" approaches to the nature and origins of human musicality—where the question of whether either biology or culture should account for deeply social and universal human activities that require complex cognitive functions (e.g., music) is replaced by a perspective that integrates the two. For example, Cross (1999) suggests that musicality is an emergent activity—or "cognitive capacity"—that arises from a more fundamental human proclivity to search for relevance and meaning in our interactions with the world. It is claimed that because of its "multiple potential meanings" and "floating intentionality" music provides a means by which social activity may be explored in a "risk free" environment, affording the development of competencies between different domains of embodied experience and the (co)creation of meaning and culture (Cross, 1999, 2003). Tomlinson (2015) develops similar insights, arguing that what we now refer to as "language" and "music" began with more basic forms of coordinated socio-cultural activity that incrementally developed into more sophisticated patterns of thought, activity, and communication (see also Morley, 2013). Moreover, such activities are understood to have transformed environmental niches over time (Sterelny, 2014; Killin, 2016a, 2017) and with them the behavioral possibilities (affordances) of the hominines who inhabited them through recursive cycles of feedback and feedforward effects.

In all, this orientation suggests a way through the traditional nature-or-culture dichotomy discussed above. In doing so, however, it necessarily draws on models of evolution and cognition that differ from those that have traditionally guided evolutionary musicology. In line with this, Tomlinson's (2015) approach develops Neo-Peircean perspectives in semiotics (e.g., Deacon, 1997, 2010, 2012), exploring how embodied and indexical forms of communication may in fact underpin our linguistic and musical abilities both in evolutionary and ontogenetic terms. As we discuss below, this is further supported by work in theoretical biology associated with developmental systems theory, studies of musical and social entrainment (rhythm and mimesis), and insights from ecological psychology and embodied cognition.

## Looking Beyond Adaptation

Tomlinson (2015) argues that although music-as-adaptation perspectives all reveal important aspects of why music is meaningful for the human animal, they are also problematic when they tend to assume a "unilateral explanation for a manifold phenomenon" (p. 33; see also Killin, 2016a). That is, because music takes on so many forms, involves such a wide range of behavior, and serves so many functions, it seems difficult to specify a single selective environment for it. And thus, these traits sit "uneasily side by side, their interrelation left unspecified" (p. 33). To be clear, this does not in any way negate the claims regarding the social and developmental meanings of music. These biologically relevant traits do exist, but they are just too numerous and complex to be properly described in terms of an adaptation (at least not in the orthodox sense of the term). Because of this, Tomlinson (2015) claims that we must be careful about how we frame evolutionary questions—and especially those regarding complex behaviors such as music and language—lest we fall into the reductive theorizing associated with "adaptationist fundamentalism." He thus argues that dwelling on the question of the adaptive status of music has had the effect of "focusing our sights too narrowly on

the question of natural selection alone—and usually a threadbare theorizing of it, at that" (p. 34).

With this in mind, the developmental systems approach to biological evolution posits a useful alternative perspective (see Oyama et al., 2001). In contrast to the one-directional schema that characterizes more traditional frameworks (where evolution is understood to involve adaptation to a given environment), developmental systems theory presents a more recursive and relational view, where organism and environment are understood as mutually influencing aspects of the same integrated system. Here evolutionary processes do not entail the adaptation of a species' phenotype to a fixed terrain, but rather "a dynamic interaction where other species and the non-living environment take part" (Tomlinson, 2015, p. 35). In other words, this approach explores the complex ways genes, organisms, and environmental factors—including behavior and (socio-cultural) experience interact with each other in guiding the formation of phenotypes and the construction of environmental niches (Moore, 2003; Jablonka and Lamb, 2005; Richerson and Boyd, 2005; Malafouris, 2008, 2013, 2015; Laland et al., 2010; Sterelny, 2014). As such, it eschews the classic nature-nurture dichotomy, preferring instead to examine the interaction between organism and environment as a recursive or "dialectical" phenomenon (Lewontin et al., 1984; Pigliucci, 2001), where no single unit or mechanism is sufficient to explain all processes involved.

Importantly, the organism is understood here to play an active role in shaping the environment it coevolves with—its activities feedback into and alter the selective pressures of the environmental niche. This, in turn, affects the development of the organism, resulting in a co-evolutionary cycle that proceeds in an ongoing way. Socio-cultural developments add additional epicycles involving patterns of behavior that can sometimes hold stable over long periods of time (see **Figure 1**). These are passed on inter- and intra-generationally through embodied mimetic processes (more on this below; see also Sterelny, 2012). While such epicycles necessarily emerge from the coevolution cycle, they may, once established, develop into self-sustaining patterns of behavior that develop relatively independently. However, the effects of these cultural epicycles may feedforward into the broader coevolutionary system resulting in additional alterations to environmental conditions and shifts in biological configurations (e.g., gene expression and morphological changes—see Wrangham, 2009; Laland et al., 2010; Skinner et al., 2015; Killin, 2016a).

The making and use of tools is offered as a primary example of what such cultural epicycles might entail (Tomlinson, 2015). The archeological record contains many examples of bi-face stone hand axes that were made by our Paleolithic ancestors. These tools are remarkably consistent in their functional and aesthetic qualities, implying method and planning in their manufacture (Wynn, 1996, 2002). However, it is now thought that the production of these axes entailed a "bottom up" process based on the morphology and motor-possibilities of the body, unplanned emotional-mimetic social interaction, and the affordances of the environment (Gamble, 1999; Davidson, 2002). In other words, it is argued that the emergence of Paleolithic technologies did not involve abstract or representational forms of thought (e.g., a mental template, or "top down" thinking)—a capacity these early toolmakers did not possess (but see Killin, 2016b, 2017). Nor were they the result of genetically determined developmental programs. Rather, they are thought to have originated, developed, and stabilized primarily through the dynamic interaction between living systems and the material environments they inhabited and shaped (Ingold, 1999). It is suggested that such self-organizing forms of social-technological behavior provided the grounding from which more complex cultural activities like music emerged much later (Tomlinson, 2015). To better understand how this could be so, we now consider the mimetic nature of these pre-human social environments, and how this may give clues to the origins of music in coordinated rhythmic behavior.

## Mimesis, Entrainment, and the Origins of Music in Rhythm

In social animals, attention tends to be turned "outwards" toward the world and the activities of others (McGrath and Kelly, 1986). This entails the capacity to observe, understand, and emulate the actions of conspecifics. It is suggested that in our Paleolithic ancestors these mimetic processes allowed increasingly complex chains of actions to be passed on from one individual or generation to the next (Leroi-Gourhan, 1964/1993; Gamble, 1999; Ingold, 1999). This involved the enactment of culturally embedded "action loops" (see Donald, 2001; Tomlinson, 2015) that depended on a basic proclivity for forms of social entrainment.

The phenomenon of entrainment may be observed in many ways and over various timescales in both biological and nonbiological contexts (de Landa, 1992; Clayton et al., 2005; Becker, 2011; Knight et al., 2017). Most fundamentally, it is understood in terms of the tendency for oscillating systems to synchronize with each other<sup>2</sup> . Accordingly, biological and social systems can be conceived of as dynamically interconnected systems of oscillating components (from metabolic cycles to life cycles, from single neuron firing to regional patterns of activity in the brain, from individual organisms to social groups and the broader biological and cognitive ecology; McGrath and Kelly, 1986; Oyama et al., 2001; Varela et al., 2001; Ward, 2003; Chemero, 2009). Importantly, the components of such systems influence each other in a non-linear or recursive way. As such, organism and environment are not separate domains, but rather aspects of "one non-decomposable system" that evolves over time (Chemero, 2009, p. 26). Moreover, the development of coupled systems is guided by local and global constraints that allow the system to maintain stability—to be resistant to perturbations, or to regain stability once a perturbation has occurred. This is, of course, crucial for living systems, which must maintain metabolic functioning within certain parameters if they are to survive.

Such self-organizing processes result in "emergent properties"—relationships, structures, and patterns of behavior

<sup>2</sup>A simple example of this is found in the way wall mounted pendulums mutually constrain one another, resulting in synchronization or "entrainment" over time (see Clark, 2001).

that may remain consistent over long temporal periods, or that may be subject to transformation due to shifts in local and global constraints of the system. The mathematical techniques associated with DST have aided researchers in modeling such phenomena. Here patterns of convergence (stability) in the state of the system are contrasted with areas exhibiting entropy (instability; de Landa, 1992). This is often represented as a topographic "phase-space" that describes the possible states of a given system over time—periods of convergence in the trajectories of the system are represented as "basins of attraction" (Abraham and Shaw, 1985; Chemero, 2009). A "phase transition" occurs when new patterns of convergence arise (i.e., new attractor layouts). Researchers associated with developmental systems theory (above) use DST methods to model the evolutionary trajectories of coupled organism-environment systems, mapping dynamic patterns of stability and change as functions of constraint parameters (see Oyama et al., 2001).

DST is also used to examine how social animals bring their actions in line with those of other agents—and with other exogenous factors—by "dynamically attending" to the environment through sight, sound, movement, and touch (McGrath and Kelly, 1986; Large and Jones, 1999). This results in the enactment of coordinated forms of behavior that can occur both voluntarily and involuntarily. Emotional-affective aspects may also come into play here. For example, when a stable pattern is disrupted, entropy emerges in the system and a negative affect may result. The (living) system then selforganizes toward regaining stability, resulting in a positive effect. It is suggested that the action loops associated with Paleolithic toolmaking emerged from these forms of social entrainment where dynamic couplings between various trajectories in the social environment led to increasingly stable patterns of behavior (basins of attraction) in the cultural epicycle. This permitted the mimetic transmission of cultural knowledge without the need for symbols, referentiality, or representation (see Tomlinson, 2015, p. 75).

Interestingly, the idea of dynamic attending has been explored empirically in the context of musical (i.e., metrical, rhythmic) entrainment (Large and Jones, 1999; Jones, 2009; Large et al., 2015). Tomlinson (2015) suggests that such dynamical models may help to reveal the distant origins of musical rhythm in the mimetic, emotional, and sonic-social environments jointly enacted by the coordinated (entrained) motor patterns of early toolmakers. This insight is supported by a range of current research into the evolution of rhythmic behavior (Fitch, 2012; Merchant and Bartolo, 2017; Ravignani et al., 2017). Indeed, evolutionary musicology has often tended to explore the origins of music in terms of its vocal dimensions (i.e., music as pitch/song production and its relationship to spoken language), and has thus had to wrestle with the issues associated with complex vocal learning, and its apparent absence in other primates. The focus on rhythm, however, has shown similarities between animal and human behavior (Fitch, 2010; Patel and Iversen, 2014; Merchant et al., 2015; Bannan, 2016; Iversen, 2016; Wilson and Cook, 2016). A large number of papers have also explored the deep relationship between rhythmic behavior and social cohesion in both human and non-human subjects (e.g., Large and Gray, 2015; Yu and Tomonaga, 2015; Tunçgenç and Cohen, 2016; Knight et al., 2017). Additionally, recent studies by Ravignani et al. (2016a) have modeled the cultural evolution of rhythm in the lab. This research shows how, when presented with random percussive sounds, participants tend to develop structured and recurrent rhythms from such information, and that these patterns continue to develop through subsequent generations of participants who are asked to imitate the rhythms of previous generations. Interestingly, the rhythmic patterns that emerged in this study display six statistical universals found across different musical cultures and traditions. This aligns with the conception of cultural transmission based on mimesis and entrainment just discussed. It also implies that the enactment of musical (or music-like) behavior may not be traceable solely to the genome, but rather arises due to a more general propensity to structure acoustical experience in certain ways (see also Fitch, 2017).

Here it should be noted that the biocultural approach also develops a theory about the origins of vocal musicality, albeit one that is deeply connected to the rhythmic factors just described. This entails the development of a repertoire of "gesturecalls" similar to those found in modern primates and many mammalian species (grunts, pants-hoots, growls, howls, barking, and so on; see Tomlinson, 2015, p. 89–123). These do not involve the abstract, symbolic-representational, and combinatorial properties employed by modern languages. Rather, they are tightly coupled with the same mimetic, emotional, and embodied forms of communication that characterize pre-human toolmaking. It is suggested that the vocal expressions associated with these gesture-calls reflected the sonic aspects (rhythmic and timbral) of these environments, the motor-patterns of production, as well as the gestural and social rhythms (e.g., turn taking, social entrainment) that developed within the cultural ecology. In line with this, studies show connections between rhythmic capacities and the development of vocal forms of communication, including language (Cummins and Port, 1996; Cummins, 2015; Bekius et al., 2016; Ravignani et al., 2016b). As an aside, it is also posited that the process of knapping may have resulted in specific forms of listening (Morley, 2013, p. 120), and that the resonant and sometimes tonal qualities of stones and flakes may have afforded music-like play with sound (Zubrow et al., 2001; Killin, 2016a,b) 3 . In brief, these rhythmic forms of behavior may have led to protomusical and proto-linguistic forms of communication that arose simultaneously.

However, as Tomlinson (2015) notes, "half a million years ago there was no language or musicking" (p. 127). While many music-relevant anatomical features were in place by this period, there is no evidence that these hominins possessed the more complex forms of combinatorial thinking required for the hierarchical structuring of rhythm, timbre, and pitch associated with musical activity (i.e., the kind of thinking that is also needed to build tools specifically intended for musical use, such as bone flutes). Rather, it is posited that protomusical and proto-linguistic communications were initially limited to deictic co-present interactions (in-the-moment faceto-face encounters that integrated gesture and a limited number of vocal utterances) that incrementally developed into more complex sequences of communicative behavior. Over time, this led to the enactment of increasingly sophisticated forms of joint action and social understanding (Dunbar, 1996, 2003; Knoblich and Sebanz, 2008; Sterelny, 2012). Such developments in the cultural loop fed forward into the coevolutionary cycle, allowing the environmental niche to be explored in new ways, affording previously unrecognized modes of engagement with it. This, in turn, altered selective pressures, leading to incremental phase transitions in the dynamics of the system, where previous constraints were weakened and new behavioralcognitive phenotypes became possible. By the Upper Paleolithic period, the growing influence of the cultural epicycle favored an enhanced capacity to understand the actions and intentions of others and the related capacity to think "offline," "top down," or "at a distance" from immediate events (Bickerton, 1990, 2002; Carruthers and Smith, 1996; Tomasello, 1999). These developments allowed for the marshaling of material and social resources in new ways, leading to the creation of more complex artifacts (e.g., musical instruments), as well as more sophisticated types of cultural activity (e.g., ritual) and communication, including the hierarchical and combinatorial forms required for language and music as we know them today<sup>4</sup> .

#### Plastic Brains

The biocultural approach sees (musical) cognition as an emergent property of situated embodied activity within a developing socio-material environment. Because of this, it requires a rather different view of cognition than the information-processing model associated with an adapted (modular) brain (e.g., Fodor, 1983, 2001; Tooby and Cosmides, 1989, 1992; Pinker, 1997; Barrett and Kurzban, 2006). Indeed, if evolutionary processes do not involve adaption to a pre-given environment, but rather require the active participation of organisms in shaping the environments they coevolve with—where "selection" and "adaptation" are now understood in a contingent and dynamically cyclical context—then it seems reasonable to suggest that cognitive processes might not depend on genetically programmed responses or be reducible to a collection of fixed information-processing mechanisms in the brain. Rather, they might entail more plastic and perhaps non-representational characteristics that reflect the dynamic integration of brains, bodies, objects, and socio-cultural environments (for similar arguments see Malafouris, 2008, 2013, 2015).

In line with such concerns, scholars are questioning whether the notion of modularity continues to have much relevance for understanding the complexities of the human brain (e.g., Uttal, 2001; Doidge, 2007; Anderson, 2014). For example, it is suggested that brain regions that appear to consistently correlate with specific processes, such as Broca's area and syntax, represent vast

<sup>3</sup>Readers may be interested to consider the studies that examine the "musical" properties of stone artifacts and acoustical characteristics of Paleolithic environments (see Blake and Cross, 2008).

<sup>4</sup>This involves the integration of phonemes and words into grammatical structures and the development of a generative syntax that provides the "rules" for such processes—or, likewise the organization of discrete sets of sounds, tones, and pitches into rhythmic/formal hierarchies that could be consciously repeated or manipulated (e.g., melodies and drumming patterns).

areas of the cortex that may in fact develop multiple overlapping or interlacing networks, the manifold functions of which may appear evermore fine-grained and plastic as neural imaging technology becomes more refined (Hagoort, 2005; Poldrack, 2006; Tettamanti and Weniger, 2006; Grahn, 2012). In relation to this, recent research suggests the existence of "global systems" that function in a flexible and context-dependent manner (see Besson and Schön, 2012, p. 289–290). These do not work independently of any other information available to the brain and are thus non-modular (i.e., they are not discrete). Additionally, research into various levels of biological organization is showing that biological and cognitive processes develop in interaction with the environment—e.g., that epigenetic factors play a central role in the expression of genes, and that the formation of neural connections unfolds as a function of context (Sur and Leamey, 2001; Uttal, 2001; Van Orden et al., 2001; Lickliter and Honeycutt, 2003; Panksepp, 2009). In short, the idea that brain and behavior are best understood as linear systems decomposable into discrete modules and corresponding functions is being replaced by more plastic<sup>5</sup> and dynamically interactive perspectives. Such insights have contributed to the growing view that music cognition is the result of non-modular cognitive developmental processes that are driven by a more general attraction to coordinated forms of social behavior (Trehub, 2000; Trehub and Nakata, 2001-2002; Trehub and Hannon, 2006; see also Drake et al., 2000).

Because of this, recent decades have seen researchers turn to "connectionist" models to account for essential cognitive functions such as (musical) perception and learning (see Desain and Honing, 1991, 2003; Griffith and Todd, 1999; Clarke, 2005). Likewise, Tomlinson discusses the connectionist approach as a way of understanding how the embodied-ecological processes of mimesis and social entrainment contributed to the development of music and language. Put simply, the connectionist strategy does not rely on the idea of fixed modules, but rather on the fact that when simple devices (such as individual neurons) are massively interconnected in a distributed way such connections may change and grow through "experience"—when neurons tend to become active together, their connections are reinforced and vice versa (Hebb, 1949). Such connectivity is thought to result in the emergence of complex sub-systems of activity as well as global convergences that produce system wide properties. This is often modeled using DST and can also be understood in terms of the oscillatory dynamics mentioned above (see Chemero, 2009).

### Embodied Minds

While the connectionist approach was initially seen as an alternative to the computational orientation, more recent modeling has revealed the ability of complex connectionist networks to simulate syntactic, representational, and combinatorial cognitive processes (see Smolensky, 1990; Bechtel, 2008)—i.e., those required by the "adapted brain" hypothesis. Such developments are attractive for some researchers as they allow for the assumed computational-representational nature of cognition to remain while accommodating the growing evidence around brain plasticity and dynamism (Chalmers, 1990; Smolensky, 1990; Dennett, 1991; Clark, 1997; on compositionality see van Gelder, 1990). However, others maintain that because the brain's connectivity cannot be separated from its dynamic history of coupling with the body and the environment, living cognition is not best understood as strictly limited to in-the-brain computations and representational content (Varela et al., 1991; Thompson, 2007; Chemero, 2009; Hutto and Myin, 2012).

To better understand what this means for the biocultural approach to music's origins, it may be useful to consider Tomlinson's (2015, p. 129–139) reading of Cheney and Seyfarth's (2008) research into the social lives of baboons. As Tomlinson notes, observations of baboon vocal and gestural interactions lead Cheney and Seyfarth to suggest that the social behavior of these animals is indicative of an underlying hierarchical and syntactic-representational cognitive structure—one that is continuous with the Fodorian notion of "the language of thought" or "mentalese" (a process of non- or pre-conscious symbolic manipulation in the brain according to syntactic rules). This, they suggest, may reveal a deep evolutionary connection between linguistic processing and social intelligence—where linguistic-computational processes are thought to underpin social cognition even if no spoken or symbolic language is present (as with baboons and our pre-human ancestors; cf. Barrett, forthcoming). However, Cheney and Seyfarth also hint at another possibility, where a more plastic and dynamic connectionist framework comes into play. The idea here is that once a system learns to organize itself in various ways, the patterns it develops can be recognized by the system in association with various things and relationships and thus may be said to "represent" them<sup>6</sup> . For this reason, connectionist processes are sometimes thought to be "sub-symbolic" in that they provide a link between biological processes at lower levels and representational processes at higher ones (Varela et al., 1991, p. 100; Smolensky, 1988). In line with this, Cheney and Seyfarth (2008) suggest that as animals engage with their environments neural networks could be reinforced leading to multimodal forms of "distributed neural representation" (p. 241; see also Barsalou, 2005; Tomlinson, 2015, p. 133). As Tomlinson (2015) points out, this implies something less abstract and more concretely embodied and ecological:

[A] quite literal re-representing, a solidifying, affirming, salienceforming set of neural tautologies. There is no reliance on abstracted social identities such as those humans conceive, on a mysterious language of mind that does the representing, or on baboon comprehension of causality, proposition, and predication. In their place are the accretion of intrabrain and interbrain networks and the responses they enable in face of situations that are both familiar and less so. Networks are, within sheer biological constraints, products of environmental affordances, forged through the repeated patterns of an organism's interaction with the socio material surroundings. [...] All the intricacy Cheney

<sup>5</sup>For studies on music and brain plasticity (see Large and Jones, 1999; Jäncke et al., 2001; Pantev et al., 2001; Schlaug, 2001; Münte et al., 2002; Gaser and Schlaug, 2003; Lappe et al., 2008). Additionally, clinical studies have demonstrated music's deep effects on the body as well as its capacity to transform or reorganize neural structures (e.g., Bunt, 1994; Standley, 1995; Nayak et al., 2000; Tomaino, 2009; Jovanov and Maxfield, 2011).

<sup>6</sup> See Toiviainen (2000) for a discussion of this in the context of music AI.

and Seyfarth find in baboon sociality may well be explained [...] without recourse to anything like mentalese (p. 135–136; italics original).

Similarly, when Tomlinson (2015) refers to the mimetic nature of the developing proto-musical environments, he clarifies that the action loops associated with this may indeed be representational, but not in the sense of mental templates or propositions. Rather, following Donald (2001), Tomlinson comments that the notion of "representation" employed here may entail little more "than the rise to salience of an aspect of a hominin's environment—in this case an enacted sequence of physical gestures imprinting itself in neural networks that fire again when repeated. Or [...] a set of interconnected neural oscillations" (p. 73–74).

It is suggested that this revised conception of representation might be more conducive to understanding cognition across a wider range of developmental and phylogenetic contexts. The problem with applying the more traditional approach associated with computational psychology is that it tends to encourage a kind of "reverse engineering, retrospectively projecting human capacities onto earlier hominins or onto nonhuman species understood as proxies for our ancestors" (Tomlinson, 2015, p. 138). This critique resonates with the work of Barrett (2011), who discusses our tendency to construct highly anthropomorphic views of other life forms and how this can lead to false understandings—not only of their cognitive capacities, but also of the nature and origins of human minds. Similarly, it is argued that the traditional assumption that "cognition" necessarily involves some form of linguistic competence (syntax, propositional thought, symbolic representation, and other forms of abstract "mental gymnastics") has tended to overshadow the more fundamental embodied and emotional aspects of living meaning making in human cognition (Johnson, 2007). This extends to music, which over the past three decades has been examined with a special emphasis on its relationship to linguistic capacities in cognitive and evolutionary contexts (Patel, 2008; Rebuschat et al., 2012; van der Schyff, 2015).

Now, all of this is not meant to imply that research into the (cognitive and evolutionary) relationship between music and language should be abandoned. This is an important area of inquiry and should continue to be investigated. However, other developmental and socio-cultural factors are receiving growing attention from researchers. This includes accounts that explore the dynamic, ecological, and embodied nature of musical experience (e.g., Large and Jones, 1999; Reybrouck, 2005; Leman, 2007; Krueger, 2013; van der Schyff, 2015; Godøy et al., 2016). As we began to consider above, while music and language both involve hierarchical and combinatorial forms of thought, it may be that both emerge from more domain general capacities and proclivities related to the ways embodied-affective relationships are generated within socio-material environments (Johnson, 2007). For some scholars, this implies that the symbolic-representational and propositional forms of cognition associated with language may be derivative rather than primary (see Hutto and Myin, 2012, 2017). As such, the origins of cognition might not be found in brain-bound computations and symbolic representations, but rather in the self-organizing dynamics associated with biological development itself—in the cycles of action and perception that are directly linked to an organism's ongoing history of embodied engagement with its environment. This recalls the coevolution cycle discussed above, but it may also be considered in the context of ontogenesis—e.g., how infants enact meaningful realities through embodied and affective interactivity with their socio-material niche (see Bateson, 1975; Service, 1984; Dissanayake, 2000; Reddy et al., 2013).

Such insights are not lost on Tomlinson (2015), who highlights the continuity between the embodied activities of Paleolithic tool makers and cognition as such—where cognition might in fact be rooted in interactions with the environment that over time result in increasingly complex extensions of individual embodied minds into the broader cognitive ecology (e.g., via mimesis and social "rhythmic" entrainment). Here Tomlinson also entertains the possibility that the self-organizing (or "self-initiating" as he sometimes refers to it) nature of the activities discussed above might not need to be understood in representational terms at all. However, he does not go much further than this general suggestion. This is perhaps somewhat surprising as he does, here and there, draw on the notion of "affordances" and the field of ecological psychology it is associated with—an explicitly non-representational approach to cognition in its original version (Gibson, 1966, 1979; more on this shortly).

Once Tomlinson outlines the deeply embodied, ecological, and socially interactive precursors of musical behavior, he then turns to explain music cognition using generative (e.g., Lerdahl and Jackendoff, 1983) and prediction- or anticipation-based models (e.g., Huron, 2006) that focus on the (internal) processing of musical stimuli and the behavioral responses they lead to. These approaches are relevant to the discussion as they focus on the more abstract and combinatorial ways the modern human mind processes musical events. We would like to suggest, however, that future contributions might benefit by exploring a wider range of perspectives drawn from embodied cognitive science and related perspectives in music cognition. With this in mind, we now turn to discuss how insights associated with the enactive approach to cognition might help to support and advance many of the claims made by Tomlinson (2015) and the biocultural approach more generally.

## THE ENACTIVE PERSPECTIVE

The enactive approach to cognition was originally introduced by Varela et al. (1991) as a counter to the then dominant information-processing model of mind and the adaptationist approach to biological evolution<sup>7</sup> . Like the biocultural model, it develops the insights of developmental systems theory and DST, and is inspired by the work of Gibson (1966, 1979). Gibson's "ecological psychology" asks us to rethink the relationship

<sup>7</sup> It should be noted enactivism also has an antecedent in work by Bruner (1964) who coined the term.

between cognitive systems and their environment. As Chemero (2009) discusses, this can be understood in terms of three main tenets. The first posits that perception is direct (i.e., it is not mediated by representational mental content). The second argues that perception is not first and foremost for information gathering, but is for the guidance of action—for actively engaging with the world. Following from these, the third tenet claims that perception is of "affordances"—or the possibilities for action offered by the environment in relation to the corporeal complexity of the perceiving organism (e.g., a chair affords sitting for a child or an adult, but not for an infant or a fish; Gibson, 1979).

While sympathetic with the three core tenets of the Gibsonian approach, some scholars suggest that the conception of affordances associated with it is problematic when it implies that they are intrinsic features of the environment (e.g., Varela et al., 1991, p. 192–219; for a discussion see Chemero, 2009, p. 135–162). This, it is argued, does not give enough attention to the active role living creatures play in shaping the worlds they inhabit, leading "to a research strategy in which one attempts to build an ecological theory of perception entirely from the side of the environment. Such a research strategy ignores not only the structural unity of the animal but also the codetermination of animal and environment" (Varela et al., 1991, p. 204–205). In brief, the enactivist perspective posits a revised interpretation of affordances that more clearly integrates corporeal dimensions and the engaged perceptual activity of cognitive agents (Varela et al., 1991; see also Nöe, 2006; Chemero, 2009; Barrett, 2011; Schiavio, 2016). As we discuss next, this approach allows for a view of cognition that is not wholly driven by the environment nor by internal representations—but rather by the embodied activity of living agents. As such, it may allow us to further develop the corporeal and ecological concerns that drive the biocultural model.

## Where There Is Life There Is Mind

One of the most central claims of the enactive perspective concerns the deep continuity between mind and life, where cognition is understood to originate in the self-organizing activity of living biological systems (Maturana and Varela, 1980, 1984; Varela et al., 1991; Thompson, 2007; Di Paolo et al., 2017). Most primarily, this involves the development and maintenance of a bounded metabolism (Jonas, 1966; Bourgine and Stewart, 2004; Thompson, 2007), but it also requires the (meta-metabolic) ability of the organism to move and interact with the environment in ways that are relevant to its survival (van Duijn et al., 2006; Egbert et al., 2010; Di Paolo et al., 2017; Barrett, forthcoming). Furthermore, because such fundamental lifeprocesses occur under precarious conditions (Kyselo, 2014), they cannot be fully understood in an indifferent way. Rather, basic cognitive activity is characterized by a "primordial affectivity" that motivates relevant action (Colombetti, 2014). By this view, a living creature "makes sense" of the world through affectively motivated action-as-perception and, in the process, constructs a viable niche (Weber and Varela, 2002; Di Paolo, 2005; Reybrouck, 2005, 2013; Colombetti, 2010; Di Paolo et al., 2017). This involves the enactment of affordances—which are conceived of as emergent properties associated with the dynamic (evolutionary and ontogenetic) history of structural coupling between organisms and their environments<sup>8</sup> (Varela, 1988; Varela et al., 1991; Chemero, 2009; Barrett, 2011; Schiavio, 2016). Importantly, such basic sense-making processes do not involve the representational recovery of an external reality in the head (i.e., mental content). Rather, they are rooted in direct embodied engagement with the environment (Varela et al., 1991; Thompson, 2007) 9 .

In brief, the enactive approach explores cognition in terms of the self-organizing and adaptive sense-making activities by which organisms enact survival-relevant relationships and possibilities for action (i.e., affordances) within a contingent milieu (Thompson, 2007). This constitutes the fundamental cognitive behavior of living embodied minds. Moreover, this perspective traces a continuity between the basic affectively motivated sense-making of simpler organisms and the richer manifestations of mind found in more complex biological forms (Di Paolo et al., 2017). In other words, where the meaningful actions of single-celled and other simple creatures are associated with factors related to nutrition and reproduction, more complicated creatures will engage in ever richer forms of sense making activity and thus exhibit a wider range of cognitiveemotional behaviors (Froese and Di Paolo, 2011). For social animals, this may include "participatory" forms of sense-making that involve the enactment of emotional-affective and empathic modes of communication between agents and social groups (mimesis), and that coincide with the development of shared repertoires of coordinated action (entrainment; see De Jaegher and Di Paolo, 2007; Di Paolo, 2009). With this in mind, we suggest that an enactive framework may provide a useful way of understanding human musical activities as continuous with, but not reducible to, the fundamental forms of self-organizing and emotionally driven action-as-perception that characterize living (participatory) sense-making more generally (van der Schyff, 2015; Loaiza, 2016; Schiavio and De Jaegher, 2017) <sup>10</sup>. As such,

<sup>8</sup>The symbiotic and co-emergent relationship between honeybees and flowers is an excellent example of this. Here autonomous organisms exist as environments to each other—the development of their phenotypes are inextricably enmeshed over evolutionary time (Varela et al., 1991; Hutto and Myin, 2012).

<sup>9</sup>This, of course, is not to say that the brain does not play an important role cognitive processes. However, from the enactive perspective, cognition is not limited to the brain—brain, body, and world are different aspects of an integrated cognitive system that functions in a non-linear way. Barrett (2011, p. 57–93) offers many examples that show how creatures with simple neural organizations are nevertheless able to engage in complex intelligent behaviours by using their bodies and environmental features as part of their cognitive systems (see also Brooks, 1991). In line with this, DST research into forms of problem solving and cognitive development associated with coordinated bodily activity have revealed that many of these processes can also be accurately described without necessarily having to recruit representational content (Thelen and Smith, 1994; Kelso, 1995; Chemero, 2009). Indeed, the DST equations employed to model such phenomena are neutral regarding representations. It is argued, therefore, that evoking representation may introduce unnecessary complications (see Chemero, 2009, p. 68–75).

<sup>10</sup>Among other things, this orientation has begun to offer insights into the ways the basic goal directed and self-organizing dynamics discussed above might be extended to living musical situations that are not life threatening in the literal sense, but that nevertheless require constant care and attention to maintain. For example, think of a performing string quartet. Each member must continuously adapt to the evolving musical environment, drawing on different

it appears to be well positioned to support and extend the biocultural model.

## Enactivism Meets the Biocultural Perspective

The enactive approach to cognition aligns with the biocultural model in several ways. Both draw on developmental systems theory and DST. And both embrace a circular and co-emergent view of organism and environment, as well as a deeply embodied perspective on cognition. Because the enactive approach traces cognition to the fundamental biological concerns shared by all forms of life, it may also help us avoid the anthropomorphizing tendencies noted above (e.g., imposing language-like capacities on non- or pre-human animals; but see De Jesus, 2015, 2016; Cummins and De Jesus, 2016), and thus better understand how cognitive capacities rooted in bodily action might ground the development of music and other cultural activities (Barrett, 2011; Tomlinson, 2015).

In connection with this, researchers drawing on enactivist theory are using DST models to examine bio-cognitive processes in terms of the non-linear couplings that occur between:


This approach is being explored across a range of areas (see Fogel and Thelen, 1987; Laible and Thompson, 2000; Hsu and Fogel, 2003; Camras and Witherington, 2005), including, for example, emotion research (Lewis and Granic, 2000; Colombetti, 2014), studies of social cognition and inter subjectivity (for a detailed discussion see Froese, forthcoming), and musical creativity (Walton et al., 2014, 2015). We suggest that similar approaches might be employed in conjunction with existing knowledge of early hominin anatomical and social structure, evidence from the archeological record, as well as comparative studies with other species and existing musical activities. This could also be developed alongside recent studies of how musical environments and behavior affect the expression of genes and gene groups, and how this might recursively influence behavioral and ecological factors (see Bittman et al., 2005, 2013; Schneck and Berger, 2006; Laland et al., 2010; Kanduri et al., 2015; Skinner et al., 2015).

Additionally, while recent theory associated with "radical enactivism" (Hutto and Myin, 2012) argues that so-called "basic minds" do not themselves possess any form of representational content, it also suggests that culture and language impose certain constraints that result in cognitive activities that may be understood as content bearing (this echoes the suggestion introduced above regarding the possible non-primary or "secondary" status of representational cognition; see Hutto and Myin, 2017). The explanatory advantages of this approach are currently a subject of debate. Nevertheless, the insights that arise from this discussion might shed new light on the cultural epicycles discussed above. As Tomlinson (2015) points out, although musical activity is not fundamentally symbolic or representational itself, it necessarily occurs and develops within cultural worlds of symbols and language. Put simply, the debate surrounding radical enactivism could offer new perspectives on how, over various developmental periods, cultural being might simultaneously constrain, and be driven by, the nonsymbolic, social-affective, and embodied forms of cognition that characterize musical activity.

Another important possibility for how the enactive orientation might contribute to the biocultural approach involves the recently developed 4E framework, which sees cognition in terms of four overlapping dimensions—embodied, embedded, enactive, and extended (Menary, 2010a; Newen et al., 2017). The embodied dimension explores the central role the body plays in driving cognitive processes. This is captured, for example, in the description of the early Paleolithic tool making societies, where the reciprocal influences of sight, sound, and coordinated movement lead to the production of artifacts with specific characteristics. Such forms of embodied activity also formed the basis from which more complex forms of thought and communication emerged later. As we also considered, the biocultural model explores how such embodied factors arise in specific environments, leading to stable and recurrent patterns of activity where bodily, neural, and ecological trajectories converge. This highlights the embedded dimension, which concerns the ecological and socio-cultural factors that co-constitute situated cognitive activity. The biocultural model explores this in terms of the sonic, visual, tactile and emotionalmimetic nature of the niches enacted by our early ancestors, as well as the growing influence of the cultural epicycle on the cognitive ecology. The enactive dimension, as we have seen, concerns the self-organizing nature of living systems, and describes the active role organisms play in shaping the environments they inhabit. Such modes of activity (which are described as "sense-making") are explored over a range of timescales (brief encounters, ontogenesis, evolutionary development), closely aligning with the coevolutionary feedback cycle discussed above. As enactivists equate "sense-making" with "cognition" (Thompson, 2007; De Jaegher, 2013), it may be argued that mental life cannot be limited to the brains or bodies of organisms: It extends into the environments in which cognitive processes play out. In line with this, the extended dimension explores how many cognitive processes involve coupling with other agents (mimesis, social entrainment, participatory sense-making) or with non-biological objects or cultural artifacts (tools, notebooks, musical instruments; see Menary, 2010b; Malafouris, 2013, 2015). While Tomlinson (2015) makes no mention of enactivism or this 4E framework, he does, as we have seen, discuss how cognitive processes emerged and developed in our Paleolithic ancestors through

forms of embodied, emotional-affective/cognitive capacities to communicate, develop shared affordances, and maintain the musical ecology they co-create (this example is developed in detail by Salice et al., 2017; see also Krueger, 2014; Schiavio and Høffding, 2015). Similar studies by Walton et al. (2014, 2015) draw on enactive and dynamical systems theory to better understand the real-time dynamics of interacting musical agents in creative improvisational contexts.

embodied activity that was situated within a milieu that they actively shaped. He also argues that such activity necessarily involved the coordination of multiple agents and the "extension" of individual minds into the socio-material environment. We suggest, therefore, that a 4E approach might be useful in terms of organizing theoretical concepts and for framing and interpreting relevant empirical research.

The 4E framework is currently being developed by a handful of scholars in association with musical cognition (e.g., Krueger, 2014, 2016; Schiavio and Altenmüller, 2015; van der Schyff, 2017; Linson and Clarke, forthcoming). It is also explored in biological contexts by Barrett (2011, 2015a,b, forthcoming) as an alternative to the brain bound (and arguably anthropomorphizing) approach of traditional computationalism. Additionally, the 4E approach aligns with, and could be used to integrate, the corporeal, neural, and environmental levels of investigation associated with contemporary DST research in musical contexts. Therefore, it could help model how these factors contributed to the development of musical behavior in pre- and early human societies. Likewise, this approach might also have interesting implications for the laboratory modeling of cultural rhythmic transmission. As we began to discuss above, experiments by Ravignani et al. (2016a) examine how individuals trying to imitate random drumming sequences learn from each other in independent transmission chains—where the attempts of one participant become the training set for the next subject. This research aligns with the biocultural and enactive perspectives when it suggests that cultural development is not the product of genetic programming, but is guided by more general dynamical processes and constraints that allow for a range of possibilities. A 4E approach might develop the parameters of such studies to include the manipulation of social environmental (i.e., embedded + extended) factors possibly exploring how groups of participants (rather than chains of individual drummers) collaboratively make sense of their sonic environments and develop rhythmic patterns in real time, and how the shared environments that result are transmitted and developed (enacted) by the following cohort. Additionally, it might be interesting to introduce different instruments and methods of sound making it to the environment to see how this affects the results. Lastly, a 4E approach could also include the analysis of video and audio recordings to better understand the relationship between the (embodied) motor, sonic, and socio-material factors involved in the enactment of "rhythmic cultures"11. If it is indeed the case that it is joint bodily action that drove cognitive and cultural processes in our ancestors, then it would be interesting to see how drumming movements shape shared learning environments, and how they develop into new more structured ones (more efficient and easier to imitate) as the rhythmic patterns are transmitted.

## CONCLUSION

We have offered here only a few tentative possibilities for how the enactive and 4E orientation might extend the biocultural approach to the origins and nature of human musicality. We hope that the ideas we have discussed here will inspire future work that explores this relationship more fully. Along these lines, readers may be interested to consider recent work by Malafouris (2008, 2013, 2015), who develops enactive and 4E principles to better understand how brains, bodies, and objects interact to form cognitive ecologies. Malafouris expands the idea of neural plasticity discussed above to include the domain of objects, tools, and culture. In doing so he posits a notion of "metaplasticity" that demands an "historical ontology" of different forms of material engagement (Malafouris, 2013, 2015). This is considered at the intersection of neuroscience, archeology, 4E cognition, and approaches to biological evolution that are closely aligned with developmental systems theory. In many ways, Malafouris' perspective sums up the interests and aspirations of the biocultural approach. He writes,

I propose to accept the fact that human cognitive and emotional states literally comprise elements in their surrounding material environment. Our attention, therefore, should shift from the distinction of "mind" and "matter" or "in" and "out," toward developing common relational ways of thinking about the complex interactions among brain, body, and world. If we succeed, traditional ways of doing cognitive science should change, and the change will stretch far beyond the context of cognitive archaeology and human evolution (Malafouris, 2015, p. 366).

With this in mind, we would like to close by briefly mentioning some ontological and ethical implications an enactive-biocultural model might have for practical areas like music education. If music is neither a pleasure technology, nor the result of some strict adaptationist process—but rather a biocultural phenomenon rooted in the dynamics of joint action—then the ways we approach it in practice (e.g., music education, musicology, performance, music therapy, and so on) should reflect this fundamental existential reality. In other words, this approach opens a perspective on what it means to be and become musical that is no longer based in prescriptive developmental processes, adapted cognitive modules, and correspondence to pre-given stimuli (e.g., music as the reproduction of a score; see Small, 1999). Instead, it highlights the plastic, creative, situated, participatory, improvisational, embodied, empathic, and worldmaking nature of human musicality. It may therefore offer support to a growing number of theorists who argue that we have tended to rely on disembodied, depersonalized, and highly "technicist" approaches to musical learning (Regelski, 2002, 2016; Borgo, 2007; Elliott and Silverman, 2015), and that this orientation has reduced the ontological status of music students, teachers, listeners, and performers to mere responders, consumers, and reproducers (van der Schyff et al., 2016). Although this cannot be explored in detail here, it is an example of how alternative perspectives on the evolution

<sup>11</sup>A relevant example of approaches involving the integration of video and audio documentation, and DST/4E analysis, may be found in the recent work by Walton et al. (2014, 2015) that examines perceptions of creativity in interacting musical improvisers (see also Borgo, 2005; Laroche and Kaddouch, 2015). Note that these studies also include a phenomenological dimension that incorporates first-person accounts of the participants.

and nature of human (musical) cognition could inspire new ways of thinking in practical areas. In all, then, we hope that the biocultural and enactive approaches will continue to be developed in musical contexts to gain richer understandings of the origins and meaning of musicality for the human animal.

#### AUTHOR CONTRIBUTIONS

DvdS developed the main body of text. AS provided suggestions and comments that were implemented in the final version.

#### REFERENCES


#### FUNDING

DvdS is supported by the Social Sciences and Humanities Research Council of Canada.

#### ACKNOWLEDGMENTS

We thank Luca Barlassina and Richard Parncutt for inspiring discussions concerning this topic. We also thank the Action Editor, Aleksey Nikolsky, and the reviewers for their helpful comments.


Jonas, H. (1966). The Phenomenon of Life. Chicago: IL: University of Chicago Press.


Kelso, S. (1995). Dynamic Patterns. Cambridge, MA: MIT Press.


Reybrouck, M. (2013). From sound to music: an evolutionary approach to musical semantics. Biosemiotics 6, 585–606. doi: 10.1007/s12304-013-9192-6


Sperber, D. (1996). Explaining Culture. Oxford: Blackwell.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 van der Schyff and Schiavio. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Role of the Baldwin Effect in the Evolution of Human Musicality

#### Piotr Podlipniak\*

*Institute of Musicology, Adam Mickiewicz University in Poznan, Pozna ´ n, Poland ´*

From the biological perspective human musicality is the term referred to as a set of abilities which enable the recognition and production of music. Since music is a complex phenomenon which consists of features that represent different stages of the evolution of human auditory abilities, the question concerning the evolutionary origin of music must focus mainly on music specific properties and their possible biological function or functions. What usually differentiates music from other forms of human sound expressions is a syntactically organized structure based on pitch classes and rhythmic units measured in reference to musical pulse. This structure is an auditory (not acoustical) phenomenon, meaning that it is a human-specific interpretation of sounds achieved thanks to certain characteristics of the nervous system. There is historical and cross-cultural diversity of this structure which indicates that learning is an important part of the development of human musicality. However, the fact that there is no culture without music, the syntax of which is implicitly learned and easily recognizable, suggests that human musicality may be an adaptive phenomenon. If the use of syntactically organized structure as a communicative phenomenon were adaptive it would be only in circumstances in which this structure is recognizable by more than one individual. Therefore, there is a problem to explain the adaptive value of an ability to recognize a syntactically organized structure that appeared accidentally as the result of mutation or recombination in an environment without a syntactically organized structure. The possible solution could be explained by the Baldwin effect in which a culturally invented trait is transformed into an instinctive trait by the means of natural selection. It is proposed that in the beginning musical structure was invented and learned thanks to neural plasticity. Because structurally organized music appeared adaptive (phenotypic adaptation) e.g., as a tool of social consolidation, our predecessors started to spend a lot of time and energy on music. In such circumstances, accidentally one individual was born with the genetically controlled development of new neural circuitry which allowed him or her to learn music faster and with less energy use.

Keywords: Baldwin effect, human musicality, cortico-subcortical loops, pitch structure, musical rhythm

## INTRODUCTION

Human musicality can be understood as a set of abilities which enable people to recognize and produce music (Fitch, 2015). Although the worldwide diversity of music reveals the cultural flexibility of Homo sapiens in musical behavior, the fact that people in all known cultures sing (Nettl, 2000) and can recognize music without explicit learning (Tillmann, 2005) suggests that

#### Edited by:

*Aleksey Nikolsky, Braavo! Enterprises, United States*

#### Reviewed by:

*Nicholas Bannan, University of Western Australia, Australia Edward Hagen, Washington State University Vancouver, United States*

> \*Correspondence: *Piotr Podlipniak podlip@poczta.onet.pl*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience*

Received: *24 May 2017* Accepted: *19 September 2017* Published: *06 October 2017*

#### Citation:

*Podlipniak P (2017) The Role of the Baldwin Effect in the Evolution of Human Musicality. Front. Neurosci. 11:542. doi: 10.3389/fnins.2017.00542*

**82**

Podlipniak The Role of the Baldwin Effect

musicality is a part of human biological endowment (Blacking, 1973; Fitch, 2006, 2015). In fact, a variety of theories have been proposed which try to explain the possible adaptive value of music (Cross and Morley, 2008). Many of the possible biological values of music have been put forward such as increasing sexual attractiveness (Darwin, 1871; Miller G. F., 2000), facilitating mother-infant bonds (Dissanayake, 2008), enhancing group consolidation (Roederer, 1984; Storr, 1992; Harvey, 2017), reducing cognitive dissonance (Perlovsky, 2010, 2012), and alerting outsiders about a group cohesiveness (Hagen and Bryant, 2003; Hagen and Hammerstein, 2009), to name only the most popular ideas. However, music is a complex communicative phenomenon (Zimmermann et al., 2013) which is composed of many features. Some of these features are shared with other communicative phenomena, which raises the question about music specificity and in consequence, about the biological character of human musicality. For example, the manipulation of sound intensity, stress and tempo is an important part of all human songs but also of speech. Moreover, a similar use of these features is observed among the vocal expressions of many mammalian species (Zimmermann et al., 2013) including our closest relative—the chimpanzee (Pan troglodytes) (Slocombe et al., 2009). The patterns of continuously modulated sounds used as a tool of expression and induction of emotions in other individuals are called "expressive dynamics" (Merker, 2003) or "affective prosody" (Zimmermann et al., 2013) and seem to be evolutionarily more ancient than any species of hominins. Of course, music is also usually composed of such elements which are not shared with other forms of human sound expressions. The interpretation of musical stimuli in terms of pitch classes and temporal isochrony are at least two musical elements which are absent in speech (Fitch, 2013a) and other human vocalizations. Although pitch and the duration of vowels can serve as discreet phonological units in tonal and time sensitive languages respectively (Remijsen and Gilley, 2008; Wong et al., 2012; Remijsen, 2014) both pitch and vowel length in speech lack hierarchical ordering based on mental reference points. Such a hierarchy is evident in music where pitch classes are organized in reference to pitch center (Podlipniak, 2016) and rhythm measures in reference to musical pulse (London, 2012). By taking into account that both of these features are not present in the vocal expressions of chimpanzees it is reasonable to assume that they were also absent in the vocal repertoire of the common ancestor of chimpanzees and humans and thus are relatively evolutionarily young innovations.

The coexistence of different evolutionarily aged features in our musical expression encourages us to think of human musicality in an analogous way, as in the faculty of language (Hauser et al., 2002) namely in terms of musicality in the broad and narrow sense. While musicality in the broad sense can encompass a set of abilities which develop at an early stage of human ontogenesis, and which allow the identification of pitch contour, changes in tempo, and dynamics as parts of the affective prosody (Zimmermann et al., 2013), only abilities which constitute musicality in the narrow sense enable the recognition of musical structure. From this perspective the actual question about the origin of music is related to the origin of these abilities which constitute musicality in the narrow sense. What makes music a phenomenon distinguishable from other experiences of sounds is the interpretation of sounds by the human nervous system not just as unrelated sound events but as a psychologically unified musical structures. Although it has been proposed that musical structure (e.g., melody) was discovered by early humans due to the resemblance of musical sounds to the acoustic characteristics of human vocalizations (Purves, 2017), the real problem is why the nervous system of early humans started to interpret certain acoustic parameters as discrete pitch and rhythm units organized into syntactically based sequences. After all, despite the fact that a sound stimulus is usually a very complex phenomenon composed of an enormous number of spectral and temporal cues, different species recognize their species-specific vocalizations using only the subsets of features which are distinctive solely to their song cognition (Bregman et al., 2016; Shannon, 2016). In other words, every species is sensitive to specific acoustic cues due to the proclivities of its own nervous system. In addition, speech and music, two complex hierarchical cognitive systems of H. sapiens (Fitch, 2014), operate with a restricted number of different distinctive features (Patel, 2008). Therefore, there must be something which predisposes humans to focus their attention on particular acoustic features whilst ignoring others. What is more, the vast majority of musical structures, especially songs of tribal communities (Blacking, 1973), are organized according to certain syntactic rules (Koelsch, 2013 but see London, 2011), which suggests that musical syntax is a natural trait of human vocalization. Apart from this, musical syntax based on pitch classes and rhythmic units measured in reference to musical pulse is a music-specific feature (Fitch, 2013a). This raises the question about the evolution of the abilities which allow the recognition of musical syntax, which seems to be the core of human musicality.

## MUSICAL SYNTAX AS A MUSIC SPECIFIC FEATURE

The human ability to organize and interpret stimuli as syntactically complex sequences is often regarded as a milestone in the evolution of human cognition (Hauser et al., 2002; Fitch, 2014). Even though syntax, understood in the broad sense as rules combining discrete elements into sequences (Patel, 2008), can be attributed to certain animal songs (Okanoya, 2013; Suzuki et al., 2016), both musical and language syntaxes seem to be exceptional in terms of their complexity and function (Fitch and Jarvis, 2013). Musical syntax however, is often thought of as a derivative of language syntax (Patel, 2008 but see Jackendoff and Lerdahl, 2006), or as a product of domain-general structural computation (Fitch, 2014; Van de Cavey and Hartsuiker, 2016) rather than a functionally separate phenomenon. In fact, there are good reasons to assume the general role of structural computation in the processing of language and musical syntaxes. There are many neuroimaging studies that show an overlap in the activation of cortical structures (located in the inferior frontal gyrus) during the processing of musical and language syntaxes (Patel et al., 1998; Maess et al., 2001). Moreover, there is also research which reveals that the same structures (e.g., Broca's area) play a certain role in the performing of other tasks involving sequential ordering (Tettamanti and Weniger, 2006; Friedrich and Friederici, 2009; Higuchi et al., 2009; Wakita, 2014). Also, the cognitive deficits observed after lesions in the Broca's area include not only deficits in production and recognition of language syntax but also action execution and observation as well as musical syntax processing (Fadiga et al., 2009).

However, from a behavioral point of view, language and music are functionally different phenomena. The former communicates referential meaning by the means of intersubjectively understandable concepts (Bickerton, 2010), whereas the latter exchanges information which is at least more ambiguous in terms of its semantic content (Cross, 2005). Since natural selection operates on phenotypes, we can assume that syntactical language evolved because it gave an advantage to our ancestors over those individuals who could not grasp the syntactic rules present in the language of our ancestors. This means that in the case of cognitive abilities natural selection acts directly upon the behavioral effects of the activity of brain structures. Therefore, when searching for the evolutionary origin of a particular cognitive trait it seems more reasonable to look at it from Tinbergen's perspective (Tinbergen, 1963), namely by considering the possible ultimate function of the brain products (e.g., syntactical language), rather than the proximal function which a particular brain structure fulfills in the processing of a certain task or tasks. From this point of view, the circuitries which perform whole task such as the processing of syntactical language should be assumed as the result of natural selection rather than isolated brain structures which take part in the operations of these circuitries (e.g., Broca's area). Since evolution usually optimizes organisms by the means of adjusting the existing traits to new functions (Jacob, 1977), it is not surprising that the same brain structures are often parts of functionally different circuits. Of course, the human brain is characterized by having huge plasticity, thus the general ability to attribute "tree structures" (Fitch, 2014) (complex syntax) to stimuli, is not restricted solely to language and music. It is well known that people are able to implicitly learn the artificial syntaxes of different stimuli (Reber et al., 1999). Nonetheless, in comparison to artificial grammars, both language and music seem to be exceptional in respect to the rate and easiness of the implicit learning of their syntactic rules by children (Jablonka and Lamb, 2005; Tillmann, 2005). The fact that there is no culture without language and music additionally strengthens the point of view that musicality, similar to language abilities, is a natural part of human behavior rather than being a very old cultural invention similar to writing or playing chess.

## Musical Syntax vs. Language Syntax

There are many similarities between language and musical syntaxes. Both are compositional and hierarchical (Merker, 2002) and both generate long-distance dependencies (Bickerton, 2009; Woolhouse et al., 2016). The default mode of language—speech is like music in the auditory domain, and it has been observed that the processing of music and speech syntactic tasks activates the peri-Sylvian network which connects the inferior frontal gyrus with sensory cortices located in the temporal lobes (Fitch, 2014). However, musical and language syntaxes are also quite different in many respects. First of all, music is composed of different units to those of language. The basic units of speech are phonemes which are experienced in our internal world as unique qualities hardly comparable with our experience of pitch class. Their discrimination is also based on different spectral cues. Pitch classes are recognized by the fundamental frequency of harmonic sound (F0) (Stainsby and Cross, 2008) whereas phonemes mainly by the spectral and temporal shape of sound (Xu et al., 2005). Although the processing of certain characteristics of spectral shape is important for the discrimination of timbre in music (McAdams and Giordano, 2008), the role of timbre in musical syntax is at least doubtful. Admittedly, timbre can play an important role in the structural organization of music as it is observed in certain musical styles such as in the deep throat singing of Tuva and Mongolia (Levin and Süzükei, 2006), the music of the Jew's harp (Fox, 1988), and tabla music in India (Patel, 2008). There are also musical cultures (e.g., Yakut culture in Siberia) in which the structure-forming function of pitch is extremely reduced whereas timbre seems to be a dominant factor which structures the sound order. Nevertheless, in all these cases timbral structure is hardly comparable to pitch and rhythm structure mainly because of the multidimensional perceptive character of timbre (Lerdahl, 1987). Also, our mental images of timbre in music and phonemes in speech differ, although both are based on the interpretation of the spectral shape of sound. Therefore, even though language grammar (Lerdahl and Jackendoff, 1983), prosodic structure (Heffner and Slevc, 2015) as well as phonotactics and morphonotactics are to some extent comparable to musical syntax (Lerdahl, 2013), the perceptual salience of these phenomena is disparate. The crucial difference between language and musical syntax however is related to the function which these syntaxes fulfill in music and language being different communicative phenomena. In language, syntax is mapped into conceptual meaning (propositional semantics; Hilliard and White, 2009) which allows a concatenation of meaning i.e., putting together two or more units and thereby creating a new meaning in comparison to the meanings of those units alone (Bickerton, 2009). But this process of mapping does not seem to be unidirectional as semantics can influence syntactic rules as in the case of some verbs in which the meaning determines grammatical patterns (Dor and Jablonka, 2000). Thus, the function of language syntax is strictly related to communication of specific conceptual meanings (Dor and Jablonka, 2001). In contrast, the function of syntax in music has nothing to do with such a complex interdependence between syntax and concepts observed in language. Although there is an endless dispute over the existence and character of musical semantics (Patel, 2008; Koelsch, 2013; Reybrouck, 2013; Seifert et al., 2013) even if one admits that music can communicate referential meaning, both the type of this meaning (Dor and Jablonka, 2001) and its relation to musical syntax (Lerdahl, 2013) are definitely different than in language.

## Musical Rhythm and Pitch as the Basis for Musical Syntax

There is currently no agreement about what musical syntax actually is (Patel, 2003; London, 2011; Koelsch, 2013; Lerdahl, 2013; Asano and Boeckx, 2015; Heffner and Slevc, 2015). The majority of research on musical syntax has been conducted on Western artistic music—especially on the functional relations between chords (Koelsch, 2013). But the case of Western artistic music seems to be an inadequate example of human musical expressions as the manifestation of H. sapiens musicality (Jackendoff and Lerdahl, 2006). After all, functional harmony is an exception within the wide variety of world music. Although music based on functional harmony has become more and more widespread in the last century, its history is very young in comparison to the ancient history of music without functional harmony. Nevertheless, the experience of even simple melody without any accompaniment necessitates the recognition of syntactic relations. These relations are hierarchical, meaning that the sequence of sounds is interpreted by the nervous system as being composed of units (sounds perceived as belonging to a particular pitch class and having a particular rhythmic measure) which possess different prominence. This prominence attributed to sounds as elements of metrical (London, 2012; Fitch, 2013b) and pitch (Lerdahl and Jackendoff, 1983; Huron, 2006) patterns is a mental construct as with the prominence of grammatical categories in language. However, in the case of music the metrical or tonal prominence is rather "felt" than "conceptually known" as in language cognition. This preconceptual character of musical hierarchy is especially evident when music is experienced by non-musicians who, in contrast to professional musicians (Burns and Ward, 1978), do not always recognize pitch intervals by the means of categorical perception (Smith et al., 1994). However, even musically trained listeners experience musical hierarchy in such a preconceptual way despite the fact that they are additionally able to recognize musical structure in terms of precise mental categories. What seems to be a source of the preconceptual experience of musical hierarchy is somehow related to motor and emotional brain processes. The recognition of meter in music can be understood as a kind of entrainment which exists in the connection between our auditory and sensorimotor systems (London, 2012). It has even been proposed that human metrical interpretation of music is based on hidden sensorimotor activity (Repp, 2007). This sensorimotor activity during listening to music leads in turns to emotional reactions (Sievers et al., 2013). Also, the recognition of pitch hierarchy causes measurable emotional reactions (Steinbeis et al., 2006; Koelsch et al., 2008; Mikutta et al., 2015), and perception of pitch changes (often described as "leaps" and "steps") can lead to sensorimotor interpretation (Nikolsky, 2015). Since emotion is an evolutionarily old motivational mechanism (Toates, 1988), the function of which is to assess a potential danger or attractiveness of perceived stimuli (Panksepp, 1998), the tight connection between musical structure and emotions suggests the biological importance of this human specific interpretation of sound in terms of pitch and metric hierarchy.

#### Musical Syntax and Emotions

During the processing of musical syntax, people experience a set of subtle emotional reactions (emotional qualia) dependent on the position of a note in each syntactic context (Huron, 2006; Margulis, 2014). This observation is supported by the fact that the bilateral Amygdalae and the orbitofrontal cortex are differently activated during listening to music depending on the syntactic relations of musical sequences (Mikutta et al., 2015). In spite of the coexistence of an affective experience during listening to musical syntax, the emotional reaction to musical syntax is often suggested as the mere result of cognitive recognition of syntactic structure (Koelsch et al., 2008) rather than an integral part of this recognition. For example, Huron (Huron, 2006) has proposed that the association of emotions with particular pitch classes is the result of the so called "misattribution effect." In the case of music it is the misattribution of limbic reward or punishment (caused by fulfilling or not fulfilling predictions about which pitch class will be next) to the pitch classes themselves depending on the general mechanism of prediction. However, in the original misattribution effect (Dutton and Aron, 1974) the feeling experienced in response to a stimulus (e.g., the instability of a bridge perceived by the vestibular and visual systems) is misattributed to another stimulus (e.g., a woman perceived by sight). But in Huron's example there is only one stimulus—sound. Because prediction is the ultimate function of the nervous system (Llinás, 2001) one can assume that every perception is based on prediction. The limbic reward (or punishment) in response to well predicted (or falsely predicted) stimuli is an evolutionarily old mechanism of the assessment of stimuli (Panksepp, 1998), inseparable from every perception. In other words, both the emotional reaction to a particular stimulus and the prediction of stimuli are parts of cognition. Therefore, the prediction processes of the nervous system cannot be treated similarly to external stimuli as a source of emotions. The actual source of emotion is an external stimulus and a prediction process is one of the mechanisms, the function of which is to deliver information about the external world that is assessed by emotions. Moreover, if the emotional effect was solely the result of prediction then the emotional reactions to equally predicted stimuli should be the same, independent of whether they are parts of e.g., musical syntax, speech phonotactics or the sequence of timbres in music (Gorzelanczyk et al., 2017 ´ ). Yet the emotional experience of musical syntactic relations seems qualitatively different from the experience of phonotactics and other syntactically organized sequences. This difference is evident if we compare singing with speech. Although both of these vocal expressions are composed of syntactically organized sounds, the variations of pitch in time occur much slower in singing than in speech (Zatorre and Baum, 2012) and the emotional impact of singing on listeners seems to be greater in comparison with speech from our childhood (Nakata and Trehub, 2004) and lasting throughout our lives.

Taking into account the behavioral specificity of singing, both prediction and emotional reactions to successive sounds are in this case rather the integral parts of a mental tool dedicated to processing musical structure. From this perspective, different subtle emotional reactions to a variety of possible syntactic relations are the elements of a functionally specific form of communication, similar to how semantics is strictly connected with grammar in natural language. In other words, music can be understood as a mapping system in which syntactical relations are mapped into preconceptual emotional cues. For the majority of the human population, the recognition of musical syntax is a solely preconceptual experience. Of course, the conceptual level of musical syntax recognition is achievable by professional musicians. However, in contrast to implicitly learned tacit syntactic knowledge by non-musicians, this additional level of competence necessitates strenuous explicit learning. Changes in the neural architecture of musicians in response to environmental influences (explicit learning) could perhaps represent a kind of phenotypic adaptation within the cognitive domain which is possible due to the brain's plasticity (Moreno and Bidelman, 2014; Strait and Kraus, 2014). In fact, it has been observed that music performance affects the transcription of genes that are related to dopaminergic neurotransmission, motor behavior, neuronal plasticity, and neurocognitive functions such as learning and memory (Kanduri et al., 2015) which can be responsible for the observed differences between musicians and non-musicians. Therefore, the fact that musical syntax can be recognized by musically trained individuals at a conceptual level shows how much cultural influence can extend cognitive skills rather than saying something about its primordial nature. Since for average humans the syntactic relations in music, both metrical prominence (especially evident when they dance or tap), and pitch prominence (when they sing), are somehow felt but are difficult to express conceptually, it is reasonable to assume that motor, emotional and cognitive processing are an integral part of the ability to recognize and produce syntactically organized music.

#### MUSIC AND THE BALDWIN EFFECT

The main problem concerning the evolution of the abilities to use syntax as a part of any communicative system is related to the question about the cause of "syntax genes" proliferation (Dor and Jablonka, 2001). If musical or language syntax is adaptive due to extending communicative capabilities, then this means that it must be used by at least two individuals. After all, as long as one individual cannot use syntax in communication with another individual, the use of syntax by the latter is useless. Even with the use of music in the form of self-communication, as in the case of "personal song" in the musical cultures of Siberia, Far East, and Amerindian tribes (Nikolsky, 2015), the development of the abilitiesto organize music syntactically necessitates learning from another individual. However, the appearance of a new genetic trait in the population is usually a result of accidental mutation or recombination. Because the probability of the coincidence of identical mutations of the same allele in the same generation is very low the appearance of a genetically based predisposition to organize vocal expression syntactically seems puzzling. In other words, all advantages of syntax are useless in a population in which only one individual is able to produce and recognize syntactically organized sequences. A possible solution to this problem can be the Baldwin effect (Baldwin, 1896a; Simpson, 1953).

## The Baldwin Effect

The Baldwin effect is an evolutionary mechanism which transforms a culturally invented and acquired trait into an instinctive trait by the means of natural selection (Baldwin, 1896a; Simpson, 1953; Hall, 2001). Although this mechanism was independently proposed at the end of the Nineteenth Century by at least three different people (Baldwin, 1896a,b; Morgan, 1896; Osborn, 1896) it was forgotten with some exceptions (Simpson, 1953; Waddington, 1953a,b), for more than half of the next century. Only in the last few decades of the twentieth century has the Baldwinian idea started to inspire scientists and philosophers and has regained popularity (Godfrey-Smith, 2003). The core of this concept is very simple. Some animals due to their cognitive flexibility learn new adaptive behaviors in response to environmental changes. If a particular behavior is adaptive, lasts many generations, and its learning is strenuous and timeconsuming, sooner or later a genetically based predisposition appears and starts to be favored by natural selection (Dor and Jablonka, 2000, 2010; Godfrey-Smith, 2003; Jablonka and Lamb, 2005). Therefore, the Baldwin effect is a combination of learning and the genetic assimilation of a learned trait (Dor and Jablonka, 2000, 2001; Godfrey-Smith, 2003). The process of Baldwinian evolution occurs in three stages: (i) the appearance of a new environmental challenge, (ii) the invention of a new behavior as a response to the new environmental challenge and its proliferation by the means of learning—at this stage natural selection favors cognitive plasticity, (iii) the appearance of a new genetically based predisposition (canalization, Jablonka and Lamb, 1995; Dor and Jablonka, 2010)—at this stage natural selection favors less flexible individuals but faster at exhibiting a particular adaptive behavior (Godfrey-Smith, 2003).

Since an important factor in the Baldwinian evolution of human behavior is the social character of our species, the aforesaid environmental challenge can be a part of the sociocultural products of hominins. This specific socio-cultural environment often described as a "cultural niche" (Godfrey-Smith, 2003) has been proposed as a crucial element in the evolution of natural language (Bickerton, 2010; Deacon, 2010). In some regards, the evolution of language can represent a niche construction (Deacon, 1997, 2003)—a kind of niche extension or elaboration. The importance of this "cultural niche" seems evident when taking into account that at least during the last 2 million years hominins have become more and more socially complex animals in comparison to other primates (Dunbar, 2014). Living in a complex social group definitely causes new challenges which influence both natural selection and niche construction. In the process of niche construction hominin brains and hominin culture can be considered as a specific environment in which language evolved (Deacon, 2003). From this perspective the Baldwinian process is a part of the geneculture coevolution (Lumsden and Wilson, 1982; Richerson et al., 2010; Gintis, 2011). Since speech (the "default mode" of language) is similar to music in many respects (both are transmitted by the acoustic domain, both are syntactical systems composed of discrete elements etc.), and language is assumed to be a crucial factor in the cultural evolution of H. sapiens, it seems reasonable to assume that the evolution of human musicality is somehow related to the evolution of language. However, language is a very elaborate signal related to the exchange of conceptual meaning. The presence of non-symbolic and non-conceptual culture in many other species (Cantor and Whitehead, 2013; van de Waal et al., 2013; Fehér et al., 2016) indicates that the beginning of cultural niche construction can be based on the exchange of preconceptual meaning. Therefore, music as an example of a communication system operating on preconceptual meaning is a good candidate to be a part of more ancient communicative tool other than language and so the proposed Baldwinian scenarios of language origin must differ from the possible Baldwinian processes that led to the emergence of music.

## The Baldwinian Evolution of Music

Human musicality seems to be a very good example of the potential effects of Baldwinian evolution (Podlipniak, 2015). Huron has suggested that apart from certain reflexes, the human auditory experience is mainly influenced by learning (Huron, 2006). The predominance of learning in shaping human auditory cognition implies, according to Huron, that the auditory environment of hominins must have been very semiotically unstable. Following Huron's reasoning, such a semiotic instability led to the great variety of music found around the world. However, music as a product of human musicality is characterized not only by culture-specific features but also by universals (Nettl, 2000; Bispham, 2009; Brown and Jordania, 2011; Savage et al., 2015) which suggests that apart from the cultural (environmental) influence, the musicspecific genetic constraints also shape the musical mind of every human. This means that on the one hand, musicality develops spontaneously and effortlessly but on the other hand, the learning of more sophisticated musical skills, as in the case of professional musicians, is time-consuming and necessitates a lot of effort. The transmission of musical information has its roots in the human ability of vocal learning which is exceptional among primates (Janik and Slater, 1997; Fitch and Jarvis, 2013). However, human vocal learning is canalized into an imitation of selected acoustic characteristics rather than the literal copying of every heard sound (Jackendoff and Lerdahl, 2006). People are very skillful at imitating the distinctive features of phonemes, the temporal order of sound sequences, and fundamental frequency of harmonic sounds but not very talented when they try to simulate the barking of a dog or environmental sounds such as the noise of a refrigerator which seem a very simple task for many parrots. This canalization suggests that apart from the aforementioned environmental instability, certain circumstances related to hominin vocal expressions must have been stabilized long enough during numerous generations to cause natural selection to have promoted an instinct to learn only selected sound features (Briscoe, 2000; Gibson and Tallerman, 2011). As a result, speech and music, similar to many songbirds' songs, are examples of so called ritual culture (Merker, 2009). The most important characteristic of ritual culture is its transmission by the means of imitative social learning (Merker, 2005, 2012). In contrast to non-imitative social learning, learning by imitation consists of copying the behavior of other individuals (Jablonka and Lamb, 2005). Therefore, what is important in the transmission of ritual is not the result of a particular action but the action itself (Merker, 2009). In the case of music, transmitted units are pitch classes and rhythm measures (Merker, 2002, 2003). After all, a melody is recognized independent of whether it is played slower or faster on the flute, piano, or when sung. What is important for the recognition of melodic pattern is its pitch and rhythm structure, not timbre or dynamics. In this respect music seems to be an even more striking example of ritual culture than speech in which, apart from poetry, what is crucial is the transmission of the semantic content of utterance and not its literal form. However, the thing that makes melody easier to remember is musical syntax, the evolution of which is most easily explained by the Baldwin effect.

The Baldwin effect may promote a particular trait due to different adaptive functions. For example, Morgan proposed that organic evolution (the term which Morgan used to describe the process known today as the Baldwin effect) can explain the origin of bird songs which evolved as a result of sexual selection (Morgan, 1891, 1920). Since bird songs are similar to human music in many respects, it is tempting to explain the origin of musical syntax by the means of the Baldwin effect in which the adaptive function of music is to attract sexual partners. If this is true, musicality could have been used by hominins as a mating handicap (a mark of the quality of a mate, Zahavi, 1975; Zahavi and Zahavi, 1997; Miller G. F., 2000) since the production and recognition of musical syntax is costly in terms of energy (necessary to process the perceived sounds and to control the vocal production of songs) and time spent on singing (which for example can be used for foraging instead). In such a scenario the syntactical complexity of a hominin song should attract females more than a song that lacks such a complexity due to the costliness of the song's complex structure being an indicator of fitness (Miller G., 2000). If these female preferences had been stable enough throughout many generations, the Baldwinian mechanism should have transformed the learning of culturally invented rules of musical structure into an instinct to learn and organize musical sounds in a syntactic way. This emerged instinct to learn the distribution of music-specific discrete elements based on intuitive recognition of their probability of occurrence would have left space for idiosyncratic song modifications similar to those observed in songbirds' behavior. Such a leeway in creating new songs allows the sustaining of the process of sexual selection based on the songs' complexity. However, a study of female preferences toward musical complexity showed that women do not have a tendency to prefer more complex music during and around ovulation (Charlton et al., 2012), which does not support the Baldwinian scenario of music origin based on sexual selection. Similarly, research shows that musical aptitude and achievements are not a predictor of mating success (Mosing et al., 2014). Of course these studies are not conclusive and more studies are necessary to test a possible role of sexual selection in music evolution. Nevertheless, so far the Baldwinian scenario in the sexual selection of human musicality needs more empirical support to be convincing.

Another possible scenario of the Baldwinian origin of music is related to the idea that music can serve as a tool of social consolidation. This scenario started the moment a new social challenge first appeared. The increasing size of the hominin population caused an increase in inter-individual and intergroup competition for food and other resources (Dunbar, 2014). One way to cope with this problem was to form alliances between individuals belonging to a group. This strategy has been observed in other primates, including our closest relatives—the chimpanzee (Mitani, 2009; Gilby et al., 2013), which suggests that hominins could use a similar strategy. Dunbar has proposed that as group size increased, grooming as the main tool to sustain social alliances became insufficient. Instead, hominin vocalizations started to serve as a tool of social consolidation (Dunbar, 1996). While this idea seems to be unconvincing as far as the origin of language is concerned (Dunbar and Lehmann, 2013; Grueter et al., 2013) its validity as an explanation of the origin of music still remains an open question. An increasing number of studies suggest that communal singing can facilitate social bonds (Dunbar et al., 2012; Tarr et al., 2014; Pearce et al., 2015, 2016, 2017), which supports Dunbars' idea. However, the precise mechanism of how music could act in this way remains a puzzle. Although the obtained results of the aforementioned studies must not necessarily be the effects of the adaptive value of social bonding, being for example a byproduct of sexually selected behavior, certain characteristics of musical syntax seem to bespeak the hypothesis of music as a tool of social consolidation. A big problem which afflicts individuals living in groups is the situation in which some individuals use resources obtained by other individuals—the so called "free-riding problem." Communal song rituals demand that all participants must know the musical structure of the song. The learning of a particular song's structure is time-consuming and necessitates strenuous imitation of the vocal behavior of others especially in the case when hominins did not possess an instinct to learn musical syntax. Therefore, by devoting equal effort in order to learn a ritualized song, communal singing can serve as a good test of being prone to act together with others. After all, poor singers can be easily recognized which can lead to ostracism. In other words, the consolidation effect observed after communal singing can be a product of the unconscious assessment of other individuals in terms of their proclivity to being free-riders. From this perspective, a lack of synchronization hinders consolidation which can be a result of detecting potential free-riders and can lead to looking for new allies.

Also a proximal explanation of the mechanism responsible for the consolidating power of music can be related to observed characteristics of music processing by the human nervous system. It is possible that music consolidates individuals by the means of temporal and spectral synchronization between the brain states of co-performers (Bharucha et al., 2011). If this is true, our predecessors had to simultaneously imitate their vocalizations in order to sustain social trust (Podlipniak, 2016). This collective imitation became the beginning of a consolidating vocal ritual. Without any predisposition which canalized vocal learning so that hominins would have been sensitive to certain acoustic features, the process of the learning of vocal rituals would have been very strenuous and time-consuming. During this time, the second stage of Baldwinian evolution began in which natural selection preferred individuals who were characterized by the most flexible learning. In order to learn new melodies and sing them together then the appropriate predictions of what (which particular pitch class) and when (the position of a particular rhythm measure in relation to musical pulse) would happen in the near future was necessary. Syntax is exactly what makes the successful predictions of sound events during singing easier and, as a consequence, facilitates collective singing. At this stage, the learning of simple syntactic rules would have been accessible to hominins in a similar way to people that learn artificial syntaxes today. Because the costs of ritual learning were high an individual who was accidentally endowed with proclivities to predict the melody better than others gained an advantage over the rest of a group. In the long run, the progeny of this individual has dominated the whole population.

In a similar vein the Baldwin effect could have contributed to the origin of music if its adaptive function advertised the defending skills of a group (Hagen and Bryant, 2003; Hagen and Hammerstein, 2009; Jordania, 2014). However, in case an acoustic aposematism (instrumental music, singing) had been directed against predators (Jordania, 2014) a possible role of the Baldwinian mechanism in the origins of human musicality would have been restricted solely to the canalization of the elements of musical display which are recognizable by predators. These elements are a part of musicality in a broad sense such as pitch contour, changes in tempo, and dynamics rather than the syntactic relations specific to the discrete structure of human music. In the Baldwinian scenario, the initially invented complexity of musical syntax which became a part of the hominin cultural niche was most probably accessible (comprehensible) only to the hominin species. Therefore, predator reactions did not depend on the subtleties of musical structure and had not been a selective factor which could have influenced the process of canalization of musical-syntactical abilities which actually define musicality in a narrow sense. In contrast, if music was a coalition signaling display directed toward conspecifics (Hagen and Bryant, 2003), the Baldwinian scenario could have been very similar to that presented above in the case of consolidation as the adaptive function of music. The only difference is that this time the selective pressure which would have been responsible for the canalization of human musicality had been induced by the reactions of enemies. However, while rhythm syntax seems to contribute to the signaling/deteriorating function of music (despite the fact that people from different cultures can recognize different units of musical pulse in the same piece of music (London, 2012), a well synchronized rhythm is perceivable independent of whether the rules of rhythm syntax are familiar to us or not), the origin of pitch syntax is more problematic as a result of its possible adaptive signaling/deteriorating function.

First of all, even today when people are most probably endowed with an instinct to learn pitch system and pitch syntax, well spectrally synchronized music which is based on an unfamiliar pitch system can be perceived as being out-oftune (Ellis, 1885), which can be a sign of a poor performance rather than a signal of a performers' coalition or consolidation. Additionally, the recognition of pitch syntax by contemporary humans depends on tacit knowledge about the statistical distribution of pitch classes in a particular music (Tillmann et al., 2000; Tillmann, 2005; Huron, 2006). Foreigners who listen to unfamiliar music usually experience different tonal qualia than people familiar with this music (Castellano et al., 1984; Kessler et al., 1984; Stevens, 2004, 2012; Curtis and Bharucha, 2009). Although this situation looks similar to the aforesaid difference in musical pulse perception, it differs in one important aspect. Pitch syntax is based on pitch hierarchy in which the most prominent place is pitch center, the experience of which is accompanied by the emotional qualia of completeness, resolution etc. There is nothing resembling pitch center in rhythm hierarchy. The misrecognition of actual tonal relations in reference to pitch center by foreigners can lead to divergence between the observed emotional expression of performers (also by the means of expressive dynamics) and tonal qualia felt by those foreigners. Such a divergence can also be a signal of poor performance. Importantly, without the canalized strategy to learn pitch syntax the differences in musical dialects between hominin groups would have been even greater than between modern geographically distant musical cultures causing the aforesaid divergence to be even greater. Therefore, while in the scenario of music origin in which music is a coalition signaling system the Baldwinian mechanism can explain the origin of musical rhythm so it is difficult to imagine a similar role of the Baldwin effect in the origin of pitch syntax as a result of its coalition signaling function.

It is worth mentioning however, that the possible different adaptive functions of music are not mutually exclusive and the Baldwin effect could have played an important role at different stages of the gradual process of the evolution of human musicality. Nevertheless, it seems that Baldwinian evolution was necessary at least in the process that led to the emergence of the complex musical syntax as a part of hominins' singing behavior.

## THE EVOLUTION OF NEW CIRCUITRY

The appearance of ritualized singing behavior among hominins required the development of new abilities and recruitment of existing skills. An important ability which had to be a necessary condition of the development of culturally variable vocal communication is the aforesaid vocal learning (Janik and Slater, 1997). This ability had to be present at the first stage of the Baldwinian evolution of human musicality. The similarity of vocal pathways in vocal learning birds to cortical–basal ganglia– thalamic–cortical loops in humans suggests the role of the latter in the processing of speech (Jarvis, 2007). In fact, there is an increasing number of studies which emphasize the role of the basal ganglia in the processing of language (Booth et al., 2005) especially in the learning of language during childhood (Krishnan et al., 2016). The fact that language impairment can be the result of neurodevelopmental deficits of the corticostriatal loops (Krishnan et al., 2016) shows that corticostriatal connectivity might have been an important element of evolutionary change leading to the evolution of vocal communication among our predecessors. It is not surprising that the basal ganglia contributes to the processing of sound sequences specific to vocal communication. It is known that the basal ganglia is important in reinforcement learning (Bar-Gad et al., 2003) which is necessary to acquire the majority of culturally transmitted information. This ability is strictly connected to predictive functions which are related to internal timing (Dreher and Grafman, 2002). It is reasonable to assume that in the beginning hominin vocal communication was composed of simple vocal expressions which must have been vocally learned.

In the process of vocal learning the recognition and prediction of distinctive acoustic features is necessary and so the specific connections between the basal ganglia and auditory cortices had to be favored by natural selection during the second stage of the Baldwinian evolution of human musicality. Because contemporary humans are characterized by the ability to sustain and reproduce F<sup>0</sup> which is crucial for singing (Bannan, 2012) but not for speaking, the sequencing of sounds based on their F<sup>0</sup> characteristics must have become an important trait of hominin sound rituals. The consolidating function of them did not necessitate any referential conceptual meaning. Instead simple emotional cues were enough to establish close social relations. The emotional reinforcement of a well predicted sound i.e., the sound which is perceived as possessing a particular pitch and which happens at an exactly predicted point of time, started to act as a cue for social acceptance. The positive emotional reaction in response to music is in fact an evolutionarily old clue of social acceptance. It has been observed that this emotional reward occurring in response to music is related to the corticostriatal interactions involving auditory cortices and the nucleus accumbens (Salimpoor et al., 2013). Also the perception of beat in music is based on corticostriatal interactions (Grahn and Rowe, 2009, 2013). It is suspected that the role of the cortico-basal ganglia circuits in speech evolution is related to the positive selection of the FOXP2 gene variant (Enard, 2011). However, because the mutation of FOXP2 also impairs rhythm processing in music leaving intact pitch processing (Alcock et al., 2000; Tan et al., 2014) the evolution of the abilities to recognize pitch syntax must have been related to other genetic factors. Independent of what particular genetic factors influence pitch and rhythm processing in music (Tan et al., 2014), human perceptive preferences to recognize pitch classes and rhythm measures as parts of syntactically organized sequences suggest that at the last stage of the Baldwinian scenarios natural selection started to prefer individuals endowed with these canalized perceptive proclivities.

## CONCLUSION

The proposed possible Baldwinian scenarios of the evolution of human musicality solve the problem of the specificity of musical syntax which is functionally and structurally different from language syntax. This specificity suggests that all theories which explain the syntactic characteristic of music as a byproduct seem unconvincing. After all, music seems to be the only spontaneously emerging syntactic system apart from speech, dance (Opacic et al., 2009), and some drum (Winter, 2014) and whistled languages (Carreiras et al., 2005; Güntürkün et al., 2015; Meyer and Busnel, 2015). Although the details of these proposed scenarios are speculative, future research can elucidate which particular elements of these scenarios are more probable. The most promising would be studies which concentrate on the comparison between the functions of cortico-striatal loops (Gorzelanczyk, 2011 ´ ) in speech and music processing. In order to investigate which particular loop is mostly involved in the processing of musical syntax, neuroimaging studies could be conducted in which the activity of limbic and dorsolateralprefrontal loops can be compared during performing tasks related to the recognition of musical and language syntaxes. Although the complex interactions between genetic, epigenetic, and cultural information which occur in evolution have not so far been explained in detail (Jablonka and Lamb, 2005) it seems important to consider them in future models of the evolution of human musicality. The rapidly advancing development of genomics and transcriptomics allows us to expect that the details of these complex interactions will be better understood in the near future.

Another possible study which could be conducted to test the proposed Baldwinian origin of musicality is to compare the results of singing syntactically simple with syntactically complex (more demanding in terms of explicit learning) tonal melodies. If singing syntactically complex melodies leads to a greater consolidation of singers or if the singing group is assessed by others as being more consolidated than singers of the syntactically simple melodies then it would suggest that the tendency which was proposed as the main source

#### REFERENCES


of the Baldwinian evolution of musicality is still present in the human population. Also the comparison of singing tonal melodies by a group of people with other collective sound expressions such as simultaneously reading prose, reciting poetry, drumming, and singing atonal melodies in free rhythm (without musical syntax) would be informative as far as the question of what particular features of human sound expressions are responsible for the observed effects. If the proposed consolidating function of human musicality is tenable, then the consolidating effect of singing tonal melodies and drumming should be greater than other collective behaviors. Nevertheless, in order to better understand the origin of music the broad holistic view of human musicality is necessary. This view should be based on integrated knowledge taken from such disciplines as genetics, evolutionary biology, paleoanthropology, neuroscience, psychology, archeology, ethnomusicology and cognitive musicology.

### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Podlipniak. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Joint Prosodic Origin of Language and Music

#### Steven Brown\*

*Department of Psychology, Neuroscience & Behaviour, McMaster University, Hamilton, ON, Canada*

Vocal theories of the origin of language rarely make a case for the precursor functions that underlay the evolution of speech. The vocal expression of emotion is unquestionably the best candidate for such a precursor, although most evolutionary models of both language and speech ignore emotion and prosody altogether. I present here a model for a joint prosodic precursor of language and music in which ritualized group-level vocalizations served as the ancestral state. This precursor combined not only affective and intonational aspects of prosody, but also holistic and combinatorial mechanisms of phrase generation. From this common stage, there was a bifurcation to form language and music as separate, though homologous, specializations. This separation of language and music was accompanied by their (re)unification in songs with words.

#### Keywords: language, music, speech, song, evolution, prosody, intonation, emotion

#### Edited by:

*Aleksey Nikolsky, Independent Researcher, United States*

#### Reviewed by:

*Reyna L. Gordon, Vanderbilt University, United States Elizabeth Hellmuth Margulis, University of Arkansas, United States*

> \*Correspondence: *Steven Brown stebro@mcmaster.ca*

#### Specialty section:

*This article was submitted to Cognition, a section of the journal Frontiers in Psychology*

Received: *19 June 2017* Accepted: *12 October 2017* Published: *30 October 2017*

#### Citation:

*Brown S (2017) A Joint Prosodic Origin of Language and Music. Front. Psychol. 8:1894. doi: 10.3389/fpsyg.2017.01894* Theories of the origins of language generally fall into the two broad categories of vocal and gestural models (Corballis, 2002; MacNeilage and Davis, 2005; Armstrong and Wilcox, 2007; Arbib, 2012; McGinn, 2015). Given that humans have evolved species-specific capacities for both vocal imitation and gestural imitation (Donald, 1991), a central question is whether language evolved initially as a system of vocalization or one of gesture, since imitative mechanisms are critical to evolutionary accounts of language acquisition. Gestural theories of language have grown in popularity in recent years due to their association with mirror-neuron-based models of action observation (Arbib, 2012). However, vocal theories have a far deeper grounding in historical models of language, going back to the ancient Greeks. During the Renaissance, not only was it commonplace to talk about the evolutionary connection between language and music, but both functions were seen as being clearly rooted in the vocal expression of emotion (Condillac, 1746; Rousseau, 1781; Thomas, 1995), a trend that continued into Darwin's day (Spencer, 1857, 1890; Darwin, 1871, 1872) and through to the early twentieth century (Wallaschek, 1891; Newman, 1905; Nadel, 1930; Sachs, 1943).

While contemporary vocal accounts of language origin do not deny the linkage between speech and emotion, they do not consider it to be central to their models, focusing instead on the articulatory innovations of speech—such as complex phonemic repertoires, the nature of syllable structure, vocal learning, descent of the human larynx, among others—or the origins of symbolization per se, separate from emotional communication. Some models talk about the origins of speech in "singing" (Darwin, 1871; Jespersen, 1922), but there are problems associated with this invocation of singing. Singing as a human behavior implies something musical, but the musicality of the posited ancestral singing mechanism is not specified. Singing simply becomes a counterstate to speaking (i.e., language-based vocalizing), rather than being something truly musical, as predicated on the tonal principles of scale structure. When Jespersen (1922) claimed that our ancestors "sang out their feelings long before they were able to speak their thoughts" (p. 436), his notion of singing included such diverse vocalizations as the singing of birds, the roaring of mammals, and the crying of babies. Likewise, Fitch (2010) referred to music as being an example of "bare phonology," viewing music as basically a counter-state to propositional speech. The aim of this essay is to propose a joint prosodic model of the origins of language and music, but to avoid the pitfalls of tract).

talking about a singing or phonological mechanism that has no musical specifications. As with my earlier writings on the topic (Brown, 2000a,b, 2007), my focus here will be on phylogenetic issues of cognitive structure, rather than on Darwinian issues of adaptiveness and selection mechanisms (which I have discussed in detail in Brown, 2000b).

While vocal and gestural models have generally been placed in opposition to one another, it is far more reasonable instead to see vocalization and gesture as complementary communicative specializations (McNeill, 2005; Arboitiz and Garcia, 2009; Arbib, 2012; Arboitiz, 2012; Garcia et al., 2014), as suggested in **Figure 1**. Gesture seems particularly well-suited to iconically convey information about the spatial properties of objects and actions through pantomimic gestures (Armstrong and Wilcox, 2007), which the vocal system cannot easily achieve through iconic means. By contrast, vocal prosody seems better suited to convey information about the emotional meaning of a perceived object for the communicator, in other words its consequentiality. To my mind, the co-speech gestures of modern speech are essentially pantomimes (Beattie, 2016), and might therefore comprise "fossils" of an early gestural stage of language evolution that was pantomimic. Another potential fossil consists of what I will call "acoustic pantomimes," namely iconic sounds, such as onomatopoeias. Such pantomimes are able to represent the sound-generating properties of objects and actions—as in the "ruff " of a dog barking—as well as non-vocal object-properties like size, height, velocity, and temperature that are correlated with the acoustics of objects (Nygaard et al., 2009; Dingemanse et al., 2015, 2016; Perlman et al., 2015; Svantesson, 2017). **Figure 1** suggests that pantomimes in both the vocal and gestural domains served as parallel precursor stages on the road to symbolization for each route of communication.

Beyond the intermediate iconic stage of pantomime, the bulk of the symbolic function of language resides with acoustic symbols, rather than gestural symbols, with the prominent exception of sign languages among the deaf. While I am quite sympathetic to gestural models of language origin, I will put them aside from this point onward in order to examine the basic question raised above, namely whether language and speech arose as an offshoot of the vocal expression of emotion. Most evolutionary speech models ignore prosody altogether and instead focus on the anatomical changes to the human articulatory system that permitted the emergence of syllable structure, including descent of the larynx. An interesting example is MacNeilage's (1998) "frame/content" model, which proposes that mandibular oscillations in great apes (e.g., lip smacking) provided a scaffold upon which syllable structure in speech may have arisen (see also MacNeilage and Davis, 2005; Ghazanfar et al., 2012). What is central to this model is that such oscillations are voiceless in non-human primates, and that the key innovation for speech would be the addition of phonation to such oscillations so as to create alternations between vowels (open vocal tract) and consonants (closed or obstructed vocal

A central tenet of syllabic accounts of language evolution is the notion of "duality of patterning" (Hockett, 1960; Ladd, 2012), which argues that the acoustic symbols of speech i.e., words—are built up from meaningless constituents. Words are comprised of fundamental units called phonemes, but none of the phonemes themselves have intrinsic meanings (although sound-symbolic accounts of language origin argue that phonemes have non-random occurrences across word classes and hence may have some minor symbolic content; Monaghan et al., 2016). Through the kinds of mandibular-oscillatory mechanisms that MacNeilage's (1998) speech model highlights, phonemes get combined to form syllables, most universally as alternations between consonants and vowels. This process of syllable formation is not merely oscillatory but is combinatorial as well. From a small and fixed inventory of consonants and vowels—generally a few dozen in any given language—these phonemes become combined to form syllables. Such syllables may constitute words in and of themselves ("bee"), or they may be combined with other syllables to form polysyllabic words ("being," "Beatrice"). Finally, through a different type of combinatorial mechanism, words can be combined with one another to form phrases and sentences through compositional syntactic mechanisms, as described below in the section "Syntax evolution and the 'prosodic scaffold' ".

## THE PROSODIC SCAFFOLD

I would like to reflect on what is missing in the standard syllabic account of speech and language just presented. Much of it comes down to whether one thinks of language evolution as serving a purely cognitive function for an individual (Berwick, 2011) or instead a communicative function for a group of individuals (Tomasello, 2003; Robinson, 2005; Scott-Phillips, 2017). In a later section of this article about syntax, I will describe this as a contrast between a "monologic" view (focused on internal thought) and a "dialogic" view (focused on social interaction) of language evolution. If one thinks about language and speech as a dialogic system of social communication, as many theorists suggest, then vocal prosody is critically missing from the syllabic account.

Prosody is characterized by a number of expressive melodic and rhythmic features of an utterance that convey information about emotion, intention, attentional focus, and communicative stance (Scherer, 2003). It is quite different from what has been described for syllables. It is not combinatorial, but is instead holistic, conveying emotional meanings according to rules of expression that govern emotional modulations of vocal pitch, loudness, timing/duration, and timbral features, as based on the valence and intensity of the expressed message. An influential cross-species account of this is found in Morton's (1977) set of "motivation-structure rules," themselves based on thinking going back to Darwin's (1872) treatise on The Expression of Emotions in Man and Animals. For example, aggression is conveyed with harsh, low-frequency sounds, whereas fear and submission are conveyed with more tone-like, high-frequency

sounds. According to a prosodic account, it is not sufficient to think of "bee" as an arbitrary acoustic symbol for an insect that is generated through the combination of meaningless phonemes. In actual interpersonal communication, "bee" will be vocalized in such a manner as to convey the consequentiality of that insect for the speaker and his/her audience, as governed by prosodic rules of expression that communicate the emotional valence and intensity of that insect for the communicators. In other words, the vocal expression "Bee!!" during interpersonal communication conveys as much about the speaker's emotions and intentions as it does about the object itself. The holistic nature of prosody operates such that the declamation "Bee!" constitutes a complete utterance; it is essentially a one-word sentence.

It is important to consider that prosody is not some addon to the combinatorial phonemic mechanism for generating syllable strings and sentences, but instead the reverse: it is the foundation of vocal communication, not least speech. Phonemic mechanisms must be superimposed upon it; it is not the case that a monotone string of phonemes becomes "melodized" by prosody after the fact. Prosody is intrinsic to the generative mechanism of speech and is in fact the primary consideration in communication, namely the speaker's emotional intent and message. While there is ample evidence for a "prosody first" model of speech planning when applied to the linguistic-prosodic levels of phonological and phonetic encoding (Keating and Shattuck-Hufnagel, 2002; Krivokapic, 2007, 2012), there is still minimal experimental work regarding the generative aspect of the expression of affect in speech. Instead, most language models place "conceptual structure" at the highest level of the communicative hierarchy (Levelt, 1999), implicating the domain of semantics and thus words. What is missing here is an overarching "emotional semantics" of communication in which emotion is a primary component in the production of speech, preceding word selection. Other theorists have made similar claims with reference to "ostensive communication" or the communication of intent (Scott-Phillips, 2017). Consider the sentence "It's a bee." That same string of words would not only be uttered in a dramatically different manner between seeing a photograph of a bee in a magazine as compared to seeing an actual bee on one's dinner plate, but the behavioral consequences would, in theory, be quite different. So, while linguists are able to support a "prosody first" model when it comes to linguistic prosody vis-à-vis syntax, I would argue that we need to expand this to have the planning of affective prosody occur at an even earlier stage in the process.

Based on this reasoning, I would like to propose a "prosodic scaffold" model in which overall communicative (intentional, emotional) meaning is the primary factor being conveyed in speech and in which the combinatorial and compositional mechanisms of speech's words and utterances act to "fill out" a prosodic scaffold. This scaffold is comprised of (1) affective prosody, which refers to the vocal expression of emotions, usually acting on the utterance at a global level; and (2) linguistic prosody, which refers to a set of both local and global mechanisms for conveying emphasis (stress, prominence, focus), modality (e.g., question vs. statement), among other features (Cruttenden, 1997; Ladd, 2008). Because I am going to apply concepts about linguistic prosody to music in this article, I will avoid confusion in nomenclature by referring to it as "intonational" prosody from this point on.

## Affective Prosody

There are affective mechanisms that modulate the overall pitch height, loudness, and tempo of a spoken utterance in order to convey the emotional valence and intensity of the communicator's meaning. For any given sentence, happiness is typically conveyed by high, loud, and fast prosody, while sadness is conveyed by the opposite profile (Banse and Scherer, 1996). The same is true for the expression of these two emotions in music (Juslin and Laukka, 2003). These types of affective prosodies often work in a global fashion, affecting the entire scaffold of the phrase. For example, happiness both moves the scaffold to a higher vocal register and compresses it in time, while sadness moves it to a lower register and expands it in time. In both cases, the holistic formula of a declarative sentence is preserved, but its acoustic properties are modified by emotional intent.

## Intonational Prosody

A majority of spoken utterances are declarative and are characterized by stereotyped intonations having either arched or descending pitch contours (Halliday, 1967; Cruttenden, 1997). This is exemplified in the top row of **Figure 2** for a single sentence using musical notation, which is taken from Chow and Brown's (under review) analysis of the relative-pitch profiles of spoken sentences, as averaged across a group of 19 speakers. One can readily observe the basic pattern of declination (i.e., falling pitch) characteristic of declarative sentences. The holistic nature of this formula is shown by the fact that, when the sentence is lengthened through the addition of words at the end, the declination process is suspended until a later point in the sentence. What this indicates is that the holistic prosodic scaffold of a declarative sentence, with its descending contour, is "filled out" in the longer sentence by suspending the point of declination until the terminal word.

If we contend that the vocal expression of emotion was the precursor to speech, then the evolution of the phonemic combinatorial mechanism had to find a way to create words (strings of segmental units) and phrases in the context of communicating emotional meanings by filling out a prosodic scaffold. The alternative idea, namely that prosody is some type of post-hoc affective modulation of a previously-established linguistic message, seems untenable and should not serve as the basis for evolutionary models of language and speech. While there are clearly non-prosodic means of conveying emotion and intention in speech, such as through word selection and syntactic constructions, these do not circumvent the need to be externalized through a prosodic-scaffold mechanism during vocal communication.

Before moving on to present my evolutionary model of a prosodic precursor to speech and music, I will summarize the model briefly so as to facilitate the presentation (which is outlined in **Figure 6** below, as described in the section "Bifurcation to form language and music"). I will argue that there was not one but two shared stages that served as joint precursors in the evolution of language and music: (1) the first was a system of affective prosody, and (2) the second was a system of intonational prosody. In other words, affective and intonational prosodies evolved through a sequential process as two linked evolutionary stages. In addition, while the first stage was made up of innate calls, the second capitalized on the newly-evolved capacity for vocal learning in humans. Following this second joint stage, language and music branched off as reciprocal specializations, each one retaining certain key features of their joint precursor stages. The model attempts to account for modern-day similarities between music and language/speech by loading the precursor stages with as many shared features as is theoretically justified. I argued in Brown (2000a) that, given that language and music possess both shared and distinct features, it would be most parsimonious to propose that their shared features evolved first, and that their domain-specific features evolved later as part of a branching process (see also Mithen, 2005), making language and music homologous functions (Brown, 2001). This idea would stand in contrast to models contending that music evolved from speech (Spencer, 1857), that speech evolved from music (Darwin, 1871; Jespersen, 1922; Fitch, 2010), or that music and language's similarities arose independently by convergent evolution.

Before proceeding to describe the model, I want to point out that, given the absence of any clear definition of music, I am going to adopt a view of music (and singing) that leans heavily on the side of pitch and most especially on the use of scaled pitches, even if there is imprecision in the scale degrees and/or their execution by a voice or instrument (Nikolsky, 2015). As a result, I am going to distinguish music from both speech prosody and emotive vocalizations. In addition, while rhythm is a critical feature of music, there is no aspect of music's rhythmic system that is not potentially shared with either dance or metric forms of speech, like poetry. Hence, if the development of an evolutionary model of music requires that we identify domain specificity for music, then I see tonality as the principal distinguishing feature of music (see Savage et al., 2015 for ethnographic support for this). While I am familiar with myriad examples of musics that fail this definition—e.g., they are based on unpitched percussion sounds, they are pitched but are not based on fixed scales, they are more concerned with timbral changes than pitch pitches, they contain emotive vocalizations, prosodic speech, and/or whispers—I cannot see the utility of developing an evolutionary account of music based on either non-tonal or metrical features. Instead, my model posits that a non-tonal prosodic precursor was the source for the tonal system that came to characterize much music as we know it.

The first evolutionary stage of the prosodic model of language origin is proposed to be a system of affective calling derived from the mechanisms of emotional vocalizing found in mammals more generally (Briefer, 2012). I have argued previously (Brown, 2007) that, not only was this particular affective system a joint precursor for language and music, but that it was a group communication system, particularly one that operated in ritualized contexts, such as territory maintenance, and that acquired its group force though emotional contagion (see also Hagen and Bryant, 2003). Using a wolf chorus as a model of such a precursor, I argued that this evolutionary stage was characterized by an imprecise overlapping of parts among the group members, showing little to no synchronization of parts. This performance arrangement is referred to musically as heterophony<sup>1</sup> , which is when "different parts are performing the same melody at the same time but with different versions" (Malm,

<sup>1</sup> I am using the term heterophony in this article so as to be consistent with similar ideas about the evolution of musical texture that I described in Brown (2007) in an article entitled "contagious heterophony". However, the editor of this research topic, Aleksey Nikolsky, takes issue with my usage, and suggests that his own term "isophony" (Nikolsky, 2016, Appendix V) is a better description of the choral texture that I am alluding to. He defines isophony as a "brief call, continuously reproduced by multiple performers with irrational deviations in timing and pitch, where each participant retains idiosyncrasy of the rhythmic, timbral, and directional attributes of the pitch contour – altogether producing a 'jumbled' effect." This strikes me as an excellent description of the phenomena that I am referencing here. Readers are encouraged to see Nikolsky's commentary on the current article for a more detailed description of isophony as applied to the evolutionary origins of musical/linguistic communication in humans.

1996 p.15). Starting from this common, asynchronous precursor, an evolutionary branching process would occur to create two different forms of coordination during communication such that (1) music would evolve to achieve a tight temporal integration of parts through the evolution of the capacities for both rhythmic entrainment and vocal imitation, and that (2) speech would evolve to achieve an alternation of parts, as occurs in standard dialogic exchange (**Figure 3**). This functional and structural bifurcation reflects the fact that music retains the precursor's primary function as a device for group-wide coordination of emotional expression, most likely for territorial purposes, while language evolves as a system for dyadic information exchange in which an alternation of parts, rather than simultaneity, becomes the most efficient means of carrying out this exchange. These distinctive communicative arrangements of music and speech come together in a performance style that is found quite widely across world cultures, namely call-and-response singing (**Figure 3**), where the call part is informational and is textually complex (typically performed by a soloist, as in speech) and the response part is integrated and textually simple (typically performed by a chorus, as in music). Call-and-response is an alternating (turn-taking) exchange, but one between an individual and a group, rather than two individuals.

The idea that speech evolved from a group-wide communication system has a distinct advantage over individualist accounts of language origin in that can provide a solution to the "parity" problem (Arbib, 2012). The evolution of communication systems is constrained by the fact that meanings have to be mutually interpretable in order to be adopted by a community of users. Any person can devise a word to mean bee, but unless everyone in the community understands it, then it is useless as anything more than an unintelligible device for self-expression. A group-communication system obviates this problem, since it is produced collectively. In addition, and following along the lines of the wolf example, making the group-communication system something ritualized helps in achieving meaning through the use of context specificity and the signaling of consequentiality. Communication will occur in situations that have shared emotional meanings and shared consequences for all members of the group, such as during territory defense. Having language be group-level from the start provides a solution to a number of evolutionary obstacles to achieving parity in communication.

## "MUSILANGUAGE" AS A JOINT PROSODIC PRECURSOR

This first precursor stage of group-affective vocalizations that I have just described would be a ritualized territorial chorus of emotional communication. It would be neither speech-like nor music-like in its acoustic features, but instead something similar, functionally and structurally, to a non-human form of group chorusing, like a wolf chorus or a pant hoot chorus in chimpanzees. While speech and music do indeed have shared

mechanisms of emotional expression (affective prosody), the affective precursor just described is lacking in many additional features that are shared between speech and music and that should be reasonably found in a joint evolutionary precursor. Hence, my model requires the existence of a second joint precursor-stage before the bifurcation occurred to generate language and music as distinct and reciprocal specializations emanating from it. While the first stage focused on the shared features of affective prosody, this second stage should now contain the shared features of intonational prosody that are found in speech and music.

My characterization of this second precursor stage will comprise a revised and corrected account of what I called the "musilanguage" system in a previous publication (Brown, 2000a) and which was fleshed out in book form by Mithen (2005). Hence, it will comprise my Musilanguage 2.0 model. The core idea of the model is that those features of language and music that are shared evolved before their domain-specific features did due to the presence of a joint precursor—what I call the musilanguage system—that contained those shared features. While the initial joint precursor described above would be a system of affective prosody, this second stage would achieve the next important level of intonational prosody. In other words, it would embody those features of intonation that are shared between language and music, but without requiring lexicality, syntax, or musical scale structure (tonality), an idea also developed by Fitch (2010). If I had to summarize the properties of this stage, I would argue that it is a "grammelot," in other words a system of nonsense vocalizing or pure prosody, as was prominent as a tool for traveling theater companies in the days of the Commedia dell'Arte during the Renaissance (Jaffe-Berg, 2001; Rudlin, 2015). The only modification that I would propose is that the musilanguage precursor was a group-level grammelot, produced through chorusing. In what follows, I will outline a number of key properties of the proposed joint prosodic precursor, with an eye toward defining those features that can be thought of as shared between speech and music and hence that can be most reasonably attributed to a joint evolutionary stage. **Table 1** lists a dozen such features.

## Voluntary Control of Vocalization and Vocal Learning

I contend that the transition from the first affective stage to this second stage corresponds to the transition from the non-human-primate system of involuntary control of stereotyped calls to the appearance in humans of both voluntary control over the vocal apparatus and vocal production learning. Belyk and Brown (2017) proposed a co-evolutionary account for the joint appearance of these two capacities, although other theorists have suggested sequential models (Ackermann et al., 2014). Hence, the advent of the musilanguage stage would mark a transition from innate to learned vocalizing, accompanied by the complexification of communication sounds that learning makes possible. This similarity between speech and music as learned vocalcommunication systems is an extremely important one to keep in mind. It places the emergence of vocal learning firmly upstream of the separation between language and music in human evolution.


12 Polyphonic and heterophonic performance arrangements

## Breath Phrases

Phrase structure in both speech and music approximates the length of a breath phrase (Pickett, 1999). This may seem like a trivial similarity between speech and music as communication systems, but it also makes them natural partners when it comes to setting words to music (discussed below). There were significant changes in the voluntary control of respiration during human evolution (MacLarnon and Hewitt, 1999, 2004; Belyk and Brown, 2017), and it would seem that such changes impacted speech and music in comparable manners to influence the structural features of phrases in both domains. When people take breaths while either speaking or singing, they tend to do so at phrase boundaries, rather than in the middle of a phrase (Grosjean and Collins, 1979). In addition, the depth and duration of an inhalation correlate with the length of a produced sentence (Fuchs et al., 2013). Finally, extensive work on the analysis of pause duration as a function of the length and/or syntactic complexity of sentences points to a role of respiratory planning in speech production (Krivokapic, 2012). Hence, there is clear motor planning for speech at the level of respiration. Provine (2017) proposed that the nature of human breathing, and thus vocalization, may be a direct product of the transition to bipedal locomotion.

## Level Tones and Level Transitions

An important feature of human vocalization that is virtually never mentioned in evolutionary accounts of speech or music is the fact that humans can produce level tones when vocalizing. Much of primate vocalizing is based on pitch glides, as can be heard in the pant hoot of chimpanzees and the great call of gibbons. While such glides are still present in human emotional vocalizations, such as in cries, both speech and music are based on transitions between relatively discrete tones. These tones are generally longer and more stable in music than they are in speech, but level tones seem to be present in speech to a large extent (Mertens, 2004; Patel, 2008; Chow and Brown, submitted). The defining feature of much music is not only that the transitions are level but that they are scaled and recurrent as well. Hence, instead of having an imprecise sequence of tones, the tones become digitized to create a small set of pitches that are used recurrently across the notes of a melody, in both ascent and descent. When this does not occur, we get a melodic pattern that is speech-like (Chow and Brown, submitted), although such a pitch pattern sounds increasingly music-like the more it is tandemly repeated (Deutsch et al., 2011).

## Levels-and-Contours (L&C)

Building on the last point, a related acoustic feature of the musilanguage system is that it would be based on an imprecise (non-recurrent) and coarse-grained mechanism of pitch signaling that involved a basic sense of both pitch levels (e.g., high vs. low) and pitch contours (e.g., rising vs. falling). Importantly, this system would be pitched but not tonal. In other words, it would not be based on the scaled pitches that are found in the majority of musical systems, and would thus not be, in my view, a true form of singing. This idea is a modification of an incorrect proposal that I made in the original publication based on a limited database at the time about the pitch properties of speech (Brown, 2000a), about which Fitch (2010) was quite justified in raising objections. Instead, I now see the precursor system as having an imprecise relative-pitch system based on optimizing the contrast between relatively high/rising and relatively low/falling pitches, what I will refer to as a levels-and-contours (L&C) system. In twentieth century British theorizing about intonation, the term "tonicity" was used to characterize this type of pitch system (Halliday, 1967, 1970), where different types of pitch contours are used to signal intonational meaning. A key acoustic feature of this system that is shared between speech and music is that melodic movement tends to be based on pitch proximity, rather than large leaps (Huron, 2006; Patel, 2008; Chow and Brown, under review). I will propose below that, after the separation of language and music, speech retained the imprecise levels-and-contours system of the precursor, while music increased the precision of the pitch relationships by introducing tonality through scale structure, making the pitches recurrent in the formation of melodies and thereby making music into a combinatorial system for pitch. Hence, the coarse-grained pitch production mechanism of the levels-and-contours system of the musilanguage stage provides a reasonable joint precursor for the pitch properties of both speech and music.

## Phonemic Combinatoriality

One of my major contentions is that the evolution of phonemic combinatoriality is a feature that should be placed upstream of the divide between speech and music, comprising a key property of the joint musilanguage precursor (see **Figure 6** below). This conforms with Fitch's (2010) claim that proto-language was a system of "bare phonology." Tallerman (2013) takes issue with the concept of "phonology" being applied to anything other than meaningful words and thus true language, although I would point out that proto-language models do not present any kind of specification of the phonetic properties of their protowords (Bickerton, 1995; Jackendoff, 1999). Hence, there was most likely a proto-phonology in place before language evolved. I mentioned MacNeilage's (1998) frame/content theory above, which seems to be as good a model as any for the origin of syllable

structure through phonemic combinatoriality. Most mandibular oscillations in non-human primates are voiceless, and so a critical feature of the MacNeilage model is that the open vocal-tract configuration of the oscillation should become phonated, making syllables into pitch-bearing units. As per the point raised in the previous two paragraphs, this should permit the formation of level tones as well as glides. As such, this would favor the use of open syllables at this stage, so as to maximize information due to pitch variation. Importantly, phonemic combinatoriality would provide one mechanism for creating phrase structure by the musilanguage system, such that the vocalic part of the syllable would serve as the locus of melodic and rhythmic variation. I could imagine that the musilangauge system of proto-phonology was comprised of a repertoire of such syllabic units (see **Figure 4**). Given that this stage preceded the evolution of lexicality, then these syllables were vocables, or nonsense syllables, in keeping with the musilanguage's status as a grammelot. As in many forms of birdsong, there could have been a large diversity of such units, even though each unit would be devoid of intrinsic meaning (Marler, 2000; Slater, 2000).

## Meaningful Melodies: Holistic Intonational Formulas

Beyond the localist mechanism underlying phonemic combinatoriality, there would be a more global and holistic system of pragmatic intonational melodies that had categorical meanings (Fernald, 1989; Papousek, 1996), just as they do in modern speech. The most basic contrast would be between 1) phrases with descending contours that end in low tones, as in typical declaratives, conveying a sense of certainty, stability, and/or finality, and 2) phrases that proceed and/or end in high tones, conveying a sense of uncertainty, continuity, suspense, or incredulity (Halliday, 1967, 1970). The latter are perhaps the first questions of human communication (Jordania, 2006), conveyed via intonation alone through a grammelot, much the way that filtered speech retains the intonational meanings of sentences (Fernald, 1989). It would be hard to estimate how many melodies would exist in the repertoire of this system, but these would be pragmatically-distinct melodies that operated more or less in a categorical manner. Hence, this could be a first step in achieving phrase-level meanings and prosodic scaffolds before lexicality was introduced (Fitch, 2010), in which case the system would be better characterized as one of pure prosody than pure phonology. The resultant phrases could be thought of as "holophrases." However, these would not be the symbolic holophrases discussed by people like Wray (1998), Mithen (2005), and Arbib (2012), but instead prosodic holophrases that conveyed affective and pragmatic meanings in a holistic manner, much the way that speech prosody often does. This relates to some models of speech and/or music evolution that posit a central role for mother/infant communication (Dissanayake, 2000; Falk, 2009).

#### Affective Prosody

By inheriting the innate expressive mechanisms from the first evolutionary stage—something that itself is phylogenetically derived from primate communication—the musilanguage system would have additional expressive modulation of phrases

according to the valence and intensity of the communicated emotion, providing yet another influence on the melody and rhythm of the phrases. This occurs with regard to global and local changes in the pitch (register), loudness, and tempo of the phrases. I called these "expressive phrasing" mechanisms in the original publication (Brown, 2000a) and now see the initial affective precursor as being a specialized version of affective prosody occurring as a group territorial chorus.

#### Stress Groups

The system should show similarities to features of stress timing seen in speech and music, whereby syllabic units often occur in 2- or 3-unit groupings, with a sense of stress on the initial syllable (Brown et al., 2017). This conveys what linguists call "headedness" (Jackendoff, 2011), which is a hierarchical differentiation of the elements within a grouping, where emphasis is generally placed on the first element. These groupings can themselves be organized hierarchically and can potentially be embedded in one another in a recursive fashion (Jackendoff, 2011; Tallerman, 2015). The musilanguage system is thus proposed to have hierarchical phrase organization, an important feature shared between music and speech (Lerdahl and Jackendoff, 1983; Lerdahl, 2001).

#### Heterometric Rhythms

The rhythmic properties of this system would not be the isometric type of rhythm found in much music, but instead the "heterometric" type of rhythm that is characteristic of speech (Brown et al., 2017). Instead of having a single, fixed meter, the rhythm might involve changes in stress patterns, but still maintaining the primacy of 2- and 3-unit groupings and patterns.

#### Repetitive Form

To the extent that such a communication system would be both ritualized and performed in groups, it might have a strongly repetitive type of form (a so-called "ostinato" form), as in much music in indigenous cultures (Lomax, 1968) and beyond (Margulis, 2013, 2014). Hence, the same phrases would be uttered repeatedly. **Figure 4** presents a highly speculative account of what a musilinguistic phrase might look like in terms of (1) phonemic combinations to diversify the number of syllable types, (2) the predominant use of open syllables, (3) the overall melodic contour of an arching intonational formula (as one example of such a formula), and 4) the local grouping-structure of the rhythmic units, but with a non-metric rhythm overall. Such a phrase might be uttered repeatedly by a given individual during group chorusing. Compared to the vocalizations of the first evolutionary stage, this would be a learning-based system that permitted voluntary control of vocalizing, although still occurring in a ritualized manner at the group level.

## Polyphonic Texture

Finally, in order to think about the performance arrangement of the musilanguage system, **Figure 5** presents an overview of the major types of "textures" (i.e., multi-part performance arrangements) found in both human and animal chorusing. The figure is organized as a 2 × 2 scheme, where one dimension relates to pitch (whether the melodic lines are the same or not) and the other dimension relates to rhythm (whether the various parts are either synchronized in time or not). I argued in Brown (2007) that the initial evolutionary stage of grouplevel affective prosody was characterized by a "heterophonic" texture in which each individual of the group performed a similar melodic line but in which the parts were asynchronous in onset, as seen in a wolf chorus (see also **Figure 3** above). There are many examples of such chorusing in animals and humans (Filippi, 2016). In order to make the musilanguage


FIGURE 5 | Textures of chorusing. The figure presents an overview of the major types of textures (i.e., multi-part performance arrangements) found in chorusing, both animal and human. The figure is organized as a 2 × 2 scheme, where one dimension relates to pitch (whether the melodic lines are the same or not) and the other relates to rhythm (whether the various parts are synchronized in time or not). Each cell indicates a principal type of choral texture. The right-side cells are found in both animals and humans, while the left-side cells are principally found in humans. It is proposed that the initial affective precursor of language and music was heterophonic, while the musilanguage stage was potentially polyphonic as well. The aligned textures of unison and homophony required the emergence of the human capacity for metric entrainment in order to evolve.

stage more language-relevant, I would argue that, in addition to the presence of heterophony, this system would show the new texture of polyphony. Polyphony allows for two significant changes in the structural properties of performance compared to a heterophonic system: (1) there is a diversification of the vocal parts and hence the possibility of differentiation according to communicative roles, and (2) there is some degree of alternation of parts. The musilanguage stage would start to show some signs of alternation, which is a defining feature of conversation and a key feature of call-and-response musical forms. This is a first step toward having a differentiation of parts, both in terms of content and presentation, hence permitting a leader/follower dynamic. However, instead of having the seamless separation of parts that occurs in conversation, this stage would most likely have an imprecise type of exchange, in which the alternating parts overlapped with one another, as seen in a number of primate and avian duets, for example in gibbons and duetting birds (Dahlin and Benedict, 2013). One implication of the proposal that I am making here is that the capacity for vocal learning arose before the capacity for rhythmic entrainment and integration (contra the proposal of Patel, 2014). Hence, the musilanguage system, while voluntary and learned, would still have a relatively poor capacity for the synchronization of parts. I will return to this important point in the next section about the evolutionary changes that made music possible.

What would be the function of the musilanguage system as a group-level grammelot compared to the first stage of innate affective expression? In keeping with my attempt to optimize the shared prosodic features of language and music before their separation, I would say that the system could be involved in group communication but in functions more dyadic as well. For example, a simple call-and-response interaction might be a novel arrangement of this system, showing some basic capacity for the alternation of parts and thus the roots of the information exchange that occurs in dialogue. While the syllabic units would be meaningless, the intonational melodies might be able to be used referentially to convey emotional meanings about objects in the environment or the actions of others, hence communicating consequentiality in a non-lexical and prosodic fashion.

## BIFURCATION TO FORM LANGUAGE AND MUSIC

With this description in mind of two sequential precursorstages shared by language and music, we can now examine the bifurcation process to form full-fledged language and music as distinct, though homologous, functions, as well as their (re)unification in the form of songs with words, including calland-response chorusing. **Figure 6** presents an overview of the model, starting with the innate group calling of affective prosody, followed by the musilanguage system of intonational prosody. The figure highlights the important proposal that phonemic combinatoriality is a shared feature of language and music, and this forms a critical part of what will be jointly carried over during the bifurcation process. I will first talk about language (lower part

of figure) and then move on to discuss music (upper part of the figure).

#### Language

In thinking about the birth of lexicality in acoustic symbols, I am going to propose that we consider two unconventional though long-established ideas, namely sound symbolism and lexical tone, as well as their union through a "frequency code" in which lexical tones could operate in a sound-symbolic manner (Ohala, 1984, 1994). I find it reasonable to consider the notion of sound symbolism as a potential origin of symbols, a timeworn idea that dates back to the ancient Greeks. Just as gestural theories of language origin are predicated on the idea that gestural pantomimes were the road to achieving gestural symbols (Armstrong and Wilcox, 2007; Arbib, 2012), so too acoustic pantomimes could have been the road to achieving acoustic symbols (**Figure 1**). While much research on sound symbolism focuses on phonemic effects related to vowels and consonants (e.g., front vowels connoting small size vs. back vowels connoting large size), a small amount of research relates to what Ohala (1984, 1994) referred to as a "frequency code," in which pitch could be used to iconically convey symbolic meanings. Such a pitch-based code serves as the foundation nonsymbolically for affective communication in many animal species and in infant-directed speech, but also has a limited potential to iconically encode spatial features of objects. For example, Nygaard et al. (2009) demonstrated that pitch was effective as a cue to perceive not only emotional valence in speech, but also size and temperature. Perlman et al. (2015) showed that participants could modulate pitch in non-linguistic vocalizations to convey information about vertical position, speed, and texture. Interestingly, sound symbolism has been found to apply to lexical tone as well (Ohala, 1984, 1994), with high tones being associated with words conveying small size, and low tones being associated with words conveying large size (so-called size symbolism). While there is no question that arbitrariness ultimately came to dominate the lexicon, it seems reasonable to hypothesize that language evolution got its start by capitalizing on the processing advantages that iconicity could offer (Kita, 2008; Perniss et al., 2010; Imai and Kita, 2011; Perlman and Cain, 2014).

A second hypothesis that I would like to present is that spoken language evolved as a system of lexical tones from its inception (cf Jespersen, 1922), rather than tone being a late emergence. My original model (Brown, 2000a) mistakenly argued that the musilanguage precursor had the property of lexical tone and thus lexicality, an objection well pointed out by Fitch (2010). I now firmly reject that idea in favor of lexical tone being a purely linguistic feature that emerged after the separation of language from the musilanguage precursor. From a crosslinguistic perspective, we know that the majority of spoken languages in the world today are lexical-tonal, although they are concentrated into a handful of geographic hotspots, mainly sub-Saharan Africa, southeast Asia, Papua New Guinea, and the Americas (Yip, 2000). These languages, despite their absence in the well-studied Indo-European language family, represent the dominant mode by which people communicate through speech. Non-tonal languages are the exception, not the rule.

In proposing that language started out as a lexical-tonal system from its origins, I am claiming that the vocal route for developing acoustic symbols involved not just a combinatorial mechanism for phonemes but a combinatorial mechanism for lexical tones as well (**Figure 6**). While lexical tone is not conceived of as a combinatorial system by linguists, it seems reasonable to me to think about it this way. Each tone language has a discrete inventory of lexical tones, either level tones (e.g., high, low), contour tones (e.g., rising, falling), or some combination of the two (Yip, 2000). The majority of syllables receive one of these possible tones. Importantly, tone languages contain a large number of homonyms that vary only in tone but in which the phonemes are identical; the four tonal homonyms of /pa/ in Mandarin are a well-cited example of this, where the four words mean eight, to pull out, to hold, and father, respectively (Lee and Zee, 2003). Hence, lexical tone seems to operate similar to the phonemic combinatorial mechanism, but instead works on the pitch levels and/or pitch contours of the vocalic part of the syllable. In other words, while phonemic combinatoriality is principally an articulatory phenomenon, tone combinatoriality is mainly phonatory. An important feature of this hypothesis is that speech inherited and maintained the imprecise levels-andcontours melodic system of the musilanguage system. Lexical tone operates using general pitch contours with imprecise pitch targets (most commonly rising and falling) and likewise with level tones having equally imprecise pitch targets (most commonly high and low). It is critical to keep in mind that, given that lexical tone is absent in one third of contemporary languages, tone is clearly a dispensable feature of a language. However, according to the hypothesis I am offering here, lexical tone is the ancestral state of spoken language, and the loss of tone is a derived feature of non-tonal languages, rather than the reverse progression (Brown, 2000a).

The last feature about the road to language indicated in the lower part of **Figure 6** is the emergence of alternating textures associated with dyadic exchange (see also **Figure 3**). Given that language is about communicating information symbolically, alternation is a much more efficient means of effecting this transmission than simultaneous production, in contrast to music, where simultaneous production is central to its coordinative function and efficacy. I will return to this point about alternation in the section below about the evolution of syntax, since recent work on interactional linguistics demonstrates not only the prosodic relatedness of interacting speakers (Couper-Kuhlen, 2014; Bögels and Torreira, 2015; Filippi, 2016; Levinson, 2016) but their syntactic relatedness as well, leading to models of "dialogic syntax" (Du Bois, 2014; Auer, 2015).

#### Music

The road to music is characterized by a complementary set of features emerging from the joint musilanguage precursor. The imprecise nature of the pitch-targets for lexical tone is contrasted with their precision in music and its system of tonality using scaled pitch-levels, which comprises the second major branching from the musilanguage system (**Figure 6**). As paradoxical as it might sound, speech's lexical tones are not the least bit tonal in the musical sense, although both lexical tone and music operate using relative pitch-changes, rather than absolute pitches. Tonality in the musical sense involves a discrete inventory of (relative) pitches making up a musical scale, thereby establishing fixed interval-classes among these pitches, where the same pitches are generally used in both melodic ascent and descent, what I refer to as the recurrence of pitches.

Importantly, music is a third type of combinatorial system in human vocal communication (beyond phonemic combinatoriality and lexical-tonal combinatoriality), however in this case involving specific pitch combinations, similar to certain forms of melodious birdsong (Marler, 2000). As mentioned above, while phonemic combinatoriality focuses mainly on articulation, music's pitch combinatoriality focuses mainly on phonation, as with lexical tone. What makes music "musical," and what makes it acoustically different from lexical tone, is that the pitches are scaled and recurrent, whereas in speech, whether for a tone language or an intonation language, they are not. In addition, this scaling of pitch occurs both in the horizontal dimension of melody and in the vertical dimension of harmony (another manifestation of recurrence), since music retains the complex group textures of the precursor stages, although the evolution of rhythmic entrainment mechanisms provides music with a wide diversity of texture types, including human-specific forms of unison and homophony (see **Figure 5** above).

I propose that the coarse-grained levels-and-contours system that was ancestral to speech and music, and that was retained by speech after the bifurcation process, ultimately gave rise to the musical type of tonality, by making a shift from the imprecise pitch-targets of the precursor to the precise intervallic pitchtargets of music. The road to music occurred by a digitization of the pitch properties of the prosodic precursor to produce a scaling of pitches, which serves as the basis for tonality and thus music. For example, as pointed out by the early comparative musicologists (Sachs, 1943), there are simple chants in indigenous cultures that alternate between only two pitches. So, I can imagine scenarios in which the imprecise system of the precursor became quantized so as to settle on recurrent pitch targets through scaling principles.

The final point about the road to music that is indicated in the upper part of **Figure 6** is the emergence of integrated textures associated with group-wide production (see also **Figure 3**). As shown in **Figure 5**, the most integrated textures in human communication are unison and homophony, due to the joint onset of parts. These are the most domain-specific and speciesspecific textures of music, compared to both human conversation and animal forms of group vocalizing, where heterophony and polyphony predominate. The emergence of integrated forms of chorusing is due to the advent of mechanisms of not just vocal imitation but metric entrainment (Brown, 2007). While entrainment is often discussed in the literature as the synchronization of movement to some external timekeeper (as in a person tapping their finger to a metronome beat), it occurs comparably as mutual entrainment among individuals engaged in chorusing or related forms of synchronized body movement, like marching (Chauvigné et al., 2014).

A major hypothesis of this article is that music evolved to be a dual coordination system, using both tonality and metric entrainment to engender integration (**Figure 7**). Scale structure, by digitizing the occurrence of usable pitches, creates pitch slots for coordination among chorusing individuals, as manifested in the vertical grid-like pattern of a musical staff, with its discrete pitch levels. Likewise, metrical structure, by creating discrete beat locations for onsets, creates time slots for coordination among individuals. The extreme case of integration in music occurs in unison texture—as in the group singing of "Happy Birthday"—where all performers converge on the same pitches at the same time points. However, while music is indeed a dual coordination system, I contend that scale structure is music's defining feature, with metrical structure being something that is shared with dance and even with speech in the cases of poetic verse and the rhythmic chanting that occurs at political rallies. In support of this, it is clear that tonality and metrical structure can each work in isolation, as seen both in non-metrical melodies and in metrical structures that are unpitched, such as a tap dance.

An important evolutionary question is how tonality and meter came together to create the dual coordination system that we associate with music. I am inclined to think that entrainment evolved primarily in the context of whole-body synchronization through dance (Brown et al., 2006a; Brown and Parsons, 2008; Chauvigné et al., 2014), and that musical chorusing later co-opted this whole-body entrainment system to create musical integration. This jibes with the fact that tonality is domain-specific but that metrical structure is used in a cross-domain fashion, being multi-effector (voice and body) and multi-sensory (we do not need pitch at all for metrical structure or entrainment). Hence, the integrated nature of music emerges from the marriage of a domain-specific pitch system of tonality and a domain-general timing system. Patel (2014) has argued that the trait of metric entrainment evolved jointly with vocal learning, and that the two are casually related to one another. However, I do not agree with that perspective. Since vocal learning is a shared feature between music and speech, I believe it should be placed at the level of the joint vocal precursor described above. In contrast, I see entrainment as emerging outside of this vocal nexus as a system for whole-body coordination through dance, which later gets co-opted by the musical system for use in vocal chorusing and its instrumental analogs (**Figure 7**). To my mind, the relevant co-evolutionary question is not that between entrainment and vocal learning (as per Patel), but instead that between entrainment and tonality to create music's dual coordination system.

Before concluding this discussion about the evolution of music, I would like to point out that the evolutionary mystery of music is not just the generation of scale structure per se, but the cognitive perception that different scale-types have different emotional valence connotations (Huron, 2006, 2008; Bowling et al., 2012; Parncutt, 2014). I will refer to this as "scale/emotion associations." In Western music theory, there is an association

FIGURE 7 | Music as a dual coordination system. Music evolved to be a dual coordination system, using both tonality and metric entrainment to engender integration. Scale structure (tonality) provides "pitch slots" for coordination, while metric structure provides "time slots" for coordination. As shown at the top, music inherited vocal learning from the musilangauge precursor, and achieved scale structure through a digitization of the levels-and-contours pitch system of that stage. Metrical structure is proposed to be a cross-modal system that originated as mutual entrainment of whole-body movement during group dancing. Music is able to co-opt this cross-modal system as its rhythmic mechanism.

between the major scale and positive emotional valence, and between the minor scale and negative emotional valence. Scale types can be used in a contrastive manner by composers in a narrative context to convey different emotional meanings, much as contrastive facial expressions can be used by actors to convey different emotions to audiences.

Did scale structure and scale/emotion associations evolve as a unitary phenomenon or instead as two sequential emergences? I could imagine scale structure as serving a coordinative function for group integration all on its own, separate from valence coding, for example as in a musical version of a wolf chorus. Scale/emotion signaling would be more important for emotional expression, by creating a musical language of emotion based on valence coding, involving the contrastive use of two or more scale types. This would be important for group-wide emotional communication. **Figure 8** compares two possible evolutionary models for the emergence of scale structure and scale/emotion processing.

The end result of the musilanguage precursor and its branching to form speech and music is the emergence in humans of a "combinatorial triad" (**Figure 6**) comprised of (1) phonemic combinatoriality for both speech and music, derived from the musilanguage precursor, (2) lexical-tone combinatoriality specific to tone-speech, and (3) pitch combinatoriality specific to music. These systems routinely come together, combining the phonemic and pitch domains. Common examples are the singing of songs with words, such as "Happy Birthday," in which music's tonal properties are combined with speech syllables to musicalize these syllables (discussed in more detail in the last section). But even in the case of singing using vocables (like la-la-la), which is predominant in many world cultures (Lomax, 1968), all singing has to occur using some phoneme or another as the articulatory vehicle, even if it is just a single vowel (as in chanting on /a/) or a nasal consonant (as in humming), although I think that Fitch's (2010) claim that music is bare phonology misses the critical point about tonality.

## SYNTAX EVOLUTION AND THE "PROSODIC SCAFFOLD"

In the introductory section, I argued against a strictly syllabic interpretation of the origin of speech and instead suggested that we need to put emotion, prosody, and communicative intent front and center in our evolutionary thinking, leading me to propose a "prosodic scaffold" perspective. This is the idea that the production of speech is embedded in prosody, rather than prosody being an add-on to the compositional and combinatorial levels of speech after the symbolic level of sentence formation has been completed. Importantly, prosody transcends the level of the individual speaker, influencing the process of alternation that characterizes speech's performative arrangement (Robinson, 2005). Recent work on both interactional linguistics and interactional prosody demonstrates the profound influence of this interaction on what people think and say (Couper-Kuhlen, 2014; Du Bois, 2014; Auer, 2015). Speech is not just a process of communication but a process ofcoordination, and prosody serves as both a cause and an effect of this coordination.

**Figure 9** is an expansion of the material shown in **Figure 6** but which now adds the symbolic components of words and sentences. (For ease of interpretation, material from **Figure 6** unrelated to speech is removed). The prosodic scaffold is graphically represented by showing words and sentences (orange color) embedded in prosody (dark red color). At the lowest level of the linguistic hierarchy, the acoustic symbols that comprise individual words are embedded in the context of word-level prosody. This would occur through a modulation of the pitch, loudness, duration, and timbral features of constituent syllables to convey both linguistic prosody (e.g., lexical tone, the relative stress of syllables in polysyllabic words) and affective prosody (the valence and intensity of the communicated emotions).

The combination of words to form phrases and sentences brings us to the domain of syntax, without question the most contentious issue in the study of language evolution (Berwick, 2011; Tallerman, 2013). In the previous section, I talked about a "combinatorial triad" for the phonological aspects of speech and music. Syntax too is based on a combinatorial mechanism, but one that is quite different from the ones for phonemes, syllables, lexical tones, and pitches. In this case, it is words that get combined to form sentences, a process of combinatoriality that is referred to as compositionality (also productivity). Compared to the small pool of phonemes that go into the combinatorial systems for phonology, the compositional system operates with a pool of tens of thousands of words in the lexicon, organized into word classes that get combined through rule-based syntactic operations to achieve a meaningful ordering of words (Tallerman, 2015).

The field of syntax evolution has witnessed an interesting debate between two contrasting perspectives. The first is the idea of compositionality rooted in the concatenation of symbols, exemplified by "proto-language" models of the type of Bickerton's (1995). The core idea is that, starting from a basic lexicon of individual symbols, these symbols can be combined to form more-complex meanings. At the proto-language stage, the ordering of these symbols is merely associational, and does not suggest any kind of temporal ordering of events or causal relationships, although Jackendoff (1999, 2002) has suggested that Agent First might be a mechanism operative at the protolanguage stage. Later stages in the evolution of syntax are thought to add grammaticalization onto the proto-language system to develop word classes that have syntactic functions, not just semantic meanings. Hence, word order and morphology develop, both of which affect the combinability of words as well as their ability to be displaced within sentences. An alternative theory is that speech began from its inception as holistic utterances, called holophrases, and that the evolution of syntax proceeded by fractionating these holophrases into words and phrases (Wray, 1998; Mithen, 2005; Arbib, 2012). The idea is that holophases conveyed complex but holistic meanings, which could be later broken down into constituent words by decomposing their complex meanings. Hence, language evolution proceeded from the holistic level to the unit level, rather than the reverse.

A critical discussion of evolutionary syntax models is beyond the scope of this article. The only point that I will add to the debate is that the "prosodic scaffold" model has the potential to synthesize elements of the two aforementioned classes of theories. The model presented in **Figure 9** integrates combinatorial and holistic processes through the mechanism of prosodic embedding, as shown by "sentence-level prosody" in the figure. Prosody operates in a holistic fashion and is thus inherently holophrastic. The proposal of a prosodic scaffold is that the compositional mechanisms of syntax are embedded in this holistic prosodic scaffold. Therefore, instead of arguing for the idea of symbolic holophrases, I am arguing for the existence of prosodic holophrases that serve as the scaffold for compositional syntactic mechanisms.

Note that this proposal of prosodic embedding is precisely opposite to the way that most linguists think about language and its origin. Tallerman (2013:476) states: "Put simply, in syntax words must come first; phrases are built up around single words. . . Thus, suggesting that phrases evolved in protolanguage before there were word classes is once again entirely the wrong way round" (emphasis in the original). This is a difficult argument to address since structural linguists do not consider prosody to be a core component of language. The dispute between linguists and people like me, Mithen and Fitch might have far less to do with our proposals of a co-evolutionary stage uniting language and music as with how prosodic processes are situated with respect to the core linguistic processes of semantics, syntax, and phonology. If prosody is linguistically ancillary, then there is no point in discussing a prosodic proto-language that preceded semantics and syntax. If it is primary, then it makes sense to do so. I do not believe that the fields of either linguistics or language evolution have actually had a discussion on this topic.

Leaving aside the idea that language is primarily a vehicle of thought—such as in the monologue that makes up inner speech—language is routinely generated in a discursive manner though the alternating performance arrangement of speech. Sentences must therefore be generated in an interactive manner. But it would be wrong to think of a dialogue as simply a pair of monologues punctuated by interruptions. Sentences are generated during conversation in response to what has been said by others (Auer, 2015), not least through the exchange of questions and answers (Jordania, 2006). Language production during conversation, therefore, is a balancing act between two competing needs. On the one hand is the "monologic" driving force of leadership that aims to get one's personal information and perspective across to other people through persuasion, including statements of demands, commands, suggestions, desires, opinions, values, norms, etc. On the other hand is a "dialogic" driving force of mutuality that fosters exchange by adapting to one's conversation partner through an ongoing alternation between follower and leader roles. A huge literature on the pragmatics of language (Robinson, 2005) indicates the great extent to which people modify all aspects of linguistic and paralinguistic production so as to adapt to their conversational partners. This occurs at the levels of topic, word choice, syntax, pitch, loudness, tempo, and beyond. The end result of this mutual adaptation is that there is a strong sense of matching, mirroring, and mimicry between conversational partners (Couper-Kuhlen,

2014), impacting not only language and speech but posture, facial expression, gesture, as well as all aspects of prosody (Szczepek Reed, 2012, 2013; Couper-Kuhlen, 2014; Auer, 2015). Hence, the scaffold of one speaker is clearly influenced by that of another in constructing sentences during conversation.

## (RE)UNIFICATION: THE MUSILINGUISTIC CONTINUUM

A previous section described the bifurcation process to generate language and music as separate, though homologous, functions emanating from a joint prosodic precursor that I called the musilanguage system. As a last step, I now need to consider the (re)unification of language and music (Brown, 2000a), which occurs ubiquitously in the performing arts and religious rituals. The most general interaction between language and music is unquestionably songs with words. The potential for direct and seamless coupling between musical pitches and the syllables of words is one of the strongest pieces of evidence for a joint origin of music and language.

However, this coupling does not occur in a singular manner. The comparative musicologist Curt Sachs presciently argued that there was not a unique origin of music but instead multiple (Sachs, 1943). In particular, he proposed a distinction between (1) a type of music derived from melody ("melogenic") and (2) a type of music derived from words and text ("logogenic"). Quite separate from evolutionary considerations per se, we can think about Sachs' distinction from a purely structural standpoint and define the melogenic style of singing—whether it occurs with or without words—as being the conventional version of music using scaled intervals and metrically-organized beats. While most forms of melogenic singing in Western culture use words, many others in world cultures do not use words, but instead use vocables (i.e., meaningless syllables like "la" or "heya") as the syllabic vehicles for vocal production.

In contrast to this melogenic style, there are many forms of word-based singing that sound like stylized versions of speech. The focal point of communication is the text, and melody is a means of accentuating it emotively. This logogenic style of singing words is basically a chanting of speech in which the melody and rhythm are closer to speech's intrinsic melody and rhythm than to the melogenic style of scaled pitches and metric rhythms. My interpretation of Sachs' multi-origins hypothesis is that the melogenic style, most notably when using vocables instead of text, arises during the divergence of music from the joint prosodic precursor, and that the logogenic style is something that follows the full-fledged emergence of speech as a cognitive function, where chanting is a means of stylizing the linguistic message.

The argument just described leads me to propose, as I did in Brown (2000a), that the evolutionary processes that generated language and music as reciprocal specializations from a joint precursor ultimately resulted in a "musilinguistic continuum" containing the poles of language and music as well as a number of interactive and intermediate functions (see also Savage et al., 2012). **Figure 10** presents a model of this. At the extremes of the continuum are language and music, represented both vocally (as speech and vocable-based singing, respectively) and instrumentally. The latter includes speech surrogates, such as drummed and whistled languages (Stern, 1957), as well as conventional instrumental music. In the middle of the continuum is the most interactive function of songs with words, where language and music most universally come together (Savage et al., 2015). As shown in the figure, this can be accomplished in a logogenic manner that sounds like a stylized version of speaking, or it can occur in a more melogenic manner, employing musical scales and metric rhythms.

Sitting in between standard speech and songs with words are intermediate functions in which the lexicality of speech is maintained but in which the acoustic properties of the production lean in the direction of music. This can occur with respect to rhythm, melody, or both. Rhythmic speech is a common form of this (Cummins, 2009, 2013), as occurs in poetic verse, rap, and the group chanting that routinely permeates political rallies and marches. Prosodic speech includes the emotionally-accentuated speaking style of an actor, poet, or public speaker, or of a mother interacting with her baby (Fernald, 1989; Papousek, 1996). It also includes logogenic musical forms, such as sprechstimme, recitative, and parlando-style chanting, for example cantillation of the Torah. It is important to point out that there are meaningless forms of speech that still have normal speech-like prosodic contours. Examples include the filtered speech used in psychology experiments (Fernald, 1989),

grammelots used in theatrical performance (Jaffe-Berg, 2001; Rudlin, 2015), or simply me watching a film in a foreign language. The point I want to make here is that the elimination of the lexicality of these forms of speech does not suddenly convert them into music. I reject the idea that prosody is a form of music and that prosodic contours divorced of words are a type of singing. Vocalizations like grammelots are better thought of as "de-lexicalized speech" than as music.

While it is easy to talk about intermediate forms of speech that are more or less "musical," I contend that we cannot do the same thing on the musical side of the continuum. As shown by the white box in the figure, I argue that there is a "linguistic wall" that ensures that lexicality is a categorical feature, rather than a continuous acoustic feature like musicality. The right side of the figure shows the purely non-lexical functions of vocable singing and instrument playing. If the singing were now to include words instead of vocables, it would immediately jump the linguistic wall to become a song with words. It is difficult to imagine functions in which musicality is maintained but in which lexicality is intermediate between words and non-words. There is really nothing intermediate between words and nonwords (e.g., vocables, pseudowords, grunts). So, the linguistic wall creates a categorical divide between vocable singing and word singing, making the musilinguistic continuum both asymmetric in structure and partially discontinuous.

This brings me full circle to the critique that I raised in the opening section of this article of models of language evolution that posit a singing-based precursor stage (Darwin, 1871; Jespersen, 1922). Some people might think that this stage should be identical to what I am calling the musilanguage system in this article. However, I do not see things that way. In particular, the musilanguage stage is proposed to lack both tonality and meter, since it is pre-musical. It is a grammelot, hence making it acoustically far more similar to prosodic speech than to music (i.e., it is comprised of levels-and-contours, not scaled intervals). So, while musilanguage might sit next to vocable singing in terms of its absence of lexicality, it would definitely sit next to prosodic speech in terms of its acoustic properties. That is why I find it inappropriate to refer to this precursor as "singing" and why I find it problematic that singing-based theories of language fail to make a distinction between music and prosody. To my way of thinking, tonality (scale structure) is a novel, domain-specific feature of music not shared with speech. That is why I far prefer a neologism like musilanguage to the term singing, since the musical features of singing-based precursors are not specified by people who use the term singing. What they are generally implying is a prosodic vocalization system, rather than a true musical system. I have argued that such a precursor embodies the shared prosodic features of language and music, but not the scales that are specific to music.

## TESTABILITY OF THE MODEL

**Table 1** above lists a dozen proposed features of the musilanguage system. Some of them represent features shared by speech and music, while others do not. For example, much research has shown that affective prosody is conveyed in a parallel manner in speech and music, capitalizing on the same types of dynamic cues (Juslin and Laukka, 2003). By contrast, experimental work from my lab has explored the potential musical properties of speech, and has found that speech is atonal (Chow and Brown, submitted) and based on heterometric rhythms (Brown et al., 2017), both of which conform with properties of the proposed precursor. Likewise, work on singing by Pfordresher and Brown (2017) suggests that music might in fact be a derivative of a coarse-grained levels-and-contours system, as shown by the highly imprecise nature of sung intervals in everyday singers, not to mention children (Welch, 1979a,b, 2006).

Regarding brain localization, the bifurcation model suggests that music and speech/language should show their greatest similarities at the sensorimotor (phonological) level (Brown et al., 2006b, 2009), but the least similarity with regard to domain-specific features like lexicality and tonality. For example, semantic areas in the inferior and middle temporal gyri are frequently activated during language tasks (Xu et al., 2009; Visser et al., 2012; Krieger-Redwood et al., 2015), but not during music tasks, although these areas can be modulated by music when the task is specifically focused on semantic properties (Koelsch et al., 2004). While the brain network for semantics has been wellstudied, that for tonality has been much more poorly explored. A key objective for future research will be to examine the neural basis of what I have called scale/emotion processing, not least the emotional-valence connotations of different scale types. This will unquestionably lead to an exploration of limbic and paralimbic areas associated with emotion perception (Tabei, 2015).

Work on infant development supports the bifurcation model in that the first year of life appears to comprise a shared stage in which prosody, speech, and music are relatively undifferentiated, followed by a separation of language/speech and music/singing as distinct audiovocal functions (Papousek, 1996). Of course, parental singing to/with children in Western culture almost invariably involves the use of songs with words. As a result, most children are taught music through its coupling with language.

The aspect of the model that needs the most verification is the proposal of a "prosodic scaffold" in the production of speech. Work on speech perception demonstrates a strong influence of prosodic cues on comprehending speech (Filippi et al., 2017), but very little work of this type has occurred at the generative level. While numerous studies of affective prosody have used trained actors to convey different basic emotions in speech (Scherer, 2003), no studies have looked at this in spontaneous speech. A mood-induction procedure (Scherer, 2003; Van Dyck et al., 2013) might be one manner to address the influence of affect on

#### REFERENCES


speech production, especially if the content of the speech could be controlled for, say through a pre-learned text.

## CONCLUSIONS

The account of language evolution that I have presented in this article is vocal (rather than gestural), prosodic (rather than articulatory or syllabic), group-level (rather than individual, or dyadic), committed to a joint origin of language and music, and rooted in the idea that syntax-based phrase generation emerged, from its origin, as the filling out of a prosodic scaffold during speech production. I propose a two-step evolutionary process: first an involuntary but ritualized system of affective prosody, followed by a learning-based system of intonational prosody grounded in phonemic combinatoriality. From there, language and music branched out as separate, though homologous, functions through the emergence of lexicality and tonality, respectively, and through the adoption of the contrasting communicative arrangements of alternation and integration, respectively. After their separation, language and music are perennially reunited in songs with words, occurring in both melogenic (more-musical) and logogenic (more speechlike) styles. This potential for direct and seamless coupling between words and musical pitches is one of the strongest pieces of evidence supporting a joint origin of language and music.

#### AUTHOR CONTRIBUTIONS

SB conceived of the ideas and wrote the manuscript.

#### ACKNOWLEDGMENTS

I am grateful to Michel Belyk and Piera Filippi for critical reading of an earlier version of the manuscript and for their useful suggestions for improvement. I thank Aleksey Nikolsky for a detailed and insightful critique of the manuscript, as well as the reviewers for their comments. This work was supported by a grant from the Natural Sciences and Engineering Research Council (NSERC) of Canada.


Newman, E. (1905). Musical Studies. New York, NY: Haskell House Publishers.


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Brown. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Paradox of Isochrony in the Evolution of Human Rhythm

Andrea Ravignani1,2,3 \* and Guy Madison<sup>4</sup>

<sup>1</sup> Language and Cognition Department, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands, <sup>2</sup> Veterinary and Research Department, Sealcentre Pieterburen, Pieterburen, Netherlands, <sup>3</sup> Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium, <sup>4</sup> Department of Psychology, Umeå University, Umeå, Sweden

Isochrony is crucial to the rhythm of human music. Some neural, behavioral and anatomical traits underlying rhythm perception and production are shared with a broad range of species. These may either have a common evolutionary origin, or have evolved into similar traits under different evolutionary pressures. Other traits underlying rhythm are rare across species, only found in humans and few other animals. Isochrony, or stable periodicity, is common to most human music, but isochronous behaviors are also found in many species. It appears paradoxical that humans are particularly good at producing and perceiving isochronous patterns, although this ability does not conceivably confer any evolutionary advantage to modern humans. This article will attempt to solve this conundrum. To this end, we define the concept of isochrony from the present functional perspective of physiology, cognitive neuroscience, signal processing, and interactive behavior, and review available evidence on isochrony in the signals of humans and other animals. We then attempt to resolve the paradox of isochrony by expanding an evolutionary hypothesis about the function that isochronous behavior may have had in early hominids. Finally, we propose avenues for empirical research to examine this hypothesis and to understand the evolutionary origin of isochrony in general.

#### Edited by:

Leonid Perlovsky, Harvard University and Air Force Research Laboratory, United States

#### Reviewed by:

Shinya Fujii, Keio University, Japan Enrico Glerean, Aalto University, Finland

#### \*Correspondence:

Andrea Ravignani andrea.ravignani@gmail.com

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 25 May 2017 Accepted: 30 September 2017 Published: 06 November 2017

#### Citation:

Ravignani A and Madison G (2017) The Paradox of Isochrony in the Evolution of Human Rhythm. Front. Psychol. 8:1820. doi: 10.3389/fpsyg.2017.01820 Keywords: synchrony, prediction, interaction, coordination, turn-taking, evolution of music, evolution of speech

## CONTENTS

This paper deals with isochronous temporal patterns. The emphasis is on the quantitative properties of isochronous patterns, and their perception and production in humans. The paper is organized in seven sections, namely:



The aim of this paper is not to provide an exhaustive review of each of these areas. Rather, we attempt to establish a first connection between as many explanatory levels of isochrony as possible, across scientific disciplines and research traditions.

## WHAT IS ISOCHRONY?

Music is a complex phenomenon composed of interdependent parts. While a holistic approach is always important, the analytic, scientific method works by first analyzing constituent components individually (Fitch, 2015). Rhythm is a crucial dimension of human music. In common language, but also in scientific publications, different meanings are often conflated into the word 'rhythm' (see **Table 1** for definitions). In the most general definition, rhythm denotes a pattern of events in time (McAuley, 2010). An isochronous pattern is a rhythm where all intervals between events are equal, like those of a metronome. Hence, all isochronous sequences are rhythmic, but not viceversa (**Figure 1**). A third related concept is the beat, namely the psychological tendency to superimpose an isochronous grid to a rhythmic sequence (a.k.a. pulse or beat perception, **Figure 2C**). The focus of the present paper is the evolutionary significance of the human perception and production of rhythmic sequences that are physically isochronous, henceforth simply 'isochronous.' We deliberately avoid discussing pulse and beat perception, as these have been object of much empirical research and many theoretical frameworks. Pure isochrony has received comparatively less attention.

Humans are particularly good at producing and perceiving both rhythmic and isochronous patterns (Bolton, 1894; Fitch, 2009, 2015; Motz et al., 2013; Ravignani et al., 2017). Yet, while general rhythm capacities could be biologically useful for foraging, mating, navigating the environment and predicting events, propensity for isochrony does not seem to confer any evolutionary advantage to modern humans (Fitch, 2009). Humans' aptitude for isochrony contrasted with its apparent lack of evolutionary function constitutes the paradox of isochrony. Below we offer some perspectives on the presence of isochrony in nature and in modern humans' everyday life.

Humans have extraordinary abilities to deal with isochronous behaviors. We can detect deviations from isochrony on the order of 20 ms or 4% of the interval to be timed, but we can also perceive the underlying isochrony even in the face of random deviations and gradual increases/decreases in intervals (Bolton, 1894; Madison and Merker, 2002, 2004; Max and Yudman, 2003; Repp, 2005; Madison, 2009; Motz et al., 2013; Repp and Su, 2013). When confronted with isochronous sequences where intervals have been slightly jittered, humans will tend

TABLE 1 | Definitions of key concepts discussed in the paper (Expanded and modified from Ravignani and Norton, 2017).


to regularize the intervals and perceive the whole sequence as isochronous (for psychophysical thresholds see Friberg and Sundberg, 1995; Madison and Merker, 2002; Merker et al., 2009). Also when producing isochronous sequences the variability is around 4%, but it can be decreased to about 2% of the interval when synchronizing to a sound sequence with faster metrical levels (Madison, 2014). Listening to an isochronous sequence typically induces a beat (**Figure 2C**), which in turn generates expectations that future events will fall into a multiple or integer subdivision of the beat. A violation of the expectation, such as slightly changing the onset of one event, can be measured by mismatch negativity (Motz et al., 2013). The processes underlying this reaction are subliminal (Madison and Merker, 2004) and require no learning, which indicates that the beat is a very basic, inherited phenomenon.

In evolutionary biology, a behavioral trait can appear for a number of reasons. It can be an evolutionary adaptation, namely a trait which evolved to increase a species' fitness in a given environment. As such, it may more or less have lost its adaptive value due to changes in the environment, while still prevailing in the population because it has not been selected against. It can also be a by-product of other evolutionary processes, a so-called exaptation. As such, isochrony might have been recruited for purposes unrelated to the pressures which caused its early emergence (Merker et al., 2009). In the present context, isochrony refers to humans' perception and production of isochronous event sequences within the bounds and constraints reviewed above.

## THE RELEVANCE OF ISOCHRONY TO HUMAN MUSIC AND SPEECH

The roots of the human propensity for isochrony are clearly found in our biology, specifically in some preparedness of our neural system (Buzsaki, 2006; Arnal and Giraud, 2012; Fujioka et al., 2012; Fujii and Wan, 2014; Merchant et al., 2015). For example, newborn babies react differently to isochronous than to anisochronous sequences (Honing et al., 2009). Similarly, children aged 2–4 years show motoric isochronous behavior with clear periodicities, though little tempo adjustment (Eerola et al., 2006). Although isochrony in music is a human universal (**Figure 2C**), there is considerable variation across the worlds' music cultures. Western musical cultures appear to employ isochrony most thoroughgoingly, for example with rhythmic sequences being composed of isochronous subsequences (e.g., Bach). When rhythmic patterns are not isochronous, they are based on a psychologically induced sense of beat (Merker et al., 2009). In these cases, notes that are played continuously confirm or violate the induced pulse (**Figure 2D**), either in a structural or expressive manner (see Merker, 2014 for a novel perspective). African music is also isochronous at some descriptive level (though see **Figure 3**), while Asian music tends to be less so. Some North-American Indian, Javanese Gamelan, and Western electro-acoustic traditions exhibit no isochrony at all, but it might be argued that they do not fulfill reasonable definitions of music. For comparisons of timing in different musical cultures, see (Arom, 1991; Polak et al., 2016; Neuhoff et al., 2017). Although humans are cognitively biased toward isochrony in music (Ravignani et al., 2016a; Fitch, 2017), this bias is apparently modulated by enculturation (Jacoby and McDermott, 2017, though see Bowling et al., 2017). Finally, isochrony is often associated with motor synchronization in the literature. However, a recent medical case study has found a dissociation between perception of isochrony, among others, and audio-motor synchronization abilities (Bégel et al., 2017).

Speech is another human activity which may involve isochrony (**Figure 1**). The research field investigating rhythmic regularities in speech has been split for decades (Lehiste, 1977; Roach, 1982; Kotz and Schwartze, 2010, 2016; Fujii and Wan, 2014). Some scholars argue that world languages can be classified in groups exhibiting isochrony at the sentence, mora, or syllable levels (called, respectively, stress-timed, mora-timed, and syllable-timed languages, see **Table 1** and Grabe and Low, 2002; Fabb and Halle, 2012). Other researchers argue the opposite,

namely that the speech signal is inherently anisochronous (**Figures 2A,B**), and the feeling of isochrony derives, for instance, from perceptual regularization rather than physical properties of the signal (Tuller and Fowler, 1980; Dauer, 1983; Jadoul et al., 2016; Brown et al., 2017). Without entering this debate here, some empirical findings are worth noticing. In particular, no matter the theoretical perspective adopted, human vocalizations can be experimentally driven toward isochrony (Jacoby and McDermott, 2017), especially when two individuals are asked to speak synchronously (Bowling et al., 2013) or perform turn-taking (Schultz et al., 2016). If speech recordings are experimentally manipulated, so that the syllable timing follows heterogeneous rhythmic patterns, isochronous speech is more intelligible than anisochronous speech (Aubanel et al., 2016). Finally, a recent experiment found evidence for isochronous timing in children's handwriting (Pagliarini et al., 2017).

A third common human activity where isochrony is hypothesized to play a role is dance (Fitch, 2016; Laland et al., 2016; Richter and Ostovar, 2016; Su, 2016a,b). Similarly to music, a series of isochronous events, such as a drum line, may provide anchor points in time used to structure dance movements (Fitch, 2016; Laland et al., 2016). Likewise, biophysical constraints on movement produce isochronous or integer ratio temporal intervals (e.g., Merker et al., 2009; Su, 2016a). This isochrony-centered perspective might however be quite specific for dance in humans inhabiting the Western world. A more inclusive approach considers dance a polyhedric behavior present in all human cultures, and whose precursors can be found in other animal species (Fink and Shackelford, 2016; Ravignani and Cook, 2016). If this approach is adopted, then isochrony might not be such an indispensable pillar of dance (Ravignani and Cook, 2016).

All the above suggests that, while isochrony might not be crucial in dance or speech, it is present in human everyday musical behavior (**Figure 1**). So, why is isochrony so common if it doesn't appear to serve any particular function (Fitch, 2009; Merker et al., 2009), at least in modern humans? Below we will try to analytically decompose isochrony even further in its constituent parts across disciplines.

## MATHEMATICS, PHYSICS, AND SIGNAL PROCESSING

Rhythms, including isochronous ones, can be formalized mathematically (Cohen, 1962; Toussaint, 2013). From a purely information-theoretic perspective (MacKay, 2003), when producing a signal over time, isochrony minimizes the signal's entropy. A pattern of time intervals (**Figure 4A**) can be described by a set of interval durations and the probability of occurrence of each interval (**Figure 4B**). A more refined model features conditional transition probabilities (**Figures 4C,D**), where the duration of the upcoming interval is determined probabilistically from the duration of the previous n intervals (Cohen, 1962). An example could be a specific pattern for which, given that the first and second intervals are short, there is a high probability that the third interval is long. This is not the case for isochronous sequences: no matter which interval is to be predicted, and how many past intervals are taken into account, the upcoming interval will be a constant value equal to past intervals with

probability 1. This property makes isochronous sequences, from an information-theoretic perspective, purely deterministic and predictable, granting the lowest possible entropy. In other words, conditional on a known repetition rate (i.e., tempo), isochrony minimizes entropy in rhythmic sequences.

Physics offers prime examples of isochronous processes (Strogatz, 2000). For instance, atomic clocks are based on isochronous oscillations, i.e., atomic activity regularly occurring at known frequencies (Strogatz, 2003). Events like these, reliably repeating at regular time intervals, can serve as a benchmark (**Table 1**): they are used by mankind to synchronize their clocks and, for our purposes, they represent the highest level of isochrony achievable by a system. In other words, empirical isochrony can be defined as synchrony, i.e., perfect co-occurrence (Ravignani, 2017a), of an empirical sequence with respect to an isochronous reference grid.

Likewise, mathematical and computational models of isochrony are quite straightforward. In the simplest case, isochrony can be mathematically generated by trigonometric functions, such as sine and cosine (**Figure 5**). More realistic models of isochronous human behavior involve, for instance, long-range correlations (Madison and Delignieres, 2009), and fractal scaling (Madison, 2004). Isochronous synchrony between two or more entities can be modeled using phase resetting (Sismondo, 1990; Greenfield and Roizen, 1993), period bisection (Ravignani, 2014; Ravignani et al., 2014b), coupled oscillator models (Strogatz and Stewart, 1993; Large and Kolen, 1994; Strogatz, 2000; Rouse et al., 2016; Ravignani, 2017a), and a number of other techniques (reviewed in Ravignani and Norton, 2017). In dynamical systems, isochrony ranks amongst the best understood non-linear processes. The take-home message is that isochrony, in many of its forms and variants, can be comprehensively defined by very simple mathematical expressions (and visualized geometrically, see Toussaint, 2013; Ravignani, 2017b).

## PHYSIOLOGY AND NEUROSCIENCE

Human heartbeat, respiration, and locomotion all have an element of isochrony (**Figure 6**) in that they exhibit more regularity than random patterns but less regularity than periodic patterns in physical systems (Winfree, 1986). On the one hand, these processes can be quite regular within a short enough window of measurement (Larsson, 2012, 2014, 2015; Teie, 2016). In fact, the most commonly used measures of these physiological variables (beats, breaths or gaits per minute) assume local isochrony for the sequence analyzed. So, for example, most finger tapping studies have collected sequences of only 20–50 intervals in order to avoid the complicating drift (Wing and Kristofferson, 1973a,b; Madison, 2001). On the other hand, heartbeat, respiration and locomotion are highly dynamic and mutable in order to be functional (**Figure 6**); in other words, acceleration, deceleration, and phase shifts – all disrupting perfect isochrony – are quite common (Strogatz, 2003).

Isochrony and synchrony are also emergent properties of the nervous system. Synchronous groups of neurons, each oscillating isochronously, are common in the brain (Buzsaki,

2006). At a higher level, cortico-subcortical networks are usually recruited to produce and perceive external isochronous events (Kotz and Schwartze, 2010, 2016; Nozaradan et al., 2011; Fujioka et al., 2012). Finally, some pathological states of the central nervous system are known to disrupt intentional isochrony, for instance Parkinson's disease (Grahn, 2012).

What is the neurophysiological basis for behavioral isochrony? Interesting connections between timing of vocalizations and neurophysiology have been discovered by physiologists working on non-human animals, for instance, amphibians. In some frog species, the temporal structure of courtship vocalizations is modulated by hormones (Zornik and Kelley, 2011). An outstanding question is, of course, whether these connections between temporal behavior and hormones can be found in humans, whose ethogram might be more complex than some amphibians. Recent findings in the human neurogenetics of music make this line of research quite promising (Granot et al., 2007; Ukkola et al., 2009; Ukkola-Vuoti et al., 2013; Kanduri et al., 2015a,b).

## COMPARATIVE COGNITION: NON-HUMAN ANIMALS

Some neural, behavioral and anatomical traits underlying isochronous rhythm perception and production are shared with a broad range of species (Wilson and Cook, 2016). These may either have a common evolutionary origin, or have evolved into similar traits under different evolutionary pressures (Ravignani et al., 2014b, 2016b). For instance, timing processes involving the basal ganglia and isochronous oscillations in the brain are shared with other primates and probably other animal taxa. Other traits are rare across species, only found in humans and a few other animals (Patel et al., 2009a,b). For instance, motor entrainment to an external isochronous pulse is only found sparingly in the animal kingdom (Fitch, 2015; Iversen, 2016; Ravignani et al., 2016b; Wilson and Cook, 2016).

A first, crucial difference separates human isochronous behavior from the examples of isochrony in nature provided above. This is the extent to which isochronous pattern production is driven and affected by external factors. A decaying isotope and a person walking at regular pace do not need an external oscillatory stimulus to keep producing isochronous behavior. These are cases of endogenous isochrony (Pikovsky et al., 2003), corresponding to self-sustained oscillators in physics. Conversely, the isochronous behavior exhibited by humans dancing or tapping to music is mostly exogenous: an internal pacemaker is partially corrected by externally perceived oscillatory activity (Merker et al., 2009), corresponding to forced or coupled oscillators in physics.

How about all shades of gray between these two extreme cases? Those animal species exhibiting isochrony are sparsely and heterogeneously divided among cases of endogenous and exogenous isochrony. This makes animal research key to understand the nature of human isochrony: for every particular type of isochrony found in a species, its neural mechanisms and resulting behaviors can be mapped and compared between that species and humans (Merchant et al., 2015).

Within exogenous isochrony, another distinction is between prediction and reaction (**Table 1**), depending on whether a timing event is produced by predicting when the next event should occur, or reacting to a previous event (Patel et al., 2009b). Humans exhibit exogenous predictive timing (Fujioka et al., 2012), while crickets exhibit exogenous reactive timing (Greenfield and Roizen, 1993). Finally, the isochronous tail wagging of dogs is quite likely to be endogenous, as external oscillatory stimuli are unlikely to affect its period or phase (cf. Buxton and Goodman, 1967; Fitch, 2009). While some species can be readily classified along the dimensions above (Rouse et al., 2016), for other species some data is available (**Figure 5**), though still not enough to be classified into isochrony types (e.g., Schusterman, 1977). Finally, for the majority of species, no systematic investigation of isochronous behavior has been performed. In other words, we still lack data on how most

species produce and perceive isochrony under a wide range of different conditions, which would be diagnostic to the underlying mechanisms and limitations.

In animal research, isochrony has been investigated using two main methods: observing natural behavior and training animals to produce specific temporal sequences. Isochrony as natural behavior in other animal species has long been studied, though its relevance to human rhythm has been pointed out only recently (Madison, 2004; Merker et al., 2009; Ravignani et al., 2014b). In fact, many animals signal over time in a precisely isochronous fashion (e.g., see **Figure 5**). 'Isochronous species' span crickets, frogs, fireflies, birds, crabs, and marine mammals (Schusterman, 1977; Sismondo, 1990; Greenfield and Roizen, 1993; Strogatz, 2003; Merker et al., 2009; Kahn et al., 2014; Norton and Scharff, 2016). As these studies are often purely behavioral and observational, rarely targeting neurobiological brain mechanisms, it is difficult to know whether isochrony is endogenous, exogenous, predictive or reactive.

An alternative is to test animals' capacities to produce isochrony in a controlled experimental setup. This is often done in conjunction with synchronization experiments. The animals are trained to produce specific isochronous behaviors, often with the purpose of entraining to a musical beat, and are then tested in their ability to generalize to different tempi and levels of jitter (Wilson and Cook, 2016). The only irrefutable results of exogenous predictive isochrony in any animal species are three: humans, a sea lion and a cockatoo (Patel et al., 2009a; Cook et al., 2013). Trained isochrony due instead to reactive timing might be more common: several species appear capable of producing series of temporal intervals of equal duration (Hasegawa et al., 2011; Hattori et al., 2015).

## ISOCHRONY IN INTERACTION

In human communication, two opposing functions affect the structure of the signal: expressivity and compressibility (**Table 1** and Kirby et al., 2015). Expressivity influences the amount of information content, hence semantics, which ideally should be maximized (Kirby et al., 2015). Compressibility refers to the density of information transmitted: intuitively, it is cost efficient to transmit the same amount of information in its shortest or maximally compressed form (MacKay, 2003). In other words, signalers would ideally broadcast the maximum quantity of information, using the least possible amount of signal. This tradeoff is further modulated by redundancy: a maximally compressible communication system with no redundancy can be irreversibly corrupted by a minimal transmission error. Hence, a signaler might not want to completely minimize entropy in order to leave room for redundancy.

To what extent can compressibility, expressivity, and redundancy account for human isochrony (**Figure 7**)? Mathematically, isochronous signals maximize redundancy and minimize entropy, but leave almost no room for expressivity. Comparatively, when human participants develop signaling systems in communication experiments, no expressivity leads to maximum compressibility (Kirby et al., 2008, 2015). This, by analogy, would dismiss isochronous pattern production as an expressive communication system (**Figure 7**), i.e., a system where signals are mapped to meanings (but see Bharucha and Pryor, 1986; Horr and Di Luca, 2015). However, the meaning of the transmitted message could lie in the signal emission per se, rather than that being broadcasted through the signal. This is the concept behind 'signaling signalhood' (Scott-Phillips et al., 2009): the message of an isochronous pattern is its 'isochronicity,' instead of being used in referential communication. Consider a hypothetical example. Take rhythmic sequences composed of only two durational intervals (such as the first two intervals of a sequence in **Figure 4A**). The set of all two sequences could be used for communicative purposes in two main ways. In the first, more common case of 'referential communication,' the duration of the two intervals could encode different conceptual properties. For instance, the first interval could be used to encode the size of a referent, while the second its brightness. Hence, a rhythm composed of a short and a long interval would communicate a small, dark object, while a long and a short interval would refer to a large bright object. This variability in the lengths of the intervals would grant expressivity. However, if isochronous sequences were the signals most frequently transmitted, this way of encoding signals could not be expressive, because all objects would end up being encoded as having an average size and brightness. In contrast, in the second case of 'signaling signalhood,' the two intervals would be used to communicate precision in isochronous pattern production, i.e., to signal isochronicity. Hence, a pattern as the bottom of **Figure 4A** would signal high precision in isochrony, while the top pattern in **Figure 4A** would signal poor isochrony. From this point of view, human perception and production of isochronous patterns might better fit the second, signaling signalhood framework, rather than the first, referential communication framework.

There is a close match between the most precise levels of isochrony that humans are capable of producing and those they are capable of perceiving (Madison and Merker, 2002; Merker et al., 2009). This match also offers some support for the hypothesis that isochrony might have been shaped for communicative purposes. In other words, a communication system, and in particular one that takes advantage of, and evolves from, perceptual biases (Ryan, 1998), will show a match between features of the signal and the capacities to perceive those features. For example, the plumages of many bird species reflect ultraviolet light, which humans and other species cannot see, while conspecific birds can readily perceive and use to select a mate (Andersson and Amundsen, 1997; Vorobyev et al., 1998; Eaton, 2005). We hypothesize that an analogous process might have resulted from isochrony (expanding on Merker, 1999, 2000), if this were a communicative trait. In particular, a communication system employed to transmit information about deviations from an isochronous pulse would evolve toward levels of precision comparable between production and perception (Merker, 1999, 2000). This comparable precision is exactly what can be observed in human motoric and perceptual isochrony (Madison and Merker, 2002; Merker et al., 2009), offering some preliminary, indirect support for a possible communicative function of isochrony.

Isochrony does not appear to be used in the overt communication of modern humans, but might have played a role in some form of communication employed by our ancestors. In fact, isochrony is the optimal way to establish synchronized group signaling because it makes the duration of next interval perfectly predictable by another person or conspecific (Merker et al., 2009). This musical perspective on the evolution of isochrony connects to turn-taking, which is a crucial component of human language (**Figure 8**). Turn-taking allows speakers to effectively interact in conversation: it avoids that speakers' utterances overlap, while still enabling utterances to occur within a reasonable amount of time from each other. Interestingly, turntaking in language is both predictive and exogenous, but seems to lack isochrony, except maybe in a few special cases. Still, turntaking exhibits a particular temporal structure (Stivers et al., 2009; Levinson and Torreira, 2015). This structure appears to arise by a constant 200 ms lag (**Figure 9C**) between the ends and starts of utterances across cultures (Stivers et al., 2009), rather than a lag between the starts of consecutive utterances. This fixed-interval delay contrasts with the slightly positive or negative lags found in animal synchronization experiments (**Figures 9A,B**), and the anticipatory reaction in human musical synchronization. So, in modern humans, turn-taking is far from isochrony (except for when it is a by product of utterances having the same duration within and between speakers), but it might promote isochrony (Schultz et al., 2016). This makes turn-taking in modern organisms a potential approach to understand the evolution of isochrony (see **Figure 8**).

## EVOLUTIONARY HYPOTHESES AND FUTURE EMPIRICAL WORK

In conclusion, isochrony does not appear to relate to any current selection pressure. This is not surprising: a large number of evolved, heritable traits do not readily map to clear selection pressures in extant species. Instead, the pressures giving rise to isochrony might be sought in ancient humans, operating at some point between now and the split between our ancestor and that of chimpanzees/bonobos. The most articulate hypothesis to date proposes a multistage model (Merker, 1999, 2000; Merker et al., 2009). According to this, a recent ancestor to modern Homo sapiens would have been exposed to a selection pressure to attract migrating conspecific females. Accordingly, group vocalizations would have provided a conspicuously loud signal. The more individuals that managed to synchronize their calls, the greater the sound intensity, and the farther its reach in the terrain. In turn, the easiest way to achieve synchronization is to produce an isochronous signal, which is maximally predictable by the other callers, leading to an

isochronous, synchronous chorus (Merker et al., 2009). We further extend this idea, suggesting that early stages of vocal coordination could be precursors to two modern human traits: isochronous signaling in music and anisochronous turn-taking in language (**Figure 8**).

Our perspective focuses on the function of isochrony (Merker et al., 2009; Ravignani et al., 2014b), rather than its underlying ontogeny and phylogeny (Tinbergen, 1963). Hence, most ideas presented here are in principle compatible with other hypotheses focused on developmental trajectories or mechanisms (Iversen, 2016; Teie, 2016). The functional reason why isochrony appeared in human evolution might therefore not be directly derivable from how isochrony appears in development, or from its neurobehavioral mechanisms. The mechanisms underlying isochrony which manifest through development (Eerola et al., 2006) might also be reflected in evolution, though they do not need to be (see Gould, 1977, and subsequent debate).

Research on modern humans provides some grounds, in principle, for isochrony to have been an evolutionary selected trait. In particular, isochronous timing seems quite variable across individuals, and enhanced by learning (Max and Yudman, 2003; Manning and Schutz, 2016; Tierney et al., 2017). Individual differences and learning plasticity are neither necessary nor sufficient conditions to show that a trait, such as isochrony, underwent evolution by natural selection. However, individual differences are often a prerequisite for natural selection to act on a trait. Likewise, learning plasticity can be an outcome of an evolutionary process acting on behavior and cognition, rather than on a physical trait.

For the purpose of the present paper, it would be interesting to find the genetic and neuro-hormonal biological substrates responsible for perception and production of isochronous behavior both in humans and other animals. In fact, as isochrony appears as a relatively simple behavioral trait, study of its neurogenetic and hormonal substrates might prove an initial building block to understand rhythm more in general.

Our suggestions can be readily tested along several strands of empirical research in humans. Temporal interactions in groups of animals are known to lead to isochrony as one of the equilibrium outcomes (Sismondo, 1990; Greenfield and Roizen, 1993; Kahn et al., 2014). Human data on turn-taking may be re-analyzed, asking whether the isochrony of each partner entails the best predictability of turn-taking vs. constant-lag alternation. Likewise, the large body of research on isochrony perception and production across modalities and domains may be synthetized (Iversen et al., 2015; Celma-Miralles et al., 2016), to examine (1) the limits and boundaries of the human sense of isochrony, and (2) which experiments are lacking that would entail comparability across domains and modalities. In general, the field of rhythm would benefit by a tighter connection between individual and group processes: individual behavioral traits do not evolve in a vacuum, and individual timing might be modulated by social factors (Ravignani et al., 2014a,b; Ravignani and Cook, 2016; Schirmer et al., 2016). For instance, in some primate and avian species, singing is accurately timed with the group depending on the sex and social status of each individual singer (Mann et al., 2006; Gamba et al., 2016). With this logic in mind, we can list a number of specific outstanding questions:


• How do different individual group interaction modes, for instance coordination and competition, map on to temporal patterns produced, such as isochrony (Ravignani et al., 2014a)?

These questions are not only relevant from the perspective of human cognitive neuroscience and animal behavior. They are also key to test evolutionary hypotheses, where the fitness landscape might be influenced by social factors and cultural niches (Tomasello, 2009; Boyd et al., 2011; Kendal, 2011).

Finally, while selection pressures in our ancestors are difficult to reconstruct, their effects might still be observable in the behavioral tendencies, genome and neuroendocrine system of modern humans (Holmquist and Vestin, 2010; Madison, 2011; Björk, 2013; Madison et al., 2017, in press). For instance, recent studies have mapped musical and rhythmic phenotypes to genes and hormonal profiles (Mosing et al., 2015; Miani, 2016a,b); more focused studies linking biology and psychology are needed for the specific trait(s) underlying isochrony.

#### REFERENCES


#### AUTHOR CONTRIBUTIONS

AR wrote a first draft of the manuscript. GM provided references, guidance, and advice, and edited the manuscript.

#### FUNDING

AR has received funding from the European Union's Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No. 665501 with the research Foundation Flanders (FWO) (Pegasus<sup>2</sup> Marie Curie fellowship 12N5517N awarded to AR), a visiting fellowship in Language Evolution from the Max Planck Society (awarded to AR), and ERC grant 283435 ABACUS (awarded to Bart de Boer).

#### ACKNOWLEDGMENT

AR is grateful to Steve Levinson for advice and support, and to Bill Thompson and Heikki Rasilo for the stimulating discussions in Tennessee, which inspired this paper.





**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Ravignani and Madison. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Evolution of Musicality: What Can Be Learned from Language Evolution Research?

Andrea Ravignani 1,2,3 \*, Bill Thompson1,2 and Piera Filippi 1,4,5,6

<sup>1</sup> Department of Language and Cognition, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands, <sup>2</sup> Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium, <sup>3</sup> Research Department, Sealcentre Pieterburen, Pieterburen, Netherlands, <sup>4</sup> Institute of Language, Communication and the Brain, Aix-en-Provence, France, <sup>5</sup> Laboratoire Parole et Langage LPL UMR 7309, Centre National de la Recherche Scientifique, Aix-Marseille Université, Aix-en-Provence, France, <sup>6</sup> Laboratoire de Psychologie Cognitive LPC UMR7290, Centre National de la Recherche Scientifique, Aix-Marseille Université, Marseille, France

Language and music share many commonalities, both as natural phenomena and as subjects of intellectual inquiry. Rather than exhaustively reviewing these connections, we focus on potential cross-pollination of methodological inquiries and attitudes. We highlight areas in which scholarship on the evolution of language may inform the evolution of music. We focus on the value of coupled empirical and formal methodologies, and on the futility of mysterianism, the declining view that the nature, origins and evolution of language cannot be addressed empirically. We identify key areas in which the evolution of language as a discipline has flourished historically, and suggest ways in which these advances can be integrated into the study of the evolution of music.

#### Edited by:

Aleksey Nikolsky, Independent Researcher, Los Angeles, CA, United States

#### Reviewed by:

Mark Reybrouck, KU Leuven, Belgium Steven Brown, McMaster University, Canada

\*Correspondence: Andrea Ravignani andrea.ravignani@gmail.com

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

Received: 15 September 2017 Accepted: 10 January 2018 Published: 06 February 2018

#### Citation:

Ravignani A, Thompson B and Filippi P (2018) The Evolution of Musicality: What Can Be Learned from Language Evolution Research? Front. Neurosci. 12:20. doi: 10.3389/fnins.2018.00020 Keywords: evolution of music, evolution of language, cultural transmission, cultural evolution, universals, music cognition, comparative cognition, nature and nurture

## INTRODUCTION

Language and music are typical human behaviors, absent in our closest living animal relatives. Since behavior and cognition do not fossilize, the earliest stages of language and music in our ancestors can only be reconstructed indirectly. Here, we argue for the value of taking into account the evolution of language when studying the evolution of music. There are multiple reasons for this. First, from a meta-scientific perspective, the empirical investigation of language evolution predates research on the evolution of music. Second, methodologically, the fields of music and language evolution show several commonalities (**Table 1**). For instance, in both fields, corpus-based research is complemented by laboratory-based psychological testing, electrophysiology and neuroimaging studies, and comparative experiments on animals. Third, many hypotheses concerning the origins of music also involve language, and vice-versa. Studying language and music within a common framework provides key insights and testable hypotheses in both disciplines. Here, we argue that anti-empiricist views on language have detrimental effects on understanding its origins and evolution. We identify equivalent anti-scientific tendencies in musicology. We suggest a few ways to counteract such effects that have been established for linguistics, proposing that similar approaches should be adopted in musicology.

Language is defined here as the ability to produce and understand verbal units within interactional communication acts. A major issue in musicology is finding an operational definition of music. Cross (2003, p.79) defines music as "embodying, entraining and transposably intentionalising time in sound and action." Following Honing et al. (2015), we distinguish the notions of "musicality"—a set of traits which evolve as constrained by our cognitive and biological system that shape musical behaviors across cultures - and


TABLE 1 | Disciplines which can contribute to understanding the evolution of musicality, and their correspondence with affine disciplines in the evolution of language.

"music"—a socio-cultural artifact building on the biological inclination for musicality.

#### LANGUAGE AND MUSIC: DIFFERENCES AND COMMONALITIES

In language, meaningless phonemes are concatenated into larger discrete units, such as morphemes and words, in accordance with phonological and morpho-syntactic rules. These units are arbitrarily linked to meanings and conceptual representations, which may be culturally transmitted (though sound-meaning mapping in language is not always arbitrary: Monaghan et al., 2012; Parise and Spence, 2012; Imai and Kita, 2014; Dingemanse et al., 2015, 2016). In contrast, while musical tones can be concatenated into phrases or melodies according to structural rules, they are not usually arbitrarily linked to external meanings. Furthermore, while musical melodies are typically made of discrete pitches at fixed interval scales and tonal centers (in case of tonal music), spoken language involves continuous pitch rise and falls (Jackendoff, 2009).

Despite these differences, language and music share several cognitive underpinnings. A number of studies have identified common cognitive mechanisms involved in the production and perception of structural relations in both instrumental music and propositional morpho-syntax (Fedorenko et al., 2009; Patel, 2010). For instance, timing principles are used in both language and music (Ravignani et al., 2017), where longer and louder units tend to be perceived as stressed, and changes in pitch modulation orient perception of boundaries between stressed and unstressed units (Cutler et al., 1997; Curtin et al., 2005). In addition, empirical evidence from brain imaging research indicates that amusic participants show deficits in fine-grained perception of pitch (Peretz and Hyde, 2003); patients fail to distinguish a question from a statement solely on the basis of changes in pitch direction (Patel et al., 2008; Liu et al., 2010). This observed difficulty in a sample of amusic patients supports the hypothesis that music and speech intonation share specific neural resources for processing pitch patterns (but see Ayotte et al., 2002). Further brain imaging studies report a considerable overlap in the brain areas involved in the perception of pitch and rhythm patterns in words and songs (Zatorre et al., 2002; Merill et al., 2012), and in sound pattern processing in melodies and linguistic phrases (Brown et al., 2006). In adults and children, musical training facilitates syllabic and pitch processing in language (Schön et al., 2004; Besson et al., 2007).

Similarly, in both music and verbal language, emotions are expressed through similar patterns of pitch, tempo and intensity (Scherer, 1995; Juslin and Laukka, 2003; Bowling et al., 2012). For instance, in both channels, happiness is expressed by fast speech rate/tempo, medium-high voice intensity/sound level, mediumhigh frequency energy, high fundamental frequency (F0)/pitch level, high F0/pitch variability, rising F0/pitch contour, fast voice onsets/tone attacks (Juslin and Laukka, 2003). Importantly, the use of voice modulation to express emotional information within interpersonal communication might have had adaptive value in the early species of our genus Homo, improving their ability to respond appropriately to survival opportunities (Mithen, 2005; Filippi, 2016; Frijda, 2016; Filippi et al., 2017a,b).

Finally, in addition to potentially shared cognitive foundations, music and language share the feature of being socially learned from the behavioral outputs of other individuals. It is well-established that the process of transmission from one-generation to another is an important part of the cultural evolutionary process that shapes languages (Kirby et al., 2014). It is becoming clear that cultural evolutionary processes may play a similar role in shaping music (Cross, 2001, 2009; Reybrouck, 2013; Ravignani et al., 2016a; Fitch, 2017a; Ravignani and Verhoef, 2018). Cultural evolution has emerged as a unifying framework in the language sciences (Christiansen and Chater, 2008), linking the cognitive bases for language with the diversity of languages observable throughout the world. A similar approach in the study of the evolution music is desirable (Trehub, 2015).

## RESEARCH ON THE EVOLUTION OF MUSICALITY: PITFALLS TO AVOID

In the last decades, researchers from multiple disciplines have joined forces to unveil the origins and evolution of language. Despite patent progress in methodologies and applications (as attested by highly influential publications), a group of scientists, linguists and philosophers have strongly criticized the whole field of language evolution. This is particularly interesting because the "attack" has been led by one of the fathers of modern linguistics and cognitive science, Noam Chomsky.

A central theme of Chomsky's critiques to the empirical study of language is mysterianism: the idea that scientific knowledge is not always attainable (Chomsky, 2009, 2015; Piattelli-Palmarini et al., 2009). This would be due to the architecture of our minds imposing hard limits on what we can discover scientifically. In other words, Chomsky argues that the limits of human cognition hinder our capacity to unveil scientific mysteries through scientific investigation (Ravignani and Thompson, 2017). This perspective is not new, and had been already proposed in the cognitive sciences in general (McGinn, 1989).

Chomsky makes a distinction between I-language and Elanguage (Chomsky, 2015). I-language refers to internal linguistic representations, a universal language of thought also known as universal grammar, which, according to Chomsky, is innate. In his view, I-language has a dedicated brain area and is the only aspect of language worth studying, since it is not subject to variation through time and cultures. From his perspective, the ease with which infants acquire any of the hugely diverse set of grammatical structures observable across natural languages points to the existence of this innate universal grammar underlying language acquisition. Instead, E-language encompasses the multiple languages, i.e., the strings of sounds uttered in the outside world, which vary across individuals and cultures. Chomsky claims that processes underlying development and evolution of the wide variety of E-language instances cannot be investigated empirically. In fact, his ideas of mysterianism applied to language bring Chomsky to a simple conclusion: the most important questions about language (his I-language), including its cultural nature, its origins and development over time, and its acquisition are potentially unanswerable (Chomsky, 2015; Ravignani and Thompson, 2017). Notably, the I-language vs. E-languages distinction parallels that between musicality, the human cognitive-biological innate predispositions underlying music perception and production vs. music, intended as a cultural product (Honing et al., 2015; Honing, 2017).

Chomsky has a strong influence on debates about language evolution, and there are clear parallels between the fields of music and language evolution. Here we suggest in the strongest possible terms that research on the evolution of music should avoid Chomskyan mysterianism. By definition this perspective is scientifically stagnant, and the theoretical commitments driving Chomsky's mysterianism are widely rejected in the language sciences at large (Boeckx and Theofanopoulou, 2015; Corballis, 2017; Fitch, 2017b; Kirby, 2017; Ravignani and Thompson, 2017). Other schools of thought show some commonalities to mysterianism (McClelland, 2013; Coleman, 2016).

Another potential attack to the approach we propose comes from a field far away from Chomskyan thought. In particular, generations of ethnomusicologists and cultural anthropologists have strongly opposed the very idea of investigating evolutionary and cross-cultural features of music (Vandor, 1980; Nettl, 2005; Nattiez, 2012). These scholars often object that musical cultures from different parts of the world cannot be compared because comparison would occur through the eyes of a scholar bound to a specific culture. They argue that, across cultures, the word "music" maps to different meanings, and likewise what we call music in the Western world can be translated in many different ways across cultures and languages. According to most of these scholars, the concept of "universal"—corresponding to a feature which is found more often than not, above chance, across cultures, e.g., music often entailing percussion instruments is pointless because it conflicts with cultural specificity. More generally, several scholars argue there is an irreconcilable divide between humanities and sciences (Gourlay, 1984; Cohen, 2001).

Both Chomskyan and anthropological schools of thought, while departing from opposite philosophical stances, reach the same theoretical conclusion: the nature of language and music is mostly unknown, and empirical efforts to unveil it are pointless. We strongly disagree with this conclusion. Even a cursory glance at the contemporary literature on language and cognition reveals astounding progress in what we know about these topics (Fitch, 2017c). Here we simply reiterate that this progress mostly results from broad contemporary adoption of the scientific method. Our view is that music-related disciplines can benefit equally from rejection of mysterianist skepticism and continued adoption of an integrated experimental approach, namely: (1) observing behaviors and the environment in which they occur, (2) formulating a hypothesis, (3) testing the hypothesis by performing an experiment or collecting new data, (4) using the results to build a model of the phenomenon of interest, (5) employing the model to generate a more refined hypothesis to be tested empirically.

Against the Chomskyan and anthropological perspectives above, we argue for an empirical approach to the origin and evolution of music and musicality. The new-born discipline of music evolution will benefit—and has already benefitted—by unifying the following approaches into one research framework: (1) empirical investigations, as opposed to armchair speculation, on ontogenetic and phylogenetic evolution of music; (2) comparative research, addressed by probing for presence of proto-musical behaviors in other animal species; (3) crosscultural work, recognizing the diversity of world musical behaviors while comparing them to find common patterns; (4) proposing alternatives to the classical nature-nurture dichotomy. Below we discuss these four points succinctly.

## REPLACING ARMCHAIR SPECULATION WITH MODELS

First, centuries of scientific practice have shown that tightly integrating theory and empirics typically leads to scientific progress. The study of the evolution of music or language is not an exception to this. Theoretical frameworks should provide testable empirical questions (Iversen, 2016), insights for good experimental design and conceptual frameworks to interpret statistical results and generate new testable hypotheses. Ideally, theoretical contributions should be formulated as mathematical models and computer simulations. One advantage of modeling, as compared to constraint-free theorizing, is modeling's potential for falsifiability: Models rely on assumptions to make predictions, which can be empirically falsified. Another related advantage of models is their potential integration with experiments: model assumptions and predictions can be promptly translated into experimental constraints, in turn testable on humans or other living organisms. For these reasons, we strongly support the use of quantitative models of cognitive (Perfors et al., 2011; Fitch, 2014), cultural (Tamariz and Kirby, 2016), and evolutionary (Thompson et al., 2016) processes in music evolution research.

## COMPARATIVE COGNITION CAN INFORM HUMAN EVOLUTION

Second, the comparative approach to animal cognition can be useful in reconstructing the evolution of human behaviors. Proto-musical behaviors may emerge in other species by (1) homology, i.e. our last common ancestor with that species was endowed with a predisposition toward the behavior under scrutiny, or (2) convergent evolution, i.e., similar evolutionary pressures gave rise to similar genetic predisposition for protomusical behaviors in humans and other species (Fuhrmann et al., 2014; Ravignani et al., 2014, 2016b; Wilson and Cook, 2016). For instance, recent studies found evidence for beat perception and production, relative pitch and tonal encoding (Hoeschele et al., 2015; Hoeschele and Bowling, 2016), octave generalization (Crickmore, 2003), and consonance (Cook and Fujisawa, 2006) in animals. Based on theoretically driven empirical research (Honing et al., 2015), we argue that, if musical tasks designed for humans are adapted - by modifying their form, not substance—to the specific species under inquiry, many "unthinkable tasks" sensu Chomsky (2015)—may become manageable. In contrast, based on purely theoretical introspection, Chomsky argues that the cognitive differences between humans and other animals are a matter of quality, not quantity. For instance, he claims that "rats cannot deal with a prime number maze" (Chomsky, 2015; p. 105). To this, we respond that cicadas show behavioral patterns based on prime numbers (Grant, 2005; Tanaka et al., 2009); hence it is tenable that these insects could solve a cicada-adapted prime maze, say over an evolutionary timescale. Likewise, if precursors to music are studied across species with an openminded attitude, suggestive parallels can be found across humans and other species.

## CROSS-CULTURAL COMPARISONS

Third, an equally important sort of comparative approach consists in cross-cultural research on music. Acceptance of crosscultural work exhibits one of the starkest contrasts between the study of language and that of music. Cross-cultural comparison of languages has proceeded unhindered for centuries, with few roadblocks. The comparative study of music, instead, experienced a golden period before a crashing halt in the 1960s. Music research can learn from evolutionary linguistics, by performing more cross-cultural work. Its purpose is to account for the uniformity and diversity of musical forms across cultures, in turn to find patterns of music and musicality which are truly generalizable to mankind (the so called universals; Savage et al., 2015).

## BEYOND THE NATURE-NURTURE DICHOTOMY

Fourth, language research has been historically characterized by an overreliance on the nature-nurture dichotomy. In our view, the evolution of language as a scientific endeavor has long been plagued by a spurious dichotomy that need not be imported into the evolution of music (Fitch, 2011). We see this opposition as reflecting unnecessarily strong but historically entrenched theoretical divisions. On the one hand, generative linguists in the Chomskyan school have traditionally sought an exclusively mind-internal explanation for the complex tapestry of structures and operations observable in languages (although Chomsky's contemporary mysterianism questions the future of this endeavor). On the other hand, empiricists have privileged the mechanisms of human interaction and cultural transmission as explanations for the diversity and complexity of languages. These opposing standpoints have fuelled much debate but little consensus (e.g., Carruthers et al., 2005).

Contemporary nativist-empiricist dialogues in the cognitive sciences at large now focus on developing paradigms that generalize this dichotomy into a continuous range of possibilities, with traditional nativism and empiricism at the extremes. One example of such a paradigm is Bayesian cognitive modeling, in which bio-cognitive constraints and empirical learning are integrated by a theory of subjective inference based on the principles of conditional probability. The Bayesian approach provides a framework in which both empirical evidence and innate (or earlier learned) biases can be expressed as influences on how individuals learn. While earlier approaches have treated these influences as distinct alternatives, the Bayesian approach allows the two to be balanced in a formally explicit way. Several

by behaviors of the species populating it. For a given generation or time period, genes affect behavioral patterns. Behavioral patterns also adapt to, and modify, their environmental medium. In turn, behaviorally-driven changes in the environment might affect the fitness landscape of a species, influencing in turn which genes will be passed on to the next generation.

scholars have argued that approaches like this, in which any flavor of theory can be formally instantiated, interrogated, and tested against empirical data, are the future of nativist-empiricist dialogues (e.g., Spelke and Kinzler, 2009). We hope this sort of integrative approach can be imported into the evolution of music (Trehub, 2015; Jacoby and McDermott, 2017; Ravignani et al., accepted). For example, the generative program applied to music (Lerdahl and Jackendoff, 1985; Rohrmeier, 2011) has generated a range of predictions which have been tested empirically (e.g., Koelsch et al., 2013).

In the language sciences, a new generation of researchers is pushing this approach, integrating models and empirics, even further toward larger-scale evolutionary questions (Blasi et al., 2017). This effort focuses on process-based explanations for behavior, allowing us to understand our species a unique interaction of bio-cognitive and cultural processes (**Figure 1**) via evolutionary modeling (Smith and Kirby, 2008). Modeling the evolution of minds as part of cultural systems enables us to formalize theories concerning optimal divisions of labor between specialized individual minds and the cultural processes that connect them (Thompson et al., 2016; de Boer and Thompson, 2018). Free from theoretical commitments to one evolutionary process dominating another, contemporary language scientists have at their explanatory disposal an evolutionary framework more powerful than exclusively biological or cultural explanations for behavior: co-evolution, in which evolving minds shape new behaviors, and evolving behaviors shape new minds. Like the Bayesian paradigm, co-evolutionary approaches to the origins of our abilities allow us to move past the idea that only biology or only culture is relevant to unique human behaviors, in a formally explicit way. Hence, modern approaches to the study of language evolution (e.g. Kirby et al., 2014; Thompson et al., 2016) show how biological and cognitive theory can be used to develop process-based, experimentally testable hypotheses about the emergence of behavior among culturally interacting individual minds (Trehub, 2015; Ravignani et al., accepted).

## CONCLUSIONS

We hope the evolution of music can hit the ground running under by adopting the inclusive approach described above (see Bowling et al., 2017). The main hypothesis under scrutiny is: does our species unique blend of biological and cultural features underpin remarkable human behaviors like language and music (**Figure 1**)?

The distinction between music and musicality provides a practical advantage in designing and interpreting experiments. Still, empirical research on the origins of music should adopt a hybrid approach, complementing experiments in tightly controlled settings—hence targeting only music or musicality with research which integrates the two domains and explanatory levels.

Above we only describe the importance of one possible flow of ideas, from language to music. However, the inverse is also needed: More linguists would need to learn about music evolution and cognition research. For instance, phoneticians and phonologists should capitalize on music findings when investigating tone, prosody, rhythm, etc. (Brown, 2017).

In sum, recent debates provide a bird's-eye view of how the science of language has historically developed, and partly branched into the stormy study of language evolution. Scientists addressing music can benefit from historical breakthroughs and dead-ends in the study of language evolution, and use these insights to accelerate discovery in one of the most exciting topics in contemporary cognitive science, the evolution of music.

## AUTHOR CONTRIBUTIONS

All authors listed have made substantial, direct, and intellectual contributions to the work, and approved it for publication.

## FUNDING

AR was supported by funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 665501 with the research Foundation Flanders (FWO) (Pegasus<sup>2</sup> Marie Curie fellowship 12N5517N awarded to AR) and ERC grant 283435 ABACUS (awarded to Bart de Boer). PF was supported by grants

### REFERENCES


ANR-16-CONV-0002 (ILCB), ANR-11-LABX-0036 (BLRI) and the Excellence Initiative of Aix-Marseille University (A∗MIDEX). All authors were supported by a visiting fellowship in Language Evolution from the Max Planck Society.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Ravignani, Thompson and Filippi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Commentary: The 'Musilanguage' Model of Language Evolution

#### Aleksey Nikolsky\*

Braavo Enterprises, Los Angeles, CA, United States

Keywords: texture, heterophony, polyphony, homophony, musilanguage, asynchrony, isophony, meter

#### **A commentary on**

#### **The 'Musilanguage' Model of Language Evolution**

by Brown, S. (2000). The Origins of Music. eds S. Brown, B. Merker, and N. L. Wallin (Cambridge, MA: MIT Press), 271–300. doi: 10.1037/e533412004-001

The model of musilanguage (Brown, 2000, 2017) requires a new musicological term to refer to its texture. Like choral singing, and unlike speech, musilanguage is based on simultaneous vocalization of multiple participants who reproduce the same signal (call) at random time intervals and pitch levels, akin to a wolf "chorus.<sup>1</sup> " Voiced utterances produce multiple pitches, generating a "jumbled" texture, similar to polyphony and heterophony, but not fully qualifying as such.

The term "heterophony" was introduced by Stumpf (1897) to refer to the fusion of sounds whose components retained their singular identity. Stumpf discovered this word in Plato's Laws (Plato, 2013, p. 203), where it referred to pitch and rhythm discrepancy between the vocals and the lyre performing the same tune. Four years later, Stumpf (1901) reused this term to describe Thai music. He characterized heterophony as the looping of simultaneous melodic paraphrases [Umspielen], where parts generally followed the same melodic contour while differing in detail, so that minute discrepancies would meet again in unison.

Adler (1908) conceptualized heterophony as a style alternative to homophony and polyphony, applicable across Siamese, Japanese, Javanese, and Russian musics. He specifically found Russian heterophony to present a paradigm of heterophonic arrangement, designed to make melodic repetition less monotonous and more idiosyncratic for each singer's voice.

The Russian Musical Encyclopedia defines heterophony as a multi-part music generated by the collective performance of the same melody, where parts contain deviations from the principal melodic formula (Mueller, 1973). Such organization is regarded as a general textural type of ornamental, harmonic, and/or polyphonic variation that can complicate classification. Indeed, already in 1911, Stumpf criticized Adler for misapplication of the term (Stumpf and David, 2012). The keynote of heterophony is an ongoing melodic repetition with numerous intermittent variations, which seems to apply to musilanguage chorusing (Brown, 2007). However, such chorusing contains no synchronization, whereas heterophony implies prevalent synchronization of parts.

Swan (1943) defined heterophony as "a principal melody improvised simultaneously by several singers, retaining its main outline in each voice, yet showing enough independence to result in places in 2- and 3- and even 4-part harmony<sup>2</sup> ." The Grove Dictionary (Cooke, 2001), following Swan's definition, emphasizes a collective synchronized execution. Although the notation example provided in the Grove article shows a consistent misalignment of 5 parts

Nikolsky A (2018) Commentary: The 'Musilanguage' Model of Language Evolution. Front. Psychol. 9:75. doi: 10.3389/fpsyg.2018.00075

#### Edited by:

Timothy L. Hubbard, Arizona State University, United States

#### Reviewed by:

Steven Brown, McMaster University, Canada

> \*Correspondence: Aleksey Nikolsky aleksey@braavo.org

#### Specialty section:

This article was submitted to Cognition, a section of the journal Frontiers in Psychology

Received: 09 November 2017 Accepted: 18 January 2018 Published: 26 February 2018

#### Citation:

<sup>1</sup>The sample of a wolf pack howling can be heard at:http://chirb.it/kFLxIC. It is characterized by the reproduction of the same call at various speeds and pitch levels, without any synchronization of the onset and termination points of the calls. 2 I have italicized those words in the definition that imply the general coordination in time between the performances of all the participants.

#### TABLE 1 | Feature comparison of 4 types of musical texture in a multi-part ensemble.


(Knudsen, 1968), such heterophony is rare and does not sound "jumbled<sup>3</sup> ." Its asynchronicities remain minimal (<half-a-beat). Longer asynchronies (≥beat) generate polyphonic imitations, where the same melody becomes deliberately distributed between multiple parts to produce juxtapositions at certain temporal intervals<sup>4</sup> .

Polyphony is "a style of simultaneously combining number of parts, each forming an individual melody and harmonizing with each other" (Oxford Dictionary). Despite its association with Western art-music, polyphony penetrated Western popular (Bukofzer, 1940) and traditional music (Ahmedaja, 2011), prompting research of non-European polyphony (Arom, 2004; Jordania, 2006). Many ethnomusicologists prefer alternative terms (diaphony, disphony), while others treat "polyphony" as an umbrella term for any multi-part music, setting terminological confusion (Cooke, 2001). Current consensus defines polyphony as "a mode of expression based on simultaneous combination of separate parts, perceived and produced intentionally in their mutual differentiation, in a given formal order" (Agamennone, 1996) 5 .

<sup>3</sup>An example of the Hebrides Psalm can be heard at: http://chirb.it/raeaOr. It presents a clear single melodic line that is "smudged" by consistent delays between different parts—presenting a rare style of melodic variation by means of a deliberate "reverberation" effect induced by the continuous sub-beat time lagging between parts.

<sup>4</sup>Well-known examples of musical texture in monothematic imitational polyphony in folk music are round-songs. In art-music, Bach's "Musical Offering" BWV 1079 contains excellent demonstration of polyphonic arrangement of a single melody in the so-called "riddle canons"—notated as a melody solo ought to be performed by a few musicians who enter one by one in specific time intervals.

<sup>5</sup>My italics emphasize points that distinguish polyphonic texture from heterophonic.

Polyphonic and heterophonic textures differ in orderliness: heterophonic parts are inadvertent, unlike polyphonic parts (Tallmadge, 1984). Polyphony induces individualization of parts by means of sharpening their functional contrast in texture. Hence, synchronization is even more important for polyphony parts must align in pitch and time throughout the entirety of music. This makes polyphonic performance metrically stricter than heterophonic performance.

Even stricter is synchronization in another "classic" texture homophony—"music in which all melodic parts move together at more or less the same pace" (Hyer, 2001). Contrary to common belief, homophony is not bound to European music alone (Nikolsky, 2016). Its reliance on chords and harmonic intervals demands high concision in tones' onsets: in the order of under 100 msec (Huron, 2001), typically, 30–50 msec (Rasch, 1988).

All "classic" textures rely on harmonic, metric, and thematic integrity of parts. Performers attune their performance to the pitch of their partners, the manifestation of beat in their rhythms, and the distribution of musical material across parts—what musicologists call "thematic material" and consider an expressive point of a musical work by which it can be remembered (Drabkin, 2001). In this semiotic sense (Réti, 1951), the notion of thematicity is applicable to folk and non-Western music (Mazel, 1960). However, harmonicity, metricity, and thematicity are inadmissible for musilanguage. Even modifying "classic" terms (e.g., "jumbled heterophony") would constitute a misnomer: musilanguage inherently lacks any form of arrangement of parts.

Since musilanguage occupies an evolutionary position between the "natural" animal vocalizations and the simplest human oral communication, it predates mode, scale, meter and therefore, heterophony and polyphony. This situation calls for a new term—isophony: texture that uses brief calls, continuously reproduced by multiple performers with irrational deviations in timing and pitch, where each participant retains idiosyncrasy of the rhythmic, timbral, and directional attributes of the pitch contour—altogether producing a "jumbled" effect (Nikolsky, 2016, Appendix-5)<sup>6</sup> . Vocalization can be considered "isophonic" if it maintains a single call as a unit of texture, scalable shorter/longer and higher/lower through the continuum of duration and frequency for every participating part consistently reproducing that call out-of-sync in relation to the moment of its onset or termination.

Isophony contrasts "classic" textures by its tendency to expose each participant's identity without enmeshing into the ensemble. Isophony involves the assembly of individuals, rather than a single entity ("choir"). Isophonic tones never meet in unison or in beat, and are devoid of any form of harmonization<sup>7</sup> . Isophony's only feature of tonal organization is the uniformity of the melodic and timbral characteristics of a call. The function of isophonic texture is to attract attention to each participant's expression of the same state of mind. The important features that distinguish isophony from heterophony, polyphony, and homophony are summarized in the **Table 1**.

Conceptualization of isophony as a primordial texture that predated music, establishes the lineage in the morphological evolution of music, allowing comparative cross-examination of musical structures in multi-part music.

#### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.

#### REFERENCES


<sup>6</sup>This appendix contains an overview of the evolution of musical texture, following the model of "histories of fine art"—see Table 2 (p. 13–16) in it.

<sup>7</sup>An example of isophony is an Akia tribe song (courtesy of Anthony Seeger): http://chirb.it/ryANbJ. The leader coins the call (vocable "Tete"), and the tribe repeats it at various pitch levels and duration values, where each participant displays their specific social status through the relative duration of the same call (Seeger, 2004).


**Conflict of Interest Statement:** AN was employed by company, Braavo Enterprises as a technical and creative director. Braavo Enterprises specializes in developing content for educational and edutainment programs for children of 5–11 years of age.

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Nikolsky. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# "The Two Brothers": Reconciling Perceptual-Cognitive and Statistical Models of Musical Evolution

#### Steven Jan\*

*Music and Drama, University of Huddersfield, Huddersfield, United Kingdom*

While the "units, events and dynamics" of memetic evolution have been abstractly theorized (Lynch, 1998), they have not been applied systematically to real corpora in music. Some researchers, convinced of the validity of cultural evolution in more than the metaphorical sense adopted by much musicology, but perhaps skeptical of some or all of the claims of memetics, have attempted statistically based corpus-analysis techniques of music drawn from molecular biology, and these have offered strong evidence in favor of system-level change over time (Savage, 2017). This article argues that such statistical approaches, while illuminating, ignore the psychological realities of music-information grouping, the transmission of such groups with varying degrees of fidelity, their selection according to relative perceptual-cognitive salience, and the power of this Darwinian process to drive the systemic changes (such as the development over time of systems of tonal organization in music) that statistical methodologies measure. It asserts that a synthesis between such statistical approaches to the study of music-cultural change and the theory of memetics as applied to music (Jan, 2007), in particular the latter's perceptual-cognitive elements, would harness the strengths of each approach and deepen understanding of cultural evolution in music.

## Edited by:

*Aleksey Nikolsky, Independent Researcher, Los Angeles, CA, United States*

#### Reviewed by:

*Piotr Podlipniak, Adam Mickiewicz University in Poznan, Poland ´ Stephan Thomas Vitas, Independent Researcher, Washington, DC, United States Ollie Bown, University of New South Wales, Australia*

#### \*Correspondence:

*Steven Jan s.b.jan@hud.ac.uk*

#### Specialty section:

*This article was submitted to Cognition, a section of the journal Frontiers in Psychology*

Received: *05 December 2017* Accepted: *28 February 2018* Published: *04 April 2018*

#### Citation:

*Jan S (2018) "The Two Brothers": Reconciling Perceptual-Cognitive and Statistical Models of Musical Evolution. Front. Psychol. 9:344. doi: 10.3389/fpsyg.2018.00344* Keywords: qualitative, quantitative, perceptual-cognitive, statistical, memetics, phylomemetics, cultural evolution

## 1. INTRODUCTION: APPROACHES TO THE STUDY OF CULTURAL EVOLUTION

The dichotomy, even tension, between qualitative and quantitative research methods aligns to some extent with the "two cultures"—the artistic/humanistic and the scientific, respectively—famously outlined by Snow (1964)<sup>1</sup> . While this is certainly an oversimplification—the two approaches often blend; and both may be deployed in the service of falsifiability (Popper, 1959), the acid test of a scientific theory—it has, until quite recently, largely been the norm, certainly in western musicology<sup>2</sup> . Nevertheless, the explosive growth in computer power, and its increasing accessibility, has, over the last two decades, put systematic approaches in the hands of scholars in the arts and humanities. In music research, such approaches are typified by the interest in "empirical

1 I am grateful to Valerio Velardo, Alexey Nikolsky, and the anonymous reviewers for their helpful comments on earlier versions of this article.

2 In eastern Europe, there has arguably been a more thoroughgoing synthesis, the integration of artistic/humanistic and scientific methodologies being long established, for example, in Russian scholarship (Grigoryan, 2011).

[experimental, data-rich] musicology" (Cook, 2004) and, more broadly, by the current attention paid in the humanities to the promises of "big data" (Sharma et al., 2014), which allows, for instance, for large-scale statistical analysis of music-related bibliographical data (Rose et al., 2015).

Conversely, a number of research traditions in the sciences have used music data in quantitative studies, including the Music Information Retrieval Exchange (MIREX) project<sup>3</sup> . This work stems partly from an interest in how technology can expedite music research—particularly in the fields of pattern-finding and data-retrieval—and partly from a recognition that the inherent complexity of music makes it a singular challenge for the design and implementation of computerized analytical tools. A similar motivation underpins cognitive science in music: often, its music-orientated practitioners pursue it in order to try to unravel the mysteries of the art form; whereas its science-orientated researchers wish to understand the deep embeddedness of music in multiple brain and body systems (Schulkin, 2013). Linking data-searching and analysis and cognitive science, the recent development of systems which autonomously create music what might be termed the computer simulation of musical creativity—is testament to the power of computers to bring together research in music, artificial intelligence and cognitive science in the service of understanding what still seem to be the mysteries of creativity (Miranda et al., 2003; Boden, 2004), whether this research is motivated by artistic/humanistic or by scientific impulses<sup>4</sup> .

The study of cultural evolution has been approached from both of Snow's perspectives. From the scientific, there is a tradition of research at the interface of anthropology, sociology and evolutionary biology which uses broadly Darwinian methods to understand the spread of cultural items, including ideas, artistic traditions and artifact-manufacturing technologies (Cavalli-Sforza and Feldman, 1981), this cultural transmission sometimes being correlated with genetic transmission (Shennan, 2002). From the artistic/humanistic, there is a long tradition of research (conducted broadly under the rubric of historical musicology) of referring to change in music as in some sense evolutionary (Perry, 2000). But this ascription is largely metaphorical; that is, it documents morpho-stylistic changes in the outputs of composers, in the development of genres, or in the cultures of places or times—but it does not argue for a Darwinian (or any other algorithmic) basis as the mechanism driving this change. As an artistic/humanistic field, it clearly does not want to deny the agency of the composer, however that is understood to arise (Blackmore, 2010), just as composers do not want to deny it of and for themselves<sup>5</sup> .

By contrast, many would argue that because musical patterns, however defined, manifestly demonstrate variation, inheritance (transmission) and selection—"principles [which] apply equally to biological and cultural evolution" (Savage, 2017, p. 9) they conform to Darwin's theory of evolution by (natural) selection<sup>6</sup> . That is, such patterns—Dawkins' memes—instantiate the evolutionary algorithm, but are sequences of elements in cultural media—such "phemotypic" (extra-somatic) products as "tunes, ideas, catch-phrases, clothes fashions, ways of making pots or of building arches" (Dawkins, 1989, p. 192), which devolve to "memotypic" patterns of neuronal interconnection (Calvin, 1996; Jan, 2011; Mhatre et al., 2012)—rather than biological-medium (DNA) sequences<sup>7</sup> . In this sense, "music literally evolves . . . [because] musical evolution follows patterns and processes that are similar, but not identical, to [those of] genetic evolution" (Savage, 2017, pp. 38, 22).

Accepting the memetic formalization of cultural evolution as real and not metaphorical, and using a small case study which, it is hoped, can be scaled and generalized, this article attempts to reconcile approaches drawn from the perceptual-cognitive and the statistical domains as they apply to the evolution of music<sup>8</sup> . It regards these two domains as broadly aligning, respectively, with the qualitative/quantitative distinction discussed above, although it recognizes that the perceptual-cognitive is of course formalizable and measurable (and thus partly quantitative/statistical) using the methodologies of cognitive science. In this sense, the article emphasizes the perceptualcognitive/statistical dichotomy as arguably more meaningful for the understanding and advancement of memetics than the qualitative/quantitative.

Section 2 discusses some of the criticisms of memetics, arguing in its defense that its central claims, grounded as they are in important psychological principles, cannot be lightly dismissed. Section 3 discusses how relationships between musical patterns can be formalized using a combination of perceptual-cognitive and statistical approaches in ways that offer a robust model for the development of memetics. Section 4 follows up some implications of memetic similarity measurements, considering the representation of evolutionary relationships using taxonomic trees. Section 5 looks forward to the future integration of perceptual-cognitive and statistical approaches using computer technology.

The article offers two principal claims. The first of these is as follows: a purely statistical approach based on counting noteedits without consideration of perceptual-cognitive aspects gives an incomplete account of cultural evolution. A second, derived, claim will be outlined at the start of section 3.

<sup>3</sup>http://www.music-ir.org/mirex/wiki/MIREX\_HOME

<sup>4</sup> See also the Journal of Creative Music Systems (http://jcms.org.uk/).

<sup>5</sup> It should be remembered that the (musicological) conception of music as a series of discrete structures/objects (works, Goehr, 1992) produced by named and celebrated author-composers and notated unambiguously is a relatively recent western-European phenomenon, and that most human music is (from an ethnomusicological standpoint) communal, processive and deeply enmeshed with other media, such as dance and poetry, and with worship. In Taruskin's (1995) phrase, this is the distinction between music as text and music as act.

<sup>6</sup>Because the evolutionary algorithm (Dennett, 1995, p. 343) is substrate-neutral, it makes little sense to distinguish between "natural" and "cultural" selection—this being the principle underpinning Universal Darwinism (Dawkins, 1983b).

<sup>7</sup>Thus, and at the risk of multiplying terminology, I use "phemotype" as the memetic counterpart to the genetic "phenotype". By extension, I use "memotype" as the counterpart to "genotype" (Jan, 2007, p. 30, Table 2.1).

<sup>8</sup>Thus, it takes as its starting point the assumptions that (i) culture evolves; (ii) that this evolution is broadly Darwinian; and (iii) that memetics offers the best formalization of this cultural-evolutionary process.

To demonstrate the nature-culture similarities he hypothesizes, Savage (2017) uses techniques drawn from molecular genetics discussed more fully in section 3—to compare the basic mutational-editing operations of note conservation, substitution, insertion and deletion (Savage, 2017, p. 53) in corpora of folk-song melodies with protein modification in biological transmission. He argues that an advantage of a "rigorously quantitative approach modeled on molecular genetics is that such quantitative approaches have shown success in rehabilitating cultural-evolutionary theory after much criticism of earlier incarnations such as Dawkins' "memetics"' (Savage, 2017, p. 45).

Criticism of memetics—Gould called it a "meaningless metaphor" (in Blackmore, 1999, p. 17; see also Kuper, 2000) has arguably been counterbalanced by as much endorsement (Dennett, 2007), or at least by the acceptance that some problems in cultural studies are readily addressed by recourse to memetics. Yet Savage is to some extent correct in his implication that a fault with memetics (assuming one accepts its fundamental premises) is that it has hitherto been formulated in a somewhat imbalanced way, with too much emphasis on the qualitative and too little on the quantitative (but see McNamara, 2011). In the terms of section 1, it might therefore be believed that it has not (yet) been formulated in such a way as to be falsifiable. Yet this is to ignore the work of several scholars who have attempted to use the insights of memetics in quantitative studies (Adamic et al., 2014); and also, perhaps more importantly, to discount the work of Lynch (1998), who has arguably made the greatest contribution to the formalization of memetics, even though his models, to my knowledge, have not yet been systematically applied or tested<sup>9</sup> .

If Savage's (2017) criticism of memetics as insufficiently orientated toward quantitative methodologies is accepted, then it is surely valuable that the more qualitative insights of memetics often based upon introspective evaluation of the nature of certain musical patterns and their transmission across cultural time and space—are supported by quantitative work which counts and measures such phenomena systematically. This, by its very nature, implies statistical studies of large corpora. Nevertheless, the danger with such approaches, particularly the type of molecular-genetics approach adopted by Savage and his collaborators, is that they risk being focused on too low a descriptive level and may arrive at statistical generalities rather than meaningful particularities—the former an approach not dissimilar to the "beanbag genetics" criticized by Mayr (Dronamraju, 2010; but see also Juhász and Sipos, 2010). Savage and Atkinson (2015) concede this, arguing for the importance of taking into account:

higher-level units of musical structure and meaning. In music, as in genetics, the individual notes that make up the sequences have little meaning in themselves. The phylogenetic analysis of sequences is thus merely the starting point from which to



understand how and why these sequences combine to form higher-level functional units (e.g., motives, phrases) that coevolve with their song texts and cultural contexts of musicmaking as they are passed down from singer to singer through centuries of oral tradition (Savage and Atkinson, 2015, p. 167).

In this sense, it is important to consider—in the terms of the long-running debate in biology—the relevant units of selection (Lewontin, 1970), which requires a degree of nature-culture mapping10. While the protein sequences which Savage (2017) takes as analogous to musical sequences are useful exemplars of mutational operations, they have little evolutionary meaning in themselves. This is because genes are selected for, not nucleotides, nor, in Savage's case, the amino acids which make up the proteins whose production genes code for. Concomitantly, by focusing on discrete pitches—equated by Savage with the component amino acids of proteins—one is neglecting psychologically meaningful groups of pitches—these, in Savage's terms, equating to genes, which Dawkins regards as "any portion of chromosomal material that potentially lasts for enough generations to serve as a unit of natural selection" (Dawkins, 1989, p. 28). The mappings posited by Savage (2017) are summarized in **Table 1**, the first and second columns representing Savage's molecular-genetic mapping of (bio)chemical and musical structure, and the third and fourth columns representing a mirror-image, memetically motivated set of mappings (see also Jan, 2013, p. 152, Figure 1).

Thus, Savage's (2017) positing that amino acids are equivalent (in some abstract sense) to individual pitches and that proteins are equivalent to melodies is problematic because melodies are often made up of a number of discrete intermediatelevel patterns—musemes (music-memes), in my terminology, and motives in Savage's (2017)—a crucial cognitive level which is not explicitly accounted for (hence the "?" in **Table 1**) in his approach. By "museme"—a particularly salient example of which is the opening four notes of Beethoven's Fifth Symphony—is meant a perceptually-cognitively-demarcated melodic/horizontal (pitch-rhythm) and/or harmonic/vertical collection which is capable of being retained in short-term memory and which possesses "just sufficient copying-fidelity to serve as a viable unit of [cultural] selection" (Dawkins, 1989, p. 195)<sup>11</sup> .

<sup>9</sup>One might draw a distinction between formalization and quantitative studies: the former is an abstract attempt to theorize the terrain and dynamics of a system; the latter is a concrete attempt to measure a system using various metrics, perhaps using some formalization as a guide.

<sup>10</sup>These are not absolute correlations, but simply attempts to align phenomena at analogous structural levels within their parent "ontological category" (Velardo, 2016, p. 104, Figure 3).

<sup>11</sup>I alighted upon the term "museme" independently of Tagg (2016), conflating "music" and "meme" in an example of convergent evolution (homoplasy; see also

Such groups of pitches—the gene-equivalent patterns theorized by memetics—are much stronger candidates for the units of selection in cultural evolution than Savage's isolated pitches. This is because a m(us)eme is not a m(us)eme unless, as Dawkins states, it can act as a unit of selection. To serve this function it has to have a discrete identity; that is, it must (i) be discrete (demarcated to some extent from the patterns surrounding it, even if it partially overlaps with them; Jan, 2007, p. 74); and it must (ii) have an identity (it must have some attribute(s) which distinguish it to some extent from other, similarly demarcated, patterns and which motivate(s) its copying). These two points allow us to understand memetic selection as success in the competition for the finite attention and memory resources of a m(us)eme's potential human hosts.

There is very strong evidence from the cognitivepsychological literature that music is perceived in terms of such melodic/harmonic groups; and it would appear that they derive, in part, from the phenomenon of expectation (anticipation, prediction) (Huron, 2006; Husserl, 2013). As with many musicrelated perceptual-cognitive processes, this is a consequence of both bottom-up (innate/genetically determined) and top-down (learned/memetically determined) factors (Narmour, 1990). While subject to innate constraints, often considered under the rubric of Gestalt psychology, much of our perception of music (and indeed language) relies upon the statistical learning of conventions as a result of enculturation (Gjerdingen, 1988; Byros, 2009). This process has been modeled in a number of computer simulations: discussing their Information Dynamics of Music (IDyOM) model, Pearce and Wiggins (2012) argue that violation of expectations leads not only to affective responses (Meyer, 1956), but is a significant force in imposing grouping boundaries. Moreover, both bottom-up and top-down factors regulate the selective environment of musemes, for the former dictate the constraints a museme must satisfy in order to be perceived, cognized and memorized (Lerdahl, 1992; Velardo and Vallati, 2016); while the latter include the totality of musemes within a cultural community (the museme-pool), against which a given museme must compete (in the sense outlined at the end of the previous paragraph).

To expand upon the foregoing, one can make the following points:

• Bottom-up: Evolutionarily selected predispositions to vocal learning (Merker, 2012) make humans very good at attending to musilinguistic sounds (Brown, 2000; Mithen, 2006; Fitch, 2010) and abstracting statistical regularities from them (Kirby, 2013). This abstraction is fostered by the imposition of grouping boundaries, which "are perceived before events for which the unexpectedness of the outcome (h) and the uncertainty of the prediction (H) are high" (Pearce and Wiggins, 2012, p. 638). Such grouping boundaries create the "chunking" (Snyder, 2000, pp. 53–56) necessary for processing by short-term memory.

• Top-down: Suitably packaged, this musical information is retained in individual and collective memories; indeed, it would not be retained if it were not delineated. It might be termed, after Chomsky, "I-music" (internal, brain-stored, music) and "E-music" (external, culture-stored, music), respectively (Fitch, 2010, p. 32). Chunked musical patterns also influence the perception of other patterns, including their grouping, because "that which is copied [retained in memory] may serve to define the pattern" (Calvin, 1996, p. 21; see also Jan, 2011, section 4.1).

More broadly, the bottom-up/top-down duality raises the issue of gene-meme coevolution (Durham, 1991), because it pits biological replicators against their cultural equivalents. At the highest level, system-orientated research in coevolution encompasses the evolution of the human capacity for musicality and other phenotypic attributes (Blackmore, 2000, pp. 31–34; Jablonka and Lamb, 2014; Podlipniak, 2017); but replicatororientated research in this area is generally not conducted with a specifically memetic orientation (but see Shennan, 2002), tending to focus on gene-level changes driven by (often generic) cultural pressures (Richerson et al., 2010). Thus, future research in coevolution might attempt to investigate meme-level changes driven by (specific) genetic pressures, and the interactions between specific memes and genes.

Given the foregoing, while the statistical data on folk-song corpora edits of Savage (2017) are strong evidence in favor of cultural evolution, they should be regarded as epiphenomena of musemic-evolutionary processes—consequences of the changes which occur when discrete musical patterns are transmitted with copying errors and are differentially selected. To gain a deeper understanding of such statistical data, one must regard the mutational changes (conservation, substitution, insertion and deletion) as forces not only driving musemic mutation and, ultimately, musico-stylistic evolution (Jan, 2015), but also as forces constrained by the psychological realities of pattern-formation and propagation. That is, one must take into account two countervailing forces: (i) susceptibility to mutational pressures (perhaps engendered by weak perceptualcognitive demarcation and/or low intra-museme coherence) may distort a museme (resulting in high entropy; Margulis and Beatty, 2008), but may introduce a variant which has a higher perceptual-cognitive salience (Berlyne, 1971; Martindale, 1986), and therefore potentially greater replicative prospects, than its antecedent—Dawkins' "fecundity" (Dawkins, 1989, p. 194); and (ii) resistance to mutational pressures (perhaps engendered by strong perceptual-cognitive demarcation and/or high intra-museme coherence) may preserve multiple copies of a museme (resulting in low entropy), and may therefore foster an increase in its representation in the musemepool over time—Dawkins' "copying-fidelity" (Dawkins, 1989, p. 195).

Lastly, one might argue that a memetic orientation erodes the qualitative-quantitative distinction—or, rather, that it allows us to understand it as a continuum—because it supports a range of methodologies from (qualitative) assessments of

section 3). While there are alignments between our uses of the term, mine is distinguished from Tagg's by its specifically evolutionary, as opposed to semiotic, focus—as a unit of cultural selection in music.

the aesthetic effects of certain musemes in particular musical contexts to (quantitative) measurements of museme frequency and transmission relationships.

#### 3. QUANTIFICATION OF EVOLUTIONARY DISTANCE IN MUSEMES

To the first claim outlined in section 1—that a purely statistical approach based on counting note-edits without consideration of perceptual-cognitive aspects gives an incomplete account of cultural evolution—a second has arisen from section 2: that statistical data derived from measuring mutational changes, while illuminating, are epiphenomena of musemic evolution.

To investigate this, I consider some of Savage's (2017) specific data in a small case study, attempting to relate them to the musical patterns from which they arise. It is important to note at this stage that the tracking of conservations, substitutions, insertions and deletions is done partly in the service (in one of his studies) of grouping folk songs into tune-families (Cowdery, 1984), and I will focus on examples from one sub-family which will hopefully serve as a microcosm of more general issues. This focus is perhaps characteristic of the qualitative ("less is more")/quantitative ("more is more") distinction.

**Figure 1** shows one such melody, "The Two Brothers", no. 49 of the "Child Ballads", two variants of which are incorporated by Savage in his dataset. The Child Ballads are a collection of British folk ballads (specifically, their lyrics), assembled (some from American sources) by Child (1904). The (often diverse) melodies associated with these lyrics were later collated and categorized by Bronson (1959). This particular ballad, originally from Scotland, concerns the death—variously accidental or intentional—of one of the eponymous school-age brothers by the other's knife, and the deceased boy's subsequent interment<sup>12</sup> .

What I label the "Antecedent" in **Figure 1A** (**Figure 1Bii**) was transcribed in Bronson's (1959) sources from a rendition by "Mrs. Ellie Johnson (23), Hot Springs, N.C., September 16, 1916" (Bronson, 1959, p. 391, no. 16); and the "Consequent" in **Figure 1A** (**Figure 1Biv**) from a rendition by "Mrs. Lucindie (G.K.) Freeman, Marion, N.C., September 3, 1918" (Bronson, 1959, p. 390, no. 15). Phrase-ending marks (represented by continuous vertical lines in **Figure 1B**) are Bronson's and are retained by Savage (2017). Being clear points of articulation, these marks are equivalent to the terminal nodes of the four musemes—Musemes (hereafter "M") a–d—which constitute these melodies (labeled under **Figure 1Biv**) <sup>13</sup>. While Savage is correct in labeling these two versions as "older" and "younger" (in terms of date of collection), respectively, there are actually four melodies in this group (six if one includes the variants in the second halves of two of them), and his "older" is not the "oldest": this status goes, by one day, to **Figure 1Bi**<sup>14</sup> . **Figure 1Bv**, represents the implied harmony of these melodies, which may or may not have been realized in some performances, perhaps on guitar.

While it makes sense methodologically for Savage (2017) to think in terms of "older" (antecedent) and "younger" (consequent) patterns, the fact that: (i) the time intervals between the recording of these phemotypic forms are so short (three days, in the case of **Figures 1Bi–iii**); (ii) the individuals concerned would presumably have assimilated these melodies months or years before the date of collection; and (iii) the geographical area from which they were collected is relatively constrained (the western counties of North Carolina, with two of the four melodies being collected in the same town, Hot Springs), all suggest that a model of linear transmission in collection-date order, with clearly demarcated, sequential mutations, is obviously highly improbable. This conclusion is further reinforced by the fact that the variants in **Figures 1Bi,iii**, were presumably recorded on the same occasions as the ostensibly "principal" form. Given these points, references in what follows to "earlier"/"antecedent" and "later"/"consequent" forms of melodies and musemes must be understood as relating only to the dates of collection and to the resultant numeration in **Figure 1B**, and not as hypotheses of evolutionary descent-order.

An arguably more realistic model would be of an ecosystem in which a relatively stable framework—defined by balanced and rhyming periodicity, implied harmony, cadence patterns and axial pitches—was generated by means of a number of interchangeable musemes being repeatedly co-replicated. This framework is eight bars in duration, with a I–V; V–I twophrase/four-sub-phrase structure and a "middle cadence . . . on the supertonic [2, supported by an implied V]" ( ˆ Bronson, 1959, p. 384). It is clearly not unique to this set of song variants: it forms the basis, much expanded, of "two-phrase"/"balanced" binary form (Rosen, 1988, p. 22; Hepokoski and Darcy, 2006, p. 355), as well as of numerous other folk-song melodies (Bronson, 1959, p. xii)15. It serves as a container for a set of musemes which were interchangeable in ways which did not compromise the integrity of the melody, as understood by members of the cultural community which replicated it in conjunction with a similarly variable set of verbal-conceptual (lyric/text) memes.

In this sense, "The Two Brothers" is a higher-order structure re-instantiated/generated by the repeated re-conglomeration of a set of functionally equivalent musemes, each of which serves to articulate a specific node of the structure. The notion of functionally analogous musemes is essentially that of the replicator allele (Dawkins, 1983a, p. 283). This concept, when used in the context of cultural evolution, refers to musemes which are similar in their basic structure and/or function, such that members of the same museme allele-class

<sup>12</sup>Verses 4 and 6 of one variant of this ballad read: "4: Brother took out his little penknife, / It was sharp and keen. / He stuck it in his own brother's heart, / It caused a deadly wound. 6: He buried his bible at his head, / His hymn book at his feet, / His bow and arrow by his side, / And now he's fast asleep." (Bronson, 1959, p. 391).

<sup>13</sup>The segmentation of these melodies is largely unproblematic, being guided by Gestalt-psychological segmentation criteria (Deutsch, 1999; Snyder, 2000). While a phrase is not necessarily the same as a museme, in the case of this melody the four short phrases are. The distinctive ♩.– rhythm straddling Mc–Md in most of these melodies also acts (residually) in the two examples where the junction is ♩–♩, i.e., the variant forms of **Figures 1Bi,iii.**

<sup>14</sup>While Bronson categorizes these six as belonging to "Group B" of this tunecollection (Bronson, 1959, pp. 387–393, nos. 9–20), others in this group are often significantly different to the homogeneous six which are shown in **Figure 1B.**

<sup>15</sup>Such similarities suggests a deep commonality between song and dance melodies arising from the imperatives of symmetry, balance and an arch-shaped (low-highlow) tension-curve.

are interchangeable—in the sense of being equally viable and coherent—in a specific context (such as a certain point in a phrase or a particular modulatory juncture, etc.) (Jan, 2016). The framework/higher-order structure referred to above might be termed a musemeplex—i.e., a complex formed by the repeated co-replication of a set of musemes (in the case of **Figure 1B**, Ma–Md) which are nevertheless also individually replicated (Jan, 2007, p. 80). Automatically, the replication of a musemeplex results in the replication of what might be termed a musemesatz—i.e., a shallow-middleground-level structure, the "skeleton" of a musemeplex, generated by the tendency of a set of allelically related musemes to conglomerate in broadly similar ways in two or more contexts (Jan, 2010). As represented in **Figure 1**, allele-identifiers are shown as superscript boxed Arabic numbers (assigned according to date of collection), so that (for example) bb. 1–2 of **Figure 1Bi**, is labeled Allele <sup>1</sup> of Ma, symbolized hereafter in the text as "Ma 1 " 16 .

Given this nexus of similarity relationships linking six melodies assembled from a set of fourteen alleles, how might we understand the connections between the component musemes and attempt to reconstruct their transmission relationships? Perhaps it is necessary to concede that one cannot ultimately reconstruct the nexus of transmission that gave rise to these six melodic variants, simply because human culture is so interconnected—and was even when these songs were current, in the pre-internet age—and the cultural interactions with which we are concerned were largely undocumented. But one might still try to sketch out possible evolutionary trajectories and develop methodologies which might be applicable to these and other cases. One way is to attempt to quantify the differences between them, in terms of measuring the mutational changes that separate them. Savage proposes the percent identity (PID) as a measure of evolutionary distance, this being defined as "the

<sup>16</sup>Such numeration is, naturally, undertaken "vertically" (i.e., as an intra-musemeallele-class system), and not "horizontally" (i.e., as an inter-museme system). The latter approach would indicate a degree of similarity, for example, between

Ma 3 and Mc 4 (**Figure 1Biii**). Nevertheless, this particular connection is a contour-based similarity, and not one inhering in the intervallic and scale-degree recurrences which I employ here to define similarities and differences between musemes and their alleles.

number of aligned positions (i.e., amino acids, DNA nucleotides, musical notes, etc.) that are identical (ID) divided by the sequence length (L).. . .We have chosen to use the average length of both sequences [L1, L2], as this appears to be the most consistent measure of percent identity" (Savage, 2017, pp. 53–54). This metric is represented in the following equation:<sup>17</sup>

$$PID = 100 \times \{ \frac{\frac{ID}{L\_1 + L\_2}}{2} \} \tag{1}$$

Savage (2017) uses the PID as an index of the mutational distance between two variant melodies in order to assess a tune's membership of a particular tune-family—the larger the PID, the greater the likelihood of the melodies' belonging in the same tune-family. But there is no reason why this metric cannot also be used at the level of the museme, in order to quantify mutational distance between such patterns. Used this way, the PID may be used to assess membership of a museme allele-class (or, indeed, to investigate a relationship of presumed mutation which moves a museme from one allele-class into another). Membership of a museme allele-class implies—provided the musemes are of a comparable length—that the musemes in question are related by homology ["a character shared between two or more species that was present in their common ancestor" (Ridley, 2004, pp. 427, 480); what Darwin termed "descent with modification" (Darwin, 2008, p. 129)], rather than homoplasy ["a character shared between two or more species that was not present in their common ancestor" (Ridley, 2004, pp. 427–428, 480)]; that is, a relationship resulting from cultural transmission, rather than from "convergent evolution" (Ridley, 2004, p. 429), respectively. Nevertheless, as with comparable cases in biology, it is not always possible to decide with certainty which category specific cases belong in. While determination of a suitable PID threshold for perceptually-cognitively significant similarity might be achieved by means of empirical studies—whereby test musemes with various degrees of mutation are ranked by listeners according to their perceived relatedness—this would not necessarily permit the assignment of threshold-exceeding patterns to the same allele-class without fuller knowledge of the context of transmission.

A related metric is mutation rate, which is the number of "observed mutations per year" (Savage, 2017, p. 56), where the number of mutated pitches (x) is compared with the total number of pitches (y) over time (t). This is represented in the following equation:

$$MR\_{(t)} = (\mathbf{x}/\mathbf{y})/t \tag{2}$$

Again, there is no reason why this metric cannot also be used at the museme level, in order to quantify the mutation rate between two museme alleles. While cultural evolution occurs at an absolute rate many orders of magnitude faster than biological evolution (Dawkins, 1989, p. 192), and indeed occurs at highly variable absolute rates (Savage, 2017, p. 107), if cultural evolution is scaled to biological evolution (i.e., if some relative rather than absolute mutation rate is considered), then the two processes may be broadly comparable. Mutation rate is directly correlated with "transmission fidelity" (Savage, 2017, p. 111), in that the lowest mutation rates are found in repertoires with high copyingfidelity, and vice versa (Dawkins, 1989, pp. 18, 194); these repertories tend, unsurprisingly, to be notationally (as opposed to orally) transmitted musics. In the case of these particular melodies, however, the time interval is so constrained, and the transmission nexus sufficiently unclear, for the mutation-rate metric to be of limited use (despite the illustrative calculation below) in the present context.

On this basis, the PID and MR values (the latter over a notional 2-year period, the time interval separating the collection of **Figures 1Bii,iv**) for Ma 2 and Ma 4 in **Figure 1** are as follows:

$$PID = 100 \times \{ \frac{\frac{5}{8+7}}{2} \} = 71.4 \tag{3}$$

$$MR\_{(l)} = (3/8)/2 = 0.188 \tag{4}$$

Because the musemes under investigation are components of a larger melody—they are, as argued above, independently replicated elements of a musemeplex which is transmitted, isosequentially ordered, as a collective—when the melody is copied from source to source, it is clear that the order and identity of musemes is either retained or obviously altered18. Such cases of musemic transmission are therefore more tractable—Ma 2 in one melody is clearly analogous to Ma 4 in a variant of that melody than situations in which an isolated museme is potentially copied from an antecedent context (a piano sonata, for example) to a non-analogous consequent context (a symphony, for example). In the latter case, however, the PID and MR metrics might usefully be employed in order to assess the likelihood that a given pattern is indeed being transmitted from one context to another.

Such sequential-mapping constraints allow one to circumvent the fact that, at 71.4%, the PID value of Ma <sup>2</sup> –Ma 4 in **Figures 1Bii,iv** is lower than the 85% Savage takes as an index of two melodies being "highly related" (Savage, 2017, p. 54)<sup>19</sup> . It is conceivable, however, that two melodies with a PID of this order of magnitude may not actually bear any obvious musemic relationships, owing to the insensitivity of the PID metric to museme similarity when the PID is calculated at the musemeplex (phrase) level (one might address this by calculating the PID at the musemeplex level using musemes rather than individual pitches as the units of measurement)20. Because Savage's (2017)

<sup>17</sup>While Savage (2017, p. 51) argues for, and operationalises, the primacy of pitch over rhythm in his melodic-similarity determinations—yet usefully takes into account the distinction between accented and unaccented pitches—future research in this area might usefully integrate both parameters in a more sophisticated PID metric.

<sup>18</sup>This attribute of independent replication is assumed for the sake of argument, but it is not difficult to envisage easily finding coindexes (Jan, 2007, p. 71) of the individual musemes of "The Two Brothers", replicated separately from the assemblage of which they form a part in the ballad.

<sup>19</sup>A PID <85% may still indicate a relationship of (partial) transmission, in which one or more musemes from one melody are assimilated by another, largely dissimilar, melody.

<sup>20</sup>This is a consequence of the phenomenon famously summed up by the comedian Eric Morecambe, who said to André Previn—after a shambolic start by Morecambe to Grieg's Piano Concerto in A minor—"I'm playing all the right notes—but not necessarily in the right order" (McCann, 1999, p. 234).

≥85% criterion applies to melodies, not musemes, and because his algorithm has paired the 71.4%-related Ma 2 and Ma 4 in **Figures 1Aii,iv**, there must by definition be a >85% similarity between the other musemes of the phrase, Mb <sup>n</sup> –Md n , in order to compensate for the <85% of the Ma <sup>2</sup> –Ma 4 relationship. Indeed, Mb 2 and Md 1 are replicated (as their symbology implies) without mutation (= 100% relation).

**Table 2** shows PID values for each museme allele-class in "The Two Brothers", comparing alleles of Ma–Md against others in the same allele-class<sup>21</sup> .

Without the anchor of the sequential-mapping constraint, many of these patterns would not, on the basis of their PID values, appear to be related. The similarities between Ma 2 and Ma 4 , for example, inhere in relatively tenuous pitch connections—the 28.6% "PnID" (Percent non-IDentity = 100%−71.4) puts quite an expanse of clear blue water between them. In the case of the Mc <sup>1</sup> –Mc 4 relationship, the considerably smaller 14.3% PID value (and therefore considerably greater 85.7% PnID) would not even suggest membership of the same allele-class22. In both cases, and as is often the case in musemic similarity relationships, it is the rhythm, contour and harmonic implication—the latter a prolongation of the tonic and dominant chords, respectively (**Figure 1Bv**)—which additionally binds these alleles together (and which would have to suffice in the absence of the sequentialmapping constraint). In the case of Ma 2 and Ma 4 , the rise from the initial c 1 to the apical a<sup>1</sup> in b. 2 followed by a fall to the dominant g<sup>1</sup> at the end of the first half-phrase is the common, unifying contour feature of the allele-class.

Measures of similarity have a bearing on the related issues of museme transmission and of museme resolution/subdivision. In general, cultural transmission is significantly more error-prone (in an informational sense) than biological transmission, so it may be presumed that most inter-museme PID values will be lower than 100%23. Below a certain context-specific threshold, a low PID value might be taken as evidence that any similarities are the consequences of homoplasy, not homology. But the converse may not always hold true: a very high PID might be associated with a pattern so generic and so commonplace that the two instances may have been independently generated (homoplasy), rather than directly transmitted (homology). In Cope's terms, such entities are "commonalities": a category of "patterns which, by virtue of their simplicity—scales, triad outlines, and so on—appear everywhere. In a sense, commonalities seem to disappear in a sea of similarity" (Cope, 2003, p. 17). By contrast, and at the opposite end of a continuum of similarity categories (Jan, 2014, p. 4, Figure 1), longer and more distinctive patterns are termed "quotations": a category which "often involve exact note and/or rhythm duplication" (Cope, 2003, p. 11). Quotations are more likely than commonalities to be homologous as opposed to homoplasious, and vice versa. Thus, one must also take into consideration the issue of museme length, in addition to the PID value, when attempting to determine whether two coindexes are related by homology or by homoplasy.

On this last point, and as noted in section 2, museme perception and cognition is contingent upon both bottom-up and top-down processing. The former to some extent tracks the sonic-acoustic regularities governed by the laws of physics. Given that these regularities include the harmonic series, it is perhaps not surprising that certain musical structures derived from this series—triads and particular (5–7-note, unequal-interval) scale-types—are common across many (but not all) musical cultures (Patel, 2008, pp. 19–21). Such structures are thus to some extent acoustically privileged and will (ceteris paribus) naturally constitute the "connective tissue", the commonalities, of much music—which is not to say that the particular (rhythmic/harmonic) form they take in a given piece of music is not derived (memetically) from a specific antecedent coindex. Moreover, such commonalities are often useful in expediting the connection of more "characteristic" musemes (i.e., those closer to the "quotations" end than the "commonalities" end of Cope's (2003) continuum) and, in this capacity, they therefore serve as evolutionary "good tricks" (Dennett, 1995, pp. 77–78).

As a further complication, similarity values are often not helpful in trying to order musemes chronologically/sequentially in a nexus of transmission. As will be discussed further in section 4, evolution is not invariably associated with increasing complexity, however measured; in certain circumstances, adaptation might result in decreasing complexity. Moreover, the PID value measures editorial differences (it is not, strictly, an edit-distance metric; Levenshtein, 1966), which might result in no net change in absolute or relative complexity between two or more musemes; nor does it indicate the direction of change (toward greater simplicity or greater complexity), so a high PID might be associated with operations which result in the simplification of a museme, such as occurs between Ma 2 and Ma 4 . Of course, this relationship is only one of simplification if Ma 2 is regarded as the antecedent and Ma 4 as the consequent; seen the other way round, it is a process of increasing complexity. If evolution were only taken to be a process of increasing complexity, then Ma <sup>4</sup> would be a candidate for the antecedent of Ma <sup>2</sup> —which it might nevertheless still be, even though this specific (simplicity-complexity) justification is invalid.

Hitherto, these alleles have been treated as unitary, but if we hypothesize that three notes is the realistic lower threshold for a melodic museme to have perceptual-cognitive validity (Jan, 2007, p. 61), then the a1–a1–g<sup>1</sup> melodic triad of b. 2 is the only common contiguous element between Ma 2 and Ma 4 . (One might, however, regard Musemes Ma 2 and Ma 4 as being identical at the shallow-middleground level—having a c1–a1–g<sup>1</sup>

**145**

<sup>21</sup>The bracketed anacrusis c<sup>1</sup> (Bronson, 1959, p. 391, no. 16) in **Figure 1Bii** is included here, as it is in Savage's mutation calculation, represented in **Figure 1A**. <sup>22</sup>Perhaps criteria might be devised which would conclude that they are not actually in the same allele-class, or that they are only members of an "allelesuper-class", perhaps one defined by harmony but not including scale-degree factors. While the present focus is largely upon melodic (linear pitch plus rhythm) patterning, one could vary the number of parameters taken into consideration in order to narrow or broaden the definition of a museme. In this way, a museme would be seen as a multiparametric complex (a "style structure", in Narmour's terminology) made up of several uniparametric simplexes ("style forms/shapes") (Narmour, 1977, pp. 173–174; 1990, p. 34), although this runs the risk of blurring the distinction, if one truly exists, between a museme and a musemeplex.

<sup>23</sup>One concomitant of the dichotomy expressed in Note 5 is that a work-centric view of music attempted, until relatively recently, to enforce a single correct and objective text, whereas a process-centric view accepts the diversity of different acts, be these interpretations of "classical" works or variants of folk musics.

TABLE 2 | PID values for museme alleles in "The Two Brothers".


structure; but a full consideration of the structural-hierarchic location of the musemes under consideration is beyond the scope of the present article.) The first part of the museme—(c<sup>1</sup> )– c <sup>1</sup>–c1–e1–g<sup>1</sup> in Ma 2 , []–c1–c1–c1–c<sup>1</sup> in Ma <sup>4</sup> —is sufficiently dissimilar (despite the two common c<sup>1</sup> s) for one to envisage various scenarios to account for the etiology of the material of bb. 1–2 in these two song-variants, scenarios which may be generalized to other musemes in these six melodies and, indeed, more widely.

To contextualize these scenarios, it is useful to make a distinction between two ways of viewing these melodies and the alleles which constitute them, which might be conceived as extreme points on a "continuum of influence". On the one hand (the imaginary left-hand ("closed") side of the continuum), one could see these six melodies as an essentially secure ecosystem, impervious to perturbation by musemes external to its constituent allele classes. On the other hand (the imaginary right-hand ("open") side of the continuum), one could see them as entirely receptive to influence by external factors (immigration of, or influence by, external musemes). In the case of "The Two Brothers", it seems sensible to ascribe priority to intratune-family relationships, given the nature of this repertoire's transmission, while not ruling out the possibility that musemes from other sources—other tune-families, other repertoires might have influenced the transmission relationships within this group of six melodic variants. It is also important to note that in such repertoires as the folk ballad there is obviously textual as well as musical replication, but this does not necessarily guarantee that, when a textual phrase is replicated from one context to another, the museme associated with the earlier text is the source of that associated with the later text—as other instances of "The Two Brothers" tune-family attest.

For Ma n and the multitude of comparable cases:

1. One could regard bb. 1–2 of "The Two Brothers" as consisting of only one museme (Ma 2 and Ma 4 ). If so, then given the similarities between the second halves of each variant (the a1– a <sup>1</sup>–g<sup>1</sup> triad), which act as a kind of "anchor" (and given, of course, the sequential-mapping constraint), one would take the first halves, b. 1, as being edit-heavy, homology-associated mutations: to get from the antecedent to the consequent form (whichever is which), a fair amount of "earth moving" is required (Typke et al., 2007; see also Jan, 2014).



then replication of the latter might have been meditated by the memory of a melody containing a repeated-note museme.

Given that **Table 2** shows intra-museme-allele-class PID values, what is not considered are inter-museme-allele-class values. One of the latter is, however, shown (italicized), namely that between Ma 1 and Mb 2 , the relatively high value of 66.7% (higher, of course, than some intra-museme-allele-class values) indicating the presence of rhyme/symmetry within the first half of the melody24. The higher the intra-museme-alleleclass ("vertical") PID values of any tune-family, the greater the perceived synchronic unity (its coherence as a collection of melodies) of the family; whereas the higher the inter-musemeallele-class ("horizontal") values of any individual melody, the greater the perceived diachronic unity (its coherence as a collection of musemes) of that melody—and vice versa. Both forms of unity might act as musemic selection pressures: the higher the perceived unity, synchronic or diachronic, the easier it is for listeners and singers to remember these melodies and therefore the more evolutionarily successful their constituent musemes may tend to be, if success is measured in terms of the number of copies of a given museme in a museme-pool. This selection pressure might be operative in many musemeplexes, and might be a factor driving the musemic collaboration which gives rise to them.

## 4. PHYLOMEMETICS AND CULTURAL TAXONOMIES

The reference to "phylogenetic analysis" in the quotation in section 2 (page 3) is significant, in that just as the long-term outcomes of biological selection can be represented in terms of branching lineages on (by convention) a tree diagram—where species bifurcate to give rise to sub-species, etc. (Darwin, 2008, p. 90)—so can those of cultural evolution. In the case of the group of museme alleles constituting the particular subset of

As a first word of caution, attempting to calculate cultural phylogenies—what might be termed phylomemies—from such a small group of short melodies risks falling foul of what might be termed the distinction between real and virtual phylogen/memies. A real phylogen/memy is one which is objectively evolutionarily correct, indicating the transmission relationships between the replicators at various positions on the cladogram. A virtual phylogen/memy is one which arrives (perhaps as a consequence of a restricted sample size) at a "pseudo-cladogram" which, while a logical and (perhaps more importantly) parsimonious representation of the patterns under investigation, is nevertheless (potentially) not evolutionarily true (and is therefore not properly cladistic) because it does not take into account patterning "external" to the sample under consideration. This external patterning, if included, might alter the relationships represented by the cladogram. It would appear considerably easier to arrive at a real phylogeny (where groups of potentially related organisms are often relatively geographically localized, morphologically distinct and, nowadays, genetically tractable) than it is to arrive at a real phylomemy (where groups of potentially related cultural forms are often scattered across space and time).

Yet this enterprise is worth pursuing, if only to illustrate the possibilities of the approach, one which Howe and Windram (2011) term "phylomemetics", the cultural equivalent of phylogenetics. As they acknowledge (Howe and Windram, 2011, p. 1), this is by no means a new methodology in the humanities, where philologists in both linguistic and musical research have long attempted to reconstruct stemmata showing relationships of transmission and mutation in sources as diverse as biblical texts and medieval music manuscripts (Cook, 2015). Conducted under (or, some might fear, annexed by) the rubric of phylomemetics, such research can incorporate all the intellectual infrastructure of Darwinism—the notions of variation, replication and selection; concepts of fitness; and ideas of lineage bifurcation and divergence—in tracing connections between the phenomena under investigation<sup>26</sup> .

Using the phylogeny-calculation software Phylip (Felsenstein, 2016), the six forms of "The Two Brothers" in **Figure 1B** were

<sup>24</sup>Bronson argues for the primacy of musical over textual rhyme (Bronson, 1959, p. xii).

<sup>25</sup>There are various different approaches to taxonomy, and biologists often argue testily as to their relative merits—in Dawkins' view, taxonomy is "one of the most rancorously ill-tempered of biological fields. Stephen [Jay] Gould has well characterized it with the phrase 'names and nastiness' " (Dawkins, 2006, p. 275). But a cladistic approach, particularly one where genetic evidence is employed, is the one most likely to be evolutionarily "correct" in biological taxonomy (Ridley, 2004, p. 489).

<sup>26</sup>It might be argued that phylomemies differ from phylogenies in their potential for "cross-fertilization", whereby two lineages may share material, or even rejoin, after bifurcation. But this is also true, perhaps to a lesser extent, in nature, where gene-transfer between recently bifurcated lineages remains possible for a limited time.

```
FIGURE 3 | Input data for "The Two Brothers".
```
analyzed. This used the input file shown in **Figure 3A**, which is a date-ordered list—based on **Figure 1B** and in which "v" represents the variant forms of **Figures 1Bi,iii**—of the melodies consisting of a sequence of their constituent pitches, grouped into museme alleles27. It should be stressed that this is an illustrative calculation only, designed to outline a methodology which might be adopted (as discussed in section 5) in larger studies. The highly restricted dataset naturally limits the scope of the conclusions that can be drawn. The phylomemetic tree shown in **Figure 4A** was generated using the Pars utility, which "is a general parsimony program which carries out the Wagner parsimony method (Eck and Dayhoff, 1966) with multiple states. Wagner parsimony allows changes among all states. The criterion is to find the tree which requires the minimum number of changes" (Felsenstein, 2016). For ease of comparison, the textbased output of Pars (strictly, that of the graphics-generating utility DrawGram) has been replaced in **Figure 4** by images of the relevant melodies<sup>28</sup> .

Such cladograms represent descent with modification, whereby items located to the left (bottom/past) are hypothesized to be evolutionarily earlier than those located to the right (top/present), and where proximity to points of bifurcation (branch-length) represents relative evolutionary distance. While parsimony does not invariably align with evolutionary reality (a parsimonious tree is not necessarily a "real" tree, in terms of the binarism referred to above), it is a powerful constraint on evolutionary possibilities. Given this, it is reasonable to infer that both real and virtual lineages will generally proceed from left to right by the minimal mutational distances (this is not to deny the possibility of more radical, saltational, change). As suggested in section 3, evolution is fundamentally a process of adaptive change (Ridley, 2004, p. 4) and not necessarily one where that change leads to an increase in "the logarithm of the total information content of the biosystem (genes plus memes)" (Ball, 1984, p. 154)29. In the light of this, and of the proviso made in section 3 that date of collection does not necessarily align with the evolutionary chronology of these melodies, one must reiterate that, when undertaking phylomemetic analysis, melodic simplicity does not necessarily correlate with chronological anteriority, any more than melodic complexity corresponds with chronological posteriority.

As a second word of caution—one which applies more broadly to any attempt to analyse music by means of the kinds of symbolic representations used in Phylip—in order to perform the phylomemetic analysis, the musical patterning of these songs, already converted to their traditional western letter-name notation in **Figure 1**, was rendered as a series of ASCII characters to form the input to Pars. In this way, the melodies of these ballads are treated as a text. This means that the analysis is operating on a representation two stages removed from the living performances recorded over a century ago: not only has the rendition been regularized and shoehorned into western notation, a form of "lossy" compression; but this representation has itself been further divorced from its connection with sound by its reduction to a mere symbol-set, an abstract series of Mx <sup>n</sup> patterns. Perhaps more fundamentally, while the Phylip software to some extent "understands" genetics, in that it is based on a formalization of the dynamics of the biochemistry underpinning it, it has little conception of music and the dynamics of pitch and rhythm combination underpinning it. Nevertheless, the symbols offered as input bear at least some connection with their long-distant musical antecedents, and so permit a provisional phylomemetic analysis based on parsimony relationships to be conducted.

In addition to analyzing relationships between song melodies as a whole, this type of analysis may also be conducted at the level of the museme allele, as represented in **Figures 3B**, **4B**, which show only the four alleles of Mc. Importantly, if cladograms generated from complete song melodies are different from those derived from specific museme alleles within a melody, then this affords evidence in support of the second claim, made in section 3: that statistical data derived from measuring mutational changes, while illuminating, are epiphenomena of musemic evolution.

While there are many complex relationships represented within the cladograms of **Figure 4**, not all of which can be elaborated upon here, the following points may be made in summary (again reiterating that the Pars utility is operating on a deprecated, symbolic representation of music without any knowledge of music theory):

<sup>27</sup>This might be further developed by incorporating rhythmic values, whereby "bbb" = ♩. and "b" = .

<sup>28</sup>Note that these are "rooted" phylomemies: there is assumed to be an unidentified common ancestor to the left of the tree (Ridley, 2004, p. 439).

<sup>29</sup>This may often be the case with oral transmission, where the principle of lectio difficilior potior—"the more difficult reading is the stronger" (Robinson, 2001) might support one in ascribing chronological anteriority to a more complex form.


**Figures 1Bi,iii** (variant). This indeed affords evidence in support of the second claim: that statistical data derived from measuring mutational changes (**Figure 4A**) are epiphenomena of musemic evolution (**Figure 4B**), because Mc (and indeed any museme) is arguably more meaningful perceptually-cognitively and evolutionarily—than the larger melody of which it forms a part.

5. In terms of chronology, this second cladogram is (quasi-) anachronistic, in that it ascribes evolutionary (co-)primacy to the "latest" (and also "earliest") of these musemes, Mc 1 . As specified by the provisos in the third ("chronology") point above, this cladogram does not constitute hard evidence in favor of a phylomemy which runs counter to the collection-date ordering.

This consideration has only scratched the surface of the complex relationships inherent in **Figure 4**, itself only a small case study. For one thing, while these melodies would normally have been performed unaccompanied, their implied harmony (**Figure 1Bv**) may have acted as a selection pressure30. Given the tendency for harmonic changes to coincide with points of metrical accentuation—Temperley's "HPR [Harmonic Preference Rule]

<sup>30</sup>Given that unaccompanied melodies in western music normally have clear harmonic implications (a phenomenon arguably most richly developed in the solo violin music of J.S. Bach), the perceptual-cognitive salience of mutations will tend to be evaluated in the light of the silent musemes constituting the underpinning chord progressions. Implied harmony therefore constitutes a selection pressure because it motivates an assessment of the altered conformity of (elements of) a mutant museme with the associated chord vis-à-vis the alignment of its antecedent. In non-western cultures, no such implicative coadaptation exists between melodic and harmonic musemes.

2 (Strong Beat Rule)" (Temperley, 2001, p. 151)—it may be the case that Ma 1+3 y , with their implied shift to the tonic chord on the second (weak) rather than the third (strong) crotchet beat of the bar (as in Ma 2+4 y ), have either a selective advantage or (paradoxically) a selective disadvantage, depending on context<sup>31</sup> .

But the overriding issue here is that the dichotomy identified above between real and virtual phylomemies is clearly problematic, for while Savage and Atkinson (2015, p. 167) are laudable in their injunction that statistical-phylomemetic analysis is (only) a stepping stone toward the understanding of "higher-level units of musical structure and meaning", the statistical data—even considered in conjunction with musemic organization—does not always permit the reconstruction of higher-level-unit phylomemies with any real certainty, as is demonstrated by the present study. Perhaps we might simply hypothesize that, in the absence of detailed knowledge of the transmission events under investigation, the cladograms in **Figure 4** predict the true temporal ordering of (phraseor museme-level) events. Thus, we are taking the most parsimonious phylomemy to be the most plausible, and assuming that, when the historical record is obscure, this criterion should be primary when attempting to reconstruct cultural-evolutionary histories.

## 5. CONCLUSION: TWO BROTHERS?

While the lyrics of "The Two Brothers" are decidedly grim, the spirit of this article is optimistic, in that it holds that perceptual-cognitive and statistical models of musical evolution are also brothers (or sisters), and that—unlike the ballad texts they can go on not to do violence to each other but to grow together and to complement each other, developing to be cooperative adults working for a two-fold common cause: the understanding of cultural evolution as a subset of a wider Darwinian view; and the development of methodologies along the perceptual-cognitive–statistical continuum to investigate its operation.

To return to the two claims underpinning the argument here—(i) that a purely statistical approach based on counting note-edits without consideration of perceptual-cognitive aspects gives an incomplete account of cultural evolution; and (ii) that statistical data derived from measuring mutational changes, while illuminating, are epiphenomena of musemic evolution—we might assert that both have been supported by the (admittedly limited) case study outlined here. That is [apropos claim (i)], Savage's (2017) statistical data on "The Two Brothers" are arguably contextualized, enriched and elucidated by considering the musemic structure of the tune-family musicanalytically, music-psychologically and music-phylomemetically; and [apropos claim (ii)] the discussion conducted under the third of these rubrics suggests a strong regulatory role for museme-level (as opposed to note-level) processes.

This case study—a small-scale empirical example of how to pursue a novel methodological strategy—is arguably scalable (by means of more systematic use of computer technology) in ways which would foster perceptual-cognitive–statistical collaboration in research on cultural evolution. The methodology for this, which is essentially a formalization and expansion of what is discussed here, is summarized as follows. As will be clear, many of the relevant technologies already exist and so, as is often the case with advances in research, it is largely a matter of synergistic interconnection for this to become a reality.


While the four points above seem clear in outline, their connection is likely to prove difficult to implement in practice, given the recalcitrant complexity of music and the intricacy of the programming tasks required. Yet success in this venture offers a rich promise: that of reconstructing how music may have been perceived and transmitted across time and place in various human societies; and therefore of offering synchronic overviews and simulacra of once-vibrant, diachronic musical cultures.

## AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.

<sup>31</sup>This context includes the likelihood that, for some listeners, such harmonicrhythm disruptions might be appealing (and therefore selectively advantageous from the museme's perspective), whereas for other listeners the opposite might be the case.

#### REFERENCES


Margulis, E. H., and Beatty, A. P. (2008). Musical style, psychoaesthetics, and prospects for entropy as an analytic tool. Comput. Music J. 32, 64–78. doi: 10.1162/comj.2008.32.4.64


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Jan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Martindale, C. (1986). Aesthetic evolution. Poetics 15, 439–473.

# Computational Approach to Musical Consonance and Dissonance

#### Lluis L. Trulla<sup>1</sup> , Nicola Di Stefano<sup>2</sup> \* and Alessandro Giuliani<sup>3</sup>

<sup>1</sup> Centre de Recerca Puig Rodó, Girona, Spain, <sup>2</sup> Institute of Philosophy of Scientific and Technological Practice and Laboratory of Developmental Neuroscience, Università Campus Bio-Medico di Roma, Rome, Italy, <sup>3</sup> Environment and Health Department, National Institute of Health, Rome, Italy

#### Edited by:

Aleksey Nikolsky, Independent Researcher, Los Angeles, CA, United States

#### Reviewed by:

Juan G. Roederer, American Association of Retired Persons, United States Susan Elizabeth Rogers, Berklee College of Music, United States

> \*Correspondence: Nicola Di Stefano n.distefano@unicampus.it

#### Specialty section:

This article was submitted to Cognition, a section of the journal Frontiers in Psychology

Received: 28 August 2017 Accepted: 08 March 2018 Published: 04 April 2018

#### Citation:

Trulla LL, Di Stefano N and Giuliani A (2018) Computational Approach to Musical Consonance and Dissonance. Front. Psychol. 9:381. doi: 10.3389/fpsyg.2018.00381 In sixth century BC, Pythagoras discovered the mathematical foundation of musical consonance and dissonance. When auditory frequencies in small-integer ratios are combined, the result is a harmonious perception. In contrast, most frequency combinations result in audible, off-centered by-products labeled "beating" or "roughness;" these are reported by most listeners to sound dissonant. In this paper, we consider second-order beats, a kind of beating recognized as a product of neural processing, and demonstrate that the data-driven approach of Recurrence Quantification Analysis (RQA) allows for the reconstruction of the order in which interval ratios are ranked in music theory and harmony. We take advantage of computer-generated sounds containing all intervals over the span of an octave. To visualize second-order beats, we use a glissando from the unison to the octave. This procedure produces a profile of recurrence values that correspond to subsequent epochs along the original signal. We find that the higher recurrence peaks exactly match the epochs corresponding to just intonation frequency ratios. This result indicates a link between consonance and the dynamical features of the signal. Our findings integrate a new element into the existing theoretical models of consonance, thus providing a computational account of consonance in terms of dynamical systems theory. Finally, as it considers general features of acoustic signals, the present approach demonstrates a universal aspect of consonance and dissonance perception and provides a simple mathematical tool that could serve as a common framework for further neuro-psychological and music theory research.

Keywords: beating, recurrence quantification analysis, complex systems, non-linear signal analysis methods, Devil's staircase

## INTRODUCTION

fpsyg-09-00381 April 2, 2018 Time: 16:54 # 2

Beating is the sensation that typically occurs when two sounds with similar frequencies mutually interfere, giving rise to a waveform with a rhythmic oscillation in amplitude. Following the fundamental contribution of Helmholtz's treatise, On the Sensation of Tone (1954), first published in 1863, contemporary explanations of consonance are grounded in the notions of beating and complex tones—i.e., sounds displaying a broad array of sinusoidal components (harmonics).

Roederer (2008, p. 35) provides an illuminating classification of the effects of superposing two pure tones depending on where in the listener's auditory system the sounds become entangled. The above mentioned beating is labeled by Roederer as "first-order beating," because it is processed mechanically in the cochlear fluid and along the basilar membrane. Evidence of the physiological basis of firstorder beating stems from the fact that its effect disappears when sounds are played separately in different ears—i.e., dichotically. Another kind of first-order beating effect is known as combination tones, which are produced by the non-linear interaction of waves in narrow spaces, such as the body of musical instruments or the inner ear. Combination tones can be considered as the product of two sine waves. A common example is the terzo suono theorized by Giuseppe Tartini (see Lohri, 2016). If a and b are two frequencies with a > b, then the terzo suono is a tone at frequency a–b that is discernible only by the listener, because it is produced inside the inner ear rather than being caused by external air vibrations. Combination tones can be heard across the octave at sound pressure levels (SPLs) of 80 dB or higher, and across part of the octave at 50 dB SPL and above.

At 80 dB (or higher) while maintaining the interval around the octave, a distinct beating can be perceived. This disappears when f <sup>2</sup> = 2f <sup>1</sup> (where f <sup>1</sup> and f <sup>2</sup> represent the two frequencies) and reappears as long as the octave becomes mistuned by a factor ε (i.e., f <sup>2</sup> = 2f <sup>1</sup> + ε). The beating frequency turns out to be ε (Plomp and Levelt, 1965). Beating "is created by the relatively quick changes produced by modulation frequencies in the region between about 15 to 300 Hz" (Fastl and Zwicker, 2006, p. 257). Unlike first-order beats, the beating persists when tones are fed dichotically, implying that, in this case, beat perception is the result of neural processing. Hence, they are defined as "second-order beats" (Roederer, 2008). Secondorder beating shows a modulation in the vibration pattern, i.e., a periodic change in phase difference between the two sounds that form the interval (Roederer, 2008, p. 49), although no amplitude modulation is present. Second-order beats are also called "beats of mistuned consonances" because they are audible when pure tones are superposed to form a fifth (Plomp, 1976). In fact, whereas the vibration pattern of a correctly tuned fifth (f <sup>2</sup> = 3/2 f <sup>1</sup>) or fourth (f <sup>2</sup> = 4/3 f <sup>1</sup>) is static, the mistuned cases f <sup>2</sup> = 3/2 f <sup>1</sup> + ε and f <sup>2</sup> = 4/3 f <sup>1</sup> + ε cause the vibration pattern to change periodically in form, but not in amplitude. From the octave to the fifth and to the fourth, the second-order beats become faster (beating frequency being ε for the octave, 2ε for the fifth, and 3ε for the fourth) as the vibration pattern grows in complexity (see **Figure 1**).

Their neural origin makes second-order beats an excellent phenomenon for investigating the link between the mathematical description of the signals and their neural processing, and consequently allows us to shed light on their perceived "pleasantness." To achieve a consistent picture of secondorder beats, it is fundamental to overcome the frequency–time space representation trade-off and the related problem of nonstationary signal characteristics.

Graphic representations of sound typically plot the course of amplitude over time or report the relative amplitudes of the different frequencies computed by the Fourier Transform. Thus, there is no mention of time in the latter, and no mention of frequency in the former. However, in the actual hearing process, time and frequency are strictly intermingled, because specific frequencies are processed at specific moments. This fact suggests that we should focus on the simultaneous analysis of time/frequency dimensions (Roads, 2001).

To determine the frequency of an oscillatory phenomenon, we must count the number n of vibrations that occur within

a set time interval 3t. As n is an integer, the minimum error in measuring the frequency is one, thus generating a kind of uncertainty principle in the form 3f ≥ 1/3t. Increasing the precision of the frequency reclaims a wider window in which to count the time, thus increasing the indetermination of the instant in which the specific frequency occurs.

It is possible to neglect the explicit consideration of time and visualize tone relationships within the octave by computing the ratio of two simultaneous frequencies and then plotting the interval ratio against the amplitude. This is achieved by forming a linear combination of two pure tone waves, a glissando from the unison (f <sup>1</sup>) to the octave (2f <sup>1</sup>) and a firm wave at frequency f <sup>1</sup>. Similar stimuli were previously adopted by Helmholtz (1954) and Kameoka and Kuriyagawa (1969a,b). More recently, Piana (2007) provided a phenomenological explanation of consonance and dissonance when moving from the glissando and ruling out intervals and harmonics. In the following, we propose a numerical approach to Helmholtz's glissando. Note that this approach maintains the time dimension in terms of the determined sequence of interactions between glissando and the fixed frequency. Focusing on these interactions allows us to overcome the trade-off in the frequency–time representation. For this purpose, we base our approach on the concept of recurrence, a simpler and more fundamental property of the signals with respect to the oscillation frequency (Eckmann et al., 1987; Marwan et al., 2007). The degree of recurrence of a series is estimated by the number of times a signal comes back to an already visited state (see section "Materials and Methods"), and can be computed by the application of recurrence quantification analysis (RQA) (Marwan et al., 2007). Estimating the recurrence rate avoids any stationarity assumption, as the estimate is obtained by a "computation window" sliding along the signal; the result is a profile of recurrence values relative to subsequent epochs along the original signal. This provides a model-free, discrete, and local estimation of the recurrent properties of the series, enabling a quantitative description of second-order beats. The recurrence peaks exactly match the values of the interval ratios corresponding to just intonation and are proportional to the order of consonance of the intervals, thus providing a link between consonance and the dynamical features of the signal.

## MATERIALS AND METHODS

## Recurrence Quantification Analysis

The original idea of describing non-stationary signals (which are not amenable to classical Fourier analysis) by means of recurrence dates back to the work of Ruelle's group (Eckmann et al., 1987). The authors introduced recurrence analysis as a purely graphical technique in the form of recurrence plots (RP). Webber and Zbilut (1994) then converted the RP approach into a quantitative technique (RQA) by defining some non-linear descriptors of the RP. RQA has been adopted for the assessment of time series structures in fields ranging from molecular dynamics to physiology and text analysis (Manetti et al., 1999; Orsucci et al., 2006; Marwan et al., 2007). In the field of music research, RQA has been successfully applied to song recognition (Serra et al., 2009) and in the definition of an objective basis of consonance of pure tones (Trulla et al., 2005). In general, this non-linear technique is especially useful for quantifying transient behavior far from the equilibrium (Trulla et al., 1996).

RQA builds upon the computation of a distance matrix between the rows (epochs) of the embedding matrix of the signal of interest, with the lag defined by the method of the first minimum of Mutual Information (Kennel et al., 1992). Given a scalar time series {x(i) = 1; 2; 3;. . .}, an embedding procedure generates a vector Xi = (x(i); x(i+L);. . .; x(i+(m-1)L)), where m is the embedding dimension and L is the lag. {Xi = 1; 2; 3;. . .; N} then represents the multi-dimensional process of the time series (signal) as a trajectory in m-dimensional space. RPs are symmetrical N × N matrices in which a point is placed at (i; j) whenever a point X<sup>i</sup> on the trajectory is close to another point X<sup>j</sup> . The relative closeness between X<sup>i</sup> and X<sup>j</sup> is estimated by the Euclidian distance between these two vectors. If the distance falls below a threshold radius (r), the two vectors (epochs, windows) are considered to be recurrent, and this is graphically indicated by a dot. The value of r is usually set to 5–10% of the average pairwise distances between epochs. Therefore, RPs correspond to the symmetrical distance matrix between the epochs (rows of the embedding matrix) of the signal transformed into a binary 0/1 matrix by the action of a threshold.

As an example, consider a time series A made up of 10 consecutive values: 7, 8, 10, 15, 6, 7, 9, 11, 10, 8. To observe the recurrence structure of the series at the level of subsequent epochs of length 3, we transform A into the embedding matrix AE:


Thus, the original series has been projected into a threedimensional space in which the variables (columns) are the time-lagged original series and the statistical units (rows) are the overlapping epochs. The second step is to compute the Euclidean distances between the epochs. This generates the following distance matrix AD:


As the AD elements correspond to the Euclidean distances between corresponding epochs, the diagonal values are 0, and the symmetric character of the distances implies the matrix can be written in lower-triangular form.

We now specify that two epochs are recurrent if their distance is less than 95% of all the between-epoch distances. The average value of the below-diagonal elements of AD is 6.48, and their standard deviation is 2.74. Thus, it is estimated that 95% of distances are greater than 1.74. This implies we have only two recurrences, corresponding to the epoch1–epoch5 and epoch1–epoch6 couples (bolded in the table).

Therefore, example series A has a recurrence rate of 0.071 (two recurrences out of 28 distinct distances) or, equivalently, a recurrence percentage equal to 7.1. The AD matrix corresponds to an RP with only two dots, at coordinates (1, 5) and (1, 6). Note that the recurrences can be identified without the need for any frequency estimation, thus resembling the hearing process that receives sounds as they occur in time.

To provide a quantitative measure of the recurrence, numerical RP descriptors were developed (Marwan et al., 2007). We now consider the proportion of recurrent points (dots) in a plot, called the recurrence. Going back to the music domain, **Figure 2** reports the data relative to **Figure 1** as RPs.

## Software

Files were generated using the sound editor Cool Edit Pro and saved in ASCII format before being fed to the Visual Recurrence Analysis (VRA) software. For the plots in **Figure 1**, we loaded a stereo file of 8000 samples/s to the audio editor, and sent a fixed pure tone of 400 Hz lasting 6 s through the left channel and a fixed pure tone of 403 Hz (**Figure 1A**), 803 Hz (**Figure 1B**), or 603 Hz (**Figure 1C**) for 6 s through the right channel. The sample type was then converted from stereo to mono. **Figure 3** was generated by loading a stereo file of 8000 samples/s to the audio editor, and sending a linearly increasing sound from 360 to 840 Hz lasting 6 s to the left channel and a fixed pure tone of 400 Hz lasting 6 s to the right channel. Finally, the sample type was again converted from stereo to mono.

**Figure 2** shows RPs for the data in **Figure 1**, i.e., 1 s (8000 points) of a mistuned unison, octave, and fifth. The plots were generated by calculating the global recurrence using RQA, as there is no change along the sample. The recurrence of the data shown in **Figure 3** was calculated using a windowing version of an RP, whereby the recurrence is calculated repeatedly for a window that is continuously shifted along the whole sample. Among the RQA parameters, we chose the simplest one, Percent Recurrence, a descriptor that sets the percentage of recurrent points with respect to the non-trivial maximum [equal to (N × (N−1))/2

for an N-point series]. The window for recurrence analysis was 480 points long and the shift was 48 points. The embedding dimension was 5 and the delay was 3 points.

MATLAB programs were obtained from http://sethares. engr.wisc.edu/consemi.html for Sethares' dissonance curve and http://courses.theophys.kth.se/5A1352/mfiles/devils.m for the theoretical Devil's staircase (see Discussion).

#### RESULTS

A non-stationary signal exploring all interval combinations within the octave can be generated by merging the course of two sounds into a single waveform. The first sound is set at constant frequency f <sup>1</sup> for the full duration of the course, while the second follows an ascending glissando from f <sup>1</sup> to f <sup>2</sup> = 2f <sup>1</sup>. **Figure 3** shows an instance of the above procedure.

The most conspicuous singularity (recurrence peaks, see below) in the graph occurs when lines cross themselves, i.e., when f <sup>2</sup> = f <sup>1</sup> (unison, interval ratio of 1:1). A second relevant case occurs at the interval ratio of 2:1, which corresponds to the octave. Less evident events occur at 3:2 (fifth) and 4:3 (fourth), as can be seen in the zoomed inset in **Figure 3**. Singularities in the waveform are thus localized where the frequency ratios are expressed by lower integers and with an apparent amplitude (or degree of singularity) matching the accepted ranking of consonance. In our representation, second-order beats appear as a zone of relative calm centered in rational numbers, surrounded by the tempestuous region of irrationals that Roederer (2008) called "beat holes."

Following the numerical solution of Helmholtz's glissando, we explore the glissando/constant frequency signal through an RQA windowing procedure called Recurrence Quantification of Epochs (RQE). RQE performs a scansion of the whole signal by sequentially selecting small windows—specifically episodes of 480 points—in which the RQA algorithm (with the consequent computation of recurrence rate for each episode) is applied. The subsequent windows are shifted by 48 points and the process is repeated throughout the entire file. For each iteration, we retain both the recurrence value and the interval ratio in which this value occurs, calculated as the mean of the interval ratios in the window. **Figure 4** represents the degree of recurrence along the continuum of interval ratios within the octave.

Emergent features of the glissando are evident in **Figure 4**. Firstly, the higher peaks exactly correspond to the places of just intonation (see Trulla et al., 2005), thus establishing a link between pleasantness and the dynamical features (i.e.,

abbreviations).

recurrence) of the signal. As expected from the numerical model, all peaks correspond to rational numbers. Secondly, it is worth noting the symmetry of the peaks around the perfect fifth. Moreover, the correlation between the extent of recurrence and the rank order of consonance derived from the literature evidences the link between the present model based on signal analysis and results from psychological approaches (see Schwartz et al., 2003, i.e., U > P8 > P5 > P4 > M6 > M3 > m3 > m6 > m7 > M7, in decreasing order of consonance; see **Table 1**).

In summary, RQA allows us to establish a natural link between the signal properties and the consonance judgment of the listeners without any a priori hypothesis or frequency estimation. The reasons why integer numbers play such an important role in harmony has recently been addressed in the literature, with many different recipes presented for calculating the simplicity of the intervals. We use the consonance index provided by Frova (1999) to demonstrate the close relationship between the proposed recurrence index and the bare numerical characteristics of the intervals. If m/n is the rational number in its lowest terms, Frova's index is (m+n)/(m × n) (Frova, 1999, p. 178). **Figure 5** illustrates the correlation of this index with the notion of simplicity (i.e., degree of recurrence).

Whereas Frova's index is derived from the energy of the partials forming a complex sound, the percentage recurrence is a purely bottom–up phenomenological descriptor of a pure tone signal, relating recurrence (and consonance) to secondary beating and thus providing a natural (albeit roughly phenomenological) link between the signal properties and neural processing.

Note that the computation of recurrences gives very similar results with respect to models based on primary beating, such as the Plomp and Levelt model reported in **Figure 6**.

## DISCUSSION

In this paragraph, we relate the self-similar appearance of the recurrence graph in **Figure 4** to the mathematical fractal structures generated by physical processes. **Figure 7** shows the empirical cumulative recurrence distribution (obtained by adding consecutive points) and a formal Devil's staircase in the [1, 2] interval: the similarities between the two graphs are remarkable.

The Devil's staircase pattern is a fingerprint of dynamical systems characterized by the mode-locking phenomenon (Schroeder, 1990, p. 171), which is crucially important in both music generation and perception. In the 17th century, Christian Huygens studied mode-locking and discovered the phenomenon of resonance. He noticed that, after a time, the pendulums of two clocks fixed on the same mounting swung synchronously. The synchronization of two coupled oscillators starting from (slightly) different frequencies is called resonance. A more general case of resonant behavior appears when a specific constant frequency is periodically driven by an external power to oscillate at a different frequency; the so-called Devil's staircase

pattern refers to the behavior of forced quasilinear oscillators. In the glissando, the constant frequency is the intrinsic frequency and the glissando the external periodic force. Every plateau in the Devil's staircase relates to a particular phase-locked solution (stable state), and its relative width forms a hierarchy that follows the explained propriety of rational numbers. The mathematical model for this case is the circle sine map (McCauley, 1994).

$$2\theta n + 1 = \theta n + \frac{\rho}{q} + (\frac{k}{2\pi})\sin(2\pi\theta n)$$

where k is a coupling strength parameter that controls the degree of non-linearity. Without coupling (k = 0), the behavior of the system is expressed by the ratio p/q (often called , the bare winding number). When k > 0, the system locks into rational frequency ratios, preferably with small denominators. In this case, the long-term description of the system corresponds to w, the dressed winding number. For the critical value k = 1, the infinite


number of locked frequency intervals corresponding to all the rational numbers between 0 and 1 cover the entire range.

In our terms, is the cumulative recurrence and w is the interval ratio. In other words, the system is locked at any rational number—indicated as the interval ratio—but the width or extent of the lock comes from the cumulative recurrence. Thereby, most relevant consonances have extended areas around the lowest rationales—like the unison or octave—and a strong attraction exists toward these exact ratios. This is perfectly sound in terms of music theory.

The above considerations can be summarized in three main points:


Numerous studies have confirmed the adequacy of concepts from non-linear dynamics for music perception and construction (e.g., Cartwright et al., 2001, 2002, 2010), and for the study of synchronization among sound sources (Abel et al., 2009). Additionally, neuroscientific research has adopted non-linear dynamical models to describe phase-locked neural populations

FIGURE 5 | Linear relationship between the degree of recurrence (Figure 4) and Frova's index of consonance (Frova, 1999). Note the almost perfect overlap between the a posteriori statistics of actual signals (i.e., recurrence) and the theoretically motivated a priori consonance index (i.e., Frova's index).

Bolded values are intervals most used in Western harmony.

(Bidelman and Krishnan, 2009; Large and Almonte, 2012) and build in silico neuronal models (Lots and Stone, 2008).

Taken together, our work and previous results support the idea that the production and perception of sound are intimately linked, the perceived pleasantness of intervals being an intrinsic property of the signal (in terms of the degree of recurrence), and not only a secondary effect of the signal on the listener. In turn, this allows us to speculate on the auditory system. Second-order beats have been attributed to the central auditory nervous system, and neuronal webs are known to support phase-locking, as in the mammalian auditory system, in which neural activity in areas including the cochlear nucleus, inferior colliculus, and primary auditory cortex is phase-locked to the stimulus waveform (Large and Tretakis, 2005). The mode-locking model was proposed by Lots and Stone (2008) as the basis for musical consonance, leading to the development of a dynamical system model based on stylized neural oscillators producing both synchronization and mode-locking. These results support the idea that both parts of the communication system (the sender and the receiver of sounds) are similarly "wired." Bidelman and Heinz (2011) applied a waveform to a computational model of the acoustic nerve and, after deriving the autocorrelation function for the nerve fibers, generated the pitch salience profile for the different intervals, giving rise to a distribution that could be superimposed onto the recurrence rate (**Figure 4**). Using an artificial neural network model, Pankovski and Pankovska (2017) recently demonstrated that a specific auditory spectral distribution caused by non-linearities and Hebbian neuroplasticity are sufficient phenomena for a system to generate the consonance pattern.

In line with the literature on music perception (Benade, 1973; Roederer, 2008), we believe that the link between music generation and perception could rely on the fact that the vibrating elements of musical instruments undergo mode-locking into stationary complex vibration patterns. In turn, these can be recognized as the "best fit" to a harmonic template (resident in a properly wired neural circuit). Though this explanation

stems from empirical correlations, we are convinced that the simplicity and versatility of the RQA approach could pave the way for neuro-psychological studies with the great advantage of considering the acoustic signal and the perceiver from the same mathematical perspective.

The origins of the distinction between consonance and dissonance have been hotly debated in recent years. As the phenomenon of consonance represents a key element of Western music theory, this has mainly been investigated in terms of Western science (i.e., mathematics, physics, psychoacoustics, and neuroscience). For this reason, Parncutt and Hair (2011) called for studies on the use of consonance and dissonance in non-Western cultures to be conducted in terms of local indigenous musicians, rather than in terms of Western science. In this direction, a relevant study published in Nature by McDermott et al. (2016) compares the harmonic preferences of people who have different degrees of exposure to Western music. An indigenous population from Bolivia (the Tsimané) was assumed to have no exposure to Western music, and their preferences were compared with groups of city residents in Bolivia and the United States with different degrees of exposure to Western music. The results show that the subjective preferences of Tsimané participants differ from those of the comparison groups; in particular, they failed to rate consonance as being more pleasant than dissonance. The authors state that, as the Tsimané are able to hear the acoustic distinctions associated with consonance and dissonance, the lack of a measurable preference for consonance appears to reflect difference in their aesthetic response to this contrast (McDermott et al., 2016, p. 549). Correctly, they state that the observed cross-cultural variation suggests that consonance preferences are unlikely to be innate, and so preference is probably acquired. However, the fact that the preference for consonance co-varies with presumptive exposure to Western culture is not sufficient to conclude that consonance perception is not biologically determined. Though preferences vary with cultures, the discrimination of consonance is a prerequisite for preference and has a biological basis, as supported by a large number of neurobiological studies (Koelsch and Mulder, 2002; Koelsch et al., 2005; Minati et al., 2008; Perani et al., 2010; Park et al., 2011; Wang, 2013). Investigating whether consonance perception is biologically determined or shaped by culture is likely to be misleading, as it conceives enculturation as a non-biologically constrained process. Harmonic intervals are a consequence of the entrainment of the nervous system with the sound excitation. This forms a universal biological foundation under any musical culture, determining the distinction between acoustic consonance and dissonance and leaving it to each culture to determine exactly how to employ these acoustic distinctions. However, the existence of different musical cultures and systems does not imply the lack of a shared natural/biological basis for music production. The interaction between nature and culture is much more complex, and cross-cultural variations in musical systems only show that biology does not rigidly determine music aesthetics. Similar considerations have led to a more adequate definition of music as a "biocultural phenomenon" (Cross, 2003).

## CONCLUSION

The main contribution of this paper stems from the numerical solution of Helmholtz's glissando. Though the standard modern theory of consonance is based on first-order beating, we have shown that similar results can be obtained starting from secondorder beats. The recent interest in second-order beating has been fruitful for models of pitch recognition or neural circuitry (see Roederer, 2008), but not for theories on consonance.

Scholars have started to consider music from the perspective of dynamical systems, both in neurobiological and physical terms, showing that mode-locking models can explain how the nervous system manages sound and is engaged in the ranking of consonances. The resemblance between the formal Devil's staircase model and the cumulative recurrence distribution strengthens this idea.

From a methodological perspective, the main contribution of this work is to provide neuroscience scholars with an extremely simple and model-free tool (RQA) that approaches the acoustic signal and the listener's perception system with the same mathematical method. Different RQA applications have

been reported in research on otoacoustic emission (see, for example, Zimatore et al., 2002, 2003). We are therefore confident that the use of a simple statistical approach will foster interactions between music theory and neuro-psychological approaches.

Finally, our results support the idea of natural roots of consonance perception, and are thus in line with several studies published in recent years (see, for example, Wang, 2013; Bowling and Purves, 2015; Nikolsky, 2015; Foo et al., 2016; González-García et al., 2016; Di Stefano et al., 2017). However, as proved by McDermott et al. (2016), the role of perception in the formulation of aesthetic judgment remains unclear. Therefore, musical consonance and dissonance remains a hotly debated topic (see Bowling et al., 2017), in need of further research to merge different approaches into a consistent theory.

#### AUTHOR CONTRIBUTIONS

LT originally conceived the idea of the paper, elaborated the stimuli, provided all the figures, and significantly contributed

#### REFERENCES


Fastl, H., and Zwicker, E. (2006). Psychoacoustic. Facts and Models. Berlin: Springer.

Foo, F., King-Stephens, D., Weber, P., Laxer, K., Parvizi, J., and Knight, R. T. (2016). Differential processing of consonance and dissonance within the human superior temporal gyrus. Front. Hum. Neurosci. 10:154. doi: 10.3389/fnhum. 2016.00154

Frova, A. (1999). Fisica nella Musica. Bologna: Zanichelli.

to the results and discussion. NDS prepared the manuscript, co-authored the introduction and the results with LT, contributed to the discussion, wrote the conclusion, and finally revised the entire draft. AG wrote the section "Recurrence Quantification Analysis," reviewed the entire manuscript, and suggested useful ideas for the discussion. All authors equally contributed to the revision of the manuscript before agreeing on the final version.

#### FUNDING

This work was funded by the Institute of Philosophy of Scientific and Technological Practice, Campus Bio-Medico University of Rome, under a 2015 Grant on "Embodiment."

#### ACKNOWLEDGMENTS

LT is grateful to Universitat Pompeu Fabra (UPF) of Barcelona for access to their library.


unstructured note sequences. Neuroreport 19, 1381–1385. doi: 10.1097/WNR. 0b013e32830c694b


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Trulla, Di Stefano and Giuliani. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Music Evolution in the Laboratory: Cultural Transmission Meets Neurophysiology

Massimo Lumaca<sup>1</sup> \*, Andrea Ravignani 2,3,4 and Giosuè Baggio<sup>5</sup>

*<sup>1</sup> Center for Music in the Brain, Department of Clinical Medicine, Aarhus University and The Royal Academy of Music Aarhus/Aalborg, Aarhus, Denmark, <sup>2</sup> Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium, <sup>3</sup> Research Department, Sealcentre Pieterburen, Pieterburen, Netherlands, <sup>4</sup> Language and Cognition Department, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands, <sup>5</sup> Language Acquisition and Language Processing Lab, Department of Language and Literature, Norwegian University of Science and Technology, Trondheim, Norway*

#### Edited by:

*Aleksey Nikolsky, Independent Researcher, Los Angeles, CA, United States*

#### Reviewed by:

*McNeel Gordon Jantzen, Western Washington University, United States Laura Verga, Maastricht University, Netherlands Vera Kempe, Abertay University, United Kingdom Seana Coulson, University of California, San Diego, United States*

#### \*Correspondence:

*Massimo Lumaca massimo.lumaca@gmail.com*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience*

Received: *21 September 2017* Accepted: *29 March 2018* Published: *16 April 2018*

#### Citation:

*Lumaca M, Ravignani A and Baggio G (2018) Music Evolution in the Laboratory: Cultural Transmission Meets Neurophysiology. Front. Neurosci. 12:246. doi: 10.3389/fnins.2018.00246* In recent years, there has been renewed interest in the biological and cultural evolution of music, and specifically in the role played by perceptual and cognitive factors in shaping core features of musical systems, such as melody, harmony, and rhythm. One proposal originates in the language sciences. It holds that aspects of musical systems evolve by adapting gradually, in the course of successive generations, to the structural and functional characteristics of the sensory and memory systems of learners and "users" of music. This hypothesis has found initial support in laboratory experiments on music transmission. In this article, we first review some of the most important theoretical and empirical contributions to the field of music evolution. Next, we identify a major current limitation of these studies, i.e., the lack of direct *neural* support for the hypothesis of cognitive adaptation. Finally, we discuss a recent experiment in which this issue was addressed by using event-related potentials (ERPs). We suggest that the introduction of neurophysiology in cultural transmission research may provide novel insights on the micro-evolutionary origins of forms of variation observed in cultural systems.

Keywords: cultural transmission, diffusion chains, signaling games, iterated learning, music universals, music diversity, neural predictors, mismatch negativity (MMN)

## INTRODUCTION

There has recently been a surge of interest in the biological and cultural origins, and evolution of music (Wallin et al., 2001; McDermott and Hauser, 2005; Patel, 2010). Music is prominent in virtually all human societies, and in its most sophisticated versions it is only attested in humans. This fact raises two important questions: how did music originate? And how did it evolve in its current forms? One intriguing issue here, especially in relation to the cognitive and neural bases of music evolution (Honing et al., 2015), is that of the evolution of musical structure. Musical systems are structured at several levels, from melody and harmony to rhythm and composition, in ways that may resemble the organization of other human generative systems, such as language (Jackendoff and Lerdahl, 1983; Jackendoff, 2009). The analogy between language and music may be pushed further, if one considers aspects of music that may be understood "semantically." Listening to music can evoke a wide range of extra-musical experiences, from emotional feelings (e.g., the sadness suggested by Albinoni's Adagio in G minor) to the mental imagery of specific referents (e.g., characters or ideas in Wagnerian Leitmotifs) (Patel, 2010). Musical structures can and often do relate to a world of possible experiences and non-musical phenomena (Lerdahl, 2003) expressively (by being associated to internal affective states, e.g., emotional qualities), if not representationally (via relations of reference and truth, as language does) (Patel, 2010).

In this work, we focus on the cultural origins of musical syntax: the set of principles governing the combination of melody and rhythm into "well-formed" sequences (for a discussion on the evolution of semantic structures see Lumaca and Baggio, 2017, 2018; Ravignani and Verhoef, 2018). Some aspects of musical syntax, such as the organization of temporal structure and pitch intervals, display widespread distribution and striking cross-cultural similarities. For example, the tendency to use small intervals in non-polyphonic melodic phrases, or "proximity," has been observed across several musical traditions of the world, including indigenous tunes from North America, Europe, and Asia (Dowling, 1968; Von Hippel, 2000). Despite some exceptions, such as Scandinavian and Swiss yodeling music, proximity is a prominent feature of melodic structure. These shared attributes are known as "musical universals." Nevertheless, their form and frequency differ across and within different musical traditions of the world (Lomax, 1977; Rzeszutek et al., 2012; Savage et al., 2015). How can we explain both the invariance and the variation of structure in music? Which processes underlie the cross-cultural convergence toward common music traits or their diversification? In this paper, we suggest that neuroscience can provide critical methodological and theoretical tools for testing and generating hypotheses on this complex matter.

This article is organized as follows. We start by presenting a recent theoretical perspective in which music is understood as an evolving cultural system, adapting to the human brain [sections Linking Biological and Cultural Levels of Analysis and From Cultural Transmission to Neurophysiology (and Back)]. In section The Cognitive Level: Diffusion Chains and the Evolution of Musical Regularities in the Lab, we describe studies that support this view using data from behavioral experiments. In section The Neural Level: Constraints Imposed by a Neuronal Niche Drive the Emergence of Regularities, we transpose our analysis of cultural adaptation to the neural level. Partly using the "neuronal recycling hypothesis" as a theoretical framework (Dehaene and Cohen, 2007), we argue that music can adapt to a "neuronal niche" defined by the specific information processing constraints imposed by neural circuits originally evolved for auditory streaming.

To our knowledge, no one until recently has investigated this hypothesis by means of brain imaging or neurophysiology. In section Neural Predictors in Cultural Evolution Research, we describe a recent experiment in which this hypothesis was tested combining behavioral and neurophysiological methods. Finally (section The Neural Origins of Cultural Variation), we suggest that the introduction of concepts and methods from neuroscience in music evolution, and cultural evolution in general, can provide new insights on the process of cultural variation.

## Linking Biological and Cultural Levels of Analysis

Music may be seen as a complex adaptive system, shaped by various biological, environmental, and cultural factors. This has made it difficult for musicologists and cognitive scientists to analyze the evolutionary origins of musical structure. The predominant view during the last century was the cultural account, where music was seen as an entirely socio-cultural construct, free to vary with virtually no biological and environmental constraints on its structure and content (Nettl, 1983; Repp, 1991; Blacking et al., 1995). The striking diversity of musical forms, as attested across and within cultures, and over human history, seems to support this notion (Lomax, 1968; Henry, 1976). Yet, this account has been challenged by experiments in psychology and neuroscience, together supporting a broadly biological account of the origins of music. Several studies point to the existence of perceptuo-cognitive biases and constraints in music processing and production (e.g., Trehub, 2000; Drake and Bertrand, 2001; Zatorre, 2001; Peretz and Zatorre, 2005; Deutsch, 2012) with some parallels in other species (Fitch, 2015). On this view, prototypical properties of music, such as a relatively steady beat, smooth melodic contours, tonality, and a narrow distance between adjacent tones (or "pitch proximity"), derive from built-in functional properties of the brain (McDermott and Hauser, 2005), which tend to manifest themselves in most human cultures (Lerdahl, 1992; Savage et al., 2015).

A recent view is that neither the "cultural account" nor the "biological account" can independently provide a satisfactory theory of the origins and evolution of musical structure (Trainor, 2015). Cultural accounts typically focus on the evolution of musical systems, while biological accounts investigate the evolution of the human capacity to perceive, appreciate, and produce music (also including musicality; Honing et al., 2015). These different accounts, however, may be connected within a more complete explanatory framework, if one accepts that music is neither an entirely arbitrary cultural construct nor strictly a biological product. Much like natural language, music is a cultural construct, which nonetheless rests upon, and is partly shaped by, human neurobiology. Our neurobiological makeup determines the scope and constraints of human auditory memory capacity, hierarchical sequence processing, attention, perceptual hearing threshold, and auditory scene analysis (Snyder, 2008; Deutsch, 2012). This is now a central tenet in the field of music cognition, and it is becoming increasingly accepted in cultural analyses of music, too. The open question is how neurobiological capacities, biases and constraints manifest themselves in actual musical systems (Trainor, 2015).

## From Cultural Transmission to Neurophysiology (and Back)

Answering this question requires theories, models, and empirical data that can effectively bridge the gap between the classical chasms of (cultural) evolutionary science: between individuallevel and population-level processes, micro-evolutionary and macro-evolutionary processes (Mesoudi, 2011). Specifically, one important question is how the individual's neurobiological endowment manifests itself in music at the population level. This issue was already known in linguistics as the "problem of linkage" (Kirby, 1999). A possible answer is "through cultural transmission." Music, much like language, is not only a richly structured symbolic system, but also a set of behaviors that is maintained over time by intergenerational transmission (Morley, 2013; Le Bomin et al., 2016).

During intergenerational transmission, cultural information must survive a "memory bottleneck" (Deacon, 1997): the set of all neurobiological biases or constraints that bind our capacity to infer (and store) the "rules" that govern a system of information<sup>1</sup> . The properties of the cultural system that fit best the human neurobiological filter—e.g., those that make information easier to process, encode, and recall—will have greater likelihood of being passed on to the next generation. If this view is correct, in the long run the neurobiological endowment of individuals should be reflected in the musical corpus at the population-level.

This view of transmission, emphasizing adaptation of fastchanging cultural systems to a largely stable neurocognitive architecture, was developed in evolutionary linguistics to account for the emergence of structure in human languages, including putative linguistic universals (Christiansen and Chater, 2008). Recent methodological advances (Mesoudi, 2015; Edmiston et al., 2018) have provided support for this view in controlled laboratory conditions. In most experiments, groups of individuals engage in simple, controlled forms of knowledge transmission, for example from a participant (a sender) to another (a receiver), along a diffusion chain. Each participant represents a "generation," and each interaction between participants allows for the passage of information across generations (Esper, 1925; Bartlett, 1932). The set of items transmitted along a diffusion chain (e.g., linguistic or musical phrases) is a finite sample drawn from the (infinite) set of items that learners have to generalize from. A challenge for research on cultural transmission is to show that core properties of the artificial systems being transmitted are also properties of the actual cultural systems being modeled and that the mechanisms at work in artificial conditions are also at work in real cultural evolution. In a landmark study, Kirby et al. (2008) showed how miniature "languages" emerge in the course of transmission from initial random associations of signals and meanings. When these pairings are transmitted across "generations" of participants, some regularities emerge, including compositionality (Hockett, 1960), as observed in human language. This result supports the view that core properties of language can be explained by the interplay of individual cognitive biases (sensu Brighton et al., 2005) and iterated cultural learning and transmission. Recent studies on animal models of cultural learning further support this conclusion (e.g., for non-human primates see Claidière et al., 2014; for a seminal study on zebra finches see Fehér et al., 2009).

One way to start bridging this gap in the musical domain, is to assume that music, like language, is a complex adaptive cultural system, shaped for thousands of years by cycles of transmission, acquisition, and use (Morley, 2013). Following this view, neurobiological biases and constraints, as discussed above, brought out through cultural transmission, would exert effects on the form and structure of music (Merker et al., 2015; Trainor, 2015; Mehr et al., 2018). This mechanism could explain some properties of temporal (rhythm, meter) and spectral (melody, harmony) dimensions of musical structure, which are likely to be the result of adaptations to the combined pressures of neural constraints and various socio-cultural forces (Merker, 2006; Merker et al., 2015; Trainor, 2015). This would in principle apply to both invariants—putative cultural universals shared by musical systems or traditions (Savage et al., 2015)—and variation among individuals, generations, and traditions.

This point is not new. Lévi-Strauss (1960) had already observed that some structural regularities observed across cultures (e.g., the fact that symbolic material tends to be organized in binary oppositions) are reflections of principles of brain organization. Therefore, neuroscience is expected to contribute to explanations of the emergence and evolution of structural regularities, including their convergence and diversity. However, to date this issue has been addressed only by behavioral studies, and only to explain some invariant aspects of musical structure. In the next section, we summarize three of these lines of experimental work in the field of music evolution.

## THE COGNITIVE LEVEL: DIFFUSION CHAINS AND THE EVOLUTION OF MUSICAL REGULARITIES IN THE LAB

In recent experiments, a diffusion chain method was used to study how music evolves in the lab (Ravignani et al., 2016). This study aimed to test whether human psychological biases, amplified by cultural transmission, can explain the emergence of rhythmic universals (Trehub, 2015). In this experiment, participants were given a drumstick and an electronic drum pad. Participants in the first generation listened to 32 randomly generated, hence a-rhythmic, patterns of beats (the input), and were asked to reproduce each of them to the best of their abilities (the output). The "imperfect" output produced by this first generation of participants became the input for the next generation, whose task was to perform the rhythm they heard, and so on, along a diffusion chain. This paradigm is known as "iterated learning" (IL) (Kirby et al., 2008). Given the difficulty to memorize these patterns, errors were introduced in the emerging system of drumming sequences, slightly modifying the original patterns at each generation. Across generations, patterns became increasingly structured and easier to learn. After 8 generations, at the end of each diffusion chain, patterns showed regularities similar to those found across musical traditions of the world. These universal rhythmic regularities included a tendency toward small integer ratios (e.g., 1:1 and 2:1) of intervals between beat

<sup>1</sup>Our definition of "memory bottleneck" includes constraints on perceptual grouping; capacity and temporal limits of auditory memory, serial processing, and attention; constraints on the neurodynamics of the auditory system; perceptual hearing thresholds. We limited this list to constraints "directly" related to basic aspects of perception and cognition. We acknowledge that constraints of a different nature might have a formative power over musical structures (e.g., motoric, motoric-expressive, physiological, cross-modal, and semantics).

durations, and a relatively steady beat, also termed "isochrony" (Savage et al., 2015). This study represents the very first attempt to "grow" musical universals in the lab (Fitch, 2017), and sheds light on the cognitive and cultural mechanisms underlying the creation and vertical transmission of music (Le Bomin et al., 2016).

An IL study by Verhoef (2012) investigated the cultural evolution of combinatorial structures in musical systems. Participants were first exposed to a set of 12 whistles that they had to imitate immediately after listening by using a slide whistle (training phase). Next, they were asked to reproduce the whole set of signals as they remembered it (recall phase). The sequences generated by a participant were used to train the next one in the diffusion chain, and so on, until the end of the chain. In the course of transmission, structural regularities emerged, as predicted by previous computer simulations (de Boer, 2000). In the last generations, fewer discrete units were reused by individuals in concatenations, repetitions, or mirror forms to produce the entire vocabulary of whistles. Combinatoriality is a "design feature" of human language (Hockett, 1960) and it applies to musical structure, too. For instance, the authors observed that two distinct whistles were often combined into a single pattern by the next generation of individuals. Also, participants tended to produce mirror forms out of single patterns, so that more elements were shared between signals of the same set. With fewer units to memorize, organized in this manner, the set of signals was more structured, more compressed, and easier to learn and reproduce.

A more recent attempt to study music evolution in the lab is the work by Lumaca and Baggio (2017). The authors used a different model of cultural transmission than IL: multigenerational signaling games (MGSGs) (Moreno and Baggio, 2015; Nowak and Baggio, 2016). MGSGs are in essence an iterated variant of signaling games (Lewis, 1969; Skyrms, 2010) that combine basic aspects of semiotic models of coordination and communication (e.g., horizontal transmission; Galantucci and Garrod, 2011) with the intergenerational transmission of IL (Kirby et al., 2008). Two-person signaling games were organized in diffusion chains of 8 generations each. In each game, the sender and receiver were expected to converge, through repeated interactions, on a common code: a signaling system where 5 isochronous melodic riffs were associated to basic or compound emotions. This design can contribute to model different aspects of music transmission: first, a degree of alignment of internal states between musical senders (e.g. composers) and receivers (e.g., an audience) at two main levels, the structural and affective (Temperley, 2004; Bharucha et al., 2011); second, a partial asymmetry in information flow from senders to receivers, which is present in language and music transmission (e.g., from composers to listeners, from teachers to pupils, etc.). In each signaling trial, the sender was presented on the screen with one of the 5 equiprobable emotions (visualized as human facial expressions) and was asked to compose a 5-note isochronous riff on the computer keyboard. The receiver, after he listened to the riff via headphones, was asked to choose one of the 5 expressive faces displayed on the screen (i.e., the one possibly seen by the sender). A feedback was then presented simultaneously to both participants' screens, showing the expressive face seen by the sender and the one chosen by the receiver for the same melodic signal. This procedure was repeated at each successive trial. At the end of the game, the receiver (generation n) became the sender in the next game, with the same structure and a new participant as a receiver (generation n + 1), and so on, until the chain was completed. Senders were always asked to transmit the code they had learned in the previous game. Therefore, recall errors in the melodic signals (possibly "innovations") were introduced throughout the experiment. The authors observed the gradual evolution over generations of several structural features of musical phrases: pitch proximity and continuity, symmetry, and motivic structure.

Despite differences in their assumptions and methods, those three experiments have reached similar conclusions: the immediate effects of psychological constraints on the musical systems may be weak, but they are amplified in the course of inter-generational transmission (Boyd and Richerson, 1988; Kalish et al., 2007; Kirby et al., 2007; Thompson et al., 2016) or iterated reproduction (Jacoby and McDermott, 2017), leading the evolution of musical structures along non-random paths. If principles of auditory organization and memory constraints operate in similar ways also in the production and perception of actual music, they could similarly shape the evolution of historical systems in the course of iterated transmission. Convergence toward some of the musical structures found across populations (Savage et al., 2015) could be then explained, to some extent, by adaptation to a special niche, constituted by a restricted set of low-level perceptual and memory processes. In the rest of the paper we will refer to this special niche as "neuronal niche" (Dehaene and Cohen, 2007).

## THE NEURAL LEVEL: CONSTRAINTS IMPOSED BY A NEURONAL NICHE DRIVE THE EMERGENCE OF REGULARITIES

In recent years, there has been an increasing interest in how the brain accommodates and shapes novel cultural symbolic systems (Dehaene and Cohen, 2007). A leading hypothesis is that some cortical circuits, initially evolved as a result of specific selective pressures, are later "recycled" to accommodate novel cultural functions (Dehaene and Cohen, 2007; Simon et al., 2013; Dehaene et al., 2015; Skeide et al., 2017). Therefore, the acquisition of novel functions is constrained, however weakly, by prior human evolution. Once "culturally recycled," preexisting systems and mechanisms maintain some of their original capacities and limitations, providing a neuronal niche within which culture may adapt and evolve. This also means that the variability observed in cultural systems is limited by brain structure and function across individuals and groups.

If this hypothesis is correct, near-universal characteristics of music (Savage et al., 2015) may be traced back to the computational infrastructure of human auditory cortex and other (e.g., motor, attentional etc.) areas of the brain. Trainor (2015) related the origins of certain invariant musical features as adaptations to bottom-up neural mechanisms of auditory scene analysis (ASA), such as the sequential sound segregation and integration of within-stream elements (Bregman, 1994). These specific mechanisms have evolved specifically to detect and localize multiple sources of auditory objects and to extract regularities from the acoustic environment. They often involve the perceptual grouping of single-event auditory stimuli into auditory streams and operate following Gestalt principles of proximity, similarity, and continuity (Deutsch, 1999). They are automatic (pre-attentive), they emerge early in human development (Demany, 1982; Winkler et al., 2003), and they are widely conserved across species (Fay, 2008). This point shows that the ASA neural circuitry is likely phylogenetically older than human music. Thus, the exaptation (or evolutionary reuse) (Gould and Vrba, 1982) of this more ancient biological mechanism by music should impose constraints on the way music is stored and organized in the brain, and accordingly, on the way it is recalled during transmission. In this regard, perceptual and memory recall advantages have been reported for tone streams that conform to Gestalt principles of organization (Bendixen et al., 2010; Loui, 2012; Rohrmeier and Cross, 2013). The cross-cultural tendency to organize music following these principles (Huron, 2001), in addition to the findings reported by cultural transmission research (Verhoef, 2012; Ravignani et al., 2016; Lumaca and Baggio, 2017), may support the idea that the neurocomputational constraints of the human auditory system constitute a filter through which musical material must pass, adapt, and eventually evolve.

It is surprising that up until recently, no one has attempted to find (counter-) evidence of cultural adaptation using neural measures. Research has shown that even recently-encoded information is shaped by perceptual or memory constraints into more compressed and abstract forms (Tamariz, 2017). Yet, the neural mechanisms underlying this phenomenon remain unknown. One reason is arguably our limited understanding of how information is represented in the brain (Mesoudi et al., 2006). Current whole-brain methods, such as functional magnetic resonance (fMRI), are not well-suited to investigate the precise basis of mental representations (but see Haynes and Rees, 2006; Johnson and Johnson, 2014; Zadbood et al., 2017). Another issue is to establish a link between neural constraints on learning—neural activity underlying specific, fast, and accurate encoding processes (Sadtler et al., 2014)—and cultural adaptation. Electrophysiological methods, such as multiunit recordings, seem ideal for this purpose, but they are too invasive to be performed on healthy individuals. Various animal models of social learning—in songbirds, primates, and other species—have provided useful information in this respect (Araki et al., 2016; Gadagkar et al., 2016; Tchernichovski and Lipkind, 2016). None of these species possesses cultural behaviors as rich and complex as human music. However, some of their behaviors exhibit structured patterns, which are maintained over time through inter-generational transmission. Cultural transmission, in turn, can shape animal vocal behavior so as to fit speciesspecific learning constraints (Fehér et al., 2009; Fitch, 2009).

The application of techniques and models used in language evolution allow researchers of animal behavior to explore the biology of culturally transmitted systems in simpler and more controlled conditions, and to answer questions about cultural adaptation that cannot be directly answered in humans using current methods (but see next section for indirect answers). For example, Araki et al. (2016) used cellular recordings to demonstrate the existence in zebra finches of constraints on neuronal temporal coding that limit song acquisition to certain species-specific temporal features. Juvenile birds acquire their songs by imitating adult tutors. Although zebra finches are not bound to learn only specific sequences, they do show significant consistencies in their vocal repertoires (Lachlan et al., 2016). Do these consistencies result from adaptation of song material to the zebra finch neural constraints on learning? Araki et al. (2016) found that a subset of neurons in the zebra finch auditory cortex responds synchronously and selectively to patterns of intersyllable silent gap durations, which are typical of their songs. The same cell population was unresponsive to other species' songs. Temporal coding mechanism like this are thought to preserve the species-specific song identity from any random drifts that may be introduced during cultural transmission.

Critically, the same mechanisms might underlie learning behaviors that resemble cultural adaptation in humans. When presented with the songs of other species, zebra finches tend to gradually adjust the duration of inter-syllable intervals toward their own (species-specific) songs' temporal structures, in a way similar to the human adjustment of random auditory stimuli toward Gestalt features. To our knowledge, this work provides the first cellular-level support of the idea of a neurobiological basis of cultural adaptation. It remains to be determined to what extent their findings can be generalized to other species. Would similar neuronal constraints operate in humans? Could they explain perceptual predispositions for some musical features (e.g., for small intervals and isochronous beat)? Are those neuronal constraints species-specific or, instead, are they shared with other species (Nicolai et al., 2014)? Another critical question is whether inter-individual variability in the neural filter is reflected in forms of cultural variation, for example in participant behavior during transmission, or in the shape taken by cultural systems as a result of it. Cross-individual variability is typically regarded as a source of noise in cultural transmission research, and is often removed by means of various procedures. The idea of linking individual neural variability with cultural variation may lend itself well to investigations using brain imaging and electrophysiology, but no one until recently has adopted this approach in cultural transmission research.

## Neural Predictors in Cultural Evolution Research

In a recent experiment, Lumaca and Baggio (2016) addressed some of these issues using a neural predictors approach (Berkman and Falk, 2013). This entails use of neuroimaging (fMRI, PET) or electrophysiological methods (EEG/ERPs, MEG) to identify neural predictors of behavior (for examples in the music domain, see Golestani et al., 2002; Zatorre et al., 2012; Zatorre, 2013). Lumaca and Baggio (2016) used neural predictors of signaling behavior as a first approach to examine whether and how symbolic systems adapt to human neural information processing systems, and to assess the effects of inter-individual variation in neural information processing on three core cultural behaviors: social learning, transmission, and regularization of signal sequences. To this purpose, the authors used one of the best-investigated brain signatures of auditory processing, the mismatch negativity (MMN) (Näätänen et al., 1978).

The MMN is a fronto-central negative wave, evoked by violations of some perceptual regularity (Paavilainen, 2013) which is picked up by the brain in a visual or auditory stimulus stream. The limited influence of attentive processes on the MNN (Paavilainen, 2013) and its onset (∼200 ms from the relevant stimulus) suggest that the MMN is a low-level marker of auditory processing. The encoding of regularities from an auditory input, possibly through the same ASA mechanisms reported above, is an antecedent condition for the elicitation of the MMN (Näätänen et al., 2001). The efficiency of these mechanisms is revealed by the MMN latency and amplitude (Näätänen et al., 1993; Tervaniemi et al., 2001). Larger amplitudes or shorter latencies are typically associated to more accurate representations of the input material and, thus, they are taken as proxies of more efficient encoding mechanisms. The MMN has been used to study how efficiently an individual's auditory system extracts and encodes regularities from acoustic inputs, and how this process may affect linguistic and musical behaviors. For example, differences in ERP responses in infants have been successfully used in various studies to predict cognitive and linguistic development (Molfese and Molfese, 1997; Choudhury and Benasich, 2011). Overall, these studies open up the possibility of using low-level neural markers to predict individual behavior during transmission and acquisition of language, music, and cultural material more generally. Structural properties of symbolic systems may thus be understood as adaptations to information processing bottlenecks during cultural transmission (Kirby, 2001; Tamariz and Kirby, 2015). It should then be possible, for example, to find a relationship between individual brain processing capabilities or limitations, and the degree of regularization imposed by each individual on the cultural material that is being transmitted and acquired.

Neurophysiological (ERP) evidence for this type of effect was provided by Lumaca and Baggio (2016) in the domain of melodic structure. The authors combined ERPs with diffusion chains on two successive days. On day 1, they identified a neural correlate of extracting regularities from 5-tone sequences in musically naïve individuals in a classical auditory oddball paradigm. ERPs were recorded while participants were presented with randomly interleaved standard (80%) and deviant (20%) stimuli: there was no task for the participants, who were watching a silent movie throughout the session. On day 2, participants played a reduced version of MGSGs, with melodic systems of the same kind used by Lumaca and Baggio (2017). Each participant played the first signaling game as receiver (learner) and the second as sender (transmitter)<sup>2</sup> . The main question addressed by the authors was whether constraints and biases on auditory processing could drive the melodic material toward known Gestalt principles of perceptual organization (Lumaca and Baggio, 2017). The results showed that inter-individual variation in neural information processing, as revealed by the latency of the MMN on day 1, predicted learning and transmission of melodic signaling systems in the MGSGs on day 2. Specifically, individuals with longer MMN latencies performed "worse" in the MGSGs, showing lower coordination, transmission, and accuracy. Yet, these participants introduced more innovations than participants with shorter MMN latencies. Inter-individual variation in neural auditory processing (or regularity encoding) may be sufficient to discriminate "better" from "worse" transmitters, as observed in the cultural transmission of music (Sawa, 2002). However, perhaps the most interesting finding was that participants with longer MMN latencies introduced more regularities in the artificial tone system, reproducing more often melodic structures that were more compressed (signals from the same set became more similar), more proximal (temporally adjacent elements in the signals were closer in pitch), and smoother (the sequences showed a coherent melodic direction) than the sequences they originally received. To our knowledge, this study is the first demonstration that three essential processes underlying cultural evolution (i.e., social learning, transmission, and innovation), and three near-universal properties of melodic structure (i.e., proximity, continuity, and compression) are constrained by the organization of sensory and memory systems in the brain. The MMN is only "the tip of the iceberg" here. The MMN is likely to reflect auditory scene analysis and encoding mechanisms. Constraints on these mechanisms, as revealed (among others) by MMN latencies, may represent a "neuronal niche" through which cultural material must pass, adapt, and evolve (see below). In a cultural evolutionary context, this finding may provide clues to the origins of forms of variation observed in cultural symbolic systems. We discuss this point in the next paragraph.

## The Neural Origins of Cultural Variation

Human cultural traits show a myriad different forms across world cultures. Music, like language, provides an excellent example of this diversity, within and between populations (Lomax, 1959; Rzeszutek et al., 2012). For instance, the tendency toward the use of intervals of small size or the division of the octave (2:1) into a limited number of tones (or "discreteness") as observed in several cultures (Merriam et al., 1956; Dowling, 1968) is counterbalanced by significant diversity, within and between those cultures, in the relative frequency of such traits (Savage et al., 2015). The frequency distribution of proximal intervals (<700 cents; Savage et al., 2015) differs across musical traditions, with variation being mostly confined to the interval range 0 (unison) to 6 semitones (Huron, 2001). A similar diversity was found in the "tonal material" of musical cultures (i.e., the total set of discrete pitches within an octave), which spans from the 12 semitones of the Western musical scale to the 22–24 microtonal steps of North Indian and Arabic scales (Malm, 1967; Ayari and McAdams, 2003).

<sup>2</sup> In signaling games with fixed roles, including all MGSGs, the receiver tends to learn the code transmitted by the sender. In other words, there is asymmetry in the division of coordination labor between the sender and the receiver, with most

coordination work (most code changes) falling to the latter (Nowak and Baggio, 2016).

The evolutionary mechanisms that affect the relative frequency of musical characters, such as random cultural drifts and biased selection, have been extensively studied in recent years (Mesoudi, 2015). For example, MacCallum et al. (2012) used a biologically-inspired evolutionary system to explore the effects of "aesthetic" selection on the frequency distribution of musical characters. A population of listeners was asked to rate the pleasantness of randomly generated tunes. The top-rated tunes recombined or mutated into novel variants that were in turn evaluated by a new generation of consumers. The authors reported an over-time increase of characters classically regarded as "musical," such as isochrony and chordal clarity. This work was the first of its kind to show that consumers' preferences can deeply shape the evolution of music in the near absence of learning and memory pressures. It is still controversial whether aesthetic preferences are just a social construct, changing over time, or if instead they are themselves stable information processing biases (for an in-depth discussion on this topic see Hodges, 2009; Huron, 2009). In a recent model, Reber et al. (2004) combined the two proposals. Specifically, the authors put forward the hypothesis that aesthetic preferences result from an interaction between knowledge-dependent stylistic rules and information processing fluency for certain stimulus properties (e.g., symmetry, clarity, and the amount of information content) (Nieminen et al., 2011). This may explain the evolution of music toward specific features, such as symmetry and chordal clarity (MacCallum et al., 2012; Verhoef, 2012; Lumaca and Baggio, 2017). A similar proposal was made by Haiman (2011) to explain the emergence of symmetric compounds in language. These arguments are still hypothetical, but we are now starting to understand the effects of these biases on the cultural evolution of music (Savage and Brown, 2007). Specifically, we know that these processes can enhance the diversity of musical behaviors and forms, but they can also produce local homogeneity<sup>3</sup> . While those mechanisms can explain how musical variants spread over time in a population, the sources of variability remain to a large extent elusive.

Up until now, only four main mechanisms of variation have been considered in music: creative innovation (e.g., via original musical composition), borrowing (through blending or syncretism), translation (from one tonal system to another; Alekseyev, 1986), and random mutation (errors in music copying or performance) (Savage and Brown, 2007). Lumaca and Baggio (2016) provided evidence for an additional mechanism: individual neural variability. One could argue that every individual in a population represents a distinct and unique "neuronal niche" (Dehaene and Cohen, 2007), through which cultural material is filtered and to which it may eventually adapt. Minor inter-individual differences in neural information processing can manifest themselves in differences in musical behavior. Moreover, they can be amplified and spread via different cultural evolutionary mechanisms. Small differences in learning or information processing can have large system-level effects, if they are amplified by cultural transmission.

One tenet of cultural transmission research is that cultural systems evolve toward certain prior distributions, known as "cognitive attractors" or "inductive biases" (Sperber, 1996; Griffiths et al., 2008). Strong versions of this account have been challenged by recent modeling work (Navarro et al., 2017). The convergence toward priors holds in the (implausible) scenario where all learners are endowed with the same identical prior. However, when learners instantiate (slightly) different constraints, the emerging cultural systems may reflect the more idiosyncratic biases of some individuals. In light of our findings, one could suggest that individuals with "tighter bottlenecks" exert a disproportionately large effect on the evolution of musical structures (see Ravignani et al., 2018 for some issues concerning this view). Similarly, differences between populations in brain function and anatomy may, at least in part, be reflected in differences in the structure of the symbolic systems in use. This account has recently found some support in language evolution research. Dediu and Ladd (2007) have shown that the populationlevel frequencies of two human genes involved in brain growth, Microcephalin and ASPM, are reliably associated with the presence or absence of linguistic tones in that population. The authors' proposal is that variants of these genes may determine small biases at the individual level in the processing and acquisition of linguistic tones, which may in turn give rise to distinct language variants. Those variants are hardly detectable in individual subjects, because tonal and non-tonal languages can be acquired by any individual, independently of genetic variants (Ladd et al., 2008). But when their effects are amplified by intergenerational transmission (Kirby et al., 2008), these variants may give rise to measurable, large-scale population differences.

Dediu and Ladd (2007) is the first study suggesting that variation, as observed in cultural traits and in their distribution, may originate in interindividual neurogenetic variability. Lumaca and Baggio (2016) provide converging neurophysiological evidence in support of this view (for the genetic bases of interindividual variation in musicality, see Gingras et al., 2015). Genetic and neural variability are not the only source of cultural variation, but they are likely to play a prominent role in any future theory of the biological roots of culture. For example, Brown et al. (2014) have shown that musical and genetic diversity may correlate to some degree. After sampling a set of traditional songs from 9 indigenous populations in Taiwan, they measured the relative distance for 41 properties of song structure and performance-style. Music and genetic distance among the populations were significantly correlated. A similar relation was found in Eurasian populations (Pamjav et al., 2012). The study of genetic and neural variability may help address questions that were considered taboo in ethnomusicology since fairly recently: for example, whether a causal relationship exists between the distribution of some gene variants and aspects of musical systems and behaviors (Jordania, 2006, p. 101; Nikolsky, 2015). Such a theory requires the synergic and coordinated effort of genetics, neuroscience, and research on cultural evolution. The recent drive toward a "grand synthesis" of the latter discipline (Brewer et al., 2017) makes this possibility somewhat more likely.

<sup>3</sup>The re-use of Wagner's musical ideas by other composers during Nazi Germany and the emergence and maintenance of stylistic clusters in contemporary pop music are clear examples of biased selection.

## CONCLUSIONS

In this paper, we have argued that some of the most fundamental (and still unresolved) issues in music evolution can be addressed using the methods of cognitive neuroscience. This approach so far suggests a novel hypothesis on the mechanisms behind forms of cultural variation in musical systems. This line of work can also shed light on the "problem of linkage" (Kirby, 1999). Up until recently, this problem has been framed at only two levels of explanation. At the behavioral level, individual behaviors (e.g., code changes) that serve coordination and communication are linked to population-level patterns. At the cognitive level, sensory or memory constraints in individuals are identified in order to account for properties (e.g., structural features) of cultural systems. We suggest that a third level, the neural level, should be taken into consideration when developing accounts of the origins and evolution of structure in cultural systems, as is the case for accounts of the organization and function of information processing systems (Marr, 1982; Baggio et al., 2012, 2014, 2016). Thus, we can address questions in the cultural domain such as: which sources produce cultural diversity (computational level); through which mechanisms it may arise (e.g., inter-individual variation; algorithmic level); and which physical substrates, if any, those mechanisms exploit (i.e., the human brain; implementational level). We believe that explanations at all three levels are necessary to understand human cultural transmission. This requires (1) analyzing the structural and dynamic properties of the cultural systems (or codes) themselves, (2) determining how those are shaped by perceptual and cognitive biases and constraints, and (3) identifying the biological roots of such biases and constraints using neural and genetic data. This proposal generates several new questions, such as: to what extent do neural processes drive cultural evolution? How does inter-individual

REFERENCES


variation in brain function and structure affect variation in cultural behaviors? How does the distribution of neural traits in a population affect the structure of the symbolic system itself? How do these traits interact with aesthetic processing biases and the environment at large in the cultural evolution of music? How specific and accurate can neuroprediction get in the context of cultural evolution? Here, we hope to have shown that these questions are worth asking, and are largely amenable to scientific inquiry.

## AUTHOR CONTRIBUTIONS

ML wrote the article. AR and GB made additional contributions and edited the manuscript. All authors approved the manuscript for publication.

## FUNDING

AR was supported by funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 665501 with the research Foundation Flanders (FWO) (Pegasus<sup>2</sup> Marie Curie fellowship 12N5517N awarded to AR), and a visiting fellowship in Language Evolution from the Max Planck Society (awarded to AR).

## ACKNOWLEDGMENTS

We are grateful to Monica Tamariz, Bruno Gingras, and Aleksey Nikolsky for their helpful comments during the revision of the manuscript. Center for Music in the Brain is funded by the Danish National Research Foundation (DNRF117).


language and cognitive abilities. Clin. Neurophysiol. 122, 320–338. doi: 10.1016/j.clinph.2010.05.035


Deacon, T. (1997). The Symbolic Species. New York, NY: W.W.Norton.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Lumaca, Ravignani and Baggio. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Music and the Meeting of Human Minds

#### Alan R. Harvey\*

School of Human Sciences, The University of Western Australia, Perron Institute for Neurological and Translational Science, Perth, WA, Australia

Over tens of thousands of years of human genetic and cultural evolution, many types and varieties of music and language have emerged; however, the fundamental components of each of these modes of communication seem to be common to all human cultures and social groups. In this brief review, rather than focusing on the development of different musical techniques and practices over time, the main issues addressed here concern: (i) when, and speculations as to why, modern Homo sapiens evolved musical behaviors, (ii) the evolutionary relationship between music and language, and (iii) why humans, perhaps unique among all living species, universally continue to possess two complementary but distinct communication streams. Did music exist before language, or vice versa, or was there a common precursor that in some way separated into two distinct yet still overlapping systems when cognitively modern H. sapiens evolved? A number of theories put forward to explain the origin and persistent universality of music are considered, but emphasis is given, supported by recent neuroimaging, physiological, and psychological findings, to the role that music can play in promoting trust, altruistic behavior, social bonding, and cooperation within groups of culturally compatible but not necessarily genetically related humans. It is argued that, early in our history, the unique socializing and harmonizing power of music acted as an essential counterweight to the new and evolving sense of self, to an emerging sense of individuality and mortality that was linked to the development of an advanced cognitive capacity and articulate language capability.

#### Edited by:

Aleksey Nikolsky Independent Researcher, Los Angeles, CA, United States

#### Reviewed by:

Guy Madison, Umeå University, Sweden Steven Robert Livingstone, University of Wisconsin–River Falls, United States

> \*Correspondence: Alan R. Harvey alan.harvey@uwa.edu.au

#### Specialty section:

This article was submitted to Cognition, a section of the journal Frontiers in Psychology

Received: 15 January 2018 Accepted: 30 April 2018 Published: 16 May 2018

#### Citation:

Harvey AR (2018) Music and the Meeting of Human Minds. Front. Psychol. 9:762. doi: 10.3389/fpsyg.2018.00762 Keywords: music, evolution, altruism, social cooperation, language, motherese, dopamine, oxytocin

In the twenty-first century, there are many different types and varieties of music, sung in different ways and played on many different types of musical instrument, and there are also many different types of language with different alphabets and modes of writing. Despite this biological, cultural, and behavioral diversity, the fundamental elements of music and language communication seem to be universals, present in all human cultures and social groups. There is an ever-increasing literature that focuses on the neuroscience, evolution, psychology, and sociobiology of music (e.g., Levitin, 2006; Patel, 2008; Morley, 2013; Harvey, 2017). In addressing the issue of "why music?", a number of key questions arise, including: (i) when did our species evolve musical behaviors, (ii) what is the evolutionary relationship between music and language, and (iii) why do humans universally continue to possess two complementary but distinct communication streams, streams that are processed by some common but also some distinct pathways and circuits within the human brain.

Language is the main way we learn and pass on skills, ideas, plans, and acquired knowledge within, and across, generations. It structures thought and permits intuitive reasoning, it is

expandable in content and allows the creation of virtual realities and the manipulation of symbols. But language, similar to music, also possesses emotional or prosodic content, involving changes in pitch, voice quality, loudness, and timing of delivery. Such changes convey information about the emotional state of the speaker and/or change the meaning of a word or phrase, and when vocalizing there are elements in common in the generation of phrases in music (singing) and speech (Patel, 2008). Interestingly then, studies using functional magnetic resonance imaging (fMRI) and other methods to monitor brain activity have revealed that elements of both language and music, including for example aspects of pitch and rhythm processing (Rogalsky et al., 2011; Perrachione et al., 2013), and perhaps even some elements of syntax (e.g., Sammler et al., 2013; Kunert et al., 2015, but see Bigand et al., 2014), are processed in closely related regions within the brain.

There are also, however, differences in brain circuitry. Spoken language, at least in right handers, is mostly left hemisphere biased, whereas music has a right hemisphere bias, although with training the networks involved with music processing generally show a leftward shift (Ellis et al., 2013; Thaut et al., 2014). Recent research has identified specific regions in the temporal lobe that appear to respond only to music or to speech (Angulo-Perkins et al., 2014; Norman-Haignere et al., 2015). In addition, music activates pathways and regions within the limbic system (Koelsch, 2014), a system associated with a number of functions including learning, memory, motivation, and emotional responsiveness. Music we appreciate and subjectively find arousing is associated with dopaminergic activity and activates reward centers such as the nucleus accumbens, located in the ventral striatum (Menon and Levitin, 2005; Salimpoor et al., 2011; Zatorre and Salimpoor, 2013; Mueller et al., 2015). The pathways that mediate interactions with other sensory modalities and links to motor systems controlling vocalization and body movement also show some left versus right bias for speech and music, respectively (Callan et al., 2006; Özdemir et al., 2006).

From an evolutionary perspective, the origins of modern language and articulate speech in Homo sapiens, and the extent to which language is encoded in the brain and/or influenced by cultural experience, remain controversial (e.g., Pinker, 1994; Hauser et al., 2002; Bickerton, 2007; Patel, 2008). Nonetheless, while obviously speculative, it seems likely that linguistic-related advances in neural hardware and software were superimposed on pre-existing neural capabilities, the latter including prosodic utterance as well as non-verbal behaviors such as hand movement (gesture), tool making, and facial expression (Mithen, 2005; Arbib et al., 2008; Aboitiz and Garcia, 2009; Fay et al., 2014). There was also likely a need for enhanced neural plasticity as well as improved attention mechanisms, greater capacity for abstract thought, a vastly improved speed of learning, and enhanced working memory and storage capability (Harvey, 2017). But if language was a key partner of our newly evolved cognitive power, and a primary driver of human cultural evolution, why did we also evolve, and why do we continue to use and enjoy, another communication stream, music? The presence of at least some shared linguistic and musical neural circuitry is not inconsistent with the proposal that, during the evolution of modern H. sapiens, language and music emerged from a common precursor possessed by our immediate ancestors, sometimes called a musilanguage (Brown, 2000) or protolanguage. But what does music do for us, as individuals and as a community, and where did the production and appreciation of music come from, and when? Can music have had meaningful evolutionary significance even though "it neither, plows, sows, weaves nor feeds" (Cross, 2001)?

In this brief review, the focus is on speculation about the earliest origins of music in H. sapiens, rather than discussing the subsequent genetic/cultural evolution of different musical genres, styles, and social practices: "Distinctions between the surface complexity of different musical styles and techniques do not tell us anything useful about the expressive purposes and power of music, or about the intellectual organization involved with its creation" (Blacking, 1976). Fossil evidence suggests that the earliest evidence of our species, H. sapiens, first appeared about 200,000 years ago, perhaps even as far back as 300,000 years ago (Hublin et al., 2017). Pieces of fossilized bone provide only limited information about the evolution of the modern human brain (Sherwood et al., 2008). More meaningful archeological evidence, strongly suggestive of relatively recent cognitive and behavioral advances in our species, comes from the discovery and analysis of cultural artifacts and tools (Henshilwood et al., 2002; Bouzouggar et al., 2007). Scientists can also trace lineage relationships in modern H. sapiens, to determine how genes, and especially the subtle regulation of those genes, have changed over the millennia. Coalescent modeling allows researchers to work backward from DNA samples from members of different human population groups in order to develop gene trees that identify the most likely common or shared ancestral population from which all living humans are derived. Lineage tracing based on changes to the Y chromosome genotype permits a study of paternal ancestry, and changes in the DNA within mitochondria allow an analysis of maternal lineage relationships (Underhill and Kivisild, 2007).

There are, of course, significant margins of error associated with such studies. Nonetheless the genetic data, when combined with evidence of cultural change, do suggest that the modern version of H. sapiens, the version with an advanced cognitive architecture and newly developed language capabilities, evolved within the last 100,000 years, perhaps even more recently. A study looking at the four major variants of African mitochondrial DNA has found that there was a rapid expansion of the group carrying of one of these haplotypes (L3) between 61,000 and 86,000 years ago, and that all humans living outside Africa share this phenotype (Atkinson et al., 2009; Soares et al., 2012). Within Africa, L3 carrying populations originally in Eastern Africa also increased, migrating into central and southern parts of the continent. Overall, while there were earlier migrations of H. sapiens out of Africa (Reyes-Centeno et al., 2014), it seems likely that the last major dispersal out of Africa by our founder population occurred somewhere between 60,000 to 70,000 years ago. Since then, human evolution has been driven by complex and dynamic interactions between genes, environment, and culture, yet both within and beyond Africa language and music remain universals, found in all living

human populations and cultures. For me, the most parsimonious explanation for the universality of both communication streams is that members of our founder population possessed both systems from the very beginning (Harvey, 2017). What is clear is that by at least 43,000 years ago our ancestors were already modifying bone to make flute-like instruments (Higham et al., 2012).

Numerous theories have been put forward in an attempt to explain the seeming universality of music, from its earliest evolutionary origins through to the present day, and the possible adaptive benefits that music brought, and continues to bring, to our species (Hauser and McDermott, 2003; Huron, 2003; Mithen, 2005; Harvey, 2017). These ideas are necessarily speculative, but scientific and psychological research over the past few decades does provide some useful clues and pointers. Music and its partner dance may play an important role in mate attraction and selection, it can help structure time, and it facilitates long-distance communication. Music also enhances cognitive and motor skill development, has mnemonic power, encourages group cooperation, and social interactions, helps to define cultural identity and perhaps even facilitates conflict resolution. Of course, there may have been multiple, and intersecting, advantages for our founder population to possess musical communication as part of its make-up. Here, I wish to emphasize the early evolutionary importance of music in the context of cooperative interactions and prosocial behavior in H. sapiens.

One intriguing idea is that music-like capabilities may have evolved to enable mothers to interact with infants during the first few months after birth, encouraging parent-infant attachment before babies learn to talk and use language (DeCasper and Fifer, 1980; Trehub, 2000; Falk, 2004; Levitin, 2006; Trevarthan, 2008). This infant-directed communication, important in driving the behavioral, perceptual, cognitive, emotional, and social development of children, is generally known as motherese. Motherese has particular characteristics; it is rhythmic, usually delivered slowly and at higher pitch with exaggerated pitch changes that often slide from one to another, it is calming and reassuring, and is associated with other key social and emotional interactions that involve gesture, eye contact, and facial expression. However, while motherese may sound musical, the scientific evidence currently available does not, in my view, clearly establish if it is "protomusical" or "protolinguistic" in nature (Harvey, 2017). This prosodic maternal-infant communication system may in fact be more akin to the more "primitive" system used by our ancestors, prior to the more recent evolution of the two communication streams of articulate speech and music.

In adult life, speech usually involves turn taking, one person talks and another listens (Levinson, 2016), whereas music is capable of integrating and coordinating feelings and emotions within even a large group. Music, and its ally dance, require both energy and time commitment, so from an evolutionary perspective: "..if music had no value whatsoever, one might expect strong selection against musical behavior. Recent evidence from subjects with congenital amusia indicates that the necessary genetic variance is present in human populations, albeit at a very low frequency. So why have not quiet, better-rested non-musical humans out-reproduced and replaced their musical conspecifics?" (Fitch, 2006).

A possible reason for the continued universality of music, as argued in detail elsewhere (Harvey, 2017), is that the evolution of language and the modern human mind, and a new capacity for transgenerational communication and mental time travel (Suddendorf and Corballis, 2007), led to a more sharply defined sense of individuality. Our species became, as others have described it, a "society of selves" (Humphrey, 2008), with greater levels of social complexity and uncertainty, and increasingly aware of our own mortality and finite life span: "Only human beings guide their behavior by a knowledge of what happened before they were born and a preconception of what may happen after they are dead; thus only humans find their way by a light that illuminates more than the patch of ground they stand on" (Medawar, 1977). The relatively indeterminate nature of music allowed humans to interact, infer, and actively share experiences even though at the individual level each person may have been experiencing different emotions and thoughts: "Music allows participants to explore the prospective consequences of their actions and attitudes toward others within a temporal framework that promotes the alignment of participants' sense of goals" (Cross, 2009). In other words the rewarding, harmonizing power of music and music-related activities such as dance may have acted as important and essential counterweights to the individualization experienced by increasingly intelligent, articulate, and apprehensive members of our species. As Steven Brown wrote: ". . .the straightforward evolutionary implication is that human musical capacity evolved because groups of musical hominids out-survived groups of non-musical hominids due to a host of factors related to group-level cooperation and coordination" (Brown, 2000).

If music did indeed play a key evolutionary role in facilitating and rewarding social bonding and mutual cooperation in H. sapiens, is there any evidence of this relationship at the neurological level? Neuroimaging studies have revealed that positive, optimistic thoughts activate specific regions such as the anterior cingulate cortex, the amygdala and parts of the prefrontal cortex (Sharot et al., 2007), similar to some of the regions associated with positive emotional responses to music (e.g., Salimpoor et al., 2009; Zatorre and Salimpoor, 2013; Koelsch, 2014). Altruistic acts are a form of social behavior that result in some form of benefit for the receiver and usually, but not always, at least some negative consequences for the giver. In humans, reciprocal altruism, involving mutual cooperative advantage (Trivers, 1971; Fehr and Fischbacher, 2003), has a beneficial impact not only on the viability and welfare of individuals but can also lead to better cooperation with kin as well as promote pro-social attitudes within larger groups of culturally compatible but unrelated individuals (Boyd and Richerson, 2009; Kurzban et al., 2015). Significantly, when subjects are performing mutually cooperative tasks that are rewarding and require empathy and an appreciation of social context and another person's feelings, there is increased activity in areas such as the nucleus accumbens, amygdala, anterior cingulate cortex,

superior temporal cortex, temporoparietal junction, and several specific regions in prefrontal cortex (Rilling et al., 2002, 2008; O'Doherty, 2004). Once again, these regions overlap many of the regions that are active when listening to familiar and emotionally rewarding music, music that can reinforce the sharing of affective states.

There is also neurochemical and physiological evidence to support the idea that music acts to promote empathy and cooperative behavior. Music alters the expression of a number of agents that modulate brain activity and that are associated with reward and a sense of well-being, including not only neurotransmitters such as dopamine but also several hormones (Chanda and Levitin, 2013; Zatorre and Salimpoor, 2013). Pain sensitivity, thought to be a surrogate for altered endogenous opioid levels, is lowered during participation in group music activities, including choral singing and dance (Dunbar et al., 2012; Tarr et al., 2015), and music can reduce levels of the stress hormone cortisol (Khalfa et al., 2003).

In humans the hormone oxytocin has been shown to promote the expression of positive emotions and prosocial altruistic attitudes (Hurlemann et al., 2010; Marsh et al., 2015; Leppanen et al., 2017). When applied as a spray via the nose, oxytocin reduces anxiety and fosters trust and emotional empathy (Kosfeld et al., 2005; Baumgartner et al., 2008), and genetic variants of the oxytocin receptor are linked to differences in the way humans empathize and engage in prosocial activities (Rodrigues et al., 2009; Aspé-Sánchez et al., 2016). In a study that also analyzed stress and cortisol levels in participants, it was recently reported that oxytocin levels measured in saliva are decreased during choral but not solo singing (Schladt et al., 2017). However, measurement of salivary oxytocin levels do not necessarily reflect levels within the specific regions of the brain that possess oxytocin receptors (Mitre et al., 2016; Lin et al., 2018), and there is a clearer and more consistent relationship between blood plasma levels of oxytocin and levels found in cerebrospinal fluid (Valstad et al., 2017). Interestingly then, circulating levels of the hormone oxytocin were found to be increased only under conditions in which singers were asked to improvise together (Keeler et al., 2015). Musical improvisation, presumably comparable to the type of activity experienced by early humans, requires that participants innovate, engage, and socially interact with other members of the group. Consistent with this, compared to other creative group activities, singing encourages more rapid interindividual bonding, apparently short-circuiting the usual way that social relationships are established: "The capacity of singing to bond groups of relative strangers in humans may have played a crucial role in allowing modern humans to create and maintain much larger social networks than their evolutionary relatives. . ." (Pearce et al., 2015).

#### REFERENCES


For many thousands of years, the evolution and development of musical styles and practices, as well as individual musical abilities, has been driven by a complex interplay between genes and culture. Evidence for the vertical transmission of at least some aspects of musical behavior in human populations comes from studies showing that genetic variants in a number of neuromodulator receptors are linked to musical memory, aptitude, and creativity (Liu et al., 2016; Oikkonen et al., 2016; Mariath et al., 2017). Furthermore, lineage studies have revealed a correlation between variation in folk musical styles and genetic variance within populations (Pamjav et al., 2012; Brown et al., 2013; Le Bomin et al., 2016), all lending support to the proposal that musical traits are heritable.

In summary, it is argued that music and music-related behaviors, operating alongside language and articulate speech, were important in helping to promote emotional synergy, social bonding, and foster group-level cooperation and coordination early in human evolution. As suggested elsewhere (Cross, 2003), music may have been especially suited for this purpose because it is usually risk free and of indeterminate meaning. In a group context, music allows humans to interact and share experiences during a particular musical activity even though each person may have very different outlooks and goals, and be involved in a range of real or imagined relationships. "Wherever humans live, and however, they have organized their societies, they exhibit a behavioral peculiarity of gathering from time to time to sing and dance together in a group. By featuring both human song and entrainment (in the dancing movements and perhaps clapping performed in synchrony with the singing/music), such behavior qualifies as human music. Indeed, the fact that it occurs in every human culture, and indeed subculture, without exception, unless deliberately suppressed by severe sanctions against it, marks this phenomenon as the most universal human behavior of a musical kind on record" (Merker et al., 2015).

An appreciation of the evolutionary significance of music, and the impact that the universal of music has had in promoting social harmony and an individual's sense of well-being in our species, can only help when advocating for the importance of music in education and for therapeutic purposes in the clinic (Harvey, 2012). Music and associated synchronized behaviors remain core human attributes, promoters of prosocial, empathic behaviors, and cultural cohesion, and it is to be hoped that humans will continue to harness music's power, for the good of society.

## AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.

temporal lobes: differences between musicians and non-musicians. Cortex 59, 126–137. doi: 10.1016/j.cortex.2014.07.013


social and psychiatric traits. Front. Neurosci. 9:510. doi: 10.3389/fnins.2015. 00510


Blacking, J. (1976). How Musical is Man? London: Faber and Faber.


of peak emotion in music. Nat. Neurosci. 14, 257–264. doi: 10.1038/nn. 2726


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Harvey. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

fpsyg-09-00762 May 15, 2018 Time: 17:48 # 6

# Rhesus Monkeys (Macaca mulatta) Sense Isochrony in Rhythm, but Not the Beat: Additional Support for the Gradual Audiomotor Evolution Hypothesis

#### Henkjan Honing<sup>1</sup> \*, Fleur L. Bouwer <sup>1</sup> , Luis Prado<sup>2</sup> and Hugo Merchant <sup>2</sup>

<sup>1</sup> Amsterdam Brain and Cognition, Institute for Advanced Study, Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, Netherlands, <sup>2</sup> Department of Cognitive Neuroscience, Instituto de Neurobiología, Universidad Nacional Autonoma de México, Santiago de Querétaro, Mexico

Charles Darwin suggested the perception of rhythm to be common to all animals. While only recently experimental research is finding some support for this claim, there are also aspects of rhythm cognition that appear to be species-specific, such as the capability to perceive a regular pulse (or beat) in a varying rhythm. In the current study, using EEG, we adapted an auditory oddball paradigm that allows for disentangling the contributions of beat perception and isochrony to the temporal predictability of the stimulus. We presented two rhesus monkeys (Macaca mulatta) with a rhythmic sequence in two versions: an isochronous version, that was acoustically accented such that it could induce a duple meter (like a march), and a jittered version using the same acoustically accented sequence but that was presented in a randomly timed fashion, as such disabling beat induction. The results reveal that monkeys are sensitive to the isochrony of the stimulus, but not its metrical structure. The MMN was influenced by the isochrony of the stimulus, resulting in a larger MMN in the isochronous as opposed to the jittered condition. However, the MMN for both monkeys showed no interaction between metrical position and isochrony. So, while the monkey brain appears to be sensitive to the isochrony of the stimulus, we find no evidence in support of beat perception. We discuss these results in the context of the gradual audiomotor evolution (GAE) hypothesis (Merchant and Honing, 2014) that suggests beat-based timing to be omnipresent in humans but only weakly so or absent in non-human primates.

Keywords: music, rhythm, beat perception, ERP, MMN

## INTRODUCTION

The interest in rhythm cognition in non-human animals is motivated by the search for signs of musicality as a means to get an insight in the evolutionary and causal processes underlying human musicality (Trehub, 2003; Honing and Ploeger, 2012; Hoeschele et al., 2015; Honing et al., 2015; Trehub et al., 2015; Honing, 2018b) 1 . Most animals show at least some sort of rhythmic behavior,

#### Edited by:

Aleksey Nikolsky, Braavo! Enterprises, United States

#### Reviewed by:

Christopher I. Petkov, Newcastle University, United Kingdom Aniruddh Patel, Tufts University, United States

> \*Correspondence: Henkjan Honing honing@uva.nl

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

Received: 30 September 2017 Accepted: 22 June 2018 Published: 16 July 2018

#### Citation:

Honing H, Bouwer FL, Prado L and Merchant H (2018) Rhesus Monkeys (Macaca mulatta) Sense Isochrony in Rhythm, but Not the Beat: Additional Support for the Gradual Audiomotor Evolution Hypothesis. Front. Neurosci. 12:475. doi: 10.3389/fnins.2018.00475

<sup>1</sup>Musicality, in all its complexity, can be defined as a natural, spontaneously developing set of traits [designed for the perception and production of music] based on and constrained by our cognitive abilities and their underlying biology. As such, music, in all its variety, can be defined as a social, and cultural construct based on that very musicality (Honing, 2018a).

like walking, flying, crawling, or swimming. It is hence not unnatural to think that the perception (and enjoyment) of rhythm might be shared by most animals, as was argued by Darwin (1871) and Patel (2014). While only recently experimental research is finding some support for this claim (Wilson and Cook, 2016), there are also aspects of rhythm cognition that appear to be species-specific (Fitch, 2013), such as the capability to perceive a regular pulse in a varying rhythm (i.e., one level of a metrical structure) and consequently being able to synchronize to it (i.e., rhythmic entrainment), referred to as beat-based timing (Merchant and Honing, 2014).

Beat-based timing in humans is a complex neurocognitive phenomenon that depends on a dynamic interaction between auditory and motor systems in the brain (Grahn and Brett, 2009; Morillon et al., 2014; Patel and Iversen, 2014; Hoeschele et al., 2015; Merchant and Yarrow, 2016; Ross et al., 2017). It is hypothesized to be facilitated by bidirectional, and potentially causal links between the auditory and motor areas in the brain, including the motor cortico-basal-ganglia-thalamocortical (mCBGT) circuit, that appear to be more developed in humans as opposed to non-human primates and related species (Patel et al., 2009; Mendoza and Merchant, 2014; Patel and Iversen, 2014; Petkov and Jarvis, 2014; Merchant et al., 2015a; Wilson and Cook, 2016).

These observations lead to the gradual audiomotor evolution (GAE) hypothesis (Merchant and Honing, 2014) that suggests beat-based timing to be gradually developed in primates, peaking in humans but present only with limited properties in other non-human primates, while humans share intervalbased timing with all non-human primates and related species. Thus, the GAE hypothesis accommodates the fact that the performance of rhesus monkeys is comparable to humans in single interval tasks—such as categorization, interval reproduction, and interception—, but differs in multiple interval tasks, such as synchronization, continuation, and rhythmic entrainment (Honing and Merchant, 2014; Merchant and Honing, 2014).

In the current paper, we will focus on beat and isochrony perception as two key components of musicality (Merchant et al., 2015b; Honing, 2018a), and provide further evidence for the GAE hypothesis by studying rhythm perception in two rhesus monkeys (Macaca mulatta). For this study we used an existing ERP paradigm that allows for testing and disentangling the contributions of beat perception and isochrony to the temporal predictability of the stimulus (Bouwer et al., 2016).

## EARLIER WORK

Most existing animal studies on beat-based-timing and rhythmic entrainment (Wilson and Cook, 2016) have used behavioral methods to probe the presence of beat perception, such as tapping tasks (Zarco et al., 2009; Hasegawa et al., 2011; Hattori et al., 2015) or measuring head bobs (Patel et al., 2009; Schachner et al., 2009; Cook et al., 2013). However, if the production of synchronized movement to sound or music is not observed in certain species, this is no evidence for the absence of beat perception. It could well be that while certain species are not able to synchronize their movements to a regular beat, they may be capable of beat perception (i.e., the ability to perceive a regular pulse in a temporally and/or acoustically varying rhythm; Honing, 2012). With behavioral methods that rely on overt motoric responses it is difficult to separate between the contribution of perception and action. More direct, electrophysiological measures such as auditory event-related brain potentials (ERPs) allow to test for neural correlates of rhythm cognition, including beat perception (Honing et al., 2014).

While the vast majority of previous studies on animals have used implanted electrodes to record electroencephalograms (EEG) (Javitt et al., 1994; Laughlin et al., 1999; Pincze et al., 2001), non-invasive electrophysiological techniques such as scalp recorded evoked potentials (EP) and event-related potentials (ERP) are considered an attractive alternative. Next to being a mandatory requirement for studying some non-human primates such as chimpanzees (Fukushima et al., 2010; Hirata et al., 2013), these methods allow for a direct comparison between human and non-human primates. As such they have contributed to establishing animal models of the human brain and human brain disorders (Godlove et al., 2011; Gil-da-Costa et al., 2013), a better understanding of the neural mechanisms underlying the generation of human evoked EP/ERP components (Fishman and Steinschneider, 2012), as well as delineating cross-species commonalities and differences in brain functions, including rhythm cognition (Ueno et al., 2008, 2010; Fukushima et al., 2010; Reinhart et al., 2012; Hirata et al., 2013; Itoh et al., 2015). We will describe the most relevant ERP components for rhythm perception below.

## USING ERPS IN MEASURING BEAT PERCEPTION

The mismatch negativity (MMN) is an auditory event-related component that was shown to be sensitive to rhythmic violations in both humans and monkeys (see Honing et al., 2014 for a review). The MMN can be used as an index of a violation of temporal expectation using an oddball paradigm, by identifying a negative peak shortly after the deviant (the "oddball") that is maximal at fronto-central midline electrode sites and has sources in the auditory cortices and in the inferior frontal gyrus (i.e., not primarily reflecting motor cortex activity; Gil-da-Costa et al., 2013). The larger the violation of rhythmic expectations, the larger is the amplitude of the MMN (Näätänen et al., 2007; Winkler, 2007). The MMN has been shown to be indicative of beat perception in humans, with deviants on the beat within a repeating metrical auditory pattern eliciting a larger MMN than deviants off the beat (Ladinig et al., 2009, 2011; Winkler et al., 2009; Bouwer et al., 2014, 2016; Honing et al., 2014; Bouwer and Honing, 2015; Mathias et al., 2016).

The P3a, thought to reflect the redirection of attention to a deviant stimulus (Polich, 2007) and possibly an index of the conscious perception of a deviant (cf. Mathias et al., 2016; Peretz, 2016), often emerges just after the MMN, and has a latency of 200–250 ms in humans and between 100 and 250 ms in rhesus monkeys (Picton et al., 1974; Polich, 2007). Gil-da-Costa et al. (2013) provided functional evidence that the neural generators of both MMN and P3a may be homologous in humans and monkeys, despite the observed differences in latency (see **Table 1** for an overview).

In addition to the ERP responses that reflect the detection of a deviant stimulus, the P1 and N1 responses, two early auditory event-related components, have been shown to be sensitive to the timing of the stimulus presentation (Costa-Faidella et al., 2011; Schwartze et al., 2013; Pereira et al., 2014; Teichert, 2016). In general, both P1 and N1 have an inverse relationship of ERP amplitude with temporal predictability (Javitt et al., 2000; Costa-Faidella et al., 2011; Schwartze et al., 2013). As such they are not indicative of beat perception per se, but of the temporal predictability of the stimulus. For instance, Schwartze et al. (2013) showed the P1 and N1 to be smaller in an isochronous as opposed to a jittered rhythmic sequence. Generally, increasing the predictability of the auditory stimulation (both in stimulus timing and stimulus probability) leads to a pronounced N1 attenuation (Costa-Faidella et al., 2011). Both components are maximal over fronto-central electrodes and there is some consensus on their homolog in rhesus monkeys, most notably on the N1 (Teichert, 2016) (see **Table 1** for an overview).

## RHYTHM COGNITION IN MONKEYS

Recently, Honing et al. (2012) were able to show, for the first time, that an MMN-like response can be measured in rhesus monkeys (M. mulatta). [See (Ueno et al., 2008) using a similar method in a chimpanzee (Pan troglodytes), and (Gil-da-Costa et al., 2013) for a recent study comparing humans and macaques (Macaca fascicularis)].

In addition, Honing et al. (2012) showed a sensitivity of the MMN in response to pitch deviants and infrequent omissions, showing that it was possible, in principle, to use an identical paradigm for human and non-human primates to probe beat perception. However, and contrary to what was found in human adults and infants (Winkler et al., 2009; Bouwer et al., 2014), no difference was found in the MMN in response to omissions in beat and offbeat positions. This lead to the conclusion that rhesus monkeys are unable to sense the beat (Honing et al., 2012). In addition, a strong response was found for onsets of rhythmic groups suggesting a sensitivity to rhythmic structure (A similar result was reported in Selezneva et al., 2013 showing large responses to changes in a repeating temporal pattern, while measuring gaze and facial expressions in monkeys).

However, Bouwer et al. (2014) pointed out that the earlier paradigm (Winkler et al., 2009; Honing et al., 2012) needs additional controls to be certain that any effects (or the lack thereof) are due to beat perception, and not, for instance, a result of pattern matching, acoustic variability or sequential learning. While rhesus monkeys have apparently little or no ability to perceive a beat, they are able to detect the regularity of an isochronous visual or auditory metronome (Zarco et al., 2009; Merchant et al., 2015b; Gámez et al., 2018). This suggests a capacity for making temporal predictions which most likely depends on absolute interval perception (Merchant and Honing, 2014). As such monkeys might not have beat perception, but could still be able to sense the regularity in an isochronous stimulus.

To examine the perception of isochrony, several studies have compared the responses to temporally regular, isochronous sequences with the responses to temporally irregular, jittered sequences (Schwartze et al., 2011; Teki et al., 2011; Fujioka et al., 2012). The prediction of events in jittered sequences has been suggested to rely on absolute interval perception, while the prediction of events in isochronous sequences has been suggested to be based on beat perception (Schwartze et al., 2011; Fujioka et al., 2012). However, it is possible to predict events in isochronous sequences on the basis of absolute interval perception alone, which may explain why macaques, with little or no ability to perceive a beat (Honing et al., 2012; Merchant and Honing, 2014), respond more accurately to temporally regular than jittered sequences (Zarco et al., 2009), based on their isochrony, rather than on beat-based perception (Merchant et al., 2015a; Merchant and Bartolo, 2017).

Based on these and related neurobiological observations (e.g., Zarco et al., 2009; Merchant et al., 2011) the GAE hypothesis was proposed (Merchant and Honing, 2014), arguing that the integration of sensorimotor information throughout the mCBGT circuit and other brain areas during the perception or execution of single intervals is similar in human and non-human primates, but different in the processing of multiple intervals. While the mCBGT circuit was shown to be also involved in beatbased mechanisms in imaging studies (e.g., Teki et al., 2011), direct projections from the medial premotor cortex (MPC) to the primary auditory cortex (A1) via the inferior parietal lobe (IPL) that is involved in sensory and cognitive functions such as attention and spatial sense (see **Figure 1**), may be the underpinning of beat-based timing as found in humans, and possibly apes. The GAE hypothesis suggests beat-based timing to be more developed in humans as opposed to apes and monkeys, and that it evolved through a gradual chain of anatomical and functional changes to the interval-based mechanism to generate an additional beat-based mechanism, instead of claiming a categorical jump from single-interval to multipleinterval abilities (i.e., rhythmic entrainment; Patel, 2006; Patel and Iversen, 2014). As such, the GAE hypothesis suggests that beat perception and entrainment have emerged gradually in primate order. This observation is in line with Rauschecker and Scott (2009) earlier suggestion that "the privileged access of the humans' auditory system to the sequential and temporal machinery of the mCBGT circuit emerged gradually in the course of evolution from precursors of the great ape lineage." Some recent behavioral studies support such a gradual interpretation (Hattori et al., 2015; Large and Gray, 2015) suggesting at least some beat-based timing capabilities in apes that are absent in rhesus monkeys. Finally, the GAE hypothesis is in line with Patel and Iversen (2014), that argue for a causal link between auditory and motor planning regions needed for human beat perception. However, the GAE hypothesis differs from the latter proposal in that it (a) does not claim the neural circuit that is engaged in beatbased timing to be deeply linked to vocal learning, perception,



Time range in ms; alternative naming in square brackets (Adapted from Picton et al., 1974; Javitt et al., 2000; Ueno et al., 2008; Honing et al., 2012; Gil-da-Costa et al., 2013; Itoh et al., 2015; Teichert, 2016).

and production, even if some explicit overlap between these neural circuits exists, and (b) that it gradually evolved in primates, instead of being solely present in humans as the only primate capable of vocal learning (Honing and Merchant, 2014).

However, in the current study we will not make claims about the underlying neural mechanisms, nor will we present a systematic comparative study (this is a topic of ongoing research). Instead, we will focus on whether rhesus monkeys are able to sense isochrony and/or the beat in a rhythmic stimulus.

## CURRENT STUDY

In the current study we adapted an auditory oddball paradigm that was previously used in humans (Bouwer et al., 2016) and that allows for testing and dissociating the contributions of beat perception and isochrony to the temporal predictability of the stimulus.

We presented two rhesus monkeys (M. mulatta) a rhythmic sequence that was made up of a pattern of loud and soft percussive sounds such that the acoustic stimulus could induce a simple binary metrical structure (duple meter), with an accented beat on every other metrical position (see combinations of S1 and S2 in **Figure 2**). This rhythmic sequence was presented in two conditions: an isochronous condition, in which the sounds were presented in an isochronous fashion, using an inter-onset interval (IOI) of 225 ms, allowing a beat to be induced (i.e., one metrical level of a duple meter). And a jittered condition, in which the IOIs were randomly selected between 150 and 300 ms, as such disabling the perception of a beat. Furthermore, we used intensity decrements to be able to compare the ERP response to deviants in both the isochronous and jittered conditions (since omissions, as used in Honing et al., 2012, wouldn't be recognized as deviants in the jittered condition). By introducing unexpected intensity decrements (i.e., deviants) on both on the beat and offbeat positions, in both the isochronous and jittered conditions, we could probe the effect of metrical position (beat vs. offbeat), as well as the effect of isochrony (isochronous vs. jittered) on the amplitude of the MMN and the P3a. Additionally, we examined the effects of metrical position and isochrony on P1 and N1 responses to standard sounds.

This design allows for testing several hypotheses. First of all, we expected an MMN and P3a for all deviants in both the isochronous and the jittered conditions, irrespective of isochrony or metrical position. This to make sure that the auditory system of monkeys is sensitive to unexpected amplitude decrements (deviants) in a rhythmic stream. For this we predicted an effect of Type (standard vs. deviant).

Second, we did not expect to find evidence for beat perception, in line with earlier findings (Honing et al., 2012; Merchant and Honing, 2014). As such, we predicted no interaction between Position (Beat vs. Offbeat) and Isochrony (Isochronous vs. Jittered). To show beat perception, the difference between the MMN responses to deviants on the beat and offbeat in the isochronous condition should be more pronounced than the difference between the MMN responses to deviants on beat and offbeat positions in the jittered condition, in which beat perception is disabled. Without such an interaction between metrical position and isochrony, the differences between the MMN responses to different metrical positions should be

interpreted as a result of sequential learning instead of beat perception (see Bouwer et al., 2016).

Third, we did expect the MMN and the P3a to be affected by the isochrony of the stimulus, with both having a higher amplitude in the isochronous condition (isochronous rhythm) as compared to the jittered condition (random rhythm). When monkeys are sensitive to the temporal regularity of the isochronous stimulus (Merchant et al., 2015b; Gámez et al., 2018) this should help in predicting the next event (i.e., increasing its temporal predictability) and enhancing its processing (Schwartze et al., 2011; Bouwer et al., 2014). Hence, it can be expected that the amplitude of the MMN and P3a in response to deviants in the isochronous condition is larger than in the jittered condition. As such, we predicted an interaction between Isochrony (Isochronous vs. Jittered) and Type (Standard vs. Deviant). In addition, we expected an inverse relationship between the amplitude of the P1 and N1 and temporal predictability. When the amplitude of the N1 and P1 is attenuated by the isochrony of the stimulus this can be used as additional evidence for isochrony perception (as was shown for humans in Schwartze et al., 2013).

## METHODS

## Ethics Statement

All the animal care, housing, experimental procedures were approved by the National University of Mexico Institutional Animal Care and Use Committee and conformed to the principles outlined in the Guide for Care and Use of Laboratory Animals (NIH, publication number 85–23, revised 1985). Both monkeys were monitored daily by the researchers and the animal care staff, and every second day from the veterinarian, to check the conditions of health and welfare. To ameliorate their condition of life we routinely introduced in the home cage (1.3 m<sup>3</sup> ) environment toys (often containing items of food that they liked) to promote their exploratory behavior. The researcher that tested the animals spent half an hour interacting with the monkeys directly, giving for example new objects to manipulate. We think that this interaction with humans, in addition to the interaction that was part of the task performed, can help to reduce potential stress related to the experiment. Food and water where given ad libitum.

## Participants

Two rhesus monkeys participated in the ERP measurements. Monkey A is a 11 year old male, Monkey B a 9 year old male. Both monkeys have normal hearing. They were awake (i.e., not sedated) during the measurements, sitting in a quiet room [3 (l) × 2 (d) × 2.5 (h) m] with dimmed lighting and two loudspeakers in front of them. The ERP measurements were performed after a morning session of unrelated behavioral experiments. The animals were seated comfortably in a monkey chair where they could freely move their head, hands and feet. No head fixation was used and the EEG electrodes were attached to the monkey's scalp using tape. To ease the fixation of the electrodes, the monkey's hair on the scalp and reference ear was shaved.

## Stimuli

The stimuli were identical to those used in Bouwer et al. (2016). Rhythmic sequences were composed of two sounds that differed in timbre, intensity and duration to induce a simple binary metrical structure (duple meter) with acoustic accents. The sounds were made with QuickTime's (Apple, Inc.) drum timbres. The first sound consisted of a simultaneously sounding bass drum and hi-hat, and will be referred to as accented (or A for short). The second sound was a hi-hat, which was 16.6 dB softer than the accented sound and lasted 70 ms instead of 110 ms. This sound will be referred to as unaccented (or U for short). The deviant sound was created by attenuating the accented sound by 25 dB (using Praat software; www.praat.org), leaving timbre and duration intact. This sound will be referred to as attenuated (or T for short; see **Figure 2**).

The accented, unaccented and attenuated sounds (A, U, and T) were combined into a rhythmic stream in which 60% of the time an accented sound was followed by an unaccented sound (see S1 in **Figure 2A**), and 30% of the time an accented sound was followed by another accented sound (see S2 in **Figure 2A**), as such inducing a duple meter, with always an accented sound on the beat. In the remaining 10% of the time a deviant was inserted (the "oddball"). This was either, randomly chosen, an attenuated sound followed by an accented sound (5%; see D1 in **Figure 2A**) or an accented sound followed by attenuated sound (5%; see D2 in **Figure 2A**). An example of an isochronous and a jittered rhythmic stream are given in the Supplementary Material.

The resulting sequence was used in both the isochronous and jittered conditions (see **Figure 2B**). In the isochronous condition, all sounds were presented with a constant inter-onset interval of 225 ms. In this condition, the probabilistic pattern of alternating accented and unaccented sounds was expected to induce a beat (or duple meter) with an inter-beat interval of 450 ms, within the optimal range for beat perception in humans (London, 2002). Sounds in uneven positions of the sequence (including deviant D1) can be considered on the beat, while all sounds in even positions (including deviant D2) are offbeat. In the jittered condition, the inter-onset intervals were randomly distributed between 150 and 300 ms with an average of 225 ms (uniform distribution), using the same sequence as in the isochronous condition, making it impossible to induce a regular beat (London, 2002; Honing, 2012). However, the interonset interval just before and after a deviant tone was kept constant at 225 ms. This was done to make the acoustic and temporal context in which a deviant occurs, identical between both conditions.

In addition, four additional constraints were applied to the construction of the sound sequences. To optimize the possibility of inducing a beat in the isochronous condition, S2 (containing two consecutive accented sounds) was never presented more than once in a row, and only a maximum of four consecutive S1 patterns was allowed. Furthermore, a deviant on the beat (D1) was always preceded by an accented sound offbeat (S2), ensuring the acoustical context to be identical for all deviants. Finally, at least five standard patterns occurred between two deviant patterns. For schematic examples of both the isochronous and jittered sequences, see **Figure 2B**. Note that the D1 and D2 are referred to as on the beat or offbeat in both conditions for comparison, while this can only be perceived as such in the isochronous condition.

The statistical properties of the sequences used are visualized in **Figure 3** as a transition network, with the three basic sounds A, U, and T as nodes. Note that this is a simplification in the sense that it does not include the four constraints mentioned above.

## Procedure

Stimuli were presented in blocks consisting of 3 isochronous and 3 jittered sequences (1,300 pattern, i.e., 2,600 sound events, per sequence), in randomized order. One block (of 6 sequences) lasted 9 min and 45 s, separated by a silent interval of about 15 s. Isochronous and jittered blocks were presented in semi-random order, with a maximum of two blocks from the same condition following each other. This resulted in a total session length of 60 min, one session per day.

Sound stimuli were presented through 2 loudspeakers placed 1.1 meters away from the subject (and 1 meter apart from each other). The sound intensity measured at the subject position was 80 dB SPL. The monkeys participated in one recording session per day, to a total of 14 sessions for Monkey A and 12 sessions for Monkey B. All measurements were completed in about 5 weeks per monkey.

Overall, the design of the study was identical to the unattended condition presented in Bouwer et al. (2016), except that in the human study participants watched a silent video with subtitles, whereas in the current study monkeys were not given any visual stimuli to focus on.

## EEG Recording and Analysis

The EEG was recorded from electrodes (Grass EEG electrodes; #FS-E5GH-60) attached to five scalp positions (Fz, Cz, Pz, F3, F4) according to the 10–20 system (see **Figure 4**).

The electrodes were connected to a Tucker-Davis Technologies (TDT) headstage (#RA16LI) for low impedance electrodes. This headstage was connected to a TDT RA16PA preamplifier, which in turn was connected to a TDT RZ2 processor. RZ2 was programmed to acquire the EEG signals with a sampling rate of 498.25 Hz and the bandpass filters were set at 0.01–100 Hz.

All electrodes were attached using Ten20 Conductive EEG Paste and medical tape, and were referenced to the right ear (fleshy part of the pinna). In the offline analysis, a 0.1–30 Hz band-pass FIR filter (Kaiser-window) was applied.

#### MMN and P3a

For the analyses of MMN and P3a, epochs of −150 to 300 ms were extracted for the four deviant patterns (D1<sup>i</sup> , D1<sup>j</sup> , D2<sup>i</sup> , D2<sup>j</sup> ; with subscripts referring to isochronous and jittered conditions). Epochs of the same length were extracted for accented sounds from the standards in the isochronous condition, both on the beat (from S1, but only if preceded by S2) and offbeat (from S2). Thus, the acoustic context preceding all sounds that were used in the analysis of the MMN and P3a was identical (i.e., they were preceded by an accented sound at −225 ms). Epochs were baseline corrected using the average voltage of the 150 ms prior to the onset of the tone and averaged to obtain ERPs for each condition and monkey. We obtained difference waves by subtracting the ERP responses to the accented sounds from the standard patterns from the ERP responses to the deviant tones at the same position (beat or offbeat). Epochs that exceeded ±300 µV amplitude were excluded from the statistical analysis. The number of epochs accepted for analysis are given in **Tables 2**, **3**.

We defined the amplitude of the MMN as the average amplitude from a 30 ms window (relatively small to avoid overlap with the N1) centered around the average peak latency across conditions on Cz (electrode that was shown to be maximally indicative of MMN in rhesus monkeys; Honing et al., 2012; Gilda-Costa et al., 2013). The MMN peaked at Cz on average at 72 ms for Monkey A and at 110 ms for Monkey B. See the caption of **Table 2** for the time windows used in the statistical analyses.

We defined the amplitude of the P3a as the average amplitude from a 50 ms window centered around the average peak latency across conditions on Cz (electrode that was shown to be maximally indicative of P3a in rhesus monkeys; Gil-da-Costa et al., 2013). The P3a peaked at Cz on average at 151 ms for Monkey A and at 203 ms for Monkey B. See the caption of **Table 3** for the time windows used in the statistical analyses.

(Adapted from Honing et al., 2012).

#### TABLE 2 | MMN.

#### Monkey A


Mean amplitudes of standard and deviant waves at Cz in the isochronous and jittered conditions for Monkey A and Monkey B. Mean amplitudes (µV) are indicated with SE values in parentheses. S, values for standard stimuli; D, values for deviant stimuli; n, number of epochs. The time window used in the statistical analyses is 57–87 ms for Monkey A and 95–125 ms for Monkey B.

#### P1 and N1

For the analysis of P1 and N1 epochs of −100 to 300 ms were extracted for accented standards on the beat and offbeat, including only those standards that were preceded by an accented sound. A filter of 5–75 Hz was applied to eliminate slow drift (as is commonly used in studying these components in humans; see Schwartze et al., 2013; Bouwer and Honing, 2015). Hence, baseline correction was not needed. Epochs that exceeded ±300 µV amplitude were excluded from the statistical analysis.

#### TABLE 3 | P3a.


Monkey B


Mean amplitudes of P3a at Cz in the isochronous and jittered conditions. Mean amplitudes (µV) are indicated with SE values in parentheses. Number of epochs: n between brackets. The time window used in the statistical analyses is 126–176 ms for Monkey A and 178–228 ms for Monkey B.

The peak amplitude of both the P1 and N1 was defined on the average of the waveforms collapsed over conditions, using 20 ms window centered around the average peaks at Cz (electrode that was also shown in other studies to be maximally indicative of P1 and N1 in monkeys; see Itoh et al., 2015). A 20 ms window was chosen around the average P1 and N1 peaks to avoid overlap between the P1 and N1 windows. **Table 4** shows the average amplitudes for all four conditions, the number of epochs accepted for analysis, and the time windows used for statistical analyses.

#### Statistical Analysis

The amplitudes extracted from the difference waves were entered into ANOVAs with factors Position (Beat vs. Offbeat), Isochrony (Isochronous vs. Jittered), and Type (Standard vs. Deviant) for the MMN and P3a analyses, and the factors Position (Beat vs. Offbeat), Isochrony (Isochronous vs. Jittered) for the P1 and N1 analyses. Partial eta squared (η 2 p ) was used as a measure of effect size. All statistical analyses were conducted in SPSS (Version 22).

#### RESULTS

#### MMN and P3a

**Figures 5**, **6** show that, for both monkeys, the ERPs elicited by the standard (dotted lines) and the deviant (solid lines) are different, with peaks in the interval 60–110 ms for the MMN and 100–250 ms for the P3a, consistent with earlier studies (see **Table 1**).

For both monkeys the ANOVA with factors Type (Standard vs. Deviant), Isochrony (Isochronous vs. Jittered) and Position (Beat vs. Offbeat) revealed a significant main effect of Type {Monkey A: [F(1, 61345) = 83.790, p < 0.0005, η 2 <sup>p</sup> = 0.001]; Monkey B: [F(1, 48191) = 38.906, p < 0.0005, η 2 <sup>p</sup> < 0.001]},

#### TABLE 4 | P1 and N1.




Mean amplitudes of P1 and N1 at Cz in the isochronous and jittered conditions. Mean amplitudes (µV) are indicated with SE values in parentheses. Number of epochs: n between brackets. The time window used in the statistical analyses for Monkey A is 20–40 ms for the P1 and 40–60 ms for the N1. For Monkey A the time window used is 18–38 ms for the P1 and 40–60 ms for the N1.

showing that the evoked brain response to deviants was significantly negative as compared to the standard in all conditions. In addition, there was an interaction of Isochrony and Type {Monkey A: [F(1, 61345) = 3.896, p < 0.048, η 2 p= 0.0005]; Monkey B: [F(1, 48191) = 7.819, p < 0.005, η 2 <sup>p</sup>< 0.0005]} showing an effect of isochrony on the size of the MMN, being larger in the isochronous condition. This suggests a sensitivity to the isochrony of the stimulus. However, no effects of Position or an interaction between Position and Isochrony were found. Hence, there is no support for beat perception. See **Table 2** for all MMN measurements and **Figure 8** for a summary.

With regard to the P3a there is a significant main effect of Type {Monkey A: [F(1, 61345) = 292.069, p < 0.0005, η 2 p= 0.005]; Monkey B: [F(1, 48191) = 76.793, p < 0.0005, η 2 <sup>p</sup>< 0.002]} and an interaction of Isochrony and Type {Monkey A: [F(1, 61345) = 17.032, p < 0.0005, η 2 <sup>p</sup>= 0.0005]; Monkey B: n.s.}. For Monkey A there was also a significant interaction between Position and Type {Monkey A: [F(1, 61345) = 3.884, p < 0.049, η 2 <sup>p</sup>< 0.0005]}. Note that this is in a direction opposite to what was found in humans (Bouwer et al., 2016), which makes this result difficult to interpret. See **Table 3** for all P3a measurements and **Figure 8** for a summary.

#### P1 and N1

**Figure 7** shows the P1 and N1 responses to the standard in the isochronous condition (solid lines) and jittered condition (dotted lines) for both monkeys. **Table 4** shows the average amplitudes for all four conditions and the time windows used for statistical analyses.

For both monkeys the ANOVA with factors Isochrony (Isochronous vs. Jittered) and Position (Beat vs. Offbeat) revealed

indicate the onset of the next sound event (at 225 ms). See Table 3 for details on time ranges used.

a significant main effect of Isochrony for both the P1 {Monkey A: [F(1, 50572) = 31.306, p < 0.0005, η 2 <sup>p</sup>= 0.001]; Monkey B: [F(1, 41014) = 27.912, p < 0.0005, η 2 <sup>p</sup>= 0.001]} and the N1 {Monkey A: [F(1, 50572) = 30.822, p < 0.0005, η 2 <sup>p</sup>= 0.001]; Monkey B: n.s.} showing that the P1 is significantly different in the jittered as compared to the isochronous condition for both monkeys, and the N1 only for Monkey A. However, in the case of the P1 the direction of the effect is different for both monkeys. In Monkey A both the P1 and the N1 are larger for the jittered as compared to the isochronous sounds. This is in line with the idea that unpredictable stimuli receive more processing, hence resulting in larger amplitude responses (Schwartze et al., 2011, 2013). However, in Monkey B the effect of Isochrony on the P1 response is in the opposite direction, with larger responses to the sounds in the isochronous as opposed to in the jittered sequences. As such, this weakens the interpretation that the P1 is indicative of predictability. Additionally, we found an interaction between Isochrony and Position in Monkey A {[F(1, 50572) = 4.585, p < 0.032, η 2 <sup>p</sup>< 0.0005]}. However, the latter is in the opposite direction as compared to humans (Baldeweg, 2007; Costa-Faidella et al., 2011) and hence hard to interpret. All other effects were not significant.

## DISCUSSION

In the current study we examined the role of beat perception and isochrony perception in two rhesus monkeys using the same stimuli in an oddball paradigm that was previously used in humans (Bouwer et al., 2016). Similar to humans, we found MMNs to all deviants as opposed to standards in both the isochronous and jittered conditions. From this we conclude that the monkey brain is sensitive to unexpected amplitude decrements (deviants). This is in line with earlier studies that used either omissions (Honing et al., 2012) or intensity deviants (Gil-da-Costa et al., 2013).

However, and contrary to what was found in humans (Bouwer et al., 2016), there are no significant differences in the MMN between deviants in beat positions as opposed to deviants in an offbeat position, neither in the isochronous condition nor the jittered condition (i.e., no effect of metrical position, nor an interaction with regularity; see **Figure 8**). So while in humans the beat appears to modulate the amplitude of the MMN, in monkeys there was not such an effect. This suggests the absence of beat perception in monkeys, in line with an earlier study (Honing et al., 2012).

With regard to the P3a, Monkey A did show an interaction between Type and Isochrony, but this was small and in the opposite direction as compared to humans (see **Figure 8**). This makes the latter result difficult to interpret. The effect might be a result of overlapping components (e.g., an interference with the strong response to the next onset; marked with dashed lines in **Figures 5**, **6**), as such affecting the overall amplitude of the ERP signal. It could also be a result of attentional fluctuations that are modulating the P3a, as was shown in human studies (cf. Bouwer et al., 2016). Future work should manipulate attention to make sure whether and how attention influences rhythm perception.

#### TABLE 5 | Summary of the results.


The effect of isochrony on MMN and P3a (enhancement), and temporal predictability on P1 and N1 (attenuation). \*,significant; n.s., non-significant; –, significant in opposite direction.

By contrast, both monkeys appear to be sensitive to the isochrony of the stimulus, as is supported by an overall larger MMN response to deviants in the isochronous as opposed to the jittered condition (see **Figure 8**). Note however, that we cannot make the claim that monkeys detect isochrony (hence the use of the term "sense"). It could well be that the monkey brain is able to sense isochrony, but that the monkey is not aware of this, as was recently shown in two beat-deaf humans (Mathias et al., 2016). The latter study suggested the P3a to be associated with conscious perception or "awareness," comparable to what is observed in tone deaf humans (Peretz, 2016).

With regard to the N1 and P1 the results are mixed and not consistent between Monkey A and Monkey B. While the attenuation of the N1 of Monkey A is in support for isochrony perception, the P1 is in opposite directions for the two monkeys. This difference could again be caused by attention, which was not controlled for in this study. As such we have to be cautious in our interpretation of the P1 and N1 in the standards, and will base our conclusions mainly on the MMN in response to the deviants (see summary in **Table 5** and main results in **Figure 8**).

In short, the ERPs of both monkeys appear to be influenced by isochrony, but not by the induced beat. Where humans show a clear interaction between metric position and isochrony (Bouwer et al., 2016), and as such show evidence for beat perception, the current results provide no evidence for beat perception in rhesus monkeys. Hence, the hypothesis that beat-based timing is common to all animals (Darwin, 1871; Wilson and Cook, 2016) is not supported by these results.

While the underlying neural mechanism (be it interval-based timing or otherwise; Teki et al., 2011, 2016) is as yet unclear, we take this result as further support for the GAE hypothesis (Merchant and Honing, 2014). The GAE hypothesis suggests beat-based timing to be a result of bidirectional bottom-up and top-down interactions between the auditory and motor areas of the brain, including the mCBGT circuit and parietal areas such as the IPL (see **Figure 1**), connections that are quite developed and efficient in humans and that emerged gradually in the course of evolution from precursors of the great ape lineage (Rauschecker and Scott, 2009; Honing and Merchant, 2014; Merchant and Honing, 2014; Morillon et al., 2014; Merchant and Yarrow, 2016).

Parts of the audiomotor circuit, including areas such as the putamen in the basal ganglia or the parietal cortex, also process information from the dorsal stream of visual processing (Kimura, 1992; Merchant et al., 2004). Therefore, it could well be that the areas involved in the strong visuomotor coupling in monkeys partially overlap with the beat-entrainment audiomotor system, in line with the predictive and entrainment abilities of monkeys with visual stimuli (Takeya et al., 2017; Gámez et al., 2018). Thus, where humans show a preference for auditory metronomes, monkeys have a clear preference for visual stimuli (Gámez et al., 2018). Applying the current paradigm to the visual modality (e.g., with different intensity flashes rather than different intensity tones) might be able to show beat-entrainment for visual stimuli.

Gámez et al. (2018) also provides evidence that monkeys can predictively entrain to an isochronous metronome, even when it accelerates or decelerates. These findings suggest that the beat-based mechanisms of macaques might not be as restricted as previously thought (Patel, 2014). Thus, it is crucial that future experiments focus on finding the limits in beat perception and entrainment capabilities of monkeys, using gradually more complex levels of metrical periodicity in their stimuli (Schwartze and Kotz, 2015; Bouwer et al., 2016).

Finally, and as predicted by the GAE hypothesis, we expect beat-based timing in the auditory modality to be present in some rudimentary form in apes (not monkeys). Two recent behavioral studies support this interpretation (Hattori et al., 2015; Large and Gray, 2015) for a chimpanzee (P. troglodytes) and a bonobo (Pan paniscus). Applying the same oddball paradigm to a chimpanzee (cf. Ueno et al., 2008) might find further support for the GAE hypothesis (Merchant and Honing, 2014). In addition, the paradigm could be extended to other species, for example, pigeons, rodents, cats and dogs (e.g., Nelken and Ulanovsky, 2007; Howell et al., 2012; Schall et al., 2015; Harms et al., 2016), where the MMN can be measured to probe the potentially shared capabilities of isochrony perception and/or beat perception.

While musicality is likely made up of many components, it appears to be good strategy to start with a focus on core aspects, like beat perception (cf. Honing, 2018b). The core aspects of musicality are well suited for comparative studies, both cross-cultural and cross-species, and the nature and extent of their presence in non-human animals have attracted considerable debate in the recent literature. These recent discussions, combined with the availability of suitable experimental techniques for tracking these phenomena in human and non-human animals, make this a timely and feasible enterprise. Of course, we need to remain cautious about making claims on music-specific modes of processing until more general accounts have been ruled out. It still has to be demonstrated that the constituent components of musicality, when identified, are indeed domain specific. In contrast, the argument that music is a human invention (Patel, 2018) depends on the demonstration that the components of musicality are not domain specific, but each cognitively linked to some nonmusical mental ability. So while there might be quite some evidence that components of musicality overlap with nonmusical cognitive features (Patel, 2018), this is in itself no evidence against musicality as an evolved biological trait or set of traits. As in language, musicality could have evolved from existing elements that are brought together in unique ways, and that system may still have emerged as a biological product through evolutionary processes, such as natural or sexual selection. As such there is no need for musicality to be modular or show a modular structure. Alternatively, based on the converging evidence for music-specific responses along specific neural pathways, it could be that brain networks that support musicality are partly recycled for language, thus predicting more overlap than segregation of cognitive functions. In fact, this is one possible route to test the Darwin-inspired conjecture that musicality precedes music and language (Honing, 2018a).

## AUTHOR CONTRIBUTIONS

HH and FB conceived and designed the experiments. HM and LP performed the experiments. FB and HH analyzed the data. HH, FB, and HM wrote the paper.

## REFERENCES


## ACKNOWLEDGMENTS

Thanks to Germán Mendoza for assisting in running the monkey experiments. Yaneri Ayala, Yonathan Fishman, Yukiko Kikuchi, Sonja Kotz, Ani Patel, István Winkler and four reviewers are thanked for their constructive criticisms on an earlier version of the manuscript. The first author HH is supported by a Horizon grant (317-70-010) of the Netherlands Organization for Scientific Research (NWO). HM is supported by CONACYT: 236836, CONACYT: 196, and PAPIIT: IN201214-25.

## SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnins. 2018.00475/full#supplementary-material

Audio 1 | Example of a rhythmic pattern used in the isochronous condition. [1\_isochronous.wav].

Audio 2 | Example of a rhythmic pattern used in the jittered condition. [2\_jittered.wav].

in the nonhuman primate. Ann. N. Y. Acad. Sci. doi: 10.1111/nyas.13671. [Epub ahead of print].


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Honing, Bouwer, Prado and Merchant. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Human Nature of Music

Stephen Malloch1,2 \* and Colwyn Trevarthen<sup>3</sup> \*

<sup>1</sup> Westmead Psychotherapy Program, Sydney Medical School, The University of Sydney, Sydney, NSW, Australia,

<sup>2</sup> The MARCS Institute for Brain, Behaviour, and Development, Western Sydney University, Sydney, NSW, Australia,

<sup>3</sup> Department of Psychology, School of Philosophy, Psychology and Language Sciences, The University of Edinburgh, Edinburgh, United Kingdom

Music is at the centre of what it means to be human – it is the sounds of human bodies and minds moving in creative, story-making ways. We argue that music comes from the way in which knowing bodies (Merleau-Ponty) prospectively explore the environment using habitual 'patterns of action,' which we have identified as our innate 'communicative musicality.' To support our argument, we present short case studies of infant interactions using micro analyses of video and audio recordings to show the timings and shapes of intersubjective vocalizations and body movements of adult and child while they improvise shared narratives of meaning. Following a survey of the history of discoveries of infant abilities, we propose that the gestural narrative structures of voice and body seen as infants communicate with loving caregivers are the building blocks of what become particular cultural instances of the art of music, and of dance, theatre and other temporal arts. Children enter into a musical culture where their innate communicative musicality can be encouraged and strengthened through sensitive, respectful, playful, culturally informed teaching in companionship. The central importance of our abilities for music as part of what sustains our well-being is supported by evidence that communicative musicality strengthens emotions of social resilience to aid recovery from mental stress and illness. Drawing on the experience of the first author as a counsellor, we argue that the strength of one person's communicative musicality can support the vitality of another's through the application of skilful techniques that encourage an intimate, supportive, therapeutic, spirited companionship. Turning to brain science, we focus on hemispheric differences and the affective neuroscience of Jaak Panksepp. We emphasize that the psychobiological purpose of our innate musicality grows from the integrated rhythms of energy in the brain for prospective, sensationseeking affective guidance of vitality of movement. We conclude with a Coda that recalls the philosophy of the Scottish Enlightenment, which built on the work of Heraclitus and Spinoza. This view places the shared experience of sensations of living – our communicative musicality – as inspiration for rules of logic formulated in symbols of language.

#### Keywords: musicking, motor intelligence, gestural narrative, infant musicality, cultural learning

"There are certain aspects of the so-called 'inner life'—physical or mental —which have formal properties similar to those of music—patterns of motion and rest, of tension and release, of agreement and disagreement, preparation, fulfilment, excitation, sudden change, etc. Langer (1942, p. 228).

#### Edited by:

Aleksey Nikolsky, Braavo! Enterprises, United States

#### Reviewed by:

Antonio Damasio, University of Southern California, United States Marc Leman, Ghent University, Belgium Jerome Lewis, University College London, United Kingdom

#### \*Correspondence:

Stephen Malloch stephen.malloch@heartmind.com.au Colwyn Trevarthen c.trevarthen@ed.ac.uk

#### Specialty section:

This article was submitted to Cognition, a section of the journal Frontiers in Psychology

Received: 18 February 2018 Accepted: 21 August 2018 Published: 04 October 2018

#### Citation:

Malloch S and Trevarthen C (2018) The Human Nature of Music. Front. Psychol. 9:1680. doi: 10.3389/fpsyg.2018.01680

**196**

"The function of music is to enhance in some way the quality of individual experience and human relationships; its structures are reflections of patterns of human relations, and the value of a piece of music as music is inseparable from its value as an expression of human experience" Blacking (1995, p.31).

"The act of musicking establishes in the place where it is happening a set of relationships, and it is in those relationships that the meaning of the act lies. They are to be found not only between those organized sounds which are conventionally thought of as being the stuff of musical meaning but also between the people who are taking part, in whatever capacity, in the performance" Small (1998, p.9).

## PRELUDE

We present a view that places our ability to create and appreciate music at the center of what it means to be human. We argue that music is the sounds of human bodies, voices and minds – our personalities – moving in creative, story-making ways. These stories, which we want to share and listen to, are born from awareness of a complex body evolved for moving with an imaginative, future seeking mind in collaboration with other human bodies and minds. Musical stories do not need words for the creation of rich and inspiring narratives of meaning.

We adopt the word 'musicking' (as used above by Christopher Small) to draw attention to the embodied energy that creates music, and which moves us, emotionally and bodily. Further, we argue that music comes from the way in which knowing bodies (Merleau-Ponty, 2012 [1945], p. 431) prospectively explore the environment using habitual 'patterns of action,' which we have identified as our innate 'communicative musicality,' observed while infants are in intimate communication with loving caregivers (Malloch and Trevarthen, 2009a). In short case studies of infant interactions with micro analyses of video and audio recordings, we show communicative musicality in the timings and shapes of intersubjective vocalizations and body movements of adult and child that improvise with delight shared narratives of meaning.

Following a survey of the history of discoveries of infant abilities, we propose that the gestural narrative structures of voice and body seen as infants communicate with loving caregivers, 'protonarrative envelopes' of expression for ideas of activity (Panksepp and Bernatzky, 2002; Panksepp and Trevarthen, 2009), are the building blocks of what become particular cultural instances of the art of music, and of dance, theater and other temporal arts (Blacking, 1995).

As the child grows and becomes a toddler, she or he eagerly takes part in a children's musical culture of the playground (Bjørkvold, 1992). Soon more formal education with a teacher leads the way to the learning of traditional musical techniques. It is at this point that the child's innate body vitality of communicative musicality can be encouraged and strengthened through sensitive, respectful, playful, culturally informed teaching (Ingold, 2018). On the other hand, it may wither under the weight of enforced discipline for the sake of conforming to pre-existing cultural rules without attention to the initiative and pleasure of the learner's own musicmaking.

The central importance of our abilities for music as part of what sustains our well-being is supported by evidence that communicative musicality strengthens emotions of social resilience to recover from mental stress and illness (Pavlicevic, 1997, 1999, 2000). Drawing on the experience of the first author as a counselor, we argue that the strength of one person's communicative musicality can support the vitality of another's through the application of skilful techniques that encourage an intimate, supportive, therapeutic, spirited companionship.

Turning to brain science, focussing on hemispheric differences in performance and in response to music, and the affective neuroscience of Jaak Panksepp (1998), we emphasize that the psychobiological purpose of our 'muse within' (Bjørkvold, 1992) grows from the integrated rhythms of neural energy for prospective, sensation-seeking affective guidance of vitality of movement in the brain (Goodrich, 2010; Stern, 2010).

We conclude with a Coda – an enquiry into the philosophy of the Scottish Enlightenment, which built on the work of philosophers Heraclitus and Spinoza. This view of living in community gives innate sympathy or 'feeling with' other humans a fundamental role within a duplex mind seeking harmony in relationships by attunement of motives (Hutcheson, 1729, 1755). It places the shared experience of sensations of living – our communicative musicality – ahead of logic formulated in symbols of language.

## MUSIC MOVES US – EMBODIED NARRATIVES OF MOVEMENT

Small (1998) calls attention to music as intention in activity by using the verb musicking – participating as performer or listener with attention to the sounds created and the appreciation and participation by others. The compelling quality of music comes from the relationships of sounds, bodies and psyches. 'Musicking' points to our musical life in active 'I-Thou' relationships. Only in this intimacy of consciousness and its interests can we share 'I-It' identification and use of objects, giving things we use, including musical compositions, meaning (Buber, 1923/1970).

'I-Thou' relationships are entered into through the body. The philosopher Merleau-Ponty (2012 [1945]) writes that "The subject only achieves his ipseity [individual personality, selfhood] by actually being a body, and by entering into the world through his body. . . The ontological world and body that we uncover at the core of the subject are not the world and the body as ideas; rather, they are the world itself condensed into a comprehensive whole and the body itself as a knowing-body" (p. 431; italics added). Musicking is knowing bodies coming alive in the sounds they make. Scores and other tools that record the product of musicking, performed or imagined, aid the retention of ideas, as semantics of language does, and they serve discussion and analysis – but they are not the same as the breathing, moving, embodied experience of human musicking (Mithen, 2005).

Musicking is the expression of the sensations of what we call our communicative musicality, for the purpose of creating music

that is enlivening and 'beautiful' (Malloch, 1999; Malloch and Trevarthen, 2009a; Trevarthen, 2015; Trevarthen and Malloch, 2017b). We define communicative musicality as our innate skill for moving, remembering and planning projects in sympathy with others through time, creating an endless variety of dramatic temporal narratives in song or instrumental music. We describe this life-sharing in movement as having three components:

Pulse – a regular succession through time of discrete movements (which may, for example, be used to create sound for music, or to create movement with music – dance) using our felt sense of acting which enables the 'future-creating' predictive process by which a person may anticipate or create what happens next and when.

Quality – consisting of the contours of expressive vocal and body gesture, shaping our felt sense of time in movement. These contours can consist of psychoacoustic attributes of vocalizations – timbre, pitch, volume – or attributes of direction and intensity of the moving body perceived in any modality.

Narratives of individual experience and of companionship, built from sequences of co-created gestures which have particular attributes of pulse and quality that bring aesthetic pleasure (Malloch and Trevarthen, 2009b; Trevarthen and Malloch, 2017b).

With music we create memorable poetic events in signs that express in sound our experience of living together in the creating vitality of 'the present moment' (Stern, 2004, 2010). The anthropologist and ethnologist Claude Lévi-Strauss draws attention of linguists to the structured 'raw' emotive power of music (beyond what words may be 'cooked' to say).

"In the first volume of the Mythologiques, Le Cru et le Cuit. Lévi-Strauss refers to music as a unique system of signs possessing 'its own peculiar vehicle which does not admit of any general, extramusical use'. Yet he also allows that music has levels of structure analogous to the phonemes and sentences of language. The absence of words as the connecting level is an obvious and pertinent fact in the structuring of meaning within music as a sign system." (Champagne, 1990, p. 76).

Goodrich (2010), in her appreciation of the contribution of neuroscientists Llinás (2001) and Buzsáki (2006) to the science of the mind for skilled movement, cites Llinás' evidence on the role of intuitive structural 'rules,' seen also in a musical performance.

"Llinás describes another method of keeping movement as efficient as possible: motor 'Fixed Action Patterns' (FAPs), distinct and complicated 'habits' of movement built from reflexes, habits that we develop to streamline both neural action and muscle movement. These are not entirely fixed, despite their name; they are constantly undergoing modification, adaptation, refinement, and they overlap each other... Llinás even argues that the extraordinarily precise motor control of Jascha Heifetz playing Tchaikovsky's violin concerto is composed of highly elaborated and refined FAPs, a description most instrumentalists would find absolutely plausible" (Goodrich, 2010, p. 339).

#### As Llinás himself writes,

"Can playing a violin concerto be a FAP? Well, not all of it, but a large portion. Indeed, the unique and at once recognisable style of play Mr. Heifetz brings to the instrument is a FAP, enriched and modulated by the specifics of the concert, generated by the voluntary motor system" (Llinás, p. 136).

We add that skilled FAPs are not "composed of reflexes" as separate automatic responses. Rather they are purposeful projects that are animated to be developed imaginatively, and affectively, with exploration of their biomechanical "degrees of freedom," as in Nikolai Bernstein's detailed description of how a toddler learns to become a virtuoso in bipedal locomotion, which he calls The Genesis of the Biodynamical Structure of the Locomotor Act (Bernstein, 1967, p. 78). The testing of these locomotor acts is with an immediate and essential estimation by gut feelings (Porges, 2011) of any risks or benefits, any fears or joy, they may entail within the body.

## MUSIC REFLECTS THE FELT-SENSE OF OUR FUTURE-EXPLORING MOTOR INTELLIGENCE

Consciousness is created as the ongoing sense of self-inmovement with which we experience and manipulate the world around us. Its origin is in our evolutionary animal past, evolved for new collaborative, creative projects, regulated between us by affective expressions of feelings of vitality from within our bodies (Sherrington, 1955; Panksepp, 1998; Damasio, 2003; Mithen, 2005; Stern, 2010; Eisenberg and Sulik, 2012).

Using the philosopher and psychologist James (1892/1985) as a starting point for exploring the intimacy of feeling that supports and guides psychotherapy, Russell Meares (2005, p.18), following the 'conversational model' of therapy developed with his collaborator psychiatrist/psychotherapist Hobson (1985), identifies five dimensions of the self:


These sensuous qualities of the experienced self are expressed in music, and in other temporal arts, as 'the human seriousness of play' (Turner, 1982). Music, as Susan Langer says so clearly in the quote at the start of this paper, has qualities of this inner life described by Meares as shape, ongoingness and flow, connectedness and unity. The notion of music as expressive of the movements of our inner life has also been explored by music theorists, most notably Ernst Kurth (1991). Likewise, in his book Self comes to Mind, Antonio Damasio likens all our emotion and feeling to a 'musical score' that accompanies other ongoing mental process (Damasio, 2010, p.254).

The ultimate motivation for creating music can even be traced to the cellular level. In Man On His Nature (Sherrington, 1955), in a chapter entitled The Wisdom of the Body, the creator of modern neurophysiology Charles Sherrington called the coming together of communities of cells into the integrated body,

nervous system and brain of a person "an act of imagination" (p. 103). Neuroscientist Rudolfo Llinás also grants subjectivity, a sense of self, to all forms of life. "Irritability [i.e., responding to external stimuli with organized, goal-directed behavior] and subjectivity, in a very primitive sense, are properties originally belonging to single cells" (Llinás, 2001, p.113). "Thinking", writes Llinás (2001, p.62), "ultimately represents movement, not just of body parts or objects in the external world, but of perceptions and complex ideas as well."

Intrinsic to the sociability of this intelligence of movement is sensitivity for the exploration of the future, which is woven into our creation and experience of music. Karl Lashley (1951) reflecting on the evolution of animal movement, proposed that the ability to predict what might come next, and to plan the 'serial ordering' of separate actions, may be understood as the foundation for our logical reasoning as an individual, as it is for the grammar or syntax and prosody of our communication in language. It is essential for musicking. A restless future-seeking intelligence, with our urge to share it, inspires us to express our personalities as 'story-telling creatures,' who want to share, and evaluate, others' stories (Bruner, 1996, 2003).

All animal life depends on motivated movement – the urge to explore with curiosity – to move towards food with anticipation, to move away from a predator with fear, to interact playfully with a trusted friend (Eibl-Eibesfeldt, 1989; Panksepp and Biven, 2012; Bateson and Martin, 2013). A great achievement of modern science of the mind was the discovery by a young Russian psychologist Nikolai Bernstein of how all consciously made body movements depend upon an 'image of the future' (Bernstein, 1966, 1967).

Bernstein applied the new technology of movie photography to make refined 'cyclographic' diagrams of displacements of body parts, from which he analyzed the forces involved to fractions of a second. His findings reported in Coordination and Regulation of Movements became widely known in English translation in 1967, at the same time as video records of infant behaviors were described more accurately (see next Section The Genesis of Music in Infancy – A Short History of Discoveries), revealing their anticipatory motor control adapted for intelligent understanding of how objects may be manipulated, as well as for communication and cooperation (Trevarthen, 1984b, 1990b).

Our musical creativity and pleasure come from the way our body hopes to move, with rhythms and feelings of grace and biological 'knowing' (see Merleau-Ponty). The predicting, embodied self of a human being experiences time, force, space, movement, and intention/directionality in being. Together, these form the Gestalt of 'vitality' (Stern, 2010), the 'forms of feeling' (Hobson, 1985) by which we sense in ourselves and in others that movement, be that movement of the body or of a piece of music, is 'well-done' (Trevarthen and Malloch, 2017b).

#### THE GENESIS OF MUSIC IN INFANCY – A SHORT HISTORY OF DISCOVERIES

The ability to create meaning with others through wordless structured gestural narratives, that is, our communicative musicality, emerges from before birth and in infancy. From this innate musicality come the various cultural forms of music.

Any attempt to understand how human life has evolved its unique cultural habits needs to start with observing what infants know and can do. Organisms regulate the development of their lives by growing structures and processes from within their vitality, by autopoiesis that requires anticipation of adaptive functions. And they must develop and protect their abilities in response to environmental affordances and dangers, with consensuality (Maturana and Varela, 1980; Maturana et al., 1995). Infants are ready for human cultural invention and collaboration as newly hatched birds are ready for flying – within 'the biology of love' (Maturana and Varela, 1980; Maturana and Verden-Zoller, 2008). All organisms reach out in time and space to make use of the 'affordances' for thought and action (Gibson, 1979).

Infants have no language to learn what other humans know, or what ancestors knew. But the vitality of their spontaneous communicative musicality, highly coordinated and adapted to be shared through narratives with sympathetic and playful companions, enables meaningful communication in the 'present moment' (Stern, 2004; **Figure 1**, Upper Right), which may build serviceable memories extended in space and time (Donaldson, 1992).

In this section we review changes in understanding of infant abilities over the past century which can help explain the peculiar way music has in the past been seen by some leading psychologists and linguists as a relatively insignificant epiphenomenon, learned for play, not, as we argue in this paper, a source of all talents for communication, rooted in our innate human communicative musicality of knowing bodies (Merleau-Ponty, 2012 [1945]) moving with prospective intuition to engage the world in company, 'intersubjectively,' from the start (Zeedyk, 2006).

Two leading scholars in medical science and the science of child development in the past century, Freud (1923) and Piaget (1958, 1966), declared that infants must be born without conscious selves conceiving an external world, and unable to adapt their movements to the expressive behavior of other people. The playful and emotionally charged behaviors of mothers and other affectionate carers were considered inessential to the young infant, who needed only responses to reflex demands for food, comfort and sleep.

Then René Spitz (1945) and Bowlby (1958) revealed the devastating effects on a child's emotional well-being of separation from maternal care in routine hospital care with nursing directed only to respond to those reflex demands. Spitz observed that babies develop smiling between 2 and 5 months to regulate social contacts (Spitz and Wolf, 1946), and he went on to study the independent will of the baby to regulate engagements of care or communication, by nodding the head for 'yes' or shaking for 'no' (Spitz, 1957).

In the 1960s a major shift in understanding of the creative mental abilities of infants was inspired by a project of the educational psychologist Bruner (1966, 1968), and the pediatrician Brazelton (1961, 1979), who perceived that infants are gifted with sensibilities for imaginative play and ready to start cultural learning from the first weeks after birth. Supported by

FIGURE 1 | Inborn musicality shared in movement. Upper Left: Infant less than one hour after birth watches her mother's tongue protrusion and imitates. Upper Right: On the day of birth, a baby in a hospital in India shares a game with a woman who moves a red ball. The baby tracks it with coordinated movements of her eyes and head, both hands and one foot. Lower: A two-month premature girl, Nasira, with her father, who holds her against his body in 'kangaroo' care. They exchange short 'coo' sounds with precisely shared rhythm. Upper Left and Upper Right: Photos for own use of second author from colleagues Vasudevi Reddy and Kevan Bundell. Part reproduced from Trevarthen (2015), Figure 1, p. 131. Lower: Photo from Trevarthen (2008), Figure 2, p. 22. Original spectrograph in Malloch (1999), Figure 3, p. 37.

insights of Charles Darwin and by new findings of anthropology and animal ethology they studied infant initiatives to perceive and use objects, and they were impressed by the intimate reciprocal imitation that develops between infants and affectionate parents and caregivers who offer playful collaboration with the child's rhythms and qualities of movement. Film studies showed that young infants make complex shifts of posture and hand gestures that are regulated rhythmically, similar to the same movements of adults (Bruner, 1968; Trevarthen, 1974).

While this new appreciation of infant abilities was developed at the Center for Cognitive Studies at Harvard, radically transforming the 'cognitive revolution' that was announced there by George Miller, Noam Chomsky and Jerome Bruner in 1960, nearby at the Massachusetts Institute of Technology, a project

initiated by Bullowa (1979) sought evidence on the behaviors that regulate dialog before language. Bullowa used information from anthropology to draw attention to the measured dynamics of communication.

"For an infant to enter into the sharing of meaning he has to be in communication, which may be another way of saying sharing rhythm.... The problem is how two or more organisms can share innate biological rhythms in such a way as to achieve communication which can permit transmission of information they do not already share." (Bullowa, 1979, p. 15, italics added).

Wanting to understand how the rhythmic flow of dialog can be shared with a child too young for speech, she pointed a way to the appreciation of the role of human communicative musicality as inspiration for the development of behaviors for carrying meaning in language – thinking and communicating in words that will be acquired to specify facts, and to describe and think about how these facts are related or may be used.

The importance of rhythm and the graceful narratives of movement displayed by infants as they communicate purposes and feelings was revealed sixty years ago by a psychobiological approach using photography and movie film, then video. Discoveries were made that challenged the theory that infants had no minds, no sense of self, and therefore no sense of others (Zeedyk, 2006; Reddy, 2008). Most astonishing, and dismissed with derision by convinced rational mind-separate-from-body constructivists, was the finding that infants activate the many parts of their body with an exquisite sense of time, and that they can use the rhythms of expression skilfully to imitate in intersynchrony with attentive responses from an adult (**Figure 1**).

In his work as a pediatrician, Brazelton (1961), developing his now famous Neonatal Assessment Scale (Brazelton, 1973; Brazelton and Nugent, 1995), accepted and encouraged the natural love mother and father felt for their new baby, and showed how appreciative the baby could be of their actions to each other and to the baby. This welcoming of the newborn as a person with intelligence and sociable impulses confirmed the parents' belief that they could communicate feelings and interests by responding to their baby's exquisitely timed looks, smiles, hand gestures and cooing with their own exquisitely timed gestures of voice and body. It transformed medical concern for the baby. As Brazelton declared in Margaret Bullowa's book, "The old model of thinking of the newborn infant as helpless and ready to be shaped by his environment prevented us from seeing his power as a communicant in the early mother-father-infant interaction. To see the neonate as chaotic or insensitive provided us with the capacity to see ourselves as acting 'on' rather than 'with' him" (Brazelton, 1979, p.79).

New attention to newborns within hours of their delivery, with the aid of films, led to confirmation that the baby could imitate adult expressions with careful timing of movements of eyes, face, mouth and hands (Maratos, 1973, 1982; Meltzoff and Moore, 1977; Kugiumutzakis, 1993; Nagy and Molnár, 1994, 2004; Nagy, 2011; Kugiumutzakis and Trevarthen, 2015; **Figure 1**, Upper Left). The findings proved that the baby is born with an altero-ceptive awareness of another person's body parts as having feelings in movement like their own proprio-ceptive ones. It also became clear that this consciousness appreciates the balance and drama of a collaborative narrative flow with shared rhythms – essential to our ability for musicking (Trevarthen, 1974, 1977, 2005a).

The pediatrician Sander, (1964, 1975; republished in Sander, 2008) recognized that an infant and caregiver create a coherent system of actions regulated with feelings of vitality in shared time. This dynamic collaboration was also discovered by Daniel Stern when he examined recordings of a mother playing with her threemonth-old twins (Stern, 1971).

A stimulating contribution to this new approach came from the work of anthropologist and linguist Mary Catherine Bateson, daughter of anthropologists Gregory Bateson and Margaret Meade. In 1969 Bateson had her first child after beginning postgraduate studies at MIT with Margaret Bullowa, researching language development using statistical analysis of vocal expressions. Observing a film in Bullowa's collection as well as the experience of rich exchanges with her own infant opened her awareness of the form and timing of communication that developed in the first 3 months, which she called 'protoconversation' (Bateson, 1971). She benefitted from attention to the field studies of Albert Scheflen on the stream of conversation (Scheflen, 1972) and Ray Birdwhistell on body movements in natural conversation (Birdwhistell, 1970).

Reviewing her work in Bullowa's book, she said:

". . . the mother and infant were collaborating in a pattern of more or less alternating, non-overlapping vocalization, the mother speaking brief sentences and the infant responding with coos and murmurs, together producing a brief joint performance similar to conversation, which I called 'proto conversation'. The study of timing and sequencing showed that certainly the mother and probably the infant, in addition to conforming in general to a regular pattern, were acting to sustain it or to restore it when it faltered, waiting for the expected vocalization from the other and then after a pause resuming vocalization, as if to elicit a response that had not been forthcoming. These interactions were characterized by a sort of delighted, ritualized courtesy and more or less sustained attention and mutual gaze. Many of the vocalizations were of types not described in the acoustic literature on infancy, since they were very brief and faint, and yet were crucial parts of the jointly sustained performances." (Bateson, 1979, p. 65).

Bateson's work confirmed Bruner's realization that Noam Chomsky's hypothesis of a specific Language Acquisition Device (LAD) or innate ability of the child to construct syntax, the grammatical order of words, to formulate ideas (Chomsky, 1965), paid no heed to the vital importance of a complementary ability of a parent to encourage enrichment of reference in communication with a young child, a Language Acquisition Support System (LASS). As Bruner, expressing his psychology of education, put it "the LADD needs a LASS" (Bruner, 1983). Child and adult share rules of imagination for all kinds of movement, including spoken propositions.

A follower of Chomsky's theory of the evolution of language as reasoning, Stephen Pinker, in his perhaps overly ambitiously titled How the Mind Works (Pinker, 1997) claimed, "As far as biological cause and effect are concerned, music is useless." He

gave no attention to movement and time, the communication of infants, playful children, affectionate parents, the poetry of music, or Einstein's theory of his mathematical invention as "sensations of bodily movement" (Hadamard, 1945), thus misunderstanding the origins and purpose of rational discourse (see Sections Communicative Musicality and Resilience of the Human Spirit and Musical Affections of the Embodied Human Brain).

We now have evidence from many studies analyzing behaviors that demonstrate that infants show a rich spectrum of expressive movements of the upper parts of their bodies (Trevarthen, 1984a), not just the 'categorical emotions' identified by Paul Ekman (Ekman and Friesen, 1975), but the 'complex social emotions' that Damasio (1999, 2010) describes as regulators of well-being in intimate interpersonal relations, and expression of a moral personality in society – expressions of such feelings as embarrassment, shame, guilt, contempt, compassion, and admiration.

Most importantly, study of recordings reveal that modulations of timing, of rhythms, and of the flow of vitality forms shared with infants have the characteristics recognized as musical (Trevarthen, 1990a). These have been precisely defined by acoustic analysis of vocalizations of adult and infant in dialogs and games (Malloch et al., 1997; Malloch, 1999; Trevarthen, 1999; Trehub, 2003).

## CASE STUDIES OF INFANT MUSICALITY

We summarize here key findings related to the growth of musical abilities from studies of infant individuals that we have reported previously.

First there is evidence from a recording made by Saskia van Rees in an intensive care unit in Amsterdam (van Rees and de Leeuw, 1993) that rhythms corresponding to those of human locomotion are present in vocalizations of a premature infant which are precisely coordinated with simple vocal exchanges with a caring father (**Figure 1**, Lower).

The recording of a two-month premature girl with her father, who was holding her under his clothes against his body in 'kangaroo' care, shows that they exchange short 'coo' sounds, the father imitating her sounds, with precise timing based on a comfortable walking rhythm of andante – one step every 0.7 s. Father (F) and the baby Naseera (N) are equally precise in their timing, which also shows what a phonetician would recognize as a 'final lengthening' characteristic of a spoken phrase - when they are ready to stop the dialog the interval lengthens to 0.85 s. Following the shared phrase with its syllable-length durations, they exchange single sounds separated at 4 s intervals, the normal duration of a short spoken phrase. The recording supports our contention that even a prematurely born baby is skilled in sharing a musical pulse (Trevarthen, 2009).

A recording with a blind 5-month-old girl illustrates intermodal attunement between the heard melody of a mother's song and the proprioceptive feelings in the body of the baby of a gesturing left arm and hand (**Figure 2**, Upper). The human ability to sense the shape of a melody within the body is intrinsic to our enjoyment of music as human communication (Stern, 2010). Maria is totally blind and has never seen her hand. Maria and her mother were assisting in a project of Professor Gunilla Preisler in Stockholm to aid communication with blind and deaf infants.

While her baby is lying down during bottle feeding, the mother sings two baby songs including "Mors Lille Olle," wellknown throughout Scandinavia. It was not realized until later when the video was viewed that Maria was 'conducting' the melodies with delicate expressive movements of her left hand, while the right hand was making unrelated movements, stroking her body. When Professor of Music at Edinburgh University, Nigel Osborne, saw the film he said, "Yes she is conducting using the conventional movements of a professional conductor, describing a phrase with a sweeping movement, pointing up for a higher pitch, and dropping her wrist at the close of a verse – and she is making the movements with some anticipation." Microanalysis supported what he observed. At certain points in the course of the melody Maria's finger moves 300 milliseconds before the mother's voice. She knows the song well, and leads the 'performance' (Trevarthen, 1999; Schögler and Trevarthen, 2007).

Although blind, Maria knows the feelings of anticipated movement of her hand, and uses them to sense and share the human vitality dynamics in her mother's voice. This kinematic sensibility was identified by Olga Maratos in her pioneering research in imitation as foundational for the ability of a young infant to reproduce another person's expression seen or heard (Maratos, 1973, 1982). Indeed, vocal perception, detecting the modulation of pitch and timing in an adult's voice sounds, develops much faster than vocal production. The infant may be tracking sound with reference to the kinesics of the fastest and most complex gestural movements of her hands.

When taking part in a nursery song, infants demonstrate sensitivity for melodic phrase structure, attending to the rhyming vowels at the ends of lines, and by 5 months the infant can vocalize a matching vowel in synchrony with the mother (Trevarthen, 2008). For our final example, the application of acoustic analysis with observations on gestural behaviors of infants in the middle of the first year, suggests that melodic patterns, common to different cultures, define four-line verses (Imberty, 2000), with a pattern of Introduction, Development, Climax and Resolution, identified in proto-conversations with two-month-olds (**Figure 2**, Lower, and **Figure 3**), and we note similarities with the sections recognized in classical Roman rhetoric or speech-making – exordium, narratio, confutatio, and confirmatio. In spite of very different conventions in musical performances in different communities, a parent, or a child, wanting to share the pleasure of songs and action games with a baby, naturally adopts the intuitive formula of a poetic verse to share a story of body movement.

Lastly, we present the work of Katerina Mazokopaki, a developmental psychologist who is a pianist and teacher of piano playing. She made a study of babies in Crete with her professor, Giannis Kugiumutzakis, an expert in analysis of imitative games with newborns (Mazokopaki and Kugiumutzakis, 2009). The babies were left alone in a familiar place at home amusing themselves. Then a recording of a Greek baby song came on. Between 3 and 10 months old they all reacted in the same way.

FIGURE 2 | Communicating in the rhythm of narratives with baby songs. Upper: A five-month-old girl, who is totally blind, 'conducts' her mother's singing with her left hand. The graph of her hand movements with her mother's voice marks, with black bars, three moments in the narrative, at 4, 16, and 28 s, where the infant anticipates the mother's voice by 0.3 s. These are times, 12 s apart, when lines of the slow lullaby commence. The arrows identify moments when there is perfect synchrony between the mother's voice and the baby's index finger. The circled index finger movement is 300 msec ahead of a lift in the pitch of the mother's voice. Lower: Two infants share the poetry of action songs with their mothers. A four-month-old enjoys her mother's singing of a song and accompanies the narrative with movements of her baby's hand. And a six-month-old imitates clapping movements in synchrony with the song. Both songs show an iambic rhythm of short and long syllables, and rhyming vowels, 'bear' with 'there' and 'well' with 'bell,' which the babies imitate with their voices. (Upper: Analyses of video from Professor Gunilla Preisler, University of Stockholm. Previously published in Trevarthen (1999) and Schögler and Trevarthen (2007). Lower: From Trevarthen (2015), modified from Figure 4, p.137.)

First they looked surprised; then they looked about as if someone had come into the room; and finally they smiled with delight and started performing with the music, inspired by the pulse and melody, joining the music with their different abilities to dance and sing (**Figure 4**).

## COMMUNICATIVE MUSICALITY AND EDUCATION INTO THE CULTURE OF MUSIC

Mastery of a musical culture, and of language, starts with the intuitive vocal interactions between caregiver and infant (Vygotsky, 1966). Our innate communicative musicality is the 'raw material' for cultural forms of music and the rules of grammar and syntax. A child makes stories in sound as an active participant whose pride to belong to the rich musical traditions of society propels them into learning and creating. This is the cultivation of communicative musicality to music, from innate self-expression to cultural practice and a musical identity (MacDonald et al., 2002; MacDonald and Miell, 2004). It is brought to life, as language is, with the enthusiastic support of more experienced companions (Bruner, 1983).

The desire for cultural participation is evident in informal learning in which children's own musical culture grows from the vitality of The Muse Within (Bjørkvold, 1992). It is nurtured

FIGURE 4 | Katerina Mazokopaki's baby dancers: Upper Left: Georgos, 3.5 months lying on a comfortable bed, responds with a big smile and gestures of hands and feet to accompany the music. Upper Right: Katerina, 9 months, smiles, bounces and extends her arms to 'fly' into action. Lower Left: Panos, 9 months, smiles his greeting, then beats the floor rhythmically with his right hand. Lower Right: Anna, 10 months, standing in her cot, smiles and starts dancing vigorously, swinging her bottom. All four sing with the music (Photos supplied to the CT).

from the music, live and recorded, that the child hears all around and contributes to spontaneously, along with the invention of talking and verse-making with playmates, often accompanied by rhythmic stamping, hopping and jumping (Chukovsky, 1963). Our earliest shared signing of communicative musicality in infancy becomes dialogic 'musical babbling' from around 2 months old (Trevarthen, 1990a). Already at 2 months the infant is learning the cultural gestures and preferences that become the tools through which cultural meaning will be created and exchanged (Moog, 1976; Tafuri and Hawkins, 2008). Lullabies are sung more often in cultures that value quiet infants, playsongs in cultures that value lively enthusiasm in infants (Trehub and Trainor, 1998). A musical 'proto-habitus' is created (Gratier and Trevarthen, 2008).

The infant is born ready to interact and discover her musical culture. Hearing responds to musical sounds from the third trimester of pregnancy (Busnel et al., 1992), and infants can recognize music they heard before birth (Hepper, 1991). They recognize musical contours and rhythmic patterns (Trehub and Hannon, 2006), and 'dance' to music before they are one year old (Zentner et al., 2010). Infant-inclusive singing is preferred, like infant-inclusive speaking<sup>1</sup> .

From about 3 months of age in many (probably all) cultures, mothers start to sing baby songs (Trevarthen, 1986) to their infants. In a previous publication the second author summarized their characteristics:


<sup>1</sup>We regard the term 'infant directed,' usually used in this context, to be a misnomer that does not acknowledge the co-operative and shared nature of the interaction. We use 'infant-inclusive' to refer to the style of speech and singing that spontaneously occur between caregiver and infant.


More recognizable musical forms grow with the spontaneous singing of young children as they play alone or with others (Barrett, 2011), practicing their musical craft. The Norwegian musicologist Jon-Roar Bjørkvold (1992) collected and studied the songs of 4–7-year-olds in three kindergartens in Oslo. He observed how they gave voice to emotion, conveyed information, and established relationships through learning and creating their own children's musical culture. He identified two types of children's singing – 'egocentric' for private pleasure, which, as the child matures, gives way to more social or 'communicative' music making.

As young children mature so they use their voice with a singing kind of expression in progressively more 'symbolic' ways. Fluid/Amorphous Songs "evolve in a completely natural way from the infant's babbling as part of its first playful experiments with voice and sound. This type of spontaneous song, with its fanciful glissandi, micro-intervals, and free rhythms, is quite different from what we adults traditionally identify as song." (Bjørkvold, 1992, p.65). Song Formulas, such as teasing songs, are symbolic forms for communicating and they flourish after the child begins to play with peers, typically at two or three. Elements of musically more complex Standard Songs are picked up from play with adults and hearing them sing, and are adapted to fit what the child is doing. This progressive 'ritualisation' of vocal creativity clarifies the adaptive motives for learning to sing, and how they express increasing narrative imagination for sharing ideas in culturally specific ways (Gratier and Trevarthen, 2008; Eckerdal and Merker, 2009), paralleling the way language is mastered (Chukovsky, 1963).

All through the development of children's singing, repetition and variation, basic tools of any piece of music (for example, see Ockelford, 2017), are primary features as children explore the possibilities of musical form. Repetition and variation between the vocalizations of infant and caregiver feature from the very first shared vocalizations, regulating feelings in social interactions (Malloch, 1999). Later, the growing child will continue to play with how music can convey affect and change their own and others' mood, the four-part structure of Introduction, Development, Climax and Resolution, identified above in the structure of a proto-conversation, becoming the basis of large scale musical works, as well as verbal argument (for example, see the sections of classical rhetoric).

How the child's spontaneous musicality, as it grows in group practice without formal training (MacDonald and Miell, 2004), is received by the surrounding educational culture, is a vital ingredient in the child's emerging 'musical identity.' Musical identity and self-efficacy or mastery of skill in music making inform each other, in reciprocal relationship. A child who sees themselves as a competent musician may attempt to learn a difficult piece of music, and their success at performing this piece will further bolster their sense of competence. And the way a child is welcomed into their musical culture is of vital importance as to whether this child thrives playfully with the musical tools at her disposal, developing her skill in the use of these tools, or shrinks away in disinterest because her own intrinsic musicality is not being heard or valued. If education does its job well, with the child as collaborative artist and thinker (Trevarthen et al., 2018), our rich inner narrative of affective life, generated with our prospective awareness of body movement, is expressed in our social group to create a life-affirming, inclusive culture of shared artful rituals that celebrate the aesthetic grace and moral graciousness of joy in performance (Frank and Trevarthen, 2012; Trevarthen and Bjørkvold, 2016). For example, the InCanto project (Tafuri and Hawkins, 2008) is a wonderful example of infants' and parents' being encouraged to have their expression of music cultivated in such a way that the infant grows into a child who shows greater ability to sing in tune, a greater range of musical expression, and overall more enthusiasm for music participation.

Problems from introducing an emphasis on enforced cultural learning too early are demonstrated by Bjørkvold (1992) who studied the musical games of children in Oslo, Moscow, St Petersburg, and Los Angeles where educational, cultural, social and political practices are very different. In all three countries children showed spontaneous musicality, but in the nations of Russia and the US, where formal training in music was given greater value than it was in Norway, he found reduced spontaneous music making. He insists, "It is critically important for children to master spontaneous singing, for it is part of the common code of child culture that gives them a special key to expression and human growth" (Bjørkvold, 1992, p. 63). A comparable inhibitory effect of conventions of schooling has been recorded on the spontaneous expression of religious feelings and spirituality in the early years (Hay and Nye, 1998). These innate sources of human imagining in collaborative, moral ways give value and meaning to the later cultivation of advanced cultural ideas and skills (Valiente et al., 2012).

The importance of valuing both the child's innate musical creativity and introducing a child into his musical culture so that he may thrive within it and contribute to it can be conceptualized as a balance between two educational necessities – providing a social environment where a child's own skills and abilities are nurtured, and a place where training is provided into the ways of a particular culture (Rogoff, 2003). Both build enthusiasm for cultural participation. This balance has been presented by Bowman (2012) and others as a consideration of two Latin roots for the English word 'education.' One, educare, means to train or to mold. Its motivation is the initiation of a person into cultural conventions, without which a person is left unable to live effectively within a particular culture, using its tools to communicate. The other, educere, means to lead out, or draw out. Without this more responsive nurturing, the person is left unable to engage with situations and solve problems not yet

imagined. Their ability for creativity is compromised. These very different concepts of what education means are often experienced in schooling as being in tension, with educare often winning out, leaving the child with dry knowledge rather than living abilities supported by their own innate skills.

We propose that teachers and students of music at all levels learn how best to do their work by deliberately invoking the rhythms of the student's innate creative vitality while demonstrating cultural conventions that make rich use of this talent (Flohr and Trevarthen, 2008). Infants and toddlers make imaginative musical play in affectionate friendships with parents or peers (Custodero, 2009); primary school children build relationships with the invention of stories in groups with free instrumental play and dance (Fröhlich, 2009); and an advanced music student is assisted to master their instrument through their teacher encouraging their playing to be like a dance representing a narrative, rich in expressive feelings (Rodrigues et al., 2009). In all instances the motives of the learner, and how they may change with development of the body and experiences gained, are of crucial importance (Bannan and Woodward, 2009; Ingold, 2018). As with all education, the success of teaching depends on recognition of how children's 'zest for learning' (Whitehead, 1929; Dewey, 1938) changes with age and the development of body and mind.

We end this section with a quote from Bowman on the broader role of music in education. In times where the arts are often considered of marginal importance in education, it talks to the richness of our engagement with music in nurturing all learning experiences:

"The distinctive educational and developmental potential of music lies, I submit, in dynamic, bodily, and social natures, and distinctly ethical, responsive, and responsible kinds of know-how these afford. Practical knowledge is action embedded knowledge, quite distinct from theoretical knowledge and technical know-how. It is a kind of character-based sense of how best to proceed in situations where best courses of action cannot be determined by previous ones. This ability to discern the right course of action in novel, dynamic situations is precisely the kind of human asset required in today's rapidly changing world. And musical engagements may, under the right circumstances, nurture this capacity in ways unmatched by any other human endeavour." (Bowman, 2012, p. 31).

#### COMMUNICATIVE MUSICALITY AND RESILIENCE OF THE HUMAN SPIRIT

As Daniel Stern has written (Stern, 2010), the human body has a rich range of gestural 'forms of vitality' – we move in musical ways. And within each actor there is both 'self-sensing' and 'other-sensing' of the degree of grace, or biological efficiency (Bernstein, 1967) and hopefulness (Trevarthen and Malloch, 2017a) in the gestural narratives of our projects. These qualities of vitality, or well-being, transmitted to others, become the qualities of relationships and social activities – their moral values (Kirschner and Tomasello, 2010; Narvaez, 2014; Trainor and Cirelli, 2015). They convey relational feelings for the degree of consensuality or sharing of expression in moving. Effort to manage the grace and morality of movements can be cultivated to assist well-being of those whose actions are confused or fearful - that is, the making or 'poetry' (from the Greek poiein, to make) of their movements may be enhanced to provide responsive and relational care or therapy (Hobson, 1985, ch3; Stern, 2000, p. xiv; Osborne, 2009b; Meares, 2016).

At times our healthy ability for graceful gesturing is met with circumstances that do not allow it to be expressed with its natural healthy vitality. For example, failure to gain a sympathetic appreciation of their musicality can cause an infant to express withdrawal and distress (Murray and Trevarthen, 1985). Instead of joyful pride in sharing play they show sadness and shame (Trevarthen, 2005b). However, an infant's communicative musicality can also be expressive of resilience and determination.

In the example presented in **Figure 5A** we see a consistent rigidity of expression and a lack of self-confident invention on the part of a mother suffering from BPD (borderline personality disorder). She repeats the same up-and-down vocal gesture again and again, with almost no vocal participation on the part of the infant. Where the infant does participate (shown by vocalizations with either a square or circle around them), the infant appears to be setting up the possibility for a dialog – vocalizing exactly on the 'bar-line' (bar 5, shown by a square) and then around the mother's pitch (shown by a circle). Indeed, the infant's vocalizations persuade the mother out of her repetitiveness – the mother momentarily takes notice of her infant and responds to her infant's conversational offering by ceasing her unresponsive repetition and vocalizing once more at the infant's pitch. But the dialog almost immediately breaks down, and the mother returns to her stereotypical, repetitive vocal gesturing.

As well as showing inflexible, 'non-graceful' behavior of a mother who is suffering from BPD, this example also shows the resilience and hope of the infant in the face of her mother's rigid communicative style. This will of our communicative musicality, evident from the earliest of interactions with infants (Papoušek, 1996), is utilized in therapies that employ inthe-moment intersubjective interactions as means for healing (Stern et al., 1998; Stern, 2000, p. xiv). This immediately responsive interaction may be through talking, through musicmaking, through dance, or touch. The common element is an individual who is highly skilled in attuning to nuances of interpersonal timing and gesture, and who aims to lead back to health another whose personal mindedness of time-in-thebody has become compromised through hardship, suffering, or biological disruption, perhaps leading to a sense of isolation and misunderstanding (Trevarthen and Malloch, 2000). The therapist joins with the person who needs help, leading them back to health and wellbeing through their own therapeutic sense of the 'minute particulars' in that moment of meeting (Hobson, 1985, part 2).

Olga Maratos, a developmental psychologist who pioneered recognition of the ability of young infants to imitate expressions of an attentive and supportive adult (Maratos, 1982) took part in the establishment in Athens of a residential school for children with autism, called Perivolaki, meaning 'a little garden.' It is a day-care center with facilities in beautiful surroundings that

invite playful out-door activities, games with toys and creative shared occupations such as listening to and making music. They use the psychoanalytic concept of 'transference' of feelings or 'unconscious desires' to encourage sensitive intimate and consistent relations, each child having a trusting relationship

with a particular member of the staff. Stability of activities is maintained, with close engagement with the parents, who are seen weekly during the first two years of the child's stay at Perivolaki and every fortnight thereafter. The average length of stay is 4–5 years (Maratos, 1998).

Olga explains that staff are trained to observe the children, think about them and discuss their behavior at staff meetings "with a view to understanding whether their autistic behaviour is defensive, refusing interaction and relations because they don't make sense for them or because they are painful, or whether there is a pervasive lack of motivation for relating and communicating. We find both conditions present, at different times, in all our children" (Maratos, 1998, p. 206). This approach, adapting a psychoanalytic treatment which avoids diagnosis of the cause of the disorder, leads to a form of active and intimate 'relational therapy,' which does not rely on verbal formulation of anxieties and lack of trust.

Affect attunement has been defined as qualities of vocal and body gesture that carry meaning in parent–infant communication – it is, "the performance of behaviours that express the quality of feeling of a shared affect state, but without imitating the exact behavioural expression of the inner state" (Stern, 1985, p. 142). This largely unconscious 'recasting' of events is necessary to "shift the focus of attention to what is behind the behaviour, to the quality of feeling that is being shared" (Stern, 1985, p. 142). We say the relationship is now one of 'companionship,' a word from Latin meaning 'to break bread with' and defined here as the wish to be with an other for a mutually beneficial 'inner' purpose, apart from reasons of immediate survival, procreation or material gain. Companionship involves exchanging affect through sharing the quality or virtue of impulses of motivation, which is the original rich meaning of 'sympatheia' in Greek (Trevarthen, 2001).

The therapeutic relationship, even in talking therapies focussed on what the language says about feelings, is underpinned by the manner or graciousness of our gestural exchanges – whether those qualities are carried in vocal prosody or bodily movements (Stern, 2000, p. xiv; Malloch, 2017). It is about the direct and desired sharing of feelings in human vitality. Stern (2010) writes eloquently on the importance of listening for the forms of vitality (the forms of feeling as Hobson, 1985, called them) that are being expressed through prosody – "without the dynamic vitality features of the intention-unfolding process we would not experience a vital human being behind the words that are being said" (Stern, 2010, p.124). He also writes of the use of metaphors as carriers of images of our in-the-moment vitality (also see Hobson, 1985, ch.4). For example, in the work of Stephen, the first author, as a therapist, here is the text of an exchange between him and a client discussing the client's sense that his life has never measured up to his notion of what a successful life should look like:

Client: My father thought I was stupid. He'd call me 'stupid boy.' Therapist: The way you said that – perhaps a sort of exhaustion and emptiness – reminds me of the tide going out in a bay. Client: Empty. . . like something's slowly leaving.

Here the focus on the vitality that the client associates with his father's dismissing statement leads to a new experience to explore therapeutically – something is slowly leaving. Further discussion focussed on the experience of what 'slowly leaving' feels like. When the therapist works like this, "we move from an enquiry about intentions, means, and goal states to an enquiry about processes of creation, emerging, and becoming" (Stern, 2010, p.126). We move from distanced recollection or speculation to life in the present moment (Stern, 2004).

The prosody of the client's voice sometimes sums-up the therapeutic change itself. In the example below of Stephen's work a client is talking of an emerging 'new me' in contrast to an 'old me.'<sup>2</sup> The 'old me' was marked with 'a lack of self-respect.' 'I blame myself when things go wrong, I believe I'm not working hard enough.' The voice is drone-like, body hardly moving. **Figure 5B** shows a four-second section of a pitch plot of 'old me' voice.

After describing 'old me' the client's body relaxed, they looked up from the floor, hands lifted from their lap, the volume of their voice increased, its pitch lifted, and they began talking of 'new me.' 'New me is more rational about life. This part says, "Well, I was uncommunicative this morning – that's all right, that's OK. That's just the way I was. Doesn't make me a bad person. Other times I communicate really well!" **Figure 5C** is a four-second pitch plot of 'new me.' The shift in the vitality of the musicality is clear. Stephen felt the distinct difference in the vitality of the two "me's" of the client, and continued the session exploring how the new "me" might express itself in the world (see Malloch, 2017, for further discussion of communicative musicality in therapy).

The role of our communicative musicality in supporting our wellbeing lies at the very heart of the practice of using music therapeutically. Music therapy covers a wide range of ways of using musical experiences – stories in sound – to heal and improve people's lives. The Australian Music Therapy Association defines Music Therapy as: "a research-based practice and profession in which music is used to actively support people as they strive to improve their health, functioning and wellbeing." It is the compassionate use of music to engage another emotionally, interpersonally, cognitively, and culturally. "Music is therapeutic because it attunes to the essential efforts that the mind makes to regulate the body, both in its inner neurochemical, hormonal and metabolic processes, and in its purposeful engagements with the objects of the world, and with other people" (Trevarthen and Malloch, 2000). This is particularly so during improvisational music therapy, where the therapist supports the client towards change – greater integration of experience and freedom in communication (Pavlicevic, 2000, ch.6; also see Malloch et al., 2012, on the effectiveness of improvisational music therapy with neonates).

However, the practice of music therapy is more than the therapeutic use of preverbal protomusic, however, important this is. Reflecting our discussion above on music and education, music therapy is also making use of the cultural forms of our musicality, and the power these cultural forms have within our psyche (for example, Donald, 1991; MacDonald et al., 2002; Stern, 2010).

A relationship between our communicative musicality and our culturally made music for the practice of music therapy is proposed by Pavlicevic and Ansdell (2009). They emphasize the peculiarly musical relationship established within music therapy practice – that is, that the cultural elaboration of communicative musicality relates to our communal, social lives. Music therapy

<sup>2</sup>Written consent was obtained from the client by the first author.

engages our shared communicative musicality, and welcomes us into the shared cultural, communal experience of musicking, using the tools of a particular cultural type of music – one of many musics in the world (Stige, 2002).

Thus, part of the reason music therapy 'works' is its invitation for cultural collaboration – we exercise what has been called our 'deep social mind' (Whiten, 2000) following particular cultural forms. This wish to learn the forms of culture, our 'conformal motive' (Merker, 2009), comes to life within an environment where we sense our communicative gestures are being valued by another or others through the creation of shared narratives of vitality forms (Stern, 2010). We feel ourselves to be both a companion in our shared narrative of communicative embodied gestures and a companion in a particular shared cultural collaboration. It is within this dual companionship that our deep yearning to belong is met and satisfied, and where healing can occur.

## MUSICAL AFFECTIONS OF THE EMBODIED HUMAN BRAIN

In our opinion brain science has been most insightful into the nature of the self and what makes us human, and how we share the joy and pains of life, when it investigates 'Primary Process' emotional guidance of brain growth for regulation of vitality in body movement and its 'seeking' awareness (Hess, 1954; Solms and Panksepp, 2012). It offers us insight when it investigates how experiences develop by generating expectations of well-being in companionship and by enriching it with cultural meaning (Trevarthen, 1990b). It shows that differences in the rate of development of cognitive processes in left and right hemisphere at different ages are caused by different affections (Thatcher et al., 1987; Chiron et al., 1997), from which arise correlations with musical behaviors and other creative forms of play, as well as Piagetian stages of rational mastery of the body and objects it uses, and development of language.

The neuroscientist Jaak Panksepp, who studied the emotions of mammals who do not cultivate music, but who use patterns of movement including signs with sound to regulate their social lives in ways that anticipate our richer experience of the sounds of our movement using the tools of culture, offers insight into why we are musical beings (Panksepp and Bernatzky, 2002). And in a recent synopsis he agrees that mastery of language in early years depends on the sense of purpose we share in musical ways:

"Human languages are coaxed into the brain, initially by the melodic intonations of motherese by which emotional communication becomes the vehicle for propositional thought." (Panksepp et al., 2012, p. 11).

Like the brain of any animal, the human brain grows to represent and regulate a body form in movement (Trevarthen, 1980, 2001, 2004). And from even before birth, the self-formation of a personal self in the brain of the fetus is led by manifestations of movement.

"The first generalized movements occur in week 8 (de Vries et al., 1984), but already in week 5 monoamine transmission pathways grow from the brainstem to animate the primordial cerebral hemispheres. Key components of the Emotional Motor System (hypothalamus, basal ganglia and amygdala) are in place when the neocortex is unformed." (Trevarthen, 2001, p. 26).

After the baby is born and seeks intimate communication of all motives with a parent, the affective system remains as the director of learning and appreciation of what is gained by new awareness.

Psychiatrist and literary scholar Iain McGilchrist in The Master and His Emissary (McGilchrist, 2009) has presented a brilliant review of behavioral and brain research, and a clear conception of complementary consciousness in the two cerebral hemispheres. "Music", he writes, "being grounded in the body, communicative of emotion, implicit, is a natural expression of the nature of the right hemisphere" (p. 72).

McGilchrist's research leads him to the political view that we are living in a society that grants too much power to the special refined perceptuo-motor and scientific skills of the left brain, while failing to appreciate how the right brain gives purpose and value to all that we do, thus pointing to the importance of closer cooperation between science and the fundamental values of the humanities. He draws on anthropological information about the universal principles of social understanding at very different levels of technical ability and manufacture, and on the importance in all social groups and cultures of musical performances, which he concludes, from a wide range of evidence, that music evolved before language and contributed to the formulation of its syntax and prosody (see Sachs, 1943).

Music, he says, was not,

"an irrelevant spin-off from something with more of a competitive cutting edge – namely, language.... rather the reverse. If language evolved later, it looks like it evolved from music.... Rousseau in the eighteenth century, von Humboldt in the nineteenth century and Jespersen in the twentieth, have thought it likely that language developed from music.... That we could use non-verbal means, such as music, to communicate is, in any case, hardly surprising. The shock comes partly from the way we in the West view music: we have lost the sense of the central position that music once occupied in communal life, and still does in most parts of the world today.... We might think of music as an individualistic, even solitary experience, but that is rare in the history of the world." (p. 104).

And he quotes neurologist Oliver Sacks, who said:

"This primal role of music is to some extent lost today, when we have a special class of composers and performers, and the rest of us are often reduced to passive listening. One has to go to a concert, or a church or a musical festival, to recapture the collective excitement and bonding of music. In such a situation, there seems to be an actual binding of nervous systems, the unification of an audience by a veritable 'neurogamy' (to use a word favoured by early Mesmerists) (Sacks, 2006, p. 2528).

Twenty-six years before McGilchrist published his book, the anthropologist Victor Turner, famous for his book From Ritual to Theatre, drew on knowledge of the different functions of the hemispheres to identify play with a collaboration between them, in an article entitled "Play and drama: The horns of a dilemma":

"Current ideas about differences between the left and right hemispheres of the brain provide a basis for speculating about the nature of play. Play encompasses both the rationality and order of the left hemispheric orientation, and the improvisation and creativity of the right. But play also transcends these oppositions, running rings about them as it encircles the brain's consciousness" (Turner, 1983, p. 217).

Good, beautiful and enjoyable, music is created out of poetic play. Indeed, we play music (Trevarthen and Malloch, 2017b).

A sense of time in the mind is the fabric from which movements of all kinds are woven into ambitious projects that value elegance with efficiency. It is a manifestation of the 'biochronology' that is essential to the vitality of all forms of life (Osborne, 2017). In Rhythms of the Brain, György Buzsáki (2006) presents a wealth of evidence that the brain functions as a coherent rhythmic system, always in synch., and with a rich array of rhythms that are organized to collaborate.

"At the physiological level, oscillators do a great service for the brain: they coordinate or 'synchronize' various operations within and across neuronal networks. Syn (meaning same) and chronos (meaning time) together make sure that everyone is up to the job and no one is left behind, the way the conductor creates temporal order among the large number of instruments in an orchestra" (Buzsáki, 2006, p. viii).

From Panksepp and Trevarthen (2009), p. 114:

"Music is performed with the measure of expressive movements in time, and with tensions created by combining rhythms (Osborne, 2009a). The 'architecture' and 'narration' of moving in psychological time is also displayed with emotional qualities related to vital functions of the body (Trevarthen, 2009). These psychobiological elements of vitality are charted in three bands or ranges of physical or scientific time: (1) for the felt and imagined 'extended present' (from 10 seconds to years); (2) for the conscious 'psychological present' (Stern, 2004), with its serially ordered steps of motor control coupled to the physiological rhythms of breathing and variations in heart rate (0.3 to 7 seconds); and (3) for 'reflex experiences' and 'just noticeable differences' too fast to be regulated by movements that are prospectively controlled in awareness (5 to 200 milliseconds). (For detail and the sources of this description see Trevarthen, 1999).

The time of musical narrative, which Imberty (2000) calls the macrostructure or 'story-without-words' of music, is related to the times of expressive behavior that form 'protonarrative envelopes' of intuitive vocal and gestural play between infants and their mothers (Stern, 1985, 1995; Malloch, 1999). The period corresponding to a stanza or verse of 20 to 40 seconds may be manifested in the brain, as gamma waves or parasympathetic cycles, which control autonomic functions of the heart and breathing. It continues to be active through sleep to produce fluctuating rates of breathing and heartbeat, as well as electrical activity of the cerebral cortex that might be related to the rehearsal and consolidation of memories in dreaming (Delamont et al., 1999). In wakefulness the narrative cycle is charged and modulated for intersubjective meaning with the 'microtonal' and 'microtemporal' variations of emotion that express urgency and facility in skilled control of moving within the voice of a singer or the playing fingers of an instrumental performer, and in the hearing of a listener (Imberty, 1981, 2000; Gabrielsson and Juslin, 1996; Juslin, 1997, 2001; Ku¨khl, 2007; Osborne, 2009a). Music can assist the synchronization of physiological functions of respiration and heart activity and bring improvement in locomotor activity, and it can improve cognitive and memory processes by brain synchronization."

Rhythmic co-ordination by the Intrinsic Motive Pulse (IMP) of the brain holds body movements together in composition of intentions and experiences (Trevarthen, 1999, 2016). It is the medium for all shared experiences and purposes, and for the convivial vitality of music making.

## CODA: THE PHILOSOPHY OF HUMAN VITALITY

In her review of the role of movement and sense of time in the creation of intelligence, Barbara Goodrich, as a philosopher, traces a history of ideas supporting the view that consciousness is founded on emotions for agency, which we argue are the sine qua non for music. In opposition to the "implicit philosophical presuppositions inherited from the canon of Plato, Aristotle, Descartes, and Kant, e.g., that consciousness is selfreflective, passive, and timeless," she proposes a natural science view.

"Western philosophy, however, also includes what might be described as a counter-tradition—and one that is more compatible with empirical biological science than the usual canon. Heraclitus, Spinoza, Schopenhauer, Nietzsche, and especially the 20th century French philosopher and psychologist, Merleau-Ponty, all anticipated aspects of Llinás's and Buszáki's approaches... sketching out a notion of consciousness emerging from motility, and generating new hypotheses for neurophysiological research." (Goodrich, 2010, p. 331).

We have argued that music comes from this very foundation of consciousness in motivated motility, and we underline the importance of a philosophy that acknowledges the motives and feelings of our life, as well as the intelligence we show in relating to persons, other life forms, and objects in our environment, by recalling the achievements of the philosophers of the Scottish Enlightenment – Hutcheson, Hume, Smith and Reid. In line with Goodrich's "counter tradition," their work anticipates the new understanding of the human BrainMind pioneered by Panksepp and Damasio, which gives primary importance to feelings of vitality in movement, and to emotions that express positive and negative affections in sympathetic communication. This is the science of communicative musicality which underpins the music we create and enjoy.

The Scottish philosophers of the 18th Century, led by Francis Hutcheson, held that relationships and social life depend upon a universal human capacity for "innate sympathy," which generates a conscience, a sense of beauty, a public 'common sense' that values happiness and is disturbed by misery, and a moral sense that perceives virtue or vice in ourselves or others (Hutcheson, 1729, 1755; Hume, 1739–1740).

Smith (1759) in his Theory of Moral Sentiments took 'sympathy' to designate any kind of 'moving and feeling

with,' whether motivated positively or negatively, and including posturing and acting in the same expressive way as another's body (cf. the work of Stern on 'affect attunement' quoted earlier), and he also imagined experiences of relating and being sensed, as, for example, interrogating one's conscience. He said:

"How selfish soever man may be supposed, there are evidently some principles in his nature, which interest him in the fortune of others, and render their happiness necessary to him, though he derives nothing from it except the pleasure of seeing it."

"Sympathy... may..., without much impropriety, be made use of to denote our fellow-feeling with any passion whatever." Part I – Of the Propriety of Action; Section I – Of the Sense of Propriety, Chapter I – Of Sympathy.

He examined his conscience to understand being a person in relations.

"When I endeavour to examine my own conduct, when I endeavour to pass sentence upon it, and either to approve or condemn it, it is evident that, in all such cases, I divide myself, as it were, into two persons; and that I, the examiner and judge, represent a different character from that other I, the person whose conduct is examined into and judged of. The first is the spectator, whose sentiments with regard to my own conduct I endeavour to enter into, by placing myself in his situation, and by considering how it could appear to me, when seen from that particular point of view'. The second is the agent, the person whom I properly call myself, and of whose conduct, under the character of a spectator, I was endeavouring to form some opinion."

This picture of a duplex mind regulated by motives of sympathy anticipates the distinction made by William James in 1892 and by Martin Buber in 1923 between a fundamental "I-Thou" state of awareness and the objective "I-It" relations with the physical word we acquire in communication.

Otteson, joint professor of philosophy and economics, and chair of the Philosophy Department, at Yeshiva University, and adjunct Professor of Economics at New York University, has proposed that Adam Smith's Theory of Moral Sentiments (1759) has a more profound message for commerce and industry than The Wealth of Nations.

"Smith's picture thus has a clear anti-Freudian thrust: it denies the hydraulic picture of human emotions according to which emotions build up "pressure" that must be "released." Instead, and more plausibly, it conceives of emotions as things that can be controlled and trained by exercising what Smith calls "selfcommand." The activity of reciprocal adjustment is then repeated numberless times in every person's lifetime, as it is between and among the people in one's community, resulting in the creation of an unintended and largely unconscious system of standards. These standards then become the rules by which we determine in any given case what kind of behavior is, as Smith calls it, "proper" in a situation and what "improper"—meaning what others can reasonably be expected to enter into." Otteson (2000, November 01).

Smith was a great admirer of the messages of music and wrote about the communication of its poetic massages in his essay Of the Nature of that Imitation which takes place in what are called the Imitative Arts, published in 1777 (Smith, 1777/1982).

Beginning with his A Treatise of Human Nature (1739), David Hume strove to create a natural science of human psychology in opposition to René Descartes' rationalism. He concluded that desire rather than reason motivates our behavior. Anticipating Merleau-Ponty's phenomenology he also argued against the existence of innate ideas, concluding that we know only what we directly experience. He held that inductive reasoning and causality cannot be justified rationally, rather we follow custom and constant relations between ideas rather than logic. He concluded that we do not have a 'conception of the self,' only sensations of being alive. Following his teacher Hutcheson (1729), he believed that ethics are based on feelings rather than abstract moral principles.

Finally, there is a bold clarity in the work of Thomas Reid, the third great follower of the teachings of Hutcheson, and a vigorous debating companion to David Hume. He wrote An Inquiry into the Human Mind on the Principles of Common Sense (Reid, 1764).

Reid founded the Scottish School of Common Sense. For him 'common sense' is based on a direct experience of external reality, experience that becomes internal in language, which is based on an innate capacity pre-dating human consciousness, and acting as an instrument for that consciousness. He distinguished the acoustic element from the meanings which seem to have nothing to do with the sounds as such, a state of language, which he calls 'artificial,' that cannot be the primeval one, which he terms 'natural.' He described the way a child learns language by imitating sounds, becoming aware of them long before he or she understands the meaning in the artificial state of contemporary adult speech. If, says Reid, children were to understand immediately the conceptual content of the words they hear, they would never learn to speak at all. Here Reid distinguishes between natural and artificial signs.

'It is by natural signs chiefly that we give force and energy to language; and the less language has of them, it is the less expressive and persuasive.... Artificial signs signify, but they do not express; they speak to the intellect, as algebraic characters may do, but the passions and the affections and the will hear them not: these continue dormant and inactive, till we speak to them in the language of nature, to which they are all attention and obedience.' (Reid, 1764, p. 52).

'Language of nature' we equate with our embodied moving consciousness – our communicative musicality. An excess of 'artificial signs,' perhaps aimed at increasing productivity, leads to loneliness and ruthless rationality. However, the cultivation of our communicative musicality, in ourselves and others, through playful music, dance, ritual and sympathetic companionship, makes our communal life of shared work of the body and mind creative in more hopeful ways. It restores our common humanity and our connection with all living things.

## ETHICS STATEMENT

Informed consent was gained for all data presented in this paper.

## AUTHOR CONTRIBUTIONS

fpsyg-09-01680 October 1, 2018 Time: 14:38 # 18

SM contributed to all sections of the paper, particularly sections Communicative Musicality and Education into the Culture of Music and Communicative Musicality and Resilience

#### REFERENCES


of the Human Spirit. CT contributed to all sections of the paper, particularly sections The Genesis of Music in Infancy – A Short History of Discoveries, Case Studies of Infant Musicality, and Musical Affections of the Embodied Human Brain.



Hobson, R. F. (1985). Forms of Feeling. London: Tavistock.


Educationally Productive Collaborative Work, eds K. Littleton and D. Miell (New York, NY: Nova Science), 133–146.



Trehub, S. E. (2003). The developmental origins of musicality. Nat. Neurosci. 6, 669–673. doi: 10.1038/nn1084


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2018 Malloch and Trevarthen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Overlooked Tradition of "Personal Music" and Its Place in the Evolution of Music

Aleksey Nikolsky <sup>1</sup> \*, Eduard Alekseyev 2,3, Ivan Alekseev 4,5 and Varvara Dyakonova<sup>6</sup>

*<sup>1</sup> Braavo Enterprises, Los Angeles, CA, United States, <sup>2</sup> Independent Researcher, Boston, MA, United States, <sup>3</sup> The State Institute for Art Studies of the Ministry of Culture of the Russian Federation, Moscow, Russia, <sup>4</sup> Experimental Laboratory of the North-Eastern Federal University, Yakutsk, Russia, <sup>5</sup> International Jaw Harp Music Center, Yakutsk, Russia, <sup>6</sup> Department of Art Studies, Arctic State Institute of Arts and Culture, Yakutsk, Russia*

#### Edited by:

*Margarita L. Mazo, The Ohio State University, United States*

#### Reviewed by:

*Jaan Ross, Estonian Academy of Music and Theatre, Estonia Theodore Levin, Dartmouth College, United States*

> \*Correspondence: *Aleksey Nikolsky aleksey@braavo.org*

#### Specialty section:

*This article was submitted to Cognition, a section of the journal Frontiers in Psychology*

Received: *28 June 2019* Accepted: *24 December 2019* Published: *18 February 2020*

#### Citation:

*Nikolsky A, Alekseyev E, Alekseev I and Dyakonova V (2020) The Overlooked Tradition of "Personal Music" and Its Place in the Evolution of Music. Front. Psychol. 10:3051. doi: 10.3389/fpsyg.2019.03051*

This is an attempt to describe and explain so-called timbre-based music as a special system of musicking, communication, and psychological and social usage, which along with its corresponding beliefs constitutes a viable alternative to "frequency-based" music. Unfortunately, the current scientific research into music has been skewed almost entirely in favor of the frequency-based music prevalent in the West. Subsequently, whenever samples of timbre-based music attract the attention of Western researchers, these are usually interpreted as "defective" implementations of frequency-based music. The presence of discrete pitch is often regarded as the structural criterion that distinguishes music from non-music. We would like to present evidence to the contrary—in support of the existence of indigenous music systems based on the discretization and patterning of aspects of timbre, rather than pitch. This evidence comes mainly from extensive ethnographic research systematically conducted in eastern European and Asian parts of Russia from the 1890s. It involved the efforts of thousands of specialists and was coordinated by dozens of research institutions, and it has included not just ethnomusicology but linguistics, philology, organology, archaeology, anthropology, geography, and religious, and social studies. Much of the data has not been translated into Western languages. Although some Soviet-era publications were tainted by Marxist ideology, many researchers strove to provide accurate information (despite at times having been prosecuted for their work), and post-1990 research undertook a substantial revision of ideologically compromised concepts. Timbre-based tonal organization (TO) differs from that based on frequency in its personal orientation: musicking here occurs primarily for oneself and/or for close relatives/friends. Collective music-making is rare and exceptional. The foundation of timbre-based music seems to have vocal roots and rests on "personal song"—a system of personal identification through individualized patterns of rhythm, timbre, and pitch contour, utilized like a "human voice"—whose sound enables the recognition of a particular individual. The instrumental counterpart of the personalized singing tradition is the jaw harp tradition. The jaw harp is the principal musical instrument for at least 21 ethnicities in Russia, who occupy over half the territory of the country. The evolution of its TO forms the backbone for the development of timbre-based music art. Here, we provide the acoustic, socio-cultural, geographic, and chronological overview of timbre-based music.

Keywords: timbre-based music, personal song, Jaw harp (aka Jew's harp), musicality, arctic hysteria, music evolution, thematicism, musical texture

In the past few decades, a lively discussion on matters concerning the origin and evolution of music has finally begun to move toward a consensus among specialists (Cross and Morley, 2009): the biological importance of music is being seen in its capacity to foster and sustain social interactions within a group, to the mutual benefit of its members. Here, music stands as an important counterpart to language—another biological marker of Homo sapiens—specializing in managing the emotional aspect of human interaction. Without diminishing the importance of this perspective, we wish to cast light as well on the personal function of music—its capacity to organize and sustain the psychological identity of an individual. This function must have factored in the evolution of music since at least the Upper Paleolithic.

The need for the update grows as more scholars lean toward regarding music as an exclusively collective phenomenon. Thus, Lewis (2013) concludes: "In most parts of the world, and for most of human history, music exists only because of the social relations that enable its performance." Levinson (2013) infers that in the evolution of music its structural complexity "may have its origins in joint action rather than in abstract representations or solitary mentation" 1 . We will argue the contrary—that at least in the traditional lifestyle of numerous native Siberian ethnoses, music serves primarily as means of solitary mentation and abstract representation of reality. There is no reason to believe that this manner of "musicking" (Small, 1998) is a recent development, and there are valid reasons to prototype prehistoric North Eurasian music upon this type of music.

## DICHOTOMY BETWEEN TIMBRE-BASED AND FREQUENCY-BASED MUSIC

Almost everything that we know about the perception of music and what has served as a scientifically established foundation for modern views on the origin of music comes from a musical tradition based on frequency discrimination of musical sounds. It is this tradition that currently prevails in the world. Its prevalence probably started with the rise of Bronze Age urban civilizations, whose palace and temple music traditions relied on math-based theory (Nikolsky, 2016). Rationally defined pitches have made the corresponding music practices rely on the frequency aspect. Civilizations that cultivated frequency-based music imposed their influence on the music cultures of neighboring peoples. On a global scale, this must have resulted in a steady decline of the alternative form of TO that was based on timbre. This process is evident in the music cultures of many native Siberian peoples, e.g., the Nganasan, whose timbre-based tradition has been recently overtaken by the Russian frequency-based tradition (Bicheool, 2009). Schneider (2001) qualifies such development as "pitch reductionism"—the replacement of timbre-based tuning standards by frequency-based standards, most obvious in the indigenous gong/bell and xylophone music. Timbral vocal music is no less vulnerable to pitch reductionism.

#### **Examples-1/2**<sup>2</sup>

Ex.1. Timbre-based vocal music: throat-singing song, "Seagull," in archaic style, by Anna Ankhani, a 70-year-old Koryak woman from a remote reindeer-breeding settlement, Khailino, in Northern Kamchatka (reachable only by aerial transportation). Singing involves throat-rasping and double phonation that allows the singer to produce sounds an octave below her speaking voice range (http://chirb.it/n5JNvk).

Ex.2. Modern style Koryak "Festive song" by Maria Appolon, a 48-year-old woman from a little town, Ossora, a seaport at Bering Sea. Maria is a legislator in the local district Council and a member of the folk ensemble, "Agya," that often performs at international festivals, special events, and for tourists. Her singing exhibits traits of "frequency-based music": no timbral effects, clear discrete pitches that are interconnected without gliding that is so typical for traditional singing in Kamchatka, hexatonic mode, consisting of 2 motifs (a descending trichord and an ascending tetrachord) which retain amazing precision in intonation, without noticeable fluctuations, and strict formulaic structure (http:// chirb.it/FK7A8A).

In the literature in English, the distinction between "pitchcentered" and "timbre-centered" musics was drawn by Levin and Suzukei (2006, p. 45–72). Fales (2002) discussed the downsides of "pitch-centrism" and "timbre-deafness" in approaching timbrebased music. In the Russian literature, the dialectics of timbre and pitch was acknowledged much earlier. The pioneer of eartraining and music psychology in Russia, Maykapar (1900), noticed that the musical ear developed from a timbre- to a frequency-reference frame, and proposed to begin musical education with timbral exercises. Since then, ear-training has become an obligatory part of music education in Russia, and has attracted many researchers (Blium, 1977). Adoption of a state-controlled, unified system of early music education

**Abbreviations:** PS, personal song; JH, jaw harp; AH, arctic hysteria; TO, tonal organization; FF, fundamental frequency; f1, first harmonic; F1, first formant; dBFS, decibels relative to full scale; BP, before present.

<sup>1</sup>The emphasis in both quotations is added to the original.

<sup>2</sup> In case if the internet links provided for the audio examples end up being broken, the archive with all of the audio examples (named "Audio for the article") is available in the **Supplementary Material**.

by the Bolsheviks created an exceptionally diversified crosscultural pool of subjects for the investigation of music skills acquisition (Arakelova, 2012). Based on such data, Teplov (1947)<sup>3</sup> formulated his theory of interference between "timbral" (speech-like) and "pitch" (music-like) hearing—the latter typically replacing the former by the beginning of a child's primary education, provided there was cultural exposure to frequency-based music<sup>4</sup> .

Teplov's theory was substantiated by Leontyev (2009, p. 115–36), Vygotsky's pupil, in a series of experiments that disclosed a "timbre-centered" estimation of pitch in adults who lacked music schooling<sup>5</sup> . To demonstrate that pitch-ear is a "functional organ"6—absent at the moment of birth and forming as a consequence of cognitive development—Leontyev built a vibrator machine and successfully taught subjects to discriminate pitch exclusively by touch. Another Vygotskian, Zaporozhets, headed the research on the emergence of frequency-oriented hearing in preschool children (Endovitskaya, 1959, 1964; Lisina, 1966; Mukhina and Lisina, 1966; Repina, 1966a,b; Zaporozhets, 2003) 7 . His group concluded that the genesis of frequency discrimination reflected a fundamental re-organization of motoric, visual, and spatial orientations throughout childhood embodying a form of abstraction "of a number of perceptual activities, directed at the exploration of objects and phenomena of reality, identification and fixation of their perceptual attributes and interrelations" (Zaporozhets, 1985, p. 1:18).

The frequency-based "musical ear" corresponds to the urban Western environment, where systemically organized straight parallel lines and right angles are widespread and dominant. But these do not exist in the steppe or tundra (Nikolsky, 2016, Appendix-7 in **Supplementary Material**). Life in the tundra promotes music systems based on indefinite (ekmelic or khasmatonal) pitch (Alekseyev, 1986) and fine timbral distinctions, corresponding to life in an open terrain that lacks landmarks. There, orientation occurs via qualitative (nonquantitative!) evaluations of wind, light, snow, distance, etc., rather than the incremental and reversible pathfinding in the urban landscape. Most likely, timbre-centrism originates from life in such natural sound-stages where "definite" rhythm and frequency are mostly bound to human activities, and are overall less important than timbre-differentiated environmental sounds (Fales, 2002). The opposition of indefinite/definite pitch systems in Russian musicology finds its match in the opposition of "smooth" and "striated" pitch in Western ethnomusicology (During and During, 2015).

Timbre-centeredness distinguishes the vocalizations of newborns: crying, screaming, rasping, grunting, whining, sobbing, whimpering, etc. (Loewy, 1995). Even prenatally, fetuses recognize parental voices (Lee and Kisilevsky, 2014), which involves at least some timbral discrimination. Newborns distinguish different musical instruments by timbre (McAdams and Bertoncini, 1997). Timbre discrimination is so crucial for a newborn's life that it appears innate (Simons, 1986, p. 43). Perhaps the ontogenetic progression from timbre to frequency (Teplov, 1947) takes the same course as cultural evolution, and "indefinite-pitch" music systems precede "definite-pitch" systems (Alekseyev, 1986, p. 14–15)—following the general paradigm suggested by Foster (1994). Unfortunately, published research on the acquisition of musical skills overwhelmingly focuses on the experience of children in Western societies<sup>8</sup> . Garfias (1990) seems to share Alekseyev's conviction that in indigenous music cultures infants start their musical development from speech-like timbral hearing. Tuvan children learn early to vocally imitate typical environmental sounds with amazing precision, adopting learned timbral distinctions for the creation of their own music (Levin and Suzukei, 2006, p. 85–7)—very much like Western children model their vocal improvisations upon commonly heard tunes (Bjørkvold, 1992). Although Tuvans also use frequency-based music, it remains secondary in importance for them.

In Western classical music, the opposite pertains: it is timbre that is secondary (Scruton, 1997, p. 77–8). Even its standard definition—the "attribute of sensation in terms of which a listener can judge that two sounds having the same loudness and pitch are dissimilar" (ASA, 1951)—is culturally skewed toward prioritizing frequency. Western "frequency-centrism" is most evident in the tendency to ascribe pitch values to clearly

<sup>3</sup>Recently, considerable material has emerged, which well fits Teplov's theory. Timbre and pitch perception seem to engage the same brain resources, and therefore require selective attention (Melara and Marks, 1990a). Analytic extraction of pitch affects extraction of timbre properties, and vice versa (Melara and Marks, 1990b). Timbre affects judgements of simple pitch discrimination (Singh and Hirsh, 1992). Melodic motion can be compromised by timbral shift so that non-musicians could perceive a melodic tritone as larger in size than a perfect 5th (Russo and Thompson, 2005). Timbre can interfere with perception of melodic direction (Allen and Oxenham, 2014). Mistakes in melodic direction due to changes in timbre and loudness are especially pronounced in children as opposed to adults (Williams, 1990).

<sup>4</sup>The detrimental influence of usage of Western frequency-based music on timbral perception has been experimentally demonstrated in a number of studies. A tonal reference can reduce timbre's interference with pitch extraction (Warrier and Zatorre, 2002). This interference is much greater for Western non-musicians (less experienced in pitch analysis) than for musically trained listeners (Pitt, 1994). On the other hand, simplification and uniformity of spectral content seems to enhance frequency-interval discrimination in Western music (Zarate et al., 2013).

<sup>5</sup>Recently, this claim was substantiated by the evidence for two alternative pitch encoding mechanisms based on frequency resolution or non-resolution of harmonics of the fundamental tone (Grimault et al., 2002). This difference seems to lie at the heart of the distinction between "timbre-centered" and "frequencycentered" listening modes.

<sup>6</sup>Leontyev borrowed the concept of "functional organ" from Aleksey Ukhtomsky (1978) who held that "organ" did not necessarily have to constitute a morphologically distinct anatomical part, but may have constituted an organization of a specific function with dynamic rather than static attributes.

<sup>7</sup> Some support for Leontyev's and Zaporozhets' conclusion of post-natal formation of pitch discrimination skills comes from more recent Western research. Thus, McAdams and Bertoncini (1997) found that at birth infants can discriminate between rising and falling melodic contours only when they are presented with timbrally and spatially contiguous sounds. Schwarzer (1997) found that 5–7 year old children generally favor tracking loudness and timbre in analytic listening tasks over tracking melodic contour—unlike adult listeners.

<sup>8</sup>At present, cross-cultural investigation of general patterns of acquisition of music skills is in its virginal state. Only a handful of publications in English discuss the issue of musical development in non-Western societies (Blacking, 1967; Garfias, 1990; Fernald, 1992; Papoušek, 1996; Minks, 2002; Stige, 2002; Nettl, 2005).

non-musical sounds (e.g., car brakes)—including sounds of non-Western timbral music, such as Inuit (Walker, 1997, p. 323). Unsurprisingly, Westerners' attempts to reproduce indigenous timbral music introduce distortions and are rejected by native users (Ojamaa, 2005).

The reverse bias affects the use of frequency-centered music by timbre-centered musicians. Soviet researchers of the collectivization/likbez era discovered that children (Beliayeva-Ekzempliarskaya, 1925) and adolescents (Antoshina, 1939) who lacked exposure to classical music could not detect a harmonic mismatch when a well-known melody was performed against accompaniment in a wrong key. Transference of popular tunes across different music systems usually involves systemic pitch conversion. Thus, the Russian-Ukrainian diatonic heptatonic song "Provody" becomes anhemitonic pentatonic in Buryat and whole-tone tetratonic in Yakut reproductions, severely distorting the original intervallic structure (Alekseyev, 1986, p. 148–55). Both, Buryats and Yakuts, cultivate timbral music. It might be appropriate here to speak of the intuitive substitution of "pitch classes" of the original foreign song (which was composed circa 1918 in Ukraine within the framework of Western tonality) by the "timbral classes" of traditional Buryat and Yakut music which sounded the closest to the original. The "strange" and complex sounding tonality of the foreign song, which for some reason attracted the attention of the indigenous musicians, was substituted by the TO that was habitually "normal" and easy for them.

Timbre-oriented musical cultures seem to base their musical modes on a set of pitch levels that are joined together according to some common trait(s) of timbral coloration and/or sound production technique(s). Thereby a number of such related "timbre-classes" become united into a "timbral mode." This is in contrast to a specific intervallic distance in pitch that distinguishes one "degree" of a musical mode and/or key from another "degree" in frequency-oriented music cultures. Inevitable conversion from pitch- to timbre-based frame of reference must be responsible for the intervallic distortions, discovered by Alekseyev in a Russian-Ukranian tonal song upon its appropriation by the indigenous Buryat and Yakut musicians.

## PERSONAL NATURE OF TIMBRE-BASED MUSIC

Often overlooked is the fundamental unsuitability of timbrecentered music for collective use—in sharp contrast to frequencycentered music. Even before the crystallization of Western tonality, musicians realized that the more instruments that played the same harmony, the more euphonic it appeared (Mersenne, 1957, p. 270). However, the more speakers who collectively pronounce the same sentence, the less articulate it sounds. People sing together, but take turns speaking (Brown, 2007). Fusion of sounds dominates production and perception of music (Huron, 2001) because pitches readily conjoin in unisons, "double-notes" and chords. Timbre-centered music, however, resembles speech in its soliloquacity: for a native listener, simultaneous throatsinging by a dozen singers would make the song unintelligible.

Unlike the simple dimensionality of height for pitch, the multi-dimensionality of timbre considerably complicates its categorization (Krumhansl, 1989). Timbre-based music also misses the systemic rationality of frequency-based music systems. The interrelation of pitches cross-refers any given pitch value, facilitating the interpretation of pitch by making it "rational" whereas timbre stands as "irrational," lacking anything akin to intervallic ratios (Balzano, 1986). To add to the confusion, the temporal dimensions of timbre interact with its spectral dimensions (Caclin et al., 2007), disallowing multiple timbres to synchronize perfectly. Different timbres, dubbed in unison, either blend into a new timbre, intensifying one of the constituents, or repel each other (Sandell, 1995). Blending depends primarily on the onset of synchrony, similarity of attacks, and/or spectral centroids—traditionally studied within the field of orchestration (Tardieu and McAdams, 2012). Orchestration is limited to frequency-centered music, where musical instruments are perfected to produce clear pitch. Even many popular instruments exhibit registrally fixed formants, and can therefore be considered "pitch-generalized"—which greatly influences their timbral blend (Lembke and McAdams, 2015). Unsurprisingly, unison blending is greater than non-unison blending, inverse to the identifiability of constituent timbres (Kendall and Carterette, 1993). The fusion of multiple "pitch-generalized" instruments generates orchestral tutti—an undistinguishable mass of timbres that nevertheless remains "pitch-clear."

There is no tutti in indigenous timbre-centered music<sup>9</sup> . Such music cultivates very different instruments (whip, buzzer, cane, flask), distinguished by a characteristic timbre and indefinite pitch (Mazepus and Galitskaya, 1997). In Russian musicology, these are qualified as "phono-instruments"10—sound-producing tools manufactured for some common application other than music-making (Yesipova, 2008). The most important of them is the Jaw Harp (JH)11. In the contiguous area from the Urals to the Okhotsk Sea, all indigenous ethnicities have their JH traditions (Emsheimer, 1986). A century ago, every ethnicity of Asiatic Russia possessed the JH (Jochelson, 1928, p. 217). The JH can be qualified as the archetypical instrument of Siberia (Sheikin,

<sup>9</sup>This generalization applies only to the indigenous folk traditions of timbrebased music – and not to the styles of Western classical music that are based on sonoristic effects rather than conventional "melodies" or "chords," such as "sonorism" (e.g., Krzysztof Penderecki) or "spectralism" (e.g., Tristan Murail). Such styles generally tend to follow an "experimental" approach forged by Western modernistic composers through "inventing" their own original means of tonal organization rather than implementing already existing folk timbre-based music systems that have been "naturally" formed over a long period of time within an indigenous cultural tradition.

<sup>10</sup>The term "phono-instrument" was introduced by Sheikin (1996) to supplement the Sachs/Hornbostel structural classification of musical instruments by the classification of functionality of their use, where "phono-instruments" represent the archaic forms of musicking—preceding the invention of instruments designed specifically to generate a particular type of music (Sheikin, 2002).

<sup>11</sup>This lamellophone musical instrument is often called "Jew's harp," which is somewhat confusing, since it is neither connected to Jews, nor to Judaism (Wright, 2015, XIV). A more appropriate name seems to be "jaw harp" (Crane, 1982), as suggested earlier by Sachs (1940, p. 58)—for the characteristic manner of its sound production. Player plucks the lamella while holding the instrument close to his/her mouth, used as a resonance chamber for amplifying the desired partials of the vibrating lamella (Ledang, 1972; Adkins, 1974; Trias, 2010).

Nikolsky et al. The Overlooked Tradition of "Personal Music"

1996). It and other "mouth-resonating instruments" (mirlitons, bows, "singing-pipes") specialize in making "music for oneself " (Sheikin, 2002, p. 116–66). The JH's solitary use prevails over the Volga Plateau (Zagretdinov, 1997), Tuva/Altai (Suzukei, 1989), Afghanistan (Koskoff, 2008, p. 1062), Sakhalin (Mamcheva, 2005), Taiwan (Blench, 2004), Indochina (Sam, 2008), Indonesia (Matusky, 2008), and New Guinea (Pugh-Kitingan, 1977).

Such a wide geographic distribution indicates the intercultural—acoustic—reason for JH's personal application: except for the recently invented JH models12, traditional instruments are barely audible beyond a few meters. Even in social settings, JH remains **private**. Such are JH's romantic responsorial duets between a young man and a young woman, known, other than in Russia, in Southern China (Picken, 1957, p. 154), Tibet (Arcones, 2013, p. 216), Hainan (Hsu, 2001), Taiwan (McGovern, 1922), Laos (Simana and Preisig, 2003), Vietnam (Rault and Brenton, 2000, p. 83), Indonesia (Kartomi, 2012, p. 160), New Guinea (Pugh-Kitingan, 1977), and Western Europe (Kolltveit, 2006, p. 111).

#### **Example-3**

"Serenading" khomus (Yakut JH): a romantic duet in traditional style, performed by Erkin Alekseyev and Tokuiaana Nikolayeva. Music proceeds in a responsorial setting, but ends with simultaneous playing that reflects the union of feelings (http:// chirb.it/rH6bFD).

Germane in this context is that "serenading" duets involve taking turns, like interlocution: one player reproduces the sonic material suggested by another player (Haid, 1999).

The JH is also a child's favorite plaything—another crosscultural sphere of private use—in Yakutia (Dyakonova, 2017), Ural (Aleksandrova, 2017), Sakhalin (Mamcheva, 2012, p. 197), Uzbekistan (Beliayev, 1933), Kyrgyzstan (Vinogradov, 1958, p. 180), Afghanistan (Slobin, 1976, p. 53), Mongolia (Pegg, 2001), Japan (Ishi, 1916), southeastern China (Picken, 1957), Polynesia (McLean, 1999), Indonesia (McPhee, 1955), and Western Europe (Kolltveit, 2006, p. 109). Toying with a JH is typical during the long hours of herding (Shchurov, 1995), a task entrusted to children in nomadic Asiatic societies (Stépanoff et al., 2017). In Altai, the JH is regarded as a herding instrument (Dorina, 2004). Children start working in pastures at ages 5–6 (Yekeyeva, 2011). In Yakutia, mothers teach their children to make JHs from tree splinters (Tchakhov, 2012). Nivkh 5-year-olds learn to play and make bamboo and grass JHs, guided by parents, grandparents, or older siblings<sup>13</sup> .

Self-manufacturing is an important marker of timbral instruments; personal use accompanies personal manufacturing (Dorina, 2004) 14 . Even metallic JHs are often self-made, such as by flattening brass rifle cartridge cases (Mamcheva, 2012, p. 50). Forging of metallic JHs by a metalsmith is a relatively recent historic development, but instruments made that way are no less personal. Even in modern Western societies, JH gatherings promote private musicking—mostly solo playing, without electronic amplification, and submerging into a meditative state (Morgan, 2017). For Siberian vocal and instrumental traditions such musicking constitutes the norm (Zabolotskaya, 2009).

Despite the JH's tuning to a certain fundamental frequency (**FF**), indigenous techniques obstruct standard Western notation (frequency-based), since the production of discrete pitches constitutes only a fraction of the possibilities (Alekseyev, 1991b).

#### **Example-4**

"Mary had a little lamb," produced on khomus by Erkin Alekseyev. This music demonstrates inauthentic, Western-style treatment of JH as a "frequency-based" musical instrument whose purpose is to accurately reproduce the pitches of "tunes" (http://chirb.it/ HNcBsq).

The bulk of the JH's autochthonous repertory consists of speechlike articulations and special sound effects (Alekseev, 1988). Common among players, the verbal characteristics of JHs' sounds are all timbre-oriented (Zagretdinov, 1997). Each timbredistinguished device usually carries a particular semantic value (Shishigin, 1995) 15 .

The JH tutti would mess up the clarity of articulation, preventing the parsing of meaningful musical elements such as tabygyr (staccato) (Alekseyev, 1986). Therefore, autochthonous styles are strictly soliloquy-based. A good example is the Yakut syyia tardy<sup>16</sup> ("moderate playing"), nicknamed "talking khomus" <sup>17</sup> (Grigoryan, 1957), which is closely related to the vocal toyuk (Shishigin, 1995). Toyuk (song) is the most ancient form of dieretii yrya—a smooth-flowing, drawn out improvisational singing style, based on a limited set of melodic intonations, and timbrally individualized (Alekseyev, 2016). Syyia's closeness to toyuk enables the simultaneous singing and playing of a JH.

<sup>12</sup>During the second half of the twentieth century, a number of blacksmith masters in the USSR, especially Yakutia, attempted to enhance the construction of JH by maximizing the instrument's dynamic capacities to make JH more suitable for concert stage performance (Dyakonova, 2011). Modernized instruments substantially differ from traditional ones—to such extent that some experts consider them "not simply an augmented khomus, but a new instrument in terms of acoustics and art" (Kunanbayeva, 1987).

<sup>13</sup>Personal communication with N. Mamcheva.

<sup>14</sup>The notable exception are those cases where making of an instrument has to pass through a specific magic ritual, such as divination of the tambourine for shamanic use, which might involve up to 9 specialists and take a few days.

<sup>15</sup>For instance, the choice of ysyakh (the summer solstice celebration) as a subject by a khomus-player usually involves performance of the okhuokhai rhythmoformula (festive round dance), the crescendo effect (as if gradually approaching the festival), kuoregei (imitation of a larch-call, associated with sunrise), kebe (cuckoo call, announcing the onset of summer), tabygyr (associated with the sound of the horse trot and droplets of melting icicles), and uyuu (articulation of words representing shouting of excited people and singing of dance-leaders) (Shishigin, 1995; Alekseev, 1988).

<sup>16</sup>Syyia in Yakut literally means "slowly, staidly," and tardy—"to yank, pull," as to pluck the JH's lamella. Syyia tardy signifies a particular playing technique, where silent mouth articulations as though speak out syllables while the plucked lamella keeps vibrating. Its technique is also used in other styles that require increased clarity of articulation (Zhirkova, 1991). This underlines the priority of clear diction, precluding simultaneous engagement of many JHs.

<sup>17</sup>"Khomus" ("Khamys" in archaic spelling) is a Yakut Jaw Harp. The alternative meaning of this word is "Scirpus"—grass-like plant that grows next to rivers and lakes (Pekarsky, 1959, p. 3293). Most probably, the original instrument was manufactured of this grass—similar to the way modern Nivkhi and Ulchi make their grass JHs (Mamcheva, 2012, p. 50).

#### **Example-5**

"Tuluktan doborum" ("Friendly bullfinch")—spontaneously improvised simultaneous singing and playing on khomus by Agrafena Ptitsyna from Megino-Kangalasskii district of Yakutia. The performer starts in the style of degeren (rhythm-oriented, metrically regular), but after 30 s switches to another style, dieretii (smooth, metrically free style), and thereafter keeps alternating between the two. Instrumental and vocal parts interfere with each other: khomus as though obstructs attempts of the voice to carry out a coherent song (http://chirb.it/2D3pJh).

Yakuts do not consider the khomus to be frequency-based. They evaluate its sound not as "pitch," but as "coloration." Therefore, concurrent singing-playing constitutes "coloring" the same song. All Yakut music is predominantly solo-vocal, where "vocal" is understood as "mouth-driven"—incorporating both vocals and JHs (Alekseyev, 1991a).

The inclusion of the JH in bands is uncommon in indigenous traditions. It is bound to the South: i.e., India and Indonesia (Morgan, 2008). In northern Eurasia, JH playing is usually solitary, and an additional player can participate only if he/she is a close relative18. The JH "combo" is a recent phenomenon19, inspired by Western chamber music (Alekseyev, 1988, p. 185–7)20. Still, Siberian ensembles resemble the jazz combo by limiting tutti to only the start/finish of the music. Superimposition of JHs is understood here as a timbral mixture rather than as "frequency intervals"—unlike the counterpointstyle mentality of Austrian and German performers who group up to five or six JHs to produce "chords."

### TRANSMISSION AND PARTICIPATION IN PERFORMANCE OF TIMBRE-BASED MUSIC

Frequency- and timbre-based musics fundamentally differ from one another with respect to transmission. It is much harder to reproduce someone's timbre than their pitch. Timbre is usually personalized. We recognize a person's voice and a musical instrument by its timbre. Consequently, timbre-based music is designed to reflect the state of a unique individual rather than that of a group. Timbre (Saitis and Weinzierl, 2019) and sonority (McAdams, 2019) have been inherently connected to the display of emotions—even within the Western classical music tradition (Maddox, 2009).


The inherent individualism of timbre-centered music originates from mother-infant vocal communication. Their vocalizations imitate each other in pitch-contour and timbre to establish a personal bond (Malloch, 2000). These imitation games are deeply private, defining conventions of a "**mini-culture**" for each caretaker-infant pair (Trevarthen, 2008). Such a mini-culture opposes "culture" by circumnavigating social conventions for the sake of securing effective communication with any child, no matter how anomalous he/she may be.

Participation in a group always imposes obligations, and group music makes no exception. However, timbral music performance is fundamentally "free." The performer plays to his/her own satisfaction rather than the satisfaction of others. This is also true with a mini-culture: the caretaker and infant lock in a sympathetic communication that preserves individual freedom.

For timbral music cultures, infant **mini-cultures** develop into an adolescent "**maxi-culture**"—a maximally inclusive culture that allows everyone to remain himself/herself, rewarding them with a positive experience.

Timbral music acts like self-directed speech in problemsolving situations: it encourages behavioral reorganization and compensates for negative experience. And here the JH exemplifies timbral music. Notably, in many indigenous Eurasian cultures the JH is renowned for being a female and children's instrument (Mazepus and Galitskaya, 1997). The JH creates a mini-culture for each player within the ethnic maxi-culture.

## PERSONAL SONG (PS) AS A FORM OF TIMBRE-BASED VOCAL MUSIC

A peculiar form of "timbre-based" music is "**personal song**" 23 (**PS**)—a traditional custom of assigning a specific "tune" to each individual that represents his/her identity and satisfies the required sonic uniqueness: i.e., it does not resemble

<sup>18</sup>JH duets occur either during dating or in joined entertainment between two female friends. Thus, in Bashkortostan, leisure sessions might engage up to 3 women playing together (Shchurov, 1995).

<sup>19</sup>One of the first accounts of such practice comes from the time of collectivization campaigns in Middle Asia after the establishment of the Communist rule. Vinogradov (1938, p. 88) describes the performance of a JH orchestra of Kyrghyz young women as a "haphazard juxtaposition" of multiple instruments "without any coordination," where each performer played according to their own whims.

<sup>20</sup>The brightest features that distinguish this new use of JH from the autochthonous traditions of the past are its presentation from concert stage for a large audience, use of microphones and continuous concurrent playing of more than two instruments. Noteworthy is the globalization tendency of this new format—it attracts artists from many diverse ethnical backgrounds, including those that have not retained their autochthonous JH traditions (e.g., Russians).

<sup>21</sup>The empowering effect of collective frequency-centered performance is most evident in such genres as the work-song, military music and the national anthem—all called forth to increase group cohesion and to strengthen each of the participants.

<sup>22</sup>The propensity for self-communication – uniting self-reflection while performing, to the memory of oneself perceiving the same lyrical expression at some earlier time(s), was identified by Yury Lotman in regard to poetry (Lotman, 2001, p. 25). Musical expression can also follow suit. In fact, this is exceedingly common in traditional cultures of almost all indigenous people of Siberia (Sheikin, 2002, p. 304).

<sup>23</sup>This term originates from writings by Merriam (1964, p. 83). In English language literature, the term has been used in reference to "private songs" of North American Indians, e.g., Sioux (Austin, 1930), who, along with Canadian indigenous population, share the institute of PS with Siberian ethnoses (Ojamaa, 2002). In Russian research literature the most common equivalent term is "lichnaya pesnia."

the timbre and melody of another person's singing (Novik, 1999). A "personal tune" incorporates a particular melodic contour and timbre, and is used to coin multiple variations that the singer spontaneously forges while singing/humming during everyday activities (Novik, 2003). Most of one's day is accompanied by such semi-conscious singing. A comparison of different versions of one's PS recorded on different occasions shows great variability in text and emotional states along with stability in the melodic structure (Ojamaa and Ross, 2004), suggesting a link between "self " and melody. Furthermore, when a person is absent and missed by relatives, they will sing that person's PS as a substitute for his/her presence (Dobzhanskaya, 2017). The unauthorized performance of a PS by strangers is prohibited, and is punishable by the offended party (Grachiova, 1983, p. 56)<sup>24</sup> .

Parents create PSs for their children as though "soundpainting," employing acoustic parameters to express the visual-motoric traits of the child (hyperactive/calm, headstrong/social, big/small).

#### **Examples-6/7**

Ex.6. Dinimiaku's children personal song, created and directed to Dinimiaku by her father, Tubiaku Kosterkin, from the settlement Ust'-Avam in Taimyr, in Nganasan. The song uses the timbral markers: contrasting "clean" high and "dirty" low registers connected together by sliding, and embellished with the "tremolo" effect. According to the performer, the song expresses tenderness and playful teasing. The lyrics question if the girl is upset at the old grouchy woman (Dobzhanskaya, 2014, p. 159) (http://chirb.it/7htOzK).

Ex.7. Derkuptie's children personal song, performed as if "coming from him" to the house guests, by his mother, Valentina Kosterkina, in Nganasan. The song is "infantilized": raised in pitch and "whiny"—it emphasizes the descending sliding intonations, as though gently "complaining." The lyrics imply that the infant-boy is overwhelmed by the visitors' attention and wishes to be left alone (Dobzhanskaya, 2014, p. 156) (http://chirb.it/zsxAEm).

After reaching adulthood, one might create or accept as a gift a second PS. Such renewal addresses the discrepancies between "infantile" and grownup states. The new PS often emulates the PS of a relative who is considered a role model<sup>25</sup> .

#### **Example-8**

Adult personal song of Ver'a Nenyang, performed by his daughter, Liubov' Nenyang. Nenets masculine personal songs are characterized by praising one's own strength, luck, and smartness (Nenyang, 2006). Nenets adult personal songs usually oppose two contrasting vocal registers, joined by strong portamento, use of few "degrees" (3 in this example), and exaggerated intonations the melodic intervals between the degrees are stretchable within the range of about 700 cents (Dobzhanskaya, 2017) (http://chirb. it/dgenGN).

The third<sup>26</sup> PS is permissible at old age to reflect age-induced personality changes (Sheikin, 1996, p. 67).

#### **Example-9**

Murun Yrya ("nasal" song). Old age personal song of Maria Sleptsova-Kustui Maaia, of Yakut-Evenk ancestry, reproduced by her granddaughter, Marina Vasilyeva. Her grandmother used to sing this particular arrangement during knitting. The lyrics list many different garments made by Maria in the past and explain what they are good for (Dyakonova, 2014). The song is characterized by nasalization and kylysakh, applied to the simple formula of 3 pitches ("degrees") (http://chirb.it/4Fc24f).

A PS resembles a renewable passport photograph, identifying principal temperamental shifts throughout life27. In this way a PS maintains personal integrity, reminding the singer of an anchorstate to return to after an emotional perturbation. This is most obvious in lengthy "autobiographical" PSs, where singers list dramatic events of their life (Ojamaa, 2002). A PS is replaced if the return to the anchor-state somehow becomes impossible.

PS "anchoring" is invaluable in the solitary lifestyle of Northern nomads, who are infamous for psycho-pathological disorders known as "arctic hysteria" (AH) (Tseng, 2003). In the colonial past, this diagnosis reflected the Eurocentric evaluation of the indigenous population by Western physicians (Tseng, 2007), but throughout the twentieth century AH captured the attention of anthropologists (Foulks, 1985). Eventually, AH fell within the scope of cross-cultural ethno-psychiatry as a "cultural syndrome" (Kirmayer, 2018). Such qualification somewhat undermines the life-threatening power of AH (Czaplicka, 1914, p. 307–25), perhaps because the English literature on AH focuses on Inuits, whose symptomatology appears milder and rarer than in the Russian North. Mitskevich (1929, p. 21) reported severe disabilities (and even deaths) in 60% of AH incidences in Upper Kolyma. He considered this to be an underestimate, however, since many concealed AH (13), and locals regarded it a "bliss" not to be medically treated (16). According to a Mitskevich's confidant, 100% of women in his settlement suffered from AH (21). AH ran through many families (24) and sometimes broke into mass "epidemies," affecting up to a third of the

<sup>24</sup>The PS taboo can be illustrated by the following case: after a newly installed system in the settlement's club-house played back a PS recording of a recently deceased local old woman, the residents of Chukchi settlement in dismay demanded that its public broadcasting be stopped immediately (Novik, 1999). <sup>25</sup>One often imitates a relative's PS when moving away from home, due to suffering

from nostalgia. Reminiscing about a dear relative by singing a similar PS on a daily basis, then, serves as combating nostalgia.

<sup>26</sup>In the past, for many, if not most, Siberian ethnoses, multiple personal songs were common. However, during the communist rule the institute of PS was severely shaken by campaigns directed against "shamanism," replacement of traditional lifestyles with modern ones, and the growing influence of "metropolitan" Russian culture, empowered through the education system. Thus, traditional taboo on non-personal use of PS was undermined by public broadcasting of those popular songs that are based on "personal motifs" of their creators, such as those by the distinguished Chukchi musician, Gennadii Pananto (Vensten-Tagrina, 2008).

<sup>27</sup>Russian ethnomusicologists often nickname PS a "musical passport." Passport in USSR differed from Western passports in that it indicated one's permanent residential address (propiska)—in addition to identifying a person's face, name, and family status—very much what the PS usually indicates.

population (25–29)28. Many feared contracting AH from others. Even in modern Russia, ethnographers are afraid to research AH-related matters<sup>29</sup> .

Earlier researchers explained AH by the negative influence of extreme cold and Polar night (Novakovsky, 1924). Later research has identified such contributing physiological factors as calcium and vitamin D deficiencies (Wallace and Ackerman, 1960). Similar symptoms were observed among Southern neighbors (Mongols), along with phobias and "copycat" syndrome attributed to a sufferer's excessive submissiveness (Aberle, 1952). Alcoholism was also considered a contributing factor (Foulks, 1985), although alcohol was exceedingly expensive and thus unaffordable to the indigenous Siberian population before 1917 (Mitskevich, 1929, p. 16). More convincing is Mitskevich's connection of AH with chronic starvation, constant stress, excessive pre-pubertal sexual activities, and sensory deprivation, which affect the Northern population much more than the Southern (45) and are environmental rather than cultural—they are as common among local Russians as among Yakuts (9). Such etiology and epidemiology are confirmed by numerous authors (Tokarsky, 1893; Gamov, 1894; Sieroszewski, 1896; Sakaki, 1903; Bogoras, 1910; Vitashevsky, 1911; Anuchin, 1926; Shreibler, 1927; Shirokogoroff, 1929; Shternberg, 1936; Petrov, 1960; Grigoryeva, 1996).

The reduction in AH must be attributed to the considerable improvement in living conditions throughout the twentieth century30. Prehistoric life in a tundra-like climate, then, must have been compromised by even more severe AH.

What remains unknown outside the Russian literature is the connection between AH and music. AH sufferers sing, first quietly, then more excitedly, swinging hands and shivering (Jochelson, 1910, p. 31). Their songs comprise a special genre—menerik yryata (crazy songs)31—characterized by a confusion of identity and consciousness: the singer haphazardly switches between multiple identities in dialogic singing, with incoherent words and intense timbral modulations (tremolo, rasping, falsetto), raving-like, occasionally shrieking, moaning, and clapping in metric disarray (Alekseyev, 2008). Sufferers report hearing songs of evil spirits, and can ameliorate their suffering by singing the evil spirits' songs, to vent the spirits out (Vitashevsky, 1911, p. 188).

#### **Examples-10/11**

Ex.10. Reproduction of menerik yrya, usually performed during the attacks of meneriyi by the anonymous old Yakut from Tattinskii ulus. He described his singing later as a reaction to seeing an evil spirit abaasy in the corner of his yurt and, causing him shriek in deep fear, trying to scare the spirit away. His yelling is interspersed with singing out the lines, supposedly pronounced by abaasy. Characteristic is the "dialogic" representation of at least two characters (http://chirb.it/tIandL).

Ex.11. Another menerik yrya, imitated by Vissarion Gavrilyev from the Maar settlement in Niurbinskii ulus - according to his experience of frequently witnessing meneriya of his neighbor, an old woman. The lyrics are more comprehensible than in the example above and present an argument between the patient and the spirit ichchi. The song is characterized by the alternations of a recitative-like excited singing/talking and brief tremolo motifs in a free metric setting, interrupted (rather than accompanied) by spontaneous clapping (http://chirb.it/mr28fk).

AH singing drastically contrasts with the stability of PS and its regular, endless repetitions of the same formula in a characteristic "personal" timbre. This opposition reveals the orderly power of PS and how crucial it is for survival in Arctic conditions. Continuous singing during long, solitary travel in the tundra prevents the rider from falling asleep and losing track of direction, or helps in surviving snowstorms (Krushanov, 1987, p. 234). PS keeps one's mind present under critical pressures.

In contrast, menerik loss of personal identity corresponds to suicidal behavior—e.g., running away and freezing to death (Nissen and Haggag, 1988)—supporting the indigenous beliefs that meneriya occurs when evil spirits invade one's soul. According to anamnesis, singing starts precisely during the attacks of such "possession" (Mitskevich, 1929). Although pathological singing clearly results from AH, there might be some feedback between the disruption of "normal" PS use and AH incidents. The Nanai, another PS ethnos, regard any mental disease as a "personality disorder" caused by the invasion of multiple spirits (Shimkevich, 1896) <sup>32</sup>. Culture is known to construct methods for handling common mental illnesses, which often include music (Robertson-DeCarbo, 1974). If music can heal, it should also have the power to aggravate—PS disfunction could contribute to AH.

<sup>28</sup>The documented evidence often indeed details the description of cases where witnesses who had observed the AH behavior shortly developed the same or very similar symptoms themselves. Sometimes these symptoms did not disappear by time, but gradually exacerbated into regular fits. This was especially common amongst females.

<sup>29</sup>Thus, Yevdokiya Sergina, a professor of the Arctic State Institute of Art and Culture in Yakutsk, in a personal communication, admitted to have deliberately avoided contact with people that suffered from AH disorders in fear of their detrimental influence. To illustrate the potential dangers involved, she reported the suicide of a young researcher who specialized in shamanic mysteries, after having attempted to investigate such disorders.

<sup>30</sup>The modern reduction in severity and incidence of AH could also be explained by the growing acceptance of AH by the local population, who tend to avoid medical assistance—in combination with the lack of qualified scholars capable of establishing rapport with local population to obtain confidential information. Most likely, the exuberance of the anamnesis information in nineteenth to twentieth century Russia became possible because of the custom of condemning political prisoners to settle in Siberia. Most physicians and ethnographers who left accounts of AH were convicted to exile at first by Imperial and later by Bolshevik courts a long-lasting tradition that was instituted from the times of Russian conquest of Siberia (Hartley, 2014). The length of their sentences secured their bond with the indigenous people. Additionally, political convicts of the nineteenth century (as well as Bolshevik officials appointed to investigate local living conditions) used to be atheistic—unsusceptible to fear of evil spirits.

<sup>31</sup>Menerik yryata singing that includes moaning, shouting and occasional clapping with lyrics as though coming in alternation from presumable "evil spirits" and the singer himself/herself, is still encountered not only amongst Yakuts but also

their neighbors, who have their own names for such singing: cármoriel amongst Yukagirs, haujan amongst Evenki (Jochelson, 1910, p. 31).

<sup>32</sup>Shimkevich describes a traditional Nanai wooden idol (seon) and an iconographic image drawn on a piece of cloth, paper or wood (girki) of a deity (burkhan) that has multiple heads, representing the disarrayed and conflicting mentality (p. 49). Nanai call such multi-headed burkhans "segemi." A similar tradition was described in neighboring Nivkh culture (Shrenk, 1903, p. 118).

Another musical-psychological anomaly, common in Siberia even now, is tyyl yryata33—sleep-singing, caused by extreme exhaustion or distress, especially when these are chronic. It resembles sleep-talking and can last for hours until the sleepsinger awakes (Jochelson, 1910, p. 13:37). Locals do not consider this a disease (Mitskevich, 1929, p. 10).

#### **Examples-12/13**

Ex.12. Genuine tyyl yrya, captured by Eduard Alekseyev from an overnight recording of Prokopii Sleptsov from the settlement Druzhina, Abyiskii ulus. Upon listening to this recording, Sleptsov remembered his dream of hunting a moose, but could not recognize the language of his singing (presumably, Yukaghir, a native tongue of his mother, that he later forgot). Singing is based on a brief descending gliding motif consisting of 2 pitches (degrees), possibly with a third complementary degree (http:// chirb.it/086zkG).

Ex.13. Reproduction of tyyl yrya of an old woman, a relative of the famous Yakut singer, Luka Turnin, who had overheard her singing on numerous occasions. The song is based on the repetitions of a brief formula of 6 tones, engaging 3 pitches (degrees) - most likely a personal motif of the old woman. The lyrics complain about disappointing the barnmaster spirit, and promise to please the spirit with a gift (http://chirb.it/cg0cvL).

The formulaic structure of tyyl yrya suggests that it constitutes a deeply ingrained PS, "automatically" reproduced by the sleeper. Perhaps tyyl yrya is a byproduct of the hyper-activation of the self-identifying circuit that attempts to cope with a stressful exertion, whereas menerik yrya reflects the failure of such coping. This would confirm Jochelson's observation that sleep-singing relates to meneriya.

In indigenous Yakut culture, there are numerous vocal genres dedicated to meditative psychological self-regulation in stressful situations: kögus yryata (pain-reducing songs), enelgen yrya (complaining songs), and sulanyy yryata (dying songs), which are collectively described as uiulga yrya (singing for mental endurance) (Dyakonova and Grigoryeva, 2017).

In Arctic conditions, the significance of PS is truly "Cartesian": "I sing; therefore I am" (Sheikin, 2002, p. 330). Loss of PS amounts to "disintegration."

Musically, a PS reflects one's affiliation with a kin and territory (Sheikin, 1996, p. 12). Specific intonations, rhythms, and timbres characterize specific geographic locations and kin groupings (Novik, 2004, p. 80). Comparative musicological analysis of PSs allows for identification of one's genealogical tree in Samoyed cultures (Niemi et al., 2004). The need for geographic tagging originates in the custom of taking a wife from neighboring ethnoses (Goltsova et al., 2005); territorial PS markers prevent incestuous marriage (Aizenshtadt, 1982). Patriarchic lineage determines the wife's social identification through her symbolic "rebirth," in which she loses membership in her birth kin upon adoption by her husband's kin (Sagalayev and Oktiabr'skaya, 1990, p. 18). Kin membership legitimizes a human being (Tyukhteneva, 2015). In such a system, the absence of PS amounts to "excommunication" from traditional society and the patronage of supernatural forces.

## THE CONCEPT OF MUSICALITY IN TIMBRE-BASED MUSIC

The existential cardinality of PS increases the value of musicality. In Western societies, a lack of musical abilities bears no biological cost; amusiacs do not suffer damage to their social life (Patel, 2010, p. 371–379). In PS cultures, however, amusia is costly.

Most traditional indigenous societies engage all members in active music production, since musical deficiencies pose serious obstacles to "normal" social interaction (van der Schyff, 2013). Like speech, music-making fuels **autopoiesis—**ongoing biological self-organization that optimizes the antithesis "I"- "non-I" to support one's autonomy (Maturana and Varela, 1980). Essentially, this is no different from the formative role of musical communication in the cognitive development of an infant (Cross, 1999). The absence of such communication endangers the psycho-emotional well-being of the child (Malloch and Trevarthen, 2009).

Inability to sing one's PS is perceived in indigenous societies as a handicap that requires family assistance. That is why, in Chukotka, those incapable of singing have their relatives sing their PS for them (Dyakonova, 2015). According to Siberian mythology, aphonia distinguishes dead souls from the living, and losing one's voice to a spirit equals death (L'vova et al., 1989, p. 90). Through its association with breathing, voice is considered an immanently live object capable of "invading" a person and "subduing" his/her mind (Pashina et al., 2005, p. 50). This is of special concern for shamans, who frequently imitate the voices and personal songs of dead people. Hence, using a "correct" voice is crucial for personal and societal well-being, and in some communities traditions call for the very old and sick to stay silent during public rituals (Dorokhova, 1995).

Unsurprisingly, in PS cultures the standard of vocal musical proficiency is decidedly low. Sieroszewski expressed discontent at "voiceless" Yakuts, who were always singing their "endlessly repeated monotonous tunes," "pleasant only to [the] performer," and "annoying like [a] mosquito buzz" (Sieroszewski, 1896, p. 569). His annoyance is evidence that contemporaneous Polish and Russian folksong standards, influenced by the classical bel canto, exceeded the Yakuts' in aesthetic demands. Frequency orientation and the absence of PS in Russian folksong enabled the bel canto influence, propping up its underlying idea of the necessity for "musical gift" and "proper" education. Such an attitude pushes toward exclusiveness of musicking and its professionalization. PS, on the contrary, promotes inclusiveness<sup>34</sup> .

<sup>33</sup>Tyyn yryata ("night songs," in Yakut), or tyyl yryata ("sleeping songs") is unconscious singing in sleep, often sung in foreign tongues that the singer does not speak when in an alerted state. This is still relatively common amongst Northern Siberian ethnoses, especially men. Evenks call this singing náyani, and Yukaghirs—yendójenut yáxtei (Jochelson, 1910, p. 13:37).

<sup>34</sup>To illustrate this point, in 1957, one of the authors of this paper, Ivan Alekseev, irretrievably damaged his vocal cords by the excessive singing of his PS in nostalgia, after having moved away from his homeland. Nevertheless, his "bad" voice in no

Timbral music does not observe "wrong notes": informants are puzzled by questions about musical mistakes, as they believe that any expression is "right" (Ojamaa, 2003). In addition, intervals in timbral-based cultures are not fixed, as they are in frequency-based cultures, but are stretchable (ekmelic) depending on the singer's emotional state (Nikolsky, 2015). A performer, when repeating a song, sets the same lyrics to varying pitches, unaware of pitch discrepancies despite a fine musical ear that is evident whenever he/she sings other styles of music. Indefinite pitch certainly reduces to a minimum the demand on musical skills necessary for a "decent" performance.

A PS can be exceedingly simple, performable by virtually anyone.

#### **Example-14**

Personal song of a Nenets old woman, Utchi, from the Kazym river region, covered with taiga (courtesy of Triinu Ojamaa). The melodic contour of this song's formula is exceedingly simple engaging only 3 degrees within the range of only 274 cents: the lowest (228–231 Hz), the middle (249–252 Hz), and the highest (c. 254-267 Hz) degrees (http://chirb.it/0wM28B).

Structurally, a PS consists of "leitmotif " and "leittimbre"<sup>35</sup> —a reused motif and timbral quality. One family's PS can be distinguished from another's by its timbral style (Ojamaa, 2002). Musical elements that stay intact in all uses of a PS represent the PS's owner (musical "foreground"). Elements that change represent the circumstances of his activities ("background"): e.g., riding engages a different rhythm than fishing in an otherwise identical PS. Together, foreground and background define the position of an individual in space and time—indispensable in the solitary conditions and open spaces of Siberia.

Discrimination between foreground and background features requires the perception of **thematicity**—musical material that constitutes a **theme** (Mazel, 1979, p. 150–5)36. Although thematic analysis evolved within Western classical music (Réti, 1951) <sup>37</sup>, it should nevertheless be extended to all types of music (Val'kova, 1992)—like the Greek thema (proposition), which emerged in rhetoric theory and therefore suits any application related to semiotic representation where the perceiver needs to remember a particular expression essential for a musical work (Drabkin, 2001). Matters are complicated by the theme's salience. It might fall anywhere between two poles: "concentrated" (e.g., symphony) or "dispersed" (prelude) (Val'kova, 1992, p. 33) as of the case in centonization in plainchant (Treitler, 2007). Timbral music also utilizes themes of various salience; JH players "compose" by developing selected thematic material<sup>38</sup> .

PS culture relies on **monothematicism**: "personal leitmotifleittimbre" represents the same person. PS users identify themes intuitively, without conscious differentiation between melody and lyrics, but nevertheless associate musical sameness with personality, and verbal diversity with environmental circumstances (Ojamaa and Ross, 2004). Worth noting is that altering a PS's melodic contour constitutes a change of ownership between family members, but altering a PS's words does not (Ojamaa, 2002). A person sings one's PS without words when no contextual reference is needed, only a personal reference, such as missing a dear relative. However, changing words per se is considered changing the music. Typically, asking one to sing "another song" results in retexting the same **monotheme.** Only upon request to "sing like someone else" would a singer switch from that monotheme to the PS of the person mentioned (Alekseyev, 1988, p. 163–65).

"Timbral musicians" parse songs primarily by lyrics—not surprisingly, since phonemic oppositions constitute the timbral domain (Ojamaa and Ross, 2011)—rather than by pitch, like "frequency musicians" and non-musicians (Bonnel et al., 2001) <sup>39</sup>. Lyric-based parsing is common for Western infants (Lebedeva and Kuhl, 2010). Centering on lyrics generally characterizes the initial stage of acquisition of vocal musical skills (Welch, 1994). The earliest children's non-imitative singing surprisingly resembles an ekmelic PS with its stretchable intervals, formulaic structure with retexted lyrics, and private musicking to accompany various activities (Moog, 1976; Dowling, 1984; Bjørkvold, 1992; Campbell, 1998; Barrett, 2011; Koops, 2012). By 7 months, infants can differentiate the timbre of complex tones (Trehub et al., 1990) and remember timbre-specific information (Trainor et al., 2004).

If PS processing requires skills available to normal 1 to 2-year-old children, then amusia in timbral music societies

way affected the public judgment of his PS singing. This is in stark contrast to "bad" singing of Western classical compositions, which requires sustaining the pitch without fluctuations, retaining pleasant timbre, etc. (Sundberg, 1987).

<sup>35</sup>Although these terms originate from Western musicology, they are applicable to non-Western timbre-based music too. PS users reserve certain attributes of the melody and the timbre to make PS recognizable across all modifications. This is quite similar to what classical composers do in stage works (operas, ballets) and program music (tone poems and overtures). In both cases, the retained "leit-nucleus" marks the features of the "protagonist," whereas the changeable elements represent the "background" features, characterizing the context of action/happening.

<sup>36</sup>In Russian systematic musicology, the classic definition of "musical theme" is: "Musical idea that is fixed in musical structures, characterized by considerable completeness, salience, and originality—allowing one to recognize these structures upon repeated hearing—and that are often used continuously or reused within the same musical work" (Mazel, 1979, p. 151). Structurally, "thematicism" is the "totality of all themes employed in a work or in a group of works, which characterizes a particular music style (153). "Thematic material" is the complex of musical structures employed toward making a given musical work – disregarding their completeness and salience (152).

<sup>37</sup>The notion of "musical theme" is often described as peculiar to the Western classical music. This is not true. For instance, in South Indian Carnatic tradition, a musical composition is often segmented in 3 sections (Pallavi, Anupallavi, and Caranam) based on the treatment of the musical theme (Agrawal, 2018). The

gamelan tradition not only utilizes the variation form, but also employs various forms of accompaniment to the principal theme (Miller and Williams, 2008).

<sup>38</sup>JH players choose a specific articulation (vowel or syllable), melodic intonation (a succession of pitches), imitation of a natural sound (birdcall) or a special technical device (staccato) for their specific expression. They use it either for repetition, variation, or contrast to some other thematic element – generating a sort of expressive "scenario" for a musical piece.

<sup>39</sup>Although Western listeners do integrate melodic and verbal information in lyrics prior to parsing song's lyrics, they process verbal and musical structures separately while listening to familiar songs (Sammler et al., 2010). In contrast, for indigenous Siberian listeners, autonomous processing of pitch and text does not seem to occur under any circumstances. Of course, final conclusion should be drawn only after experimental research of song's perception by representatives of well-preserved "timbral cultures," such as the nomadic Nganasan or Evenk.

must be virtually non-existent. Indeed, native PS-processing skills are widespread and effective. There are accounts of Siberian non-musicians retaining memory of a once- heard PS for decades (Novik, 2003). Timbral music culture is definitely designed to support this personal thematicism. Niemi et al. (2004) considers the absence of collective singing and instrumental pitch-based accompaniment among Northern autochthons a consequence of the identification functionality of PS<sup>40</sup> .

## JAW HARP (JH) AS A PRINCIPAL MUSICAL INSTRUMENT OF EURASIAN TIMBRE-BASED MUSIC

Ps provides a model for the "musicalization" of other sounds. Animal calls, wind, rain, etc., all are regarded as the "personal voices" of natural objects, and are characterized by specific timbres and pitch contours (Novik, 1999). Like human PSs, their monotheme could take different shapes to provide information about their "whereabouts": e.g., Evens distinguish between the sad, angry, and happy "talk" of the fire in a fireplace. Moreover, "fire-talking" is believed to interact with human speech (i.e., respond to human conversation). Every environmental/household object could interact with humans. And this subjects them to the same "personalization" as humans.

Most Siberian **phono-instruments** (Sheikin, 2002, p. 46–67) possess their own "PS": flask, spoon, or cane, each is easily recognizable by its sound. Environmental sounds too represent "personal owner-spirits."

**Examples-15/16**

Ex.15. Vyvko—the Nenets buzzer used to imitate wind. In the past this had to do with the rituals of calling on rain, but now it is primarily a children toy, promptly made from a thread and a button (http://chirb.it/mcGarg).

Ex.16. Symysky—the Khakass male maral call, made from a piece of birch bark (http://chirb.it/8zt1tw).

To a human, such instrumental "PSs" form an "objective" sound-scene collectively representing the surroundings.

And, here, the JH emerges as the simplest archaic instrument capable of producing multiple "PSs" of natural objects—it is universally renowned for its onomatopoeia capacity (Maslov, 1911; Beliayev, 1933; Le Roux, 1950, p. 2:507–8; Picken, 1957, p. 186; Koizumi et al., 1977; Alekseyenko, 1988; Zagretdinov, 1997; Bulgakova, 2001; Sermier, 2002, p. 103; Alexeyev and Shishigin, 2004; Mamcheva, 2005; Canave-Dioquino et al., 2008; Yesipova et al., 2008; Suzukei, 2010; Kartomi, 2012, p. 159).

#### **Example-17**

Temir-khomus (Altaic JH) is imitating something that human voice cannot—the sound of the water stream, Tuva (http://chirb.it/aa35p2).

Onomatopoeic versatility explains JH's popularity amongst children, which reflects their desire (typical in traditional societies) to resemble adults<sup>41</sup> .

Onomatopoeic vocalizations play an important role in early verbal acquisition (Menn and Vihman, 2011). Onomatopoeic words are easier to spot, remember, and comprehend (Laing, 2017). Similarly, onomatopoeic sounds facilitate comprehension of JH music, help in mastering the JH, and model new expressive means. The sounds one makes in learning the JH are no different than the babbling that comes from an infant learning to talk; both are playful explorations of the phonemic variants of a selected articulation (Davis and MacNeilage, 1995). The frame/content theory explains not only the acquisition of speech (Davis and Zajdo, 2008) but also of JH playing. If the vocal apparatus constitutes the infant's first "sound-toy" (Papoušek and Papoušek, 1995), for many timbral-based cultures the JH is the second. The "frame" of each JH syllable is filled with selective spectral content according to the same principles of phonetic symbolism that govern verbal acquisition (Shinohara and Kawahara, 2010).

The universal semantic values of ideophones and phonaesthemes (Svantesson, 2017), especially in vowels (Fischer-Jørgensen, 1978), apply to JH articulations. Thus, [i] implies nearness, narrowness, forwardness; [a] broadness, openness; front vowels tenderness, clarity, luminosity; back vowels firmness and concreteness42. Many associate sounds with colors/tints.

#### **Example-18**

The comparative demonstration of the principal khomus articulations: (a) front vs. (b) back, (c) high vs. (d) low vowels performed by Ivan Alekseyev. Each of these 4 "poles" of JH articulations is characterized by salience of a particular register in generating a respective "JH formant": 2.4–3.1 kHz for "front," 0.3–1.6 kHz for "back," 1.2–2.5 kHz for "'high," and 0.6–0.9 kHz for "low" vowels. The most similar are the "back" and "low" vowels, distinguished by greater intensity of the lowest 1st, 2nd, 3rd, and 4th harmonics of the "back" vowel versus the "low" vowel which has much narrower bandwidth of its lowest formant (http://chirb.it/Az51G7).

#### "Front" syllables are described as tense and unpleasantly "tart." **Example-19**

<sup>40</sup>In Siberia and Russian Far East, the exceptions from strict monophony are scarce: only the practice of collective repetition of the leader's verse by the chorus of dancers in Yakut osuokhai and similar dances of the neighboring peoples (Alekseyev, 1967), and, perhaps, the singing of the shaman's sidekicks during his kamlaniye—a ritual of reaching spirits necessary to fulfill the client's need (Dobzhanskaya, 2008).

<sup>41</sup>Thus, Khalkha Mongol women teach their children to play JH using imitations of horse and camel gaits. This is closely related to the herding family business in which children become involved at an early age (Pegg, 2001).

<sup>42</sup>Ivan Alekseev emphasizes that this generalization relies on the same synesthetic association of "smallness" with front vowels and "largeness" with back vowels that underlies phonetic symbolism of speech. JH articulations require constant control of all the constituents of the vocal apparatus, and reconfigurations of larynx and pharynx are usually done automatically—based on the reflexes established by speech.

"Front" syllables alone. They require tension in throat, face, lips, jaw, cheeks and tongue—isolating the mouth chamber. Erkin Alekseyev, a distinguished khomus player and the director of the research department at the International Museum of Jaw Harp at Yakutsk, describes his sensations while playing or hearing this articulation as though tasting an extremely sour apple (http:// chirb.it/HNwHpq).

"Back" syllables feel pleasantly relaxed. **Example-20**

"Back" syllables. They require relaxation in vocal apparatus. Erkin Alekseyev describes his sensations as "comfortable" to the extent of feeling "lazy" (http://chirb.it/MCzDkN).

"High" syllables resemble "smiling," from joyful to sarcastic. **Example-21**

"High" syllables. They resemble "front" articulation, except that facial muscles remain relaxed. Erkin Alekseyev experiences this configuration as though "smiling" to oneself as in a situation when finding something funny. Emotionally, this state can be charged with joyfulness/playfulness (positive) or sarcasm (negative) – depending on how strained the larynx is (http://chirb.it/n7z749).

"Low" syllables are experienced as "sublime," yet constrained. **Example-22**

"Low" syllables. They strongly activate the soft palate, configuring it into a "cupola." Majority of khomus performers associate it with the sound of a big church-bell—and imagine something sublime and lofty while playing or listening to it. Like "high syllables," sensing "low syllables" can appear "positive" or "negative"—depending on whether it is accompanied by the exertion of larynx and the discomfort resulting from this (http://chirb.it/6Iy1z8).

JH musicking explores a subject by chaining such elementary "meanings," interspersed with the referential meanings of onomatopoeic devices. In syyia tardy, expressing oneself via the JH is preferable to speaking/singing whenever one feels unclear about something and tries to figure out his/her attitude toward it. JH proficiency enables one to pick any thematic idea and track it, exploring one's emotional state—akin to selfdirected speech, yet addressing subjects that are too intangible for verbal expression.

#### **Example-23**

Demonstration of a typical session of "playing for oneself " on Yakut khomus by Ivan Alekseyev (http://chirb.it/bg32m9).

Numerous North Asiatic ethnicities did not develop frequency-based musical instruments, probably because the JH satisfies the need for an objective reference frame and the PS for a subjective frame, thus leaving no void to fill.

## COMPLEMENTARITY OF PS AND JH PLAYING IN TIMBRE-BASED MUSIC CULTURES

The JH presents a counterpart to the PS. Onomatopoeia, citations of popular melodies, the use of conventional genre and stylistic features, and a "storytelling" compositional approach, common among JH players, all elaborate on environmental objects. Moreover, the JH is "anti-personal": it replaces the player's natural voice.

The JH's lamella makes women, men, the old, and the young sound the same despite all their physical differences. Its capacity to conceal the most telling source of personal information, the human voice, while supporting diverse mimicking, reflects the JH's power to "objectivize" sound. The camouflaging capacity of the JH goes a long way—it even allows aphoniacs to emit sounds (Shimomura, 2016).

Despite its "chameleonic" nature and different constructions, the JH remains easily recognizable (Ledang, 1972). Its "proprietary" timbre is evident in the practice of vocally imitating the JH with help from the fingers, a popular activity among Altai children (Fomin, 2018). Paradoxically, the "proprietary" timbre does not impede the imitation of other timbres; the JH makes an effective **vocoder** (Leipp, 1963). The lamella-generated tone provides a "carrier signal," and the vocal apparatus, a "modulator filter." What is unusual is the hybridization of different domains: the JH's modulator is human, while the carrier is instrumental.

• The JH is not an ordinary musical instrument that produces a desired tone, it is a musical "centaur": an organically indivisible human-instrument (Alekseyev, 1991b).

All JH constructions are "centauric": their instrumental component determines the frequency, while the vocal component—its amplification/attenuation (Dournon-Taurelle and Wright, 1978, p. 21). Since personalization comes largely from the vocal component, the JH becomes "impersonal." Totally unlike the PS, the JH hides the player's identity, whether in romantic serenading, shamanic rites, animistic hunting rituals, or children's pretend-games.

Vocoding generally relates to hiding. Electric vocoders were invented for ciphering military telecommunications to conceal the speaker's identity (Tompkins, 2010). In popular music, vocoders surfaced in urban genres that replaced a "humanistic" expression with a "robotic" one; they also penetrated other genres to conceal a singer's personal traits (gender, ethnos, class, age) that were perceived as potential vulnerabilities (Dickinson, 2001). Likewise, in animistic societies the JH reduces the risk of the player being identified and hurt while coming in touch with "spirits" (Popov, 1949, p. 265). In modern societies, the JH protects against eavesdropping (Morgan, 2008).

• The JH remedies the undesired ramifications of PS usage.

However, the JH offers a workaround for its camouflaging: it supports personalization through pitch contours rather than timbres. In the Russian Far East, a player engages individualized patterns, each of which exposes the JH "voice" in its own way—often enabling the recognition of a particular player (Mamcheva, 2012, p. 223). This applies to other Siberian instruments as well. The Udege perceive typical instrumental patterns as the "personal songs" of a specific instrument (Sheikin, 1982). JH patterns constitute "motifs" of "timbral music," equivalent to PS motifs; the equivalence becomes apparent whenever the JH borrows thematic material from a song (Sheikin, 2002, p. 7).

#### **Examples-24/25**

Ex.24. "Hyttya-hyttya, syrdyk kγ mmγ t" (Summer is coming), Yakut folk song, sung by Fedora Gogoleva, describes a bright sun looking out from the skies and sending its warmth to mark the beginning of summer. The song is based on a simple 3 degree formula with regular dance-like rhythm in the degeren style. However, timbrally, music stays in the tangalai yryata style — "palatal singing," where tongue is abruptly pushed to the soft palate while taking a loud inspiration (http://chirb.it/vpnhv1).

Ex.25. "Hyttya-hyttya, syrdyk kγ mmγ t" improvisation on khomus, Fedora Gogoleva. The same performer takes the melodic formula of the song above (Ex.24) and elaborates its motifs/articulations (naigryshi), arranged to form a proprietary JH composition that is intended to express joy and happiness. The JH version loses the traits of "palatal singing," but keeps the degeren style rhythm (http://chirb.it/dp391O).

JH personalization must have emerged after its vocoding capacities became established. As archaic cultures lose their animistic ideology, musical instruments turn into producers of a specific musical arrangement, purposed for a specific application, thereby obtaining distinct semantic value and becoming "informative" about their creator (Novik, 1998). Thus, Mansi use instrumental patterns along with PS to display the birthplace/kin of Bear Festival participants (Tchernetsov, 1971, p. 110). Mansi and Khants conduct annual riverboat races where musician crew-members perform patterns representing the locations through which the boat passes (Sheikin, 1990, p. 8–9).

Each musical instrument embodies a particular gesture abstracted from the performing motion, plus the timbre abstracted from that instrument's sound—to form a new modus of musical thinking (Zemtsovsky, 1987). Like any instrument, a musical instrument is a technological tool for the (re)-production of a desired sound/texture43. To earn the reputation of a folk instrument for a certain ethnos, the instrument must execute a culturally important function (Stockmann, 1987).

Ainu, Ulch, Evenki, and Indonesian JHs are surprisingly similar in construction, sound, and technique (Duvan, 2003). So are Udmurt, Bashkir, Tatar, and Komi JHs of the Kama region (Aleksandrova, 2017). Nearly identical are the ancient Japanese and modern Nivkh and Karelian JHs (Wright, 2001). Intercultural contacts can hardly explain the commonalities among areas so remote from one another, as attempted by Fischer (1986, p. 156). A more likely explanation is the psychoacoustic properties of the JH. There are four transcultural traits of JH music:


Common to all four traits is the representation of multiple entities by a single JH "voice."

The JH acts like a wild card, adopting different identities to benefit the player. For hunters, the JH attracts prey by imitating their sound (Alekseyenko, 1988), or by pleasing the master-spirit of a particular place to get his favor (Galdanova, 1987). For sweethearts, the JH unites them by appropriating the other's articulations (Haid, 1999). For children, the JH presents unlimited make-believe impersonations (Yekeyeva, 2014). All these applications disregard the player's identity and focus on the impersonation

<sup>43</sup>Other than having a peculiar timbre, each instrument is characterized by a specific pattern of sounds convenient for playing on it. Thus, mandolin is characterized by plucking tremolo, guitar by strumming chords, and cimbalom by hammering passages—each of which can be successfully imitated by other instruments (as in orchestrations of Sicilian, Spanish, or Hungarian style music). In musicological literature such characteristic patterns are usually put under the umbrella of "musical texture" (Frayonov, 1981). Most popular musical instruments have developed "glossaries" of such patterns that allow to tell a musical composition that was conceived for one instrument from another (Kholopova, 1979).

<sup>44</sup>For example, maidens of the Bontoc tribe sing inside their houses in response to young men's JH playing outside, trying to guess the identity of the player (Canave-Dioquino et al., 2008, p. 437). Amongst Li people of Hainan, in contrary, girls use JH to camouflage their voices in their confession of love to young men (Hsu, 2001). <sup>45</sup>The capacity of JH to faithfully reproduce vowels and some consonants of a language is often employed to deliver words and phrases at the closest distance, so that the more remote occasional witnesses would not be able to eavesdrop on an intimate conversation. Thus, in Hmong tradition JH not only imitates linguistic phonemes, but also engages tone intonations (Poss, 2012).

<sup>46</sup>The extension of traditional beliefs in JH talismanic protection can be seen in the modern-day use of JH music for therapeutic purposes by the local Siberian population. Such is the program of "Meditative-imagery Jaw Harp therapy" at the Goriachii Kliuch Sanatorium in Krasnodar administrative district, where about 60% of patients report improvement from such problems as panic attacks, posttraumatic stress, acute depression, sleep disorders and irritable bowel (Kalinichenko and Alekseyev, 2009).

act using the JH as a "fit-all" entity. The JH's vocoding nature empowers it.

In archaic cultures, the JH opposes the PS mainly through the implementation of privacy:


Self- and world-orientations complement each other. Not accidentally, the JH and PS share the same territory (**Figures 1**, **2**).

## ANIMISTIC ROOTS OF TIMBRE-BASED MUSIC

Initially, the JH worked like a talisman, connecting its owner with a supernatural force via "magic contract" (Novik, 1998). The first JHs were most likely made from grass, e.g., the Nivkh konga-chnyr (Mamcheva, 2005); tree twigs, such as the Altaian/Tuvan taia/yyash-khomus (Modorov and Dvornikov, 2015); and splinters, such as the Yakut maskhomus (Sheikin, 2002, p. 119). It is exceedingly easy to make such instruments, even for a child; the materials are gathered rather than manufactured. The antiquity of JHs can be deduced from myths ascribing the invention of the JH to a bear—a totemic ancestor for numerous Siberian ethnoses (Startsev, 2017) 47 .

The JH might have originated from the "aeolian harp" its eerie sounds produced by wind hitting a splintered tree (Duvan, 2003). Siberians believe that lightning purifies trees that it strikes, and items made of their wood have protective power (Suzukei, 1991). Additional protection came from totemic phytomorphism: thus, Nivkhi drew their ancestry from larch, Ulchi from cedar, Oroks from birch, and Ainu from fir (Duvan, 2003). Plants were believed to breathe and, therefore, to possess a soul, possibly to even have PSs, just like animals (L'vova et al., 1989, p. 89). Live creatures could have PS.

**Example-26**

"Personal song" of reindeer Urdy, performed "on behalf of him" by his owner, Valentina Kosterkina, in Nganasan. The lyrics describe reindeer's exhaustion from work, complaints on a dog that likes to bite his legs and anticipation of a good rest (Dobzhanskaya 2014, p. 124) (http://chirb.it/M46DrK).

Personalized songs and instrumental patterns are also assigned to spirits to attract their attention, and even musically reflect the mythological family relations between spirits (Gemuyev and Sagalayev, 1986, p. 68). "Spiritual" Ugric PSs were transcribed by Sheikin (1990). They were used to call a patronizing spirit for healing (Voldina, 2017).

Ancestor cults dominated most of Asia. Like mountains, rivers, and animals, plants too were believed superior to humans and could be totemic ancestors. Even today, Khakassian and Altaic kin groups consider larch and birch their ancestors (Sagalayev and Oktiabr'skaya, 1990, p. 50–59). Myths tell of trees nourishing or giving birth to human kinfounders (Tadysheva, 2018). A dying nearby tree is seen as prediction of death for someone from a corresponding kin: for Jyc, fir, Kuzen pine, Komdosh birch, Tubalar aspen, and Todosh honeysuckle (Kypchakova, 2006). Cutting ancestral trees is still taboo in Altai. Trees were actually used to trace genealogical "trees": thus, Irkit kin drew their origin from father-honeysuckle and mother-birch (Potanin, 1883, p. 7).

Zoo/phytomorphic ancestors are depicted on the Uralic metallic disks, 1st millennium BCE, from the Khanty-Mansi Museum (Gemuyev, 2000, p. 63–64); one of them depicts a "plant-woman," identified by ethnographers as mythological Por's ancestress, birthed by a bear after eating poryg (Heraclium sibiricum) (Tchernetsov, 1971, p. 91–93). Poryg and similar herbs are commonly used for making musical instruments in Siberia (Sheikin, 2002, p. 4–7); these are sacred for Por people. Such instruments probably served as kin talismans before becoming regional musical instruments (Vasilyev, 2016). Kongon (Leymus), used for grass JH-making, was also mythologically anthropomorphized into a woman (Kreinovich, 1928, p. 192), and was possibly ancestral to some kin.

Trees serve as kin markers for South Siberian Turks (Sagalayev and Oktiabr'skaya, 1990, p. 43), who distinguish an individual by membership in seok<sup>48</sup> (kin) and believe that the bones of co-members are made of the same "wood" type. Following the tree/forest paradigm, each person corresponds to some tree, and his kin to the same tree-species—in the forest that represents the entire ethnos (53–54). The correspondence of people-grouping to tree-grouping is also observed among Ugric ethnicities (57). In the 2010 census, 82 seoks were registered in the Altai Republic (Tyukhteneva, 2015). Standard Altaic identification includes a totemic animal, tree, and mountain (Tadysheva, 2016).

Within this identification system, a plant-made JH would represent an individual, kin, or birthplace by the instrument's "voice." Initially purely timbral, such identification eventually became "melodic," engaging personal patterns. This development likely came about through vocoding. Since a wife is considered

<sup>47</sup>Thus, Yenisei ethnicities consider JH to be the instrument of Kaigus'—a bear-like deity, master of all animals, who uses JH made of birch splinter to imitate voices of all animals and thereby exercises power to dictate to them (Sheikin, 2002, p. 126). According to myths, in order to secure prey, Kaigus' taught human hunters to play JH prior to hunting.

<sup>48</sup>The word "seok" in languages of Altaians literally means "kin," "generation," "bone," and has been used in reference to a specific form of societal exogamic patrilineal organization (Verbitsky, 1865). Seok's meaning of "kin" relates to its secondary meaning of "bone" through a more general meaning of "remains" (cemetery), where "bone" is understood as "the quintessence of live matter, capable of securing future births"—a trans-Siberian convention (Sagalayev and Oktiabr'skaya, 1990, p. 39). Hence, "bone" is something that is left from the deceased ancestor and that ties him/her with posterity: native Siberians believe that members of the same kin share the same bone ingredients (40).

estranged to her husband's kin even after marriage, and is tabooed to call her husband's totemic entities by name<sup>49</sup> (Tadysheva, 2018), articulating names on a JH or onomatopoeically hinting at them would circumnavigate the taboo. This could explain the female affiliation of the JH among many Ural-Altaic ethnicities. As a rule, men married foreign women (Pakendorf, 2007), who then became restrained by the taboo. Most archaeological finds of JHs in Ural vicinities come from personal belongings in female burials (Aleksandrova, 2017). Across Eurasia, the JH is one of few artifacts typical for burials, indicating the JH's connection to the supernatural and ancestral (Oleszczak et al., 2018). In the Far East, the JH is still used in lamentations for deceased ancestors (Sheikin, 2002, p. 364).

The ubiquity of the JH in Asia can be explained by its shamanic use (Emsheimer, 1986). Shamanism constitutes a unified religious system in the entire Uralo-Siberian area (Alekseyev, 1992). Samoyedic female shamans still use a JH instead of a tambourine (Alekseyenko, 1988), the primary shamanic instrument across North Asia. Around 2000 BCE, the entire area from Bactria to China was strongly influenced by shamanism (Francfort, 1994).

In Altai, novice shamans are restricted to playing the JH until they achieve spiritual maturity (for some never achievable) and receiving a tambourine—a pattern reflecting the antiquity of the JH in shamanic practice (Vainshtein, 1991, p. 273). In Mongolia, shamans use a JH to initiate shamanic rituals and engage the tambourine only upon reaching a trance-state (Chuluunbaatar, 2016). This suggests that the JH's onomatopoeic and vocoding capacities make it more socially inclusive than the shamanic tambourine, as they do not require exceptional "supernatural" powers from the JH's user (Sheikin, 2002, p. 69–134). Surviving beliefs in the personal protecting capacities of the JH reveal its origin as an egalitarian "magic" tool, preceding the professionalization of

<sup>49</sup>A stranger uttering names of a person, his relatives, sacred objects of his kin, including sacred trees, is believed to potentially harm that kin. This taboo has imposed heavy restrictions on wives, who despite bearing their husbands' children, remained estranged from their husbands' kin—thus, until the XX century, an Altaic widow did not receive inheritance after her husband's death (unlike her daughter) (Yenchinov, 2009).

*yuukara*, Chukchi *chinitkin grep*, Koryak *sinkin ulikul* (nominal song), and *ikoleyavan kuli* (kin-song) (Sheikin, 1996), Kerek *kuligul*, Even *tiinmei* (singer's PS), *alma* (someone else's PS) and *ikaan* ("cover" PS that has turned into a popular song), northern Khanty *ar*, central Khanty *arae*, eastern Khanty *lulpany*, Nenets *khybants* (Sheikin, 2002). Comparison of Figure 1 and this figure demonstrates that PS and JH are shared by Mansi, Khanty, Selkups, Kets, Yughs, Kamasins, Evens, Dolgans, Evenks, Yakuts, Udege, Yukagirs, Chukchi, Kereks, Chuvans, Koryaks, Itelmens, Nivkhs, Ulchi, Nanai, Negidals and possibly Ainu. Of them all, Mansi, Khanty, Evens, Udege, Nivkhs, Itelmens and Koryaks have retained both traditions well. Area of their traditional habitat—the Urals, the Arctic Circle, Chukotka, Kamchatka, Okhotsk Coast, Sakhalin and Primorye—must delineate the stronghold of "timbre-oriented music." Prior to the Russian colonization and Chinese cultural influences, which have both been spreading out the "frequency-oriented music," this stronghold most probably included also Eastern, Western and Southern Siberia, where today both traditions, of JH and PS, are relatively weak.

shamanism and the establishment of the tambourine as one of its attributes.

JH musicking surpasses PS in its accessibility. Across northeastern Asia, toothless people use a hammer or ax to play JH (Tadagawa, 2017b).<sup>50</sup> In Yakutia, the ax and JH are even welded together51. The JH is used as an aid in alalia (Everstova et al., 2019). Complete aphoniacs play the JH (Shimomura, 2016)—and probably have been doing so for a long time in the Far East. The usage of JH-like devices (koouchyntzyy) to aid in aphonia was recorded during the Song dynasty (Picken, 1957). Such "prosthetics" highlights JH's great importance in native cultures.

Together, a JH and a PS provide "axes of coordinates" in the animistic worldview. The PS defines the "subjective" aspect as follows (Sheikin, 2002, p. 255):


These hierarchic levels also reflect an evolutionary progression, inferred from cross-examination of all existing Siberian traditions. Sheikin adopts the Even culture for a model, because it has retained all three

<sup>50</sup>Video recording of M. Angin, the elder of the village Auri, Ulchi district, playing JH with the help of an axe can be seen in the documentary film "Ulchi," made in 1967 by the Moscow Studio of Documentary Films (Duvan, 2003). This technique requires superimposition of axe's blade on the ring of a metallic JH. The same performance technique was also utilized by Nivkhs in the beginning of the twentieth century (Mamcheva, 2012, p. 54).

<sup>51</sup>Yakut master Piotr Krivoshapkin from the settlement, Khomustaakh (Namskii ulus), used to manufacture the JH-axe hybrid—which was filmed by a group of Czech filmmakers in 1967.

levels (263–8)52. In the past, Chukchi observed a similar hierarchy<sup>53</sup> .

An analogous three-stage model applies to the JH, defining its "objective" aspect:


This model might apply to other Siberian mouthand phono-instruments.

The intra-cultural interaction of the PS and the JH promotes the emergence of JH patterns. Thus, Nivkh players often engage the same pattern for different compositions (Mamcheva, 2012, p. 222).

#### **Examples-27/28**

Ex.27. Nivkh JH PN by Vera Khein, as presented in her original improvisation which she named "The flatfish dance." This piece, like many others that were recorded by Natalia Mamcheva from Vera Khein's, is based on her personal motif consisting of two descending intervals: of a 3rd and a 2nd (e.g., E-C-Bb), where the latter is often marked by the shorter rhythm of the middle tone (C) and its multiple repetitions, marking the lowest tone (Mamcheva, 2012, p. 297–300) (http://chirb.it/aIrCvG).

Ex.28. Nivkh JH PN by V.M. Persina, as presented in her improvisation (No. 130 from Appendix 1, Mamcheva, 2012, p. 301). It uses 2 motifs: of an ascending 2nd (passing C-D-E or auxiliary C-D-C) and of an ascending auxiliary 3rd (E-G-E), where C and E are marked as anchors, and the middle tones (D and E) are often given shorter rhythmic values (http://chirb.it/ 2L65H3).

Nivkhs also differentiate JH patterns by specific timbral words matched to a genre<sup>54</sup> (Sheikin, 2002, p. 132).

**Example-29**

The Nivkh rhythmic formula "Kan Vai," typical for various instruments and vocal music, in JH music is characterized by the onomatopoeic dog-like articulation of "khav-khav" (Mamcheva, 2012, p. 119). This formula distinguishes the genre of a dog-racing music, used in festive competitions that are held by Nivkhi, Ulchi, and Negidals. This genre is also used during the sacrifice of a dog in an annual bear festival (103) (http://chirb.it/q2xxbJ).

The amalgamation of non-onomatopoeic JH devices must have accompanied the historic development of PS: timbral themes prototyped **timbral words**. Many indigenous Siberian cultures employ specific vocables<sup>55</sup> as ethnic/territorial markers (Sheikin, 2002, p. 251). Their timbral and rhythmic makeup could direct JH improvisation toward elaborating standards. Thus, Yakut khomusists frequently engage the timbral vocable "hyttya." Timbral words comprise "timbral phrases," employed as in magic incantations (Tchernetsov, 1987, p. 35)56. Animistic hierarchy somehow transforms into syntactic hierarchy.

As the traditional lifestyle modernized, animism and totemism lost relevance, along with the need for anonymity. Subsequently, playing the JH acquired melodic/rhythmic/timbral patterns idiosyncratic for a player. They differ from a PS by engaging a repertory of playing devices rather than a single monothematic device.


#### JAW HARP ARTICULATIONS VS. SINGING AND SPEECH

Musical and verbal vocalizations adopt opposite parsing strategies (Alekseyev, 1993):


<sup>52</sup>Structurally, Even classification is based on the relation between music and words. Tiinmei ("to speak to oneself ") is an improvisatory song, created strictly for oneself, using timbral words mixed with spontaneous prosaic mundane texts. Ikaan ("song") can be performed for others. It uses cliché fragments of poetic speech which, however, do not comprise poetic verses. Alma ("recollection") is a "cover-song"—it reproduces someone else's words and/or musical themes and is retained as a ready-made "plot" in the memory of the following generations. Alma consists mostly of poetic speech.

<sup>53</sup>Chukchi classification is based on social grouping. Chinitkin grep conveys one's personal identity ("I, who sings my song"), roiyr'yn grep reflects one's membership in the family ("We, who share a home, sing as I do), chygrymngyat grep presents the entire clan of related families ("We, who share fire, sing songs to ancestors as I do) (Sheikin, 2018, p. 257, 41). The latter type has disappeared in the mid-twentieth century.

<sup>54</sup>JH personal instrumental pattern for dog-riding imitates barking with short tones articulated on syllables "khai t'ok khai-nak-nak" or "khavr-khavr khaf-khafkhaf," while JH pattern for self-entertainment is based on sustained monotones on "ko lyo-lyo-lyo," with vibrato of the lower lip or tip of the tongue.

<sup>55</sup>For example, all melodies of Ude songs include words "yava-yava," whereas Nivkh songs—"anga-anga" (Sheikin, 2002, p. 250–1). Such words differ from "regular" meaningful words by being verbally meaningless and conveying semantic information through phonic symbolism alone. Often, they require a specific rhythm and melodic contour, thereby exercising a formative influence on the melody.

<sup>56</sup>Thus, in his diary from 1926 to 1927, Tchernetsov cites the meaningless sentence "Gav-ri-ke, gav-ri-ke, vas pyrish, vas pyrish!" that he heard continuously repeated on the JH (tumran) by a Northern Mansi girl in a remote Sos'va settlement.

<sup>57</sup>For example, a syyia tardy can be dedicated to such subject as representation of a young man and can present an "aesthetic" contemplation on such a "topic." JH player envisages a protagonist (here, a young man) and fantasizes a storyline, more or less cohesive and tangible, that is not always describable in words. Such a composition is "about someone else" ("objective")—as opposed to the typical PS which is "about the singer" himself/herself ("subjective").

The JH's vocalization introduces tonal homogeneity and monotony. Both are considered detrimental to speaking and singing, reducing the intelligibility of their prosody. JH prosody is unaffected by this. It relies on contrasts only of the mid/highfrequency spectrum, confining monotony to FF.

In frequency-oriented cultures, singing requires timbral uniformity while stressing the FF changes (Titze, 1988). Voicing every consecutive utterance by definite pitch per se homogenizes the spectrum of sung tones58. Like singing, verbal intonation engages FF changes even in non-tonal languages (Ladd, 1996), while stressing spectral contrasts, as JH does.

• The JH **combines singing and verbal approaches**. Its monotony conceals the player's identity yet preserves all articulations: "timbral words" become monotonously "flattened." Such "flattening" presents a form of timbral equalization similar to singing.

Both singing and speech vowels stress FF (**Figure 3**). But speech's clarity depends more on frequency **changes** of F1 and F2 (Zsiga, 2013) that are sufficient for distinguishing vowels (Carlson et al., 1970). Singing **freezes** F1 at about the same frequency level for all vowels (Sundberg, 1974).

• Singers intuitively match vowels. Speakers oppose them. So do JH players—but only in relation to the "**residual** tone," not the "**fundamental** tone."

The concept of "residual pitch" was introduced by Schouten (1940) and adopted by modern theories of harmonic perception (Goldstein, 1973; Terhardt et al., 1982; Moore et al., 1984, 1985, 1986; Houtsma and Smurzynski, 1990; Moore, 2012; Norman-Haignere et al., 2013). The sum of the unresolved partials that comprise "pitch residue" is perceived as a pitch identical to FF, but harsher in timbre (Schouten et al., 1962).

Unresolvability does not prevent pitch perception, because pitch can be analyzed by an alternative mechanism: through the autocorrelation of nervous impulses, activated by partials, in a phase (Licklider, 1951). Despite its complexity, such hearing is exceedingly common (Moore, 2012, p. 217). However, unlike FF recognition, it requires learning (Terhardt, 1974). Hence, periodicity recognition comprises a cultural phenomenon.

• JH prosody engages monotony to facilitate residual pitch analysis<sup>59</sup> .


JH users distinguish one articulation from another most likely by learning and memorizing a number of "harmonic templates," pairing them with the perceived sounds in a search for the best match (Goldstein, 1973). Such templates are formed by a longterm accumulation of statistical data, and tend to rely on the harmonic series (Shamma and Klein, 2000).

Musical templates might be influenced by phonological templates that are generated by the pitch analysis of the tonal attributes of speech (Schwartz et al., 2003). JHs made of organic materials usually deviate from the harmonic series in the tuning of partials. A JH can also omit a harmonic (usually f2), or generate a non-serial partial or a subharmonic (see Appendix-1 in **Supplementary Material**).

• JH's "centauric" nature makes JH's prosody **hybrid** (Alekseyev, 1991b).

The **equalized** FF portion of the JH spectrum resembles the singing formant (Sundberg, 1974), while JH's "residue" portion, in contrast, marks every **opposition** of phonemes. Hence, the JH can "talk" AND "sing" **sequentially** as well as **simultaneously**. JH prosody engages two different domains, vocal and instrumental, where the vocal controls the dynamics and the instrumental the frequency of the harmonic components (Trias, 2010). Each domain introduces its own formant(s) (**Figure 3**):


JH vowels retain F1 at the same level (per instrument), but move vocal formants. Each vowel has its unique harmonic signature comprised of:


The border between the "instrumental" and "vocal" ranges lies ≈f6–f8. Therefore, we must introduce a new concept of "**harmonic base**" to encompass f1–f4(f7) in contrast to "**harmonic residue**."

<sup>58</sup>We must underline that the homogeneous nature of singing manifests itself only in frequency-based musical traditions. Vocalizations of timbre-based cultures, including PS and vocal "games," such as Inuit katajjaq or Ainu rekkukara (Nattiez, 1983), abide by principles of timbral contrast, and are considerably richer than conventional frequency-based vocal music. Such timbrally-rich vocalization styles are used by Chukchi, Nganasan, Koryak, Kerek, Even, Enets, Yukagir, Nanai, Oroch, and Ude (Sheikin, 1996). As Sheikin demonstrates in his analysis of these styles, their TO is based on the binary opposition of two "timbral classes": inspiratory and expiratory rasping, each of which is varied by employing different syllables and intonations. There can be as many of 20 "timbral classes" (e.g., the pilche'ingen vocal system of Chukchi).

<sup>59</sup>That is why multiple attempts to modernize the traditional JH constructions by adding extra lamellas or joining a few JHs on a single "JH farm" did not really catch up with majority of JH players. Western European players of multi-pitched

JHs subscribe to the model of Western tonality borrowed from classical music. Chinese players follow the model of traditional Chinese music system based on yayue. Like Western tonality, this constitutes a type of **tonality**—frequency-based music (as opposed to timbre-based indigenous JH traditions).

FIGURE 3 | Characteristic acoustic features of the JH vowel articulations. This figure demonstrates the importance of harmonic structure for the JH vocal system. The numbers on the curves mark the harmonics. Vertical axis indicates amplitude, and horizontal frequency. (A) Power spectra of the vowel [a] performed on khomus, sung and spoken. (1) Khomus. There are 145 harmonics discernible above the noise floor, peaking at f9. (2) Singing (bass). There are 23 active harmonics, peaking at f1. (3) Speaking. There are 19 active harmonics, peaking at f1. Weakness of formants seems to be part of the indigenous Yakut prosody—judging by the recordings of native Yakut speakers we had at our disposal. In their number of active harmonics, position of the loudest harmonic and distribution of peaks (2) and (3) are harmonically closer to each other than to (1). JH reserves higher intensity for odd harmonics, as opposed to speaking and singing. The bandwidth of harmonics is significantly narrower for khomus than for singing and speaking. If to measure the bandwidth at the harmonic's baseline level, the khomus f2 occupies a range between 141 and 151 Hz, making 117 cents (a semitone). The f2 of the singing "A" ranges between 243 and 297 Hz (349 cents—a neutral third). And the f2 of the speaking "A" ranges between 197 and 238 Hz (325 cents—a narrower third). The last notable difference is a much smaller noisiness of the khomus "A"—easily detectable in the absence of the "furry" looking jiggles in the contour line. These jiggles are very prominent in the singing of "A" and even more so in the speaking of "A," especially >f7. The recordings were made on the recorder Ritmix RR-989, with the sampling rate of 44.1 kHz and the bit depth of 16 bit, and analyzed by the software application RX Pro by iZotope with the following settings: 262144 samples FFT size, Hann window, 4x time overlap, average channel mode, 80 dB/s decay, extended log frequency scale. The peak at 15.6 kHz reflects a permanent background noise in the room. The analysis was the average of 4 versions of (1), (2) and (3). (B) Power spectra of "active harmonics" of all Yakut vowels articulated on Yakut khomus. The vowels are numbered according to their perceived "pitch values" in an ascending scale (Ogotoyev, 1988), according to the convention common amongst Yakut JH players—which generally coincides with the comparative vowel's height. JH's "instrumental formant," F1, marked by the yellow rectangle, falls on f5 in all vowels except "Ü" and "I," where F1 broadens to f3–f5. In contrast, "vocal formant," F2, marked by the red rectangle, falls on different harmonics in 6 vowels: f13 for "O," f17 for "A," f23 for "Ö," f19 for "Y," f21 for "Ä," and f21-f23 for "I." Only 2 vowels, "U" and "Ü," engage the same harmonics as, respectively, "O" and "Ö." However, "U" differs from "O" by a smoother shape and consistent dynamic prevalence of odd over even harmonics (>f13). "Ü" differs from "Ö" by an even greater contrast of odd/even harmonics and louder high harmonics >f47. Altogether, the harmonic composition of F2 varies by the number of active harmonics, the ratio of the intensity of odd to even harmonics and the magnitude of dynamic "dropouts" amongst the adjacent harmonics in the harmonic series. Therefore, each vowel possesses its unique harmonic configuration of F2. The recording was made as in (A). The audio clip representing each vowel was selected out of a pool of 15 performances by 3 performers on 3 JHs based on averaging the dynamic values for the lowest 11 harmonics and finding the closest match to the average harmonic profile for each of the vowels.

JH formants differ from speech/singing formants for the same vowels (Leipp, 1967). A JH vowel engages six times more "active" (i.e., salient) harmonics than singing, and almost eight times more than speaking. F1 in the JH is much softer than in speaking and singing. The JH's F2 is much stronger. The JH's harmonics display a much greater odd/even contrast than singing/speaking vowels do. JH harmonics are much narrower in bandwidth and are compromised by less noise.

**Harmonicity** (Plack, 2010), which has no direct parallel to the JH for singing and is absent altogether from speech, is definitely crucial for JH prosody (Nikolsky et al., 2017). Very likely, harmonicity/inharmonicity forms the principal axis in the discrimination of JH articulations (Nikolsky, 2017). Indigenous users must have high sensitivity to the harmonic structure of sounds. Thus, Nanai believe that everything sung will certainly be heard by spirits who do not recognize human speech (Bulgakova, 2000). The harmonicity of singing could be responsible for this.

The hybrid nature of JH prosody poses the question: is it derivative of speech, like a grammelot (Jaffe-Berg, 2001); of singing, like a kazoo; or is it ancestral to both language and music?<sup>60</sup>

<sup>60</sup>This matter is thoroughly examined in the acoustic and ethnographic study that is scheduled for publication by the state People of the World Khomus Museum and Center in Yakutsk, Russia next year.

## ESTABLISHING THE TIME AND PLACE OF ORIGIN OF INDIGENOUS EURASIAN JH MUSIC TRADITIONS

Archaeological evidence points to Inner Mongolia as JH's homeland (**Figure 4**).

The oldest JH belongs to the Lower Xiajiadian culture, 2146– 1029 BCE (Kolltveit, 2016). Its proximity to the Manchurian sites where the Neolithic "singing masks" were found (Okladnikov, 1971) indicates that JH articulations might have originated from timbral singing (**Figure 5**). "Singing masks" always stress the configuration of mouth, nose, and eyes in a single face, omitting the body; in contrast, the animalistic art of the neighboring Siberian cultures presents the entire animal, and usually in groups (Okladnikov and Mazin, 1979, p. 86). Each "mask" depicts a specific vocal articulation charged with an emotional expression that may represent an ancestral attribute (Sheikin, 2002, p. 259).

Modern Nanai tell myths about a Skull Horse-rider and paint burial idols' faces to house dead souls, closely resembling skull-like Sikachi-Alyan petroglyphs (Okladnikov, 1979). The Nanai habitat includes Manchuria, and the JH remains their principal instrument. "Singing masks" could capture the symbolic meaning of a particular articulation and assign it to an ancestral figure as its "call"—like an animal call ascribed to a totemic ancestor. The model for this could be military calls, urany, used by Turkic peoples to distinguish a kin (e.g., Yakut "urui!") up until the nineteenth century (Sagalayev and Oktiabr'skaya, 1990, p. 21). Such calls were believed to possess supernatural power, making them suitable for the vocoding "protection."

Vowel harmony characterizes most languages of the Altaic Sprachbund (Ko et al., 2014), especially of the Turkic language family (Gadzhiyeva, 1997). There, the Yakut vocal system (Antonov, 1997) is considered the most ancient, representative of proto-Turkic (Tenishev, 1984) and proto-Altaic languages (Tenishev, 1997). The acknowledged paleolinguistic studies date the break of proto-Altaic by 8000–10000 (Andreyev and Sunik, 1982), 7000 (Starostin, 1991), or 5000 BCE (Robbeets, 2015, p. 506)—earlier than the earliest of the Sikachi-Alyan images (Okladnikov, 1971, p. 83–89). "Singing masks," then, could capture the articulation representing a totemically important meaningful "harmonic" word, such as "urui" 61 .

"Singing masks" spread over Amur/Ussuri, Baikal, Lena, and Aldan, dated 3000–100 BCE (Okladnikov and Zaporozhskaya, 1972; Okladnikov, 1974; Devlet, 1976, 1980; Leontyev, 1978; Arkhipov, 1989). This distribution mostly coincides with the oldest archaeological JH finds and modern areas of greatest JH popularity. The northern Far East became culturally isolated around the Bronze Age and retained Neolithic traits until the seventeenth century Russian colonization (Andreyeva, 1987), effectively conserving the archaic traditions.

"Singing masks" are also depicted on Malyshevo ceramics, 2000 BCE (Okladnikov, 1968, p. 50). Similar Neolithic stone and clay masks were excavated in Primorye (Brodianskii, 1978). In Yenisei burials, near Sayan, "real" gypsum masks were found, manufactured ≈2000 BP, probably following some earlier tradition of mask-making from easily decomposing materials (Devlet and Devlet, 2011, p. 310–12). Nenets, Khanty, Mansi, Evenk, Udege, Kumandin, Shor, and Buryat shamanic masks likely represent the same tradition.

Closely related are tsam masks (335–37). Tsam is a theatrical ritual dance, performed in Lamaistic monasteries across Buryatia, Tuva, Tibet, Nepal, and Bhutan. To distinguish the tsam protagonists from each other, their masks express different emotions, assisted by melodies/rhythms that are specific for each protagonist. Shamans too accompany mask-wearing with vocal and theatric impersonations of evil or good spirits (Zabolotskaya, 2011). Instead of masks, Tuvan shamans use maskoids—small, mask-like plates, mounted on a hat or knitted, called "human face" (Devlet and Devlet, 2011, p. 313). Maskoids usually reproduce conventional face-masks and represent the ancestral spirits guarding the maskoid's owner (Avdeyev, 1960).

The mask's protective function (Ivanov, 1975) resembles JH's "masking": both are employed by shamans to repel evil spirits.

"Singing masks" likely belong to the cultural Neolithic/Mesolithic "confederacy," stretching from the Volga-Ural region to Tibet and China (Leontyev et al., 2006). Its influence may have extended farther to Southeast Asia and Oceania (Okladnikov, 1968, p. 52–64). In this entire area the bamboo/wooden JH remains the principal musical instrument.

Archaeological data confirms the vegetative origin of the JH. All BCE finds, except for Perm's, occurred in Asia (**Table 1**). The JH crossed the Urals into the Volga planes ≈200 BCE (Aleksandrova, 2017) thereafter heading toward Hungary and Ukraine. Finno-Ugric peoples might have carried the JH to Northern Europe.

Archaeological finds form 6 clusters: Mongolia->China, Mongolia->Baikal, Baikal->Ural/Volga, Volga->Ukraine, Manchuria->Japan, and Western Europe. Bamboo/bone JHs comprise two of the oldest clusters. Ural/Volga and Manchuria/Japan combine both organic and metallic JHs. European JHs are exclusively metallic. This clustering is exceedingly important, because metallic and organic JHs significantly differ in **spectral texture**.

## SPECTRAL TEXTURE AND ITS DEPENDENCE ON TECHNOLOGICAL EVOLUTION

The concept of musical texture (Skrebkova-Filatova, 1985) has not been applied to JH music. However, JH players definitely arrange spectral components in a way similar to parts in polyphonic music. Polyphonic thinking is common in a great many indigenous music cultures (Jordania, 2006), timbral music included. Tuvan vocal "solo-polyphony"

<sup>61</sup>In Yakut language exclamation "urui," especially common for shamanic verses, is used in prayers for well-being, repelling evil, expressing a jolly feeling, calling for fighting, or announcing an end to something (Pekarsky, 1959, p. 3070).

FIGURE 4 | Worldwide geographic distribution of archaeological finds of JH, dated prior to the fourteenth century, and of artifacts that depict faces performing "vocal articulations," dated c.3000–100 BCE. This map is designed to reflect JH's distribution pattern throughout Eurasia prior to the fourteenth century—the beginning of JH mass production, trade inclusion and, subsequently, export to European colonies. Numbers in red reflect the chronological order of JH finds. Their dating and publication sources are referenced in Table 1. Circles mark the geographic clustering of finds and are accompanied by their dating range. Blue color marks JHs made of natural organic materials (bone, bamboo, wood), brown—of metallic JHs, and green—of mixed (organic and metallic). The most ancient finds cluster within the Inner Mongolia and are all organic. Next in timeline are Southern China and Altai, both of which also house organic JHs. Still next is the cluster covering Ural Mountains and Volga planes. It includes mixed materials. Two subsequent areas to the West, Ukraine and Northern Europe, feature exclusively metallic instruments. Manchuria and Japan form another cluster, concurrent with the Ukrainian, and feature mixed materials. The black face icon marks the archaeological sites containing petroglyphs, sculptures, masks or ceramics that depict human faces with open mouth—"singing masks" (Okladnikov, 1968, 1971; Okladnikov and Zaporozhskaya, 1972; Devlet, 1976, 1980; Okladnikov and Mazin, 1976; Brodianskii, 1978; Leontyev, 1978; Arkhipov, 1989; Kochmar, 1994; Devlet and Devlet, 2011)—exemplified in Figure 5. Most singing masks petroglyphs precede JH finds and concur with the Shuiquan JH in Inner Mongolia. The oldest petroglyphs are located at Sikachi-Alian, Amur, 3rd millennium BCE (Okladnikov, 1971, p. 83–89), and the most recent—the utmost northern location in Yakutia, at Olenek River (Arkhipov, 1989; Kochmar, 1994). Geographically, the locations of "singing masks" coincide with the JH finds in two regions: Tuva and Primorye, but chronologically precede the JHs excavated in those locations, suggesting that JH articulations grew out of timbral singing, probably of PS. Absence of "vocal articulations" in Inner Mongolia could be explained by the lack of attention of Chinese archeologists, not distinguishing such pictures from other images [in the same vein, Chinese archaeologists had difficulties recognizing JHs in the unearthed materials of a number of sites (Shulga, 2015; Kolltveit, 2016)]. On the other hand, absence of JH finds in Central and Southern Yakutia and Kamchatka, where images of "vocal articulations" were found, might be due to the wet climate that makes preservation of organic materials unlikely.

is well-known. JH traditions also engage 2- and even 3 part polyphony (Brodsky, 1972). Dobzhanskaya (1991) pioneered the polyphonic notation<sup>62</sup> of JH field recordings; an earlier attempt had been made by Pugh-Kitingan (1977) for a Huli JH excerpt. Polyphonic analysis must be used to investigate JH textural typology. It is hardly possible to comparatively study JH traditions without accounting for the distribution of thematic material within the spectral texture.

The JH's texture (i.e., the structure of the harmonic residue and harmonic base) depends on the instrument's material (**Figure 6**, see the archive "23 figures for Appendix," in **Supplementary Material**, for higher resolution images).

Grass and metallic bow-shaped JHs form two poles of textural arrangement. A grass JH generates **maximal polyphony**: five undifferentiated melodic parts moving simultaneously and only partially controlled63. A metallic JH generates **homophony**: a single melody with accompaniment. Bamboo/wood/bone

<sup>62</sup>She distinguished between 4 textural layers in the repertoire of Chuvan JH music for frame-shaped JHs: (1) the FF; (2) its subharmonic chest resonance an octave below; (3) the middle melodic layer above FF controlled by the player's breathing and tongue position, exhibiting "talking" qualities; and (4) a high melodic layer that changes the sustained phonemic formants which resembles singing.

<sup>63</sup>There is experimental evidence that non-musicians can simultaneously track no more than 3 concurrently running polyphonic parts, and trained musicians—up to 4 (Stoter et al., 2013). Although this was established in relation to frequencybased music, it can also apply to timbral-based music since most known traditions of vocal and instrumental timbral music too use no more than 3–4 parts. It is reasonable to expect the JH players to consistently and reliably control no more than 2–3 melodic parts. This assumption agrees with the anecdotal information from JH players.

FIGURE 5 | "Singing masks" from Sikachi-Alyan, Amur river, dated c. 3000 BCE (Okladnikov, 1971, p. 83–89) and their phonetic interpretation. These 9 images were selected by Sheikin (2002, p. 259), who believed that they depicted pronunciation of vowels in order to perpetuate a model for articulating sacred names, important for the ancestor cult. Such images are found near Amur river along the border between modern Russia and China, and differ from most East-Northern Siberian petroglyphs by focusing on not animals, but humans. Specifically, these faces emphasize the configuration of open mouth and the emotional expression in an individualized manner where no one image reproduces any other image (Okladnikov, 1971, p. 85). Similar in style, "singing masks" are found in few isolated sites to the West of Primorye, in Tuva, and to its north, in Yakutia and Kamchatka, suggesting the enormous expansion of a Neolithic Far Eastern culture during the Bronze–Iron Ages (Andreyeva, 1987). Each mask could hint at a particular sacred word tabooed for public use—similar to the current practice in the Altai region. The vowel harmony, omnipresent in Uralic and Siberian languages (Kiyekbayev, 1972), especially strong in Turkic languages (Gadzhiyeva, 1997), in effect, encodes the configuration of the mouth for the entire word, since all its syllables are bound to reproduce the same vowel (e.g., Yakut miitary call "*urui*"). Sheikin's set of images is identified as pictorial representation of a vocal system of a hypothetical "Amuric" Paleo-Asiatic language by I. Seliutina, N. Urtegeshev, T. Ryzhykova and A. Dobrinina *(Continued)*

FIGURE 5 | from the Laboratory of Experimental Phonetic Research at the Institute of Philology of the Siberian Department of the Russian Academy of Science. Based on the methodology of the founder of this institute, Nadeliayev (1960), this lab has collected an extensive database of articulations of basic phonemes of Siberian languages (Seliutina et al., 2012). Although the lack of uniformity in Neolithic images limits the researcher to estimate the gradations in labialization and only a gist of its palatability, however, Nadeliayev's classification of labialization (Nadeliayev, 1980) allows the phonologist to identify vowels based on the ratio of mouth width to height. The "Amuric" vocal system is presented in the phonetic alphabet developed after Nadeliayev by 9 of his pupils: I. Seliutina, A. Urtegeshev, A. Letiagin, A. Shevela, A. Dobrinina, G. Esenbayeva, A. Savelov, M. Rezakova, and Yu. Ganenko (Seliutina et al., 2012). The basic vowels from this alphabet are shown in a table at the bottom of the figure. The "singing masks" are numbered in the order of the ascending JH "articulatory scale" as formulated by Ogotoyev (1988), with the correction by Shishigin (2015) that places "o" rather than "a" at the bottom of the scale. The mask images (Sheikin, 2002) and the phonetic table (Seliutina et al., 2012) are reproduced by permission of the authors.

#### TABLE 1 | The earliest Jaw Harps found by archaeologists.


*The numbering reflects the chronological order from the oldest dates to the newest. "JH type" shows the number of JHs found in a given location, the material from which they were made and the type of construction they use. "Source" cites the publications that contain descriptions of each find.*

engage **well-differentiated polyphony**: bi-melodic for bamboo and tri-melodic for bone and wood (see Appendix in **Supplementary Material**).

The order of the increasing discretization of parts in these textures corresponds to the order of the growing complexity of manufacturing a JH from a given material (**Table 2**).

Materials fall into three classes according to texture, timbre, and playing technique (**Table 3**).

This order suggests the development toward simpler/clearer typologies where each textural type constitutes a stage that builds on a previous stage.

Obviously, the "metallic stage" comes after the "organic." Based on a comparative ethnomusicological perspective, Sheikin (2002, p. 132) offers further distinctions. He defines the "bamboo stage"<sup>64</sup> as representative of Oceanic cultures (exemplified in

<sup>64</sup>Sheikin speaks of the "bamboo period," but his choice of the term "period" is not the most accurate for designating the timeline of JH's evolution. The word "period" implies a "hard" end—one period replacing another. However, none of the JH materials have been really "discontinued" in the production of JH. Once mastered, a material usually remains in use. Therefore, the term "stage" seems more appropriate here, since developmental stages are cumulative in nature and built on top of one another.


FIGURE 6 | Spectral textures of music performed on JHs made of 5 different materials. On the left, spectrograms show the spectra of JHs made of each of the 5 different materials, most commonly used in Northeastern Eurasia by indigenous players. The horizontal axis indicates time, the vertical axis frequency, and the color-coding amplitude (from black for silence to bright yellow for maximal intensity). In most cases, the texture spreads from about 70 Hz up to 3–5 kHz. On the right, the entire texture is broken in "parts" (bands), each capturing the frequency range within which the "active" harmonics form a continuous (as much as possible) melodic stream. The procedure for defining these parts is described in the Appendix in Supplementary Material. Arrows point to the original position of parts. To the right of the parts are their names, following the standard nomenclature of Western choral music. Each part's bandwidth, amplitude and number of pitch classes (i.e., pitch set) are listed on the right. A pitch class is defined by a dynamic spike repeatedly marking the same frequency level (plus/minus a quartertone) throughout the clip. The dBu value reflects the greatest amplitude for a given part of the texture within the entire clip. The red color marks the most intense part that presumably serves as the principal melodic line for the JH player. Whenever there are a few parts with many pitch classes that reach amplitudes that are close together, they are considered "polyphonic" melodies. The green color marks the "melodic" parts. Each of the complete textures and their constituent parts can be auditioned by downloading the corresponding audio file from the provided web link or looked up in the zip archive in the Supplement. For the description of Audio Examples Nos. 31-55, see the file "List of Audio Examples.docx" in the Supplements. (A) Grass JH (Nivkh). 5 Components, each melodic (including the bass). Each part uses similar number of pitches. The leading melody is in tenor. The other 4 melodies are of similar intensity, except the highest one. This texture can be called *scattered "quintuplum" polyphony*, since all of its parts are constantly moving in a similar way, making it hard to track their changes by ear. The only controlled part here is likely to be the tenor. (B) Bamboo JH (Ainu). 4 Components, of which the upper 2 are melodic. Bass sustains a pedal tone. Tenor is restricted to an ongoing figuration of 2 adjacent pitches ("principal" belonging to the harmonic series, and "auxiliary" that does not belong to the harmonic series but is a step away from the "principal" pitch). All parts are nearly equal dynamically. Soprano contains the most embellished hemitonic melody which uses twice more pitches than does the anhemitonic alto. This is a *well-differentiated "duplum" polyphony* of soprano and alto, with the drone accompaniment of tenor and bass. (C) Wooden JH (Itelmen). 5 Components, of which 3 middle ones are melodic. Bass sustains a pedal tone. Descant alternates between a few ostinato cluster-chords (one principal, others auxiliary). Three lowest parts are nearly equally intense. The polyphonic contrast is tonal (4 anhemitonic pitches for tenor vs. 14 chromatic pitches for alto). This is a *fairly differentiated "triplum" polyphony* of tenor, alto and soprano, accompanied by the pedal bass and pedal-like descant. (D) Bone JH (Mansi). 4 Components, of which the upper 3 are melodic. Dynamically, their relation resembles (C), but the stronger tenor is likely to lead. However, tonally, there is little differentiation between 3 melodic parts. This is a poorly differentiated "triplum" polyphony, accompanied by the bass pedal. (E) Bronze JH (Magyar). 3 Components with a single melody in tenor. Tenor is the most dynamically and melodically salient part. Its melody is accompanied by the pedal bass and the ostinato "cluster-chords" in alto. This is a *well-differentiated homophony*. Black horizontal lines break 5 textures in 3 groups based on their structural similarity. (A) Stands out by its maximal polyphony and the absence of differentiation of parts. (E) Stands out by its absent polyphony and maximal differentiation of parts (pedal-melody-ostinato). (B–D) Have limited selective polyphony. Of the three, (B) features the strongest, whereas (D)—the weakest differentiation. (C) Features the greatest complexity. The complete spectrograms were calibrated to blur the display of the most intense parts of the spectrum in order to better indicate the melodic continuity across pauses, along the timeline. Blurring was achieved by maximizing the FFT size up to 65,536 samples, with the multiresolution STFT, no frequency overlap and time overlap 4x-16x, depending on the audio material. The window is Gauss 200 dB. Frequency is displayed in mels. The spectrogram of a complete texture zooms in from about 0–100 Hz to 4,000–9,000 Hz, depending on the audio material. Each clip was noise-reduced and dynamically normalized. The clips of textural "parts" have slightly increased frequency zoom and reduced blurriness (greater frequency and time overlap) to display finer detail. The audio of "parts," uploaded for audition, is normalized in order to make the parts comparable.

mukkuri) and preceding the "bone/wood stage" of Siberian peoples. Sheikin's periodization can be enhanced. Wooden JHs likely replaced bamboo as the JH spread north. Bone must have replaced wood in the further expansion to the tundra, where wood is rare. Hence the "bone stage" is likely to have succeeded the "wood stage."

Geometry/construction was another factor that determined the rise of homophony: only the bow-shaped metallic JHs are homophonic. Nevertheless, both factors nearly always<sup>65</sup> coincide (Fox, 1994; Sheikin, 2002; Kolltveit, 2016). In contrast, frameshaped idioglot JHs do not seem to support homophony in indigenous traditions. Therefore, the adoption of bow-shaped metallic JHs ended up replacing JH polyphony with homophony.

It is imperative to determine the timeline of the distribution of bow-shaped instruments (**Figure 7**).

Organic JHs emerged in Atlai/Baikal/Mongolia/Primorye during the pre-Metal Age and spread north, south, and west. Grass and chip probably succeeded twig, followed by chip-in-frame and frame-idioglottic instruments corresponding to the advance from purely "sonoric" effects to the pitch-oriented manipulation of overtones (Sheikin, 2002, p. 124). An example of an intermediate "chip-in-frame" construction is the Tuvan charty-komus (Suzukei, 1989, p. 65–6).

In the Volga/Kama/Ural region the westward expansion of frame met a wave of eastward expansion of metallurgy to introduce metallic JHs (Aleksandrova, 2017). Initially, these seem to have imitated the popular organic constructions (Golubkova and Ivanov, 1997). By the Sarmatians' time, a single standard for frame-shaped instruments (10.3–12.5 × 1.4–1.7 × 0.1– 0.2 cm) was established across Central Asia and southern Siberia (Borodovskii, 2017). This must have corresponded to a single standard of polyphonic texture.

Farther westward propagation of JHs involved only metallic instruments; wood/bone JHs were used in Europe exceedingly rarely and only before the heteroglot metallic constructions became established (Kolltveit, 2006, p. 83)<sup>66</sup> . The bow-shaped construction was invented somewhere in Ukraine/Balkans/Alps/Karelia, providing a new, simplified texture that co-existed with the older textural standard. However, the exclusive cultivation of bow-shaped JHs must have established a new textural standard that spread all over Europe and its overseas colonies67. On land, bow-shaped JHs spread eastward over the same territories where organic JHs had already settled, eventually reaching the Far East68. This brought about the co-existence of two autonomous traditions among Nivkhi, Ainu, Evenks, Evens, Kets, Yughs, Selkups, Chukchi, Itelmens, Koryaks, Kereks, Yukagirs, Khanty, Mansi, Tuvans, Yakuts, Dolgans (Sheikin, 2002, p. 125–6), Mongols (Pegg, 2001), and Chinese (Li, 1956). The two traditions differ in playing techniques, sound quality, and texture, to the extent of bearing different names among the same people (Li, 1956; Yakovlev, 2001; Sheikin, 2002; Mamcheva, 2012). The closer to the Far East, the greater the discrepancies. This geographic distinction is supported by the gender/age distinction. Frame-shaped JHs constitute the female/children's sphere of use—bow-shaped the male/adults' (Tadagawa, 2001; Sheikin, 2002; Dyakonova, 2017) 69 .

Significantly, the traditions differ in their mythological status. For Yenisei peoples frame-shaped JHs represent the "voice" of a local deity in charge of successful hunting (126), whereas metallic bow-shaped JHs are an attribute of power and prestige, entitling the owner to protection (131). The ideological divide also involves an aesthetic aspect. Across the entire southeastern end of Russia, all cultures that contain both frame-shaped and bow-shaped JHs consider the former suitable for learning the JH, but not for "serious" music-making (Sheikin, 2002, p. 131)<sup>70</sup> .

Hence, each model follows its own course. Around Kama, bow-shaped JHs supplanted frame-shaped JHs (Yakovlev, 2001). Generally, the archaic polytheistic belief in sacred places yielded to a monotheistic cosmology under the influence of Christianity after Russian colonization (Alekseyev, 1992, p. 215). This weakened the kin-tree/kin-bone correspondences that provided ideological support to the frame-shaped tradition. The strengthening of top-deity cults, the Christian-like dichotomy of good/evil, and divine protection supported the "ownershipbased" (not kin-based) protection of bow-shaped JHs.

The autonomy of frame- and bow-shaped traditions is corroborated by the late arrival of metallurgy to the north Pacific coast (**Figure 8**). The Chinese ideograph for the metallic JH (tieyehuang) confirms that the bamboo JH (huang) preceded it (Tadagawa, 1991).

The scenario in which the bow shape was the descendant of the frame shape also received organological support (Sachs, 1917; Dournon-Taurelle, 1975; Sheikin, 2002; Suzukei, 2010). Organological analysis of European archaeological finds indicates that European bow-shaped JHs form their own lineage of morphological development from earlier Asiatic samples brought to Europe through trade (Kolltveit, 2006). Currently, the archaeological consensus holds the organic

<sup>65</sup>In Taiwan and Indonesia there are very few examples of indigenous musical traditions that use bamboo frame-shaped JHs with metallic lamella (Kolltveit, 2016). This is also the case in Vietnam (Wright, 2001). However, judging by available recordings of such instruments, this hybrid construction does not substantially differ in spectral texture from the frame-shaped instruments that are made just out of bamboo. On the territory of Russia, bow-shaped JHs are always metallic (Sheikin, 2002, p. 129).

<sup>66</sup>There are some vague ethnographical references indicating that wooden JHs have been used in Flanders, Hungary and Ireland—however, such cases seem to be exceptional, leaving no archaeological or historic traces (Kolltveit, 2009). It is likely that wooden instruments such as Slovenian drumlja present a relatively modern development, called to reduce costs and thereby increase sales of JH in modern market conditions.

<sup>67</sup>Low quality of mass produced instruments is known to negatively affect the standards of JH music produced on imported JHs—as compared to hand-crafted instruments of indigenous JH traditions (Morgan, 2017). Metallic instruments, less responsive to the player's technique, must have further limited the spectral content of JH texture, increasing the dominance of a single melodic "part" even more.

<sup>68</sup>Such scenario has been confirmed in relation to Yakutia where organic frameshaped JHs were found to precede the metallic bow-shaped JHs (Dyakonova, 2017). The same applies to the peoples of Volga-Urals (Yakovlev, 2001), Kama (Golubkova and Ivanov, 1997), Sakhalin (Mamcheva, 2012).

<sup>69</sup>Thus, amongst Khanty and Mansi, frame-shaped tumran is strictly female, related to wedding rituals of fortune-telling and "avoiding" men (Sheikin, 2002, p. 127). Across Southeastern Siberia bow-shaped JHs are used exclusively by men in funeral lamentations, whereas females use bowed pikolute instead (131).

<sup>70</sup>An interesting case presents the tradition of Ude JH. Despite the fact that Ude hunters consider frame-shaped JHs as being children instruments, nevertheless, they report that hunters must play it after setting their hunting traps to ensure that the prey will be trapped (132). This testifies to a more ancient origin of the frameshaped construction in hunter-gatherer's societies—in contrast to the bow-shaped construction that did not exist before the Metal Age. The same applies to their respective musical textures.



*The numbering in the table reflects: the progressive increase in complexity of the manufacturing technology, the decrease in availability of the material to the JH manufacturer, the increase in discretization and variability of the components of the spectral texture, and the increase in timbral richness, fullness and homogeneity of the sound—that characterize each of the materials [for a more complete information see* Table 1 *in the Appendix and "Textural typology for Appendix.xml" (*Supplementary Material*)]. The identified pattern of evolution could potentially apply not only to JH, but to other musical instruments made of the same materials. This possibility calls for additional research.*

TABLE 3 | Classification of JHs based on similarities and differences in sound quality, spectral texture types and playing techniques between the JHs made of different materials.


*In general, different hand action in production of sound on a JH corresponds to a different sound quality (Mamcheva, 2012, p. 53–55) and textural typology (see the Appendix and listen to the audio examples in the archive "Audio for Appendix" in* Supplementary Material*). Their correspondence is somewhat complicated by the cumulative nature of indigenous JH traditions, exemplified in Nivkh JHs (Mamcheva, 2005): newer technologies of manufacturing JH tend to adopt the standards of playing from older previously existing technologies, before the novel constructions allow to forge new standards (see Appendix in* Supplementary Material *for the discussion of this issue). Therefore, metallic JHs are used to generate textures characteristic for bamboo, wood or bone JHs—as well as proprietary "metallic" textures, unproducible on JHs made of organic materials.*

JHs as the prototype for the metallic via westward expansion from northeast Asia (Fox, 1988; Wright, 2004; Kolltveit, 2006; Honeychurch, 2015; Aleksandrova, 2017; Turbat, 2017; Oleszczak et al., 2018). Another expansion is likely to have followed from South China to Austronesia (Blench, 2004).

The time frame for the genesis of organic JHs can be established by dating the migration of the Siberian population to America. Many North Amerindians use the PS (Ojamaa, 2002).

#### **Example-56**

The topahti—a personal Nootka song of inherited origin, performed by Joe Titian. This topahti was given at the inter-tribal marriage between Nootka and Kwakiutl as a dowry, and permitted for performance only by its owner and her children (Halpern, 1974) (http://chirb.it/NvahDq).

But no JH usage is known before the Western colonization (Wright, 2011) <sup>71</sup>. This is hard to reconcile with the evidence for the common genetic ancestry of Native Americans and

<sup>71</sup>JHs, rather popular amongst modern Inuits (Nattiez, 1976), have been introduced in Alaska, Canada and Greenland from Western Europe (Whitridge, 2015). The JH-like use of the feather of an arctic bird, common eider (Somateria mollissima), was observed amongst the Inuits of the Belcher Islands and northern Quebec (Oakes, 1991). They hold the feather in their mouths and beat it with one hand. A photograph of this is published in the LP "Inuit Throat And Harp Songs" (Green and Hodge, 1980). Most likely, feather is used as a more affordable material to substitute the imported metallic JH. However, the possibility that the feather use is a local invention that might have preceded JH trade cannot be completely discarded. However, even in this case, such feather use remains totally isolated.

FIGURE 7 | World distribution of JH made of organic and metallic materials prior to the Industrial Period. This map combines archaeological and ethnographic data on local production and consumption of JH across the world, excluding mass international trade from Europe, in order to establish the routes of the spread of metallic and non-metallic JH indigenous traditions. Yellow color marks those countries where JHs were produced and widely consumed by local population. Such countries were selected based on the integration of findings of linguistic research on the presence of native words for JH in local languages (Fox, 1994; Sheikin, 2002, p. 411–503; Bakx and Crane, 2017), of archaeological finds (de Ramón and Rivera, 1982; Beck et al., 1983; Barr, 1994; Kungurov, 1994; Yakovlev, 2001; Pignocchi, 2002, 2004; Crane, 2007; Wright and Impey, 2007; Wright, 2011; Honeychurch, 2015, p. 20; Whitridge, 2015) [plus references from Table 1], and of geo-ethnographic research (Vertkov et al., 1963; Galaiskaya, 1973; Fox, 1988, 1994; Sheikin, 2002; Wright, 2004; Yesipova et al., 2008; Suzukei, 2010; Mamcheva, 2012). Geo-ethnographic data was also used to distinguish between the areas of distribution of bow-shaped metallic and frame-shaped organic JHs. Five icons with corresponding pictures mark the geographic locations of musical cultures that use JHs made of different materials. Red numbers indicate the zones of distribution for different types of JHs—encircled in different colors. (Number 1) Marks the blue oval that shows the area where organic-made JHs have been prevalent (see Figure 4). This area most likely constitutes the cradle of JH music. (Number 2) Marks the green ovals that encircle the areas where metallic JHs have coexisted with organic ones. It is in these areas, that presumably, the transition from organic to metallic instruments took place. Two areas [(2a) and (2b)] correspond to the oldest centers of metallurgy (see Figure 8). The transition probably occurred first in 2a), where the access for import of organic JHs from zone (1) was geographically broader and easier than in (2b), and organic JHs were likely to have been established prior to the wide spread of metallic artifacts. Transition in (2c) area most likely occurred much later because of the later arrival of metallurgy (see Figure 8). (Number 3) Marks the brown oval that encircles the area where exclusively metallic JHs were cultivated. The color-coded arrows show the probable direction for spreading of the organic and metallic JH technologies. Green arrows indicate the distribution of the initial handicraft metallic JHs that might have emulated popular wooden/bone/bamboo types. Brown arrows indicate the distribution of the mass-produced bow-shaped metallic construction, prevalent in Europe. Across the ocean, these arrows connect the European dominions with their offshore colonies. Over the land, brown arrows indicate the chain of cultural contacts that led to the establishment of local production centers of metallic JHs. Evidently, one of the routes of this spread went contrary to the original spread of the organic JHs—from European West to Far East, over the Steppe Belt's borderline with the northern forests via the chain of neighboring Turkic cultures that all favor metallic JHs. This route earned the nickname of "Fur Route" (Rubinson, 1992; Bunker, 1993). Another route went from Central Asia to China via the Silk Road—historic as well as prehistoric, as recent research suggested (Christian, 2000; Kuzmina, 2008; Barinova, 2013).

south Altaians 23000–18000 BCE (Schurr, 2015). How could an instrument so important for Siberian peoples be totally missing from their American descendants? The Altai-Sayan region is home to the JH. According to genetic and linguistic evidence, Yakuts, whose national symbol is the khomus, descend from there (Pakendorf, 2007). Siberia and Alaska remained connected until 8000 BCE, but glaciers blocked Alaska from the rest of the continent until 11000 BCE (Dixon, 2015). Sequencing mitochondrial genomes from pre-Columbian South American skeletons 8,600–500 BP indicates that a small population entered the Americas through a costal route 16,000 BP (Reich et al., 2016). This date comes close to the earliest archaeological evidence of a human presence on the North American continent: 13,000 BP (Anderson and Bissett, 2015).

The most plausible scenario is that the JH tradition did not exist along the northeast Asian coast before 10,000 BP, when access to America was open. Throughout the Quaternary Northeastern Asia enjoyed a remarkably stable climate, and lowlands remained ice-free (Zamoruyev, 2004). The prevailing landscape of the non-glaciated area of Altai was desert-steppe, covered with grass, with patches of woody vegetation at stream valleys—changing into coniferous forest or forest-steppe only in the Holocene (Hais et al., 2015). Similar vegetation covered Mongolia and Tuva. Modern forest-steppe regions, e.g., Ob'- Irtysh Baraba, remained steppe throughout the early Holocene until 5500 BP (Zhilich et al., 2017). In highlands such as the Chuya Alps, the first forestation is dated 7000 BP (Agatova et al., 2012). Forestless landscape stretched toward North

FIGURE 8 | Historic spread of copper metallurgy in Eurasia. This map displays the locations of earliest regional centers of smelting copper ores—according to the available archaeological research on the earliest metallurgy (Chernykh, 1966, 1992, 2012; Sunchugashev, 1975; Zhuravlev, 1977; Sergeyeva, 1981; Prakash and Tripathi, 1986; Rybakov, 1987; Kon'kova, 1989; Thiel, 1989; Mishra, 1994; Reedy, 1997; Kiriushin, 2002; Tylecote, 2002; Nguyen, 1986; Baipakov and Taimagambetov, 2006; Simukhin, 2006; Yanin, 2006; Ciarla, 2007; Hauptmann, 2007; Kaniuth, 2007; Park and Gordon, 2007; Subbotina, 2008; Herva et al., 2009; Roberts et al., 2009; White and Hamilton, 2009; Radivojevic et al., 2010; Erb-Satullo, 2011; Higham et al., 2011; Wan, 2011; Potts, 2012; O'Brien, 2014; Garner, ´ 2015; Gelegdorj, 2015; Mei et al., 2015; Hung and Chao, 2016; Huo, 2016; Tripathi, 2018). The routes and timeline of its spread suggests the spread of metallic JHs along with the new homophonic tradition of JH music. The timeline is indicated by color-coding of the circle-icons that mark the location of smelting sites: the darker—the older. The epicenter of metallurgy is in modern Turkey and Iraq, although a concurrent independent center might have existed in Vinca culture, in Balkans, Belovode (Radivojevic et al. ´ , 2010). Anther independent center emerged in Fennoscandia (Herva et al., 2009) and the neighboring Karelia (Zhuravlev, 1977). Copper metallurgy spread from Balkans to Ukraine, thereon bifurcating, with one branch heading to the northeast forest zones of Russia (up to Ural Mountains), while the other moving southeast through the Steppe Belt all the way to Zabaikalye and Mongolia. There, again the route bifurcated: north, going to Yakutia, and south to Primorye. The other wave from the epicenter reached Iran, branching in 3 directions: toward Indus' valleys, Himalayas, and Middle Asia, through Turkmenistan, Uzbekistan and Kazakhstan to the Uyghur territories via the route that later became known as the Silk Road (Kuzmina, 2008). Around Lop Nor this branch met the outshoot of the Steppe Belt route and proceeded through modern China to its North and South kingdoms. The southern branch eventually hit Indochina, then heading to Indonesia and Pacific islands. This entire area, Manchuria and Japan were the last to develop metallurgy—no earlier than in Middle Ages. Therefore, if for Volga-Ural area frame-shaped JHs co-evolved with the metallic JHs, for cultures at the eastern end of Eurasia bow-shaped JHs came from West and formed its own special niche, different from the already established frame-shaped JH cultures. The bow-shaped tradition must have taken a long time to develop—as metallic artifacts were becoming more affordable to local population. Another zone where frame- and bow-shaped JHs form different cultural traditions is the vast area from southeastern Indochina to Papua New Guinea. The two zones where metallic bow-shaped JHs must have penetrated concurrently with organic frame-shaped JHs where both might have shared the same or similar forms of use, are Himalayas-Pamir-Tian-Shan mountains and Yunnan-Myanmar area.

America, including Beringia (Hoffecker and Elias, 2007). The tundra belt extended to 57◦N, then turned into steppe that covered most of Eurasia (Tarasov et al., 2000). Forestation was prominent only much farther south, at Taiwan's latitude, during the late Pleniglacial (Hope et al., 2004)—too distant from Beringia.

The scarcity of trees would have prevented the formation of the "person/kin/ethnos=tree/species/forest paradigm" that underlies the Siberian timbral music culture. The emergence of "sacred" tree cults depends on trees' importance in sustaining a human population, which requires that they be available in abundance. Otherwise, plants' proprietary "voices" cannot be discovered by population groups large enough to institute a musical tradition.

## CONCLUSION

Not all known music systems abide by the principle of discrete changes in pitch. In fact, many music cultures rely on timbral changes in their TO. One trait such cultures share is the prevalence of the "personal" over the "collective" use of music, upholding the opposition between "timbre-centered" (definite timbre/indefinite pitch) and "frequency-centered" (definite pitch/indefinite timbre) music systems. Current psychoacoustic research seems to support this opposition.

The large concentration of "timbral" cultures in the former USSR prompted research (often emic72) on their TO, promoted by the centralized infrastructure of research institutions, "topdown" funding (official ideology favored "people's music"), and the intense development of ear-training research. The accumulation of vast ethnographic and archaeological data over much of the twentieth century made it possible to create a "historic" perspective on the evolution of folk music. The integration of sciences within Soviet academe supported a multi-disciplinary approach that combined fieldwork with experimental testing. The gathered evidence suggests that "frequency-" and "timbre-centered" musics contrast each other in their TO, patterns of usage, and their social implementation.

Timbral TO relies on PS in vocal, and JH musicking in instrumental domains. Every member of such traditional society possesses at least one PS that serves for personal identification. Musical elements usually indicate family relations and birthplace through similarities with other individuals' PSs. Locals have a keen ear and memory for cross-relating thousands of PSs throughout their life (better than for remembering faces)<sup>73</sup> .

PS contributes to mental health in the traditional Siberian lifestyle. Ongoing singing throughout one's day secures consistency in self-consciousness under environmental stress, common in severe Siberian conditions, thus protecting one from psychological disorders. Pathological singing accompanies multiple personality disorder (meneryia) and sleep-singing disorder (tyyl yryata), most probably related to the dysfunction of the PS circuit.

In traditional societies, PS lowers the standard of musicality to maximize singing accessibility. Close relatives sing their PS for aphoniacs. An indefinite intervallic structure and an absence of the notion of "wrong notes" make a PS performable by almost anyone. Playing JH is even more accessible.

The JH opposes PS by "de-personalizing" one's natural voice via the JH's vocoding mechanism. While camouflaging one's voice, the JH "impersonates" a wealth of environmental sounds. Its exceptional mimicry allows for re-creating the "real world" musically through the timbral abstraction of sounds and their arrangement according to principles of phonetic symbolism.

The JH addresses the objective aspects of reality, PS the subjective. Together, they form a coherent musical worldview, which explains why numerous North Asiatic ethnicities have so few "pitched" musical instruments: They simply did not have much need for them.

Indefinite-in-pitch rhythmo-timbral "themes" effectively represent:


Both rest on the traditional hierarchic paradigm: "person"-"kin"- "ethnos"="tree"-"tree-species"-"forest." Such a paradigm must have formed in southeastern Siberia/Manchuria ≈10,000 BP. JHs made of ancestral plants were initially used as talismans. The accumulation of onomatopoeic devices and conventions of phonetic symbolism created "timbral words," "timbral phrases," instrumental patterns emulating PSs, and, eventually, proprietary JH TO, where the key role belongs to a set of "harmonic templates." Each template carries a specific semantic value, enabling linguistic-like semiosis: meaningful elements comprise meaningful components, but convey emotional rather than referential information. This development must have directed the entire evolution of Eastern Eurasian timbre-based music systems in opposition to the frequency-based music of the Eurasian West<sup>74</sup> . The JH was for the prehistoric East what the bone pipe was for the prehistoric West.

The personal nature of PS/JH traditions stems from their reliance on timbre, which is fundamentally unsuitable for collective musicking. Collective performance of PSs or JHs would make their message unrecognizable. Only frequency-based music allows the tradition of collective performance to form and continue.

Western frequency tradition, exemplified by Aurignacian pipes (Morley, 2013), might have African roots. Africa underwent two major demographic expansions prior to the Aurignacian, enabled by its tropical ecology (Lahr and Foley, 2001). The second expansion (86,000–61,000 BP) carried haplogroup L3 outside of Africa (Atkinson et al., 2009). Genetic evidence suggests that non-Africans descend primarily from this migration, whose maximum falls on 70,000 BP, coinciding with the improved climate in East/Central Africa (Soares et al., 2012). At that time, the East African effective population size was at least 10,000 people (Relethford, 1998), vs. the census maximum of 3,700 in Gravettian Europe<sup>75</sup> .

Tropical environments generally support greater population densities than those at higher latitudes (Layton and O'Hara, 2012). Environmental conditions impact demographic density

<sup>72</sup>Thus, three of the authors of this paper were raised in the native cultural environment of Yakutia, and therefore combine a first-hand "emic" knowledge about its indigenous music (and language) with an expertise in musicological analysis of such music.

<sup>73</sup>Long-term memory for melodic structures is known for remarkable longevity and capacity (Halpern and Bartlett, 2010). Accurate memorization of music can be life-long (Bartlett and Snelus, 1980). Music seems to provide superior encoding, evident in the phenomenon of earworms, unique to the domain of music (Halpern and Bartlett, 2011).

<sup>74</sup>For more information on the role of flute-like instruments in determining a music system see "Tonal organization in tuning of Paleolithic and Neolithic pipes" (Nikolsky, 2015, Appendix-2).

<sup>75</sup>The median ratio of effective population size to census size remains exceedingly low across 43 mammalian species, constituting on average 0.003 (Nei and Graur, 1984). The method of mismatch analysis, introduced by Rogers and Harpending (1992), allows for estimation of size and timing of ancient demographic based on the genetic data. Population growth at the time of expansion is estimated to be 100-fold or even greater (Rogers, 1997). And the timing of it clusters around 50,000 BP (Sherry et al., 1994). The African expansion of this time suggests a demographic picture with sparse population outside of Africa, while a rapid population explosion inside of Africa (Relethford, 1998).

the most, and even the tropical desert supports a denser population than the polar biome (Tallavaara et al., 2018). Intergroup connectedness also drops at higher latitudes; the large effective population of connected groups in the African Middle Stone Age contrasts with stochastic variation without linear trajectories in the contemporaneous European Mediterranean region (Malinsky-Buller and Hovers, 2019). Sparse groups' migration leads to frequent losses of gained cultural skills. Steady post-Gravettian demographic growth triggered the cumulative cultural complexity that characterizes behavioral modernity (Shennan, 2001). Part of this was the consistent increase of foragers' group size (Grove, 2012). By 45,000 BP, the median effective population size in Europe equaled that which sub-Saharan Africa had reached 101,000 BP alongside the markers of modern behavior (Powell et al., 2009).

Population growth promotes group cohesion, territoriality, ethnogenesis, and language formation (Robb, 1993). "Frequency music" likely followed suit. Steady demographic growth accompanied the Neolithic "revolution" and civilizations' rise (Hassan, 1981)—together with "frequency-based" music (Nikolsky, 2016). Collective music-making was part of the "demographic expansion package" designed to consolidate and empower the "tribe" to grab and hold its territory.

Radically different is the demography of northeastern Eurasia. Siberia is famous for its immense land (13,100,000 km<sup>2</sup> ), sparse population (200,000 before Russian colonization), and the so-called everlasting importance of hunting/gathering for sustenance (Naumov, 2006). Such population density−0.065 person/km2–approximates the 0.036 person/km<sup>2</sup> maximum of Magdalenian Europe (Maier, 2017). Harsh living in the traditional lifestyle makes landholding not a viable strategy. Constant migration by small "packs" requires marking and regulating the sharing of territory between all neighbors, helping each other to survive (Funk, 2006). Therefore, local beliefs assign power not to the "tribe" but to the spiritmasters of landmarks on whose disposition human "tenants" must rely. This has laid the foundation for the PS/JH pan-Siberian framework. And since the climate in Northeastern Asia remained remarkably stable throughout the Quaternary (Zamoruyev, 2004), it is reasonable to believe that the institution of PS/JH characterizes the music culture of local prehistoric people.

It can hardly be a coincidence that the area where JH remains to be the principal musical instrument in scarce instrumentarium is identical with the area of the greatest concentration of Denisovan genomes. The highest levels of Denisovan ancestry is found in Oceanic populations (Vernot et al., 2016). Denisovan genomes are also present in Eastern Eurasians and Native Americans (Qin and Stoneking, 2015). Denisovans may have interbred with early humans over the territory of Northern China (Martinón-Torres et al., 2017), where the oldest JHs were unearthed. Longevity of the JH dominance might constitute a distant remnant of the Denisovan timbral music tradition, preserved in the refugium of isolated Pacific islands, north-Chinese deserts and Altai mountains. Neanderthal heritage could entail the "frequency music," carried from Africa by Homo Heidelbergensis. The latter adapted to the northern latitudes as opposed to southern Homo erectus—and so is the case with Neanderthals as opposed to Homo Heidelbergensis (Grove et al., 2012). Either of them might have adapted the southern collective "frequency music" to the northern ecosystems, generating a new<sup>76</sup> personal "timbral music."

The sparse Neanderthal (French, 2016) and Denisovan (Meyer et al., 2012) populations of the Pleistocene Altai (Buzhilova et al., 2017) might have also subscribed to "timbral music". Homo's "timbral music" either descended from Neanderthals and Denisovans, or "downgraded" from the European "frequency music" carried by "Ancient North Siberians" from the West ≈38,000 BP (Sikora et al., 2019).

## AUTHOR CONTRIBUTIONS

EA and AN conceived the presented idea. VD organized the recording of Jaw Harps at the Jaw Harp Museum in Yakutsk, collected all the necessary ethnographic materials and provided the most recent data of the fieldwork research in Siberia and Russian Far East. She verified the information on traditional musical instruments in Siberia. IA performed on different Jaw Harps and had them recorded and acted as a consultant in relation to the indigenous Siberian traditional music and matters of prosody and phonology of Jaw Harp playing and singing. EA provided his archive of the recordings and supervised the application of his methodology of the analysis of tonal organization of indigenous folk music. AN conducted the acoustic and musicological analysis of the provided material, conceived the method of analysis of spectral textures, corroborated all the findings with the research in the former USSR, modern Russia, and Western countries, wrote the manuscript and created the figures and tables for it. EA edited the manuscript. AN translated it into English.

## ACKNOWLEDGMENTS

We are grateful to Valentina Suzukei, Natalia Mamcheva, Oksana Dobzhanskaya, Yurii Sheikin, Tatyana Ignatyeva, Gjermund Kolltveit, Erkin Alekseyev, Ricardo Eichmann, Leon Crickmore, Jeremy Montagu, Triinu Ojamaa, Jaan Ross, Iraida Seliutina, Nikolai Urtegeshev, Tatiana Ryzhykova, Albina Dobrinina, Tatiana Daineko, Josef Jordania, Alexandra Khaltanova, Yevdokiya Sergina, Spiridon Shishigin, Alevtina Everstova, and Vyatcheslav Shchurov for providing us with the necessary materials for this project. We appreciate the thorough critical input from Theodor Levin that was instrumental for shaping the scope of our research. Special thanks to Erkin Alekseyev, Nastya Petrova, and Aksenty Beskrovny for their recordings of Jaw Harp, and to Sheila Bazleh for editing the text of this manuscript. Also we would like to thank Brian C. J. Moore, Viktor Reznik, Stephen McAdams, Pantelis N. Vassilakis, Steven Brown, Jaan Ross, and Frank Scherbaum for their help in elaborating the

<sup>76</sup>If the Divje Babe pipe is indeed not a byproduct of the hyena bite (Diedrich, 2015), but rather a Neanderthal musical instrument (Turk, 2014), then the Neanderthal expansion to Altai would have created a Paleolithic border between the frequency- and timbre-based music.

optimal approach to taking acoustic measurements of Jaw Harp recordings and interpreting their results.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2019.03051/full#supplementary-material

#### REFERENCES


слуха]," in Development of Musical Hearing [Воспитание музыкального слуха, ed A. Agazhanov (Moscow: Muzyka [Музыка]), 86–117.


дошкольного возраста], eds A. V Zaporozhets and D. B. Elkonin (Moscow: Prosvesheniye), 13–71.


звуковысотного дифференцирования от характера деятельности детей в пред," in Development of Perception in Early and Preschool Childhood [Развитие восприятия в раннем и дошкольном детстве], eds A. V. Zaporozhets and M. I. Lisina (Moscow: Prosvesheniye), 49–73.


верованиях народов Сибири]," in Eastern Folklore and Mythology from the Comparative-Typological Aspect, eds N. Lidova and N. Nikulin (Moscow: Russian Academy of Science), 217–235.


Buryat University, Russian Academy of Science, Ulan-Ude, Russia, The Department of Ethnography, Ethnology and Anthropology.


музыкальной культуры Тувы: динамика аксиологического аспекта]. Kemerovo: Kemerovo University of Culture and Arts.


University named after Lomonosov, The Department of Theory and History of Culture.


**Conflict of Interest:** AN was employed by company, Braavo Enterprises.

The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Nikolsky, Alekseyev, Alekseev and Dyakonova. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Pastoral Origin of Semiotically Functional Tonal Organization of Music

#### Aleksey Nikolsky\*

Independent Researcher, Austin, TX, United States

This paper presents a new line of inquiry into when and how music as a semiotic system was born. Eleven principal expressive aspects of music each contains specific structural patterns whose configuration signifies a certain affective state. This distinguishes the tonal organization of music from the phonetic and prosodic organization of natural languages and animal communication. The question of music's origin can therefore be answered by establishing the point in human history at which all eleven expressive aspects might have been abstracted from the instinct-driven primate calls and used to express human psycho-emotional states. Etic analysis of acoustic parameters is the prime means of cross-examination of the typical patterns of expression of the basic emotions in human music versus animal vocal communication. A new method of such analysis is proposed here. Formation of such expressive aspects as meter, tempo, melodic intervals, and articulation can be explained by the influence of bipedal locomotion, breathing cycle, and heartbeat, long before Homo sapiens. However, two aspects, rhythm and melodic contour, most crucial for music as we know it, lack proxies in the Paleolithic lifestyle. The available ethnographic and developmental data leads one to believe that rhythmic and directional patterns of melody became involved in conveying emotion-related information in the process of frequent switching from one call-type to another within the limited repertory of calls. Such calls are usually adopted for the ongoing caretaking of human youngsters and domestic animals. The efficacy of rhythm and pitch contour in affective communication must have been spontaneously discovered in new important cultural activities. The most likely scenario for music to have become fully semiotically functional and to have spread wide enough to avoid extinctions is the formation of cross-specific communication between humans and domesticated animals during the Neolithic demographic explosion and the subsequent cultural revolution. Changes in distance during such communication must have promoted the integration between different expressive aspects and generated the basic musical grammar. The model of such communication can be found in the surviving tradition of Scandinavian pastoral music - kulning. This article discusses the most likely ways in which such music evolved.

#### Keywords: tonal organization, animal communication, honest signal, semiosis, aspects of expression, domestication, kulning vs. yodel, motherese

**Abbreviations:** AC, animal call (plural ACs); AE, aspect of expression (plural AEs); FF, fundamental frequency (plural FFs); ky, thousand years; kya, thousand years ago; SFTO, semiotically functional tonal organization; TO, tonal organization.

#### Edited by:

Danielle Sulikowski, Charles Sturt University, Australia

> Reviewed by: Colwyn Trevarthen, University of Edinburgh, United Kingdom Mark Reybrouck, KU Leuven, Belgium

> > \*Correspondence: Aleksey Nikolsky aleksey@braavo.org

#### Specialty section:

This article was submitted to Evolutionary Psychology, a section of the journal Frontiers in Psychology

Received: 21 November 2019 Accepted: 22 May 2020 Published: 23 July 2020

#### Citation:

Nikolsky A (2020) The Pastoral Origin of Semiotically Functional Tonal Organization of Music. Front. Psychol. 11:1358. doi: 10.3389/fpsyg.2020.01358

Since antiquity, scholars have been puzzled by the origins of music. Their quest still remains largely unanswered impeded by the shortage of available data. The current consensus holds that some kind of musilanguage (Brown, 2000) must have preceded the bifurcation of music and language, marking the emergence of behavioral modernity in humans (Cross, 1999). Pitch orientation is seen as the primary structural marker of music, followed by rhythmometric organization (Brown, 2017) 1 . This unnecessarily oversimplified view can and should be expanded, since in reality music is organized not in two but in eleven aspects of expression (AEs<sup>2</sup> ), each providing its autonomous information channel (**Table 1**):


The problem is that in investigation of music, cognitive scientists rely on "standards" of Western musical theory, produced by Western civilization and therefore specific to certain historic periods and geographic regions. Although Western music system has proved to be the widest spread and the oldest surviving tradition, with its theoretic foundation rooted in the 3rd millennium BC (Dumbrill, 1998; Mathiesen, 1999; Jorgensen, 2003; Christensen, 2008; Crickmore, 2009; Nikolsky, 2016), nevertheless, there are other civilizations that abide by their own musical theories, explicit or/and implicit, documented or/and orally transmitted (Nettl, 2005). The need to formulate a "metatheory" applicable to all varieties of musics has been realized only in the 1890s and dealt with by the discipline of systematic musicology (Bader, 2018). However, this discipline too inherited the framework of Western "classical music," which is just one of many (Nikolsky, 2015b, 2016, 2020; Nikolsky et al., 2020). Since this framework is tailored to incremental frequency changes, the pitch-related AEs have been prioritized in Western musicology, covered by the dedicated disciplines of harmony, counterpoint, and musical form (Christensen, 2008). The other AEs have only recently received attention, after the traditional discipline of musical form was approached semiotically (Bobrovsky, 1978; Mazel, 1979; Ratner, 1980; Nazaikinsky, 1982, 1988, 2013; Lerdahl and Jackendoff, 1985; Berry, 1987; Ruwet and Everist, 1987; Beliayev, 1990b; Molino, 1990; Nattiez, 1990; Aranovsky, 1991, 1998; Monelle, 1992, 2000, 2006; Narmour, 1992; Tarasti, 1994, 1995, 2012; Kholopova, 2002; Arom, 2004; Bonfeld, 2006; Medushevsky, 2010; Tagg, 2012; Turino, 2014; Benjamin et al., 2015; Yust, 2018). Cross-examination of syntactic, pragmatic, and semantic use of conventional musical idioms has revealed that they break into 11 different AEs (**Table 1**). Nine of them are used in monophonic music (without harmony and texture)<sup>4</sup> . Each AE is distinguished by its unique perceptual substrate and idiomatic expressions.

Interspecific comparison of human music to vocalizations of different animal species along these aspects promises a better understanding of the qualitative leap in the emergence of music. The Moscow school of "integrative analysis"<sup>5</sup> presents a methodology for such interspecific analyses, which I have adapted to identify those typological patterns in AEs of human music that contrast animal calls (ACs). These contrasts should be examined to reveal what exactly in human cultural evolution could be responsible for the emergence of new AE patterns that are unique to humans.

Human music is distinguished by its incremental structure (Bresin and Friberg, 2011)—requiring the ability to discriminate

<sup>1</sup>Metric organization of rhythm accompanies and supports pitch organization in music (Jones and Large, 1999), jointly supporting the "musical" manner of interpretation of sounds (Huron, 2006). Tonally important pitch-classes are usually stressed by longer durations and/or dynamic accents. However, comparing to rhythm, pitch organization is much more common in known world's music cultures (found even in music for only percussive instruments, e.g., African talking drums)—there are many forms of music that are characterized by ametric and arrhythmic free timing, but there are very few non-pitch forms. Therefore, pitch organization is a more reliable marker that distinguishes music from language than rhythmo-metric organization.

<sup>2</sup> I will use the abbreviation "AE" when speaking of a single aspect of expression, and "AEs" when speaking of multiple aspects of expression.

<sup>3</sup>The matter of choosing different timbres for different musical expressions has traditionally been handled by the discipline of instrumentation in Western classical music (Banshchikov, 1997). The term "instrumentation" here is somewhat misleading, because it covers not only the qualia of the timbres of musical instruments and their ensembles (trio, quartet, orchestra, orchestral group) but also various types of voices (soprano, tenor, bass), vocal ensembles (duet, trio, choir) and the rules of combining vocals with instruments (Kreitner et al., 2001). Arabic maqam, Persian dastgah, and Indian raga also observe similar rules in their respective practices.

<sup>4</sup>Technically speaking, monophonic music can still engage some idioms that relate to harmony and texture. A melody solo often features a pronounced "harmonic rhythm" (Swain, 2002)—i.e., periodic changes of implied chords (e.g., the "Blue Danube Waltz" theme by J. Strauss Jr.) that can stay regular (as in a metric pulse), be patterned (as in rhythm), or elaborated by expansion or contraction of a pulse period. Monophonic music can also implement changes in texture by patterning a stream of sounds into familiar textural idioms (e.g., the "Alberti figuration" or tremolo on a single tone) which then carry their specific semantic expression, different from other textural components, such as a melodic theme (Skrebkova-Filatova, 1985). However, overall, harmony and texture play a secondary role in monophonic compositional practices, limited to Western classical music alone.

<sup>5</sup>The "integralist school" of structural analysis of music was founded by the father of systematic musicology in Russia, Viktor Beliayev, in the 1920s, during his tenure in the Moscow Tchaikovsky Conservatory (Beliayev, 1990b). Beliayev's approach was further developed by two leading Moscow theorists, Leo Mazel and Viktor Tzukkerman (Mazel and Tzukkerman, 1967). They sought to integrate thorough structural analysis of a musical work with the psychological and sociological analyses of the expressive means employed in the analyzed musical work (Khannanov, 2005). It was especially Mazel who was concerned with broadening the framework of analysis to encompass not only the domains of melody and harmony, traditional for Western musicology, but also aspects of rhythm, meter, texture, articulation, dynamics, and timbre. After Yevgeny Nazaikinsky's death in 2006, the leading Russian "integralists" are Valentina Kholopova and Vyacheslav Medushevsky.

#### TABLE 1 | AEs ofmusic.


Frontiers in Psychology


www.frontiersin.org

fpsyg-11-01358 July 22, 2020 Time: 21:48 # 3

(Continued)

rhythmic value. The Pastoral Origin of Music

#### TABLE 1 | Continued


The Pastoral Origin of Music

#### TABLE 1 |Continued


This table summarizes the structural characteristics of each of the principal AEs according to the treatises on music theory and relevant psychoacoustic research during the last 70 years. In the footnotes to the table, I have provided the relevant sources for those readers who are interested to find out more. All AEs feature incremental organization. Most AEs are quantifiable: melodic and harmonic intervals, textural elements and components, themes, metric pulses, rhythms, and articulation groups. Of 11 AEs, 9 (melody, harmony, form, rhythm, meter, articulation, dynamics and register) exhibit gradual inflections that are systemically used to increase the expressiveness of music.

<sup>I</sup>(Noorden, 1975, 40–67); II(Nikolsky, 2015b); III(Garbuzov, 1948); IV(Benson, 2007); <sup>V</sup>(McDermott et al., 2010); VI(Jones, 2016); VII(Clarke, 2007); VIII(Fabian and Schubert, 2008); IX(London, 2004); <sup>X</sup>(Levitin, 1994); XI(Fallows, 2001); XII(Nazaikinsky, 1972); XIII(Garbuzov, 1950); XIV(Keller, 1973); XV(Jerkert, 2003); XVI(Repp, 1995); XVII(Repp, 1998); XVIII(Dean et al., 2011); XIX(Berndt and Hahnel, 2010 ¨ ); XX(Garbuzov, 1955); XXI(Todd, 1992); XXII(Drabkin, 2001); XXIII(Miller, 2014); XXIV(Titze, 1988); XXV(Patterson et al., 2010); XXVI(Yemelyanov, 2000, 46); XXVII(Ravens, 2014); XXVIII(Garbuzov, 1956); XXIX(Nazaikinsky and Rags, 1964); XXX(Rags, 1980); XXXI(Fraisse, 1982); XXXII(Harding, 1983); XXXIII(Madison and Paulin, 2010); XXXIV(Gabrielsson, 1999); XXXV(Friberg and Sundberg, 1999); XXXVI(Chew, 2001); XXXVII(Thiemel, 2001); XXXVIII(Reti, 1951 ´ ); XXXIX(Mazel, 1979); XL(Kholopov, 2006); XLI(Val'kova, 1992); XLII(Braudo, 1961); XLIII(Huron, 1989); XLIV(Benward and Saker, 2009); XLV(Skrebkova-Filatova, 1985); XLVI(Berry, 1987); XLVII(Cambouropoulos, 2010); XLVIII(Rags, 1999); XLIX(Sandell, 1995); <sup>L</sup>(Kholopova, 2002); LI(Meyer, 2009); LII(Banshchikov, 1997); LIII(Kendall and Carterette, 1993); LIV(Kreitner et al., 2001); LV(Rothstein, 1989); LVI(Cambouropoulos, 2008).

changes in at least 9 AEs (**Table 1**). Their categorization into "classes" seems to be modeled after pitch. A music-maker breaks the range between the lowest and the highest pitch classes (i.e., ambitus) within a music work into "degrees," forming a set of pitch classes to construct music. Similarly, other AEs divide the continuum between their marginal values into step-like increments, the assortment of which can structurally characterize a musical work. Pitch-class sets receive their analogs in sets of the following classes, intuitively selected by a music-maker for a particular expression per composition:


Such discrete classes coexist with gradual inflections for each class (**Table 1**). Evidently, music is designed to integrate multiple AEs in a complex admixture of their patterns of expression. Music defaults to the integration of concurrent tones in contrast to the segmentation tendency of speech (Bregman, 1994)—people can sing together, yet when speaking, they always take turns (Brown, 2007). Here, AC sides with music rather than speech, evident in the widespread animal chorusing. Integrative power of music makes the concept of "musical mode" indispensable for understanding the rise of music. "Mode's" reduction to "scale," adopted by some researchers (i.e., Pfordresher and Brown, 2017) constitutes a fundamental error in confusing the purely quantitative and formalistic concept of "scale" with the qualitative and content-oriented concept of "mode" (see Nikolsky, 2015b). Musical mode is more than a mere set of pitch-classes selected to make music—it also encapsulates the rules for their interconnection and the semantic range of suitable expressions (Wulstan, 1971; Alekseyev, 1976; Kholopov, 1976, 2005; Bytchkov, 1987, 1997; Lester, 1989; Beliayev, 1990a; Porter et al., 2001; Powers and Wiering, 2001; Straehley and Loebach, 2014; Winnington-Ingram, 2015).

In essence, "mode" constitutes the generalization of a particular melodic typology, characteristic for a given musical genre, which supplies that mode with semantic denotations (Nazaikinsky, 2013). Nothing similar exists in speech. Music is unique in its holistic appreciation of sounds per se (Patel, 2010). Hence, the idea of euphony—pleasant concordance of sounds in specific expressions—is quintessential for "mode," as emphasized by Russian theorists.

The same principles apply to "rhythmic modes," conceptualized within Western (Roesner, 2001) and some non-Western civilizations (Clayton, 2000). Rhythmic divisions, utilized in a composition, complement one another in expression of musical movement and in combinatory rules. A rhythmic modus in Western medieval theory, Arabic maqam, Iranian dastgah, or Indian raga incorporates not only a specific progression of rhythmic values but a specific "ethos"— an abstracted emotional quality projected by music on society at large (Shestakov, 1975). Each rhythmic modus in the abovementioned music systems is characterized semantically by its affiliation with a certain ethos and structurally by certain proportions between the duration values used in a music work. Rhythmic modus resembles pitch modus by incorporating a set of rules. Just as pitch-classes are allowed to follow or not follow one another, or require an alteration for ascending or descending motion, rhythm-classes are restricted to certain ratios which can be altered in a certain way (e.g., a dotted rhythm can be "over-dotted" in a suitable context).

The idea of concordance and appreciation that underlies the overwhelming majority of known traditional music cultures justifies the conceptualization of each AE as a carrier of its proprietary "mode." Every musical piece can be defined by identifying its melodic, harmonic, rhythmic, metric, tempo, articulation, textural, and timbral modes.

Together, these modes constitute "tonal organization" (TO) in music. Conceptualized by Franc¸ois-Joseph Fetis (1840), TO ´ is a method of joining musical tones together according to the sensibility of music-users (Fetis, 1994 ´ , XXV). Unlike tonemes of tonal languages, musical TO affects all tones, generates complex functional relations between them, and involves rhythmo-metric, dynamic, articulatory, and registral arrangements. Speech might also use similar arrangements (Patel, 2006). But music requires a special analytic attention where changes in the melodic contour are quantized into pitch-classes that are continuously crosscompared—unlike the linguistic "vowel pitch" (Walker, 1997, 322–3). Such syntactic pitch-parsing is as imperative for music as word-parsing is for language. Semantics provides yet another distinction: verbal syntax specializes in conveying referential meaning, whereas music specializes in emotional expression<sup>7</sup> (Gabrielsson and Lindstrom, 2001 ¨ ; Juslin, 2001, 2005, 2011, 2013; Cook, 2002; Krumhansl, 2002; Gabrielsson and Juslin, 2003; Dissanayake, 2008; Johnson-Laird and Oatley, 2010; Trainor, 2010; Perlovsky, 2012; Altenmuller et al., 2013b ¨ ; Eerola and Vuoskoski, 2013; Eerola et al., 2013; Peretz, 2013; Nikolsky, 2015a, 2020; Schiavio et al., 2016). Such distinction has been

<sup>6</sup>The word "movement" here refers to a principal division of a longer music work into sizeable sections, each distinguished by its own metric organization and tempo: e.g., a 4-movement symphony or a 3-movement sonata. The concept of movement emerged in 16th-century Western classical music to reflect on the old practice of switching from one tempo to another within the same piece of music (Sadie, 2001). However, by no means the use of multiple movements within the same work is exclusive to Western civilization. Well known are non-Western genres of music that employ cyclic arrangement, such as Arabo-Andalusian nubah (Touma, 1996) or Javan court Gamelan music (Sutton, 1991).

<sup>7</sup>Ontologically, it is necessary to distinguish between "meaning" in a natural language and "meaning" in a cultural system of symbols (such as music) especially in light of the difference in their acquisition: thus, under experimental conditions non-human primates can acquire some symbolic systems but not a fullfledged human language (Balari et al., 2011). It seems that the verbal combinatorial semiosis of referential meaning is fundamentally different from retrieving imagery, be it emotional or motivational information assigned to cultural symbols. This distinction is crucial for the investigation of origins of human language and music. Here, music, despite its combinatorial nature, occupies a place closer to signal-like semiotic systems, which makes music more accessible to hominins than language.

fundamental for the musical practices and theories of most musical traditions before Western classical music was swept away by the 20th century modernistic "revolution." This distinction became revived after emotion and music attracted intense neuropsychological research in the 1980s.

Music's social nature—evident in entrainment<sup>8</sup> (Tarr et al., 2014)—and emotionality—evident in chills (Altenmuller ¨ et al., 2013a)—are critical for distinguishing music: neither entrainment nor chills characterize verbal communication. And both are closely related through emotional contagion (Trost et al., 2017). This music/language distinction must have been already present in musilanguage, since in AC referential and motivational information is coded differently (Manser, 2010). However, music differs from ACs by encoding affective information according to the conventional modes of numerous AEs, as we shall see. Hence, the structural definition of music should be:

TO of multiple AEs that entrains listeners and performers and transposes performers' intentions to emotionally stir listeners through vocal and/or instrumental performance.

Pitch contour, rhythm/meter, and dynamics (the most salient AEs) together constitute the principal structural criteria of music.

## EMIC AND ETIC APPROACHES TO TONAL ORGANIZATION

The proposed definition is instrumental for engaging an additional source of evidence in the quest for the origins of music—the comparative structural analysis of world's archaic indigenous musics, earliest forms of music-making by human infants, and animal vocalizations. The modern advances in computer science support the acoustic and statistical analyses of vast datasets unavailable before. Such investigation could radically update the evolutionary theory while resolving the current situation in comparative ethnomusicology that is nothing short of a crisis (Savage and Brown, 2013).

Many cognitive scientists remain unaware of the profound ideological shift in Western ethnomusicology that occurred during the last half-century. In essence, the study of "text" became replaced by the study of "people" (Zemtsovsky, 1997) 9 . The turning point was marked by Gourlay (1982) at the 1979 Oslo Conference of the IFMC by a call for "humanizing ethnomusicology" to abandon "the pretense of objectivity." Timothy Rice reflected this departure in his influential article "Remodeling Ethnomusicology" (Rice, 1987). At the heart of this transformation lies the emic/etic antithesis, introduced by Pike (1967) in 1957 to oppose the "insider's" versus the "outsider's view" in the researcher's position toward an object of study. Ever since, this opposition has grown into a schism between Western social and cognitive scientists (Headland, 1990). Harris (1964) adapted Pike's approach for social sciences, conceptualizing "emic" as a specific culture, mentally "native" to an "insider," whereas "etic"—as cultures, experienced not mentally, but behaviorally due to their "foreignness" to an "outsider." Hence, Harris' claim that an outsider is capable of only grasping the superficial behavioral patterns through direct observation. Harris' followers wanted to abstain from any "mentalization" of observed facts to avoid their misrepresentation (Harris, 1990). Pike's followers, in contrary, interconnected mental and behavioral aspects, holding that etics and emics present respectively physical and cultural aspects of analysis, so that an outsider can learn to analyze like an insider, and vice versa (Pike, 1990).

For ethnomusicology, emic/etic problem was discussed at the 32nd ICTM Conference, 1993, Berlin. The consensus recognized that insider and outsider perspectives were inseparable and complementary to each other: emic data was to be fit into etic categories, disregarding whether they were actually recognized by the insiders (Baumann, 1993). However, in the following decade Western ethnomusicology became progressively politicized against a supposed "Western bias"—equated with any form of etic evaluation. Some authorities went as far as viewing crosscultural scientific investigation of music as "cultural colonialism" (see Agawu, 2003).

The purist emic approach replaces the scientific method of investigation with the insider's description of a native culture in a social context (Myers, 1993, 222–3). The reason for this is that the scientific method by itself is a product of Western civilization (Messner, 1993). Thus, Gourlay (1984) explicitly defies any objective inquiry about music by means of scientific investigation<sup>10</sup> . Becker (1986) declares musical systems as being "incommensurable," and any scientific study of non-Western music as being "immoral." She insists that each musical culture should be investigated only in its own native terms and not evaluated against another culture—the only way for a researcher to study music is to merge with the indigenous community, learn its language and jargon, and collectively make music. In effect, this utilitarian ethno-unilateral approach to music precludes the study of its origins (Dobzhanskaya, 2012). No wonder, in the West, comparative musicology became abandoned,

<sup>8</sup>Entrainment (from French "en-" + "traˆıner"—to drag something along) is the term used in physics to address a wide range of phenomena where two oscillators are coupled, and one of them gradually comes into synchrony with the other, becoming locked in a phase. Entrainment of two pendulum clocks was discovered by Christiaan Huygens in 1666 but was explained only few centuries later. In early 20th century, other manifestations of entrainment were unveiled in acoustics (coupling of the organ pipes) and biology (glimmering fire-flies)—until it was generalized as a universal physical phenomenon (Pikovsky et al., 2001). Its biomusicological manifestations were identified in the 1990s, at first in relation to music therapy, and thereafter as an integral part of perception of rhythm and meter (Large and Kolen, 1994), of great importance to the evolution of music (Fitch, 2012).

<sup>9</sup>Thus, Titon (2015), one of the leading Western ethnomusicologists of today, goes as far as defining the discipline of ethnomusicology as "the study of people making music"—rather than "the study of music" as the term "musicology" indicates (the study of human societies is conducted by another discipline—"anthropology," reflected in the etymology of its name). Paradoxically, modern Western "people's ethnomusicology" still shuns the Soviet ethnomusicology which shared the same approach, holding music as "belonging" to people and "reflecting" people's mentality, while remaining totally free of the anti-textual bias (Panteleeva, 2019).

<sup>10</sup>Gourlay argues that no musicological study of African music by outsiders is justified, because "in no African language about which we have information, and in many used by other peoples who have oral rather than written traditions, is there a word corresponding to the English term 'music'." So, according to Gourlay, "where the term 'music' is unknown to the people in question, one can conclude only that what we are presented with is the investigating scholar's concept of his/her 'music'."

musical universals denied, and music history fragmented into a bunch of disconnected "histories" (Savage and Brown, 2013). Unfortunately, despite its severe shortcomings, the "emic bias" has penetrated into psychoacoustics (i.e., see Parncutt and Hair, 2011) 11 .

Certainly, not all Western ethnomusicologists abstain from the musicological analysis (Arom, 2010) and deny the validity of objective etic approach (Alvarez-Pereyre and Arom, 1993). Nevertheless, the anti-analytical trend<sup>12</sup> has taken its toll, establishing a conviction that any research of structural universals is inevitably ethnocentric and inadmissible for ethnomusicology (Nattiez, 2012). Disregarding musical text in sake of musical behavior is symptomatic of a shift away from comparative musicology to fractured sociomusicology of isolated musical communities (Nettl, 2010, 70–92). Many contemporary American ethnomusicological papers are published without a single example of structural analysis to support the author's claims, basing their claims on entirely behavioral, and not musicological, data—paradoxically conducting musicological research without looking into music per se (Zemtsovsky, 2002) 13 . Consequently, cognitive scientists interested in comparative music theory and musicological analysis have no choice but to rely on the old publications in English and new ones in other languages (especially those coming from Eastern Europe and Asia, where the influence of politicization is weaker).

The summary of etic/emic arguments, crucial for investigation of TO, demonstrates that proponents of emic approach strongly overvalue it while writing off its fundamental flaws (**Table 2**).

TO is identifiable based on the etic information alone, and its few potential shortcomings are easily amendable by emic references (Dasen, 2012). Purely etic approach has been a status quo in organology, where musical instruments are identified according to etic principles, disregarding emic views (Baumann, 1993). And there is no reason why the entire field of ethnomusicology should not be treated in the same way. The etic approach is unique in enabling a "progressive" accumulation of knowledge where the mistake of one researcher can be corrected by another. Etic self-sufficiency is evident in the fields of ethology and developmental psychology. Neither human babies nor animals can provide emic information—which by no means invalidates the acoustic analysis of their communication.

In light of this, studying TO is paramount for establishing the objective ground for interdisciplinary scientific research of the evolution of music across the synchronic and diachronic varieties of music systems. TO's role for musicology is comparable to the role of phonology in linguistics: TO specifies a set of acoustic attributes and their oppositions to encode and convey information. Together, they form the "surface level" that underlies the musical syntax and semantics, and provide the material base for any music culture (Cambouropoulos, 2010).

## TONAL ORGANIZATION DISTINGUISHES HUMAN MUSIC FROM ANIMAL COMMUNICATION

The very ability to enjoy "harmonious" sounds most likely emerged as a byproduct of satisfying the need to bring individual emotions in accordance with the interests of a social group (Panksepp and Bernatzky, 2002). Musical anhedonia in humans is exceedingly rare, indicating that music evolved as a direct auditory pathway toward the emotional reward centers in the brain (Loui et al., 2017). Music is probably a human invention that came-into-being to shape important brain functions through the hedonistic effect of appreciating sounds (Patel, 2010). Patel's (2008) theory of "transformative technology of the mind" reconciled the adaptionist (Darwinian) and the non-adaptionist (Spencerian) approaches, based on the latest cognitive research, and provided the foundation for the theory of "mixed origins of music" (Altenmuller et al., 2013b ¨ ) that explains how human affective signaling system has transformed the human brain and created music. Emotive specialization and emergence of "musical emotions" must have followed the formation of human auditoryaffective circuitry (Altenmuller et al., 2013a ¨ ).

Centrality of affective signaling brings animal communication closer to music than to speech (Fitch, 2006). Animal signals usually express affective states according to their innate "vocabulary," are volitionally produced, and are actually felt (Fitch, 2010, 179–81). TO shares more similarities with animal vocalizations than with phonetics, since consonants, crucial for verbal parsing, are unique to human speech—unlike vowels that are more similar to singing and ACs (Kolinsky et al., 2009). Vowels determine verbal prosody which is the primary means of conveying emotions through speech.

Most likely, the musilanguage's TO resembled the model of vocal production, common for primates and human infants—a reflex-like vocalization (e.g., pain-shrieking), triggered by specific stimuli, and hard-wired for animals but modifiable for humans (Jurgens, 1995 ¨ ). Humans start developing the repertory of cries by differentiating timbral and contour features just a few months after birth (Wermke and Mende, 2009), whereas for most animals, call structure is not modifiable by acoustic experience

<sup>11</sup>Parncutt and Hair subscribe to Gourlay's defiance of a scientific investigation for those phenomena that do not find a corresponding term in a native language. They categorically insist that the research of consonance and dissonance be constrained only to music of such cultures that define the concepts of consonance and dissonance: "if musicians in that culture do not talk directly or indirectly about C/D [consonance/dissonance], it is considered irrelevant." By this logic, there is no gravity in those countries whose native people do not have a word translatable in English as "gravity." Parncutt and Hair see the goal of studying music in "documenting the musical and music-theoretical discourses of the insiders about which tones and rhythms should be played together and why, and considering the political and psychological mechanisms that are allowing Western music to dominate world music"—undoubtedly, a controversial and a politically biased agenda.

<sup>12</sup>To substantiate this criticism that is rarely voiced in modern Western literature, I shall quote one of the biggest authorities in ethnomusicology (the emphasis is added by me): "Functional analyses of musical structure cannot be detached from structural analyses of its social function: the function of tones in relation to each other cannot be explained adequately as part of a closed system without reference to the structures of the sociocultural system of which the musical system is a part, and to the biological system to which all music makers belong" (Blacking, 1974, 30–31).

<sup>13</sup>One of the main reasons for the drop in standards of musicological and ethnomusicological analyses is that in the US and UK academic curricula, music theory in general, and music analysis in particular, have been offered as rudimentary undergraduate courses (Agawu, 2004). In contrast, in countries of the former Soviet Union, music analysis has been taught at the highest level of scholarship that requires at least 10 years of study before attaining a level of training where an analyst is expected to capture and interpret the totality of expressive means employed in a music work (Khannanov, 2005).

#### TABLE 2 |Prosand cons (P/C) of purely etic, emic, and combined "etic + emic" approaches to analyzing music structures.


Pros are colored blue, and cons red. The number of cons for the emic approach (11) doubles the number of cons for the etic approach (5). Emic cons are more detrimental for the outcome of the analysis. The etic approach, even at its worse, still allows the researcher to infer valid principles of TO in a sufficiently large pool of samples of musical styles/genres - which, in the long run, secures correction of mistakes by subsequent researchers. At its worse, the emic approach precludes any comparative study and invalidates the study of TO in an isolated music culture where its members do not regard certain sound production as "music" (i.e., incantations, spells, herding vocalizations). The combined etic/emic approach effectively corrects the shortcomings of a purely etic approach, but in most cases, it fails to correct the shortcomings of a purely emic approach.

The Pastoral Origin of Music

(Hauser, 1996, 315). Call-learning occurs in a few songbird species, but for most birds, songs are innately encoded, and life experience only activates their retrieval (Marler, 1997).

A call serves as the basic unit in animal communication<sup>14</sup> and usually conveys specific affective information (Hauser, 2000). Different calls are combinable in "mixed bouts" that are different from "pure bouts" (single call) by triggering a sequence of emotion-based behavioral responses in other animals. Each call's significance is hard-bound to its acoustic structure. Despite their superficial similarity with music, "mixed bouts" lack transposability of intentions: each call comes only in response to the actual stimulus present in the environment (Zuberbuhler, 2017 ¨ ). Transposability is the landmark of music the same structural pattern is intended to express the same idea across different instances of use, without which musical genres would be impossible: e.g., most lullabies are recognized cross-culturally by their set of structural features (Trehub et al., 1993). Genres are based on reproduction and transposability, and usually form genre systems to support important social practices (Samson, 2001), which enables music to reflect perceptual reality. Animal-learned vocalizations miss such comprehensiveness and generalization. They are limited to:


Syntactically, AC overall lacks a combinatorial organization<sup>15</sup> . It resembles the one-word holophrasic communication of human infants by depending on a directly observable context and on an "analog" signal-emotion correspondence (Johansson, 2005). The same applies to animal "phonocoding" 16 (Marler, 2001): it excludes categorical perception, rhythm, hierarchical structure, and adjacent transitional probabilities (Yip, 2006).

Indispensable for speech and music, compositionality completely eludes ACs—along with listener's capacity to continually (re)-organize behavior as the song unveils. Non-human communication, as a rule, employs a "oneended" system: a signaling animal emits a signal unconsciously, not for any specific receiver but as a physiological reflex conditioned to a particular type of stimuli (Hauser, 2000). Such intention-free transmission precludes semiosis<sup>17</sup>—since sender and receiver must share signs and codes to actually transmit information.

A cumulative "two-ended" semiosis, where the receiver signals in response to the sender and vice versa, is unique to humans, and emerges as a result of technological complexity of human life. Dennett (1983) called this "second-order intentionality"—i.e., the receiver's beliefs and desires about the sender's beliefs and desires—in distinction from the "first-order intentionality" that is limited to the receiver alone.


Subsequently, the state of knowledge is changed on both ends of such communication, which, so far, has not been found in any non-human animal. Most common for ACs is zeroorder intentionality—the signaler does not consciously intend to convey a piece of information, but instinctively engages a specific signal structure, triggering a similarly automatic response of the receiver.

Two-ended communication generates an unlimited diversity of structure due to infinite recombinations of a finite set of discrete elements that do not carry meaning on their own—what Abler (1989) calls "particulate principle." It is peculiar to human language and music, finding only embryonal equivalents in a few animal species (Hauser, 2000). Complexity, comparable to human, is evident in some birdsongs, but serves to impress mates and intimidate competitors rather than conveying a specific message (Marler and Slabbekoorn, 2004) likely forming a parallel (not prototype) to human evolution (Fitch, 2010, 184).

<sup>14</sup>In some songbirds, the innate encoding consists of smaller elements, resembling syllables, and following simple rules for how to order them, so that a bird actually learns to "assemble" its song. However, the assortment of such elements is very limited, making songs signal-like, restrained to a single species. Playback of isolated syllables of such songs either does not elicit response or produces a weak reaction in other conspecific birds (Searcy, 1992). Perhaps the rearrangement of elements constitutes not a pragmatic, but a "syntactic" production unit—thus, zebra finches were found to stop at syllabic breaks in a song, when detracted (Cynx, 1990). Rearrangement of "syllables" is also used by a few primate species (gibbon) to disclose the identity of a caller for conspecific animals (Marler and Mitani, 2008). <sup>15</sup>Although it is not uncommon for ACs to form a sequence according to a rulebased structure, noticeable by conspecific animals (Fitch, 2010, 182), changes in such structures apparently do not result in the changes of meaning of the entire song (Hauser, 2000). The most syntactically elaborated bird and whale songs use combinatorial features, albeit minimal. However, despite having a componential structure, such animal song in its entirety presents a single piece of information learned from the animal's parent holistically rather than incrementally, element by element (in contrast to how humans learn), and is therefore highly stereotypical in form (Hurford, 2012, 3–99).

<sup>16</sup>The concept of phonocoding (i.e., "phonological coding") was introduced to oppose "lexicoding" of human speech (Marler, 2001). Phonocoding refers to the capacity to generate new sound patterns by recombining the constituent elements and components of known conventional signals. This capacity is minimal in nonhuman primates, but common in learned vocalizations of songbirds and whales, which, however, remain primarily non-symbolic and affective.

<sup>17</sup>The term "semiosis" here refers to the Peircean concept of conveying information by encoding it into signs by one party and decoding it by another party—a "two-ended" system. A "one-ended" call can be somehow interpreted in relation to the situational context by the listening animal, but this interpretation can radically differ from the actual state of the sender: e.g., bird's mating call might be interpreted by a nearby cat not as a signal of readiness for mating but as a signal for hunting. Then, the integrity of the information passed from sender to receiver is not preserved. Within this context, the use of the term "meaning" in regard to an AC, adopted in biosemiotics (Sebeok, 1994, 111), is confusing, since "meaning" implies that someone "means" something by displaying a specific sign. More accurate here would be to employ the term "significance" (as in "to signify") instead of "meaning."

descending for decelerations. Form reflects the thematic organization of the material, indicated by horizontal brackets and letters: thinner brackets and lowercase letters for motifs, and thicker brackets and uppercase letters for phrases. Each new material is marked by a new letter, and variation—by a subscript number. Register is represented by the coloration of the grainy filling of the ambitus: from a deeper green for the darkest timbre to yellow for the lightest timbre. In this example, oboe uses its darkest register, bassoon—its faintest register, whereas guitar—its medium register. Harmonicity (see Table 3) is indicated by the relative thickness and the geometric shape in representation of tones: the greater the harmonic richness, the thicker the rectangular bars, whereas the noisier the sound, the more irregular the fuzzy shapes (not present in this particular example). For thorough explanation of this method of visualization see Appendix 1 in Supplementary Material.

The structural criterion for emergence of the Semiotically Functional TO (SFTO)<sup>18</sup> in music is therefore manifested in the introduction of particulate organization in phonocoding.

## THE TIMEFRAME OF TONAL ORGANIZATION OBTAINING FULL SEMIOTICALLY FUNCTIONAL CAPACITY

The current consensus holds that music was gradually formed since the appearance of Homo heidelbergensis about 600,000 BP, leading to an artistic "explosion" circa 40,000, when the earliest bone "flutes"<sup>19</sup> were produced "en masse" (Morley, 2013, 219–25). Although flutes prove the existence of TO in the Aurignacian culture, this tells nothing of whether their sounds served a one- or two-ended communication. In all likelihood, TO did

<sup>18</sup>By "semiotically functional," I mean that a music-maker selects the elements and components of tonal organization for each of the aspects of expression in music based on their efficacy in conveying specific affective information ("musical emotion") to his/her listeners and/or partners in performance. In this sense, the AC can be considered "semiotically dysfunctional"—not supporting a successful two-ended communication (delivery of the intended message) between the sender and the receiver.

<sup>19</sup>The word "flute" here is used informally: there is not enough archeological evidence to conclude if the earliest instruments were flutes or clarinets. The oldest artifact is a bone fragment from Haua Fteah, Libya, with a single hole, dated 90–110,000 years ago (Blench, 2013). Most archeologists do not recognize it as man-made. Next in line is the 47,000 years old 3-hole artifact from Divje Babe, Slovenia, uncovered in 1995. It was interpreted as a bone bitten by a carnivore (D'Errico et al., 1998). However, experimental testing has demonstrated that none of the cave bear, wolf or hyena dentition could punch two holes without cracking and splitting the bone (Turk et al., 2001). Nevertheless this argument was not accepted by the supporters of non-human origin of the Divje Babe artifact (Morley, 2006). Subsequent tomographic analysis has concluded that the Divje Babe artifact was man-made (Tuniz et al., 2012). Slovenian researchers have presented additional reasons for its man-made origin (Turk, 2014). In spite of this, another recent British study has restated the bite origin hypothesis (Diedrich, 2015)—though, without addressing the 2012 and 2014 studies' arguments. The third in timeline and unequivocal in its provenance, is the 5-hole Hohle Fels-1 flute, 35,000 years old (Conard et al., 2009).

not communicate musical emotions but merely accompanied the behavioral display of actual real-life emotions—as it happens in reflex-driven animal vocalizations (Seyfarth and Cheney, 2017). Their acoustic form is shaped by the physiological impact of emotion on the vocal organs plus Pavlovian-style priming.

Semiosis originates in an ongoing interaction between signalers and receivers within the reference-framework of the same environment—forging communication rules through the dialectics of ritualization and devaluation (Wiley, 1983). Ritualized signals establish conventions via encoding/decoding interaction between the acquainted individuals. Once established, convention becomes "devalued"—abused by "bluffing calls" of the unacquainted signalers trying to take advantage of the established reactions of the receivers. Increase of dishonest signaling causes the signaler to substitute the signal or modulate it along a single acoustic dimension until an "evolutionary stable strategy" is formed, marking a stationary equilibrium within the population—which ultimately fixes the convention (Maynard-Smith, 1976). Here, "signaling efficacy" obtains its formative power: as natural selection optimizes a signal to support the signaler's visual display, successful decoding starts relying on whatever the receiver finds most comfortable to detect, discriminate, and remember (Guilford and Dawkins, 1991). Together, strategic design and efficacy determine the ultimate structure of a signal.

The road from animal call to musical phrase goes through the ritualization of innate physiological and behavioral cues that animals use to exchange information (Maynard-Smith and Harper, 2003) <sup>20</sup>. Ritualized signals differ from cues by being more conspicuous, redundant, stereotypical, and containing alerting components (p. 72). Nevertheless, they remain "concrete" (bound to a single context) like cues (Fitch, 2010, 184) and unlike "transposable" music. For ritualized signal to evolve into musical phrase, its meaningful features must be abstracted to become non-signal-specific and form an AE of TO—a conventional dimension of gradient change along some axis.

The end result of such abstraction is the multifactorial nature of music communication (**Figure 1**): each emotional/motivational state is represented not by a dedicated signal but by the configuration of numerous AEs (Juslin, 2005). Conventional musical notation is poorly suited for incremental representation of AEs other than rough indications for melody/harmony, rhythm/meter, and form. Waveforms display rhythm and dynamics in finer detail, but miss other AEs. Spectrograms decently represent melody, rhythm, articulation, register, harmonicity, and dynamics, but miss harmony, tempo, meter, and texture. This necessitates the use of a special notation—such as prosogram, developed by Mertens (2004) for analyzing speech. Although applicable to monophonic vocal music in visualizing pitch, rhythm, articulation, dynamics, harmonicity, and register, prosogram ignores harmony, tempo, meter, texture, and form. To overcome these limitations, I propose a similar approach to music—"musogram<sup>21</sup>." Its advantages over conventional notation in capturing 11 AEs are demonstrated in the simplest case of classical music (**Figure 1**). It introduces the conventions, necessary to read the upcoming figures.

Multifactorial visualization reveals the expressive contribution of all AEs. Each AE features structural patterns representing specific emotional states across cultures, genres, and styles at least for basic emotions (**Table 3**) <sup>22</sup>. Configuration of such patterns distinguishes one emotional expression from another. If multiple expressions share the same pattern of AE (e.g., legato characterizes both sadness and tenderness), the combination of a few aspects (e.g., "articulation + meter") differentiates them.

Multifactorial particulate semiosis shapes musical signs—each AE features SFTO, which enables "natural selection" for the most effectively communicated expressions. AC can be multifactorial but lacks particulate semiosis. Verbal semiosis is particulate but mostly unifactorial: phonetic organization is its primary source<sup>23</sup> .

Basic emotions can be recognized across musical cultures (Mohn et al., 2010) and can be acoustically described (Eerola and Vuoskoski, 2013). Therefore, at least some of their musical markers share biological roots with mammalian ACs (Zimmermann et al., 2013). The birth of SFTO is trackable by comparing the multi-cultural markers of typical musical expressions of basic emotions to equivalent AC expressions and by inferring their differences and commonalities (**Table 4**). Common traits indicate music's inheritance from ACs, whereas contrasting traits—innovations brought about by cultural evolution.

Music and ACs have in common only regularity/irregularity and articulation. They both find a perfect match between human music and AC (5 out of 5 emotional states). The next closest match (4 out of 5) is "harmonicity." That is why these two aspects of TO (articulation and harmonicity) must be the most ancient, possibly retained from the prehuman times. In contrary, "register" shows a nearly perfect

<sup>20</sup>Maynard-Smith and Harper give an example of such ritualized physiological cues as thermoregulation that causes animals to raise their feathers/hair to reduce body temperature, heightened in social interaction—which makes an animal appear larger and promotes dishonest signaling of increased body size in instances of confrontation (p. 68). Other physiological cues are respiration, urination/defecation, pupil dilation, and yawning (p. 69). The ritualized behavioral cues include "intention to move" which signals the beginning of a significant action (a bird taking a few false starts before flying), "protective movement," and "displacement behavior" (p. 70).

<sup>21</sup>For thorough explanation of the visual representation of the multifactorial organization of music, a way of its quantification, and its difference from the prosogram approach by Mertens, see Appendix 1 "A New Method of Modal Multifactorial Analysis of Tonal Organization in Music" in **Supplementary Material**.

<sup>22</sup>Musicological literature identifies many more structural patterns of different AEs than the patterns listed in **Table 3—**and their semantic references include many more affective states than merely five basic emotions. Much of this information is dispersed in the treatises on music theory, some of which are cited in the beginning of this paper. There are very few books that list such structural patterns in a manner of the 18th century treatises of "musical lexicon" (Cooke, 1959; Mattheson and Harriss, 1981; Bartel, 1997; McCreless, 2002; Vashkevich, 2006). However, only isolated patches of such literature have attracted attention of psychoacousticians and received experimental trial (Kaminska and Woolf, 2000). For this reason, the metareviews on research in "musical emotions" tend to focus exclusively on 5 basic emotions.

<sup>23</sup>Although tempo, rhythm, prosodic contours, and registers contribute meaningful motivational and attitudinal information to verbal communication, by no means can they be regarded as its primary semiotic aspects. Without knowing the lexic meaning of words of a particulate language, inferred from phonetic structures of auditioned speech, no adequate understanding of that speech is possible. This is in polar opposition to musical semiosis, where tempo, rhythm, melodic contour, and register directly convey the most important information, whereas keeping the referential meaning optional.

mismatch, testifying that humans cardinally reorganized the use of registers in music. The rest of the AEs display mixed results. If to generalize by emotional states rather than by expressive aspects, then none of the emotions display a full match or a full mismatch. Evidently, coding of emotions in human music has developed its own proprietary acoustic attributes. This confirms that ACs are mostly conspecific. Heterospecific<sup>24</sup> generalities support only a rough distinction between "positive" versus "negative" emotions (Snowdon et al., 2015). Human communication inherits from ACs just 2 general semiotic oppositions: (1) positive/negative affectation and (2) low/high intensity of an affective state (Brudzynski, 2013). High-intensity "strong emotions" (Grewe et al., 2005) have evolved into chilllike experiences of music—in contradistinction to the "mundane" use of language (Silvia and Nusbaum, 2011). However, "strong emotions" per se could not support musical semiosis because the stimulus-response relationship between chill and music structure has not been experimentally reproducible—music chills seem to occur intermittently (Altenmuller et al., 2013a ¨ ).

Both incremental and gradual changes in multiple AEs (**Table 1**) are peculiar to human music, whereas holistic tempo, dynamics, rhythm, and melodic contours are mutual for music and ACs. Musical meter, articulation, and harmony are also traceable to, respectively, ACs' regularity/irregularity, pausing/continuing, and periodicity/harshness.

However, the cross-examination of TO in expression of 5 basic emotions in music versus ACs reveals that many AE's patterns are unique to music (**Table 5**). Moreover, humans completely invert the acoustic characteristics of animal's affective states:


This indicates massive remapping of the instinctive vocal encoding of affective states, achieved throughout the cultural evolution of Homo.

#### What could have caused such changes?

For many AEs, their cultural origin is obvious: metric pulses usually break into a default binary pulse (Potter et al., 2009), following the left/right paradigm instituted by bipedalism (London, 2004). Rubato patterns (ritenuto/accelerando) also relate to bipedal locomotion (Honing, 2003), so as tempo which is synchronizable to gait or heartbeat (Fraisse, 1982). Melodic intervals follow another locomotive paradigm of stepping/leaping (Nikolsky, 2015b)—each successive tone either "stands" (unison), "steps" (2nds and fast 3rds), or "leaps" (>3rd)—unlike harmonic intervals that are factored by consonance/dissonance relations (a much later historic semiotic development). Articulation grouping relies on yet another biological factor—the breathing cycle (Alekseyev, 1976, 130). Taking a breath terminates a phrase, imposing a "clausal structure" on the melody (Fenk-Oczlon and Fenk, 2009b). The "breath group" prototypes the "articulation group" via a "breathing pulsation" (Etzel et al., 2006). Noteworthy, breathing pulse takes over metric control in ametric forms of music-making (Wallin, 1983). Locomotive and respiratory AEs must have formed long before Homo.

The rhythmic aspect of music possibly emerged from the quantification of verbal rhyming, following the language development (Kharlap, 1972) <sup>25</sup>. Melodic contours also relate to verbal prosody. The timeline of language formation remains controversial: the "saltational" scenario regards language as a sudden mutation 50–100 kya, whereas the "gradual" scenario qualifies it as part of evolution throughout millions of years (Hillert, 2015). Paleoneurology points to the Middle Pleistocene as a birthtime of language (Quam et al., 2017). Since musical rhythm and melodic contours rely on fine vocal control, their addition to TO must have followed the accumulation of extensive lexic vocabulary within a phonological organization of language (Tallerman, 2013). This ties the emergence of multifactorial TO (which is hardly possible without engaging melodic contour and rhythm) to Homo sapiens and the Upper Paleolithic, as indicated by the proliferation of bone "flutes." During 1995– 2009, over 120 bone pipes were recovered across Europe, dated 36–30 kya and concentrated up to 3 "flutes" per cave (Conard et al., 2009). Evidently, melodic music suddenly became popular in the Aurignacian.

Discreteness of pitch is evident in the construction of Paleolithic "flutes": holes are drilled in particular spots in order to generate sound of a particular pitch, and there is evidence of common patterns in the intervallic distances between the placement of the holes, suggestive of the commonality of certain melodic intervals in Aurignacian music-making (Nikolsky, 2015b, Appendix II). Discreteness of pitch was very likely to have been accompanied with the discreteness of rhythm, since stressing a pitch as a rule relies on extending its time-value relative to other pitches. Pitch hierarchy is supported by rhythmic contrasts between shorter timing of modally insignificant pitchclasses as well as longer timing of modally important pitch-classes (Krumhansl, 1990).

However, Aurignacian music most certainly lacked SFTO semiotization of rhythm and directionality requires an extensive period of exploration. This is obvious in the acquisition of musical skills throughout infancy: infants babble—engage in meaningless play with melodic contours—before learning to compose musically expressive vocalizations (Moog, 1976; Dowling, 1984; Swanwick et al., 1986; Holahan, 1987; Hargreaves,

<sup>24</sup>The opposition of conspecific and heterospecific distribution of acoustic features that characterize the vocal expression of a particular affective state in AC allows a researcher to identify those patterns of AEs that match cross-cultural features of corresponding affective states in "musical emotions" of human music. The patterns of expression that are present across multiple animal species are more likely to form the equivalents of "universal" traits of human "musical emotions" than those patterns that are found only within the very same animal species.

<sup>25</sup>However, the idea of rhyming seems to have a precursor in ACs. Thus, humpback whales match the constituent syllables in some of their songs (Payne, 2001). A similar organization was noticed in mockingbird songs (Thompson et al., 2000). Its underlying cause is perhaps simplification of memorizing a complex song. Yet another cause could be the employment of repetition of a particular syllable in a song for a certain number of times as a conspecific marker for certain bird species (Fitch, 2010, 183). Hearing such birdsongs might have prompted humans to invent rhyming.

#### TABLE 3 |The configurationof structural patterns for each AE, typically used to express five basic emotions.


This table is compiled based on a number of meta-reviews of experimental research on emotional responses to listening to music (Gabrielsson and Lindstrom, 2001 ¨ ; Gabrielsson and Juslin, 2003; Juslin and Laukka, 2003; Juslin, 2005). The data is categorized according to the musicological nomenclature: all acoustic attributes are broken into 10 AEs across 4 acoustic domains. The aspect of texture is missing, because it was not controlled for in the experimental studies of the acoustic structural patterns that characterize "musical emotions". The aspect of harmonicity constitutes an organic part of the aspect of instrumentation, listed in the beginning of this paper. This potentially confusing mismatch occurs as a result of the discrepancy in musicological and psychoacoustic scholarships: as a rule, musicians are ignorant of harmonicity, while psychoacousticians are ignorant of instrumentation. Harmonicity can be defined as the extent to which the spectrum of a complex tone is made of its component frequencies that are integer multiples of its fundamental frequency (FF). This is usually measured as the ratio of harmonics to noise. Slow attack and great vibrato generally tend to reduce harmonicity in a monophonic tone.

fpsyg-11-01358 July 22, 2020 Time: 21:48 # 14

Nikolsky

Nikolsky

The Pastoral Origin of Music

fpsyg-11-01358 July 22, 2020 Time: 21:48 # 15

#### TABLE 4 | Acoustic attributesof typical animal vocalizations used by different species to display their affective state, grouped according to AEs of human music.


increases The data for this table is compiled from numerous meta-reviews (Morton, 1977; Peters, 1984; August and Anderson, 1987; Snowdon, 2003; Briefer, 2012; Altenm ¨uller et al., 2013; Zimmermann et al., 2013; Snowdon et al., 2015). According to the classification scheme of Brudzynski (2013), human and animal affective states are equated in the following ways: human "happiness" is equated to animal "pleasure" (satisfaction), human "sadness"—to animal "dissatisfaction" (social isolation from a bonded party), human "anger"—to animal "aggression" (agonistic behavior, conflict with display of threat or combat), human "fear"—to animal "alarm/disturbance" (anxiety at the presence of threat or intimidation by a novel environment), human "tenderness/love"—to animal "appeasing" (affiliation—physical contact without agonistic behavior, e.g., grooming, and play). Those acoustic features that agree between human and animal expressions of the same affective state are marked blue, whereas the disagreeing features—red. Features that are not covered in research literature are marked "n/a." The aspect of "harmony" is clearly not applicable to animal vocalization. The aspect of "form" bears only distant relation to "musical form": AC's compactness loosely corresponds to simplicity of structure, whereas lengthiness—to complexity. Aspect of "meter" also finds only partial correspondence in regularity or irregularity of call units in the AC bouts. The timbral coloration is reflected by the aspect of "harmonicity" rather than "instrumentation" that manages timbre in human music.

The Pastoral Origin of Music

fpsyg-11-01358 July 22, 2020 Time: 21:48 # 16

TABLE 5 |The acousticattributes of typical expression of 5 basic emotions in human music that find no correspondences in animal communication (based onTables 3,4).


These attributes constitute a stock of TO features developed in the process of evolution of human music from hominin musilanguage. This includes changes in vertical harmony, in metric pulse, and in complexity of musical form; contrasts in melodic contour, in directionality of melodic intervals (sharpening for ascending, flattening for descending dyads), and in thematic material; diversity of rhythm, articulation and tempo; and ambitus size. Animal vocalizations do not seem to engage these categories in meaningful differentiation of calls.

1996). Most children pass through a music-babbling stage when 12–18 months old (Gembris, 2006). Universality of babbling suggests the universality of prolonged sensorimotor trials in music-making before semiotic rules are formed. Babbling abstracts melodic directions and intervals, allowing an infant to master particulate semiosis. Similarly, early humans had to long experiment with meaningless melodic play for the SFTO conventions to emerge.

## CROSS-CULTURAL "SCRIPTS" IN THE FORMATION OF SEMIOTICALLY FUNCTIONAL TONAL ORGANIZATION

Tool-making technologies (Ambrose, 2001) and "social scripts" i.e., fixed generalized patterns of social behavior (Aiello, 1998) most likely served as syntax precursors by providing explicit models for combining numerous elements into a structured sequence (Wildgen, 2004). Paleolithic proxies for syntactical language include composite tools (Ambrose, 2010), fire (Brown et al., 2009), knot-making (Camps and Uriagereka, 2006), cooperative hunting (Chase, 2006, 52), symbolic behaviors (Mcbrearty and Brooks, 2000), and burials (Mellars, 2004). The same proxies apply to syntax-related features of musical TO. All the AEs of music listed above (perhaps, except harmonicity) are engaged in the syntactic organization of music. Phrasal ends are usually marked by descending pitch, lower register, more concordant harmony, slowing of tempo, longer rhythmic value(s) placed on metrically strong time, reduction in loudness, and clear caesuras in articulation which separate the end of one formal unit (phrase, sentence) from the beginning of the following unit. In addition, there is evidence of a link between structures of tonal and social organization in indigenous societies (Blacking, 1967; Davidson, 1970; Lomax, 1977; Berliner, 1993; Arom and Voisin, 1997; Kubik, 1999)—which indicates that social structures might have also served as proxies for music syntax.

Making bone "flutes" was extremely tedious, demanding skills and expertise (Munzel and Conard, 2009 ¨ ). Why to invest into a "pitch toy" rather than to merely vocalize?

Cave-inhabitants must have supported flute-makers in the same way as they supported cave-artists—their exquisite labor required narrow specialization, precluding participation in hunting/gathering. In animistic ideology, depictions linked hunters to prey, providing means to benefit the outcome of hunting (Hauser, 1999, 1–4). Magic—not aesthetics—governed rock art, turning depiction into a shamanic occupation<sup>26</sup> . Shamanic music resembles shamanic depiction by cross-linking the signified to the signifier (Hubbard, 2003). In northern shamanic traditions, both melodic and pictorial contours are believed to affect the corresponding real objects (Novik, 2004, 67–85). Archeological evidence also links most resonant locations in caves with rock art in Paleolithic sites, suggesting the combined ritualistic use of images and music (Reznikoff, 2008; Morley, 2013; Mills, 2016). Hence, a Paleolithic "flute" was most likely a talisman used in rituals (Marshack, 1990). Its manufacturing from the bone of a particular animal (Wyatt, 2016) must have carried more significance for Aurignacians than the pitches it produced.

For melodic semiosis to occur, rhythm and directionality must first be abstracted into AEs. Abstraction of directionality probably followed rhythm: salience of the melodic direction depends on rhythmic values, but not vice versa. Tracking the melodic contour within the tonal "grid" constitutes the backbone of melodic organization (Deutsch, 2013), just like tracking the rhythmic grouping within the metric grid supports the temporal organization (Large, 2008). Reference to tonal hierarchy interferes with rhythmo-metric perception by biasing the attention toward pitch (Prince et al., 2009). Their conflict indicates that users of non-Western music discriminate rhythmometer better than users of Western tonality (which agrees with the observations of ethnomusicologists). This suggests that frequency reference-frame emerged later than rhythmometric.

Developmentally, acquisition of rhythmic hearing usually precedes melodic hearing (Shatkovsky, 1986). Infants seem to acquire rhythm-discrimination skills earlier than pitchdiscrimination (Trehub and Hannon, 2006) <sup>27</sup>. The perceptual foundations of rhythm/meter are manifested just a few days after birth, as a part of developmentally crucial rhythmic interaction between infants and caregivers, occurring spontaneously and requiring little experience—reflecting its evolutionary importance for bonding (Trainor and Hannon, 2013). In verbal acquisition, rhythm too obtains semantic functionality earlier than prosodic contour (Shvachkin, 1948). According to the vast data collected through administration of early musical education in USSR, rhythmic hearing lays the foundation for vocal musical skills—followed by learning to reproduce melodic contours (Kirnarskaya et al., 2003, 168–170). Impressions that not only rhythm can influence melodic perception by directing the attention to longer tones, but that melodic features carry the reverse influence onto rhythm, are based on the misnomer between rhythm and meter (McAuley, 2010). Melodic intervals, contours, and "tonal accents" help to infer meter, but play no major role in identification of rhythmic values. On the contrary, judgments of melodic similarities are significantly affected by rhythm, especially in folk music (Eerola et al., 2001) <sup>28</sup>. Even for experienced Western musicians the distinction between

<sup>26</sup>Thus, newer paintings often covered the older ones: hiding the underlying image did not matter—once painted, an image was "brought to life," and stayed "alive," even if masked—just as a person who disappears from our sight does not die (Uspensky, 1995, 173–181).

<sup>27</sup>The earliest age when infants show the ability to recognize changes in pitch contour is 5 months (Chang and Trehub, 1977). Majority of studies demonstrate such capacities in older children, 6 months and up (Trainor and Hannon, 2013). The ability to recognize changes in rhythmic values of a familiar music seems to emerge quite earlier—at 2 months of age (Demany et al., 1977).

<sup>28</sup>Metricality, along with tonality, influence primarily the Western musicians: nonmusicians process melodic contours mostly according to the distribution of longer rhythmic values (Monahan et al., 1987). Non-trained listeners simply cannot ignore rhythm—as it governs their melodic recognition (Jones and Ralston, 1991). Majority of young and inexperienced listeners at first parse melody by rhythm and only then by pitch contour and mode (Halpern et al., 1998). Tempo/rhythm descriptors are much more prevalent in listeners' judgments of thematic similarity than of pitch contour (Addessi and Caterina, 2000; McAdams, 2004).

rhythms is more salient than the distinction between pitches (Monahan and Carterette, 1985) 29 .

Important Upper Paleolithic cultural proxies promote the abstraction of rhythm—not of melodic contour. Metric pulse is transposable from bipedal gait into such a common Paleolithic activity as stone-knapping. Each knapper prefers his own tempo and rhythm (Whittaker, 1994, 81)—quite similar to individual gait preferences (Whittle, 2007). Knappers' heartbeat provides a metric reference (Zubrow and Blake, 2006). Two knappers might have accidentally discovered the expressive capacity of rhythm through their entrainment, thereby forming the world's first musical instrument (Montagu, 2004). Group "musical" knapping was observed amongst Aboriginal women in Queensland (Duncan-Kemp, 1952, 27). Rock slides and gongs are drummed across the globe in rituals related to fertility cults (Fagg, 1997, 38). The ritualistic context provides feeling of contentment or awe, abstractable into a semantic value for the knapping/grinding sound, turning its rhythm into a sign—and the archeological evidence for collective stone-knapping is present in Neolithic sites at Sanganakallu-Kupgal, India (Boivin et al., 2007). Even earlier, stationary lithophones were drummed in Solutrean-Magdalenian caves (pecked rock surfaces were found in Africa) suggestive of the existence of portable lithophones (Blake, 2011). The weird-sounding cave echo might have prompted specific affective connotations (Cross and Watson, 2006).

Unlike rhythm, pitch directionality finds no proxies in the Paleolithic<sup>30</sup>. A set of meaningful pitch contours could have originated in verbal prosody, but paleolinguists connect the development of the fully phonemicized semantic languages to population growth after the Last Glacial Maximum (Robb, 1993). Deeply social, language is imperative for accumulation of knowledge, which depends on population density to avoid "bottlenecks" due to climate changes and extinctions. Cultural evolution stabilized only after 50 kya—most certainly, because of the advancement of language (Klein, 2009). In all the prehistory, the transition to Holocene stands out as the grand leap in innovation, called to subsist an ever-growing population (Richerson et al., 2009). Powell et al. (2009) developed a demic model to estimate the critical population density capable of sustaining the innovation growth to offset the innovation loss: for Europe it was 45 kya. Prior to 20 kya, prehistory consisted of a chain of major discontinuities in cultural transmission (d'Errico and Stringer, 2011). Technically, the archeological concept of "culture" applies only starting from the Neolithic (Probst, 1991, 227).

The first archeological symbolic "culture" of pan-European scale is the Gravettian, whose common trans-European traits are both socio-economic and spiritual, with regional differences confined to the material techno-complex (Kozłowski, 2015). The continent-wide cultural unity is evident in the omnipresence of "Gravettian Venuses" over most of Europe (Soffer et al., 2000) 31 . Denser population turns language from means of inter-group cooperation that compensates for local ecological deficits into a life-long ethnic marker, akin to the cranial configuration (Robb, 1993). Personal ornaments in Gravettian burials manifest similar function of the "ethnic badge," differentiating age classes across the puberty threshold (Zilhao, 2014 ˜ ).

Social restructuring by ethnos and age hardly occurred without the involvement of music, closely affiliated with funeral and puberty rites. The Gravettian funerary practice strongly suggests the existence of burial rituals regulating the emotive interaction between the group's members, the dead, and the landscape as part of a greater ritual system, underpinned by cosmological beliefs (Pettitt, 2010). The remnant of such socio-eco-cosmological interconnection with TO, providing its semantic foundation, is the ancient doctrine of **ethos**<sup>32</sup> renowned in Hellenic civilization (Mathiesen, 1984), but certainly much older (Farmer, 1965) and geographically wider (Manuel and Blum, 2011). The roots of ethos must lie in the Gravettian trans-European spiritual unity.

## CONTRIBUTION OF MULTI-DIMENSIONAL AND MULTI-EMOTIVE SEMIOSIS TO THE EVOLUTION OF MUSIC

Human melodic universals remap animals' universals. Animal anger is characterized by descending contour, whereas animal appeasing—by ascending contour. Music reverts the registers for happiness, sadness, fear, and anger from low to high. Why?

Music contributes to the conservation of knowledge by bonding social groups and incentivizing linguistic communication. This capacity came in play after the Younger Dryas (11 kya), when global warming enabled colonization of Eurasia. Widely dispersed populations created a few flexibly bounded

<sup>29</sup>Of course, the influence of rhythmic features on the judgment of melodic similarity is far from being simple and direct. Other factors, such as tempo and harmonization, can affect the extent of autonomy of temporal and frequencyrelated aspects of music (Prince, 2014).

<sup>30</sup>There are accounts of "tone-painting" where the contour of the hills is represented through the melodic contour in songs of indigenous hunter/gatherers of Northern hemisphere (Krushanov, 1987, 234) whose life style is comparable to that of Aurignacians. However, the idea of such representation most probably was inspired by the need in mnemonic aid in long-distance navigation during migrations with reindeer herds, which doubtfully existed earlier than a few thousand years ago (see the last chapter). Such tradition had chances to survive the ongoing extinctions in harsh climate only as a part of a reliable subsistence strategy for a fairly large population.

<sup>31</sup>Broad-scale technological clustering originated in the earlier Aurignacian tradition—attributed to the long-term influence of the ethnolinguistic variation. Forming of a continental culture during the Gravettian indicates the increased language contacts between different "clusters," establishing pan-European networks of informational exchange (Zilhao, 2014 ˜ ).

<sup>32</sup>The term "ethos" was coined in Archaic Greece, where it originally meant "custom," but by Classic times it obtained the meaning of a certain affective "character," associated with a particular musical melodic mode. "Ethos" embodied the consensus within a community as to which affective states would be generally "good" or "bad" for that community. The doctrine of "ethos" is closely related to the concept of "harmony of spheres," attributed by Hellenic sources to Pythagoras, who presumably learned it from Babylonians. The discussion of ethical value of this or that musical emotion and its suitability for astrological dispositions constituted an important part of public discourse in Ancient civilizations of Near and Far East, as well as Central Asia.

"social territories<sup>33</sup>," developing the dialect continuums by linkages among groups due to intermarriages during population shortfalls (Robb, 1993). Population growth and sedentism accompanied rapid neolithization, promoting ethnogenesis and thereafter fissioning language into language families as regional cultural differences cumulated (Robb, 1991). Such line of development benefited from the social bonds established by music.

The absence of music-like particulate emotional communication must be one of the reasons why chimpanzees do not accumulate cultural traditions. Some chimpanzees acquire a culture of tools but due to the lack of transposability and abstraction cannot transmit it (Whiten, 2011). However, it is music, not language, that engages reproduction, transposability, and abstraction of idiomatic patterns of each of its AEs.

Human remapping of pitch encoding most probably originates from the continuous practice of:


content, projecting agitation (**Table 2**). AC's anger does not engage such interaction. It conserves a unifactorial timbral quality<sup>34</sup> (**Table 3**).

All AEs differ in musical expression of love (**Figure 2**) and anger (**Figure 3**), as evident in musograms<sup>35</sup> of indigenous Siberian songs that Russian theorists believe to represent the earliest forms of TO (Alekseyev, 1976, 1986; Brodsky, 1976; Zemtsovsky, 1983; Mazepus, 1993; Mazepus and Galitskaya, 1997; Novik, 1999; Zabolotskaya, 2009; Dobzhanskaya, 2011, 2016; Nikolsky, 2015b; Sheikin, 2017, 2002).

Unlike the expression of love, anger engages a wider ambitus, greater leaps, contrasting registers, harsh timbres, loudness, shorter and richer rhythms, reduced regularity and tonal stability, increased tempo fluctuations, staccato articulation, and thematic complexity (**Figures 2**, **3**). However, gorillas express anger differently: "call-motifs" remain always isolated and slow-paced, featuring neither a clear melodic contour (due to its enormous bandwidth) nor rhythm (**Figure 4**).

If humans consciously manipulate numerous learned expressive parameters in music, animals instinctively "center" on a single biologically "hard-wired" parameter to reflect their emotional intensity. Human infants start their development at the same level where animal cubs start theirs, but quickly advance. Newborns employ just 2 vocalization types: negative and positive (Loewy, 1995). Cries of hunger, cold, distress come first as biological reflexes (Zeskind, 1985). However, the similarity of an infant's supralaryngeal vocal tract to that of the primate cub's does not stop the infants from trying to imitate his/her caretaker's vocalizations (Lieberman, 1985) <sup>36</sup>. Infant cries start varying in temporal and frequency characteristics as the infant ages (Papouˇsek and Papouˇsek, 1995). Loudness, timbre, register, attack speed, FM range, and harmonicity are progressively mastered as markers of different cry-types (Golub and Corwin, 1985). An infant builds a repertory of melodic contours assigned to specific situations and used as building blocks to inform the caretaker about his/her state and to receive a desired treatment (Wermke and Mende, 2009). Such ongoing two-ended communication lies at the heart of musicality (Trevarthen, 2019).

Call/cry-repertory building appears to be universal in human development (Wermke et al., 2007), very likely paralleling the phylogenetic evolution of music (Foster, 1994). Similarities between the structure and function of human and nonhuman vocalizations were discovered in crying, motherese, and babbling (Snowdon, 2003). Fluent switching from one cry-type to another, corroborated by the caretaker's response,

<sup>33</sup>Thus, Peter Bogucki counts as few as 14 Mesolithic "social territories"—i.e., regions differentiated by the material culture as manifested by archaeological evidence—spread out over the entirety of Western Europe during its transition from the Boreal to the Atlantic periods, c. 7500 kya (Bogucki, 1988, 41–46).

<sup>34</sup>It could be said that an animal "centers" (i.e., focuses) on a single aspect of vocal expression, conserving the extent of increase or decrease in intensity of the psychophysiological state that is associated with that vocal expression. This is yet another parallel between AC and the vocalization of a sensorimotor human infant. This is in contrast to the ability of an adult human to simultaneously conserve multiple dimensions of changes in multiple AEs in music.

<sup>35</sup>For a comprehensive analysis of those musical examples that were selected for musograms in **Figures 2**, **3**, **4**, and **7**, see Appendix 2 "A Comparative Structural Analysis of Musograms" in **Supplementary Material**.

<sup>36</sup>Similarity of the anatomy of the supralaryngeal vocal tract of the human baby and the ancestors of Homo sapiens provides yet another justification for seeking the TO model of hypothetical Paleolithic music in the musical babbling of 1–2-year-old infants.

FIGURE 2 | Characteristic patterns of AEs in expression of love in a Yakut traditional lyrical song "Sae Dyige" (may be auditioned at http://chirb.it/sNegG1). By Juslin's (2005) classification this song fits the "love" music category—in agreement with its lyrics, describing how a woman is anticipating visits of her multiple lovers (Alekseyev and Nikolayeva, 1981, 86). The musogram follows the same conventions as Figure 1, with minor additions due to the less definite use of pitch in the purely vocal music. Tones of low spectral periodicity (noisy or spoken-like) are represented by fuzzy strips in contrast to high periodicity, represented by rectangular bars. The number under each pitch displays its frequency value in bold, its duration in italic, and its maximal amplitude (the highest value of any of its spectral constituents) in regular font. The lyrics are given in the phonetic transcription. There are two contrasting motifs: "a"—a sustained long anchor tone (tonic function), followed by rapid alternation of steps with rising intonation; and "b" —two descending intonations, the first of which leaps to the alternative anchor (dominant function to mark a cadence), while the second steps down and then gently rises. These two motifs make up a call-like phrase that is regularly repeated. Song is characterized by a narrow ambitus (half-octave), mid-low register, high harmonicity, low complexity, moderate tempo (102 bpm) with little rubato (11%), diverse rhythm (usage of four rhythmic values), regular meter, overwhelming legato (97%), and scarce dynamic changes. For more detailed discussion, see Appendix 2 "A Comparative Structural Analysis of Musograms."

FIGURE 3 | Characteristic patterns of AEs in expression of anger in a song of the underworld virgin from the olonkho "Djiribina Djirilatta" (http://chirb.it/sCq02k). This excerpt from the traditional Yakut epic expresses anger of the evil sorcerer toward the heroine, challenging her to a fight (Alekseyev and Nikolayeva, 1981, 35). Structural descriptors of most aspects of this song fall in the category of "angry" music (Juslin, 2005). The acoustic markers of all AEs contrast those in Figure 2. The ambitus is over twice wider. There are two registers instead of one: low singing and high "shouting"), both are higher than Figure 2. The share of well-pitched sounds in the overall duration of music is reduced by 34%. The share of staccato articulation is increased (by 142% in the duration of silence and 40% in the number of pauses). Tones are overall shorter and 50% more diverse in time values, with contrasts between rhythmic groups. The tempo contains abrupt switches, the fastest of which is 66% faster and 73% more variable (rubato) than Figure 2. Intonations feature wide leaps, on average 70% wider than Figure 2. Thematically, the music is more diverse and complex, using two contrasting materials, "A" and "B" (Figure 2 had only one). Timbre is harsh (a heightened larynx and intensified pressure).

prompts the cross-examination of the cries' acoustic parameters. The intensity of temporal expression usually matches pitch expression (frequent leaps require faster tempo to convey excitement and emergency—otherwise the caretaker is not "convinced" to respond urgently enough). Together, the projection of feedback and memorization/cross-relation of cry-types establish the acoustic oppositions between AEs of common musical emotions.

What diverts music from AC is the radical change in communication framework. Animals communicate "faceto-face" in situations that demand immediate action, which selects signals effective in expressing rapidly changing motivational states, with clear gradations in their intensity (Morton, 1977). Such signaling prioritizes ease of detection, speed of interpretation, signal's briefness, and a single salient gradient AE (Maynard-Smith, 1976). High redundancy and stereotypicity of selected signals often "fix" them (Simpson, 1997). This precludes combinability of AEs and calls, enabling "dishonest" calling.

Unlike animal calls, traditional indigenous music normally never "lies" (Nikolsky, 2016, Appendix III). A performer, as a rule, expresses emotions he actually feels—even when

multifactorial analytical method as human music reveals important differences in TO. The most noticeable is complete absence of harmonious sounds with clear FF and legato articulation. The share of silence doubles: 43% (versus 17% of Figure 3). The form is simpler—no motifs conjoin into a phrase. Calls (voiced roar, non-voiced growl, and snort) remain detached except for a few instances of joining snort and growl together. The same disconnectedness characterizes all temporal AEs. The onset of each of the calls exposes a sort of an irregular pulse. However, the rate of this pulse is more than twice slower than the angry human music (Figure 3) and its deviation from a regular pulse is nearly twice greater—exceeding even the slow and flexible "loving music" (Figure 2). In essence, it would be accurate to characterize these vocalizations as rhythmically irregular, ametric, and undifferentiated in pitch. None of the calls generate a clear pitch contour due to their very broad band (up to 4.2 octaves). The calls' bandwidth was calculated by taking measurements of the frequency of that portion of the spectrum which stood out from the rest of the signal. Unlike music, gorilla's call-motifs do not break the ambitus into registers but timbrally recolor the entire ambitus for each of the calls, thereby increasing their separation.

impersonating an epic protagonist or a spirit, the singer becomes temporarily "possessed" by them (Novik, 2004, 272). "Putting on an act" is a prerogative of post-Renaissance Western classical performance tradition, and even there the performance canon demands "method-acting" to convince the audience in the realism of musical emotions (Nikolsky, 2015a) <sup>37</sup>. A non-western traditional song usually appears "westernized" to the indigenous audience when "acted out" formally (Zemtsovsky, 1983). Folk "cover-songs" necessarily engage the performer's "direct"—rather than "indirect" or "scripted" speech (Zemtsovsky, 1979) 38 .

Insincerity and falsehood in musical expression did not present a critical issue prior to the 1760s (Charlton, 2009). They both attracted public discourse as a systemic aberration peculiar to a specific class of music (rather than a "defective" sample) only after the entertainment industry became institutionalized (Dahlhaus, 1989, 314). Rise of mass production made "emotional faking" a norm for commercial popular music—explicitly codified in Irving Berlin's composition standards (Suisman, 2009) <sup>39</sup>. So, music started as a decidedly "honest signal" (Levitin, 2009, 141–6) and only recently adopted "acting"—albeit, hardly enough to declare music fundamentally "dishonest<sup>40</sup>."

Jointly, multi-dimensionality of music and emotional contagion make lying difficult. Music always integrates listeners and performers, and this togetherness promotes sincerity. The particulate structure of musical semiosis effectively reveals dishonesty: at least some of AEs' insincere expressions are bound to contradict each other, prompting a resolving interpretation. But what in the cultural evolution could have spurred the inclination for aspect-matching?

<sup>37</sup>Demonstration of musical "method-acting" can be found in the video clip of Andrei Gavrilov performing Rachmaninov's Prelude in g minor, op.23, No.5 https: //www.youtube.com/watch?v=T3AEfMMyH6A. Especially telling is the pianist's facial expressions, as he is getting up from the piano bench after completing his performance—he continues to remain in his "role."

<sup>38</sup>Some indigenous traditions have developed professional forms of musical art which require aesthetic evaluation (e.g., Tatar, Kazakh, Mongolian). However, they still fundamentally differ from Western classical music by not taking a musical work as a "script' created by the composer for the performer to adhere to (Zemtsovsky and Kunanbayeva, 2011). Only the Western musician is trained as part of his occupation to accurately "execute" the composer's script while being aware of the fictiveness of its emotional content. However, application of such treatment to a folk cover song is most likely to come across as fundamentally "inauthentic" and detrimental to the song (Moore, 2011).

<sup>39</sup>Berlin's rules Nos. 3, 6, and 9 call for the composer to please the consumer at the cost of insincerity: "the ideas and lyrics must suit either a male or a female, so both sexes want to buy a song," "music and lyrics must have to do with things common to everyone," and, most explicitly—"songwriter must look upon the song as a mere business, not take music to heart." Berlin's rules break away from the Western composer's "canon," established since the introduction of "musica reservata" in the 16th century (Meier and Dittmer, 1956). For this reason, Berlin's approach provoked criticism of the American popular music in toto, seen by connoisseurs of art music as a "sweet lie" sold (for profit) to the mass audience to replace music that is "truthful" yet unpleasant in revealing "social truth" (Adorno, 1942).

<sup>40</sup>As far as I know, Trehub (2008)remains the only scholar who believes that music, in general, operates by having the performer emotionally deceive the audience. Other scholars who point out that a professional performer can evoke emotions that he/she does not actually feel, realize that this discrepancy is possible only in music that segregates the listener, the performer, and the composer. This solely happens in Western classical music. And even within this tradition "deceiving" the audience is still regarded as a fault to be avoided. Noteworthy, Trehub did not respond to Juslin and Vastfj ¨ all's (2008) ¨ objection to her criticism.

FIGURE 5 | Hybridization of characteristic patterns of ACs and human music in encouraging and prohibiting commands by human trainers to their dogs (McConnell and Baylis, 1985; McConnell, 1990, 1991, 2002; Miklosi, 2015). (A) Typical expression of tenderness in human music. This diagram extracts the key features of Table 2 and Figure 2: very few pitch-classes with a low rate of change within a narrow ambitus, wave-like melodic contours filled by stepwise motion in the low register, slow tempo, with long tones and tendency to decelerate, and regular meter yet rhythmic diversity. Articulation is mostly legato, with occasional pauses. Dynamics is soft, stressing the anchor tones. (B) Typical expression of anger in music (according to Table 2 and Figure 3): many pitch-classes with high rate of change and wide ambitus, ascending contours, and leaping zigzagging motion in high register. The tempo is fast, with short tones, often accelerating, with irregular pulse, and strong rhythmic contrasts. Dynamics is mostly loud, and accents fall on metrically weak tones. (C) Typical expression of appeasing disposition in primate vocalizations (Table 3). Many pitch-levels have a high rate of change, following a gradually ascending melodic contour within a relatively narrow ambitus. Tempo is fast, with short tones and long groupings. These features strongly contrast (A), whereas metric regularity, legato articulation, low registration, and soft dynamics resemble (A). (D) Typical expression of aggressive disposition in primate vocalizations (Table 3 and Figure 4). There are relatively few pitch changes due to an extremely broad bandwidth, precluding frequent leaping. Long tones are embedded in fast motion with a descending contour in low register. These features oppose (B), whereas meter, articulation, dynamics, and harmonicity resemble (B). (E) Typical expression of growing encouragement in fetch-whistles for dogs. This expression combines a tender disposition of a human (A) with the appeasing disposition of a dog (C). Therefore, fetch-command has to reconcile the contradictions between AEs' expressions of (A) and (C). To accomplish this, the ascending contour becomes steeper, each signal and the time interval between signals become shorter, the ambitus of each signal grows and reaches higher register, and the groupings grow in size (from 2 to 4). Temporal and pitch AEs are co-adjusted, merging traits from (A) and (C). (F) Typical expression of growing prohibition in stop-whistles for dogs. This expression combines the display of human displeasure, like (B), with the appeasing disposition of the dog (C), while structurally and semantically opposing (E). (F) subverts a single long tone to the contrasting gradual flections in pitch, where the descending portion receives the greatest significance. The increase in intensity of prohibition is signified by extending the time values and reducing the steepness of the descending curve—in contrast to (E). Dynamics provides yet another axis of opposition: loud for (E) versus soft for (F). Most importantly, the (E,F) opposition involves a compensatory interaction of the temporal, dynamic, and pitch patterns of AEs. Thus, whenever (F) is used in isolation, its softness, slowness, and ametricity might project the impression of passiveness—contrary to the categorical nature of a "stop" command. To avoid this, (F)'s melodic curve combines ascending and descending curves whose conflicting relation generates extra tension.

## DOMESTICATION OF ANIMALS SETS THE NEED TO MAKE TONAL ORGANIZATION SEMIOTICALLY FUNCTIONAL

The need to command domestic animals underlaid the population explosion of both humans and livestock during the Neolithic Revolution. Animals benefited from human support, while humans benefited from animal produce. They both had to establish common patterns in their existing codes of vocal communication and adopt new patterns wherever the old patterns were deficient. Aspect-matching of pitch and rhythm was part of "bi-specific translation" of human commands (**Figure 5**). Rhythm reflects the "motion" pattern characteristic for a given "emotion" (Amaya et al., 1996), while pitch the exertion/effort required by such motion—jointly defining a "sound gesture" (de Gotzen, 2004 ¨ ). Perception of pitch and

rhythm relies on the biological components mutual for mammals, thereby supporting heterospecific communication. There is fMRI evidence of shared emotional vocalization systems across species (Belin et al., 2008).

An account of pitch-rhythm interaction comes from dogtraining. Long continuous low/descending pitch is universally used to stop a dog, whereas repetitions of short rhythmic high tones—to encourage it—which might comprise a mammalian generality (McConnell and Baylis, 1985). Dog trainers identify pitch contour, rhythm, repetition rate, and amplitude as AEs effective in dog's commands.

Stop/fetch opposition reflects a multi-dimensional compensatory interaction of pitch, rhythm, and dynamics, mutual for both humans and canines. Some of the animal acoustic "universals" became appropriated into this bispecific communication, while others were overruled. Thus, across mammals, greater amplitude generally corresponds to a higher level of arousal (Briefer, 2012). However, it is only the fetch-command that follows this rule, whereas the stopcommand, in contrary, adopts soft dynamics to subdue a dog (McConnell, 2002, 49–63). This overriding of the natural association between dominance and loudness highlights the fundamental difference between human and animal communications (Owren and Rendall, 2001):


Human-to-animal communication integrates both strategies:

• Humans address animals, treating them like humans, but perfect the encoding to secure the desired response. Thus, "doggerel" (Hirsh-Pasek and Treiman, 1982) constitutes dog-directed adaptation of human motherese (Mitchell R. W., 2001).

Pitch contour is a primary AE for most human cultures. Melody is the only aspect that differentiates between the basic musical emotions completely on its own (**Table 2**) <sup>41</sup>. In ACs, pitch does not provide such differentiation (**Table 3**). Pitch's importance for music pushes human melodies higher in register. This is because the low frequencies appear softer (Oxenham, 2013)—making the low contours less salient than the high contours. The same applies to primate hearing and, possibly, other mammals (Stebbins and Moody, 2011). Domestic animals too should follow suit. This incentivizes humans to raise contours characteristic for basic emotions above 1 kHz, where pitch changes are more salient. The only exception is the affection/love signals. Intimacy requires close-distance communication where the softness of low-frequency poses no problems.

Social animals share affective signaling system with humans (Snowdon et al., 2015). This enables effective musical communication between humans and domestic animals—all of whom are "social" (Stricklin, 2001). SFTO in all likelihood evolved gradually, following the schemata of human-todog communication. The earliest archeological evidence of domesticated dogs dates back to 15 kya (Larson et al., 2012), but signs of domestication were found in a Gravettian site, at Pˇredmosti`ı (Germonpre et al., 2012 ´ ). The DNA analysis indicates that a dog-like 33 kya old fossil from Altai is closer to modern dogs than to wolves (Druzhkova et al., 2013). Dog domestication must have been slow, preceded by feeding dogs with leftovers in exchange that they would follow humans and alert them of approaching predators. Dogs are genetically adapted to digest starch, which constituted part of human diet (Axelsson et al., 2013). Similar adaptation occurred in dog's communication system. It adopted traits of human TO. Compared to wolves, dogs use more vocal signals, especially bark-based—and barks feature co-modulation of two expressive aspects, amplitude and rhythm (Simpson, 1997). Alerting and territorial barking, both vary in intensity and rate depending on the distance of the dog from the conspecific or heterospecific intruder and the extent of the dog's arousal. At near distances barks become louder and more rapid. Such signaling and the manner of its modification most likely evolved in response to human's selective pressure on dogs to bark territorially at strangers (Simpson, 1997).

Human-to-dog communication most likely prototyped communication to later domesticates: cows, sheep, and goats. The surviving Nordic tradition of kulning provides the gist of the Neolithic pastoral music-making.

## THE SCANDINAVIAN TRADITION OF KULNING AS A MODEL OF NEOLITHIC MUSICAL SEMIOSIS

Animal husbandry in Scandinavia started ≈1800 BC and reached its "golden age" by 1200 BC. This is when owning larger stocks became prestigious while climate warming enabled outdoor animal maintenance almost year-long (Tesch, 1992). However, winter grazing was hard on bushes and trees, depleting local resources. This, along with subsequent climate cooling, brought about a new housing style, designed to shelter animals together with humans for winter—which characterized Scandinavian pastoralism (Armstrong Oma, 2013). Sharing the house with animals led to acceptance of animals as household members, equal to humans, and categorically as "clean"—even animal dung was used to make wattle and daub walls. Sharing is known to increase bonding. Human dependence on milk products, and animals'—on humans' "room and board" promoted mutual trust and attraction (Armstrong Oma, 2010). From being "products," animals turned into "producers" of dairy. This brought about psychological "revolution" in human-animal relationships, where music acquired the leading role.

<sup>41</sup>Prevalence of ascending contour characterizes happiness, anger, and fear. Happiness differs from anger and fear by employing variety of melodic contours called to diversify an ascending contour. Anger differs from fear by using sharp rather than wave-like contours and by dominance of staccato articulation in pitch changes (fear mixes staccato and legato articulations). Prevalence of descending contour characterizes both, sadness and love. They can be distinguished solely by intonation: flattened with stepwise falling contours for sadness, and sharpened with occasional ascending leaps for love.

Milking required concordance. An irritated animal or milkmaid reduced milk-yield, reducing human nutrition. Humans had to maintain mutual affection toward animals evident in taboos on swearing/screaming at cattle, widespread across Eurasia (Plotnikova, 1999b). Music ritualized and fortified this union across different cultures (Shevtsov, 1988; Wallin, 1991; Alekseyev, 1995; Ivarsdotter, 1995, 2004; Novik, 1999; Dorina, 2004; Dissanayake, 2005; Kolltveit, 2008; Cheng, 2009; Yoon, 2018), especially evident in surviving traditions of milking songs (Nielsen, 1997; Pegg, 2001; Gioia, 2006b), animal lullabies (Kondratyeva, 1989; Kyrgys, 2002; Tchotchkina, 2003; Kan-ool, 2012), and spells (Kondratyeva, 1996; Kyrgys, 2002; Bordzhanova, 2007; Sodgerel, 2012, 2016; Tiukhteneva, 2017)—which all share the union of musicality and love/care that characterizes human motherese (Trevarthen, 2019).

Principal traits of such music can be extracted from the current practice of Scandinavian herder's music-making. Its chief task is to control the behavior of the grazing livestock during the warm seasons at distant pastures (Ivarsdotter, 2004). The herder aims at influencing the animal's emotional state over a range of distances, up to a few kilometers. Long-distance transmission requires a special vocal technique and musical instruments. The same musical signals convey different information to livestock and humans: commanding animals while informing animalowners at the farmstead of their animal's wellbeing. This dual communication has been faceted through a transhumance system known as shieling in England (Cheape, 1996), and fabod ¨ in Scandinavia (Svensson, 2015)—emerging during the late Bronze Age in response to the scarcity of local winter fodder (Tesch, 1992). In Sweden, the shieling standard was set in Dalarna, and the alternative local traditions are considered its variations (Svensson, 2015). Traces of shieling are spotted across Europe, from the Hebrides to the Carpathians, becoming widespread by the Iron Age (Cheape, 1996). In Norway, the earliest fossil fields of lynchets show signs of cultivation during the late Bronze Age (Skrede, 2005), confirmed by palaeobotanic and archeological dating (Kvamme, 1988).

Shieling is characterized by seasonal migration to a summer station where herders spend their daytime supervising animals, preparing fodder for the coming winter, and produce dairy during evenings (Cabouret, 1984). Since milking, butter- and cheese-making traditionally constituted the women's job, shieling and its music became female prerogatives in Scandinavia. There, milking could dishonor a man, and shieling was managed exclusively by young women (Svensson, 2015). In Ireland, shieling was a family business, whereas in Spain, France, and Switzerland dairy-work and herding were conducted by men.

The gender difference, undoubtfully, played a role in shaping the European pastoral musical traditions. Scandinavian, Icelandic, Alpine, Jurassic, Pyrenean, Apennine, Sardinian, Balkan, Turkish, and Caucasian mountains have sheltered singing styles that originated in the herding culture, and shared a peculiar singing technique based on a forceful high-laryngeal falsetto-like sound production (Wallin, 1991, 510). Wallin (pp. 511–23) summarizes the archeological, anthropometric, and genetic research to support the ethnographic findings of Carl-Allan Moberg (1971). Moberg outlines the core traits of the archaic Fabodv ˚ asendet ¨ music: "head-voice" vocal technique, utilitarian function of long-distance signaling, and ideological roots in pagan magic.

The centerpiece of Fabodv ˚ asendet ¨ tradition is its "maximaldistance" style—"kula"—that I distinguish from "kulning"—an umbrella-term for the entire Fabodv ˚ asendet ¨ <sup>42</sup>. Local names for kulning (e.g., lockrop) imply the alluring of animals by magic properties of sound to suggest certain behavior to the herd, avert evil trolls and predator-animals—following shamanic tradition of maiden singing (Mitchell R. W., 2001). In Swedish mythology, forest spirits possessed their own cattle, and herdswomen (kulerska) learned kulning from skogsra˚, "sirens of the woods" (Johnson, 1990). Suggestive power of kulning was deemed so high that women lived in fabods ˚ alone without any weapons. Folk beliefs attributed this power to beauty. Indeed, well-ornamented high "warbling" register of distant female voice made men and women pause their work and enjoy the sounds (Ivarsdotter, 1986). For humans, kula clearly presented an aesthetic object despite bearing utilitarian status of "non-music" (Frodin, 1929 ¨ ) 43 . For animals, kula constituted a "safety call." Both attitudes focus on positive rather than negative emotions—not only to keep the cattle under human control, preventing panic, but also to boost the kulerska's confidence and alertness (Wallin, 1991, 420)<sup>44</sup> . SFTO must have emerged as a set of sonic attributes, perception of which was directly "wired" to reward circuits in brains of humans and domestic animals.

Wallin (1991, 420) rightfully stresses that matriarchy influenced early pastoralism: "the maternal instinct and care" instilled the social holding of attachment to stabilize and reinforce the animal-human affiliation. Distinctively female, Fabod ˚ tradition must have prehistoric roots (Johnson, 1990). Motherese undoubtedly prototyped a close-range kulning. Animal-directed vocalizations acoustically and functionally resemble lullabies by commanding calmness/happiness—not just in Sweden (Wallin, 1991, 392) but also on the other side of Eurasia, in Altai (Kondratyeva, 1996). Common traits include

<sup>42</sup>There is a wealth of terms used in Scandinavian countries to refer to herding vocalizations (Rosenberg, 2003, 8). Although the term "kulning" (kolning) is most commonly used in English in relation to the special technique of the long-distance vocal calling, I follow Wallin (1991, 387) in reserving the term "kula" (he uses the alternative spelling "kola") which in Swedish means "to make a distant call" exclusively for long-distance communication. This is necessary, because longdistance "kula" calls are routinely inserted in mid-distance and close-distance vocalizations, while it is the long-distance "kula" style that distinguishes shieling vocalizations from other forms of traditional Scandinavian music.

<sup>43</sup>It should be noted that the peculiar status of pastoral music as a form of heterospecific communication is responsible for the emic views on kulning as **nonmusic**. This is yet another confirmation of the need in the etic approach. Across Eurasia, herder-made music is distinguished from "normal" music as a form of "magic." The profession of the herder is traditionally associated with sorcery: herders are believed to sign a contract with the evil forest spirits, receiving magic power for vocal and instrumental music-making in exchange for not using their gifts publicly, under the threat of death (Plotnikova, 1999b). At the eastern end of Eurasia, in Altai, supernatural beliefs are even stronger, reserved not only for professional herders (chabans) but for all livestock-owners who use pastoral spells (Kondratyeva, 1996). All vocalizations of this type are considered non-music—to the extent that informants perceive any request to "sing" a spell as being ridiculous. <sup>44</sup>Noteworthy, despite a 16-hour-long workday and insecurity of living alone without any weapons, shieling jobs were always highly sought after, since women remained in charge of their summer life and enjoyed freedom unavailable to them at the farmstead (Rosenberg, 2014).

prolonged singing, formulaic regularity, vocables, smooth contours, motherese-talking, and caressing (Tiukhteneva, 2017). In animistic societies, both infant-lulling (Kondratyeva, 1989; Farber, 1990; Tchotchkina, 2003; Gioia, 2006a; Milne, 2017; Garroway, 2019) and domestication rites for newborn cattle (Aksyonov, 1964; Johnson, 1990; Kondratyeva, 1996; Plotnikova, 1999b; Kan-ool, 2012; Tiukhteneva, 2017) are associated with magic, achievable by female "charms."

Similar to lullabies are milking songs (Nielsen, 1997)—used across Eurasia, from Scotland to Mongolia (Gioia, 2006b, 71). Remarkably, when milking, Mongolian herdsmen switch to motherese-like "musical talk," based on animal onomatopoeia (Yoon, 2018). Known cases of male pastoral calling engage falsetto to imitate the female model (Uttman, 2002). Similarly, in surviving pastoral traditions of Altai, lulling is reserved for women, and require throat-singing if sung by men (Tiukhteneva, 2017). Pastoral spells in Altaic tradition constitute female prerogative, but are occasionally performed by men (Kondratyeva and Kopytov, 2017), engaging throat-singing (Kyrgys, 2002, 64). Like falsetto, throat-singing emphasizes harmonics that make melodies appear registrally higher—closer to the female range—and, like female kula, resembling pure tones.

The same applies to whistling signals, used across Eurasia by herdsmen to stimulate and/or safe-guard animals (Levin and Suzukei, 2006, 134–40). Just like kulning, in pastoral societies whistling is associated with sorcery (Plotnikova, 1999a) and is thoroughly regulated by taboos (Dzenzelevskii, 1984). Acoustically, whistling comes closest to "kula" in distancerange, loudness, and tonal quality (Eklund and Mcallister, 2015). To command their animals, Altaic herdsmen produce whistles audible over 4–5 km, and throat-singing—3 km (Pegg, 2001, 236). Curiously, female "head voice," required by kula, is called "whistle register" (Sundberg, 1987, 50). And xo¨omii ¨ (throat-singing) is considered a form of whistling in Mongolia (Pegg, 1992).

Wallin (1991, 523) sees shieling music as part of the prehistoric expansion of a novel herding culture northwest of Anatolia/Balkan/Caucasus toward Iceland, with its base in Jamtland (**Figure 6**). Jamtland's "forest barrow" marked the end of tundra after the glaciers' retreat, attracting hunters and supporting a mixed pastoral economy that survived at the coldest outskirt of Europe practically unchanged until the late Middle Ages. Geographic and chronological distribution of cattle-herding across Europe, quite well-studied, provides timing references for Wallin's model. The outcome of this geomusicological<sup>45</sup> correlation is presented in **Figure 6**.

Domesticated cattle spread East-to-West along the Mediterranean coastline, encapsulating most of "yodeling" territories ≈6000 BC. The South-to-North expansion took much longer—Central Sweden became pastoralized in the 2nd millennium BC. Dissemination of cattle and Indo-European languages went hand by hand. The Indo-European language family covers most of Europe—except for Finno-Ugric languages of Fennoscandia and Russia. Another notable exception is Turkey whose Indo-European languages (Hittite, Luwian, Palaic, Lydian) died out during Antiquity. Formation of each new Indo-European language seems to have followed the adoption of husbandry. The yodeling areas correspond to the earlier stages in expansion of the Indo-European languages, conserved by the mountain systems: Taurus, Pontic, and Armenian Highland in Turkey, the neighboring Caucasus, Balkan, and more remote Carpathian, Alps, Jura, Apennine, Sardinian, Corsican, and Pyrenean. The dissemination routes either curve around the mountains or cross them by riverbeds. The oldest routs ran by the Mediterranean coastline along the 40N latitude, supporting the conclusion of Diamond and Bellwood (2003) that the domesticates and languages spread faster to East-West than to South-North. This explains the divergence of pastoral music tradition into two types: Southern yodeling versus Nordic kulning and kulning-likes<sup>46</sup>, distinguished by different bovine genomes. Studies of Y-chromosomal variation have identified two primary taurine haplogroups in Europe, split in two homogenous regions alongside cultural, historic, religious, and linguistic boundaries between the pied or red cows of the Nordic and Baltic/Slavic lands, on the one hand, and the spotted yellow or brown breeds of Switzerland and southern territories, on the other hand (Edwards et al., 2011).

Kulning and yodel form respectively Northern and Southern "dialects" of a cattle-directed "language"—a satellite of the proto-Indo-European. The main role in the Indo-European "domestication package" belonged to cattle—the largest meat- and milk-source of all domesticates. The emergence of cattle-related mythology reflects the importance of cattle and explains the sudden proliferation of cattle burials across Northern Europe ≈3000 BC (Sjogren and Price, 2013 ¨ ) <sup>47</sup>. Symbolic elevation of cattle could characterize the entire Neolithic "revolution" in Eurasia, more noticeable in Scandinavia, where ox symbolism replaced red-deer symbolism after ox overtook deer as the most important food source (Tilley, 1996, 183–4). If wild deer opposed the human sphere as a utilitarian object of desire, domesticated ox was included into the human sphere as the emotional object of desire. And music is indispensable in supporting emotionality.

Divinization of music (Franklin, 2006) and ox (Campbell, 2017), so prominent in Indo-European tradition, could have a single origin in Indo-Iranian lands—bound to the concept of non-violence (Tull, 1996). Cattle sacrifice is depicted in prehistoric Sujanpura petroglyphs (Brooks and Wakankar, 1976). The ritual use of burnt cow dung is still common in Hinduism,

<sup>45</sup>The scope and the method of geomusicology were introduced by George Carney (Nash and Carney, 1996). Izaly Zemtsovsky formulated an analogous approach in his proposal to establish a new discipline of ethnogeomusicology (Zemtsovsky, 2005).

<sup>46</sup>Thus, Finnish "ringing calls" present a form of vocalization that acoustically and culturally resembles Swedish kulning while featuring a few unique traits (Uttman, 2002). Occasionally, ringing calls are performed by men (falsetto), utilize a peculiar lip technique (generating the "phui"-like tonal quality), and engage "darker" vowels.

<sup>47</sup>Cattle definitely carried special symbolic significance in Neolithic England (Ray and Thomas, 2003). Cattle received the same funeral treatment as humans in Danube winter burials as part of the Sun cult which thrived throughout the 4th millennium BC, probably because of drastic swings in solar activity (Horva`ıth, 2012). The second millennium BC Linear-B tablets from Knossos testify that, unlike sheep/goats, cattle was given names, bestowed with individuality—and was associated with royalty and sacrificial rites (McInerney, 2010, 50–53).

shieling pastoralism, dark green—the "core" Fabod regions, and cr ˚ eme—the area where yodel-like vocalizations survived within pastoral cultures ( ` Moberg, 1955, 1971; Baumann, 1976; Leuthold, 1981; Ivarsdotter, 1986; Wallin, 1991; Mitchell S. A., 2001; Uttman, 2002; Plantenga, 2004). The origin of the latter can be dated by the timeline of the spread of domesticates over Europe, which is well studied. Animal icons show the approximate place and time of origin of domesticated cow, goat, sheep, and pig, based on available archeological data (Zeder, 2008; Driscoll et al., 2009; Peters et al., 2017). Color-filled thick arrows show the timeline and main routs of dissemination of domesticated cattle during the Neolithic and early Bronze Age according to the archeological and genetic data (Caramelli, 2006; Lougas et al., 2007 ˜ ; Zeder, 2008; Rowley-Conwy, 2011, 2013; Tresset and Vigne, 2011; Blauer and Kantanen, 2013 ¨ ; Marciniak, 2013; Sana, 2013 ˜ ; Schulting, 2013; Sjogren and Price, 2013 ¨ ; Berthon, 2014; Cramp et al., 2014; Felius et al., 2014; Sørensen and Karg, 2014). The darker the arrow's color, the older the date. The double-dotted black line approximates the border between the Northern and Southern European bovine genetic funds. Colored ovals and outlined arrows indicate the hypothetical origin and the spread of Indo-European languages according to the computational methods, based on Bayesian logic and phylogenetic analysis algorithms (Diamond and Bellwood, 2003; Gray and Atkinson, 2003; Atkinson et al., 2005; Atkinson and Gray, 2006; Bellwood, 2008; Gray et al., 2011; Anthony and Ringe, 2015; Chang et al., 2015; Heggarty, 2015). The brown oval marks the area of genesis of Proto-Indo-European language according to the "Anatolian hypothesis" (Renfrew, 1987), whereas the orange oval—to the earlier "steppe hypothesis" (Gimbutas, 1993; Anthony, 1995). The dashed outlined arrows show the earliest stages of dissemination of the Indo-European languages from the Yamnaya epicenter. Both hypotheses generally agree in defining the later stages (Gray et al., 2011)—represented by solid outlined arrows.

traceable to the 3000 BC Ashmounds (Boivin, 2004). The Shivabull affiliation is evident in the Bronze Age Harappan "Proto-Shiva" (Hiltebeitel, 2011). Harappan symbolism clearly elevates the cattle over other domesticates, evident in the buffalo figurine amulets and seals that are likely to assimilate the west-bound Indo-Iranian cult of Mother Goddess, eventually forming the "Sacred Cow" concept (Lodrick, 2005). This corresponds to veal and cow-milk becoming primary foods during Rigvedic and

Vedic times—there were people at that time who lived on milk alone (Prakash, 1961, 12). Milk products were used in rituals and offerings to gods, certainly accompanied by music, promoting the transformation of cow into the symbol of femininity and fecundity in Vedic literature (Brown, 1964). Consecration of cow gave it purity: even its urine and dung were used for healing and cleansing (Korom, 2000).

The cultural context of kulning and the tradition of homesharing with cattle strongly resembles the Vedic cultural blend of non-violent femininity, cow-worship, and magic. It is not accidental that kula finds a nearly perfect match in Tibetan traditional pastoral songs with long rhythmically free undulating phrases, extremely tense timbre of high quasi-falsetto voice, generous ornamentation, and an ongoing variation (Stuart, 2008, XXIV). This is the most ancient of the three major forms of Tibetan music, peculiar to a nomadic pastoral culture, and originating from cattle calls (Crossley-Holland, 1967). Like kulning, it incorporates parlando and recitative for close-distance vocalization to animals, and also includes milking songs (Plantenga, 2004, 113).

Introduction of milk revolutionized the Neolithic lifestyle, supporting the psychological revolution in human-animal relations and bi-specific musical communication—especially in Northern Europe, where milk quickly replaced fish as the main food—manifested by the widespread adoption of milk-storing pottery (Cramp et al., 2014). The archeological evidence agrees with the genetic evidence of the time of emergence of lactase persistence<sup>48</sup>. Lactase persistence reflects the adaptation to diet (Hancock et al., 2010)—without which adults have lactose intolerance and nutritional loss (Campbell et al., 2005). Ill effects of malnutrition coexisted with milkbound diseases during the adoption of the milk-based diet. Mycobacterium tuberculosis existed 40,000 years ago, but became pathological for humans only from 6200–5500 BC onward (Hershkovitz et al., 2015) - by the time when the spread of husbandry reached Central Europe. Seemingly "the same" milk could either kill or nurture life—which must have promoted new supernatural beliefs and rituals to "exorcize" milkproduction in replacement of the earlier hunter/gatherer rituals. Music, so common for religious applications, most certainly supported this reform.

For Europe, geographic distribution of Indo-European languages<sup>49</sup> (Heggarty, 2015) goes hand in hand with the distribution of taurine mtDNA that descends from the Fertile Crescent (Caramelli, 2006). And subdivision of the bovine European genetic pool into Northern/Southern genotypes (Edwards et al., 2011) matches the distribution pattern of lactase persistence: 40% of adults in Greece versus 90% in Scandinavia/England (Curry, 2013). Those populations that consumed more dairy have higher occurrence of lactase persistence (Bersaglieri et al., 2004). Evidently, milk dependence was more than twice higher in the North. The Indo-European expansion occurred through the farmers' immigration and interaction with local foragers rather than by technological import alone (Rowley-Conwy, 2011). Greater lactase persistence in the North reflects the dairy's effectiveness in providing nutrients, the convenience of its storage in cold climate, the insurance against bad harvests (Gerbault et al., 2013), and health benefits of increased vitamin D consumption in low-sunlight conditions (Flatz and Rotthauwe, 1973).

Kulning emerged to nourish the symbiotic co-dependence of humans and cattle in harsh Nordic conditions that demanded stronger bonding than those of more diverse pastoral economies of Southern yodel territories, therefore employing a **female pastoral model**.

The biggest contender for the Indo-European language family in Northern Europe—the Uralic family (Diamond and Bellwood, 2003)—relates to another domesticate: the reindeer. Reindeer hunting was essential for colonization of Eurasian Arctic/Subarctic (Gordon, 2003). However, reindeer domestication still remains in its early phase (Reimers and Colman, 2009). The distinction between reindeerhunting and reindeer-herding remains vague—even reindeer owners often do not know if a particular reindeer is "wild" or "domestic" (Ventsel, 2006) <sup>50</sup>. Leading fences and corrals have been used for hunting wild reindeers and only recently have they become "domestic" accessories (Aronsson, 1991). Reindeer pastoralism emerged gradually from taming individual reindeers for transportation and decoy-hunting—compensating for the depletion of wild reindeer population (Vorren, 1973) that occurred during the 13–16th centuries (Hansen and Olsen, 2014, 175)<sup>51</sup>. Reindeer domestication must have started in parallel with cattle domestication in Norway/Sweden but lingered into the Middle Ages—absorbing cultural traits of human-to-cattle communication.

The principal psychological trait of kulning is the "humanization" and child-like patronizing of cattle. Similar attitude characterizes reindeer pastoralism: animal is treated like a family member whose life is valued and its attitudes are respected (Ingold, 1986). Kulning, yodel, and reindeercommunication should all be regarded as various "**languages of domestication**," generated by borrowing "acoustic traps and snares"—i.e., onomatopoeic decoy calls—from hunters

<sup>48</sup>Lactase persistence was completely absent in early Neolithic population 5500 BC (Burger et al., 2007), making its first appearance in Scandinavia in 3400 BC (Malmstrom et al., 2010 ¨ ), by 3000 BC in Iberia (Plantinga et al., 2012) and taking over Europe thereafter (Marciniak and Perry, 2017). This timeframe agrees with the scenario represented in **Figure 6**.

<sup>49</sup>The Indo-European family contains 144 languages divided amongst 11 distinct branches—with even more languages most certainly having existed in the past but gone extinct (Diamond and Bellwood, 2003). In Europe, non-Indo-European languages are limited to merely 11 documented languages (only 8% of the total number of languages): Etruscan, Basque, Iberian, Tartessian, Estonian, Finnish, Urartian, Sumerian, Hurrian, Hattic, and Mitannian—plus 3 undocumented languages: Pictish, Lepontic, and Ligurian (Robb, 1993).

<sup>50</sup>Herders routinely let their reindeers graze unsupervised for a rather extensive length of time. Inevitably, many animals become lost, turn wild, and can then be hunted (Stepanoff et al., 2017 ´ ). Also, the herder's strategy of searching for his lost animals strikingly resembles that of hunting.

<sup>51</sup>This caused the import of non-native reindeers via the emerging Russo-Finno-Scandinavian markets and transition to pastoralism (Røed et al., 2018). Genetic evidence points to 3 epicenters of reindeer domestication: Fennoscandia, Western and Eastern Russia (Røed et al., 2008). Reindeer domestication took about 6000 years. Its earliest evidence comes from the 4000 BC petroglyphs (Helskog, 2012), a 1510–1130 BC burial (Murashkin et al., 2016), and the paleolinguistic tracking of words for reindeer that date back to 1500–1000 BC (Aikio, 2006).

and syntactically reorganizing them into "animal-directed" words to control the herd, its leader, and the individual animals (Alekseyev, 1995).

Kulning and yodel are Indo-European musical "cowlanguages," later adapted for goats/sheep as they became personalized like cows<sup>52</sup>, whereas reindeer-vocalizations make a Finno-Ugric "reindeer-language."

Kulning's SFTO was forged by long-distance delivery of the desired subharmonic structure. Kula is characterized by dynamic maximization (80–100 dB SPL at 50 cm)<sup>53</sup> while fixing 4 formants at FF, 1700, 3,000 and 4,000 Hz throughout all frequency changes, restraining vibrato, and raising the larynx above the resting position (Johnson et al., 1982). Elevating laryngeal position up to 4 cm increases the sub-glottal pressure tenfold as compared to talking (Ivarsdotter, 1986). Somehow, this causes no distortions, and kula's "harmonic signature" remains virtually unchanged at close- and mid-distances (1–11 m) contrasting the "classic" falsetto (Eklund and Mcallister, 2015). Harmonic conservation is still observable at 22 m in kulning, albeit varying between different performers (Eklund et al., 2019). Evidently, kula is designed to transmit kulerska's harmonic and melodic "signatures" to the herd at distances common in herding (Rosenberg, 2014).

Long-distance spectral optimization is known in intergroup communication of some primates (Waser and Waser, 1977). However, optimization to preserve subharmonic structures is unique to kula.

Kula's sounds are supposed to stand out in the environmental soundstage by featuring unnaturally hyper-periodic noise-free spectrum. Kula's harmonicity aligns with "**pleasantness**" following the cow-bell paradigm. Animal-bells were used in Scandinavia at least from 1–4th centuries (possibly, from the beginning of the Bronze Age) to repel evil spirits, mark a humancontrolled territory, and decorate the herd's leading animal (Kolltveit, 2008). For cattle, the bell signified human control, herd-leader's authority, and a safety signal. Humans associated bells with nature, peacefulness, goodness, and protection, employing bells to "borrow" the land from the forest spirits (Emsheimer, 1991, 43). Therefore, overall harmonicity signifies strongly positive values—in line with kulning's perceived beauty and safety/care. Across the animal world, too, harmonicity (pure-tonedness) and inharmonicity are meaningful along the friendliness/fear opposition (Morton, 1977).

Long-distance transmission requires high intensity and register. For 1 km, the most effective transmission occurs at ≈2 kHz (= C7) (Graf, 1980; Gray and Atkinson, 2003)—the range of a piccolo flute. Perhaps, whistling prototyped kula. Whistles are common in communication with dogs and the herd. And whistles exceed calling and yodeling in long-distance intelligibility: correct identification of whistles at 170 m distance is 95% versus 58% for yodeling and 70% for calling (Titze et al., 2018). Bi-factorial changes of rhythm/pitch-contour in whistling signals would pave the road for tri-factorial changes of rhythm/pitch/phrase-length in kula.

Long-distance communication eliminates mimics and gestures from semiosis, making it rely exclusively on acoustic attributes and demanding long-term memory (Wallin, 1991, 390). Exclusion of visual cues promotes the prolongation of a musical expression to facilitate its recognition and memorization. Therefore, phrase length reflects the distance: longer distances require longer phrases (p. 391). Changes in distance generate musical syntax (**Figure 7**).

Close distance promotes short phrases of multi-registral motherese-like recitative where only the "reciting tones" are pitched, and exaggerated leaps employ legato and portamento (**Figure 7B**). Pitches have tendency to monotony in low register at phrasal ends, which generates tonicity. Vocalizations are mostly stimulating and diverse in their referential/propositional content.

Middle distance makes motherese inaudible, instead requiring a different approach. Vocalizations become euphonized: engaging "parlando" rather than recitative<sup>54</sup>, "smoothening" the leaps, increasing the share of pitched tones, and stressing rhythmic patterning and ordering. The calming effect of these adjustments, inappropriate for stimulating applications that are mostly common for mid-distance communication, is compensated by intensifying dynamics, structural contrasts, and staccato articulation (**Figure 7A**). Notwithstanding diversification, the highest-register "peak-tones" at motivic beginnings are often monotonous, prototyping the musical "leading-tone" by requiring some sort of continuation (as in a melodic resolution).

Longer distance further increases the share of musicality and pleasantness in herding vocalizations. They prioritize audition over visualization by engaging "call-phrases," made of exclamatory imperatives and summoning, free from referential/propositional context (Wallin, 1991, 417). Verbalized vocalization is replaced by a wordless kula (p. 410). Simple phrase-sentences consist of motif chains akin to incipits, climaxes, and cadences of Gregorian tunes (Helmer,

<sup>52</sup>Ivarsdotter describes how goat-calling follows the model of cow-calling, adapting it to the livelier nature of goats, notorious for their proneness to naughtiness (Ivarsdotter, 2004). Similarity of cow-calling (Kolock), goat-calling (Getlock), and sheep-calling (Farlock ˚ ) is obvious from listening to their archive recordings published by Swedish radio (Ivarsdotter, 1995). The same similarity is retained in pastoral incantations and spells that survive in Altai region—all three types of calling differ primarily in the prevalence of different phonemes for each of these three animals (Tiukhteneva, 2017). The musical characteristics of all three types of calling closely resemble one another (Kondratyeva and Mazepus, 1999). This suggests that similarity between cow, goat, and sheep pastoral communication is a wide-spread Eurasian phenomenon.

<sup>53</sup>The highest SPL level is reached at a 30 cm distance from the sound source (125 dB) which exceeds the ear's pain threshold at 120 dB (Rosenberg, 2014). The average SPL of kula at 1000 Hz is 113 dB. This is dynamically comparable to an operatic soprano singing fortissimo, except that the soprano's technique requires maintaining a fixed larynx configuration at a low position. However, the maximal SPL of the soprano does not exceed 90 dB near the lips and does not change much from modulating the pitch (Johnson, 1984).

<sup>54</sup>The term "parlando" was adopted by Anna Johnson in her report (Johnson, 1979) despite the traditional use of this term to refer to Western operatic singing that imitates speech and engages speaking "voice registers" (Sicoli, 2015) despite the absence of such intention for kulerska. Sung out words of closedistance kulning surprisingly resemble the operatic "parlando" sound. The kulning parlando contrasts the recitative kulning that minimizes voicing and remains much closer to talking than to singing, especially in its dynamics. The opposition of kulning parlando and kulning recitative resembles the opposition of operatic parlando and secco recitative, on the one hand, and the genre of melodrama that became popular in Western classical music in the 19th century, on the other hand.

FIGURE 7 | Patterns of TO in four main types of vocalization in the vocal tradition of kulning. Since kulning is essentially ametric and averbal (except for the closest range recitative), its analytic charts omit lyrics. Unlike the previous figures, the vertical dash lines indicate the onset of motifs. The colored arc-line symbol represents an ornamental melismatic shake. (A) Stimulative medium-distance kulning: parlando (a), exclamation (b), and onomatopoeia (c) motifs (http://chirb.it/ntIxfM). This style is designed to compel the entire herd to move in the desired direction and, most probably, sets a model of interaction with animals for the other three styles. The three motifs achieve stimulation, each in a different way, contrasting one another in register, harmonicity, rhythm, and articulation. Motif "a" alerts by its staccato zigzag leaping between two registers. Motif "b" combines stimulation (staccato leap up to the "shrieking" register) with relaxation (legato leap down to the long tone). The "shrieking" peak-tones maintain the same pitch level (melodic regularity)—reflected by the dotted double-arrows (numbers indicate the frequency discrepancy in cents). Motif "c" teases the cattle by imitating dog's barking. The stimulative specialization of (A) is manifested in its prevalence of staccato, loud dynamics, three registers within a wide ambitus, exuberance of leaps, and briefness of motifs and tones. Noteworthy, the motifs "a2" and "c" resemble the "fetch-command" archetype (Figure 5E). (B) Stimulative close-distance kulning: recitative (a) and motherese (b) motifs (http://chirb.it/8K3Lqg). (B), like (A), is stimulative but dynamically gentler due to closer distance (≈9 dB softer). This allows for diverse motherese-like prosodic exaggerations in motivating individual animals. Motif "a" expresses love/care by greatly prolonging the "recitative tone," sustaining its pitch and harmonicity. Motif "b" stimulates animals by briefly stressing the upper "head-voice" register with a shake-like embellishment, then sliding it all the way down to the low talking voice. Compared to (A), (B) is smoother: fewer registers, less staccato, and longer motifs and tones. (B) tends to support a monotone (a predecessor of tonicity), most noticeable at phrasal ends. (C) Inhibitive longer-distance kulning: simple kula (a), exclamation (b), and parlando (c) motifs (http://chirb.it/n6f0sv). This style functionally opposes (B) by commanding the herd to stop grazing and to go home, implying that it is no longer safe to stay out. The chief function of "a" (kula) is to instill confidence in the herder's control over the animals. "Kula" typically consists of a chain of motifs stitched together to form a characteristic shape of steep ascension to the crest point and thereafter a gradual fall-off. However, motifs might differ according to their phrasal functions: initiation, climax, decay, and cadence. The resulting kula receives a basic modal TO: anchor tones constitute "degrees" of the mode, forming a fifth between the marginal degrees and dividing it in wider upper and narrower lower parts. The Roman numerals indicate degrees (I = stable is marked as T = "tonic"). The "b" motif presents "exclamation": a gradual sliding down (≈4th), softer than in (A), and shaped like the "stop-command" (Figure 5E). Similarly shaped is the parlando "c" motif, much smoother than (B) due to its prevailing legato, freer rhythm, more homogenous registers, and longer motifs and tones. (D) Tropotrophic maximal-distance kulning: exclusive use of complex kula sentences (http://chirb.it/gpyC7t). Delivering signals over a kilometer requires taking multiple short caesuras throughout the span of the kula's descending formula, which distinguishes (D) from (C) by making kula complex. Motifs make up phrases, and phrases—sentences, all of which create modal complexity: anchor-tones form intervallic relations that define degrees within a mode (usually, 5–7 degrees). Upper degrees open kula, forming an antecedent cadence (marked by letter "D"—"dominant" function). Lower degrees end kula with the consequent cadence (marked by "T"—"tonic"), providing resolution. (Continued)

#### FIGURE 7 | Continued

fpsyg-11-01358 July 22, 2020 Time: 21:48 # 30

Compared to (C), sentences in (D) are longer, rhythmically freer, more homogenous (by maintaining legato, a single register, the narrowest of ambitus for all kulning styles, and no leaps). Relaxation, secured by modal resolution, is supported by beautification: exclusive use of legato in smoothly shaped phrases and exquisite ornamentation (shakes, trills). (D) differs from (C) by sacrificing dynamic shaping on a phrasal level and, instead, reproduces the same dynamic contour on a motivic level—the final long tone is almost always the loudest in a motif (i.e., stable). Increased homogeneity and melodic consonance (i.e., absence of leaps) are called to motivate the herd not to depart any further beyond the range of hearing kula.

1975). Each phrase is distinguished by a wavelike melodicdynamic "envelop" with an abrupt quick rise and a gradual prolonged fall. Kula pushes vocalizations higher, squeezing their ambitus, homogenizing timbres and legato articulation, while loosening the rhythm (**Figure 7C**). This triggers the modal genesis: kula's anchor-tones turn into degrees, with more-or-less sustained pitch values. The lowest degree becomes "tonic," in contrast to the unstable upper degrees, thereby forming tetrachord-based modes.

Maximum-range communication complicates kula by introducing hierarchic structuring (motifs-phrasessentences) and by engaging the contrasting phrasal functions (initiation/climax/interruption/termination). The stimulating effect of the increased syntactic contrasts, undesirable for maximum-range communication that focuses on keeping the animals calm, is compensated by greater melodic homogeneity: maximizing legato, sentence-length, and dynamics, while minimizing melodic-intervallic, rhythmic, and registral diversity (**Figure 7D**). Longer span necessitates inter-phrasal caesuras, marking multiple phrases within long sentences, joined by stereotypical declining inter-phrasal melodic and dynamic "envelops." Melody relies on pentachordal skeleton, divided in upper major and lower minor 3rds, often supported with quartal/quintal infrafix (Johnson, 1979). Kula breaks in a series of antecedent-consequent sentences that engage different pentachord/tetrachord(s)—usually conjunct. This produces heptatonic modes (**Figure 8**).

The ongoing unveiling of musical structures makes kulning particulate by stacking up certain phrasal types while avoiding certain other combinations. This establishes syntactic rules and implicit music theory of TO for herders and herds. Herders perceive kulning as improvised "musical work in progress" (akin jazz improvisation) that elaborates a specific "theme" selected by the kulerska (Rosenberg, 2014). Herded animals probably perceive kulning as a series of programmed Pavlovian-conditioned routines. In both cases, compositionality promotes particulate semiosis: the meaning of a streak of phrases consists of the sum of the meanings of each of the constituent phrases. In effect, kulning tells a "continuing story" of the day, going through an elaboration of a musical theme (Rosenberg, 2014).

The herd's daily movement generates SFTO by stitching/ restitching phrases of 5 syntactic-semantic types (**Table 6**):


Genesis of SFTO follows the path of human-to-dog whistling communication. Noteworthy, kulning's exclamations and onomatopoeic calls engage stop- and fetch-whistle features (see **Figures 5E,F**).

The proof for SFTO's pragmatic efficacy is in the herd's fulfilling of the shepherd's commands (Wallin, 1991, 410).

Yet another source of semiosis for kulning was phonemic symbolism. Complete absence of words in kula and minimal wording of motherese suggest the prelingual existence of kulning. Wallin (1991, 410–413) rightfully emphasizes that there is no reason to label kula's sounds as "phonemes"—they are mere homologues to vowels and consonants, shaped by the anatomic-physiological conditions of breathing and acting while uttering. The same applies to traditional Alpine yodel (Fenk-Oczlon and Fenk, 2009a). Yodel and kulning vocables are formed not by phonological oppositions of local languages but by the communication distance and the extent of the desired stimulation/inhibition for a given call. Thus, the highest larynx and intensity at the onset of long-distance inhibitive kula-phrases generate a semantically "negative" [i], whereas a relaxed post-climactic position in the mid-distance tropotrophic kula generates a "positive" [a]. Similarly, glottal stops at phrasal ˚ beginnings and endings range from a gentle [h] to a harsh [tj], depending on the needed attack and tenuto decay (Rosenberg, 2014). The choice of the most common kulning syllables (Ahlback, 2007 ¨ ) can be explained by human/animal's natural selection for effective distant communication (Wallin, 1991, 390).

**Monodization** of kulning was imperative in genesis of SFTO.

Animal communication usually employs male "chorus," male-female "duetting," or "antiphonal" formats (Yoshida and Okanoya, 2005). Musicologically, this corresponds to a special type of texture—"isophony": the ongoing out-of-sync multi-part reproduction of the same thematic material (Nikolsky, 2018). Isophonic jumble precludes SFTO. For multifactorial patterning to emerge, each vocalizer must clearly hear his/her voice in order to track spectral changes without any contamination by a partner. This is how infants learn to make their own songs and how children acquire "musical ear" (Nikolsky, 2020). Even in non-European traditions that are exclusively polyphonic, such as Aka Pygmy, motherese and children-made music remain monophonic (Rouget, 2011). This is because an auditory stimulus must be objectified to become accessible for reproduction: a relation of 2 tones in certain aspect must be realized as an auditory constant to lay the foundation for construction of a musical mode (Nazaikinsky, 1973). For perception, the listener must discover permanence of the foreground "sound-object" against the background of a sound-stage, and memorize it in order to relate to it all of the subsequent changes in the thematic material.

Just as one cannot learn prosody of a language by listening to the crowd, one cannot learn SFTO by listening to isophony. And

#### TABLE 6 | Acoustic traitsof main motif types and their semantic values in kulning.


Ten AEs (in rows) are used in five types of phrases (in columns), each characterized by a unique combination of AE patterns, the most distinctive of which are pitch, rhythm, articulation, dynamics, and register. Each type also is distinguished by its semantic specialization: kula—safety signal for grazing, exclamation calls—social "grooming talk," onomatopoeia—playful teasing, parlando—commanding and convincing, and motherese— endearing and trusting. Except for the long-distance kula (whose sentences can reach up to 15 s), all other types are quite brief (usually, 0.3–2 s) and are intermixed with the same-distance or shorter-distance motif types (i.e., kula phrases can be included in motherese recitative, but motherese cannot be included in kula). The maximal distance squeezes the ambitus into an octave confined to a single highest register. This compresses degrees into steps of a smaller or a longer intervallic size, depending on their phrasal position. Climactic step tends to constitute the interval of major 2nd, whereas cadential step—of minor 2nd, to emphasize and facilitate resolution of tension ("major" for a peak in tension, "minor" for relaxation). The framework of a breath cycle sets the basis for traditional association of major with happiness (climax = inspiration = maximal power) and of minor with sadness (cadence = expiration = collapse). Mid-distance kula transposes heptatonic structures to lower registers, fitting them into a tetrachord (pentachord, if a climactic motif is added). Octave equivalence secures heptatony. Closer distances enable alterations, flattening or sharpening of the unstable degrees, and timbral recoloring. Stacking phrases of contrasting TO and semantic values, learnable by humans and domesticated animals, generates the SFTO. See a fuller version of this table in Appendix 2 "A Comparative Structural Analysis of Musograms," inSupplementary Material.

fpsyg-11-01358 July 22, 2020 Time: 21:48 # 31

herding music promotes monodic application: herding demands hours of solitary interaction with animals, ideal for testing their response to music-making.

## CONCLUSION

Homo heidelbergensis was already anatomically capable of practicing proto-music which was most probably isophonic, lacking the combined coding of pitch/rhythm—without which conventionalization of the semiotically functional melodymaking was hardly possible. Isophony supports only group communication of zero- and first-order intentionality, limited and conditioned by the genetically embedded instinctive responses to isophonic formula. Learning of multi-factorial particulate expression and second-order intentionality requires monophonic production. AE's pattern becomes fully semiotic only when many senders/receivers remember it as the bearer of the same semantic value that connotes a certain affective state—"binding hearer to speaker" through "tying of some social sentiment" (Wallin, 1991, 420).

Emotional contagion is possible in isophonic signals, but it is primed to a single most salient AE—provided all communicators share the necessary neuro-anatomical substrates (Snowdon et al., 2015). Harmony, meter, texture, and form are not supported by non-human brains; neither is a premediated "construction" of an intended message. Animal interpretation of auditory signals is inherently circumstantial—determined by the signaling context (Zuberbuhler, 2017 ¨ ). Therefore, human music is often "misunderstood" by animals, requiring music's "translation" into animal's "sonic templates of recognition" (Snowdon and Teie, 2013).

For ACs to evolve into music, a repertory of patterns of AEs had to be extracted from proto-musicking practice and abstracted into elemental signs to continuously inform someone(s) of the communicator's affective state, intentions, and needs. Such use emerged in communication with domesticated dogs, thereafter, adapted for herding. Hunting/gathering does not demand such communication. Instead, it prioritizes collective collaboration: bringing participants emotionally "in-tune," binding them into a group to increase one's powers. Such use makes sense in situations of using loud complex sounds while hunting large prey and repelling human predators in open savannah space (Jordania, 2011). Large groups of big-game foragers tend to prioritize collective music-making over personal, confining the latter to prepubertal age, like Aka Pygmies (Rouget, 2011). Homo probably exported isophonic proto-music from Africa to Europe.

The last Glacial Maximum greatly reduced the European population by the Gravettian—until the Magdalenian repopulation (Maier, 2017) enabled the rise of symbolic cultures (Kozłowski, 2015) and ethnolinguistic genesis (Zilhao, ˜ 2014). Low-density foraging groups usually form alliances, cemented by linguistic commonalities and intermarriage (Marlowe, 2005). Music surpasses language in its bonding capacities (Nakata and Trehub, 2004). Gravettian proto-music must have adjusted isophony for new cultural applications,

especially religious. Smaller groups generate a smaller sonic "jumble," facilitating the recognition of specific musical elements. Smaller groups also promote honesty in communication (Richerson and Boyd, 2005). Honest musical expression enables and validates the person-to-person musical communication. This opens doors to the cultural development of a motherese communicative model. Small groups are likely to promote motherese-like duetic and babbling-like solitary musicmaking. Thus, collective music-making is exceedingly rare in Northern Siberia (Alekseyev, 1967) which has always remained underpopulated (Sikora et al., 2019)—closely resembling life in glaciated Europe.

Motherese talk, lullabies, onomatopoeia, and instinctive utterances supplied the initial material for the formation of bi-specific SFTO. Changes in distance while continuously communicating with the herd put in place the musical modes. The closest distance promotes low-register monotony, middle distance—high-register monotony, long distance tetrachord-based tonicity, and maximal distance—conjunct pentachord/tetrachord octave-equivalent modes with dominanttonic functionality. Monotony increases the tuning accuracy of anchor-tones, firstly defining principal degrees (tonic, supertonic, dominant), and then additional unstable degrees (Alekseyev, 1976). Characteristic modal intonations of different phrasal styles and varying position within a breathing cycle charge modal degrees with specific functionality, which directs the formation of semantic values for each of the common modal intonations. This triggers the process of modal evolution as outlined by Beliaev (1963) and elaborated by Nikolsky (2015a, 2016).

Nordic kulning is probably a vestige of an archaic cattleoriented "domestication language" which descended from yodel—accompanying the northerly spread of Indo-European languages throughout Europe. Other Eurasian domestication languages accompanied the spread of the Uralic and Turkic language families, and were optimized, respectively, for reindeer and horse. Remy Dor cross-analyzed vocalizations/whistles of ´ herders speaking 20 Turkic languages, from Anatolia to Yakutia, and inferred their syntactic organization (Dor, 2005), identifying their common utterances (Dor, 1993). Like Wallin and Alekseyev, Dor too found continuity between vocalizations of hunters and herders: "somatotropic" vocalizations, designed to make the prey come closer, evolved into "fetch" or "home-return" calls, while "somatofugal" vocalizations evolved into "stop" calls to repel predators. The new class of "somatoneutral" vocalizations emerged in order to keep an animal at a constant distance (like safety-call kula). Strong biological foundation of this distancegoverned communication made it well-conserved—practically indestructible—unlike languages or music systems (Dor, 2008).

Domestication languages could underlie modern languages and musics, as traditional beliefs suggest. Swedish rural informants considered kulning an ancient "language" (Moberg, 1971, 145). And on the opposite end of Eurasia, Mongolian herders believe that their music-making is derivative of the "large language," superior to human language and designed to communicate with animals, nature, and spirits (Pegg, 2001, 235). Altaic xo¨omii ¨ most likely constitutes yet another "domestication language."

Capacity to simultaneously control numerous AEs and second-order intentionality enabled humans to create a heterospecific semiotic system of communicating desirable affective states, which gave humans control over domestic animals, resolved human sustenance needs, and put in place music as we know it. The semiotically functional tonal organization that distinguishes music from speech might have emerged no earlier than during the Neolithic "revolution" as a result of forging new conventions of human-to-animal vocal communication.

## DIRECTIONS FOR FUTURE RESEARCH

Comparative examination of human-to-animal signaling for different domesticate animals across different geographic regions can confirm whether the paradigm of "musical domestication language," divisible in "dialects" and integrable into "language families," is applicable here.

Collecting a database of patterns of human-to-animal communication would be analogous to building a lexicon of a newly discovered natural human language or to establishing a stock of typical idioms in the musical communication within a novel musical culture. Once established, such database can be statistically analyzed and cross-examined in relation to other databases, e.g., of emotional expressions in music. This could substantiate or invalidate my conclusions.

The perception of specific elements and patterns of humanto-animal communication by humans and animals can be experimentally tested. This could identify syntactic and pragmatic rules that cannot be assessed by acoustic analysis alone. Together, both approaches can evaluate semiotic efficacy of TO in pastoral signaling. This, in turn, can establish whether introduction of herding communication during the Neolithic Revolution was capable of generating SFTO in music to make it break away from the basics of animal communication.

Experimental archaeo-ethnomusicology could provide yet another way of verifying this hypothesis. Members of isolated tribes that maintain a hunter/gatherer lifestyle and use no domestic animals can be introduced to domestic animals and "taught" to use music-like signals to command them. Their progress can be analyzed and compared to patterns of conspecific acquisition of music skills by human infants, as well as to the available archaeological, genetic, and paleo-physiological data.

## AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and has approved it for publication.

## ACKNOWLEDGMENTS

I am grateful to CT and MR for reviewing the manuscript for this paper, and to Sheila Bazleh for copy-editing it. My special thanks to Leonid Perlovsky, Steven Brown, Piotr Podlipniak, Leon Crickmore, Theodor Levin, Margarita Mazo, and Philipp

Tagg for their critical input in relation to matters of semiotics of music, and to Isaly Zemtsovsky, Eduard Alekseyev, and Frank Scherbaum for reviewing my approach to modal analysis.

#### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg. 2020.01358/full#supplementary-material

#### REFERENCES


DATA SHEET S1 | Appendix 1 – A new method of modal multifactorial analysis of tonal organization in music and music-like sounds.

This technical paper contains instructions for identifying the tonal organization in a music work, a music-like vocalization (e.g., infant's babbling) or music-like animal signals (e.g., bird's song) – including sounds that are indefinite or modulating in pitch.

DATA SHEET S2 | Appendix 2 – A comparative structural analysis of musograms used in Figures 3, 4, 7 of this article.

This document contains a comprehensive analysis of the characteristic traits of tonal organization in the examples of human musical communication, animal vocal communication, and bi-specific communication between domestic animals and their human guardians.









Sundberg, J. (1987). The Science of the Singing Voice. DeKalb, IL: Northern Illinois University Press.


Devouche, and M. Gratier (Cham: Springer International Publishing), 1–18. doi: 10.1007/978-3-030-04769-6 1



**Conflict of Interest:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2020 Nikolsky. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.