# THE EVOLUTION OF RHYTHM COGNITION: TIMING IN MUSIC AND SPEECH

EDITED BY : Andrea Ravignani, Henkjan Honing and Sonja A. Kotz PUBLISHED IN : Frontiers in Neuroscience, Frontiers in Human Neuroscience and Frontiers in Psychology

#### Frontiers Copyright Statement

© Copyright 2007-2018 Frontiers Media SA. All rights reserved. All content included on this site, such as text, graphics, logos, button icons, images, video/audio clips, downloads, data compilations and software, is the property of or is licensed to Frontiers Media SA ("Frontiers") or its licensees and/or subcontractors. The copyright in the text of individual articles is the property of their respective authors, subject to a license granted to Frontiers.

The compilation of articles constituting this e-book, wherever published, as well as the compilation of all other content on this site, is the exclusive property of Frontiers. For the conditions for downloading and copying of e-books from Frontiers' website, please see the Terms for Website Use. If purchasing Frontiers e-books from other websites or sources, the conditions of the website concerned apply.

Images and graphics not forming part of user-contributed materials may not be downloaded or copied without permission.

Individual articles may be downloaded and reproduced in accordance with the principles of the CC-BY licence subject to any copyright or other notices. They may not be re-sold as an e-book.

As author or other contributor you grant a CC-BY licence to others to reproduce your articles, including any graphics and third-party materials supplied by you, in accordance with the Conditions for Website Use and subject to any copyright notices which you include in connection with your articles and materials.

All copyright, and all rights therein, are protected by national and international copyright laws.

The above represents a summary only. For the full conditions see the Conditions for Authors and the Conditions for Website Use. ISSN 1664-8714 ISBN 978-2-88945-500-3 DOI 10.3389/978-2-88945-500-3

#### About Frontiers

Frontiers is more than just an open-access publisher of scholarly articles: it is a pioneering approach to the world of academia, radically improving the way scholarly research is managed. The grand vision of Frontiers is a world where all people have an equal opportunity to seek, share and generate knowledge. Frontiers provides immediate and permanent online open access to all its publications, but this alone is not enough to realize our grand goals.

### Frontiers Journal Series

The Frontiers Journal Series is a multi-tier and interdisciplinary set of open-access, online journals, promising a paradigm shift from the current review, selection and dissemination processes in academic publishing. All Frontiers journals are driven by researchers for researchers; therefore, they constitute a service to the scholarly community. At the same time, the Frontiers Journal Series operates on a revolutionary invention, the tiered publishing system, initially addressing specific communities of scholars, and gradually climbing up to broader public understanding, thus serving the interests of the lay society, too.

### Dedication to Quality

Each Frontiers article is a landmark of the highest quality, thanks to genuinely collaborative interactions between authors and review editors, who include some of the world's best academicians. Research must be certified by peers before entering a stream of knowledge that may eventually reach the public - and shape society; therefore, Frontiers only applies the most rigorous and unbiased reviews.

Frontiers revolutionizes research publishing by freely delivering the most outstanding research, evaluated with no bias from both the academic and social point of view. By applying the most advanced information technologies, Frontiers is catapulting scholarly publishing into a new generation.

### What are Frontiers Research Topics?

Frontiers Research Topics are very popular trademarks of the Frontiers Journals Series: they are collections of at least ten articles, all centered on a particular subject. With their unique mix of varied contributions from Original Research to Review Articles, Frontiers Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author by contacting the Frontiers Editorial Office: researchtopics@frontiersin.org

# THE EVOLUTION OF RHYTHM COGNITION: TIMING IN MUSIC AND SPEECH

Topic Editors:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium; Sealcentre Pieterburen, Max Planck Institute for Psycholinguistics, Netherlands Henkjan Honing, University of Amsterdam, Netherlands Sonja A. Kotz, Maastricht University, Netherlands; Max-Planck Institute for Human Cognitive and Brain Sciences, Germany

Image: ioat/Shutterstock.com

Human speech and music share a number of similarities and differences. One of the closest similarities is their temporal nature as both (i) develop over time, (ii) form sequences of temporal intervals, possibly differing in duration and acoustical marking by different spectral properties, which are perceived as a rhythm, and (iii) generate metrical expectations.

Human brains are particularly efficient in perceiving, producing, and processing fine rhythmic information in music and speech. However a number of critical questions remain to be answered: Where does this human sensitivity for rhythm arise? How did rhythm cognition develop in human evolution? How did environmental rhythms affect the evolution of brain rhythms? Which rhythm-specific neural circuits are shared between speech and music, or even with other domains?

Evolutionary processes' long time scales often prevent direct observation: understanding the psychology of rhythm and its evolution requires a close-fitting integration of different perspectives. First, empirical observations of music and speech in the field are contrasted and generate testable hypotheses. Experiments exploring linguistic and musical rhythm are performed across sensory modalities, ages, and animal species to address questions about domain-specificity, development, and an evolutionary path of rhythm. Finally, experimental insights are integrated via synthetic modeling, generating testable predictions about brain oscillations underlying rhythm cognition and its evolution.

Our understanding of the cognitive, neurobiological, and evolutionary bases of rhythm is rapidly increasing. However, researchers in different fields often work on parallel, potentially converging strands with little mutual awareness. This research topic builds a bridge across several disciplines, focusing on the cognitive neuroscience of rhythm as an evolutionary process. It includes contributions encompassing, although not limited to: (1) developmental and comparative studies of rhythm (e.g. critical acquisition periods, innateness); (2) evidence of rhythmic behavior in other species, both spontaneous and in controlled experiments; (3) comparisons of rhythm processing in music and speech (e.g. behavioral experiments, systems neuroscience perspectives on music-speech networks); (4) evidence on rhythm processing across modalities and domains; (5) studies on rhythm in interaction and context (social, affective, etc.); (6) mathematical and computational (e.g. connectionist, symbolic) models of "rhythmicity" as an evolved behavior.

Citation: Ravignani, A., Honing, H., Kotz, S. A., eds. (2018). The Evolution of Rhythm Cognition: Timing in Music and Speech. Lausanne: Frontiers Media. doi: 10.3389/978-2-88945-500-3

# Table of Contents

### 1.PERSPECTIVES

### 1.1.GENERAL


Sundeep Teki


## 1.2.ANIMAL RHYTHMS

*55 What Pinnipeds Have to Say About Human Speech, Music, and the Evolution of Rhythm*

Andrea Ravignani, W. Tecumseh Fitch, Frederike D. Hanke, Tamara Heinrich, Bettina Hurgitsch, Sonja A. Kotz, Constance Scharff, Angela S. Stoeger and Bart de Boer


### 1.3. DANCE AND MOVEMENT


### 2.EMPIRICAL WORK

### 2.1.ANIMAL RHYTHMS

*110 Can Birds Perceive Rhythmic Patterns? A Review and Experiments on a Songbird and a Parrot Species*

Carel ten Cate, Michelle Spierings, Jeroen Hubert and Henkjan Honing

*124 "Bird Song Metronomics": Isochronous Organization of Zebra Finch Song Rhythm*

Philipp Norton and Constance Scharff


### 2.2.DANCE AND MOVEMENT


### 2.3.DEVELOPMENT

*205 Measuring Neural Entrainment to Beat and Meter in Infants: Effects of Music Background*

Laura K. Cirelli, Christina Spinelli, Sylvie Nozaradan and Laurel J. Trainor


Ruth Cumming, Angela Wilson, Victoria Leong, Lincoln J. Colling and Usha Goswami


Erin E. Hannon, Yohana Lévêque, Karli M. Nave and Sandra E. Trehub

### 2.4.MUSICAL RHYTHM

*274 Beat Perception and Sociability: Evidence From Williams Syndrome* Miriam D. Lense and Elisabeth M. Dykens

*287 Both Isochronous and Non-Isochronous Metrical Subdivision Afford Precise and Stable Ensemble Entrainment: A Corpus Study of Malian Jembe Drumming*

Rainer Polak, Justin London and Nori Jacoby


Alexandre Celma-Miralles, Robert F. de Menezes and Juan M. Toro

*324 The Impact of Instrument-Specific Musical Training on Rhythm Perception and Production*

Tomas E. Matthews, Joseph N. L. Thibodeau, Brian P. Gunther and Virginia B. Penhune

*340 Rhythm Facilitates the Detection of Repeating Sound Patterns* Vani G. Rajendran, Nicol S. Harper, Khaled H. A. Abdel-Latif and Jan W. H. Schnupp

### 2.5.SPEECH AND LANGUAGE


Annike Bekius, Thomas E. Cope and Manon Grube *373 The Enhanced Musical Rhythmic Perception in Second Language Learners*

M. Paula Roncaglia-Denissen, Drikus A. Roor, Ao Chen and Makiko Sadakata

*383 Preliminary Experiments on Human Sensitivity to Rhythmic Structure in a Grammar With Recursive Self-Similarity*

Andreea Geambaşu, Andrea Ravignani and Clara C. Levelt

# Editorial: The Evolution of Rhythm Cognition: Timing in Music and Speech

#### Andrea Ravignani 1, 2, 3 \*, Henkjan Honing<sup>4</sup> \* and Sonja A. Kotz 5, 6 \*

*<sup>1</sup> Veterinary and Research Department, Sealcentre Pieterburen, Pieterburen, Netherlands, <sup>2</sup> Language and Cognition Department, Max Planck Institute for Psycholinguistics, Nijmegen, Netherlands, <sup>3</sup> Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium, <sup>4</sup> Music Cognition Group, Amsterdam Brain and Cognition, Institute for Logic, Language, and Computation, University of Amsterdam, Amsterdam, Netherlands, <sup>5</sup> Basic and Applied NeuroDynamics Lab, Faculty of Psychology and Neuroscience, Department of Neuropsychology and Psychopharmacology, Maastricht University, Maastricht, Netherlands, <sup>6</sup> Department of Neuropsychology, Max-Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany*

Keywords: rhythm, meter, synchrony, interval timing, time perception, beat perception, evolution of speech, evolution of cognition

**Editorial on the Research Topic**

**The Evolution of Rhythm Cognition: Timing in Music and Speech**

### OVERVIEW OF THIS PAPER

This editorial serves a number of purposes. First, it aims at summarizing and discussing 33 accepted contributions to the special issue "The evolution of rhythm cognition: Timing in music and speech." The major focus of the issue is the cognitive neuroscience of rhythm, intended as a neurobehavioral trait undergoing an evolutionary process. Second, this editorial provides the interested reader with a guide to navigate the interdisciplinary contributions to this special issue. For this purpose, we have compiled **Table 1**, where methods, topics, and study species are summarized and related across contributions. Third, we also briefly highlight research relevant to the evolution of rhythm that has appeared in other journals while this special issue was compiled. Altogether, this editorial constitutes a summary of rhythm research in music and speech spanning two years, from mid-2015 until mid-2017.

### TIMING IN MUSIC AND SPEECH

Human speech and music differ in many respects but also share similarities. One of the main similarities lies in their temporal nature. In fact, both music and speech:


Humans seem to be particularly rhythmic animals. Decades of research have shown that human brains are tuned-in to the fine degrees of rhythmic information in music and speech (Bolton, 1894; Fraisse, 1981, 1982, 1984; Longuet-Higgins and Lee, 1982, 1984; Povel, 1984, 1985; Essens and Povel, 1985; Povel and Essens, 1985; Shmulevich and Povel, 2000). This human propensity to

#### Edited and reviewed by:

*Carol Seger, Colorado State University, United States*

#### \*Correspondence:

*Andrea Ravignani andrea.ravignani@gmail.com Henkjan Honing honing@uva.nl Sonja A. Kotz, sonja.kotz@maastrichtuniversity.nl*

> Received: *29 April 2017* Accepted: *26 May 2017* Published: *13 June 2017*

#### Citation:

*Ravignani A, Honing H and Kotz SA (2017) Editorial: The Evolution of Rhythm Cognition: Timing in Music and Speech. Front. Hum. Neurosci. 11:303. doi: 10.3389/fnhum.2017.00303* TABLE 1 | Papers in this issue categorized along methodological and conceptual dimensions.


perceive, produce, and process rhythm is increasingly well understood, though its evolutionary origins remain a bit of a mystery. Let's compare this to what we know about the eye. This organ has evolved in animals as a complex photoreceptor to supply the need of sensing light (Fitch, 2015a). In addition, color vision in humans and many other species appears particularly useful to assess the ripeness of food or the quality of a potential mate, hence conferring an evolutionary advantage. Unfortunately, we are still far from providing similar answers for a complex neurobehavioral trait such as rhythm. However, we firmly believe that rhythm needs to be anchored in an evolutionary perspective.

A number of critical questions spurred this special issue. When did the sensitivity for rhythm arise in human evolutionary history? How did rhythm cognition develop in human evolution? How does this evolutionary path relate to rhythm ontogeny? What is the biological function of rhythm in the millisecond to second range? Do environmental rhythms affect the evolution of brain rhythms, and how? Do speech and music share rhythm-specific neural circuits and cognitive modules? Are these circuits shared with other domains and even across species?

### RHYTHM: A MULTIDISCIPLINARY FIELD

In general, the long-time scales involved in evolutionary processes prevent direct observation. Sometimes the evolutionary dynamics of simple traits can be replicated in the lab: For instance, the evolution of learning in fruit flies can be directly observed (Mery and Kawecki, 2002). Instead, the evolution of human behavior and neurobiology requires a more indirect scientific method. This is why understanding the cognitive neuroscience of rhythm and its evolution calls for a tight integration of different perspectives (Fitch, 2015b; Honing et al., 2015; Ravignani, 2017a). In particular, complementary approaches include but are not limited to:


The cognitive and neurobiological bases of rhythm are increasingly well understood (Honing et al., 2015; Merchant et al., 2015). While researchers in many fields are interested in rhythm, there is little awareness of how related and potentially converging their research strands are. This special issue builds a bridge across a large number of scientific disciplines; the focus lies in the cognitive neurosciences of rhythm, conceptualizing rhythm as a neurobehavioral trait undergoing an evolutionary process.

## DEVELOPMENT EVIDENCE

A good proportion of the papers in this issue deals with developmental aspects of rhythm (Abboub et al.; Bedoin et al.; Cirelli et al.; Cumming et al.; Hannon et al.; Lense and Dykens; Teie). Among those, one theoretical contribution raised the intriguing possibility that an individual's fetal environment may already affect his future rhythmic repertoire (Teie). Two contributions tested rhythmic abilities in infants ranging from 7 to 15 months of age, with a focus either on beat perception (Cirelli et al.) or speech meter and grouping (Abboub et al.). A corpusbased approach investigated rhythmic regularities in children's songs and finds a connection between rhythms in song and nonsong speech features (Hannon et al.). Two contributions focused on the interaction between musical beat and language in 9 year old children with specific language impairments (Bedoin et al.; Cumming et al.). Finally, Lense and Dykens tracked rhythmic abilities over the lifespan in a sample of 74 children and adults affected by Williams syndrome.

## CROSS-CULTURAL EVIDENCE

The study of rhythm in speech and music is increasingly adopting a global, cross-cultural perspective (Abboub et al.; Bekius et al.; Polak et al.; Roncaglia-Denissen et al.; Teie). The field seems to be expanding beyond learners of English as first language or musically-enculturated Westerners. In speech, three contributions explored the relationship between different languages or learning a foreign language, and rhythmic capacities (Abboub et al.; Bekius et al.; Hannon et al.; Roncaglia-Denissen et al.). In music, the focus is on biologically-driven rhythmic universals (Teie) and experiments involving cross-cultural comparisons (Polak et al.).

### EEG AND FREQUENCY TAGGING

Along a methodological dimension, empirical papers adopted three alternative approaches: corpus analyses, behavioral experiments or brain imaging/electrophysiology. It is interesting to note that all experimental papers in this issue that employed EEG also adopted a frequency-tagging approach (Celma-Miralles et al.; Cirelli et al.; Teki and Kononowicz), rather than a grandaverage ERP method (but see Henry et al., 2017 for a note of caution).

## MUSIC, SPEECH, AND SYNTAX

The relationship between music, language, and speech continues being of great interest in the scientific community. This continued interest is found also in the papers in this issue (Bedoin et al.; Bekius et al.; Cumming et al.; Geamba¸su et al.; Norton and Scharff; Ravignani et al.; Roncaglia-Denissen et al.). In particular, one paper investigated how beat keeping and phonological patterning are related (Bekius et al.). Another study focused on recursion, a topic of great debate in linguistics and showed how human adults are sensitive to recursive structures in rhythmic patterns (Geamba¸su et al.).

### MODALITY

Modality-specificity and domain-specificity were also explored in this issue (Celma-Miralles et al.; Matthews et al.; Richter and Ostovar; Su). Findings about rhythm in vision (Celma-Miralles et al.; Su) and movement (Su) suggested that some circuits for rhythmic timing may coincide across modalities.

### RHYTHM IN INTERACTION

Complementary to meticulously controlled individual experiments, rhythm can be investigated by taking a more holistic approach, and probing rhythmic behaviors in interaction (Benichov et al.; Gamba et al.; Hartbauer and Römer; Lense and Dykens; Richter and Ostovar; Woolhouse et al.). Vocal coordination behavior in groups of primates (Gamba et al.), songbirds (Benichov et al.), and insects (Hartbauer and Römer) can offer insights for human interactional timing. Connections between internal rhythms and group behaviors can be investigated in healthy adults (Richter and Ostovar; Woolhouse et al.) and individuals with specific syndromes affecting musicality and sociality (Lense and Dykens).

## DANCE

Three papers discussed rhythm from the perspective of dance (Richter and Ostovar; Su; Woolhouse et al.). Rhythm and dance should be thought as a tightly connected pair (Richter and Ostovar), which can be empirically investigated (Su) and shed light on other aspects of cognition (Woolhouse et al.).

## QUANTITATIVE MODELS

Two papers in the issue were devoted to mathematical and computational modeling (Forth et al.; Jadoul et al.). These approaches are complementary. On the one hand, rhythm and timing can be investigated using top-down abstract models (Forth et al.). On the other hand, different aspects of speech timing can be statistically modeled with different degrees of precisions and assumptions made (Jadoul et al.).

### ANIMAL RESEARCH

This issue also contains ample evidence on rhythm from a comparative approach (Benichov et al.; Dufour et al.; Gamba et al.; Hartbauer and Römer; Hoeschele and Bowling; Norton and Scharff; Ravignani et al.; Rouse et al.; Spierings and ten Cate; ten Cate et al.). Songbirds continue to be a particularly often-used model species in the study of rhythm (Benichov et al.; Hoeschele and Bowling; Norton and Scharff; Spierings and ten Cate; ten Cate et al.). For instance, important advances have been made by confirming how the subjective "feeling of rhythm" experienced when listening to a songbird has a quantitative, isochronous counterpart in the animal's song (Norton and Scharff). Rhythmic behaviors in two primate species were also explored in this issue. These works examined either the closest primate to humans, the chimpanzee (Dufour et al.) or one of the phylogenetically farthest group, the lemur (Gamba et al.). This suggests that some components of human rhythmicity may be due to evolutionary homology (common descent from our last common ancestor with chimpanzees) while other traits to analogy (convergent evolution in man and singing lemurs). A taxonomic group emerging as particularly promising for future rhythm research is the pinnipeds, which features harbor seals, sea lions, and walruses (Ravignani et al.; Rouse et al.).

## GENERAL TIMING AND OTHERS

Other papers discussed general issues related to timing and time perception (Rajendran et al.; Sameiro-Barbosa and Geiser; Teki; Teki and Griffiths). Two contributions tested general aspects of the relationship among timing, rhythm and cognitive functions (Rajendran et al.; Teki and Griffiths). A theoretical paper reviewed neural entrainment mechanisms (Sameiro-Barbosa and Geiser). Finally, Teki provided a useful overview of timing papers since 2000, ordering them by number of citations, so to identify community trends and overall research interests.

### RHYTHM IN OTHER JOURNALS SINCE LATE 2015

Since the launch of this Frontiers Research Topic, a number of publications on rhythm have appeared in other journals. Among those, some strands are particularly relevant to research in the evolution of rhythm. Far from attempting a comprehensive overview, we mention these papers and summarize some of them below.

### Evolutionary Hypotheses for Rhythm Origins

Some review papers have properly focused on the evolutionary origins of musical rhythm and animal species showing humanlike rhythmic traits (Bannan, 2016; Iversen, 2016; Wilson and Cook, 2016). Bannan (2016) provided a recount of Charles Darwin's thoughts on music and how he thought human musicality may have emerged via sexual selection. Iversen (2016) summarized and compared many evolutionary hypotheses on the origins of rhythm in humans. Wilson and Cook (2016) discussed which animal species are capable of synchronizing to a beat, either spontaneously or after being trained, and how this evidence relates to evolutionary hypotheses. Some of these evolutionary hypotheses on music and rhythm have been tested via genetics (Mosing et al., 2015), behavioral experiments (Miani, 2016), electrophysiology (Bouwer et al., 2016) or animal comparative work (ten Cate et al.; van der Aa et al., 2015).

### Speech Rhythm and Comparative Anatomy of Vocal Tracts

In the evolution of speech, several studies have shown how vocal tracts in non-human primates are more flexible than previously thought. Other primates' vocal tracts are capable of producing a human-like range of vowels (Fitch et al., 2016; Boë et al., 2017) and consonants (Lameira et al., 2015, 2016, 2017). The overall conclusion is that the complexity of human speech, including its rhythmical nuances, must have neural, rather than morphological, bases (Ravignani et al., 2014b; Fitch et al., 2016; Belyk and Brown, 2017).

## The Social Roots of Rhythm

The relationship between rhythm and sociality has seen a steady increase in research and has probably been the most investigated topic over the last 2 years (Large and Gray, 2015; Yu and Tomonaga, 2015; Ellamil et al., 2016; Gebauer et al., 2016; Greenfield et al., 2016; Moore et al., 2016; Reddish et al., 2016; Rennung and Göritz, 2016; Schirmer et al., 2016; Tunçgenç and Cohen, 2016; Wallot et al., 2016; Bishop and Goebl, 2017; Chang et al., 2017; Cirelli et al., 2017; Hannon et al., 2017; Knight et al., 2017; Mogan et al., 2017; Murphy and Schul, 2017; Rorato et al., 2017; Myers et al.). Common foci are the relationship between synchronization and prosociality (Gebauer et al., 2016; Reddish et al., 2016; Rennung and Göritz, 2016; Tunçgenç and Cohen, 2016; Cirelli et al., 2017), and different forms of rhythmic behaviors in interaction (Large and Gray, 2015; Ravignani, 2015; Yu and Tomonaga, 2015; Ellamil et al., 2016; Gebauer et al., 2016; Greenfield et al., 2016; Moore et al., 2016; Schirmer et al., 2016; Wallot et al., 2016; Murphy and Schul, 2017).

### Speech, Music, and Prosody

Another topic of broad interest centers on the relationship between speech, prosody, and music (Toro and Nespor, 2015; Vanden Bosch der Nederlanden et al., 2015; Chang et al., 2016; Filippi, 2016; Frühholz et al., 2016; Kotz and Schwartze, 2016; Schwartze and Kotz, 2016; Weidema et al., 2016; Carr et al., 2017; Ding et al., 2017; Spierings et al., 2017; Toro and Hoeschele, 2017). An intriguing hypothesis is that speech prosody may be the "missing link" between music and language (Filippi, 2016) or that music and language may be preceded by musical prosody (Fitch, 2013; Honing, 2017). This may inform us on early proto-musical and proto-linguistic behaviors in our early hominid ancestors.

### Cultural Evolution and Cognitive Biases

Rhythm seems to be slowly overcoming the classical naturenurture debate that actually is built on a false dichotomy. Along these lines, recent research has focused on the cultural evolution of musical rhythm and perceptual priors (Savage et al., 2015; Trehub, 2015; Hansen et al., 2016; Le Bomin et al., 2016; Ravignani et al., 2016; Fitch, 2017; Jacoby and McDermott, 2017). Statistical universals found in musical rhythms all over the world (Savage et al., 2015) can emerge via the combined effect of human cognitive biases and cultural transmission (Ravignani et al., 2016). Interestingly, these biases seem at least partly modulated by enculturation (Jacoby and McDermott, 2017).

### The Evolution of Dance

The field of musical rhythm is increasingly expanding to encompass the scientific study of dance (Ellamil et al., 2016; Fitch, 2016; Laland et al., 2016; Ravignani and Cook, 2016; Su, 2016; Fink and Shackelford, 2017). Only in 2016, three papers have introduced conceptual frameworks for the evolutionary study of dance (Fitch, 2016; Laland et al., 2016; Ravignani and Cook, 2016). We believe the field would benefit from connecting these theoretical frameworks with recent empirical findings on dance (Ellamil et al., 2016; Su, 2016).

### Timing and Time Perception

The science of timing and time perception has been a major research area in the last century. After a less active period, this field is again experiencing an increase in research efforts. A whole special issue of Current Opinion in Behavioral Sciences was recently devoted to "Time in perception and action" (Meck and Ivry, 2016). In addition, a "Timing Research Forum" was established in 2016, to spur and connect research on timing and time perception across disciplines.

### Measuring Rhythm

Finally, new methods to model (van der Weij et al., 2017) and measure rhythmicity have been proposed, either quantitatively from data (Daniele, 2017; Malisz et al., 2017; Ravignani, 2017b; Ravignani and Norton, 2017) or as a test battery on human participants (Dalla Bella et al., 2016).

## FINAL CONSIDERATIONS

Similarly to other fields, the study of the evolution of rhythm must build on a tight integration of experiments, theory, and modeling. Ideally, empirical observations of rhythm in music and speech are first recorded in the field. Observations are then contrasted to generate testable hypotheses. Based on these hypotheses, experiments on linguistic and musical rhythm are performed. Experimental factors and variations can encompass sensory modalities, ages, and animal species, to name a few, in order to address questions about domain-specificity, development, and evolutionary phylogeny. Finally, experimental insights should be integrated via synthetic modeling. The advantage of models is that they generate predictions that are quantitatively testable. Following these predictions, new empirical observations should be collected and compared, continuing the incremental loop of scientific investigation.

This journal issue contains novel empirical findings and state of the art reviews of hot topics in each discipline. We hope it will be useful as a reference volume on the evolution of rhythm cognition. Combining well-established findings and novel results on the evolution of rhythm, it should serve as an introductory reference for newcomers, a source of novel findings for researchers more familiar with one of the areas, and an interdisciplinary overview of progress in neighboring disciplines.

All contributions discussed so far show the many sides of rhythm. From this volume, rhythm emerges not as a monolithic concept, but as a multifaceted phenomenon for research. We hope that exciting future research will be ignited by this multifaceted display of rhythm across domains and species.

### AUTHOR CONTRIBUTIONS

All authors listed, have made substantial, direct, and intellectual contribution to the work, and approved it for publication.

### FUNDING

AR was supported by funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 665501 with the research Foundation Flanders (FWO) (Pegasus<sup>2</sup> Marie Curie fellowship 12N5517N awarded to AR), a visiting fellowship in Language Evolution from the Max Planck Society (awarded to AR), and ERC grant 283435 ABACUS (awarded to Bart de Boer). HH was supported by a Distinguished Lorentz fellowship granted by the Lorentz Center for the Sciences and the Netherlands

### REFERENCES


Institute for Advanced Study in the Humanities and Social Sciences (NIAS), and a Horizon grant (317-70-10) of the Netherlands Organization for Scientific Research (NWO). SK was supported by a BBSRC award (BB/M009742/1), a British Academy Skills Acquisition Award (SQ140010), and a Marie Skłodowska Curie Action (MSCA: 707727) under the Seventh Framework Programme (FP7).

### ACKNOWLEDGMENTS

We are grateful to the editors of Frontiers in Human Neuroscience, Frontiers in Psychology - Auditory Cognitive Neuroscience, Frontiers in Neuroscience - Auditory Cognitive Neuroscience (Isabelle Peretz, Robert J. Zatorre, Hauke R. Heekeren, and Srikantan S. Nagarajan), all members of the editorial teams, and the guest editors of some of the papers in this issue: Virginia Penhune, Mikhail Lebedev, Huan Luo, Angela D. Friederici. We would like to thank all referees who have offered their time and expertise in reviewing the papers.


Voiceless Orangutan Call. PLoS ONE 10:e116136. doi: 10.1371/journal.pone.01 16136


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Ravignani, Honing and Kotz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Citation-Based Analysis and Review of Significant Papers on Timing and Time Perception

Sundeep Teki\*

*Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, UK*

Time is an important dimension of brain function, but little is yet known about the underlying cognitive principles and neurobiological mechanisms. The field of timing and time perception has witnessed tremendous growth and multidisciplinary interest in the recent years with the advent of modern neuroimaging and neurophysiological approaches. In this article, I used a data mining approach to analyze the timing literature published by a select group of researchers (*n* = 202) during the period 2000–2015 and highlight important reviews as well as empirical articles that meet the criterion of a minimum of 100 citations. The qualifying articles (*n* = 150) are listed in a table along with key details such as number of citations, names of authors, year and journal of publication as well as a short summary of the findings of each study. The results of such a data-driven approach to literature review not only serve as a useful resource to any researcher interested in timing, but also provides a means to evaluate key papers that have significantly influenced the field and summarize recent progress and popular research trends in the field. Additionally, such analyses provides food for thought about future scientific directions and raises important questions about improving organizational structures to boost open science and progress in the field. I discuss exciting avenues for future research that have the potential to significantly advance our understanding of the neurobiology of timing, and propose the establishment of a new society, the Timing Research Forum, to promote open science and collaborative work within the highly diverse and multidisciplinary community of researchers in the field of timing and time perception.

Edited by:

*Andrea Ravignani, Vrije Universiteit Brussel, Belgium*

#### Reviewed by:

*Warren H. Meck, Duke University, USA Marshall Gilmer Hussain Shuler, Johns Hopkins University, USA*

#### \*Correspondence: *Sundeep Teki*

*sundeep.teki@gmail.com*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience*

> Received: *12 April 2016* Accepted: *30 June 2016* Published: *15 July 2016*

#### Citation:

*Teki S (2016) A Citation-Based Analysis and Review of Significant Papers on Timing and Time Perception. Front. Neurosci. 10:330. doi: 10.3389/fnins.2016.00330* Keywords: timing, time perception, rhythm perception, music perception, interval timing, temporal processing, citations, bibliometrics

### INTRODUCTION

Natural sounds have a rich temporal structure, in the form of sequences of sounds that rapidly change over time and result in dynamic states of perceptual organization. Natural sound sequences like speech and music form sequences of temporal intervals, often evoking the percept of a rhythm. How the brain processes time intervals and rhythmic sound sequences is an unresolved and challenging problem, given the absence of dedicated neural systems for encoding time.

William James was one of the first psychologists to recognize time as a "sensation," and heralded a longstanding interest and debate on the nature of time perception and its underlying representation in the brain (James, 1890). William Gooddy, recognized the importance of motor structures for timing from a neurological perspective and suggested that they act as "observers"

**15**

of time (Gooddy, 1958). Braitenberg (1967) proposed the cerebellum as an internal timekeeper and hypothesized that parallel fibers act as delay lines and provide a means to represent temporal patterns. In the 1970 and 1980s, electrophysiological studies led by Llinas, Cohen and colleagues revealed the specialization of the olivocerebellar circuits for temporal representation (Llinas et al., 1974; Llinás and Yarom, 1981; Welsh et al., 1995; see Yarom and Cohen, 2002 for a review). At the same time, fundamental properties of timing behavior like scalar property provided a theoretical foundation that formal models of an internal clock must address (Church, 1984; Gibbon et al., 1984). In the 1980s and 1990s, neuropsychological work in patients with disorders of the cerebellum and basal ganglia (e.g., Ataxia, Parkinson's) began to provide causal evidence for a role of these brain regions in perceptual and motor timing (Ivry et al., 1988; Ivry and Keele, 1989; Artieda et al., 1992; Pastor et al., 1992; Ivry, 1993; Nichelli et al., 1996).

In the last two decades, however, scientific interest and progress in understanding the neural codes and mechanisms underlying temporal processing has advanced rapidly, aided by technological developments in functional neuroimaging techniques like magnetic resonance imaging and magnetoencephalography; brain stimulation techniques like transcranial magnetic stimulation and transcranial current stimulation; as well as progress in neural recording methods with the development of dense multi-electrode arrays, two-photon calcium imaging, genetic and molecular biology tools including the use of novel experimental animals models and optogenetic targeting of specific cell-types for causal investigations amongst others. Our understanding of the neural mechanisms and circuits involved in temporal computations has significantly advanced through the use of these new technologies and continues to shed light on their underlying brain bases.

However, paralleling the recent advancements in the field is an exponential growth in research output in terms of more research articles, conference proceedings, and new journals. Therefore, unlike in the previous decades, a synthesis of the research advances in the field poses a significant challenge. Discovery of knowledge represents an acute problem with a low "signal-tonoise" threshold, and it is a veritable challenge for a new or even a current investigator in the field to assimilate new ideas and apply these concepts for designing innovative experimental paradigms.

In order to make sense of the progress in the field of timing and time perception in the last fifteen years, I have adopted a data-mining approach to identify key review articles and empirical papers, from a select group of authors that have significantly impacted research on the cognitive and neural principles of time perception. The process involved shortlisting a group of established researchers in the field of timing, and identifying articles published during the period 2000–2015 that have received a minimum of 100 citations. Each qualifying article (n = 150) from this group of authors (n = 202) is listed in **Table 1** along with the number of citations, the rank of each article in terms of number of citations as well as number of citations normalized by time since publication, the names of the authors, the name of the journal, the year of publication, whether the article was an empirical study or a review, and a short summary of each article.

### KEY PAPERS ON TIMING AND TIME PERCEPTION

To obtain a representative picture of the field, I examined research articles by a select group of experts on timing and time perception. These authors were selected on the basis of their contribution to the recent special issue on "Interval timing and skill learning: the multi sensory representation of Time and Action" published in the Current Opinion of Behavioral Sciences (Meck and Ivry, 2016; 75 authors) as well as on the basis of membership of the recently concluded European COST Action—Timely (http://www.timelycost.eu/?q=members\_list; 127 authors). These 202 authors represented research group all over the world (see Supplementary material B for the complete list of authors), and covered various aspects of timing research including psychophysics, neuroimaging, modeling, and electrophysiology in both humans and experimental animal models.

A number of metrics are commonly used to evaluate the quality and impact of research articles including impact factor, h-index, i-10 index amongst others. Although none of these bibliometrics represent an unbiased estimate of research impact nor are they accepted as standard across the scientific community, the number of citations represents a useful metric as it indicates the impact of a paper and how well the reported findings are accepted and circulated in the field. It is not an ideal measure, for the number of citations an article receives is often skewed by the impact factor of the journal. In order to draw reasonable conclusions about recent progress in the field, articles that were published from 2000 to 2015 and indexed in Google Scholar were considered eligible. Furthermore, to identify the most impactful papers (ideas), a threshold of a minimum of 100 citations was applied. As such a metric may be biased toward older papers than more recent articles, a measure based on the number of citations normalized by the number of years since publication was also considered. Although it is possible to design a more optimal multi-variate measure of research impact (based on number of citations, impact factor of journal or novel altmetrics including number of downloads, number of views and circulation in social media amongst other variables), that is not the motivation of the paper.

Using the above criteria, 150 papers were identified as listed in **Table 1** (references of these papers in Supplementary material A; up-to-date as of April 10, 2016). These papers covered topics related to perception of time, rhythm, music, inter-sensory synchrony amongst others and used techniques including psychophysics, neuroimaging, electrophysiology and modeling. Out of the 150 papers, 52 papers were review articles (34.7% of all articles; marked with an asterisk next to the number of citations) that received an average of 271.7 citations (median: 183), i.e., one out of three prominent articles on timing in the last ten years were review articles that either summarized the current state of research or presented new hypotheses to drive

#### Citation and rank Reference Year Journal Summary 1305\* [1, 1] Patel 2008 Oxford Uni Press A book that analyses music cognition in relation to language from the standpoint of cognitive neuroscience. 1192\*, [2, 2] Buhusi and Meck 2005 *Nat. Rev. Neurosci.* Time is represented in a distributed manner through coincidental activation of cortico-striatal neuronal populations. 1054, [3, 4] Boroditsky 2001 *Cogn. Psychol.* Native language shapes how we think about time. 1036, [4, 6] Boroditsky 2000 *Cognition* Time structure is shaped by metaphorical mapping from experiential domains like space. 719, [5, 13] Rao et al. 2001 *Nat. Neurosci.* Cortical-subcortical network mediates different components of temporal processing. 623, [6, 3] Casasanto and Boroditsky 2008 *Cognition* Spatial information affects judgments about duration but not vice versa. 622\*, [7, 14] Lewis and Miall 2003 *Curr. Opin. Neurobiol.* Timing is measured by automatic (motor) system and cognitive (prefrontal and parietal) systems. 587\*, [8, 12] Mauk and Buonomano 2004 *Ann. Rev. Neurosci.* Temporal processing depends on state-dependent changes in network dynamics. 569\*, [9, 15] Matell and Meck 2004 *Cogn. Brain. Res.* Striatal beat frequency model proposes basal ganglia as coincidence detector of cortical and thalamic input. 551\*, [10, 16] Ivry and Spencer 2004 *Curr. Opin. Neurobiol.* Cerebellum mediates precise timing and basal ganglia mediates decisions for longer intervals. 512, [11, 11] Wittmann et al. 2006 *Chronobiol. Int.* Social jetlag, i.e., the discrepancy between social and biological timing affects wellbeing and stimulant consumption. 469, [12, 10] Grahn et al. 2007 *J. Cogn. Neurosci.* Basal ganglia and Supplementary Motor Areas mediate beat perception, in addition to motor production. 450, [13, 23] Coull et al. 2004 *Science* Attention to time is mediated by a corticostriatal network. 410\*, [14, 45] Matell and Meck 2000 *Bioessays* Coincidence detection of neural activity represents a fundamental mechanism of timing. 379\*, [15, 47] Grondin 2001 *Psychol. Bull.* Weber's law provides a framework for psychological models of time. 364\*, [16, 25] Ivry et al. 2006 *Ann. N. Y. Acad. Sci.* Cerebellum provides an explicit representation of time. 364, [17, 50] Coull et al. 2000 *Neuropsychologia* Temporal orienting depends on sensory events and top-down expectations. 360\*, [18, 8] Grondin 2010 *Att. Percept. Psychophys.* Review of recent behavioral and neuroscientific studies of timing. 346, [19, 41] Spencer et al. 2003 *Science* Cerebellar patients can produce continuous rhythmic movements but not discontinuous movements. 338\*, [20, 19] Ivry and Schlerf 2008 *Trends Cogn. Sci.* Dedicated models of timing are preferred over intrinsic models. 333, [21, 24] Karmarkar and Buonomano 2007 *Neuron* Cortical networks can read out time as a result of intrinsic network dynamics. 332\*, [22, 5] Coull et al. 2011 *Neuropsychopharmacology* Review of neuroimaging, neuropsychological and psychopharmacological aspects of timing. 320, [23, 21] Chen et al. 2008 *Cereb. Cortex* Passively listening to rhythms recruits motor regions of the brain. 318\*, [24, 28] Droit-Volet and Meck 2007 *Trends Cogn. Sci.* Review of how emotional arousal and valence modulates attentional time-sharing and clock speed. 318, [25, 29] Shuler and Bear 2006 *Science* Primary sensory cortex, like V1, mediates reward-timing activity. 315\*, [26, 62] Lewkowicz 2000 *Psychol. Bull.* Temporal relations emerge in a hierarchical and sequential fashion. 306, [27, 17] Patel et al. 2009 *Curr. Biol.* Snowball, a cuckatoo, can spontaneously synchronize its movements to a musical beat. 296, [28, 39] Morrone et al. 2005 *Nat. Neurosci.* Short intervals of time between two successive perisaccadic visual stimuli (but not auditory) are underestimated. 289, [29, 51] Lewis and Miall 2003 *Neuropsychologia* Distinct brain areas encode time in the sub- and supra-second range. 287\*, [30, 26] Wittmann and Paulus 2008 *Trends Cogn. Sci.* Review of how impulsivity affects perception of time and decision making. 283, [31, 77] Penney et al. 2000 *J. Exp. Psychol. Hum. Perc. Perf.* Attention modulates the internal clock at different rates for auditory and visual signals. 268, [32, 22] Winkler et al. 2009 *Proc. Natl. Acad. Sci. U.S.A.* Newborn infants show beat perception. 267, [33, 9] MacDonald et al. 2011 *Neuron* Hippocampal time cells encode successive moments during a sequence of events.

#### TABLE 1 | List of 150 papers on timing and time perception from 2000 to present sorted according to the number of citations (minimum of 100 citations) in Google Scholar collated on 10 April, 2016 (see Section Key Papers on Timing and Time Perception for More Details).





*Asterisks next to the number of citations denote review articles as opposed to empirical papers. The number of citations, name(s) of authors, year and journal of publication as well as a brief summary is presented for each qualifying article. The authors' names are hyperlinked to the corresponding article's web page on Google Scholar. The numbers in the square brackets next to the number of citations denote the rank of each article in terms of overall number of citations and the rank according to the number of citations normalized by years since publication, respectively. References of all articles in this table are provided in supplementary information.*

the field forward. The remaining empirical papers, 98 in all (65.3% of all articles), received an average of 208 citations per paper (median: 157). Normalizing the number of citations by the number of years since publication to remove the bias due to the "age" of each article revealed a similar trend—review articles receive more citations (mean: 30.0; median: 21.8) than empirical papers (mean: 20.7; median: 16.5). A brief one-sentence summary of each study is also presented in the last column of **Table 1**, to provide the reader an informed basis to select relevant papers for more in-depth review.

There are several conclusions to be drawn from **Table 1**, for instance—review articles tend to dominate the field in terms of number of citations while only an average of six significant empirical papers are published every year (also see Supplementary material C, D, and E). Although many of these reviews are now "classic" in the field, even the most recent article in the table is a review (Merchant et al., 2013a; 184 citations). Among other things, this suggests that either the field is still in an embryonic stage where review articles by established researchers are needed to set the precedent on certain topics, or that the field of timing is too diverse, and represents the intersection of various sub-fields including time perception, rhythm perception, music perception, temporal coding, inter sensory asynchrony, motor timing and coordination, that is reflected in the diversity of topics covered by the review articles.

It is not clear whether a similar analysis of the most recent and highly cited papers in other prominent fields like memory, vision, or decision-making will yield similar trends, e.g., ratio of reviews to empirical studies but one could make a null hypothesis that such a ratio may be smaller than for the highly diverse and multidisciplinary field of timing. Alternatively, compared to research topics like vision and memory that have been intensely studied for several decades, the field of timing is still in a nascent stage and does not boast of a large research community as evidenced by the number of specialist journals on timing, or number of exclusive workshops and meetings dedicated to timing research.

### FUTURE DIRECTIONS—SCIENTIFIC

Apart from organizational considerations, there are several new scientific directions that the field can and should embrace to achieve a more comprehensive understanding of the neurobiology of natural timing behavior. Animal models of timing focused on core timing networks including the basal ganglia, cerebellum, premotor and parietal cortex (Grahn, 2012; Schneider and Ghose, 2012; Teki et al., 2012; Merchant et al., 2013a; Allman et al., 2014; Hayashi et al., 2015) will be key to understanding the encoding of time by neuronal ensembles. Such a line of work has been recently pioneered by Merchant and colleagues in rhesus macaques that combines timing behaviors and the examination of the underlying neuronal code in the basal ganglia (Merchant et al., 2011, 2013b; Bartolo et al., 2014; Bartolo and Merchant, 2015). Recent work by Mello et al. (2015) and Gouvêa et al. (2015) further demonstrated that a population code for time exists in the striatum that scales with the interval being timed and multiplexes information about action as well as time. Optogenetic approaches in specific identified cells in animal models will yield crucial insights into the causal role of such mechanisms and their impact on timing behavior (Grosenick et al., 2015). For instance, a recent study by Chen et al. (2014) reported rapid modulation of striatal activity by the cerebellum via a disynaptic pathway that has implications for the coordinated processing of temporal information in these two core timing areas.

The other dominant view of timing is that time is not based on the computations in dedicated circuits but rather represents the output of intrinsic neuronal dynamics (Karmarkar and Buonomano, 2007). In this respect, the activity of sensory areas including auditory, visual, and somatosensory cortices merits further attention. Combining optogenetics and singleunit recordings in primary visual cortex (V1), Hussain Shuler and colleagues have recently provided novel insights into how basal forebrain cholinergic input to V1 provides a teaching signal to modulate the response dynamics of V1 so that cues predictive of given delays to future reward produce responses that express those learned delays (Chubykin et al., 2013; Liu et al., 2015), that those responses reflect learned reward timing (Shuler and Bear, 2006; Zold and Hussain Shuler, 2015) and inform visually-cued timing (Namboodiri et al., 2015). Similar work in other sensory domains such as audition will enable us to decipher the multi-sensory representation of time and action during adaptive behaviors such as speech and movement. Further neurophysiological work using high channel-count electrophysiology (n ∼ 400–1000) based on new Silicon probes based on CMOS technology (e.g., Berényi et al., 2014; Lopez et al., 2016) or mesoscopic analysis of timing behavior across different cortical layers and multiple brain areas using multi-plane calcium imaging may further shed new light on the underlying circuitlevel cortical computations (Yang et al., 2016).

Apart from adopting the latest technological tools and genetic probes, a fundamental understanding of timing can be obtained by designing more naturalistic tasks that use ecological stimuli that are meaningful to the experimental subject in the real world. Naturalistic sequences with variable temporal structure (Teki et al., 2011; Teki and Griffiths, 2014, 2016) that go beyond the traditional use of single intervals may yield novel insights into the encoding of time as well as associated motor behaviors (Kornysheva and Diedrichsen, 2014). **Table 1** and the reviews therein highlight that timing is not mediated by a single brain area but rather involves a distributed network (Meck, 2005) in cortical and subcortical areas including prefrontal, parietal, premotor and sensory cortices, insula, basal ganglia, cerebellum, inferior olive amongst others. To formulate a unified theory of how timing is mediated by these structures, it is also important to understand the core functions of these areas and what particular aspect of timing they mediate, whether it is related to perception, attention, or memory. The use of comparative paradigms in healthy human volunteers as well as clinical populations that show timing deficits such as patients with Parkinson's, Huntington's, Schizophrenia amongst others will provide a more uniform understanding of timing functions and dysfunctions in health and disease (Allman and Meck, 2012). An identical approach (and even the use of similar paradigms) in animal models via use of control animals as well as lesion or knock-out models will complement findings from the human literature and provide a more generic understanding of the neural computations and circuits that underlie timing.

### FUTURE DIRECTIONS—ORGANIZATIONAL

In order to drive more impactful experimental work, the field of timing needs to attract young researchers which would require more concerted efforts from the entire timing community. A recent positive step in this direction was marked by the launch of a specialist journal for timing, Timing and Time Perception (Meck et al., 2013) as well as its corresponding review journal, Timing and Time Perception Reviews. Another step forward would be the launch of an academic society exclusively for researchers in timing that would promote interdisciplinary exchange of ideas amongst researchers with diverse interests in timing via annual conferences that draw on a range of methods from purely behavioral to neurophysiological and neuroanatomical measures; share pertinent news and information like grant funding calls, new papers, job opportunities for doctoral and postdoctoral candidates, workshops and training opportunities; and promote the career development of young researchers through grants for short cross-disciplinary collaborations or exchange visits and funding for attending conferences and mentoring support.

Although there already exist a few scientific societies and communities relevant to timing like the Society for Music Perception and Cognition (SPMC: http://www.musicperception. org), Rhythm Perception and Production Workshop (RPPW: http://rppw.org), European Society for Cognitive Sciences of Music (ESCOM: http://escom2015.org), Society for Education, Music and Psychology Research (SEMPRE: http://www.sempre. org.uk), Deutsche Gesellschaft fur Musikpsychologie (DGM: http://www.music-psychology.de), Asia-Pacific Society for the Cognitive Sciences of Music, Fondazione Mariani (http:// fondazione-mariani.org/) that organizes the NeuroMusic conferences, their scope is limited to music perception and psychology, and do not cover all aspects of timing and time perception. Society for Neuroscience (SfN) represents the primary venue where timing researchers gather for structured symposia on human and animal timing research but the scientific discussions are limited given the busy nature of SfN meetings. A recent example of such a successful academic organization for a diverse topic of research is the Society for the Neurobiology of Language (http://www.neurolang.org/) funded by the National Institutes of Health, which since its inception in 2009, attracts more than 400 researchers for its annual conferences. To address the absence of an association of

### REFERENCES


researchers working on all aspects of timing, Argiro Vatakis and I have established a new timing society to promote open science and collaboration—the "Timing Research Forum" (http:// timingforum.org).

Irrespective of the present state of affairs, the field of timing and time perception represents a promising and exciting field of research that is growing every year in terms of number of researchers and scientific output, and one where new students and researchers may find a relatively unexplored topic of research and make a significant impact on the field.

### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.

### FUNDING

ST is funded by the Wellcome Trust (WT106084/Z/14/Z; Sir Henry Wellcome Postdoctoral Fellowship).

### ACKNOWLEDGMENTS

I thank Anu Chowdhry for help with compiling the list of papers in **Table 1**.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2016.00330

### DATA

Metrics data presented in **Table 1** are available to download as a .mat file from Figshare:


**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Teki. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Entraining IDyOT: Timing in the Information Dynamics of Thinking

#### Jamie Forth, Kat Agres, Matthew Purver and Geraint A. Wiggins \*

*Computational Creativity Lab, Computational Linguistics Lab, Cognitive Science Group, School of Electronic Engineering and Computer Science, Queen Mary University of London, London, UK*

We present a novel hypothetical account of entrainment in music and language, in context of the Information Dynamics of Thinking model, IDyOT. The extended model affords an alternative view of entrainment, and its companion term, pulse, from earlier accounts. The model is based on hierarchical, statistical prediction, modeling expectations of both what an event will be and when it will happen. As such, it constitutes a kind of predictive coding, with a particular novel hypothetical implementation. Here, we focus on the model's mechanism for predicting when a perceptual event will happen, given an existing sequence of past events, which may be musical or linguistic. We propose a range of tests to validate or falsify the model, at various different levels of abstraction, and argue that computational modeling in general, and this model in particular, can offer a means of providing limited but useful evidence for evolutionary hypotheses.

#### Edited by:

*Andrea Ravignani, Vrije Universiteit Brussel, Belgium*

#### Reviewed by:

*Makiko Sadakata, Radboud University Nijmegen, Netherlands Rie Asano, University of Cologne, Germany Johan Loeckx, Vrije Universiteit Brussel, Belgium*

> \*Correspondence: *Geraint A. Wiggins*

*geraint.wiggins@qmul.ac.uk*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

Received: *07 March 2016* Accepted: *28 September 2016* Published: *18 October 2016*

#### Citation:

*Forth J, Agres K, Purver M and Wiggins GA (2016) Entraining IDyOT: Timing in the Information Dynamics of Thinking. Front. Psychol. 7:1575. doi: 10.3389/fpsyg.2016.01575* Keywords: rhythm, entrainment, cognition, information dynamics, cognitive modeling

## 1. INTRODUCTION

We propose a hypothetical anticipatory model of the perception and cognition of events in time. A model of sequence learning and generation from statistical linguistics has been adapted to handle the strongly multidimensional aspects of music, including musical time (Gabrielsson, 1973a,b; Jones, 1976, 1981; Conklin and Witten, 1995; Pearce, 2005; Pearce et al., 2005). Multidimensionality is a property also of language that can usefully be captured (Meck, 2005; Kraus et al., 2009; Wiggins, 2012a). The model is called IDyOT, (Information Dynamics of Thinking). IDyOT is a cognitive architecture, after Baars' (1988) Global Workspace Theory: the aim is to capture as much as possible of the framework of basic cognitive function in one uniform processing cycle.

We approach the perception and cognition of musical and lingustic timing from two perspectives. Firstly, in the context of music, we discuss a conceptual space (Gärdenfors, 2000) representation of metrical time (Forth, 2012). The approach enables precise specification of metrical structures, hypothesized as patterns of entrainment that guide attention in musical listening (London, 2012). This perspective can be understood as a top-down specification of a theoretical notion of meter. Our second perspective is bottom-up: a mechanism that we hypothesize is capable of learning such a hierarchical representation of metrical time from exposure to the statistical regularity inherent in music and everyday perceptual experience. Our argument is that musical listening is coordinated by attentional patterns, which arise from a process involving both endogenous generation and induction from perceptual information. Furthermore, we argue that the same process underlies the temporal regulation of cognition in general, and we consider evidence from the domain of natural language to substantiate this claim.

IDyOT computes anticipatory distributions at multiple levels of granularity with respect to the surface sequence, and we hypothesize that the requirement for the temporal predictions of the different levels to coincide is what creates the human tendency toward cyclic (even if non-isochronous) meters in music and poetry. Thus, the work presents a new perspective on the debate between oscillatory and timer-based models (Hass and Durstewitz, 2014). Further, we propose that the combination of coinciding expectations at different levels of granularity are responsible for the percept of meter, explaining the effect modeled by London's (2012) additive cycle approach to metrical strength.

### 2. THEORY AND HYPOTHESES

Since IDyOT is a multi-faceted theory, we must decompose it. Introductory descriptions are given by Wiggins (2012b) and Wiggins and Forth (2015), and summarized below. Here, we list our specific hypotheses, to map out the subsequent argument.


reliably observed in other species. This hypothesis may also be tested by comparison with extant humans coupled with analysis of brain volumes in other mammalian species, and with evidence from the fossil record, using the methodology in Section 5.

• Finally, the methodology espoused in Section 5 may be applied to any aspect of the model with support from empirical study of current biology, to hypothesize about evolution, in two ways: first, the relationships between known developments in species (e.g., cortical volume) and parameters of the model may be investigated; and, second, differently parameterized versions of the model may be allowed to compete in a simulated environment, testing the evolutionary value of its various features.

In the following sections, we lay out the details of our motivation and of the temporal aspects of IDyOT.

### 3. RHYTHM AND TIMING IN SEQUENTIAL PERCEPTION

### 3.1. Prediction in Temporal Perception: Concepts and Terminology

The key idea of IDyOT is that one route to evolutionary success is for an organism to predict what is likely to happen next in its environment, and that the ability to learn an appropriate model of experience to inform such predictions is an important cognitive ability of higher animals. Further, we propose that the value of such prediction is increased if the prediction of what is to happen is coupled with the prediction of when it will happen. Playing music, alongside many survival traits, requires the ability to judge precisely where in time an action should be placed, usually anticipating the exact moment with motor preparation so that timing of sound and/or movement is correlated with other activity in the world. It is self-evident that organisms without human-scale cortical development are capable of impressive feats of prediction coupled with synchronization: for example, chameleons catching fast-flying insects, and dogs catching balls; what is not evident in these organisms is the voluntary maintenance and repetition of such behaviors in rhythmic synchronization with external stimuli.

Fitch (2013) surveys usages of terms relating to general and musical timing. Our taxonomy describes the same broad phenomena, but is different, and we must clarify our usage, and how it differs from Fitch's. Fitch argues that timing is an example of hierarchical cognition, and we agree. However, as will become clear later, in our model, concepts such as pulse, meter, while certainly hierarchical, are explicated in terms of the underlying predictive mechanism, and do not require separate explanations of their own. Particularly problematic are Fitch's notions of pulse and entrainment. Pulse is introduced thus:

First, rhythmic cognition typically involves extracting a pulse˙I or tactus˙I at a particular rate (the tempo) that serves as a basis for organizing and structuring incoming sonic events. (Fitch, 2013, p. 2)

and, later,

An important characteristic of musical rhythm, . . . , is isochronicity. (Fitch, 2013, p. 2)

Fitch (2013) also notes, later, that it is not the case that all musics display a(n isochronous) pulse—South Asian and Middle Eastern musics often do not have a pulse in this simplified Western sense. This is problematic, because (in the absence of a stated alternative) it implies that these musics have no "basis for organizing and structuring incoming sonic events." We believe a different definition is required.

For us, entrainment is the key concept. Fitch's definition runs thus, in terms of his notion of pulse:

When listeners extract a pulse from the acoustic surface, and adjust their own behavior to it (whether their own acoustic output, in ensemble playing, or their movements, as in dance) this is called **entrainment**. (Fitch, 2013, p. 3)

In this definition, entrainment is dependent on the presence of a pulse, that is extracted and that is, by definition, isochronous. Therefore, the movement of Indian physical performers that is correlated with their culture's non-isochronous music is not entrained. Evidence suggests that this is a narrow definition (Clayton, 2007).

The problem arises from the Western-centric notion that the perception of pulse precedes rhythm and meter perception. We suggest that the experience of pulse (isochronous or otherwise) is not primary, but an epiphenomenon of the statistical structure of music. Therefore, we define entrainment differently, allowing our Indian dancers to be entrained:

Entrainment is the capacity to sustainedly synchronize with the placement of extrinsic patterns of events in time. Autonomic entrainment is the capacity of an organism to entrain without intentional involvement (e.g., in fireflies); this is the kind of entrainment that cannot be switched off by the organism exhibiting it. Voluntary entrainment is the capacity of an organism to entrain at a non-autonomic level (e.g., Snowball the cockatoo: Patel et al., 2009; Schachner et al., 2009). Sustained voluntary entrainment is the capacity of an organism to entrain non-autonomically without extrinsic encouragement or reward (humans are the only known example).

Musical entrainment can be extremely complicated, with irregular rhythmic structures spanning cycles of several seconds, (e.g., Greek folk music), or with simultaneous multiple levels of synchronization at different speeds and with very subtle deviations from a relatively simple regular beat which are highly musically salient, (e.g., funk and rap). Equally complicated, though different, entrainment is required for production and comprehension of speech. It follows that entrainment is extremely advanced in humans, even though it needs individual development to reach the degrees of hierarchy and precision found in musicians and dancers. Given that such rhythmic sophistication is hard to motivate from a purely biological evolutionary perspective (see e.g., Merchant and Honing, 2014; Merchant et al., 2015, for discussion of such biological evolution), either it must arise de novo from social evolutionary pressures; or from a mechanism capable of capturing the simpler rhythms experienced in the world, which is then able to construct complexity as needed; or perhaps a combination of the two (Bown, 2008; Bown and Wiggins, 2009; Merker et al., 2009; Fitch, 2012; Bowling et al., 2013; Ravignani et al., 2013).

We propose that the last of these options is the case: a bottomup hierarchical perceptual construction of temporal sequence accounts for rhythm and meter in music and language. It has been selected for because it promotes predictive power which enhances information processing and the action that results from it. The enhancement is achieved by attentional orientation, which we discuss next.

## 3.2. Attentional Orientation

The orientation of attention toward specific spatial locations, objects or moments in time to optimize behavior has been extensively investigated. Coull et al. (2000) describe two distinct forms of attentional shift: endogenous, a top-down mechanism initiated to meet cognitive demands, and exogenous, a bottom-up mechanism stimulated by unexpected events.

Cherry (1953) investigated auditory selective attention (the "cocktail party effect") in experiments designed to reveal the extent to which, and under what conditions, listeners could disambiguate simultaneously spoken, but spatially-separated, dialogues recorded by the same speaker. Wearing headphones, subjects were asked to attend only to the speech signal delivered to their right ear and to repeat the words while doing so. Subjects could reproduce the spoken dialogue perfectly, and when subsequently questioned, were largely unable to report any detail from the unattended source, beyond general characteristic such as speech vs. non-speech, and male vs. female speaker. However, in a subsequent experiment, subjects performed the same task but with stimuli consisting of a single speech signal delivered independently to each ear with decreasing inter-ear time delay. In this case, nearly all subjects reported that they recognized that the two signals were the same when the delay was in the region of 2–6 s, suggesting that unattended signals are processed to some degree, and under certain conditions, are able to impact on conscious awareness. This behavior is a necessary consequence of the IDyOT architecture.

Cherry concluded that this mechanism was statistical in nature, and that the brain stored transition probabilities, to be able to estimate maximum-likelihood to guide perception and overcome noisy signals. This was assumed to account for the fact that even dialogues spoken by the same speaker, presented simultaneously but non-spatially separated, could eventually be disambiguated after multiple hearings (up to 10–20 times). Further evidence is provided by a variant of the previous experiment involving the recognition of cliché phrases. The dialogues consisted entirely of concatenated cliché phrases. Participants were reliably able to detect whole phrases at a time with relative ease, presumably relying on highly likely word transitions inherent to cliché phrases. However, between phrases, no expectations could be generated, and participants were equally likely to switch between dialogues at such phrase boundaries, and were therefore unable to completely disambiguate the two dialogues. The IDyOT prediction mechanism accords with this.

In addition to spatial information, listeners can also use other stimulus features, such as pitch, to orientate attention toward particular events (Woods et al., 1991; Woods and Alain, 1993). Semantic salience impacts selective attention, and top-down attention is also mediated by location, pitch, timbre, intensity (Shinn-Cunningham, 2008).

In the visual modality, Posner et al. (1980) developed a reaction time paradigm to provide evidence in support of a theoretical attentional framework consisting of a limited-capacity attentional mechanism coupled with adaptive expectation of where signals were likely to appear in the visual field.

Crucial to the temporal aspect of IDyOT developed below, time itself is also a modulatory factor for attentional orientation. Coull (2004, p. 217) distinguishes between temporal attentional orienting ('how attentional processing varies as a function of time') and temporal selective attention ('how time perception varies as a function of attentional selectivity').

ERP evidence demonstrates that sounds presented at attended times elicit a larger N1 than sounds at unattended times (Lange et al., 2003; Sanders and Astheimer, 2008). Coull and Nobre (1998) report the first direct comparison between the neural correlates of spatial vs. temporal cues, revealing that both temporal cues (when a target will appear) and spatial cues (where it will appear) similarly improve reaction time, but that hemispheric asymmetry is evident between the two conditions. Similar findings are also reported by Nobre (2001) and Griffin (2002). Nobre (2001, p. 1320) demonstrates that there is no hardwired cue interval, but that "the utility of a warning cue depends upon the specific temporal information it carries and the degree of certainty." A hypothesized relationship between temporal uncertainty and attentional focus has long been the subject of empirical investigation. Early work by Klemmer (1956, 1957) proposed a model of the relationship between reaction time and an information-theoretic measure of time uncertainty. In processing language, preschool children and adults employ temporally selective attention to preferentially process the initial portions of words in continuous speech. Doing so is an effective listening strategy since word-initial segments are highly informative (Astheimer and Sanders, 2009, 2012).

### 3.3. Entrainment

Our argument, then, is that patterns of events in the world afford entrainment, which in turn affords attention-orienting behavior, if there is a perceptible regularity to the patterns' occurrence, across a range of time-scales. Regularity and periodicity are therefore invariant qualities in perception over time, a fact which sits neatly with the general principle that sequential grouping of events enhances prediction and leverage and/or understanding of causality. In music, the notion of perceptual invariance is reflected in the language used to describe highly periodic rhythms, which are sometimes referred to as stationary (Shmulevich and Povel, 2000). More generally, the occurrence of invariance in the natural world is highly suggestive of intentional behavior, such as the distinctive footfalls of a predator or chosen mate. Furthermore, an argument for the evolutionary adaptive quality of entrainment can be made in terms of social interaction and cohesion. To interact and to co-operate successfully in the world, humans must be able to synchronize movement. Synchronization requires accurate temporal prediction to engage the necessary motor control prior to an anticipated timepoint: successful co-ordination cannot be based on reactivity (Trevarthen, 1999–2000; Clayton et al., 2004). Crucially, our perception of temporal invariance and capacity for entrainment allow us to direct attentional resources toward probably-salient moments of time; thus, we better predict events in the world and accordingly act more efficiently. Efficiency, in this context, is a survival trait.

Entrainment capacity in non-humans has been supposed to correlate with the capacity for vocal learning (Patel, 2006; Schachner et al., 2009), though this is now contested (Wilson and Cook, 2016). Even so, from this, and other evidence from lingustics, it may be that entrainment is related to the process of vocal imitation. This, in turn, is implicated in learning to speak (Speidel and Nelson, 2012), which entails speech perception (even prior to the development of semantic association and of speech production). A reason for entrainment to be related to all these things would be cognitive efficiency, according with the underlying principle in IDyOT. Attending to speech, as to anything else, is energetically expensive. If periods of attention can be appropriately timed, by predicting when the next unit of information from an interlocutor will appear, such as orienting attention toward initial portions of words in continuous speech (Astheimer and Sanders, 2012), the efficiency of attending is optimized (Large and Jones, 1999). Further, it is easy to imagine situations where the capacity for physical synchronization would be of survival benefit to early humans: for example, the ability to walk in step, but with irregular paces, to minimize the audible traces of a hunting party. Further, shared entrainment would be a necessary feature of effective sustained conversation, because synchronized prediction in a listener greatly increases the likelihood of successful information transmission. Models of musical and language entrainment are similar, though language seems to be more tolerant of expectation breach: an equally hierarchical system of beats for linguistic synchronization is a given in phonology (e.g., Hawkins and Smith, 2001; Hawkins, 2003).

Humans can entrain to a beat, even when it is irregular or variable, and many find it difficult not to do so, when presented with music that they find engaging. The phenomenon is studied extensively in the music cognition literature, along with timing and rhythm (e.g., Patel and Daniele, 2003; Cross and Woodruff, 2008; Cross, 2009; Repp, 2011; Fitch, 2012, 2013; London, 2012; Merchant et al., 2015). Some non-human species exhibit temporary entrainment to music when encouraged to do so (Patel et al., 2009; Schachner et al., 2009), and others, such as crickets, exhibit synchronization via reflex response (e.g., Hartbauer et al., 2005), but sustained self-motivated active entrainment seems to be unique to humans (Wilson and Cook, 2016). Grahn (2012) gives a useful survey of related research in neuroscience.

The question of whether such control is achieved by oscillators or by interval timers remains open: Grahn (2012) presents evidence for timer-based control, while Large and Jones (1999) argue for oscillators. See Hass and Durstewitz (2014) for a wider survey of contending models. Evidence from music, beyond the Western tendency toward regular binary or ternary divisions, clearly undermines a naïve oscillator model in which phaselocked oscillators simply oscillate to determine meter: otherwise, fairly simple, naturally divisible meters such as <sup>7</sup> 8 and the jazz favorite <sup>3</sup> <sup>4</sup> + <sup>3</sup> <sup>8</sup> would be at best problematic, and the long cycles of groupings of irregular length found in Greek, Arabic and Indian music would be inexplicable. Because our model operates at a fairly high level of abstraction from the neurophysiology, it is relevant to note that an oscillator can be implemented as a timer, repeatedly triggered. Thus, at more abstract levels of modeling, the distinction is only semantic, and the effect can reasonably be simulated without addressing the detail of the neural implementation. Then, a given temporal interval may be represented by a parameter, forming a closed system with the oscillator or timer that accepts it.

An important class of approaches to these issues lies in the literature on Predictive Coding (e.g., Friston, 2010) and Bayesian Inference (e.g., Tenenbaum et al., 2011). These approaches have been investigated on the neuroscientific level by, for example, Vuust et al. (2009), Vuust and Witek (2014), Vuust et al. (2014), and Honing et al. (2014). Vuust et al.'s work, in particular, presents neuroscientific evidence for a theoretical model with a similar motivation to that presented here. As such, the present work may offer a more detailed explanatory account of the observed neurophysiological responses, as suggested by Maloney and Mamassian (2009) and Wiggins (2011).

Next, we discuss rhythm and meter in language and music, and the affective effects of expectation in pitch and rhythm, in context of our definition of entrainment. There has been further debate elsewhere over the relationship in the literature (e.g., Patel, 2008; Jackendoff, 2009; Fabb and Halle, 2012), which there is not space to survey here.

### 3.4. Rhythm and Meter in Language

Speech naturally shows regularities in timing, their nature varying across languages. Until recently the view was that these rhythmic differences stem from isochrony—an even distribution of certain segment types over time—with individual languages either syllable-timed, mora-timed or stress-timed (e.g., Abercrombie, 1967). For example, whereas Italian speakers appear to maintain approximately equal durations for each syllable, English speakers tend to adjust their speech rate to maintain approximately equal durations between stressed syllables, even when multiple unstressed syllables are interposed:

$$\begin{array}{cccc} \text{(1)} & \stackrel{\text{l}}{\underset{\text{l}}{\text{CO}}} \text{K} & \text{at that } \text{WEI} \stackrel{\text{l}}{\text{R}} \text{D} & \text{T} \text{H} \text{ING} & \text{in the } \text{FR} \stackrel{\text{l}}{\text{DGE}} \\\end{array}$$

However, empirical evidence does not uphold this strict typological division, with some languages falling somewhere between syllable- and stress-timed (e.g., Dimitrova, 1997). Instead, research suggests that all languages are effectively stresstimed and that the apparent typological differences can be accounted for via differences in stress prominence, syllable complexity, and variability of duration of vowels and consonants (Dauer, 1983; Grabe and Low, 2003; Patel and Daniele, 2003; Patel, 2008). These differences lead to the impression of different rhythmic classes and perhaps, via their effects on perception and predictability, to the segmentation unit naturally used by speakers and acquired by infants (Nespor et al., 2011). Indeed, psycholinguistic evidence shows that rhythm and timing play a role in perception, with rhythmic stress affecting attention given to phonemes (Pitt and Samuel, 1990), expectations set up by syllable stress or intonation patterns early in a sentence affecting the perceived identity of ambiguous words later on (Dilley and McAuley, 2008), and regularity in timing speeding up processing (Quené and Port, 2005). Effects are also seen in production: even infant babbling shows syllable timing patterns characteristic of the language being learned (Levitt and Wang, 1991).

Rhythm and timing are, of course, not fixed, and here expectation and predictability play a significant role. Information content has effects both globally, with average speech rate decreasing as information density increases across languages (Pellegrino et al., 2011), and locally, with local speech rates and prosodic prominence observed to vary with the predictability of the current segment, both for syllables (Aylett and Turk, 2004) and words (Bell et al., 2003).

A similar picture emerges when we look at timing effects between speakers in dialogue. First, speakers affect each other as regards the word- or segment-level timings discussed above: both speech rate and information density converge amongst interlocutors (Giles et al., 1991, give a summary), with some evidence that degree of convergence is related to high-level interpersonal factors such as the level of cooperation (Manson et al., 2013). Second, conversational participants are apparently experts in timing at the level of utterances or turns (segments during which one speaker holds the conversational floor). Sacks et al. (1974) show that turn-taking is far from random: the floor can be taken or surrendered at specific transition relevance places, and speakers and hearers are apparently aware of these and able to exploit them. Stivers et al. (2009) show that these abilities are cross-linguistic and cross-cultural: speakers and hearers manage the timing of these transitions to avoid overlap, and minimize silences; and experiments suggest that disruptions in natural interaction timings are noticed by infants as young as 3 months (Striano et al., 2006). Heldner et al. (2013) extend this to the more specific idea of backchannel relevance spaces, showing that even simple feedback vocalizations (e.g., "uh-huh") are governed by constraints of appropriate timing.

Crucially, studies of turn-taking show that inter-speaker transition times are too short for this behavior to be reactive: if we waited for the end of the previous turn to react, we simply wouldn't have enough time to plan, select lexical items and begin to speak (requiring of the order of 600 ms) within the durations observed empirically (c. 200 ms). We must therefore predict the end (and content) of turns as we hear them, to begin our own response (see Levinson and Torreira, 2015; Levinson, 2016). Expectation is therefore key to turn-taking: EEG experiments show correlates of turn-end anticipation (Magyari et al., 2014), and models have been proposed based on syllable-timed oscillators (Wilson and Wilson, 2005). However, experiments suggest that this expectation is driven by factors at many levels. Grosjean and Hirt (1996) show that prosody helps listeners predict when a turn is going to end, although its utility depends on language and on position in the sentence. However, De Ruiter et al. (2006) asked participants to predict end-of-turn times with various manipulated versions of recorded speech: their predictions were accurate when hearing the original recordings, and when the intonation information was removed; but accuracy dropped when only intonation information was present and words could not be understood. Magyari and De Ruiter (2012) showed similar results when asking participants to predict the words remaining in a sentence. In machine classification tasks, Noguchi and Den (1998) and Ward and Tsukahara (2000), among others, show success in predicting backchannel points using prosody; but in a general turnend detection task, Schlangen (2006) showed that combining acoustic, lexical and syntactic information improved accuracy, and Dethlefs et al. (2016) show that people's tolerance of speaker overlap depends on information density as well as syntactic completeness. While prosody contributes information, then, lexico-syntactic or higher levels must contribute as much if not more.

It is clear, then, that rhythmic structure pervades language, but that its perception and production are governed by expectation both within and between speakers—with this expectation based on information at a variety of levels. IDyOT theory proposes that this expectation is generated by the same general mechanism as that which affords musical meter perception, which is the topic of the next section.

### 3.5. Rhythm and Meter in Music

A distinction commonly made in the literature is that between musical meter and rhythm, although there is debate over the extent to which they can be treated independently (Cooper and Meyer, 1960; Benjamin, 1984; Hasty, 1997). London (2012, p. 4) defines rhythm as involving "patterns of duration that are phenomenally present in the music." Duration here refers not to note lengths, but to the inter-onset interval (IOI) between successive notes. Rhythm therefore refers to the arrangement of events in time, and in that sense can be considered as something that exists in the world and is directly available to our sensory system.

Meter can be thought of as the grouping of perceived beats or pulses, simultaneously extracted from and projected on to a musical surface, into categories, which is typically expressed as the "regular alternation of strong and weak beats" (Lerdahl and Jackendoff, 1983, p. 12). London strongly situates meter as the perceptual counterpart to rhythm:

[M]eter involves our initial perception as well as subsequent anticipation of a series of beats that we abstract from the rhythmic surface of the music as it unfolds in time. In psychological terms, rhythm involves the structure of the temporal stimulus, while meter involves our perception and cognition of such stimuli. (London, 2012, p. 4)

The experience of meter can, therefore, be considered as a process of categorical perception, where the surface details of the temporal stimuli, such as the particular structure of the rhythmic pattern, or any expressive performance timing, are perceived with reference to a hierarchical organization of regular beats. The sensation of meter is induced from a stimulus in conjunction with both innate and learned responses to periodic or quasi-periodic stimuli.

Extending the notion of categorical perception, London (2012) argues that meter is a form of sensorimotor entrainment, that is a "coupled oscillation or resonance," afforded by the temporal invariances commonly present in musical structure. For listeners, this is one mechanism by which attentional resources can be directed toward predicted salient timepoints to efficiently process complex auditory stimuli. For musicians, and indeed any form of movement associated with musical stimuli, entrainment is necessary for the co-ordination of physical action.

London (2012) provides empirical support for his theory of meter as entrainment from recent advances in neuroscience, which shed light on the underlying biological mechanism of rhythmic perception. Neuroimaging studies have discovered patterns of neuronal activity sympathetic with metrical entrainment, providing convincing evidence that metrical perception is both stimulus driven and endogenous. Differing EEG responses to trains of identical pulses are reported by Brochard et al. (2003) and Schaefer et al. (2011) as evidence for subjective metricization. Snyder and Large (2005) and Iversen et al. (2009) both present findings that lend support to endogenous neural responses correlating with accents that are only loosely coupled with external stimuli, and in the later study it is also demonstrated that the priming of an endogenous meter has a predictable effect on subsequent auditory responses. Nozaradan et al. (2011) present evidence of measurable neural entrainment to perceived and imagined meter.

The degree to which listeners can induce a sense of meter from a rhythmic surface has also been shown to strongly affect ability in reliably processing rhythmic information (Grube and Griffiths, 2009). Where a stronger sense of meter is induced, participants could more accurately detect rhythmic deviations. In the same experiment, the authors also provided evidence suggesting the importance of closure at the endings of rhythmic stimuli in order for listeners to report a stronger sense of perceived rhythmicality. Open endings were shown to leave listeners feeling uncertain about the structure of rhythmic stimuli, demonstrating how the ends of sequences can influence the perception of the whole.

Composers have long exploited our capacity to maintain a metrical context (i.e., our capacity for sustained voluntary entrainment), which is possible even in the presence of conflicting musical stimuli. Syncopation is the intentional rhythmic articulation of less salient metrical timepoints, which in itself is evidence for our strong tendency for entrainment, since if we could not independently maintain a sense of meter the concept of "off-beat" would be meaningless. The notion of a continuous oscillation in attentional energy provides an account, importantly one with an empirically grounded underlying mechanism, of the commonly held view that meter concerns regular patterns of strong and weak beats.

## 3.6. Affective Responses to Expectation in Timing

Huron (2006) argues that prediction, experienced as expectation, is a driver of musical affect. Huron proposes that the feeling of uncertainty, which corresponds with entropy in a predictive distribution (Hansen and Pearce, 2014), makes a substantive contribution to the aesthetic of music: changes in tension due to changes in uncertainty resolving into expected certainty, or denial of expectation, is sometimes called the "ebb and flow" of music. Empirical evidence of this relationship is supplied by Egermann et al. (2013): correspondence was found, by direct and indirect response, between affective change and change in information content as predicted by Pearce's Information Dynamics of Music (IDyOM) model (Pearce, 2005; Pearce and Wiggins, 2012).

However, anticipation of what is coming next (followed by the outcome and its concomitant affect) is only one aspect of this response. Another key aspect is the entrainment that allows groups of humans to perform music together, in perfect but flexible, consistent time, in ways which have never been demonstrated in other species.

An open question is why the act of entraining should produce positive affect, as it does (Hove and Risen, 2009; Tarr et al., 2015). One possible answer is that, because cognitive entrainment is necessary for efficient speech communication (see Section 3.4), mutations that select for entraining capacity, and also for exercising of that capacity are favored. Thus, a capacity which is, presumably, grounded in fundamental cyclic behaviors such as locomotion (Fitch, 2012), might be exapted to support communication through speech, but also social bonding through shared musical activity. Since speech and social bonding are interlinked, and social bonding is crucial to human survival in the wild, one can postulate a tight feedback loop between these various factors, leading to the advanced capacity for musical and speech rhythm in modern humans. This account places neither music nor language as the progenitor: it would be the basis of an evolutionary theory in which they develop in parallel from a common root, possibly through shared mechanisms and/or resources.

There remains something of a lacuna in the literature on musical affect, with respect to specific small-scale deviations, as in groove. It is to be hoped that a model like IDyOT will render hypothesis formation in this area more readily achievable, and thence empirical study may be enabled. However, in both speech and language, affect is manipulated, intentionally or otherwise, by both time and pitch—as in the frustrating denial of expectation by a speaker who pauses too much, or by a performer whose timing is poor. Kant (1952) proposes a theory of incongruity for positive affective response in humor, and something similar to this may apply here; however, we reserve this discussion for future work.

Here, what is important is that the expectations generated in time form a predictable, if locally irregular, structure, and small variations in that structure are desirable, giving rise to affective responses such as "feeling the groove" in music (Witek et al., 2014) and "pause for emphasis" in language (Cahn, 1990). This entails a representation in which a norm (the standard beat, isochronous or otherwise) is directly implied, but in which variation may be quantified so that further prediction and associated affect may be modeled. Such a representation is the subject of the next section.

### 3.7. A Conceptual Space of Rhythm and Meter

### 3.7.1. The Theory of Conceptual Spaces

Gärdenfors (2000) proposes a theory of conceptual spaces as a geometric form of representation, situated between subsymbolic and symbolic representation. The theory proposes that concepts—entirely mental entities—can be represented using sets of dimensions with defined geometrical, topological or ordinal properties. The formalism is based on betweenness, from which a notion of conceptual similarity is derived.

Gardenfors' theory begins with an atomic but general notion of betweenness, in whose terms is defined similarity, represented as (not necessarily Euclidean) distance. This allows models of cognitive behaviors to apply geometrical reasoning to represent, manipulate and reason about concepts. Similarity is measured along quality dimensions, which "correspond to the different ways stimuli are judged to be similar or different" (Gärdenfors, 2000, p. 6). An archetypal example is a color space with the dimensions hue, saturation, and brightness. Each quality dimension has a particular geometrical structure. For example, hue is circular, whereas brightness and saturation correspond with measured points along finite linear scales. Identifying the characteristics of a dimension allow meaningful relationships between points to be derived; it is important to note that the values on a dimension need not be numbers though how an appropriate algebra is then defined is not discussed.

Quality dimensions may be grouped into domains, sets of integral (as opposed to separable) dimensions, meaning that every dimension must take a value to be well formed. Thus, hue, saturation, and brightness in the above color model form a single domain. Each domain has a distance measure, which may be a true metric, or otherwise, such as a measure based on an ordinal relationship or the length of a path between vertices in a graph. Thence, Gärdenfors' definition of a conceptual space is "a collection of one or more domains" (Gärdenfors, 2000, p. 26). For example, a conceptual space of elementary colored shapes could be a space comprising the above domain of color and a domain representing the perceptually salient features of a given set of shapes.

Since the quality dimensions originate in betweenness, similarity is directly related to (not necessarily Euclidean) proximity. Such spatial representations naturally afford reasoning in terms of spatial regions. For example, in the domain of color, a region corresponds with the concept red. Boundaries can be adaptive, providing the formalism with an elegant means of assimilating new knowledge, and the space itself can be subject to geometrical transformation, such as scaling of constituent dimensions, modeling shifts in salience. For purely numerical dimensions, Gärdenfors (2000, pp. 24–26) tentatively suggests Euclidean distance for similarity in integral dimensions, and the city-block metric for separable dimensions.

### 3.7.2. A Geometrical Formalization of Meter and Rhythm

Forth (2012) formalized London's theory of meter (London, 2012, Section 3.5), seeking quality dimensions to express all the ways in which metrical structure may be variable in perception. Forth (2012) specifies two conceptual space representations of metrical structure, denoted meter-p and meter-s, to enable geometrical reasoning over metrical-rhythmic concepts. The simpler space, meter-p, represents the periodic components of well-formed hierarchical structures that correspond with metric entrainment. It can accommodate all theoretically possible forms of metrical structure, while entrainment itself is bounded by fundamental psychological and physiological constraints (London, 2012). The principal affordance of the geometry is direct computation of similarity between musical rhythms with respect to a metrical interpretation. In a genre classification task, exemplars of a range of dance music styles were projected as points in each space. Applying simple nearest-neighbor clustering over the points in each space, classification accuracy of 76% and 81% was achieved for meter-p and meter-s respectively, compared to the naïve classification baseline of 22%.

The overall general spaces are quite high-dimensional, but current thinking is that any individual actually uses a subspace, attuned to their enculturation. Thus, someone enculturated purely in Western rock would not have in their conceptual space the dimensions required to capture, say, the Yoruba timeline, explaining why even accomplished Western musicians must learn to relate to such non-Western metrical structures. The dimensionality of the spaces depends, also, on the number of metrical levels instantiated in the overall metrical structure. Therefore, musically less structured rhythms inhabit a lower dimensional space; and, conversely, each space may be extended by the addition of new dimensions corresponding to higher-level groupings or lower-level beat subdivisions.

An important aspect of this representation is its ability to abstract metrical structure from the tempo and expressive variation of individual performances. While it is possible to instantiate the representation to the point at which specific real times are included, and thus actual performances are represented, these times may be abstracted out. In this case, a point in the abstracted subspace represents a schematic, regularized version, which may capture multiple performances of a given rhythm: and so the region that the individual performances inhabit constitutes a concept under Gärdenfors' notion of convexity. The geometry of the space then allows us to distinguish groove, inconsistent timing errors, and tempo change because of their different statistical properties: the first is a tightly defined point slightly away from the regularized rhythm, the second is a cloud around a regularized rhythm, and the third is a monotonic trajectory around the regularized rhythm. These diagnostic properties both provide support for the hypothetical representation and afford a useful facility in the wider theory proposed below.

## 4. IDYOT: THE INFORMATION DYNAMICS OF THINKING

### 4.1. A predictive cognitive architecture

We now outline the IDyOT architecture. The aim of the current section is to explain enough detail to allow the reader to follow our account of the timing aspects. Further explanation is given by Wiggins (2012b) and Wiggins and Forth (2015).

IDyOT implements Baars' Global Workspace Theory (GWT; Baars, 1988), affording a computational model of hypothetical cognitive architecture. GWT is primarily intended to account for conscious experience, and that is relevant to some aspects of IDyOT theory. However, it is the underlying mechanism that is of interest here, in our references to both theories. A number of generators sample from a complex statistical model of sequences, performing Markovian prediction from context (Wiggins and Forth, 2015). Conceptually, each generator maintains a buffer of perceptual input which may include misperceptions and alternative perceptions due to the possibility of multiple predictions matching ambiguous or noisy input, expressed as symbols, whose origin is explained below. Each buffer serves as a context for prediction of the next (as yet unreceived) symbol; predictions are expressed as distributions over the alphabet used to express the input. A buffered sequence is flushed into the Global Workspace when it meets a chunking criterion as described below. **Figure 1** gives an overview; see Wiggins (2012b), Wiggins and Forth (2015) for more detail.

IDyOT maintains a cognitive cycle that predicts what is expected next, from a statistical model, expressed in terms of self-generated symbols that are given semantics by perceptual experience. IDyOT is focused on sequence, and this is in part due to the musical focus of its ancestor, IDyOM (Information Dynamics of Music: Pearce, 2005; Pearce and Wiggins, 2012). IDyOM models human predictions of what will happen in an auditory sequence, and takes account of information about musical time in making its predictions. It is the most successful model of musical pitch expectation in the literature (Pearce and Wiggins, 2006), but it cannot predict when the next event will fall in a statistically defensible way, and it is a static model, operating over a body of data viewed as a fixed corpus: it has no interaction with the world; it has no real-time element. The focus of this paper is to extend the IDyOT model with timing, to show how it accounts for musical meter, potentially in real time.

**Figure 1** illustrates the cyclic (and hence dynamical) nature of the IDyOT model. The generators sample from statistical memory, synchronized by its own expectations of the perceptual input, if some exists, that it receives. If there is no input, the generators freewheel (Fink et al., 2009; Wiggins and Bhattacharya, 2014), conditioned only by prior context, and this is where creativity is admitted. In the current paper, we focus on the perceptual input and synchronization. Perceptual input is matched against generators' predictions, and where a match leads to a larger increase in uncertainty than other current matches, the corresponding generator's buffer is emptied into the Global Workspace, which is in fact IDyOT's memory adormed with buffers along its leading edge (**Figure 2**). The previous buffer now

forms a perceptual chunk, linked in sequence with the previous chunk. The model entails that at least some generators must be working in all perceptual modalities at all times, including sensory ones; otherwise nothing would be predicting for new input in a given modality to match against. The process of structure generation is explained in **Figure 2**.

As in parsing by competitive chunking (e.g., Perruchet and Vinter, 1998; Servan-Schreiber and Anderson, 1990), IDyOT's chunking process breaks percept sequences into statistically coherent groups, which tend to correspond with structurally coherent sub-phrases, though not necessarily with traditional linguistic categories. Chunking is the basic process by which IDyOT manages its information, by analogy with human perceptual chunking (Gobet et al., 2001). Once a chunk has entered the Global Workspace, it is added to the memory and becomes available to the generators for prediction. This generates a positive feedback loop in which the chunks inform the statistical model that in turn causes chunking, reinforcing the model.

Each chunk, having been recorded, is associated with a symbol in the next-higher-level of the model, which in turn adds to the overall predictive model, and each higher level is subject to chunking. Each symbol corresponds with a point in a conceptual space associated with its own layer, and each such point corresponds with a region or subspace of the conceptual space (Gärdenfors, 2000) of the layer below, defined by the lowerlevel symbols in the chunk. Thus, two representations grow in parallel: the first symbolic and explicitly sequential, driven by data, providing evidence from which the second is derived; and the second geometrical, mostly continuous, and relational higher layer, providing semantics for the symbols of the first.

For symbol tethering (Sloman and Chappell, 2005), very lowlevel conceptual spaces are a priori defined by the nature of their sensory input (inspired by human biology); higher-level ones are inferred from the lower levels using the information in the sequential model. The exact nature of the conceptual spaces involved is an interesting future research area. A measure of similarity, borrowed from conceptual space theory (Gärdenfors, 2000), allows structures to be grouped together in categories, giving them semantics in terms of mutual interrelation at each layer, and tethering to the level below, eventually bottoming out in actual percepts. Using this, a consolidation phase allows membership of categories to be optimized, by local adjustment, in terms of the predictive accuracy of the overall model. Theoretically, the layering of models and its associated abstraction into categories can proceed arbitrary far up the constructed hierarchy. For clarity here, we restrict our example to the number of layers necessary to describe simple musical rhythms.

In summary, IDyOT's memory consists of multiple structures, of which those in **Figure 3** are simplified examples, in parallel, tied together by observed co-occurrences of feature values expressed in multidimensional perceptual input sequences. The whole constitutes a Bayesian Network, stratified in layers determined by the chunking process, and constrained to predict only to the subsequent symbol at each level and in each modality. Note, however, that the subsequent symbol may

FIGURE 2 | An illustration of the process of IDyOT structure building. (A) The input to IDyOT is a sequence of values, with three different features (thought of as viewpoints after Conklin and Witten, 1995). In reality, the voice input would be an audio signal, but for the purposes of example, we start at the (abstract and approximate) phoneme level. The sentence perceived here is "John loves Mary." (B) The structure eventually developed by IDyOT, showing the hierarchical model created by information-theroetic chunking, and the individual times associated with each chunk (as used in Figure 4) and the higher-level symbol that labels it. (C) Five steps in the construction of the memory structure shown in (B). A generator is associated with each level of each viewpoint, and with each alternative reading of the structure (though ambiguity is not shown here: see Wiggins and Forth, 2015, for details). Rather than move data around, new input, once matched perceptually, is added directly to the memory, which serves as the substrate of the Global Workspace. As each chunk is constructed, there is a peak of information content, which constitutes attentional energy in the system. Thus, as larger chunks are produced up the hierarchy, larger segments of text (and of the meanings with which they are associated) enter the Workspace; this accods with the "spotlight" analogy of Wiggins (2012b).

represent something arbitrarily far in the perceptual future, because higher-level, more abstract models predict in parallel with, and conditioned by, more concrete ones, and each higherlevel symbol will subtend more than one lower level symbol. From this model, IDyOT's generators make predictions and their outputs are selected on the basis of probabilistic matching with input. The differences between generator outputs is caused either by their predicting from different parts of the memory structure (e.g., at different levels in the hierarchy), or from stochastic choices licensed by the distributions with which they work.

### 4.2. Rhythm and Timing Expectation in IDyOT

**Figure 3** illustrates the pattern of structures that is learned as a result of exposure to a broad range of <sup>4</sup> 4 rhythms. (The model will, of course, be much more complicated than this in general, because other meters will be represented in the same network.) The binary structure results because of a combination of musical practice, in which event occurrences on metrically strong pulses are more frequent than on weak ones, and because a balanced tree representation of the structures is more information-efficient as a representation than other kinds of representation. Thus, the properties of the data to which IDyOT is exposed conspire with its information-based criteria to provide a theoretical account for the development of meter in humans. Any rhythm that IDyOT encounters is processed in context of this background model. The figure illustrates how the temporal expectations of the different metrical levels fit together to produce weaker and stronger temporal expectations at different stages in the meter, with the perceived effect shown in the Metrical Structure and Combined effect strength illustrations.

The IDyOT generators make predictions of what will be perceived next, expressed as distributions over the relevant alphabet. Each generator also makes a prediction of when the relevant symbol will appear. Because more predictors from different levels predict (what would musicologically be) strong beats, the prediction at these points is correspondingly stronger, and, in terms of qualia, this affords the experience of metrical, hierarchical rhythm.

Section 3.7.2 outlines how the conceptual space of meter and rhythm proposed by Forth (2012) affords generalization away from the details of particular performances, to corresponding patterns of entrainment, and allows the analysis of variation in terms of its geometrical properties. Once such a space is established, new time intervals can be represented within it, and thence abstraction away from time interval to tempo becomes a straightforward projection operation on the space, rather than a matter of timing from the raw data alone, which would be difficult to handle without the prior knowledge encoded in the metrical model.

Importantly, the mechanism is required for effective linguistic communication with multiple individuals, who may have variation in speed or in regularity in their own speech, and who will certainly vary in speech speed from one to another; and for combining expectations driven by information at multiple levels, to allow accurate anticipation of lexical timing and sentence or turn-end timing simultaneously. Exactly the same hierarchical process can apply in both modalities.

Thus, IDyOT affords a method by which sequences of time periods may be derived from a base level of measured temporal units, which allows the construction of the metrical space from tabula rasa. Exposure to sufficient metrical data will cause the construction of hierarchical representations of meter, the hierarchies summing the durations of their subtended sequences, summarizing the rhythms in the data, as illustrated in **Figure 3**. The relationship between the basic unit, and the structures composed upon it by chunking, may be expressed by locating a rhythm as a point in the conceptual space defined in Section 3.7. Because the perceptual tendency is to integrate all concurrent rhythmic input, even when it is not obviously coherent, into a percept of one single rhythm (as in polyrhythms) the entire rhythmic structure that is audible at any point in time may be represented as exactly one point—or, if it is sufficiently uncoordinated as not to be perceptible as a rhythm, then as no point at all. Thus, the entire IDyOT Global Workspace resonates with the resultant temporal beat of its input, or descends into confusion when multiple conflicting rhythmic inputs are present.

The abstract, static representation afforded by the conceptual space, however, does not account for the on-going, dynamic percept of rhythmic beat: rather, it provides the parameters that configure it. In IDyOT, the on-going experience is accounted for, instead, by the predictive anticipation of the generators that use the memory at any point in a perceptual sequence to generate expectations. Consider a regular, Western rock beat, as illustrated in **Figure 3**, as processed by an IDyOT with extensive exposure to this kind of relatively foursquare rhythm.

First, we discuss predictions at the metrical level. At this level, the predictions of the part of the model representing the current rhythm are mostly in line with those of the more general metrical predictions, and therefore the expectations are reinforce: evidence confirms the estimating of the basic unit, and the predictions can be correspondingly more certain. This corresponds with a human feeling the beat strongly. However, there is one place in this rhythm where a specific musical effect is noticeable, that does not accord with simple prediction: on the third beat of the first measure, a strongly expected beat is not present in the rhythm (marked with ⋆ in **Figure 3**). Affectively, this loud rest (London, 1993) lies in strong contrast to the second measure of the rhythm, where the expectation is fulfilled. This rhythm, therefore, creates its musical effect by subverting the metrical expectation of the listener, and IDyOT is able to predict this effect: unexpected occurrences draw attention, and thus the listener is kept interested in the beat.

### 4.3. A hypothetical Mechanism Underlying Entrainment

Entrainment in IDyOT is a direct consequence of attentional dynamics. Following Nobre (2001), IDyOT embodies a multifaceted view of attention, in which there is no "unitary homuncular attention system" (Nobre, 2001, p. 1326). The understanding of attention becomes distributed activation in neural assemblies, not a single function of the brain.

The mechanism with which IDyOT makes predictions of time is the novel contribution of this paper. We consider temporal predictions to be generated by the same kind of statistical process that governs the prediction of other attributes, such as the likelihood of particular musical pitches or phonemes of speech. However, temporal predictions are integral to the behavior of the cognitive system itself, in time. Temporal predictions are hypothesized as drivers or regulators, coordinating, but also influenced by, the generation of predictive distributions in other domains, which collectively constitutes the generation of expectations. The interaction between generated expectations and sensory input leads to the construction of representations in memory, which in turn conditions subsequent expectations.

Measuring time necessitates the ability to relate distinct moments across time, and a mechanism by which the distance between such markers can be determined. Although the actual mechanism is the subject of much debate (for an overview see Hass and Durstewitz, 2014), we assume a neuronal representation of the passing of time to be available in the brain. We assume a functional means by which moments in time can be related with respect to this underlying clock, and that the neural encoding forms the basis for the estimation of time intervals, which may be related to activation in brain areas such as the pre-supplementary motor area and frontal operculum (Coull, 2004).

Hypothesizing an intervallic representation of time underlying the cognitive processing of temporal information may appear obvious. However, considering the question of why and how this may be the case illustrates and supports our wider position regarding the importance of prediction and efficiency of representation in perception and cognition. Analogous to the derivation of intervallic representations of pitch from absolute representations of pitch, an intervallic representation of time is more compact in terms of both alphabet size and resulting statistical model than a monotonic time-line. Furthermore, intervallic representations are invariant under translation, directly affording comparison, forming the basis for the identification of higher-level structure. We conjecture that the same mechanisms of chunking and representation learning, previously described as the core mechanisms underlying the processing of symbol sequences within the IDyOT cognitive architecture are directly applicable to the modeling of time, and in turn, underlie the real-time temporal dynamics of the cognitive system.

Multiple independent IDyOT generators continuously predict sensory input, at each level of the metrical hierarchy induced by the chunking process. There must be a sufficient number, making predictions at sufficient frequency, to be useful to the organism in any given situation, subject to the constraint of available cognitive resources. In the auditory domain, we take the lower bound of 20 ms (the approximate minimum IOI at which listeners can reliably discern the correct ordering of two successive onsets: Hirsh, 1959), to determine the highest frequency at which the architecture must run—but note that this could result from generators running at this frequency, or from sufficiently many generators running more slowly, but coordinated to support this temporal resolution.

IDyOT generators exhibit weakly coupled behavior, because they infer their timings from the single hierarchical memory; however, no direct coupling mechanism is assumed between individual generators. Following the global workspace theory, we hypothesize that coupling behavior emerges as the phenomenal experience of meter via the role played in the architecture by the global workspace itself, through which all communication between generators is mediated.

It is parsimonious to argue that temporal expectations, in whatever modality, are generated in accordance with general predictive principles, which are sensitive to the statistical regularities, or invariances, of sensory input. The finite resources of cognition act as a global constraint on temporal structure, which in the limit tend toward maximizing efficiency. Therefore, we argue that the same kind of predictive temporal dynamics exists in both music and language, following the temporal structure of intentional and communicative behavior. In both cases, time is used to optimize attention and maximize communicative potential. In both cases the features of the stimuli condition temporal prediction, which in turn drive the prediction of these features in time.

Thus, conceptual space representations are learned because they are efficient, and they are constrained by embodiment, and therefore take a common form across a species, but are variable across culture. Thence, we hypothesize that the mechanism underlying entrainment is a process of modeling observable patterns, which may (in a natural organism) be associated with the cause of the patterns, and thus given meaning.

The specific mechanism proposed is an extension of the eventby-event prediction used in extant statistical models of music and language. As each event is detected, the next one is predicted, the prediction being expressed as a distribution over the symbols of the dimension being predicted. In IDyOT, differently, this distribution changes with time, time being substantially more granular than the inter-event interval. It can be calculated as follows. Instead of merely determining the observed likelihood of each of the possible symbols in context, IDyOT treats each piece of evidence differently, counting not only the symbols, but also the expected time of occurrence. The result may be viewed as an overlay of distributions in time, one for each symbol, with the overall distribution across the alphabet at any point calculated by looking up the value of each symbol at that point. This is illustrated in **Figure 4**; it affords one of the means of testing the IDyOT model, laid out in the next section.

To summarize: in IDyOT, the experience of pulse, defined by Fitch (2013) as a primary cognitive construct, emerges as an epiphenomenon of our more general notion of entrainment: it results from the superposition of multiple, regular strong expectations. Importantly, this theory explains how pulse can be imagined, rather than being a response elicited by actual sound, and how the intrinsic experience of pulse can continue beyond an audio stimulus. IDyOT's mechanism also accounts for loud rests (see **Figure 3**) and other effects such as the jolt experienced by listeners enculturated into simple Western rhythms when presented with simple non-isochronous time signatures such as <sup>7</sup> 8 . Perhaps most important, it explains how untrained children from middle Eastern cultures can clap easily along to rhythms that advanced Western musicians sometimes find challenging: because the rhythms are learned, and the learned model affords the entrainment, not some simple oscillatory mechanism.

Thus, entrainment in IDyOT is a more general concept: it emerges epiphenomenally from hierarchical time prediction over sequential structures. The strength of predictions is determined by memorized hierarchical information, leading to the multiple different strengths of expectation required to explain the experienced complexity of rhythm in both music and language, from simple pulse up to the extreme rhythmic complexity found in Arabic, Indian and African musics, and the complexity of rhythm in language from everyday argot to the most carefully performed poetry or rap.

### 5. METHODOLOGY: STUDYING EVOLUTION THROUGH COMPUTATIONAL SIMULATION

A perennial problem for evolutionary accounts of biological development is that of distinguishing them from Just-So Stories (Kipling, 1993), because they are untestable. Here, we propose a methodology in which computational models of cognitive process afford a means of testing hypotheses about evolutionary development. While it is clear that in silico simulation is not the same as running in vivo experiments over evolutionary time, it can help to supply evidence for argument, if it is done rigorously.

To see this, one must understand that the computational model in question is not merely a predictor from data. That is, it is not an attempt to neutrally machine-learn structure in data and classify on that basis, or to search for arbitrary correlations. Rather, it is in its own right an overarching theory about the functional process of mind, which may be decomposed into several related aspects, one of which (timing) is the current topic. Different aspects of the theory are testable in different ways, and only though a comprehensive programme of experimentation first concentrating on individual aspects in isolation, then in combination—can a full understanding of the wider theory be established. From the current perspective, then, IDyOT is a theory; the aspect under scrutiny is its timing mechanism, and this drives our current hypothesis formation.

Given adequate evidence that the model is correct with respect to current biology, the evolutionary affordance of the approach becomes available. Once the model has been shown to be an acceptable predictor of empirical observations of the behaviors it claims to capture, its parameters may be changed so as to simulate the effect of changes known to have occurred in the relevant species over evolutionary time: e.g., size of organism

and/or nervous system, availability of food, or other intrinsic or extrinsic factors.

To be clear: we do not claim that this methodology can directly simulate evolution in all its complexity, but we do claim that it can supply useful answers to carefully posed questions that have a bearing on the evolution of the aspects of present-day organisms that the model is shown to simulate.

## 6. TESTABLE HYPOTHESES

The IDyOT model affords more than one opportunity for exploration of human rhythmic behavior in language and music, and its evolution. First, the model must predict human behavior as currently observed, in both modalities. Because IDyOT is multidimensional, it is also possible in principle to study the effects of combining music with language, for example, in lyrics. Second, the model should be used to generate behavioral predictions, from which surprising examples can be extracted (Honing, 2006). These can then be tested against human behavior, further developing the model and adding to knowledge of that behavior. Thirdly, and more important in context of the current paper, parametric constraints may be placed on the model to explore hypothetical evolutionary pressures and help understand their effects. (Of course, this is only a valid approach if the model is demonstrated to be a good model of current humans.)

## 6.1. Metheds to Validate IDyOT As a Model of Current Cognition

There is a variety of empirical tests for music and language which may serve as validation of the IDyOT approach. For example, an IDyOT with greater hierarchical depth of processing, or more training examples, may be used to predict listeners with differing degrees of expertise or development, respectively; one hypothesis, for example, would be that there is a cutoff in terms of hierarchical memory depth beyond which language will be dysfunctional. In music, an IDyOT exposed to a large corpus may process musical structure at a higher level than an IDyOT exposed to a small corpus, in the same way that expert listeners tend to perceive music in terms of more semiotic structure; in this case, IDyOT's behavior could be compared with existing results on human behavior. In addition to modeling listeners with more or less musical training, IDyOT may be used to model the musical perception and expectations of listeners with different cultural backgrounds. Further, IDyOT may be used to model subjective metricization, to test whether an encultured IDyOT exhibits the same subjective metricization behavior as similarly encultured humans.

In contrast to specifically modeling listeners with divergent expectations (afforded from different cultural backgrounds or degrees of musical expertise), IDyOT may be used to simulate interaction between "average" listeners, or those of a comparable hypothetical listening background. Exposing trained IDyOTs to conversational dialogue should afford predictions of the timings associated with observed turn-taking and of human judgments in end-of-turn prediction experiments. Similarly, it should be able to generate expressive timing for synthesized speech that correlates with human affective response to timing deviations.

In similar vein, timing may be used to disambiguate language incrementally, as follows. Consider the following discourse fragments<sup>1</sup> :


In fragment 1, the onset of the final /k/ phoneme of "bank" will appear somewhat earlier than the initial /k/ of "catching" in fragment 2, and thus the predicted meaning of the two sentences may changed at this very low level, as in the very eagerly predictive Cohort Theory (Marslen-Wilson, 1984) and its descendents. IDyOT theory predicts this and models the effect of the change in time explicitly, as illustrated in **Figure 4**. Note, however, that, on balance, semantic implication is usually somewhat stronger in disambiguating, as discussed by Wiggins and Forth (2015).

In the domain of music, IDyOT may be run as a participant in a tapping synchronization study, with the hypothesis that human-human pairings are indistinguishable from human-computer or computer-computer pairings. This sort of experiment would not only confirm the accuracy of the underlying mechanisms of IDyOT, but demonstrate the validity of the model when scaled up to behavioral interaction. More generally, we would hope that other known effects such as the scaling of timing errors proportionally to duration magnitude would be an emergent property of IDyOT's processing of sensory input, or that IDyOT can model how temporal predictions are modulated by non-temporal factors such that surprise, attention, high-level expectation from top-down knowledge.

In addition to modeling production and synchronization, as in a tapping study, IDyOT may be used to simulate human perceptual characteristics, such as the perception of similarity. Hypotheses could examine the formation of the model's geometrical space and probabilistic scaling of dimensions by testing whether the high level patterns captured by IDyOT are reflective of schematic perception of rhythmic variations, or of generalization and classification of linguistic information.

Another avenue of research with regard to language would be to test anomalies in perception and/or in the signal itself. And because IDyOT theory proposes that attention is regulated by information contained within the signal, its predictions can be experimentally validated with methods such as EEG (e.g., ERP Mismatched Negativity—MMN—analyses) or eyetracking measurements, as these techniques capture the real-time dynamics of information processing. In one such experiment, IDyOT should be able to reliably detect a deviant item within a repetitive sequence, and therefore should accurately predict MMN response in oddball paradigms (Näätänen et al., 2007), for example. Rather than predicting neural response to anomalies, one may also predict human cognition at the behavioral level, by exposing a trained IDyOT to garden-path sentences, as discussed previously, or to semantically equivalent sentences which vary in hierarchical periodic temporal structure. In this later case, one would test whether IDyOT produces temporal responses comparable to humans (e.g., who make different end-of-sentence predictions). And finally, rather than testing ambiguous or unexpected sentence endings, one may also expose a trained IDyOT to nonsense words, to see whether the model, like humans, creates perceptual chunks, perceptually imposing more regularity in time than exists in the signal.

In the music information retrieval literature (see www.ismir. net), there is significant interest in so-called "beat tracking" the automated detection of beat in (mostly popular) music, for the purpose of finding similar music for listeners. This not unsuccessful literature (e.g., Dixon, 2001, 2007; Davies and Plumbley, 2007) affords a rich vein of models against which to compare IDyOT's entrainment mechanism. Similarly, psychological (Povel and Essens, 1985) and neuroscientific (Patel and Iversen, 2014) comparators exist.

### 6.2. Predicting Behavior from IDyOT

Following Honing (2006), once a model has been validated, the researcher should push the model toward the extremes of its parameters, to discover unexpected predictions about human behavior. This is a valuable step in testing and exploring a model's performance, because surprising predictions (1) may inform us about hitherto unknown (or not well understood) human cognitive mechanisms, and (2) will further validate the model in a broader range of behavioral contexts, by pushing the boundaries of what is known, not simply modeling expected behavior.

### 6.3. Correspondence with Neural Function

Our methodology is to model cognitive function abstracted from its substrate. However, it is useful to consider cognitive predictions in context of their hypothetical neurophysiological implementation, even though they are separated from it.

The function of IDyOT, however abstract, entails memory representations that increase in size with time. These representations, though not literal recordings of sensory experience, are very high-dimensional, because they connect all aspects of all features of sensory input together, where correlated. Unless one admits mysticism or quantum theory at the physical level of the brain (which we do not), this very dense interconnectedness entails the availability of brain volume which is strongly supralinear with respect to time, because every neural assembly (Hebb, 1949) has to be connected to every other relevant neural assembly, across modalities, between senses, and so on. This, we claim, is a necessary requirement of the established ability for veridical memory: we could not remember detail of a sonata or soliloquy unless it were so at some level of abstraction (notes/chords and words, respectively). Various antidotes to this effect may be proposed. For example: the low level detail of the memory may be discarded in favor of more abstract representations; or the layering of structures may be restricted to a given number of layers; or the connections

<sup>1</sup>This example is oversimplified from a phonologist's perspective, because the /aN/ in "bank" and "bang," would in reality change subtly to reflect what is coming next, but it serves to make the point here.

between correlated sensory features may be limited; and so on. This affords a rich plethora of detailed hypotheses that may be tested in relation to comparative brain size of extant species with various cognitive capacities. This, in its own right, may be expected to elucidate the quality of the model in respect to these capacities; subsequently, in careful comparison with similar extinct species, it may be possible to chart a path relating increase in brain size with the development of successively more advanced cognitive capacities.

For example, it is known that dogs can perceive, remember, and associate meaning with words (that is, sequences of phonemes). But there is no evidence that they can compose words into meaningful phrase interpretations; indeed, quite the contrary. Our model would produce this effect when limited to only a few layers of chunking above the audio: sequences of phonemes, such as "walkies" would be memorable, but longer composed phrases and sentences would not. On the other hand, our theory affords much deeper construction when more layers are allowed (Wiggins and Forth, 2015).

Given this evolutionary account, one can formulate experiments based on IDyOT's ability to learn sequential structures (such as language or music) in which dependent variables relate to cortex volume: for example, the depth of layering can be limited, or the alphabets of the various layers can be limited, or both. These restrictions would be expected to limit the ability of the system to learn, and thence to predict. This approach, in particular, allows us to distinguish IDyOT from models whose parameters (e.g., node number in neural networks) are less specifically related to the function of the theory.

### 6.4. Evolving IDyOTs

From a modeling perspective, evolution may be thought of as a long-term parameter search within the IDyOT architecture and processing framework. When multiple IDyOTs exist in a genetic system, evolving freely, some will discover parameterization that allows for more efficient, evolutionarily adaptive behavior than others. Studies could be constructed such that IDyOTs with different temporal-predictive capacities will compete to survive, while the parameters of well-synced models are passed on to future generations by simulated breeding. The algorithmic parameters and probabilistic weightings underlying predictive processing may be randomly varied across agents to see which variations yield the most adaptive IDyOTs. Then, again, those whose predictions facilitate accurate communication or behavior

### REFERENCES


may pass their algorithmic idiosyncrasies on to their IDyOT children.

In particular, parameters such as depth of hierarchy and retention of detail in symbol creation can be varied, and their effect on the predictions of the system studied. The most interesting possibility here is modeling the evolution of the neocortex: in the style of Bown and Wiggins (2005), an evolutionary computation system may be set up that allows simulation of not only cognitive function, but also the behavior of populations. Thus, evolution may be simulated quite literally in silico, albeit at a functional level, and the relationship between biological affordances and effects studied in ways that are not accessible in vivo.

### 7. SUMMARY

In this paper, we have presented a novel model of timing in a predictive cognitive architecture. We have described in some detail how the temporal predictions allow efficient processing of ambiguous and/or noisy perceptual signals, and we have related the mechanisms to both linguistic and musical rhythm. Finally, we have proposed methods by which the approach will be evaluated, which constitutes the future work of the IDyOT project.

### AUTHOR CONTRIBUTIONS

JF invented the conceptual space representation and wrote about it, and also proposed much of the testing section; KA underpinned the theory with empirical research from psychology and neuroscience, and also contributed a lot to the testing section; MP supplied the linguistic grounding; GW invented the core model and wrote the sections summarizing it, and those on timing, using illustrations suggested by MP, and neural implementation. The rest of the writing was a team effort.

### ACKNOWLEDGMENTS

The authors are supported by the projects Lrn2Cre8 and ConCreTe, which acknowledge the financial support of the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under FET grant numbers 610859 and 611733, respectively. Thanks to Sarah Hawkins and Dan Stowell for stimulating related discussions and useful information.


Available online at: http://www.ling.cam.ac.uk/sarah/docs/hawkins-smith-01. pdf


Kipling, R. (1993). Just So Stories, New Edn. Ware, UK: Wordsworth Editions Ltd.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer JL and the handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2016 Forth, Agres, Purver and Wiggins. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Commentary: Beta-Band Oscillations Represent Auditory Beat and Its Metrical Hierarchy in Perception and Imagery

Sundeep Teki <sup>1</sup> \* and Tadeusz W. Kononowicz <sup>2</sup>

<sup>1</sup> Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, UK, <sup>2</sup> CEA.DSV.I2BM.NeuroSpin - Institut National de la Santé et de La Recherche Médicale Cognitive Neuroimaging Unit, Gif sur Yvette, France

Keywords: timing and time perception, rhythm perception, music perception, magnetoencephalography, predictive coding, beta oscillations, beat perception

#### **A commentary on**

#### **Beta-Band Oscillations Represent Auditory Beat and Its Metrical Hierarchy in Perception and Imagery**

by Fujioka, T., Ross, B., and Trainor, L. J. (2015). J. Neurosci. 35, 15187–15198. doi: 10.1523/JNEUROSCI.2397-15.2015

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Hugo Merchant, National Autonomous University of Mexico, Mexico Takako Fujioka, Stanford University, USA

> \*Correspondence: Sundeep Teki sundeep.teki@dpag.ox.ac.uk

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> Received: 04 March 2016 Accepted: 09 August 2016 Published: 23 August 2016

#### Citation:

Teki S and Kononowicz TW (2016) Commentary: Beta-Band Oscillations Represent Auditory Beat and Its Metrical Hierarchy in Perception and Imagery. Front. Neurosci. 10:389. doi: 10.3389/fnins.2016.00389 The ability to predict the timing of natural sounds is essential for accurate comprehension of speech and music (Allman et al., 2014). Rhythmic activity in the beta range (12–30 Hz) is crucial for encoding the temporal structure of regular sound sequences (Fujioka et al., 2009, 2012; Bartolo et al., 2014; Teki, 2014; Bartolo and Merchant, 2015). Specifically, the power of induced beta oscillations in the auditory cortex is dynamically modulated according to the temporal pattern of beats (Fujioka et al., 2012), such that beat-related induced beta power decreases after the beat and then increases preceding the next beat as depicted in **Figure 1A**. However, it is not known whether beta oscillations encode the beat positions in metrical sequences with physically or subjectively accented beats (i.e., "upbeat" and "downbeat") and whether this is accomplished in a predictive manner or not.

In a recent study, Fujioka et al. (2015) used magnetoencephalography to examine the role of induced beta oscillations in representing "what" and "when" information in musical sequences with different metrical contexts, i.e., a march and a waltz. Musically trained participants listened to 12 beat sequences of metrically accented beats, where every second (march) or third (waltz) beat was louder, along with unaccented beats at the same intensity. The paradigm consisted of two phases: A perception phase where accented beats were presented in march or waltz contexts and participants were required to actively perceive the meter, followed by an imagery phase where unaccented beats were presented at a softer intensity and participants had to subjectively imagine the meter.

Similar to their previous study (Fujioka et al., 2012), the authors found that event-related beta desynchronization (ERD) follows the beat, i.e., beta ERD response showed a sharp decrease after the stimulus, attained a minima with a latency of ∼200 ms, and subsequently recovered with a shallow slope (see Figure 2 in Fujioka et al., 2015). This result has been previously demonstrated (Fujioka et al., 2009, 2012), and extended by other groups using electrophysiological recordings in humans (e.g., Iversen et al., 2009), and macaques (Bartolo et al., 2014; Bartolo and Merchant, 2015), as well as demonstrating a role for induced beta oscillations in time production (Arnal et al., 2015; Kononowicz and van Rijn, 2015). This result was valid for both the march and waltz conditions in the perception and even more importantly in the imagery phase, implying a top-down mechanism.

**44**

FIGURE 1 | Schematic depiction of the time course of induced beta oscillatory activity for a hypothetical sound sequence (indicated by vertical bars in gray, in the order, upbeat, downbeat, and upbeat), in accordance to the "predictive timing" and "event tagging" mechanisms. Presented pattern is based on previous studies such as Fujioka et al. (2012). (A) Predictive timing theory (e.g., Arnal and Giraud, 2012) suggests that beta power should peak before each sound, such that the rebound of beta power could be predictive of the timing of the upcoming sound, regardless of the salience of the sound. (B) A hypothesized predictive code that also encodes the identity of the salient events in a sequence may show modulation of the stereotypical beta ERD response in panel (A), expressed in terms of differential magnitude (here, greater beta suppression) before the salient event. As opposed to panel (A) beta power is not modulated in the same manner before upbeats and downbeats, allowing the encoding of "what" and "when" information in a manner consistent with the predictive timing framework (e.g., Arnal and Giraud, 2012). (C) Event tagging proposal (Iversen et al., 2009; Hanslmayr and Staudigl, 2014) suggests that beta power encodes accented events and should peak after the accented sounds, which is in contradiction with the predictive coding of "what" information depicted in panel (B).

Significantly, the authors claimed that the beta ERD response in the auditory cortex differentiates between the positions of the downbeat and the following beat (see Figures 3, 4 in Fujioka et al., 2015).

The novel result reported by Fujioka et al. (2015) is that the beta ERD response in auditory cortex can distinguish between accented beat positions in metrical sequences. However, the underlying mechanisms are far from clear. Fujioka et al. (2015) explain their results using predictive coding theory (Bastos et al., 2012) but an alternative "event tagging" mechanism (Iversen et al., 2009; see Repp, 2005 and Repp and Su, 2013 for a review on mechanisms for metrical processing), may also account for metrical interpretation of beat-based sequences.

We consider the results of Fujioka et al. (2015) in the light of these two mechanisms, i.e., predictive timing and event tagging (**Figure 1**). The predictive coding framework posits that beta oscillations are associated with anticipatory behavior and predictive coding (Arnal and Giraud, 2012). An internal model is established that conveys top-down predictions based on modulation of beta power with the goal of predicting the next event as shown in **Figure 1A**. However, if the predictive code were to also represent the identity of the salient event (i.e., what) in addition to its timing (i.e., when), one may hypothesize a modulation of the beta ERD response before the downbeat, which might be expressed in terms of differential magnitude as shown in **Figure 1B**. Such a response, that is specific to the downbeat would predict both the identity and timing of accented beats.

According to the event tagging framework (Hanslmayr and Staudigl, 2014), beta oscillations encode salient events (e.g., downbeat) as depicted in **Figure 1C**. Specifically, desynchronization of induced beta power may reflect memory formation (Hanslmayr and Staudigl, 2014) or an active change in sensorimotor processing (Pfurtscheller and Lopes da Silva, 1999). During rhythm perception, where encoding the beat in memory is critical (Teki and Griffiths, 2014, 2016) the structure of the metrical accents and salient events may be represented by the depth of beta desynchronization. Therefore, the largest beta desynchronization may be expected to occur after the downbeat (**Figure 1C**). In the present study, the amount of beta desynchronization was found to be largest after the accented tones. Therefore, the reported results are consistent with the event tagging framework, suggesting that subjectively and physically accented events invoke changes in the encoding of these events (Pfurtscheller and Lopes da Silva, 1999; Repp, 2005).

However, it is plausible that the predictive timing and event tagging mechanisms may operate in concert. To confirm this hypothesis, one needs to assess whether any prediction, implemented as beta rebound, occurs before the accented tones (**Figure 1B**). In the current study, it is difficult to determine whether there is robust beta synchronization before the accented tones. Careful observation of the results (Figures 3, 4 in Fujioka et al., 2015) suggests that beta power is not modulated before the downbeat in either of the two metrical conditions, neither in the perception nor in the imagery phase, except for a weak effect for the waltz-perception condition in right auditory cortex.

It is therefore not evident whether beta ERD carries predictive information about salient events, in addition to their timing. Therefore, future studies should also focus on other (nonsensory) brain regions, like the supplementary motor area or basal ganglia that are implicated in encoding rhythmic patterns (Grahn and Brett, 2007; Teki et al., 2011; Crowe et al., 2014; Merchant et al., 2015), as it is possible that different regions may recruit distinct mechanisms.

Overall, this study has provided significant insights about the neural representation of musical sequences. Beta oscillations have repeatedly been shown to track the timing of events in sound sequences but whether they can differentiate between beat positions, i.e., also encode categorical information about the events has been highlighted by the present study. It is important to build upon the current results and identify the precise role of beta oscillations with respect to encoding of

### REFERENCES


"when" and "what" information in natural sound sequences, and future research may benefit highly from the current study.

### AUTHOR CONTRIBUTIONS

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

### ACKNOWLEDGMENTS

ST is supported by the Wellcome Trust (WT106084/Z/14/Z; Sir Henry Wellcome Postdoctoral Fellowship). TK is supported by ERC-YSt-263584 awarded to Virginie van Wassenhove.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Teki and Kononowicz. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# A Comparative Analysis of the Universal Elements of Music and the Fetal Environment

#### David Teie\*

School of Music, University of Maryland, College Park, College Park, MD, USA

Although the idea that pulse in music may be related to human pulse is ancient and has recently been promoted by researchers (Parncutt, 2006; Snowdon and Teie, 2010), there has been no ordered delineation of the characteristics of music that are based on the sounds of the womb. I describe features of music that are based on sounds that are present in the womb: tempo of pulse (pulse is understood as the regular, underlying beat that defines the meter), amplitude contour of pulse, meter, musical notes, melodic frequency range, continuity, syllabic contour, melodic rhythm, melodic accents, phrase length, and phrase contour. There are a number of features of prenatal development that allow for the formation of long-term memories of the sounds of the womb in the areas of the brain that are responsible for emotions. Taken together, these features and the similarities between the sounds of the womb and the elemental building blocks of music allow for a postulation that the fetal acoustic environment may provide the bases for the fundamental musical elements that are found in the music of all cultures. This hypothesis is supported by a one-to-one matching of the universal features of music with the sounds of the womb: (1) all of the regularly heard sounds that are present in the fetal environment are represented in the music of every culture, and (2) all of the features of music that are present in the music of all cultures can be traced to the fetal environment.

Keywords: origin of music, womb, pulse, music, rhythm

#### Colwyn Trevarthen, University of Edinburgh, UK Matz Lennart Larsson, Örebro University Hospital, Sweden

Reviewed by:

Edited by: Andrea Ravignani,

\*Correspondence: David Teie dteie@umd.edu

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Vrije Universiteit Brussel, Belgium

Received: 13 February 2016 Accepted: 20 July 2016 Published: 09 August 2016

#### Citation:

Teie D (2016) A Comparative Analysis of the Universal Elements of Music and the Fetal Environment. Front. Psychol. 7:1158. doi: 10.3389/fpsyg.2016.01158

## THE QUESTION OF THE ORIGINS OF MUSIC

It is reasonable to postulate that the characteristics of music that are common to the music of all cultures must have a common origin. Is there a correlation between the acoustic features of the common elements of music and the acoustic features of sounds that are present in the natural world? If so, is there a consistent and universal aspect of human development that would allow those sounds from the environment to be implanted as templates of recognition in the brains of peoples of diverse and widely separated cultures? Finally, if that common environment is prenatal, does the neurological development of the fetus allow for the absorption and retention of information from the womb? If the prenatal acoustic environment contains characteristic sounds that are the bases for some of the elements of music, then it should be possible to match, one-for-one, the sounds of the womb to those elements that can be found in the music of all cultures.

This hypothesis is not intended to be complete nor exclusionary. The scope is restricted to the elements of music that may be traced to the sonic environment during fetal development;

it does not attempt to address the various expressions and uses of music or narrative expression such as those that are found in social connections, communicative support, kinetic synchrony, and motivation. A variety of well-supported explanations for our enjoyment of music such as expectation, maternal and social bonding, and development of narrative through shared activity may be understood as parallel to and not contrary to the bases outlined below. The proposition that the origin of the common elements of music may be found in the sonic environment of the womb is not inconsistent with a host of other valid premises and observations concerning the developments of the many culturally diverse languages of music.

### OUTLINE OF HYPOTHESIS

This outlines a proposal that there is a fetal origin of the fundamental building blocks of music. The sounds heard by the fetus for four months before birth may permanently etch the foundations of music into the collection of brain structures that form what is commonly known as the limbic system. These structures are primarily responsible for our emotions and are almost fully formed at birth (Huang et al., 2006). The features of fetal development that make it possible for the fetus to hear sounds, remember (subconsciously) those sounds in adulthood, and respond to them emotionally have been demonstrated in studies that are cited below. When these features are added together and compared to the construct of music it is reasonable to conclude that the part of our brains that is responsible for emotions retains information from and recognizes the sonic environment of the womb.

The prevalent sounds of the fetal environment are pulse, respiration, footfalls, and the mother's voice. McDermott (2008) identified several universal properties of music: pulse, hierarchal organization of scales (tonality), infant-directed song, dance, and meter. To McDermott's list I propose adding: amplitude contour of pulse instrument, prevalence of discrete singlefrequency units (musical notes), varied pitches and rhythms in the melodies (prosody), the 200–900 Hz frequency range of melodic instruments, and continuity. The following is an outline of the musical elements that are based on the sounds of the womb: pulse (pulse is understood as the regular, underlying beat that defines the meter), amplitude contour of pulse, meter, musical notes, elements related to prosody (syllabic contour, melodic rhythm, melodic accents, phrase length, and phrase contour), melodic frequency range, and continuity. All of these universal features of music can be traced to the womb and all of the sounds heard in the womb are represented in the music of all cultures. **Table 1** outlines the sounds heard in the womb and the corollary features of music.

### LIMBIC SYSTEM DEVELOPMENT MEMORY

In light of anatomical studies that have emphasized the interconnections between ventral limbic circuits and the motor

#### TABLE 1 | Sounds of the womb and their corollaries in music.


control connections between striatum and motor cortex (Gunnar and Nelson, 1992), I propose that the acoustic information that pervades the development of structures responsible for our emotions as well as structures near the brainstem responsible for repetitive movement are the sources and origins of pulse, meter, and rhythm in music. The elements of music that are most universal are musical representations of sounds that are heard during the time when the limbic structures are formed in the developing brain. The following conditions allow for the formation of lasting fetal acoustic memories: The human fetus is able to hear at 24 weeks, providing 4 months of constant sound exposure (Birnholz and Benacerraf, 1983) prior to birth. The sound of the maternal heartbeat is 25 db above basal noise, dominating the fetal environment (Querleu et al., 1988). The maternal voice is heard in the uterus nearly four times more strongly than it is heard externally (Richards et al., 1992). In utero research and analysis has shown consistent evidence that the fetus responds to the sound of the mother's heartbeat (Porcaro et al., 2006) and that infants also respond to the prosodic features of speech (see below).

The combination of three features of human fetal development make it possible for the sounds of the womb to provide a lasting template of recognition: (1) the dearth of competing sensory information in the fetal environment allows sound to be a primary source of varied and ever-present information entering the developing brain, (2) information that is well-organized when incoming while a brain structure is plastic will tend to remain organized in the brain, and (3) the limbic structures are well-developed at birth. 3D brain imaging technology used by scientists in Baltimore (Huang et al., 2006) showed that the structures of the limbic system are almost completely formed at birth. The limbic fibers, the cingulum and the fornix, are two of the most dominant tracts in the fetal brain and their entire trajectories are already developed at 19 gestational weeks. "Early formation of the limbic system is well known, and it is expected that limbic fibers are well formed at 19–20 gestational weeks." – Huang et al. (2006, p. 36). The brain structures responsible for our emotions are able to retain information long before the upper cortical structures. A logical conclusion to the summed effects is that brain structures responsible for emotions and that are well developed at birth may remember and later respond to sounds that resemble those of the fetal environment.

A regular and repeated pulse is one of the universal traits of music is even though it is not found in human vocalizations. Intrauterine recordings taken in humans and animals have shown that the sounds of the mother's vocalizations, breathing, heartbeat, body movements, footfalls, and digestion are all audible to the fetus (Parncutt, 2006). In-utero research and analysis has consistently shown that the fetus responds to the sound of the mother's heartbeat. Sound measurements taken in the womb have shown that the sound of the mother's heartbeat in the womb is 25db above the baseline noise (Querleu et al., 1988).

The sound and sensation of maternal footfalls as experienced by the fetus may provide information that informs the origin of the deep connection between dance and music. The interconnections between motor control and limbic structures mentioned above may provide the framework for retention of the tactile sensations of walking as felt in the womb in combination with the pulsing sounds of footfalls. The sonic-tactile association in the fetal brain will tend to be strong in keeping with the famous principle of neurological development that "neurons that fire together wire together". Since adults take about 10,000 steps per day (Tudor-Locke and Myers, 2001) the presentation of the sound/sensory stimuli to the fetus is ubiquitous under normal circumstances. Music is often played at tempos similar to walking (Changizi, 2011) and it is reasonable to propose that one of the bases for this connection results from the concurrent stimuli of the tactile and sonic sensations of maternal footfalls informing the developing brain of the fetus.

The regular and constantly repeated sounds of maternal respiration create a repetitive acoustic framework that is broader and less specific than the heartbeat and possibly provides a basis for the recognition of more generalized pulse in music. Instruments that create amplitude contours that resemble that of a heartbeat such as drums made from stretched material over a resonating cylinder struck with a beater commonly keep the musical pulse.

The narrow spectrum of types of sounds that we humans respond to becomes apparent when we compare our time scale to other species. The ruby throated hummingbird has a resting heart rate of 615 beats per minute (Odum, 1945). If hummingbirds had their own music, it seems likely that it would be incomprehensively fast to our ears. The repetition rates of the pulses found in human music on the other hand (40– 240 beats per minute) coincide with the slowest (respiration) and fastest (footfalls of running) pulses that can be heard in the womb. Human music seems to be built to the human scale.

The fetal environment provides little information that is significantly varied for the senses of sight, smell, or taste. Hearing is the exception to this dearth of information; sounds are always present. The patterns of speech heard in the womb are constantly changing and are available to be learned by the fetus. The heartbeat, although more consistent than speech or footfalls, also provides varied information that is constantly available to the fetus.

### THE DRUM

The drum is an example of an instrument that was invented and adopted for use in music in every culture. The similar construction of the drums from widely separated cultures demonstrate the common recognition of the sound of pulse that informed the development of the features of the instrument. The amplitude contour of the drum is such that it conforms to the sound of the pulse as heard from the womb. Over the centuries of the development of instruments, musicians found that stretching a skin over something round makes the sound die away more slowly (longer decay). It was also discovered that elongating the round skin holder into a cylinder made the sound more resonant and could give it a pitch (the tube resonates and lengthens the decay even more). And if it were struck with a cushioned beater the beginning of the sound would be less sudden (longer onset). Each of these modifications was made in order to create a pulse sound that we perceive as "good". Recognition is the key point here. Why does a drum sound "better" than a stick hitting a table? The answer may be: because we recognize the drum sound, each of us who can hear heard it constantly for four months before we were born.

The construction of a drum enables it to create a heartbeatlike amplitude contour of the pulse instrument. Drums have been similarly constructed in many different cultures. (1) The onset of sound is graduated by a cushioned beater, a stretched animal skin, or both. (2) The decay of sound is elongated with a resonating chamber.

Geographically separated peoples came up with the same basic drum design. Three thousand years ago as the Greeks were drawing images on vessels of people playing the tympanon, the Chinese of the Shang Dynasty were making drums out of clay and stretched alligator skin. These parallel developments occurred well before the silk route had established a connection between the cultures of the East and West. Meanwhile the cross-rhythms of the Niger-Congo peoples were being played on Djembe drums that would not be seen by Europeans until the 15th century. That was around the time Westerners began invading the Americas where they found that drums had been made by all of the aboriginal Americans: from the Inuit and Ojibwe in the North, to the Lakota in the plains, the Aztecs between the continents and the Incas in the South.

To explain the parallel development in widely separated cultures of a musical instrument with a singular, basic design, and that was meant to be struck in repeating patterns of pulse and meter suggests a common root among all people. This commonality must be fundamental enough to supersede all linguistic, racial, and cultural differences. Furthermore, since all emotional responses begin with recognition, and since the preference for the sound of the drum is universal, then it is most logical to conclude that the template of sound that is recognized was formed in an environment that we all share. Emotional responses follow salient recognition; it would seem that the sound created by a drum triggers such recognition.

Below are a few figures and descriptions that outline the similarities between the amplitude contour of the human heartbeat and the drum:

**Figure 1A** above shows the spike of a transient sound made by one stick hitting another; this sounds "not so good" to us. For comparison, **Figure 1B** shows the amplitude contour of a human heartbeat. The amplitude contour of **Figure 1C** is created by a cushioned beater striking a drum. When skin is stretched over a round collar and placed over a resonating cylinder, the resonance of the cylinder amplifies and elongates the decay of the sound. Striking the skin with a cushioned beater creates a graduated onset. The resulting amplitude contour resembles that of the heartbeat as heard in the womb (onset 0.02 s, decay 0.06 s). In its completed form, when the onset is graduated as well as the decay, this sounds "just right" to us. The feelings of "good", or "just right", like all emotional responses, begin with recognition. It is possible that we recognize the sound that was imbued in our growing brains.

Tactile response is present in contemporary music that is presented in high decibel levels. The musicians of these genres are seeking sound levels that are not only able to be heard, but are also able to be felt. In an interview with the author, Dickie Peterson, one of the original band members of a predecessor of heavy metal music, reported that they wanted to create music that could be felt. The desire for tactile sense of music may be traced to the sensations of the womb. The maternal heartbeat creates a pressure wave that not only can be clearly heard but may also be felt by the fetus. The dense, liquid-filled environment of the womb allows for a pressure wave emanating from the maternal heartbeat to travel through the tissue of her womb into and through the body of the fetus.

### METER

The differently paced and constantly overlapping pulses of heartbeats, inhalation, and exhalation create patterns of strong and weak beats. In the womb the inhalation is louder than the exhalation. When the inhalation sounds simultaneously with the heartbeat, the combination forms the strongest sound, exhalation with the heartbeat is somewhat strong, and the heartbeat alone is weakest. As a result, it is reasonable to conclude that music has evolved to include repeating patterns of strong and weak beats that resemble those in the fetal environment. The combination of heartbeats and breathing heard in the womb may be the basis for musical meters.

From this perspective, the combinations of strong and weak pulses found in the primary meters would be derived from the sounds of respiration combined with the sound of the heartbeat. Strong-weak is duple meter. **ONE** two THREE four (1 strongest − 2 weak − 3 strong − 4 weak) is known as "common time" in Western music. When respiration and heartbeat are combined: 1 inhalation + heartbeat – 2 heartbeat alone − 3 exhalation + heartbeat − 4 heartbeat alone, the result is common time and this is consistent with normal human heart and respiratory rates (approximately four heartbeats/respiratory cycle).

The prevailing duality of pulse in Western music is the same duality found in the human rhythms of heartbeats, breathing, and walking. When respiration and heart beat are combined we have (**Table 2**):




Perhaps common time is common to us all. Theoretically, the possible combinations of tempos and stresses that are available are infinite, yet the spacing, stresses, and tempos of common time are the very same as the combined pulsations that are often heard in the womb. Other meters that are used in music can also be traced to other combinations of the strong and weak beats created intermittently by the varied combinations of heartbeat and respiration. The ratio of (1) STRONG – (2) weak is 2/4 time, and (1) STRONG – (2) weak – (3) weak is 3/4 time.

Footfalls also create pulses that can be heard and felt by the fetus and may combine with the heartbeat to create overlaps that also augment the strength of some beats. Naturally, not all mothers have the same heart/respiration ratios and the synchronicity of the beats is constantly changing. The heartbeat is faster when walking and the pressure waves created by the footsteps of a pregnant mother are audible to the fetus (Parncutt, 2006). The faster pace of the heartbeat combined with the even pulses of footfalls creates a relatively quick 2/4 m. Even the rarely heard footfalls of running may be found in the developing brain's list of recognized combined pulses.

A weak beat placed where the silence occurs between the duple pulses of the heart creates a triple meter. Here is a visual approximation using "lubb, dub", the traditional vocal approximation of the sound of the heartbeats used by physicians:


These mixed meters may also traceable to the fetal sonic environment. 6/8 is a combination of triple/duple (**1** 2 3 **4** 5 6) combining the LUBB – dub - silence of a quickened heartbeat with inhalation and exhalation.

It has been noted that the music of some cultures such as some Balkan cultures have music that does not use common meters. Irregular meters may also be derived from the combination of heartbeat and respiration. It should be noted that symmetrical rhythms are not common, even in Balkan music; the rhythms of Balkan music are primarily duple. Generally there is a distinction between underlying pulse rhythm and melodic rhythm, but this may be a case where they mingle. The Balkan exception may be due to the melodic rhythm in the language, since Greek allows for more irregularities between stressed and unstressed syllables than English (Patel, 2008).

### MELODY

The maternal voice heard by the fetus in the womb may provide the foundation of musical notes and melody. The pitches created by the vocal chords of the mother are distinct and loud in the fetal environment. A team of researchers from the University of Florida headed by Douglas Richards managed to convince eight bedridden mothers-to-be to have microphones inserted into the uterus and placed near the head of the fetus. The mothers were asked to speak in a loud voice as the intrauterine sound level was recorded. They found that the average mother's voice in the womb is 77.2 dB, nearly four times greater than the intensity measured in the air at a distance of 24 inches (Richards et al., 1992). A spoken sentence is heard in the womb as a pattern of discrete pitches in a variety of melodic contours and rhythms.

### DISCRETE, SINGLE-FREQUENCY SEGMENTS (MUSICAL NOTES)

Discrete, single-frequency segments (notes) are found in the music of all cultures. The mother's speech that is heard in the womb consists primarily of single-frequency segments created by the vowels between the consonants (Querleu et al., 1988). These units may provide the singular basis for notes in music. We do not find music from any culture that consists primarily of sliding pitches. Mammalian vocalizations generally consist of syllables that have contoured frequencies (sliding pitches) such as a cat's meow or a dog's submissive whimper as well as the human vocalizations such as moaning and weeping (Parvizi et al., 2001). Despite this preference for contoured frequencies in emotional vocalizations, human music contains a preponderance of discrete single-frequency units.

This variance between the characteristics of emotionally generated vocalizations and the characteristics of music might be explained by the womb origin of music. The prevalence of discrete single-frequency segments is a feature of the music of all cultures. One of the prominent features of Schubert's melodic style is that he gave each syllable in the lyrics only one note. Irving Berlin, who was described by George Gershwin as "America's Schubert" used the same one-to-one note/melody standard as Schubert. Accordingly, most languages are made up of predominantly single-pitch segments separated by consonants.

The preference for discrete single-frequency segments in music may be accounted for when acoustical properties of the womb are considered. The middle and lower frequencies of the mother's speech emanating from the maternal vocal chords are carried directly to the ears of the fetus through the medium of the liquid and tissue of the womb that is approximately five times more efficient than air. The high frequency sounds of the consonants such as "ch", "t", "s", and "sh" formed at the opening of the mouth are significantly attenuated when transferred from the air to the tissue and further attenuated by the absorption of sound by the surrounding tissues in the womb. Due to this attenuation, the consonants of speech are nearly inaudible in the womb but the "melody" of the pitches created by the vowels between the consonants is clearly audible and would sound something like humming discrete single-frequency pitches.

When we compare the features of maternal speech as it is heard in the womb to the features of melodies that are found in the music of a wide variety of cultures we find compelling similarities. Indeed, it is a supportable assertion that the contours and prosody of specific languages as heard by the fetus are nearly identical to the melodic contours of the music associated with that language. All of the commonly found features of musical melodies are present in the mother's voice as it is heard in the womb. Speech is produced in predominantly consonant intervals and contains implied tonalities (Schwartz and Purves, 2004; Bowling et al., 2009). As a consequence, the melodies heard in the womb are primarily harmonically consonant.

The prosody of languages may form the bases for melodic treatment in music. Newborns of French mothers prefer the

sound of the French language to Russian (Mehler et al., 1988). The newborns still prefer the French language when the speech is filtered to remove the consonant and vowel sounds in order to present maternal speech as it is heard from inside the mother, retaining only the melody, but they do not show a preference for the melody of the French language when played backward, implying that a fetus is able to recognize intervallic relationships and melodic contours.

Evidence for fetal learning of the melodic contours of maternal speech is also found in the cries of newborns that emulate the melodic contours of the mother's language (Mampe et al., 2009). Kathleen Wermke of the University of Würzburg in Germany studied five-day-old infants and discovered that they use the contours of their own mothers' language in their cries (Mampe et al., 2009). Words and combinations of words create recognizable rhythms that are found in the melodic rhythms of musical motives. Cultures whose languages have accented syllables also have corollary accents in their melodies. For example, the definite articles in the Germanic and Romance languages (the sea, die See, la mer) are heard in the musical upbeats at the beginning of many melodies. The music of cultures whose languages do not contain definite articles, rarely have musical upbeats to their melodies. Note the preference for beginning melodies on the beat in the music of Mussorgsky (Russian) and Dvorak (Czech). A number of other commonalities have been found between the melodic rhythms of a culture and the speech rhythms in its language (Huron and Ollen, 2003; Patel and Daniele, 2003).

### FREQUENCY RANGE OF MELODIC INSTRUMENTS

The womb origin of melody might also explain the range of the most commonly used melodic instruments. The frequency range of melodic instruments in a wide variety of cultures is roughly 200–900 Hz, the same as the frequency range of an adult human female voice. The following instruments have strings tuned to pitches in this range: the West African kora, the Chinese guqin, the Indian sitar, the North American Apaches' tsii'edo'a'tl, and the European violin. The flute may be even more ubiquitous. Fossilized bone flutes have been found that date back more than 30,000 years (Atema, 2004). Many musical instruments have been invented and modified over the years to create the kinds of resonance-enhanced richness that remind us of the human voice.

### CONTINUITY

The underlying beats may also provide another hidden universal characteristic of music: nearly all of the music heard in all cultures is continuous. While speech stops and starts, music rarely presents any other than an unbroken stream of sound. Continuity is key to musical appreciation, attention, and subconscious recognition. The acoustic environment of the womb also provides a constant stream of sound that could provide the foundation of recognition that music builds upon. Even when melodies are presented in phrases that are separated from one another, there is continuity in the underlying pulse and accompanimental patterns. In the fetal environment, the maternal voice comes and goes, but the pulse and meter always remain. There are exceptions to the continuity of music, but the silence is typically brief and the tempo of the pulse usually remains.

### TESTS OF THE PRENATAL ROOTS OF MUSIC

To test whether the elements of music are based upon the sounds of the environment during brain development, I conceptualized two studies that compared the responses of a species to music composed for humans and music composed specifically for that species. These studies were carried out by Dr. Charles Snowdon and colleagues at the University of Wisconsin-Madison. If the concept is correct, it should be possible to create music that would be effective for another species that is based upon the sounds that are present as the brain structures of the limbic system of that species are being formed.

The first study compared the responses of cotton-top tamarin monkeys to four types of music: aggressive human music, calming human music, aggressive tamarin music, and calming tamarin music (Snowdon and Teie, 2010, 2013). In addition to the use of instrumental music that was based on vocalizations of each species, I included in the calming music for the tamarin monkeys a pulse that was equal in pacing to the resting heart rate of an adult monkey (200−220 beats per minute). Since the brain of a monkey is 60% of its adult size at birth I included instrumental representations sounds of the womb in the music for the tamarins. As had been the case in most other tests of the effects of human music on other species, the tamarins were indifferent to the human music (McDermott and Hauser, 2007). The calming tamarin music was effective in calming the monkeys and the arousing music led to increased movement and behavior indicative of anxiety. There was one significant exception to the tamarin's indifference to the human music: the only response that was elicited by the human control music was that they were calmed by the aggressive human heavy metal music that has a pulse of 200 beats per minute. This apparent anomaly supports the proposition that the sonic characteristics of the maternal pulse may have a calming effect when presented in music.

The second study was conducted on cats (Snowdon et al., 2015). Since the brain of a newborn cat is only 1/8 the size of what it will be at 10 weeks, the sounds of the womb were presumed to be not salient for cats. A reward-related sound that is present during the development of the cat's brain is the sound of suckling. The music for cats, therefore, included musical instruments designed to resemble the sound of suckling as it would be heard by a nursing kitten. The data on the effect of species-specific music on cats from this study were even stronger than in the tamarin monkey study.

### PROBABILITY THAT THE FEATURES COMMON TO THE WOMB AND MUSIC ARE COINCIDENTAL

A demonstration of the statistical probability that these parameters are connected in a causal relationship would strongly support the theory of the prenatal roots of music. In order to assess that probability it is necessary to deduce the probability that the similarities between the elements of music and the sounds present in the fetal environment are due to coincidence. The first consideration to be accounted for is the number of variables. How many variable features are there in each element? Below is a brief outline of the variables for each element.

Pulse:


Meter:

• The combinations of the accented beats that comprise meters create an infinite number of possible patterns.

Frequency range of melodic instruments:

• The frequency range of melodic instruments would most likely center on our most sensitive hearing range that is used to produce consonants: from 2−4 kHz. Despite this sensitivity, our melodies are usually in the fairly narrow range of 200 to 800 Hz. (This is noted to refute the common assumption that the most common melodic instruments sound in the frequency range of the treble register because it is easier for people to hear in that range).

The determination of coincidence would also need to include the probability that the combination of the abovelisted variables is coincidental. Finally, it would be necessary to assess the probability that these elements are similar in

### REFERENCES

Atema, J. (2004). Old bone flutes, Pan. J. Br. Flute Soc. 23, 18–23.


not only the music of one culture, but in the music of every culture.

Since it is impossible to provide data on the probabilities that a given sonic element would or would not be present in either music or the fetal environment, it is also impossible to present the possibility of the coincidence of the elements as a statistical probability. However, given the large number of variables and the ubiquity of the commonalities between the sounds of the womb and the music of all cultures, the probability that they are similar by coincidence would seem to be astronomically remote.

### CONCLUSION

The questions of the origins of music presented at the beginning of this article are plausibly answered by the prenatal environment. There appears to be a correlation between the acoustic features of the common elements of music found in diverse and widely separated cultures and the acoustic features of sounds that are present in the womb. The neurological development of the fetus allows for the absorption and retention of those sounds to be implanted as templates of recognition in the brain.

I propose that the acoustic parameters of the sounds in the womb are the same as the parameters of universal characteristics of music. It is possible to match, one-for-one, the sounds of the womb to those elements that can be found in the music of all cultures.

### AUTHOR CONTRIBUTIONS

The author confirms being the sole contributor of this work and approved it for publication.

### FUNDING

No institutions have provided funding for any of the work leading to this submission. All of the research and development of the theory was engaged in privately, without compensation or funding.



**Conflict of Interest Statement:** The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Teie. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# What Pinnipeds Have to Say about Human Speech, Music, and the Evolution of Rhythm

Andrea Ravignani 1, 2 \*, W. Tecumseh Fitch3 †, Frederike D. Hanke2 †, Tamara Heinrich2 † , Bettina Hurgitsch4 †, Sonja A. Kotz 5, 6 †, Constance Scharff 7 †, Angela S. Stoeger 3 † and Bart de Boer <sup>1</sup>

<sup>1</sup> Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium, <sup>2</sup> Sensory and Cognitive Ecology, Institute for Biosciences, University of Rostock, Rostock, Germany, <sup>3</sup> Department of Cognitive Biology, University of Vienna, Vienna, Austria, <sup>4</sup> Chemnitz Zoo, Chemnitz, Germany, <sup>5</sup> Basic and Applied NeuroDynamics Lab, Department of Neuropsychology and Psychopharmacology, Maastricht University, Maastricht, Netherlands, <sup>6</sup> Department of Neuropsychology, Max-Planck Institute for Human Cognitive and Brain Sciences, Leipzig, Germany, <sup>7</sup> Department of Animal Behavior, Institute of Biology, Freie Universität Berlin, Berlin, Germany

#### Edited by:

Virginia Penhune, Concordia University, Canada

#### Reviewed by:

Hugo Merchant, Universidad Nacional Autónoma de México, Mexico Aniruddh Patel, Tufts University, USA

#### \*Correspondence:

Andrea Ravignani andrea.ravignani@gmail.com

† These authors have contributed equally to this work.

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> Received: 04 March 2016 Accepted: 31 May 2016 Published: 20 June 2016

#### Citation:

Ravignani A, Fitch WT, Hanke FD, Heinrich T, Hurgitsch B, Kotz SA, Scharff C, Stoeger AS and de Boer B (2016) What Pinnipeds Have to Say about Human Speech, Music, and the Evolution of Rhythm. Front. Neurosci. 10:274. doi: 10.3389/fnins.2016.00274 Research on the evolution of human speech and music benefits from hypotheses and data generated in a number of disciplines. The purpose of this article is to illustrate the high relevance of pinniped research for the study of speech, musical rhythm, and their origins, bridging and complementing current research on primates and birds. We briefly discuss speech, vocal learning, and rhythm from an evolutionary and comparative perspective. We review the current state of the art on pinniped communication and behavior relevant to the evolution of human speech and music, showing interesting parallels to hypotheses on rhythmic behavior in early hominids. We suggest future research directions in terms of species to test and empirical data needed.

Keywords: evolution of speech, evolution of music, evolution of language, vocal learning, entrainment, timing, synchronization, seal

### THE HUMAN SENSE OF RHYTHM FROM A COMPARATIVE PERSPECTIVE

Humans are particularly vocal and musical animals. They flexibly learn new vocalizations and easily perceive and move to rhythm (Bolton, 1894; Fitch, 2009) 1 . Why do humans show these two traits that have only been described in relatively few other animals? Previous research led to conflicting hypotheses on how evolution has shaped human brains and physiology to produce complex vocalizations (Richman, 1993; Fitch, 2000; Galantucci et al., 2006; Fitch and Jarvis, 2013; Manson et al., 2013). Several contrasting hypotheses also exist on how and why human and other animals' brains can perceive complex rhythmic patterns (Merker et al., 2009; Honing et al., 2012; Merchant and Honing, 2013; Patel and Iversen, 2014; Ravignani et al., 2014a). Crucially, these hypotheses differ on assumptions about social structure, ecological conditions, and audiomotor abilities present in early hominids, also providing discordant predictions on rhythm and vocal learning skills in different living species (for reviews see Ravignani et al., 2013a, 2014a; Iversen, 2016; Wilson and Cook, 2016). An influential hypothesis in the field, the vocal learning beat perception and synchronization hypothesis (Patel, 2006), states that vocal production learning (VPL) is a prerequisite for species to be able to extract a pulse from periodic acoustic

<sup>1</sup>Rhythm is defined as a "serial pattern of durations marked by a series of events" (McAuley, 2010. pp. 166).

events (like an internal metronome), and use this inferred pulse to synchronize movements to these external events in a predictive and flexible way (rhythmical entrainment). In fact, neural pathways between auditory and motor areas of the brain, which originally evolved for VPL, would also enable precisely timed movements to sounds (Kuypers, 1958, 1973; Jürgens et al., 1982). Only a few species are capable of VPL: that is, to modify existing vocalizations and to imitate novel sounds not belonging to their innate repertoire (Janik and Slater, 2000; Van Parijs et al., 2003). Humans, bats (Boughman, 1998; Knörnschild et al., 2010; Vernes, 2016), elephants (Poole et al., 2005; Stoeger et al., 2012), seals (Ralls et al., 1985), dolphins (Reiss and McCowan, 1993; Favaro et al., 2016), and whales (Foote et al., 2006), together with many bird species (Marler, 1970; Todt, 1975; Marler and Peters, 1977; Scharff and Nottebohm, 1991), have been shown capable of vocal learning (Schusterman, 2008; Petkov and Jarvis, 2012; Nowicki and Searcy, 2014).

Model species can be used to test hypotheses on how our ancestors evolved the neuropsychological prerequisites underpinning speech and music (see also Vernes, 2016). One can either pick model species, which are closely related to humans, and hence should share a specific trait by common ancestry (homology), or species that have a similar socioecology to humans, and hence independently evolved a similar trait by convergent evolution (analogy). If a living animal (i) shares much of its evolutionary history with humans, or (ii) was exposed to environmental conditions and evolutionary pressures similar to early hominids, then commonalities in selected behavioral traits may exist between the two (Fitch, 2010, 2014). This comparative approach is extremely powerful as a way of addressing questions such as (a) how humans acquired complex rhythmic and vocal imitation capacities, (b) why distantly related species but not our closest primate relatives evolved these capacities. Several biological factors may provide an answer to these questions, including brain anatomy, body morphology, social structure, habitat, and ecology. Hence suitable model species to investigate rhythm and VPL in our human lineage should, first and foremost, exhibit rhythm and VPL, and possibly be as close as possible to humans in anatomical, ecological, and evolutionary terms. To test the vocal learning—beat perception and synchronization hypothesis against alternative ones, we suggest below why pinnipeds including vocal and less vocal species—provide an excellent group of model species.

### PINNIPEDS: MORE VOCALLY FLEXIBLE THAN PRIMATES, PHYLOGENETICALLY CLOSER TO HUMANS THAN BIRDS

Traditionally, VPL and rhythmic behavior have been investigated in primates, parrots or songbirds. Monkeys and non-human apes, like chimpanzees, are evolutionarily and cognitively close to humans, but exhibit limited vocal imitation and rhythmic patterning skills (Janik and Slater, 1997; Ravignani et al., 2013a; Repp and Su, 2013; see Gamba et al., 2016, for timing in lemur singing). In contrast, many bird species are excellent at learning to imitatively produce new vocalizations (Petkov and Jarvis, 2012). Moreover, when tested on nonvocal rhythmic tasks requiring precise temporal coordination, birds outperform primates, although direct primate-avian comparisons on identical tasks are lacking at present (Nagasaka et al., 2013; Hoeschele et al., 2015; Benichov et al., 2016; ten Cate et al., 2016). However, the last common ancestor of birds and humans lived about 300 million years ago (Kumar and Hedges, 1998), and birds have evolved a vocal production system (the syrinx) quite different from the human larynx (Fitch, 2010; Elemans et al., 2015). Hence, primates and birds each have only one of the desirable features to understand rhythm and VPL: non-human primates are evolutionary close to humans but exhibit scarce rhythm and VPL capacities, while birds have rhythmic capacities and VPL but are evolutionary distant from humans.

A third taxonomic group, previously overlooked in comparative research on human evolution (cf. Cook et al., 2013; Rouse et al., 2016), may be the solution to this conundrum. Pinnipeds exhibit VPL and rhythmic abilities (**Table 1**), and as mammals they are evolutionary closer to humans than birds: the last common ancestor of humans and pinnipeds lived about 65 MY ago (O'Leary et al., 2013). This clade includes more than 30 species of semiaquatic mammals divided in three families: Phocidae (e.g., harbor and gray seals), Otariidae (e.g., California sea lions and Cape fur seals), and Odobenidae (walruses). Pinniped phylogeny is controversial. However, recent molecular evidence suggests that the first split, separating Phocidae from other pinnipeds, occurred 33 MY ago (Arnason et al., 2006). This relatively old common origin—compare it with the 33 MY between humans and e.g., capuchin monkeys (Glazko and Nei, 2003), has provided ample time to adapt to many different ecological niches and environmental constraints. Accordingly, pinniped species exhibit variation in VPL capacities, social organization, mating systems, and habitats (**Table 1**). These dimensions conveniently have anthropological equivalents, each of them deemed crucial for at least one hypothesis on the evolution of speech and music (Fitch, 2000; Hagen and Bryant, 2003; Patel, 2006; Hagen and Hammerstein, 2009; Merker et al., 2009; Petkov and Jarvis, 2012; Merchant and Honing, 2013; Patel and Iversen, 2014; Ravignani, 2014; Ravignani et al., 2014a,b; for a comparative definition of speech).

Notably, among the pinnipeds, harbor seals (Phoca vitulina) exhibit an excellent trade-off between VPL abilities and phylogenetic proximity to humans: among vocal learners, harbor seals have the closest vocal apparatus to humans (Schneider, 1962; Schneider et al., 1964; Ralls et al., 1985; Fitch, 2000; **Table 1A,B**). A human-raised harbor seal has even learned to imitate some human words and phrases (Ralls et al., 1985; **Table 1C**). So far, harbor seals have not been tested for rhythmic entrainment abilities; however, another pinniped species, the California sea lion (Zalophus californianus) was shown capable of non-vocal audio motor synchronization with precision previously exhibited only by avian species and humans (Cook et al., 2013; Rouse et al., 2016; **Table 1G**). With these few exceptions, pinniped communication, rhythm, and human speech have mostly remained unconnected areas of


TABLE 1 | Features of human speech and music (first column) are related to findings in pinniped biology (second column) to draw comparative conclusions and suggest further research (third column).


research until now. However, a lot of information is available on pinnipeds' natural vocal behavior, making the comparative study of pinniped communication and human speech a field ripe for research. We suggest that pinnipeds are ideal species to understand human speech, rhythm, and complex VPL at different levels (including physiology, behavior, neurobiology, and genetics). Pinnipeds' vocal anatomy, brain evolutionary history, socio-ecology, and broad range of environmental conditions conveniently map to human biology (Schneider, 1962; Ralls et al., 1985; Riedman, 1990; Van Parijs et al., 1999, 2003; Schusterman, 2008; Cook et al., 2013; Sauvé et al., 2015a,b; **Table 1**).

Then, why do humans and harbor seals produce flexible vocalizations? Taking ultimate and proximate causes into account and adopting a comparative approach (**Table 2**), we suggest several strands of empirical research in pinnipeds, which can shed light on the evolution of human rhythmicity.

### FUTURE RESEARCH: WHAT SPECIES TO TEST NEXT, AND IN WHICH TASKS?

### Vocal Production Learning

Pinnipeds produce many types of vocalizations, which can be recorded in air, enabling acoustic data collection with precise individual identification. Research in harbor seals, building on existing evidence on vocal imitation (Ralls et al., 1985), should investigate their ability to learn vocalizations (i) over developmental phases, and (ii) from each other in a social network (Janik and Slater, 2000; Tyack, 2008; **Table 1A–F**). This will reveal how seal vocalizations are imitated and transformed (Fitch, 2015b) similarly to human speech. In parallel, vocal flexibility in Otariids should be investigated across species, testing their ability to imitate new sounds. This will hopefully provide clear support for or against VPL capacities in this pinniped family considered, until now, the least vocally flexible. While performing this research, it will be important to keep an openminded attitude toward vocal learning, as this seems to be a graded ability rather than an all-or-none trait (Petkov and Jarvis, 2012; Fitch, 2015a).

Comparative vocal and brain anatomy in pinnipeds can be fruitful strands of research (**Table 1B,C,F**). The angle of vocal folds with respect to the tracheal air stream is 76◦ (degrees) in harbor seals, while 17.5◦ in sea lions (Schneider et al., 1964). This suggests sea lions have a vocal folds' angle closer to elephants (45◦ ); harbor seals' angle instead is closer to humans (90◦ ) than to sea lions (Herbst et al., 2013). Does this difference in vocal anatomy map to a difference in types of sounds produced or just modalities of sound production?

Neuroanatomy may constitute a fruitful research avenue to understand the mechanisms behind successful entrainment in California sea lions. Although the shape of their brain is similar to that of other carnivores, analyses of brain folding show remarkable differences. In particular, California sea lions have more secondary folds and sulci, and a radically different pattern of folds and fissures than other carnivores such as canids, e.g., dogs, wolves, coyotes, and mustelids e.g., minks (Montie et al., 2009). This suggests evolutionary pressures and potentially similar mechanisms increased the size of the neocortex in sea lions showing an interesting parallel to human evolution. A further open question is how the evolution of different brain structures relates to VPL (Patel, 2014) and social organization across pinniped species. Comparative brain anatomy and imaging will elucidate whether evolutionary old brain circuits subserving VPL are still present in vocally inflexible pinnipeds, such as sea lions (Patel, 2014).

### Interval Timing and Synchronization

Timing experiments often investigate the attentional and cognitive processes involved in perceiving or estimating single time intervals, either independently or by comparison with a second interval (Grondin, 2010). These experiments have, for instance, shown similarities and differences between humans and other primates in estimating single interval durations in the visual and auditory modality (Merchant et al., 2003; Zarco et al., 2009; Mendez et al., 2011). In pinnipeds, recent data show that a harbor seal and a Cape fur seal (Arctocephalus pusillus) can accurately discriminate time intervals in the visual modality (Heinrich, 2013; **Table 1I**). In contrast, rhythm refers to the structure of multiple


TABLE 2 | The question of why a particular behavioral trait, such as vocal production learning, exists in a species can be answered taking ultimate and proximate causes into account (Tinbergen, 1963).

durational events, i.e., sequences of time intervals. Hence, singleinterval timing research is essential (Merchant and Honing, 2013) though not enough to understand rhythm perception: in fact, perception of one interval influences perception of adjacent intervals (McAuley, 2010). Studying perception, reproduction, and entrainment to isochronous (metronomelike) sequences is the first step when moving from timing to rhythm research. In entrainment experiments, humans and other animals are tested on their ability to synchronize their movements to an external visual or auditory metronomic stimulus. Synchronization can arise spontaneously or be trained by the experimenter. Crucial experimental criteria for successful synchronization are: (i) flexibility, i.e., comparable performance at different tempos, (ii) multimodality i.e., ability to synchronize one's behavior in a sensory modality different from that of the external stimulus, and (iii) predictive rather than reactive behavior, i.e., zero or negative asynchrony, and unperturbed performance when one beat is missing (Patel et al., 2009a,b).

Extending previous entrainment studies in otariids (Cook et al., 2013; Rouse et al., 2016), harbor seals' and walruses' ability to entrain should be tested (**Tables 1G,L**). Successful synchronization in one of these vocal learners (Reichmuth and Casey, 2014) would provide an important data point in support of the VPL—rhythm link (Patel, 2006). Useful out-groups for synchronization experiments could be non-pinniped Canoidea, like dogs, exhibiting almost no VPL (Janik and Slater, 1997; Taylor et al., 2009). Harbor seals' and walruses' inability to synchronize would not refute Patel's hypothesis. However, failure to synchronize would refute alternative hypotheses, postulating individual territorial advertisement or lek displays as crucial factors for the evolution of rhythm (Hagen and Hammerstein, 2009; Ravignani, 2014).

### Natural Isochronous Behavior and Perception of Isochrony

As flexible synchronization requires the ability to represent an isochronous pulse (Iversen and Balasubramaniam, 2016), pinnipeds should be tested on their ability to discriminate between isochronous and non-isochronous temporal patterns. In birds, the ability to recognize isochronicity in acoustic sequences seems to positively correlate with VPL: pigeons perform much worse (Hagmann and Cook, 2010) than other birds capable of VPL, like zebra finches and starlings (Hulse et al., 1984; van der Aa et al., 2015). If this can be generalized, one would analogously expect harbor seals and walruses tested in comparable setups to outperform e.g., California sea lions and Cape fur seals. Finally, pinniped species naturally showing isochronous vocal behavior may be particularly promising to test in order to ascertain how VPL and natural isochronous behavior affect the ability to entrain. While vocalizations in the vocally inflexible Australian and California sea lions can be quite regular, the vocally flexible harbor seals vocalize with much less temporal regularity (Schusterman, 1977; Charrier et al., 2011).

### Meter Perception, Grouping, and Auditory Experience

Meter provides an additional dimension to rhythmic patterns, where individual events in time have different perceptual or acoustic "weights." Meter is defined as hierarchical organization of temporal events (McAuley, 2010). Meter corresponds to hearing events in time as related, forming structured patterns, e.g., the alternation of weak/strong beats in music and stressed/unstressed syllables in speech (Fabb and Halle, 2012). Meter perception can occur in sequences of stimuli that are acoustically identical (Brochard et al., 2003), or instead based on stimuli that alternate in duration, frequency, or amplitude (McAuley, 2010; Toro and Nespor, 2015; Geambasu et al., 2016; Hoeschele and Fitch, 2016).

Humans can perceive a range of metrical patterns but are biased toward specific metrical grouping patterns, partially depending on their native language (Iversen et al., 2008). In particular, a few perceptual laws, such the iambic-trochaic law (de la Mora et al., 2013), may explain most of rhythmic grouping in speech and music (Figure 1 in Supplementary Material). Rats, for instance, exhibit experience-modulated grouping biases: Like humans, they spontaneously group sequences when sounds alternate in pitch, but do not when sounds alternate in duration (de la Mora et al., 2013). However, rats can learn to group sounds of alternating durations: if exposed to short-long sequences, they will show the corresponding iambic bias when tested; if familiarized with long-short, rats will prefer trochaic grouping (Toro and Nespor, 2015).

Meter perception should be investigated across pinnipeds (**Table 1M**). As grouping is influenced by auditory experience, we would expect pinnipeds with a varied conspecific auditory input, like harbor seals, to require little training to discriminate metrical patterns. After probing pinnipeds' predictive timing by having them produce behavioral responses, temporal expectations could be explored by directly tapping into perception. Adapting noninvasive electrophysiology originally developed for humans and non-human primates, one could record event-related potentials corresponding to click sounds repeating at a constant rate, and compare these potentials to those evoked by click trains containing missing clicks or metrically-structured (accented) clicks (Rothermich et al., 2010; Schmidt-Kassow et al., 2011; Schwartze et al., 2011; Honing et al., 2012; Selezneva et al., 2013; Celma-Miralles et al., 2016; Cirelli et al., 2016).

### Percussive Behavior in Harbor Seals

Empirical evidence from human archeology, ethnomusicology and African apes' behavior suggest that percussion may have been the first form of musical expression in our hominid ancestors (Arcadi et al., 1998; Morley, 2003; Fitch, 2009). What was the function of rhythmic drumming in early hominids? A behavioral display in harbor seals may help answer this question: Accompanying vocalizations, harbor seals "drum" on the water, repeatedly slapping their flippers on the sea surface (Riedman, 1990; Wahlberg et al., 2002). Once again, hypotheses on the function of this slapping behavior mimic hypotheses proposed for human drumming (e.g., Kirschner and Tomasello, 2009). Slapping in harbor seals may function as signal in agonistic sexual displays (Riedman, 1990), or as a form of intrasexual competition to attract females (Nikolich, 2015). Another hypothesis regards water drumming as a form of territorial advertisement in agonistic contexts: in fact, during the breeding season, male seals produce slaps in response to other males either intruding a territory, or challenging an intruder (Hayes et al., 2004). Water slapping may hence indirectly play a role in establishing and maintaining dominance hierarchies, similar to chimpanzees' drumming (Arcadi et al., 1998; Ravignani et al., 2013b).

One hypothesis we suggest is that vocal displays and drumming displays may have the same territorial function but be used complementarily. Seals' slaps cover a different frequency band than, and have dramatically different durations from, roars. Slaps last about 0.002 s, contain most frequency between 5 and 20 kHz, and have (in-water) source intensity of 166–199 dB (Wahlberg et al., 2002). In contrast, roars last 2–3 s, are centered at frequencies of 200–300 Hz and have 150 dB intensity (Hayes et al., 2004). How far can each of these sounds travel so that they are still audible by seals? At 200 Hz, seals' hearing threshold is 32 dB (82 dB underwater); the sensitivity is much higher between 5 and 20 kHz, reaching 1–29 dB (60–62 dB in water; Reichmuth et al., 2013). Hence (1) slaps carry much farther than roars, (2) even if a slap and a roar reach a seal with the same sound intensity, a slap will be more conspicuous: slap might be perceived up to 30 times louder than a roar, and (3) slaps could be in principle perceived visually (Nikolich, 2015). Seals' water slaps hence seem to mimic many features of early human's territorial advertisement, which have been hypothesized to underlie the evolution of human musicality (Hagen and Hammerstein, 2009).

Future research should record individuals over time to: (i) analyse the fine-grained temporal structure of series of slaps (Babiszewska et al., 2015); (ii) test whether drumming and its temporal parameters are socially learnt, and if so (iii) compare the social dynamics of two transmitted rhythmic behaviors, across modalities (vocalizations vs. slapping), and (iv) relate waterslapping to similar percussive behaviors present in humans and chimpanzees (Fuhrmann et al., 2014; Whiten, 2015; **Table 1N**). Collection of slapping data will enable to test hypotheses postulating group and mating displays as necessary evolutionary steps toward human musicality (Fitch, 2009; Merker et al., 2009). In fact, if harbor seals' slaps show strong temporal interdependence between individuals, successful entrainment experiments in this species would support the hypothesis that rhythm may have evolved in humans as by-product of temporally-intertwined group displays (Merker et al., 2009).

### CONCLUSIONS

Researchers of human evolution and pinniped communication have been suggesting, unbeknownst to each other, similar hypotheses for the evolution of human speech and music, on the one hand, and pinnipeds' vocal displays and non-vocal communication, on the other hand. Advocating the comparative method and the distinction between proximate and ultimate questions, we have shown how animal research can help formulate and test hypotheses about the evolution of human speech and music. We have briefly reviewed previous findings in pinniped biology, explicitly pointing out their relevance to the human sense of rhythm in music and speech. We have discussed crucial questions that pinniped research should address empirically, possibly using comparable stimuli, tasks, and analysis techniques across species, ultimately shedding light on the origins of rhythmic behaviors in humans.

### AUTHOR CONTRIBUTIONS

Andrea Ravignani wrote the manuscript. All authors provided ideas and edited the manuscript.

### FUNDING

Andrea Ravignani was supported by FWO grant V439315N (to Andrea Ravignani), and European Research Council grant 283435 ABACUS (to Bart de Boer).

### ACKNOWLEDGMENTS

AR is grateful to Peter Cook, Guido Dehnhardt, Maxime Garcia, Alina Gaugg, John Iversen, Vincent Janik, Lars Miersch,

### REFERENCES


Benedikt Niesterok, Ana Rubio Garcia, Ruth Sonnweber, Amanda Stansbury, and Sonja Vernes for helpful discussions, comments, and insights.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2016.00274

Rohrmeier, J. A. Hawkins, and I. Cross (Oxford, UK: Oxford University Press), 73–95.


vitulina, roar through playback experiments. Anim. Behav. 67, 1133–1139. doi: 10.1016/j.anbehav.2003.06.019


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Ravignani, Fitch, Hanke, Heinrich, Hurgitsch, Kotz, Scharff, Stoeger and de Boer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Zebra Finches As a Model Species to Understand the Roots of Rhythm

Michelle J. Spierings 1, <sup>2</sup> \* † and Carel ten Cate1, 2 †

*<sup>1</sup> Behavioural Biology, Institute Biology Leiden, Leiden University, Leiden, Netherlands, <sup>2</sup> Leiden Institute for Brain and Cognition, Leiden University, Leiden, Netherlands*

Keywords: zebra finches, rhythm per, cognition, song system, vocal learning

### VOCAL TIMING IN ZEBRA FINCHES

Zebra finches are a widely used model species for neurobehavioral research, in particular in relation to song development and auditory processing. Males learn their songs from a tutor. Females don't sing, but do develop learned song preferences. Regardless of the differences, both sexes exchange calls in social interactions. Two fascinating recent studies looked at different aspects of rhythmicity in the production of zebra finch vocalizations (Benichov et al., 2016a; Norton and Scharff, 2016), and together with several studies on the perception of rhythms, they make the zebra finch a promising model species to unravel the roots of rhythm production and perception.

#### Edited by:

*Andrea Ravignani, Vrije Universiteit Brussel, Belgium*

#### Reviewed by:

*Yukiko Kikuchi, Newcastle University, UK Philipp Norton, Free University of Berlin, Germany*

#### \*Correspondence:

*Michelle J. Spierings m.j.spierings.2@biology.leidenuniv.nl*

*† These authors have contributed equally to this work.*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience*

> Received: *06 May 2016* Accepted: *08 July 2016* Published: *22 July 2016*

#### Citation:

*Spierings MJ and ten Cate C (2016) Zebra Finches As a Model Species to Understand the Roots of Rhythm. Front. Neurosci. 10:345. doi: 10.3389/fnins.2016.00345*

Norton and Scharff (2016) analyzed the intervals from one note onset to the next one in both isolate and directed zebra finch songs. For each male they were able to derive an isochronous sequence of "time stamps" of which a subset aligned with all note onsets of a male's song. Moreover, these time stamps often also coincided with the transitions between phonetically different gestures within a complex note. This indicates that an isochronous rhythm might underlie zebra finch songs. Benichov et al. (2016a,b) showed that both males and females can dynamically adjust the mutual timing of their calls. Individual birds were housed with a robotic zebra finch that emitted an isochronous call pattern. Within ample minutes the zebra finches adjusted their call rate to create a regular back-and-forth exchange with the robotic finch. The robot was then set to emit "jamming" calls that were produced at the moment when the zebra finch was most likely to respond. All zebra finches adjusted their call pattern to avoid the jamming calls, either by calling earlier, later or both earlier and later. Furthermore, when the robotic finch produced a pattern of alternating single and paired calls, the zebra finches timed their calls differently for the single compared to the paired calls, indicating they apparently detected the alternating pattern of the robot calls and used this to anticipate whether the next call would be single or paired. Interestingly, females performed better than males at these tasks (Benichov et al., 2016a).

Benichov et al. (2016a) next examined the role of the forebrain song system in the timing of calls. The female song system is reduced compared to the male system and lacks, for instance, the largest song nucleus, Area X. However, some nuclei of the song system are present in both sexes, like the RA, a nucleus related to the temporal aspects of zebra finch song learning. Lesioning the RA nucleus in zebra finches made them unable to correctly adjust the timing of their calls, although they were still responsive to the robotic finch. This indicates that for both males and females the RA is actively involved in call synchronization and vocal coordination. Further experiments showed that input to the RA from another forebrain nucleus, the HVC, was necessary to maintain call synchronization (Benichov et al., 2016a). While Norton and Scharff (2016) did not examine the neural basis of the isochronous patterns they observed in male songs, they also suggest that it might originate from the activation pattern of HVC neurons firing to the RA. These HVCRA neurons fire in a rhythmic, clock-like pattern (Long and Fee, 2008). However, as the rhythms observed by Norton and Scharff are 3 to 10 times slower than the firing of these HVC neurons, they propose that additional mechanisms must operate to translate the clocklike firing into the higher complexity observed in the songs.

**64**

Whether animals can perceive beat and rhythmic patterns is a prominent question in relation to understanding the evolution of human musicality (Patel, 2006, 2014; Fitch, 2013; Hoeschele et al., 2015; Honing et al., 2015) and Benichov et al. (2016a,b) suggest that the link between the neural system involved in vocal learning and call synchronization provides the relation between beat perception and auditory-motor coordination as shown in a number of animal species. So, how do the above findings on the presence of rhythms in vocal production in zebra finches relate to studies that addressed the perception of rhythmic auditory patterns?

### RECOGNIZING REGULARITY

Rhythm can be defined as a regular repeated pattern. In its simplest form it is an isochronous series of pulses, like the one produced by the robotic finch. More complicated rhythms can be created by repeating a heterochronous pattern, for example the repetition of single and double pulses as used in Benichov et al. (2016a). Humans are skilled in perceiving various rhythms. We can easily detect whether a pulse or beat pattern is regular or irregular by integrating temporal information over a series of auditory events (e.g., Geiser et al., 2014). We thus have the cognitive ability to abstract a general, global, pattern from a string of sounds, enabling us to classify patterns as regular or not. However, the question whether or to what extent non-human animals are also able to integrate temporal information over a longer series of sounds to classify strings as being regular is still open. A telling example is a study by Hagmann and Cook (2010) that showed that pigeons were unable to learn to discriminate between an isochronous and an irregular pulse pattern. Pigeons are vocal non-learners and this might be the reason for their inability, as it has been suggested that vocal learning and vocal non-learning species might differ in this respect (Patel et al., 2009; Schachner et al., 2009; see also ten Cate et al. (2016) for a review on this relationship for birds). The rhythm perception by zebra finches in the study by Benichov et al. seems to support this view, as they show that the (vocal learning) zebra finches extracted the regularity of the call pattern and used this to time their calling. However, when we combine these findings with several other recent studies on zebra finch perception of regularity the picture becomes more complex.

Nagel et al. (2010) trained female zebra finches to discriminate between two different songs and showed that the females maintained the discrimination over a range of tempo changes. The songs were still categorized correctly up to a 25% speed increase or decrease. These results might indicate rhythm generalization by zebra finches. However, Nagel et al.'s conclusion was that zebra finches maintained the discrimination by attending to the spectral envelope of the songs. Attending to the sequence of local spectral features, rather than any timing pattern may have enabled the discrimination. Whether zebra finches do attend to regularity in the timing of songs was examined by Lampen et al. (2014). They compared ZENK expression in response to playback of rhythmic zebra finch songs, i.e., where all songs in a string had identical inter-element intervals, with the expression in response to arrhythmic songs, i.e., a string of songs in which inter-element intervals vary. Arrhythmic songs resulted in stronger ZENK expression in the caudomedial nidopallium (NCM), the caudomedial mesopallium (CMM), and the nucleus taeniae (Tn). This increased activity in auditory areas of the zebra finch brain might be related to the finding by Benichov et al. (2016a) as, similar to their study, the repeated pattern in the regular song may have initiated predictive timing of the next song rendition, which would be lacking with the arrhythmic song.

### REGULARITY OR INTERVAL DETECTION?

The above mentioned studies may indicate that zebra finches perceive "regularity" as such. However, as also noticed by Benichov et al. (2016b) this need not be the case. The various findings may arise because repeated events with a fixed interval may give rise to a prediction for a next interval of the same absolute duration. So, when the birds respond to the robotic finch they can do so by attending to, and learning about, the absolute interval between successive robot calls, or by detecting other local contiguities of events (Benichov et al., 2016b), without having formed some concept of "regularity." The same accounts for the study by Lampen et al. (2014), in which the zebra finches could also have responded to identical consecutive intervals in a sound string.

If zebra finches can perceive regularity as such, one would expect that they are not only able to distinguish a regular from an irregular string, but also to transfer this distinction to strings with modified tempos. This has been tested in experiments in which van der Aa et al. (2015) trained zebra finches to discriminate between a set of regular, isochronous pulse strings and a set of irregular pulse strings. The irregularity in these strings was created by varying the duration of inter-pulse-intervals within a string. As expected from the studies discussed above, zebra finches of both sexes could learn to discriminate these strings. But this discrimination broke down when the zebra finches were tested with probe strings with novel tempo transformations. The birds seemed to distinguish and discriminate the different strings based on their specific inter-pulse-interval durations, suggesting that the zebra finches focused on local features, without attending to or learning about the global pattern of regularity-irregularity of the strings (van der Aa et al., 2015).

Another recent experiment on zebra finches used isochronous pulse strings with pulses of two types. These were present in a ratio of 1:3, with the rare type raised in both frequency and amplitude compared to the other (ten Cate et al., 2016). This variation was used to create a string with a fixed number of low tones between two high ones, creating a regular beat pattern, and a string with different numbers of low tones in each interval, making the beat pattern irregular. Again, the zebra finches discriminated these strings. They were next tested with strings in which the location of the beat within the strings or the duration of pulses and inter-pulse-intervals was changed. This revealed that here also, the zebra finches seemed to use different local features, like the inter-beat- or inter-pulse-intervals to distinguish the strings. However, some birds seemed to combine a sensitivity to such local features with one for the more global regularity (ten Cate et al., 2016). Nevertheless, the behavioral experiments so far do not indicate that zebra finches perceive the global pattern of regularity as such. We should therefore be cautious with concluding that the ability of zebra finches (and other non-human species) to distinguish regular from irregular sound patterns is similar to the ability of humans to detect rhythm. However, there may be a continuum among species ranging from those not being able to discriminate regular from irregular sounds up those able of beat detection in more complex rhythms (ten Cate et al., 2016).

### CONCLUSION AND OUTLOOK

The various studies discussed above show that zebra finches are very good at detecting fixed interval durations and can use preceding intervals to predict the next one in a string of sounds. This matches their ability to call and to produce songs with a fixed rhythmic periodicity. Furthermore, specific nuclei of the zebra finch forebrain song system (HVC, RA, CMM, and NCM) seem

### REFERENCES


involved in producing and detecting rhythmicity. However, the results also call for further behavioral and neural studies on the links between perception and production of rhythmic patterns in the zebra finch. Are receivers sensitive to the rhythms underlying songs? And which brain areas are involved in the discrimination of the rhythmic patterns in the experiments of van der Aa et al. (2015) and ten Cate et al. (2016)? Are these perceptual abilities and also the production of the song rhythms observed by Norton and Scharff (2016) affected by interfering with HVC and RA? The zebra finch has proven to be an excellent model species, and is very suitable for addressing these questions. However, we also need comparative studies on other species to understand the types of temporal patterns that birds can detect, how they do so and how this is related to vocal learning (see also Benichov et al., 2016b). Ultimately, this may also shed light on the building blocks from which our human ability for rhythm perception and beat entrainment may have evolved.

### FUNDING

This work was supported by NWO-GW, grant no. 360.70.452.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Spierings and ten Cate. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Finding the Beat: From Socially Coordinated Vocalizations in Songbirds to Rhythmic Entrainment in Humans

#### Jonathan I. Benichov <sup>1</sup> \*, Eitan Globerson2, 3 and Ofer Tchernichovski <sup>1</sup>

*<sup>1</sup> Department of Psychology, Hunter College, City University of New York, New York, NY, USA, <sup>2</sup> Gonda Multidisciplinary Brain Research Center, Bar-Ilan University, Ramat-Gan, Israel, <sup>3</sup> Jerusalem Academy of Music and Dance, Jerusalem, Israel*

Humans and oscine songbirds share the rare capacity for vocal learning. Songbirds have the ability to acquire songs and calls of various rhythms through imitation. In several species, birds can even coordinate the timing of their vocalizations with other individuals in duets that are synchronized with millisecond-accuracy. It is not known, however, if songbirds can perceive rhythms holistically nor if they are capable of spontaneous entrainment to complex rhythms, in a manner similar to humans. Here we review emerging evidence from studies of rhythm generation and vocal coordination across songbirds and humans. In particular, recently developed experimental methods have revealed neural mechanisms underlying the temporal structure of song and have allowed us to test birds' abilities to predict the timing of rhythmic social signals. Surprisingly, zebra finches can readily learn to anticipate the calls of a "vocal robot" partner and alter the timing of their answers to avoid jamming, even in reference to complex rhythmic patterns. This capacity resembles, to some extent, human predictive motor response to an external beat. In songbirds, this is driven, at least in part, by the forebrain song system, which controls song timing and is essential for vocal learning. Building upon previous evidence for spontaneous entrainment in human and non-human vocal learners, we propose a comparative framework for future studies aimed at identifying shared mechanism of rhythm production and perception across songbirds and humans.

Edited by:

*Andrea Ravignani, Vrije Universiteit Brussel, Belgium*

#### Reviewed by:

*Manfred Hartbauer, Karl-Franzens University Graz, Austria Michelle Spierings, Leiden University, Netherlands*

#### \*Correspondence:

*Jonathan I. Benichov jonathan.benichov@gmail.com*

Received: *07 March 2016* Accepted: *17 May 2016* Published: *06 June 2016*

#### Citation:

*Benichov JI, Globerson E and Tchernichovski O (2016) Finding the Beat: From Socially Coordinated Vocalizations in Songbirds to Rhythmic Entrainment in Humans. Front. Hum. Neurosci. 10:255. doi: 10.3389/fnhum.2016.00255* Keywords: songbird vocalizations, zebra finch, social coordination, rhythm, vocal learning, predictive timing, entrainment, rhythm perception

Almost all animals behave in reference to physical and biological rhythms. From the entrainment of a cricket's circadian cycles, to a sandpiper's repeated chasing and retreating from the waves on a shoreline, rhythms, and synchronization are ubiquitous in animal behavior (Strogatz, 2003). Animals do not only adapt to rhythms, but they can also generate coordinated rhythmic patterns, as in the synchronous flashing of fireflies or the antiphonal calling of marmosets (Moiseff and Copeland, 1995; Takahashi et al., 2013). Although rhythms, entrainment, and coordination appear to be widespread, some highly intelligent animals, such as dogs and apes, appear limited in their ability to spontaneously synchronize their actions to a given beat (Merker, 2000; Fitch, 2011), whereas most humans can dance and can synchronize their movements to a broad range of beats with ease. What is it that makes entrainment so easy for a few animal species (Large and Gray, 2015; Wilson and Cook, 2016) including humans, and difficult or impossible for others?

Many animals communicate by exchanging rhythmic calls. Such rhythms might simply arrise from stereotyped back and forth responses to individual stimuli (**Figure 1A**). Alternatively animals might respond to sequences of events (**Figure 1B**) or to temporal pattens (**Figure 1C**), that is, to the overall periodicity of events. In the case of sequences, one may ask if the animal has learned and responded to a simple or a complex string of contiguous events (**Figure 1B**). In the case of rhythm learning, the experimental question is whether the animal is capable of anticipating the timing of events, which may be either equally spaced in time (i.e., isochrounous events, **Figure 1C**, top) or complex(i.e., hierarchically organized events corresponding to a musical meter, **Figure 1C**, bottom).

Here we review emerging evidence from studies of rhythm generation and vocal coordination across songbirds and humans. We start with a brief review of rhythm learning assessment in human subjects, and present the difficulties in using comparable approaches in animal studies of rhythm learning. We then discuss recent approaches to studying how songbirds can coordinate their vocalization in reference to an external beat.

### CHARACTERISTICS OF RHYTHM ENTRAINMENT IN HUMANS

In humans, contrary to many other species, rhythmic entrainment is a universal feature of behavior, observed from a very early age (Fraisse, 1966). Typically, when human subjects are asked to synchronize their movement to a beat, they tend to anticipate it. That is, they tend to act just prior to the onset of the beat, an effect called negative asynchrony (**Figure 1C**, top). Negative asynchrony can be observed after just a few (2–3) introductory beats and typically precedes the stimulus by tens of milliseconds (Fraisse, 1966). This effect can be observed over a fairly broad range of inter-stimulus intervals (ISIs), between 100 and 1800 ms. The lower ISI bound is determined by motor constraints, while the higher bound is believed to be our limited ability to detect periodicity beyond certain tempi (e.g., due to memory and attention constraints). Above the 100–1800 ms range, negative asynchrony partly gives way to positive asynchrony, which reflects responsive, rather than predictive synchronization to the beat (Fraisse, 1982). Another typical human trait in beat perception is the tendency to create a hierarchy of beats, commonly termed meter (**Figure 1C**, bottom). Grouping of beats is also typical of human motor entrainment to a beat. When moving to music, humans tend to perceive levels of hierarchy beyond the basic beat and synchronize their movement to these groups of beats (Palmer and Krumhansl, 1990).

### EVALUATING RHYTHM LEARNING IN NON-HUMAN ANIMALS

When reviewing animal studies of rhythm learning we will broadly consider cases where animals can learn to adjust their behavior with respect to a given rhythm. The choice of

behavioral indicators of rhythmic perception, however, is nontrivial. The basic supposition underlying behavioral paradigms is that perception drives behavior, and, in some cases, is modulated by behavior. In human studies, one can affect behavior by directly instructing subjects to respond to perception with certain informative actions. A non-human animal, however, will respond to a stimulus only if it corresponds to a meaningful event according to some species-typical standards. Lack of response can reflect perceptual or behavioral limitations, but also the lack of motivation to respond. Therefore, in testing animal rhythm perception, it is critical to find auditory stimuli that are salient enough for the animal to respond in an informative manner. In cases where it is possible to elicit reliable responses, studies typically focus on responses as individual events, rather than ongoing patterns.

Entrainment to rhythms can be tested in many presumably reflexive responses which may exhibit the signatures of rhythm perception or recognition of a beat (**Figure 1C**). With behaviors that appear to be periodic, this can be tested simply by shifting the phase of a repeated stimulus or by observing the persistence of the periodic behavior after the removal of the entraining rhythmic stimuli (**Figure 1C**). Such manipulations are, in fact, standard when studying circadian rhythms in animals (Panda, 2002), but rare in animal communication studies. Singing behavior in oscine songbirds is one of the most studied systems of communication. Birdsong is learned, complex, and often highly rhythmic. Remarkably, some songbird species can even coordinate their songs during duets in which they alternate song syllables with millisecond accuracy (Yoshida and Okanoya, 2005; Fortune et al., 2011; Templeton et al., 2013; Rivera-Cáceres, 2015). Inspired by this rich behavior, several recent studies have involved manipulation of rhythmic stimuli to examine rhythm perception and entrainment in songbirds (Lampen et al., 2014; van der Aa et al., 2015; Benichov et al., 2016).

### VOCAL LEARNING AND COORDINATION IN SONGBIRDS

There are about 4000 species of oscine songbirds, many of which produce songs, which are learned, culturally transmitted behaviors (Brenowitz and Beecher, 2005). Songs are extremely diverse in their spectro-temporal features, complexity, and usage across species (Beecher and Brenowitz, 2005). A European nightingale male, for example, typically learns hundreds of different songs, and sings them in an enormously complex succession. A male zebra finch, on the other extreme, typically learns only a single song motif during development. Within those species-specific constraints, each individual bird can be recognized by its unique song, which is often partly learned through imitation, and partially improvised. Interestingly, even the songs of birds raised in complete isolation possess individual rhythm signatures (Fehér et al., 2009).

### The Forebrain Song System Is a Generator of Complex Learned Rhythms

The neuronal mechanisms of song learning and production have been studied in great detail (Brainard and Doupe, 2002; Nottebohm, 2005). It appears that song patterns (Amador et al., 2013) originate in a highly localized brain center, nucleus HVC (used as proper name), which is located in the bird's posterior forebrain (Nottebohm, 2005). Premotor HVC neurons, which project downstream to primary motor centers, are active during singing, and their spikes are extremely sparse and accurate (Kozhevnikov and Fee, 2007). For example, the zebra finch song is composed of a repeated sequence e.g., ABCD, ABCD. . .where each letter represents a syllable type and the repeated unit [ABCD] is called a motif. Each premotor HVC neuron produces only a single short burst of action potentials during a motif, "ticking" at a specific "moment" (e.g., in the middle of syllable C). Collectively, the ticks of these neurons span the entire duration of the song motif and cooling HVC while the bird sings, results in the slowing of the song, with an almost perfectly uniform reduction in tempo across its duration (Long and Fee, 2008). This result suggests that nucleus HVC is the principle generator of song structure. In a recent study (Okubo et al., 2015), the activity of HVC neurons was tracked during developmental song learning. Interestingly, during early development, HVC neurons generate much faster rhythms, often time-locked to a single prototype syllable, which the bird produces in rapid succession (Tchernichovski et al., 2001; Aronov et al., 2008). The prototype syllable then gradually differentiates into several mature syllable types. For example, a chain of prototype syllables XXXX may transform into XX′XX′ , and finally into ABAB. As this differentiation takes place, HVC neurons double their ticking period, such that they gradually shift from bursting once per syllable, to bursting once every other syllable (on either A or B), until eventually, HVC neurons spike only once per song motif (e.g., ABCD. . . ).

Is the emergence of rhythmic patterns simply mirroring the process of learning to imitate sequences of syllable types? Alternatively, are rhythms the primary skeleton of song production and perception? We do not know. The song system could be either a sequence generator that appears to be rhythmic or a developing rhythm generator, where HVC neurons are, in effect, entrained by the auditory memory of perceived rhythms. If the latter is correct, then through the capacity of song learning, songbirds are endowed with neuronal mechanisms that are specialized for acquiring rhythms. Interestingly, nuclei in the zebra finch auditory association cortex, which are known to be involved in song recognition, are highly sensitive to rhythmic song patterns (Lampen et al., 2014). The authors presented birds with modified songs, where the sequential order of song elements remained unchanged, but song rhythm was perturbed by randomly varying inter-syllable intervals (arrhythmic songs). Hearing arrhythmic songs strongly increased activity in auditory brain areas compared to rhythmically natural songs, supporting the notion that rhythmic structure is a salient feature of birdsong, both for males and non-singing females.

Songbirds are, perhaps, a rare example of animals in which predictive auditory-motor synchronization has evolved. If song learning is indeed a neural entrainment of HVC to memories of perceived rhythms, duet singing could be interpreted as real-time coupling of song rhythms between two birds. For example, in the plain tailed wren, both females and males sing. They learn and perform impressive duets, alternating song syllables in perfect synchrony as if one bird were singing (Rivera-Cáceres, 2015). Notably, premotor HVC neurons are sensitive to the intervals of the dueting partner (Fortune et al., 2011). In sum, the vocal learning capacities of songbirds enable them to create highly complex song patterns, and employ them in social communication, but we don't know if these patterns are primarily perceived and produced as sequences (song syntax), as rhythms (temporal structures), or perhaps as both.

There are few conclusive studies of entrainment to rhythms in non-human animals, most famously in the form of "dancing" in parrots (Patel et al., 2009; Schachner et al., 2009; Hasegawa et al., 2011; Laland et al., 2016). As noted earlier, birdsong is often highly rhythmic, and the song system can be thought of as a sophisticated generator of learned rhythmic behavior. However, once learned, song rhythms become highly stereotyped and difficult to manipulate. In contrast, it is much easier to assess rhythmic vocal abilities when birds are exchanging calls. Zebra finches, for example, rapidly exchange innate short calls and coordinate the timing of their short calls in a pair-specific manner while in a group of calling birds (Elie et al., 2010; Anisimov et al., 2014; Ter Maat et al., 2014). Recently, the vocal coordination capacity of zebra finches has been tested under controlled conditions, in terms that potentially allow for direct comparison to human rhythmic entrainment studies. These experiments showed that zebra finches can coordinate the timing of simple unlearned calls with an imposed beat in a manner that is predictive (Benichov et al., 2016).

In an initial task, individual birds were presented with equally spaced (isochronous) calls (ICs) from a vocal robot (**Figure 2A**) and answered the robot calls with stereotyped latencies. The robot then generated pairs of calls, with intervals that matched the bird's typical response time, thereby maximizing the likelihood of jamming (**Figure 2A**, bottom). Within seconds, birds learned to alter the timing of their responses to avoid jamming (**Figure 2B**). We showed that timing adjustments were predictive, that is, birds anticipated the jamming and shifted timing accordingly (**Figure 2C**). This was verified with "catch" trials, or occasional cycles in which birds hear a single call within a session consisting primarily of jamming call pairs. Further, like humans anticipating a beat, birds typically adjusted their call timing after hearing only a few cycles of the pattern.

Exchanges of calls with the vocal robot typically take the form of antiphonal duets, as opposed to in-phase synchrony. Jamming avoidance can then be thought of as a mechanism for maintaining antiphony. In comparison to human beat perception, call anticipation underlying jamming avoidance may be analogous to the expectation of a beat that underlies temporal shifts in syncopated rhythms (Fitch and Rosenfeld, 2007; Velasco and Large, 2011; Nozaradan et al., 2016). In both cases, events do not occur on the beat, but rather, they are shifted relative to the expectation of the beat. In music this is employed and perceived as accenting, whereas in zebra finches, this anticipation appears to guide antiphonal coordination. These results make sense from an ecological perspective given that zebra finches typically exchange thousands of short calls daily, and their colonies tend to be dense and busy acoustic environments (Elie et al., 2011).

### A COMPARATIVE APPROACH FOR STUDYING BEHAVIORAL MECHANISMS OF RHYTHM LEARNING

Jamming avoidance has been thoroughly studied in several species of weakly-electric fishes (Bullock et al., 1972; Heiligenberg et al., 1996; Zupanc and Bullock, 2006) and frogs (Zelick and

FIGURE 2 | Vocal Robot and jamming avoidance (From Benichov et al., 2016). (A) A bird interacts with isochronous calls (ICs) generated by a vocal robot at rate of 1 Hz. The bird's stereotyped response latencies are used to determine a window of maximum jamming probability (yellow). (B) A bird's responses (blue) across 1000 ms robot IC cycles (gray) and responses (red) across a subsequent session containing jamming robot calls (yellow). The bird shifts its response probability distribution to avoid jamming. (C) Cumulative response distributions across 12 birds, aligned to their window of maximum jamming probability (yellow), for ICs (blue), and for jamming catch trials (green) that contain only a single robot call.

Narins, 1982, 1985). These animals minimize signal overlap with their neighbors by adjusting their intrinsic pacemaker intervals, cycle-by-cycle (Zelick, 1986). Generalized phase resetting and shifting mechanisms constitute responsive forms of coordination. For example, phase adjustment mechanisms can explain how the coqui frog can avoid jamming by preferably calling during brief periods of silence (Zelick and Narins, 1985). Are these animals learning to synchronize their signals (i.e., to cooperate)? Can synchrony arise as an epiphenomenon of competitive interactions (i.e., by suppression)? In the case of chorusing Kaydid bush crickets, females prefer the "leader" male that starts to signal just prior to his competitors, in a manner that resembles negative asynchrony (Greenfield and Roizen, 1993; Fertschai et al., 2007; Hartbauer et al., 2014). In this case, sexual selection for competitive inhibitory mechanisms, which are primarily responsive, may account for the apparent synchrony of the chorus, without the need for prediction.

In zebra finches, vocal robot experiments have shown that birds predictively adjust the timing of their calls when presented with complex rhythms (Benichov et al., 2016). Beyond shifting call timing for repeated jamming call pairs, zebra finches also make anticipatory adjustments for alternating jamming and non-jamming cycles. Birds appear to predict the pattern and reduce response latencies specifically for cycles in which jamming calls occur. These non-generalized (i.e., context-sensitive) shifts in response latencies cannot be explained by responsive mechanisms alone. Rather, they would require mechanisms that can operate on longer time scales (e.g., sequences) or multiple temporally hierarchical levels (e.g., grouping of beats). To our knowledge, such context-sensitive plasticity has not been observed in signaling insects, electric fish, or frogs.

Despite the impressive context-sensitive plasticity in songbird call timing, there is no conclusive evidence that they can perceive rhythms holistically, as humans do. The human ability to perceive rhythms and exhibit spontaneous sensorimotor entrainment has certain hallmarks: it is predictive, occurs across multiple hierarchical timescales, and exhibits predictive negative asynchrony enabled by endogenous representation of an isochronous beat (Semjen et al., 1998; Merker et al., 2009; Nozaradan et al., 2016). It also occurs within a specific range of tempi (Fraisse, 1982). These features can provide a starting point for comparative studies. Along these lines, van der Aa et al. tested rhythm perception in songbirds, and found that zebra finches could not generalize a distinction between isochronous and irregular beats across tempi, namely, they failed to categorize rhythms based on their common global temporal patterns (van der Aa et al., 2015). Humans, in contrast, can easily perform such tasks without any prior training regardless of cultural background (Merker et al., 2009). These results suggest that songbirds attend to local timing events in a sequence but not to global rhythm patterns.

A "sequence-based" explanation could potentially account for predicative call timing plasticity (Benichov et al., 2016). In this scenario birds detect local contiguities of events and adjust their call timing according to a rule of succession. For example, a bird might learn to answer more quickly after hearing a long interval and more slowly after hearing a short interval (**Figure 1B**). Even though zebra finches attend to local acoustic patterns during passive listening, it remains possible that they can synchronize their calls to a given beat in the context of vocal interactions. Indeed, preliminary evidence may suggest a rhythm entrainment mechanism: when zebra finches interact with a vocal robot, a surprising proportion of calls occur just before the next anticipated robot call, as in negative asynchrony or anticipatory "leading" (**Figure 2C**, secondary peak in IC responses). To further test if call timing reflects predictive entrainment to the previously heard robot rhythm, we are currently analyzing persistent call patterns produced by birds after a robot call pattern has been terminated (**Figure 1C**).

### A COMPARATIVE APPROACH FOR STUDYING BRAIN MECHANISMS OF RHYTHM LEARNING

MEG and EEG studies in humans have shown neural entrainment to an external beat (Honing et al., 2014; Doelling and Poeppel, 2015; Nozaradan et al., 2016). No similar phenomenon has been reported in non-human animals that exhibit spontaneous rhythm synchronization or in songbirds. The lack of evidence in songbirds studies might mirror technical difficulties: As discussed earlier, the forebrain song system is a highly specialized vocal learning network. However, song learning proceeds over weeks, and is difficult to manipulate from moment to moment. The vocal robot approach for studying rhythm adaptation, and possibly entrainment of calls, could facilitate such comparative experiments.

Recent studies have identified zebra finch brain areas that drive call timing and interestingly, the forebrain song system appears to play a major role. The first evidence for song system involvement came from electrophysiological studies, showing the final premotor output nucleus, RA (robust nucleus of the archopallium), is active during the exchange of unlearned short calls (Ter Maat et al., 2014). These findings were surprising given that birds can exchange such calls even after the output of the forebrain song system has been blocked (Simpson and Vicario, 1990; Aronov et al., 2008). However, performing jamming avoidance experiments while the song system is lesioned or blocked results in complete loss of a bird's ability to synchronize its calls with a robot partner (Benichov et al., 2016). While birds remain responsive to the robot calls, their latencies become significantly less stereotyped. This was accompanied by the dramatic loss of the ability to avoid jamming. The precise timing of call coordination, therefore, relies on forebrain circuits that also underlie song learning.

What could this mean? Interestingly, female zebra finches, who do not sing, are extremely good at avoiding jamming. In fact, their jamming avoidance behavior is more accurate than that of male zebra finches (Benichov et al., 2016). Could the female "song system" be involved in vocal coordination? The female zebra finch forebrain vocal nuclei are not well developed (Nottebohm and Arnold, 1976; Wade and Arnold, 2004), yet blocking nucleus RA disrupted call timing and jamming avoidance, as it did in males. Therefore, the female song system, which was assumed to be vestigial, functions in call coordination and is perhaps more highly specialized for the task than the male's forebrain vocal pathway. Together, findings suggest that the song system involvement in the coordination of unlearned calls reflects the interplay between sensory prediction and motor control. Consequently, blocking the cortical output of the song production pathway results in the temporal uncoupling of the birds' calls from the robot's, as measured by a loss of response precision and predictive timing adjustments. This occurred without affecting the birds' tendency to respond to robot calls. As the search for the neural mechanisms of call coordination narrows down, it should be possible to test if single neurons can be entrained to an imposed beat in songbirds, and to compare the results directly to human studies of neuronal entrainment.

In comparison to humans, the behavioral results obtained after blocking the song system may in some ways be analogous to the rhythm deficits seen in some human subjects who have difficulty synchronizing to an external beat (Amos, 2013) or have been identified as "beat deaf " (Phillips-Silver et al., 2011). Understanding the roles of sensorimotor networks underlying temporal deficits in songbirds may provide insights for related human research. At this point, it is too early to judge the extent to which the control of adaptive call timing is localized to the song system. Other brain areas, particularly, the descending auditory pathway (Mello et al., 1998), which surrounds the song system,

### REFERENCES


is likely to be involved as well. This would be consistent with reports of top-down modulation of auditory processing in human subjects (Tervaniemi et al., 2009). However, since the song system has the capacity to generate and perhaps entrain to song rhythms, an extension of this capacity to call timing adjustments would be a reasonable explanation for the anatomical convergence of the two. In sum, it should now be possible to test if neuronal activity in any of the forebrain song nuclei can be entrained to rhythms produced by a vocal robot in behaving birds. If successful, such experiments should allow for direct comparisons to human rhythm learning experiments, both at neuronal and behavioral levels.

### AUTHOR CONTRIBUTIONS

All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.

### FUNDING

This work was funded by grants to OT from the National Institute on Deafness and Other Communication Disorders (National Institutes of Health: 5R01DC004722-17), the National Science Foundation (IOS-1261872), and the Professional Staff Congress of the City University of New York (66810-00 44).


evolutionary changes of a behavior and its neuronal substrates. J. Comp. Physiol. A 179, 653–674. doi: 10.1007/BF00216130


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Benichov, Globerson and Tchernichovski. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Rhythm Generation and Rhythm Perception in Insects: The Evolution of Synchronous Choruses

Manfred Hartbauer\* and Heiner Römer

Behavioural Ecology and Neurobiology, Institute of Zoology, University of Graz, Graz, Austria

Insect sounds dominate the acoustic environment in many natural habitats such as rainforests or meadows on a warm summer day. Among acoustic insects, usually males are the calling sex; they generate signals that transmit information about the species-identity, sex, location, or even sender quality to conspecific receivers. Males of some insect species generate signals at distinct time intervals, and other males adjust their own rhythm relative to that of their conspecific neighbors, which leads to fascinating acoustic group displays. Although signal timing in a chorus can have important consequences for the calling energetics, reproductive success and predation risk of individuals, still little is known about the selective forces that favor the evolution of insect choruses. Here, we review recent advances in our understanding of the neuronal network responsible for acoustic pattern generation of a signaler, and pattern recognition in receivers. We also describe different proximate mechanisms that facilitate the synchronous generation of signals in a chorus and provide examples of suggested hypotheses to explain the evolution of chorus synchrony in insects. Some hypotheses are related to sexual selection and inter-male cooperation or competition, whereas others refer to the selection pressure exerted by natural predators. In this article, we summarize the results of studies that address chorus synchrony in the tropical katydid Mecopoda elongata, where some males persistently signal as followers although this reduces their mating success.

Keywords: insect choruses, chorus synchrony, female choice, rhythm generation, pattern recognition, cooperation

## ACOUSTIC COMMUNICATION IN INSECTS

Grasshoppers, crickets, and katydids usually produce sound by stridulation, that is using a striated file-like body structure and associated structures that vibrate when they are rubbed across a sclerotized plectrum (peg). While crickets and katydids rub their forewings against each other, grasshoppers move their hind legs across a peg located at the base of their wings. The sound signals generated can be as short as 0.5 ms (i.e., the female acoustic reply in Phaneropterine species) or can last for many minutes or even longer (e.g., the calling songs of trilling katydids). Acoustic signals can also be classified according to the responses they evoke from conspecific receivers: signals that are generated in aggressive interactions with conspecific rivals are termed aggressive songs, whereas calling songs are used to attract mates (Heller, 1988). When within close range to females, males often generate courtship songs with reduced amplitudes, different temporal patterns, and

#### *Edited by:*

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### *Reviewed by:*

Bjorn Hellmut Merker, Formerly affiliated with Mid Sweden University, Sweden Michael Greenfield, Université François Rabelais Tours, France

#### *\*Correspondence:*

Manfred Hartbauer manfred.hartbauer@uni-graz.at

#### *Specialty section:*

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> *Received:* 04 February 2016 *Accepted:* 06 May 2016 *Published:* 31 May 2016

#### *Citation:*

Hartbauer M and Römer H (2016) Rhythm Generation and Rhythm Perception in Insects: The Evolution of Synchronous Choruses. Front. Neurosci. 10:223. doi: 10.3389/fnins.2016.00223

**74**

carrier frequencies. In most species, only males generate acoustic signals, and the mute females approach the singing males (phonotaxis). In duetting species, females reply to signals produced by distant males by emitting a short acoustic signal, which then elicits male phonotaxis (Heller and von Helversen, 1986; Zimmermann et al., 1989). A general feature of acoustic signals in insects is their high degree of stereotypy and redundancy. Since acoustic signals serve as effective premating isolation barriers, they are highly diverse among species. The temporal signal pattern is particularly essential for species recognition among grasshoppers (von Helversen and von Helversen, 1975, 1998), katydids (e.g., Morris et al., 1975; Keuper and Kühne, 1983), and crickets (e.g., Walker, 1957, 1969; Popov and Shuvalov, 1977; Mhatre et al., 2011; Schmidt and Römer, 2011; Schmidt and Balakrishnan, 2015). The carrier frequencies can range from 1 to 2 kHz far into the ultrasonics, and signals can be broadband (as in many katydids) or fall within a narrow frequency band (most crickets). The selective advantage of using either broadband or narrow-band acoustic signals for sound transmission and perception in a noisy environment has been previously described (Rheinlaender and Römer, 1980; Schmidt and Römer, 2011; Schmidt et al., 2011, 2013; Schmidt and Balakrishnan, 2015).

After successfully detecting signals, receivers evaluate the temporal signal pattern to obtain information about the species identity of the signaler. When signal period is rather variable or males advertise themselves by producing long-lasting trills, the period of syllables (for definition, see **Table 1**) usually contains information about the species identity (e.g., Walker, 1957; Popov and Shuvalov, 1977; Doherty and Callos, 1991; Simmons, 1991; Cade and Cade, 1992). However, when males generate a group of syllables (termed chirps) at fixed time intervals, the signal period could be a cue that indicates species identity (e.g., Walker, 1969). With reference to the current topic of timing in music and speech, the latter is particularly important. The intrinsic signal period of males shows little variability in some acoustic insect species, and males listen and respond to the signals of conspecific neighbors. As a result, the signal timing of chorus members strongly deviates from random, whereby synchrony and signal alternation are extreme forms of temporal patterns that emerge from acoustic interactions. Since signal timing in a group can have important consequences for calling energetics, mate choice, and predation, researchers have been asking questions about the evolution of chorusing for decades. Before going into detail about the various causes and consequences of synchronous insect choruses, we will provide a brief review of recent advances in our understanding of the neuronal basis of signal pattern generation and rhythm perception in insects, both of which are basic requirements for acoustic communication.

### Rhythm-Generating Neural Circuits

The temporal patterns of acoustic signals are generated by rhythm-generating networks of the central nervous system. Acoustic insects are valuable model organisms for the study of these networks because the rhythm of their songs is rather simple and their nervous system is rather primitive as compared to vertebrates or mammals. Another advantage is that neurons

TABLE 1 | Definition of bioacoustic terms.


can be identified on the basis of their response properties and unique anatomy. This allows comparisons of the function of identified homologous neurons that are part of patterngenerating networks across species to be made, which provides important insights into the evolution of both temporal signal patterns and song diversification.

In order to attract females from a distance, males of the Mediterranean field cricket Gryllus bimaculatus emit calling songs that are characterized by aperiodic chirps consisting of about 4–5 syllables. Recently, the network involved in pattern generation was identified in this species. Schöneich and Hedwig (2011) located its position in the CNS by systematically dissecting the connection between abdominal ganglia (for a similar method, see Hennig and Otte, 1996). After transecting the connectives between the third thoracic ganglion (metathoracic ganglion complex) and the first abdominal ganglion, singing behavior was immediately and permanently terminated. Later, four neurons in these ganglia that showed rhythmic activity in phase with the syllable pattern were identified (Schöneich and Hedwig, 2012). Interestingly, a similar, characteristic neuroanatomy of the song pattern generator was found in the metathoracic-abdominal ganglion complex in grasshoppers, where songs are produced through rhythmic movements of hind legs (Gramoll and Elsner, 1987; Hedwig, 1992; Schütze and Elsner, 2001). Even more surprising, the neuronal circuit for courtship song production in drosophila (Clyne and Miesenböck, 2008; von Philipsborn et al., 2011) and rhythmic sound production via tymbals in arctiid moths (Dawson and Fullard, 1995) was also located in thoracicabdominal ganglia. This suggests a common evolutionary origin for early thoracic-abdominal motor control networks, which may have been linked to ventilation (cf. Robertson et al., 1982; Dumont and Robertson, 1986). By gathering knowledge about the location and function of interneurons that constitute part of the central pattern generator, a framework for further comparative studies can be constructed. In such an attempt it would be worthwhile to investigate the neuronal basis that is responsible for rhythm adjustment in chorusing insects (see below).

### Rhythm Perception and Associated Neuronal Correlates

Mate choice experiments performed with various field cricket and katydid species have revealed that the signal traits evaluated by receivers for species recognition are as diverse as the signals (e.g., Heller and von Helversen, 1986; Shaw et al., 1990; Simmons, 1991; Hennig and Weber, 1997; Hennig, 2003, 2009; Poulet and Hedwig, 2005; Greenfield and Schul, 2008; Hartbauer et al., 2014; Hennig et al., 2014). It has been generally accepted that temporal pattern recognition is both hardwired and geneticallydetermined as compared to olfaction and visual orientation, where learning also plays an important role (Bazhenov et al., 2005; Papaj and Lewis, 2012). To understand the principal mechanisms of species recognition and mate choice in insects, it is necessary to unravel the response properties both of auditory neurons that convey information about acoustic signals to the brain, and the filter network in the brain itself. The expectation in this research was to find a neuronal network and describe synaptic mechanisms that result in selective responses to the conspecific temporal song pattern, which matches the selectivity of these patterns in behavior. Two model organisms were used for this approach: the grasshopper Chorthippus biguttulus and the field cricket G. bimaculatus.

Male Ch. biguttulus grasshoppers generate temporallystructured signals via stridulation and females respond to the temporal pattern of syllable-pause combinations of attractive songs by emitting a short acoustic reply (von Helversen and von Helversen, 1998; Meckenhäuser et al., 2013). Females in this species prefer short pauses and a strong onset accentuation of song elements (von Helversen, 1972; Balakrishnan et al., 2001). Stumpner et al. (1991) studied the response of several neurons to conspecific song models and showed that, of various local neurons in the thorax, one neuron (BSN1) responded to varying syllable-pause combinations in a way that matched behavior. Two other thoracic neurons (SN6, AN4) responded to gaps in the verse of conspecific song models in a highly reliable manner (Stumpner and Ronacher, 1994). By selectively heating individual body segments, the brain was identified as the location where pattern recognition takes place, whereas the oscillator for song production was localized in the thoracic ganglia (Bauer and von Helversen, 1987; Gramoll and Elsner, 1987; Hedwig, 1992; Schütze and Elsner, 2001; Schöneich and Hedwig, 2012). The brain neurons involved in pattern recognition still need to be characterized in this species.

As already mentioned above, male G. bimaculatus attract distant females by producing calling songs that are made up of aperiodic chirps, each consisting of about four syllables. As in many other cricket species, the syllable period represents a crucial parameter for species recognition. Behavioral experiments revealed that song pattern recognition in G. bimaculatus relies on two computations with respect to time (Grobe et al., 2012). Using a modern modeling approach, Hennig et al. (2014) were able to simulate the response of females that listened to various calling song models with different temporal patterns by using a short integration time window that operated as a filter for the pulse rate and a longer integration time window that allowed the evaluation of song energy over time.

Recently, the neuronal network that enables pulse rate recognition in the brain of G. bimaculatus has been identified. It turned out that this complex task depends on the detection of the coincidence of successive pulses in a delay line network (Schöneich et al., 2015). Subsequent sound pulses are encoded in the bursting activity of a neuron that receives sensory input at the thorax and ascends to the brain (AN1). In the brain, the sensory information of this neuron is split into two parallel pathways, one involving two other neurons (LN2 and LN5). The

processing of sensory information in these neurons leads to a moderate delay and, thus, to the coincidence of the bursting response of AN1 and LN5 in the postsynaptic neuron LN3 when pulses are separated by a syllable interval of more than 20 ms (see **Figure 1A**). While LN3 operates as coincidence detector, LN4 represents a feature detector that exhibits temporal band pass characteristics that are highly similar to those of the pulse period tuning of female phonotaxis (**Figure 1B**; Kostarakos and Hedwig, 2012). This feature-detection mechanism enables recognition of the species-specific temporal song pattern in this field cricket, and is a principal mechanism that evaluates the pulse period of calling songs.

### INSECT CHORUSES

In some insect species, males congregate in groups where they form acoustic leks (also referred to as "spree" in the temporal domain; Walker, 1983; Kirkpatrick and Ryan, 1991; Höglund and Alatalo, 1995). These male aggregations offer females the opportunity to compare the calling songs of several males simultaneously, which is principally different from sequentially comparing potential mating partners (Kokko, 1997). An analysis of signal timing in males within these aggregations revealed various forms of collective broadcasting where signal timing was non-random. Greenfield (1994a) reviewed various mechanisms for a joint display of signals in groups. These included: (1) Changing light conditions trigger the simultaneous activity of senders in dusk and dawn choruses. (2) Unison bout singing, triggered by males who initiate calling and are then joined by most other signalers. Participants in such choruses usually maintain a high signal rate for several minutes, after which the calling effort gradually decreases to zero. Then, the cycle is repeated after variable intervals of silence. (3) Periodic signal production can be controlled through a central pattern generator that leads to high precision of signal timing, if individuals in a group slowly adapt their signal period to the rhythm of others who exhibit similar, intrinsic, "free-running" signal periods (as in some synchronizing firefly species). (4) In some chorusing species, males are able to maintain a constant phase relationship between their signals and those of other males by responding with a phase shift to the signal produced by a neighbor. Depending on certain properties of signal oscillators and the number of participants, signals are either broadcast in collective synchrony or in a kind of alternation.

When singing within the hearing range of one other, males of the same species often time their signals strictly and temporally. Depending on chorus size and inter-male distance, males either alternate (e.g., Jones, 1963; Latimer, 1981; Meixner and Shaw, 1986; Tauber et al., 2001) or synchronize their periodic signals (Walker, 1969; Shaw et al., 1990; Sismondo, 1990; Greenfield and Roizen, 1993; Nityananda and Balakrishnan, 2007; Greenfield and Schul, 2008; Schul et al., 2014). Synchrony is often found in species that emit signals relatively rapidly (with a period of <1 s), whereas alternation normally involves slower signal rhythms (a period of >1 s) (Greenfield, 1994b). In principle, alternation in periodic signals is restricted to only two signalers, whereas the number of individuals engaged in synchronous signaling is theoretically unlimited. Depending on the properties of song oscillators, synchrony can either lead to a significant overlap in signals or temporally-fixed delays of signals produced by different males. At close range, synchrony can be rather precise, so that even the syllables within the chirps are synchronized with those of neighboring males: when singing in close proximity, males of the chorusing species Amblycorypha parvipennis tend to synchronize the syllable pattern of their signals (Shaw et al., 1990). Synchronous signal displays are not restricted to the acoustic world, but can also be found in other modalities. Aggregating firefly species collectively broadcast visual displays in almost perfect synchrony, which results in fascinating group displays (Buck and Buck, 1968; Otte and Smiley, 1977; Buck et al., 1981). Furthermore, the vibratory communication signals of wolf spiders (Kotiaho et al., 2004) and the visual communication system of fiddler crabs (Backwell et al., 1998) are characterized by their high degree of synchrony.

Is there a common proximate mechanism that is responsible for synchronous signaling in these different systems? The oscillator properties that lead to synchronous signal displays were first described for fireflies, where a "phase delay model" was suggested to explain flash synchrony in these organisms (Hanson, 1978; Buck et al., 1981). Greenfield (1994b; see also Greenfield et al., 1997) modified this model, hypothesizing the existence of an inhibitory resetting mechanism of signal oscillators to explain the diversity of alternating and synchronous choruses observed among members of the different species. In this model, in the absence of a stimulus, the oscillator level constantly rises to a point where the production of a signal is triggered with a minor delay (effector delay). One important characteristic of this model is that the oscillator level is reset for the duration of the stimulus, which leads to a phase delay. However, the neuronal basis of this model has not yet been described.

While inhibitory resetting can lead to the rapid synchronization of signals in a chorus (e.g., Mecopoda elongata: Sismondo, 1990; Hartbauer et al., 2005), the degree of synchrony is much higher when the signalers mutually adjust their intrinsic signal rates. Mutual rhythm adjustment has been observed to lead to the attainment of almost perfect flash synchrony in firefly individuals (Ermentrout, 1991). Furthermore, a combination of inhibitory resetting and period adjustment is responsible for the high degree of signal overlap among chorusing katydids (Walker, 1969; Nityananda and Balakrishnan, 2007; Murphy et al., 2016). In the same way, perfect synchrony of humans has been attributed to both "phase correction" and "period adjustment" mechanisms (e.g., Semjen et al., 1998; Repp, 2001, 2005; see also Merker et al., 2009).

### Evolution of Chorus Synchrony

How synchrony among different individuals could evolve in the absence of a central controlling instance within the group (i.e., an individual that would play a role similar to that of a conductor in an orchestra) is puzzling. Mechanisms that would ultimately favor the evolution of chorus synchrony are thought to be diverse and may have evolved in response to selective forces either driven by other chorus members, through female choice (see Section Female Choice and the Evolution of Chorus Synchrony) or natural predators (see Section Cooperation, Competition, and a Trade-Off between Natural and Sexual Selection). Males that advertise themselves in a chorus may gain one or more of the following mutual (group) benefits by timing signals (reviewed in Greenfield, 1994b): (1) Synchrony preserves a species-specific rhythm or a distinct call envelope that is offset by silent gaps (Walker, 1969; Greenfield and Schul, 2008). (2) In contrast, alternation ensures that females can detect, and discriminate critical signal features during mate choice. (3) Synchrony maximizes the peak signal amplitude of group displays, which is an emergent property also known as the "beacon effect" in the firefly literature (Buck and Buck, 1966, 1978). This property increases the conspicuousness of signals in a group of males as compared to that of a lone singer if females evaluate the peak signal amplitude rather than average signals over a longer period of time. This hypothesis states that males in a group can attract females from a greater distance by timing their signals to achieve nearly perfect synchrony. As a consequence, individuals in a chorus potentially increase their fitness as compared to isolated singing individuals. However, empirical evidence for the existence of a "beacon effect" in acoustic insects is rare and has been restricted to evidence from computer-model simulations of chorus synchrony evolution in an Indian Mecopoda species (Nityananda and Balakrishnan,

2009). A strong increase in the amplitude of synchronous acoustic signals was described in M. elongata (Hartbauer et al., 2014). For a description of other suspected "beacon effects" in bullfrog choruses see Bates et al. (2010) and in the vibratory communication of a treehopper, see Cocroft (1999). Whereas the hypotheses described above are based on sexual selection, the timing of communal displays may also be shaped by natural selection. For example, predators eavesdropping on the calling songs of signalers may have difficulty localizing an isolated signaler in a group of synchronously-signaling individuals due to their cognitive limitations (Otte, 1977; Tuttle and Ryan, 1982). In this way, males may benefit from a reduced per-capita rate of predation by signaling in groups (Lack, 1968; Wiley, 1991; Alem et al., 2011; Brunel-Pons et al., 2011).

The "rhythm conservation" hypothesis and the "beacon effect" hypothesis are not mutually exclusive in that they both explain the evolution of chorus synchrony in male assemblages as a result of inter-male cooperation. The first hypothesis assumes a low amount of variability in the signal period on a species level and suggests that this signal parameter includes important information about species identity, whereas the temporal pattern of syllables that make up chirps is considered to be less relevant. This assumption was recently tested using the katydid species M. elongata from Malaysia, males of which synchronize their periodic signals with a period of about 2 s in small choruses (Sismondo, 1990). Calling songs in this species consist of regular chirps that are made up of about 10 syllables increasing in amplitude. When individual males were allowed to synchronize with periodic white noise signals that lacked any fine-temporal pattern, about 80% of males succeeded as long as the signal period was limited to about 2 s (Hartbauer et al., 2012a). Similarly, males synchronized with a periodic stimulus that consisted of only three syllables. In another experiment, individual males were allowed to either signal in synchrony with a conspecific signal or an artificial, unstructured white noise signal, both of which were presented at 2 s intervals and of equal intensity. Interestingly, 65% of the males generated chirps in synchrony with the conspecific signal, whereas only 35% synchronized with the unstructured signal (see example in **Figure 2**). However, after introducing a phase transition by delaying the stimulus for 1 s, only 56% of chirps were produced in synchrony with the conspecific stimulus. These results demonstrate that males of this species responded primarily to the signal period and more or less ignored the fine temporal signal patterns. This may be adaptive when considering the potential masking of the fine syllable pattern during transmission.

Evidence for rhythm as an important signal parameter for species recognition was provided in the same species in female choice experiments. When given a choice between conspecific signals broadcast at different periods, females showed a preference for a fixed signal period of 2 s (Hartbauer et al., 2014). However, in choice tests with song models of periods <1.5 s, females rarely approached any speaker. This is remarkable because the solo signal rate positively correlates with the energetic costs associated with song production (Hartbauer et al., 2012a).

That is, if females selected males with higher signal rates they would thereby select males that invest more energy in mating displays. Their low rate of positive phonotaxis toward speakers with higher signal rates suggests stabilizing selection for the conspecific signal period.

### FEMALE CHOICE AND THE EVOLUTION OF CHORUS SYNCHRONY

As noted above, chorus synchrony can be a by-product of species recognition if signalers in a group preserve a species-specific temporal pattern (Greenfield, 1994a). The "rhythm conservation hypothesis" is exemplified by Neoconocephalus nebrascensis, where the male song requires strong amplitude modulations in order to elicit a phonotactic response in females (Deily and Schul, 2009). Thus, males are forced to synchronize the amplitude modulations of their signals when in male assemblages. A similar argument for the cooperative, synchronous display of mating signals has been put forward for the synchronously-flashing firefly Photinus carolinus. In this case, synchrony presumably reduces the visual "clutter" caused by randomly-timed, flashing signals (Copeland and Moiseff, 2010).

Darwin (1871) noted that female preference may promote the evolution of exaggerated mating displays. The evolution of such traits could be the result of a Fisherian process in which stronger preferences and more exaggerated traits coevolve (Fisher, 1915, 1930). In most communication systems, females prefer males that advertise themselves by producing conspicuous signals that are energetically expensive to produce. This is called "Zahavi's handicap principle" after Zahavi (1975), who explained the existence of such a preference by claiming that signals are reliable indicators of male quality when their production is expensive for the signaler, and that prolonged signaling lowers the fitness of the sender (reviewed in Johnstone, 1995). The energetic costs associated with the production of acoustic signals are usually determined by at least three signal parameters: duration, amplitude, and signal rate (Prestwich, 1994; Reinhold et al., 1998; McLister, 2001; Robinson and Hall, 2002). In the context of mate choice, these signal parameters are regarded as "condition-dependent handicaps," which indicate the quality of a sender (West-Eberhard, 1979; Andersson, 1994). Furthermore, signal traits that provide true information about the phenotypic and genetic qualities of the senders and exclude the possibility of cheating are known as "revealing handicaps" (Maynard Smith, 1985, 1991).

On the other hand, preferences for certain signal traits may be the outcome of a sensory bias in receivers that already existed before signalers evolved the traits to exploit it. In a mating context, this hypothesis suggests that, when confronted with a choice situation, females do not necessarily select males on the basis of their acoustic signal traits (indicative of male quality). Instead, certain signals can more strongly stimulate the sensory system in receivers, increasing the likelihood of mating (Ryan, 1990; Ryan et al., 1990; Kirkpatrick and Ryan, 1991; Ryan and Keddy-Hector, 1992; Arak and Enquist, 1993). For example, males of lebinthine crickets generate unusually high-frequency calls that elicit a startle response in females. In response to these calls, females generate vibratory signals that allow males to locate them (ter Hofstede et al., 2015). Arak and Enquist (1995) provided some examples in which the sensory bias in receivers creates competition between senders, with the result of more conspicuous and costly signals.

In male aggregations of anurans and katydids, females often select males on the basis of relative signal timing rather than other signal features (Greenfield, 1994b; Gerhardt and Huber, 2002). Such mating systems are especially interesting for evolutionary biologists since, by choosing males on this basis, there are no obvious direct or indirect fitness benefits for females (Alexander, 1975; Greenfield, 1994b). Any preference for a certain temporal relationship between competing signals drives the evolution of mechanisms that enable the exact timing of signals generated in a group. This "receiver bias" hypothesis suggests that synchrony or alternation has emerged as a consequence of inter-male rivalry due to inter-sexual selection (e.g., Alexander, 1975; Arak and Enquist, 1993; Greenfield, 1994a,b, 1997; Greenfield et al., 1997; Snedden and Greenfield, 1998; Gerhardt and Huber, 2002; Copeland and Moiseff, 2010). Therefore, by studying signal interactions among males in a chorus and their evaluation by receivers, one can study traits and selection at different levels. In feedback loops, traits emerge at the group level and influence the evolution of signal timing mechanisms at the individual level (Greenfield, 2015; Party et al., 2015).

### Leader Preference

In male assemblages, the synchronicity of calls is usually limited in precision, with some signals leading others. Relative signal timing can enhance or reduce male attractiveness if the females exhibit a preference for a certain temporal relationship between signals displayed in imperfect synchrony. Indeed, some anurans prefer signals that are timed in advance to others (leader signals) (reviewed in Klump and Gerhardt, 1992) which was also observed in many Orthopteran species (Shelly and Greenfield, 1991; Greenfield and Roizen, 1993; Minckley and Greenfield, 1995; Galliart and Shaw, 1996; Greenfield et al., 1997; Snedden and Greenfield, 1998). Such a preference constitutes a precedence effect, which is defined as the preference for the leading signal when two closely-timed, identical signals are presented from different directions [humans (Zurek, 1987; Litovsky et al., 1999), Mammals, birds, frogs, and insects (Cranford, 1982; Wyttenbach and Hoy, 1993; Greenfield et al., 1997; Dent and Dooling, 2004; Lee et al., 2009; Marshall and Gerhardt, 2010)]. This preference may be due to the fact that the leading signal suppresses the echo (reverberation) of subsequent signals that reach the receiver in a complex acoustic environment and, thus, improves sound localization.

Neoconocephalus spiza is a well-studied example of a synchronizing katydid species in which females display a strong leader preference. As a consequence, individual males compete in an attempt to jam one other's signals, with synchrony emerging as an epiphenomenon (Greenfield and Roizen, 1993; Snedden and Greenfield, 1998). The observation that males regularly switch between leader and follower roles in duets, exhibiting similar "free-running" chirp periods, provides support for the hypothesis that an ongoing competition for leadership exists (Greenfield and Roizen, 1993). In this species, males stop producing unattractive follower signals within a certain critical period of time after perceiving the signals from competitors (the so-called "forbidden interval"). Unlike N. spiza males, males of M. elongata establish mostly fixed temporal relationships for their signals over long periods of time, so that individual males assume either leader or follower roles during the duet (Hartbauer et al., 2005). Even in small four-male choruses, individuals often maintain either the leader or follower role over long periods of time (Hartbauer et al., 2014). The relative timing of synchronized chirps of different males strongly influences female choice. In two-choice experiments, M. elongata females showed a strong preference for those chirps leading by only 70–140 ms (Fertschai et al., 2007; Hartbauer et al., 2014). There is also a trade-off between time and intensity: the advantage of a signal leading by 140 ms can be compensated by an increase in loudness of follower signals by 8 dB (for similar trade-offs in other synchronizing insects and some anuran species, see Klump and Gerhardt, 1992; Greenfield, 1994b; Howard and Palmer, 1995; Grafe, 1996; Greenfield et al., 1997; Snedden and Greenfield, 1998; Höbel, 2010). The relatively high intensity value that is necessary for leader compensation implies that females must be in close proximity to the follower to prefer this male from a chorus. As a consequence, males who persistently signal as followers in a chorus should have a reduced fitness, posing an intriguing question about the evolutionary stability of follower roles. Before discussing hypotheses that may provide an answer to this question (see Section Cooperation, Competition, and a Trade-Off between Natural and Sexual Selection), we describe an oscillator property that favors the

ability of males to attain call leadership in a chorus, and results obtained from a realistic computer model of a M. elongata chorus.

## An Oscillator Property Responsible for Attaining Leadership

Sismondo (1990) demonstrated that synchrony and alternation in M. elongata are consequences of song oscillator properties, which can be illustrated in the form of phase response curves. In entrainment experiments and using realistic computer models, we demonstrated that males could establish stable synchrony and bi-stable alternation of signals over a broad range of stimulus periods, covering the whole spectrum of solo chirp periods found in a male population (1.7–2.4 s; Hartbauer et al., 2005). However, the synchrony observed was not perfect, and males tended to produce their chirps as a leader only if interacting with a male that exhibited a slower intrinsic signal rate. The member of the duet with the shorter chirp period (i.e., a difference of more than 150 ms in the intrinsic signal period duration) had an increased probability of attaining leadership (Hartbauer et al., 2005). This correlation between the intrinsic signal period and lead probability has also been described in the firefly P. cribellata (e.g., Buck et al., 1981) and two other katydid species (Meixner and Shaw, 1986; Greenfield and Roizen, 1993).

### A REALISTIC MODEL OF A *M. ELONGATA* CHORUS

Once a realistic model of male duets had been established (Hartbauer et al., 2005), the model was extended to simulate a chorus that consisted of 15 artificial males (Hartbauer, 2008). A major advantage of this approach is that manipulations of receiver properties and chorus composition could be performed that greatly exceeded those possible in behavioral experiments. In particular, parameters such as chorus density, selective attention paid to a neighbor subset, and temporal variability of synchrony due to males joining or leaving a chorus could be modified.

The results of chorus simulations revealed that synchrony in M. elongata is the outcome of an ongoing phase resetting process that propels song oscillators forward and backward during every cycle. Therefore, synchrony in M. elongata seems to be maintained on a chirp-to-chirp basis and does not depend on the mutual adjustment of intrinsic signal periods, as in a firefly (Ermentrout, 1991) or a katydid species (Murphy et al., 2016). Even in rather complex chorus situations, in which the signal oscillators and inter-male distances between nearest neighbors varied, agents that signaled at faster intrinsic rates established the leadership position more often than other chorus members. These simulation results were confirmed in real M. elongata choruses that consisted of 3–4 equally spaced males. In this situation, a single male led more than 50% of all signal interactions in 68% of choruses (Hartbauer et al., 2014). A correlation could also be drawn between the intrinsic signal period and the likelihood of producing leader signals in an Indian Mecopoda species (Nityananda and Balakrishnan, 2007). Unlike the Malaysian M. elongata species, males of the Indian species also altered their intrinsic signal period to match that of their competitors, a behavior that did not allow for the establishment of consistent leader and follower roles (Nityananda and Balakrishnan, 2008).

### Manipulation of Chorus Density

An analysis of data from computer simulations also revealed that removing two or three agents from a synchronous chorus had only a minor effect on chorus synchrony, whereas adding agents who initially signaled at random phases greatly disturbed synchrony (Hartbauer, 2008). Therefore, in order to avoid a temporal loss of synchrony, males joining a synchronous chorus should already be phase-locked with other chorus members. Empirical evidence for such synchronous initiation of songs has recently been provided for Neoconocephalus ensiger (Murphy et al., 2016). Males of this katydid species seem to adjust the intrinsic signal period of their song oscillators prior to initiating the song in order to match the rate of periodic signals. Phaselocked song initiation behavior was also observed in males that were stimulated with a periodic pacer (Hartbauer, 2008). This behavior may be regarded as an adaptation to counteract the vulnerability of a synchronous chorus.

### Selective Attention

Based on the results of computer simulations, Greenfield et al. (1997) argued that selective attention must be paid to a subset of males before synchrony and, especially, alternation can become a evolutionarily-stable signaling strategy. Selective attention can be gained at the neuronal, behavioral and ecological level and restricts the receivers' attention to signals broadcast by neighbors. Evidence for selective attention at the behavioral level has been provided from playback experiments conducted with alternating grasshopper and katydid species (Greenfield and Snedden, 2003). Individuals of these species need to pay selective attention to close neighbors when alternating in a chorus because, in principle, strict signal alternation is limited to only two acoustically interacting males. Evidence for selective attention at the neuronal level has been found by studying the membrane properties of individual interneurons; when signals that differ in loudness compete, the representation of the softer signal is suppressed (Pollack, 1988; Römer and Krusch, 2000). This enables receivers in a chorus to pay selective attention to the loudest signaler. Similarly, inhibitory mechanisms may result in a stronger representation of leader signals in imperfect synchrony (Nityananda et al., 2007). Despite the neuronal evidence for selective attention to leading signals, field studies indicate that the spacing of males appears to play a more important role in restricting the attention of a receiver to close neighbors (Nityananda and Balakrishnan, 2008). Simulating selective attention to only three nearest neighbors in a chorus model did not alter the likelihood of males with higher intrinsic signal rates to attain call leadership, but waves of synchronized signaling spread out among the agents (Hartbauer, 2008). This phenomenon, which is known as "wave-synchrony," has also been observed in fireflies that flash in synchrony. It has inspired the development of a Mecopoda-based controller that enables the navigation of a swarm of autonomous micro-robots (Hartbauer and Römer, 2007).

### IS CHORUS SYNCHRONY IN *M. ELONGATA* THE OUTCOME OF A SENSORY BIAS?

One proximate explanation for the preference of females for leading signals in behavior is based on a sensory bias in receivers. In the auditory system of insects, like in other vertebrates and mammals, direction-sensitive interneurons receive excitatory and inhibitory input from opposite auditory sides (review in Hedwig and Pollack, 2008). Thus, for a female receiver located between two acoustically interacting males, the signals of leader and follower males are asymmetrically represented in the auditory pathway, depending on the timed interaction of excitation and inhibition (Römer et al., 2002). Given that the leader signal has a temporal advantage, it may effectively suppress the representation of the follower signal, and the different representation of otherwise identical signals may bias the orientation of the female to the leader. The interaction of excitatory and inhibitory input may also explain quantitative values in time-intensity trading (Römer et al., 2002; Fertschai et al., 2007). In the auditory system of katydids, two interneurons that have properties favoring leading signals in a choice situation have been examined and may convey leader-biased bilateral information (Römer et al., 2002; Siegert et al., 2011). Depending on the strength of inhibition, the response to lagging signals was almost completely suppressed during the presentation of leading signals. Time-intensity-trading experiments revealed that follower signals needed a 15–20 dB advantage to compensate for the follower role, depending on the magnitude of the time difference.

However, the crucial question in the context of a possible sensory bias is whether the leader-biased response of auditory neurons evolved before or after male synchrony. It has been commonly accepted that a sensory bias can be the by-product of a sensory mechanism that evolved in a non-sexual context (Endler and McLellan, 1988; Ryan, 1990; Ryan et al., 1990; Kirkpatrick and Ryan, 1991; Ryan and Keddy-Hector, 1992; Arak and Enquist, 1993; Boughman, 2002; Arnqvist, 2006) and, therefore, that it already existed before signalers evolved traits to exploit it ("sensory exploitation" hypothesis) (Ryan and Rand, 1990, 1993; Ryan et al., 1990; Ryan, 1999). Ultimately, any bias in sensory processing with respect to closely timed signals has the potential to drive the evolution of communal signal displays toward synchrony or alternation (Greenfield, 1994a).

Strong support for the "sensory bias" hypothesis in Mecopoda would be the demonstration that in distantly-related orthopteran species, where synchrony does not occur, the responses to lagging signals in directionally-sensitive interneurons are also suppressed. The results of experiments conducted with locusts and field crickets have, thus far, been ambiguous (**Figure 3**). A recent phylogenetic study conducted in the genus Neconocephalus, in which—with the exception of one species—discontinuously-calling species synchronize their calls

(Greenfield, 1990; Greenfield and Schul, 2008; Deily and Schul, 2009) revealed that females do not always show a strong leader preference, which does not support the "sensory bias" hypothesis (Greenfield and Schul, 2008). The most parsimonious explanation for imperfect synchronous chorusing in M. elongata is that the phase change mechanism in males enables them to synchronize their chirps, and females choose leading males as a passive consequence of the precedence effect in the auditory system (see also Party et al., 2014). However, it is also possible that a feedback loop, which originated from a sensory bias, exists that gradually strengthened the leader preference once imperfect chorus synchrony had been established.

### The Adaptive Nature of a Sensory Bias

Whether a sensory bias can be adaptive or not is still a matter of debate. Female choice based on a sensory bias may provide the females with fitness benefits due to lower search costs, even if the choice does not result in offspring with superior genes that are associated with positive fitness consequences (Kirkpatrick, 1987; Guilford and Dawkins, 1991; Hill, 1994; Dawkins and Guilford, 1996). This seems to hold true for M. elongata females, since positive phonotaxis lasted three times longer when identical chirps were presented in strict alternation, as compared to a leader-follower situation (Fertschai et al., 2007). Such delayed responses to alternating chirps can be explained at the neuronal level, since alternating chirps elicit identical—and, thus, ambiguous—neuronal excitation on both sides, whereas leading signals cause asymmetrical responses in favor of the leader, which would allow females to reliably choose between two similar, alternative signals. Therefore, females that quickly choose from among males may enjoy fitness benefits by reducing the risk of predation that is associated with a prolonged search for mates (e.g., Belwood and Morris, 1987; Siemers and Güttinger, 2006).

The solo chirp rate of M. elongata is an important predictor for leadership in acoustic interactions between males. If this parameter were correlated with traits that indicated male quality such as body size or fertility, females would gain fitness benefits by choosing the leader from among a group of males. However, neither male age, body size, spermatophore volume, or the number of living offspring correlated with the solo chirp period of individual males (Hartbauer et al., 2015), corroborating the results of a nutritional study in which the solo chirp rate was shown to be a poor predictor of nutritional status (Hartbauer et al., 2006). Similarly, in the European tree frog Hyla arborea, the quality of males did not correlate with signal timing, although females preferentially oriented toward the first of two identical calls that overlapped in time (Richardson et al., 2008). In this frog species and in the katydid Ephippiger ephippiger, call leadership and overall energetic investment in acoustic signals correlated positively (Berg and Greenfield, 2005). In this respect, the systems in H. arborea and E. ephippiger are analogous to that of M. elongata where the probability of producing leader signals depends on a trait (intrinsic signal period) that is associated with calling energetics (Hartbauer et al., 2006), but does not correlate with indicators of male fitness. In the same way, female E. diurnus do not gain any obvious benefits by preferring leading calls although males are able to adjust the song oscillator phase to establish leadership (Party et al., 2014).

### COOPERATION, COMPETITION, AND A TRADE-OFF BETWEEN NATURAL AND SEXUAL SELECTION

Why do some M. elongata males participate in a chorus although they are less attractive for females as followers and probably would be more successful singing in isolation? One possible explanation may be that, in some species, females prefer signals that emerge from group displays over signals produced by lone singing males, which forces males to congregate [insects (Morris et al., 1978; Cade, 1981; Doolan and Mac Nally, 1981; Shelly and Greenfield, 1991), Hyla microcephala (Schwartz, 1994); but see Party et al., 2015]. Choice tests performed with M. elongata females confirmed their preference for conspicuous group displays (Hartbauer et al., 2014). However, this result does not explain why leader and follower roles were maintained by individuals in M. elongata choruses, where followers were at a disadvantage due to the strong female preference for signals from leaders (Fertschai et al., 2007). Below, several alternative, although not mutually exclusive, hypotheses are presented to explain why persistent followers still exist in M. elongata:

(1) Signaling as a follower may be beneficial when resulting from inter-male cooperation because overlapping chirps in a chorus may amplify the peak amplitude of the signals that are displayed synchronously (**Figure 4A**), and the resulting "beacon effect" may help distant receivers detect communal displays (see **Figure 4B**). In this case, females seem to evaluate the peak signal amplitude of communal displays, rather than average acoustic power. Interestingly, sound recordings revealed an elevated sound pressure level in the order of 6 dB in a chorus consisting of 3– 4 acoustically-interacting M. elongata males (2 m nearestneighbor distance; Hartbauer et al., 2014). Despite imperfect synchrony, the high degree of signal overlap found in this chorus situation resulted in an average increase of the root-mean-square amplitude that is almost identical to that found during the simultaneous playback of four identical, conspecific signals that perfectly overlapped in time. Given the fact that syllables comprising male chirps

FIGURE 4 | Signal overlap in *M. elongata* and model of the extension of acoustic space as the result of chorus synchrony. Four males singing in synchrony overlapped their periodic signals to a high degree. This led to a strong increase in signal amplitude (A) and to the enlargement of acoustic space (B). In this way, a group of synchronized males can attract females from a greater distance as compared to lone singing males. In the case of signal alternation, the area in which a single male signals at higher amplitude as compared to its competitors is strongly reduced (shown as areas with different colors).

are interrupted by brief pauses, this result is surprising and may be attributed to signal plasticity, which is known to increase the probability of temporal overlap among the loud syllables of leader and follower signals (Hartbauer et al., 2012a). As a result, signal overlap in "four male choruses" is so high that the average duration of jointly produced signals is only 1.4 times longer (343 ms) as compared to the average signal duration of solo singing males (250 ms). It is also interesting to note that the increased signal amplitude of communal signal displays was a prerequisite for the successful simulation of the evolution of chorus synchrony in an Indian Mecopoda chirper, where females also preferred "leader males" (Nityananda and Balakrishnan, 2009). This observation is in contrast to results gathered for Achroia grisella (wax moth) leks, for which such a prerequisite does not exist (Alem et al., 2011).

An inherent problem encountered when interpreting many group effects is the dilution of per capita mating success as compared to that of lone singing males. However, the increased amplitudes of group displays may enhance the mating probabilities of individual males if one considers the noisy background against which acoustic communication often takes place. Given these complex acoustic conditions, overlapping signals may allow individuals to increase the conspicuousness of their rhythmic signals in a group. Additionally, enhanced group signals were more attractive for females as compared to the solo song of a male (Hartbauer et al., 2014). These data suggest that chorus synchrony in M. elongata is the outcome of inter-male cooperation, whereby even follower males may benefit from higher mating opportunities (but see the next argument).


fly of an unknown *Tachinid* species homing in on *M. elongata* males. Arrow indicates the position of the fly's ear. This fly belongs to the tribe Ormiini of an unknown genus (potentially Therobia, Phasioormia, or Homotrixa).

This fly belongs to one of 13 different species of Ormiin parasitoid flies that parasitize crickets and katydids in Asia (Lehmann, 2003). Lee et al. (2009) showed that Ormia ochracea (Diptera, Tachinidae), a tachinid fly that parasitizes field crickets, selectively orients toward the leading of two otherwise identical—sound sources, while the lagging source had a minimal influence on the orientation of the fly. Therefore, the parasitoid fly homing in on M. elongata males may exhibit a similar leader preference as Mecopoda females, and these males would consequently suffer higher costs when signaling as leaders (review in Zuk and Kolluru, 1998). Because parasitoids are detrimental to survival and reproduction in crickets, katydids and cicada [Crickets (Cade, 1975; Zuk et al., 1998), katydids (Lehmann and Heller, 1998) and the cicada (Lakes-Harlan et al., 2000)], this hypothesis requires further testing. Ultimately, the existence of a leader preference in parasitoid flies suggests that the maintenance of follower singing in M. elongata is an evolutionary stable signaling strategy that trades lower attractiveness against reduced parasitation risk. Apparently, further studies are needed to quantify the selection pressure of this parasitoid fly on the signaling system of M. elongata.

A summary of various selection pressures that favor chorus synchrony in M. elongata is illustrated in **Figure 6**. Females prefer males that signal at a conspecific period of about 2 s, which forces males to synchronize their signals in a group in order to maintain this species-specific rhythm. Since females also prefer leading signals, males in a group compete for the leader role, whereby chorus synchrony emerges as a by-product (Hartbauer et al., 2014). However, chorus synchrony is imperfect and leader and follower roles often remain stable for long periods of time. The natural selection exerted by parasitoid flies that infest singing leader males may stabilize persistent follower roles. Signaling as a follower is disadvantageous in terms of reproductive success, but results in a lower risk of falling victim to a parasitoid fly (selfish strategy). Additionally, followers that persistently signal can benefit from the "beacon effect," which extends the acoustic space in such a way as to allow females to detect conspicuous group signals. Since females more

frequently approached groups producing conspicuous group signals in a choice situation as opposed to a lone singing male producing a quieter song (Hartbauer et al., 2014), males that join a synchronous chorus may increase both their mating chances and the chances of all chorus members. Additionally, computer simulations have been used to demonstrate an increase in the per capita mating possibilities for chorus members advertising themselves in a noisy acoustic environment due to strongly-operating "beacon effects" (chorus size = 4 males, inter-male distance = 10 m; Hartbauer et al., 2014). Therefore, sexual selection favors synchronous group displays, but follower roles are evolutionarily stabilized as a consequence of emergent group properties (beacon effect) and natural selection.

### ETHICS STATEMENT

Insects that were used in this study were taken from a laboratory breed and do not belong to endangered species. Neurophysiological experiments have been performed in accordance with Austrian animal welfare laws.

### AUTHOR CONTRIBUTIONS

MH has drafted and written this manuscript. HR contributed with helpful comments and corrections.

### FUNDING

This research was funded by the Austrian Science Fund (FWF) [P21808-B09].

### ACKNOWLEDGMENTS

We thank H. Rosli, University of Malaya, Kuala Lumpur, for his generous help with the establishment of Mecopoda breeds. Sincere thanks are given to two reviewers for valuable suggestions and comments.

## REFERENCES


Höglund, J., and Alatalo, R. V. (1995). Leks. Princeton: Princeton University Press.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Hartbauer and Römer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# "It Don't Mean a Thing if It Ain't Got that Swing"– an Alternative Concept for Understanding the Evolution of Dance and Music in Human Beings

#### Joachim Richter\* and Roya Ostovar

Institute of Tropical Medicine and International Health, Charité Universitätsmedizin, Berlin, Germany

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Alexandre Celma-Miralles, Pompeu Fabra University, Spain Lisa Horn, University of Vienna, Austria Vittoria Spinosa, University of Barcelona, Spain

> \*Correspondence: Joachim Richter joachim.richter@charite.de

Received: 31 October 2015 Accepted: 13 September 2016 Published: 07 October 2016

#### Citation:

Richter J and Ostovar R (2016) "It Don't Mean a Thing if It Ain't Got that Swing"– an Alternative Concept for Understanding the Evolution of Dance and Music in Human Beings. Front. Hum. Neurosci. 10:485. doi: 10.3389/fnhum.2016.00485 The functions of dance and music in human evolution are a mystery. Current research on the evolution of music has mainly focused on its melodic attribute which would have evolved alongside (proto-)language. Instead, we propose an alternative conceptual framework which focuses on the co-evolution of rhythm and dance (R&D) as intertwined aspects of a multimodal phenomenon characterized by the unity of action and perception. Reviewing the current literature from this viewpoint we propose the hypothesis that R&D have co-evolved long before other musical attributes and (proto-)language. Our view is supported by increasing experimental evidence particularly in infants and children: beat is perceived and anticipated already by newborns and rhythm perception depends on body movement. Infants and toddlers spontaneously move to a rhythm irrespective of their cultural background. The impulse to dance may have been prepared by the susceptibility of infants to be soothed by rocking. Conceivable evolutionary functions of R&D include sexual attraction and transmission of mating signals. Social functions include bonding, synchronization of many individuals, appeasement of hostile individuals, and pre- and extra-verbal communication enabling embodied individual and collective memorizing. In many cultures R&D are used for entering trance, a base for shamanism and early religions. Individual benefits of R&D include improvement of body coordination, as well as painkilling, anti-depressive, and anti-boredom effects. Rhythm most likely paved the way for human speech as supported by studies confirming the overlaps between cognitive and neural resources recruited for language and rhythm. In addition, dance encompasses visual and gestural communication. In future studies attention should be paid to which attribute of music is focused on and that the close mutual relation between R&D is taken into account. The possible evolutionary functions of dance deserve more attention.

Keywords: beat, rhythm, dance, embodied music cognition, embodied communication, rocking, human universals, human evolution

## THE EVOLUTION OF DANCE AND MUSIC, CURRENT CONCEPTS

The origin of dance and music, beautiful and powerful universals of humankind is a mystery. All over the world there are myths on how humankind received dance and music. In Hindu mythology, the god Shiva Nataraj created the world by dancing. In most traditional cultures dance plays a pivotal role (Métraux, 1958; Lapassade, 1976; Verger, 1982; Ginn, 1990; Christoph and Oberländer, 1996; Foix, 2007; Kouam and Mofor, 2011).

In the Western world attention has been usually paid to the origin of language and to its relation to the melodic attribute of music whereas dance and rhythm have been, for long, neglected. This might partly be an unintentional consequence of the duality of body and mind concept of Cartesian philosophy as well as the historical hostility of the Roman Catholic and Protestant Churches toward dance. This had led to the omission of percussion instruments in European classical music, thus, diverting the first attention from rhythm and dance to melody (Redmond, 1997; Wagner, 1997; Foix, 2007).

The evolution of music has become an important research topic rather recently (Falk, 2000; Cross, 2005; McDermott and Hauser, 2005; Mithen, 2007; Cross and Morley, 2008; Dissanayake, 2008; Patel, 2008; Patel et al., 2008; Tomasello, 2008; Panksepp, 2009; Merker et al., 2015).

The evolution of music has become an important research topic rather recently. The interest is partly awakened by neuroscience, basically to identify the core components of human cognition (Oberzaucher and Grammer, 2008; Thaut et al., 2008; Costa-Faidella et al., 2011; Nozaradan et al., 2011a,b, 2012; Alluri et al., 2012; Arnal and Giraud, 2012; De Guio et al., 2012; Teki et al., 2012; Leman and Maes, 2014) in comparison to animals (Fitch, 2006, 2012, 2013; Ravignani et al., 2013, 2014). The evolutionary role of dance is even more enigmatic than that of music considering that who dances dispenses considerably more energy than a singer or a musician. The evolutionary functions of dance have received more attention only recently (Dean et al., 2009; Hanna, 2010; Whitehead, 2010; Grammer et al., 2011; Neave et al., 2011; Davidson and Emberly, 2012; Fitch, 2012; Christensen et al., 2014; Morley, 2014; Woolhouse and Lai, 2014; Wang, 2015).

### Definition of Dance

Neither the term dance nor the term music as such are precise. "Dance" in Oxford's dictionary is defined as: "move rhythmically to music, typically following a set sequence of steps." For our purpose we define as "dance" body movements coordinated to a basic rhythm. Rhythm is constituted by a pulse or sequence of beats which are organized hierarchically. There are four main sub-constituent elements of rhythm: (1) tactus represents identical short-duration periods subdivided into strong beats ("downbeats") and weaker beats ("offbeats"); (2) tempo: the frequency of the tactus; (3) meter: cyclical groupings of beats into units marked by accents; (4) patterns: sequences of time intervals that may or may not extend across meter units (Fitch, 2013; Thaut et al., 2014). Dance differs from simple synchronization to a simple regular pulse, because dance offers the possibility to vary steps with respect to beats inside the tactus. Nevertheless, the dancer has to respect a basic groove (Janata et al., 2012; Fitch, 2013; Oota, 2016). We would therefore not consider soldiers marching or harvesters working in synchrony with a beat as dancers: they do not differentiate down- and offbeats of a rhythm and they have a defined purpose. We also see modern nonrhythmic expressive dance as theater rather than dance. We would also not describe as dance the repeated steps without keeping a regular basic pulse as described for some birds (Ota et al., 2015).

On the other hand, we extend our definition of dance to the beat-keeping movements of music performers. Embodied perception is the physiological fundament of this phenomenon. Unintentional body movements to a beat reflect the role of our body in rhythm perception. For dancing, the capacity of beat anticipation and of embodied rhythm perception are required (Phillips-Silver and Trainor, 2005; Honing, 2012; Bouwer et al., 2014).

### Conceptions of Music and Theories on the Origin of Music

"Music" and "dance" encompass overlapping spectral and temporal attributes. Spectral attributes are pitch, intervals, and harmony. Temporal attributes are covered by rhythm consisting in its sub-constituent elements tactus, meter, tempo, and pattern (Fitch, 2013; Janata and Parsons, 2013; Thaut et al., 2014; Lee et al., 2015; **Table 1**). Loudness and dynamics are, albeit important, not specific for music since these are also expressive means of other arts such as theater, poetry, rhetoric or cinema. For music we discern its three main specific attributes: rhythm, melody, and harmony. Rhythm is music's central organizing structure. Rhythm is indispensable for both, dance, and music (Honing, 2012; Thaut et al., 2014). Whereas, rhythm can exist without melody or harmony, melody or harmony cannot exist without rhythm. The concepts of melody and harmony are partly defined by their temporal rhythmic fundament: melody is defined as a series of sounds with a different pitch over time. Harmony must be subdivided into "sequential harmony" which means the defined pitch intervals in a melodic time line and "polyphony" which means simultaneous sounds with different pitches following different melodic lines. Authors of publications on the evolution of music, usually do not mention which of these attributes are precisely meant. Usually, only the melodic and to a lesser extent the harmonic attribute of music have been focused on. On the other hand, rhythm has been relatively neglected. Also in the past, scholars who reflected on the origin of music referred to its melodic attribute. The philosopher Rousseau (1781) argued that ancestral humans would have used a protomusilanguage and that people would have communicated by singing. The German poet Heine (1822) interpreted melodic music as a precursor of language. Darwin (1871, p. 572) in "The descent of man" noted: "as neither the enjoyment nor the capacity of producing musical notes are faculties of the least use to man in reference to his daily habits of life, they must be ranked amongst the most mysterious with which he is endowed."


red: dance attributes; red and violet: dance-beat-module; violet and blue: musical attributes. For studies on the evolution of music the main attributes should be considered separately as follows: dance, rhythm / melody (singing) / harmony. (subdivided into

sequential harmony as the musical scale constituting the base for a melody, and simultaneous polyphony). Human universals ( in yellow letters ) would be of particular interest in this respect. Rocking a baby and clapping hands in response to rhythmical music and whistling are probable human universals, but have so far never been investigated in this respect. Interestingly, only some of the melodic and harmonic attributes of music constitute human universals including the capacity of transposition of a musical scale to another tonality (relative pitch), and the recognition of the octave as "same", as well as the 5 th-tone interval as "harmonically closely related". Chorusing means that several or many individuals perform an activity in synchrony. Modal scales constitute the melodic framework of a number of musical styles such as Classical Indian music, antique Greek music, or Gregorian chants and have also been adopted by Jazz musicians. Bending means the stepless modulation of a sound which is widely used e.g., in Classical Indian and Oriental music, Blues and the popular music of Hawaii. Polyphony means the performance of different melodic lines at the same time. Examples among others are canon singing and the fugue used in Western Classical music Antiphony means the call-response pattern between a lead singer and a refrain sung by a choir as present in many musical cultures. Polyharmony is a sophisticated method which takes profit of the ambiguity of harmonic chords in modern Classical and Jazz music. Entrainment: means the capacity to synchronize body movements to an external beat. Groove: constitutes the aspect of music which compels the individual to move. Embodiment: is the mutual interaction between body movement and perception or communication.

For extensive musicological definitions and human universals (see Nettl, 2000; Jordania, 2005; Bispham, 2006; Janata et al., 2012; Brown and Jordania, 2013; Fitch, 2013; Levitin, 2013; Ravignani et al., 2014; Savage et al., 2015).

Most anthropological, psychological, and musicological references focus on the evolution of melodic music but not of rhythm nor of dance. A favorite theory on the evolution of music has been that it would have evolved just as a by-product of language evolution (Pinker, 1997; Jourdain, 2001; Brown and Jordania, 2013).

### Current Hypotheses on the Evolution of Dance

Would also dance constitute a mere by-product of language evolution? It has been proposed that the capacity to move in time with an auditory pulse, i.e., entrainment would have evolved as a by-product of vocal mimicry (Schachner, 2013). We believe that R&D are not mere evolutionary by-products of language. Scientific studies showed that infants are able to extract and anticipate a rhythmical pulse and that they have a strong impulse to move spontaneously when exposed to an external rhythm (Hannon and Johnson, 2005; Hannon and Trehub, 2005; Phillips-Silver and Trainor, 2005; Winkler et al., 2009; Fujii et al., 2014). We hypothesize that R&D are part of an inborn series of physiological reflexes universal in all humans. The observation that children at a certain age inevitably dance when they are exposed to a rhythm has been supported by a cross-cultural questionnaire study conducted in three continents by our group (Ostovar, 2016). If such a reflex was constant and genetically determined, it could join the list of physiological reflexes used in developmental psychology and pediatric neurology. The confirmation of the existence of such an innate reflex would also support the concept that dance have had a considerable importance in the evolution of humankind.

This prompted us to perform a literature search in the fields of medicine and developmental psychology, philosophy, archeology, anthropology, ethnology, and musicology focusing on the evolution of dance in both, children and in humankind. We were surprised about the scarcity of references. Neither in textbooks of pediatrics nor of developmental psychology a physiological dance reflex or dance reaction to a beat occurring during infantile development is mentioned (Ringwalt, 2008; Berk, 2014). A PubMed search on "dance reflex" yielded a total of 22 hits, of which none dealt with our topic.

### Music and Dance as Pre-verbal Communication Tools

What could be the evolutionary functions of rhythm and dance? One has to bear in mind that people who dance, stomp, and clap hands are noisy and less aware of predators (Hart and Sussmann, 2009). Such a behavior must not, at least, have had survival disadvantages if it was to be conserved in evolution.

Donald (2001) formulated the theory that long before the evolution of language humans were communicating by extraverbal means which he called mimesis. Mimesis can be imagined as how we communicate in a foreign country without any knowledge of the local language. This theory was later taken up by Mithen (2007) who called extra-verbal communication the "hmmmm"-communication (hmmmm stands for holistic multimodal manipulative musical). In this context Mithen (2007) argued that music would have been a means of pre-verbal communication, calling his book "The Singing Neanderthals." This title reflects the wide-spread opinion that melody is the main attribute which characterizes music.

Since we believe that rhythm is the fundament of music we would propose for the next book the title "The Swinging Neanderthals." Masataka (2009) sustains the view that both language and music evolved from a pre-linguistic communication system which was neither language nor music. Similarly, Brown (2007) proposes that music has evolved from a pre-linguistic precursor present also in animals which he called "contagious heterophony." An alternative account is proposed by Livingstone and Thompson (2009) where music originated from a more general adaptation known as the "Theory of Mind" which would allow an individual to recognize the mental and emotional state of conspecifics. Underpinned by the mirror neuron system of empathy and imitation, music would achieve engagement by drawing from pre-existing functions across multiple modalities (Rizzolatti et al., 1996; Wu et al., 2016). This, in our opinion, applies even more to dance because of its strong interaction between perception and motor response. Whitehead (2010) puts forward the hypothesis that song and dance would have even preceded mimesis in hominid evolution. Since most authors looked for the origin of music in melody and singing, some of them interpreted as their precursors the interaction between pre-musical utterances of the infant by modulation of crying and melodic modulation of language by the caretaker, also called "motherese." The origins of "motherese" would already be established during prenatal development of the fetus (Parncutt, 2009). "Motherese" has been proposed as proto-melody paving the way for the evolution of music. Rhythm in this respect has not been considered (Dissanayake, 2004; Falk, 2005; Hagen and Hammerstein, 2009; Panksepp, 2009; Wermke and Mende, 2009; Trevarthen, 2011; Brown and Jordania, 2013). Furthermore, although all mothers of all cultures and all ages know that babies are soothed by being rocked, this is, if ever, only exceptionally mentioned and has also never been proposed as a human universal (Brown, 1991; Antweiler, 2015).

Summarizing, the functions of R&D as pre- and powerful extraverbal communication tools have found little attention.

### DEFINING A DIFFERENTIATED FRAMEWORK OF MUSIC AND DANCE

For defining the evolutionary functions of music a more finegrained concept is required (Hagen and Bryant, 2003; Fitch, 2006; Levitin, 2013). Dance is usually conceived as being distinct from music. The close relationship between rhythm and dance has been acknowledged only recently (Phillips-Silver and Trainor, 2005; Dean et al., 2009; Phillips-Silver et al., 2010; Janata et al., 2012; Phillips-Silver and Keller, 2012; Stevens, 2012; Morley, 2014). The pivotal importance of rhythm, synchronized body movements and dancing as prerequisites for the emergence of all attributes of music has found relatively little attention. Whereas, rhythm and dance can exist without melody, there is no music without rhythm (Honing, 2012). Music has also been considered for a long time from the viewpoint of a unimodal phenomenon (auditory processing), whereas, it is multimodal (through action–perception coupling; Lesaffre and Leman, 2013). There is no reason to assume that the different attributes of music evolved at the same time and pace. Only two attributes of music constitute human universals: all cultures have rhythm and almost all have melody (songs; **Table 1**). Polyphonic harmony has developed only some 1000s of years ago and has been explored only since historic times starting in Ancient Greece (Jordania, 2005). Among the attributes of melody and sequential harmony only the octave, the perfect fifth and building a melodic scale by the division of the octave into unequal intervals are human universals [Nettl, 2000; Savage et al., 2015 (see also **Table 1**)]. Many cultures such as in the Arabic world, in Turkey, in Persia or in India, have, instead of polyphony, pursued to evolve melody by adding particular intervals and bending melody sounds in a sequential harmonic way (modal music).

On the other hand, in most cultures, worldwide, the concept and terms of R&D are not separated from each other (e.g., the terms "Samba," "Salsa," "Guaguancó," "Tango," "Waltz" apply to both, to the rhythm and to the dance).

In line with this, we propose an alternative conceptual framework which focuses on the mutual co-evolution of R&D (Phillips-Silver and Trainor, 2005; Panksepp, 2009; Lesaffre and Leman, 2013; Levitin, 2013; Maes and Leman, 2013). In other words, R&D are two sides of the same coin. Not only do we move to what we hear but what we hear depends on how we move (Phillips-Silver and Trainor, 2005). The unintentional body movements when we perceive a "groove" (defined as the aspect of music which compels us to move), among all body parts primarily involve the lower limbs confirming the close relation to dance (Janata et al., 2012). The mutual connection between R&D is also reflected by the musical terms "downbeat" and "offbeat," where the downbeat indicates the dance step which carries the weight of the body when it comes back down onto the legs.

Movement, rhythm and emotional well-being have particular neural pathways involving cerebellar structures which coordinate sensory neuronal inputs with motoric responses (Molinari et al., 2003; Levitin, 2006; Thaut et al., 2008; Lehmann, 2010; Nozaradan et al., 2011a,b, 2012; Grahn, 2012; Teki et al., 2012; see the comprehensive reviews of Levitin, 2013; Repp and Su, 2013; Chauvigné et al., 2014). Even deaf children have the impulse and the capacity to dance by perceiving the beat through their cutaneous pallesthesic and visual receptors (Phillips-Silver et al., 2015).

There are relatively few references on the specific evolutionary functions of rhythm in humans, mainly in comparison with nonhuman animals (Honing, 2012; Fitch, 2013; Ravignani et al., 2013, 2014). Rhythm cognition depends on body movement and vice versa (Lesaffre and Leman, 2013; Christensen et al., 2014). Fitch (2013) emphasizes that necessary prerequisites of research on rhythm are unambiguous definitions of the terms "rhythm" and its sub-constituent elements "beat," "pulse," "meter," "tempo," and "tactus" (**Table 1**). Dancing requires the perception of such hierarchically structured rhythms in order to coordinate and differentiate those steps which carry the whole bodyweight (usually the downbeats) from steps carrying less or no weight (offbeats). Sequences of arbitrary pulses are therefore spontaneously classified into a rhythmical tactus by the dancer, a

capacity observed already in infants (Phillips-Silver and Trainor, 2005).

For people of traditional societies the importance of dance and the tight connection between R&D are beyond any doubt (Ginn, 1990; Morley, 2014). In societies with strong dance traditions, singing out of tone is tolerated more easily than drumming only slightly out of beat. This is what the composer Duke Ellington meant with his song: "it don't mean a thing if it ain't got that swing." Instead, passive listening to music without moving, as seen in listeners to Western classical music, requires an educational effort, as it can easily be seen when children are obliged to sit still in a concert. In fact, spontaneous unintentional body movements when listening to a rhythmical beat are difficult to suppress (Molinari et al., 2003; Levitin, 2013).

Moreover, dance is a comprehensive art encompassing attributes which go beyond music such as its external visual signals. The dancer is seen by others acting as a moving picture (Hanna, 2010; Lee et al., 2015). This also applies to a widespread and very old form of dance, i.e., round dance, where the group dances and different dancers enter into the round to perform their solo before rejoining the round as it is also commonly observed in children. Furthermore, dance may encompass gestural and dramatic codes and, thus, has paved the way for the development of theater. The comprehensiveness of dance made already Curt Sachs argue that dance would be the mother of all arts (Sachs, 1933/1980).

### CONNECTING STUDIES IN VARIOUS SCIENTIFIC DISCIPLINES WITH HYPOTHESES ON THE EVOLUTION OF DANCE AND MUSIC IN HUMANKIND

### Evidence of Dance and Music in Archeological Records

When we look for archeological proof for music and dance we must bear in mind that archeology depends on the finding of artifacts indicating the presence of a certain human behavior at a certain time. The earliest artifacts confirming musical activities are around 45,000 years old. Examples are preserved instruments such as flutes in the neolithic caves of "Hohle Fels" and "Geissenklösterle" (Higham et al., 2012). Before, we assume that music and dance did not develop earlier than this, we must acknowledge that in a hunter-trapper-gatherer society not only any artifact constitutes an additional weight to carry but not to leave one's traces also constitutes part of the survival strategy. Moreover, many instruments are natural objects, such as conch shells, or are made of perishable materials, such as wood and animal skins, which are not preserved for long (Ginn, 1990). Furthermore, humans always possessed a versatile instrument without the necessity of producing a musical instrument, i.e., their own body (Barbagiovanni, 2006). Some examples of human behavior appear so omnipresent and obvious to us that it has, in fact, never been investigated if these are universal in all human societies. One of these is accompanying a beat by clapping hands, a behavior which is observable in humans of all ages (Brown, 1991; Antweiler, 2015; Savage et al., 2015). Another actual example of body percussion and dance is "Flamenco" which in its original form was performed only by singing, clapping hands and stepping on the ground without the use of any musical instruments (Caballero, 1992). This practice is also illustrated by the "Akonhoun" dance tradition of Benin, where dancers perform percussion on their body while dancing as well as "Schuhplattler" in German folk music. To support our view, one may consider in analogy the development of painting, where the human body was the first canvas as confirmed by ca. 200,000 year-old red ochre findings in several places where Neanderthals were living (Wrescher, 1980; Roebroeks et al., 2011). Cave paintings and sculptures confirming dancing are relatively recent. The oldest cave paint possibly representing a dancer is the around 35,000-year-old "magician" (French: "sorcier") of the "trois frères cave" in Southern France, a zoomorphic figure with animal and human characteristics. Such mask dancing is still practiced in traditional societies aiming at being possessed by an animal spirit (Ginn, 1990; Christoph and Oberländer, 1996). Unequivocal dancing scenes are represented in paintings in the "Valcamonica" and "Addaura" caves in Italy which are no older than 10,000 years (Anati, 1995).

### Studies on Infants and Children

Studies on infants, toddlers, and children contribute to elucidate the evolution of dance and music. The study group of Henkjan Honing showed in a very elegant experiment that newborns already perceive and anticipate musical pulses, a phenomenon which was called "beat induction" (Winkler et al., 2009). Beat processing has been shown to be pre-attentive for metrically simple rhythms with clear accents (Bouwer et al., 2014). Three- to four-month-old infants demonstrate spontaneous limb movements coordinated to a musical pulse (Fujii et al., 2014). Furthermore, infants are able to stratify musical pulses into meters (Phillips-Silver and Trainor, 2005). This capacity is linked with body movement. Infants use meter to categorize rhythms and melodies and learn more readily to tune into musical rhythms than adults (Hannon and Johnson, 2005; Hannon and Trehub, 2005). Human infants spontaneously engage in significantly more rhythmic movement to music and other rhythmically regular sounds than to language (Zentner and Eerola, 2010). The precocity of beat induction already being observed in newborns, the ability to stratify musical beats into tactus and meters, the efficacy of rocking, the unintentional movements to rhythm, the pre-attentive characteristic of beat processing, and the intensity of the emotional impact of R&D on humans, support our view that in human evolution communication through R&D preceded verbal and melodic communication (Bergeson and Trehub, 2005; Winkler et al., 2009; Hagen et al., 2010; Whitehead, 2010; Nozaradan et al., 2011a,b, 2012; Honing, 2012; Lesaffre and Leman, 2013; van Noorden, 2013; Bouwer et al., 2014; Leman and Maes, 2014; Morley, 2014; Ravignani et al., 2014; Miura et al., 2016). Not only environmental but also genetic factors have been shown to play a role in our ability to perceive rhythm (Seesjärvi et al., 2015).

### WHAT COULD THE EVOLUTIONARY FUNCTIONS OF RHYTHM AND DANCE BE?

With this differently defined framework one may reformulate the question "what was the function of music in human evolution?" in a more particularized way: "what was the function of rhythm and dance?"

Did this module pave the way for further evolution of other attributes of music and of language? Future studies may elucidate to what extent R&D are exclusively human or to which capacities in this respect non-human animals are capable. Also some birds and a captive sea lion are able to anticipate beat and move to it to some extent, but apparently animals do not subdivide rhythm into more and less accentuated beats (Patel, 2008; Patel et al., 2008, 2009; Kirschner and Tomasello, 2009; Cook et al., 2013; Ravignani et al., 2013, 2014). Chimpanzees display spontaneous rhythmical behaviors (drumming and carnival display; Arcadi et al., 1998, 2004; Fitch, 2006; de Waal, 2013; Ravignani et al., 2013) and an experiment has shown that a trained chimpanzee was able to anticipate a rhythmical beat (Hattori et al., 2014). Dancing, however, contrary to marching in lockstep or synchronized working to a beat, requires a sophisticated hierarchical perception of rhythm (Fitch, 2013). Therefore it is, in our opinion, very unlikely that dancing simply constitutes a mere by-product of entrainment. What are R&D in humans good for? Why is the impulse to dance so powerful? What could be the evolutionary functions of R&D?

### REPRODUCTIVE FITNESS AND SEXUAL ATTRACTIVENESS

Contrary to other scholars of his time, Darwin (1871, p. 880) had also the rhythmic aspect of music in mind as well as its reproductive fitness advantage, as he wrote "we may assume that musical tones and rhythm were used by our half-human ancestors, during the season of courtship." "Dancing is the vertical expression of a horizontal desire" a quote attributed to Robert Frost to which George Bernard Shaw added: "legalized by music," confirms this observation in poetry. Sexual attractiveness has been since long hypothesized to be the main evolutionary function of dance but has become a scientific research focus only recently (Dissanayake, 2008; Oberzaucher and Grammer, 2008; Dean et al., 2009; Hanna, 2010; Grammer et al., 2011). Rhythmicity has been proposed as an indicator of mate quality (van den Broek and Todd, 2009). Furthermore, dancers are able to communicate subtle non-verbal signals (Oberzaucher and Grammer, 2008; Hanna, 2010; Grammer et al., 2011). The Latin-derived French word "emotion" does not by mere chance contain the word "motion" (Harper, 2001–2016). R&D move us profoundly. Motion alone can effectively communicate emotion, charisma and sex appeal (Oberzaucher and Grammer, 2008).

To say it with the words of Hanna (2010, p. 2): "Dance and sex both use the same instrument — namely, the human body — and both involve the language of the body's orientation toward pleasure. Thus, dance and sex may be conceived as inseparable even when sexual expression is unintended. The physicality of dance imbued with "magical" power to enchant performer and observer, threatens some people (Wagner, 1997; Karayanni, 2005; Shay and Sellers-Young, 2005). The dancing body is symbolic expression that may embody many notions. Among these are romance, desire, and sexual climax."

Movement quality not only seems to indicate mate quality, but also the interest of a potential partner, which could denote the probability of successful mating (Grammer et al., 2011; Neave et al., 2011).

## SOCIAL FITNESS

### Synchronization of Many Individuals

Synchronization is a behavior not limited to humans (Ravignani et al., 2014). It may have a direct effect on predators or reflect the general advantages of cooperation via positive social interactions, a finding also observed in macaques (Nagasaka et al., 2013). Rhythm enables the synchronization of 1000s of dancing human beings such as in a rock concert (Canetti, 1960). The dynamics of rhythmic synchronization differ fundamentally from that of a swarm: a swarm is coordinated by an energy wave passing very quickly but consecutively through many individuals sensing the movement of the adjacent individual. This confounds a predator on which individual prey to catch. Humans, presenting the simultaneous movement of a stomping crowd screaming and armed with fire, may delude a predator by producing the impression of being a homogeneous enormous animal which would be too powerful to attack. This effect may be taken advantage of also in hunting battues (Hagen and Bryant, 2003; Bispham, 2006; Trevarthen, 2011; Phillips-Silver and Keller, 2012; Repp and Su, 2013).

It has recently been argued that self-generated sounds of locomotion and ventilation interfere with the perception of the surroundings. The synchronization of the movement of a number of individuals would thus increase the duration of the intervals where the surroundings can be heard better (Larsson, 2014). This means that synchronization would constitute a by-product of hunting abilities. To prove this hypothesis one would expect that traditional hunters follow animals to hunt in a kind of lockstep, an observation that has not been provided so far.

### Social Bonding

In humans, the synchronic movement leads to "muscular bonding" which enables to overcome emotional boundaries between individuals and, thus, strengthens the community (Wiltermuth and Heath, 2009).

A pivotal fitness strategy of hominids is cooperation (Nowak, 2006; Nowak et al., 2010). Drumming and dancing are profoundly social activities (van Noorden, 2013). Some of the countless examples of this way of social engagement are "Samba de Roda," "Flamenco," and Senegalese "Sabar," where the audience supports the musicians and dancers rhythmically, members of the audience enter into the round for dancing and the drummers

and/or other musicians interact directly with the dancer. In many musical cultures the dancer is a percussionist at the same time, as it may be observed not only in traditional societies but also on ancient Egyptian and Greek frescoes or in actual tap-dancing (Caballero, 1992; Redmond, 1997). In many societies dancing is an integral part of important group ceremonies such as initiation rites or weddings. In hunter-gatherer societies, groups may be limited to 40–50 people. The future spouse has to leave her or his group after the wedding in order to join the partner's group. By dancing, future spouses demonstrate their ability, strength and elegance not only to the future partner but also to other members of the group which will admit the spouse as a new member. In other words, promised spouses need also to catch the eyes of the mothers and fathers in law or other group members who have a say. This latter aspect has, to our knowledge, not yet been explored. Bonding by R&D strengthens the community. Musicians delight dancers. They offer the fundament for the joy of the dancers. This profoundly emotional type of embodied extra-verbal communication increases the group's cohesion and the identification with the group (Kirschner and Tomasello, 2009, 2010; Boer et al., 2011, 2012, 2013; Davidson and Emberly, 2012; Boer and Abubakar, 2014; Kirschner and Ilari, 2014). Music and especially rhythm constitute a deeply rooted signaling system for extra-verbal communication evoking emotional reactions of other potentially cooperating individuals (Bryant, 2013). The propensity to move in time to rhythmic percussive sounds is manifest from an early age on, as seen in children's impulsive body movement in response to music (Zentner and Eerola, 2010). Joint drumming facilitates the synchronization in preschool children (Kirschner and Tomasello, 2009). Interpersonal synchrony increases helpfulness already in 14 month-old toddlers and the promotion of prosocial behavior by interpersonal rhythmic synchrony has been confirmed in cross-cultural studies in 4-year-old children as compared to matched controls (Kirschner and Tomasello, 2010; Cirelli et al., 2014; Kirschner and Ilari, 2014; Trainor and Cirelli, 2015). R&D are also potent collective mood synchronizers (Hagen and Bryant, 2003; Wiltermuth and Heath, 2009; van Noorden, 2013). The emotional impact of the synchronization of many individuals in military drill has impressively been described by McNeill (1995).

### Keeping Peace

There is evidence of intra- and intergroup aggression in primates such as chimpanzees (de Waal, 2000), and hominids (Kelly, 2000; Zollikofer et al., 2002; Kelly, 2005). Hominids possessed spears for more than 400,000 years (Thieme, 1997). The advent of tools of potential use as weapons among hominids required even more effective reconciliation means (Wilkins et al., 2012).

To say it with the words of Zollikofer et al. (2002, p. 6447): "The intentional use of implements in the context of intragroup conflict must have had a major impact during hominid evolution because the availability of highly effective hunting and or foodprocessing tools in interpersonal conflict created a new and considerable potential for intragroup damage, a potential that required specific behavioral adjustments with which to cope. Intragroup aggression in primate societies must be understood as one specific behavioral option in a complex network of social interactions, which is typically balanced by active reconciliatory behavior [...]."

This ability is confirmed by the relative scarceness of traces of violence in prehistoric bone findings as compared to skeletons from historic times (Haas and Piscitelli, 2013). Dancing as an effective reconciliatory means has been well-described among potentially hostile Andaman groups by Kelly (2005). Dancing enabled to appease our most dangerous enemies: other men of other tribes or even of the own group (Kelly, 2005; Evans Pim, 2013). Similar to symbolic fights present in many non-human animal species, dance may serve for getting to know who is stronger before undertaking a fight, thus reducing the risk of injury and preventing casualties (Evans Pim, 2013). As an actual example, ghetto dance battles may contribute to avoid deadly duels (McDermott and Hauser, 2005).

### Dance Rituals, Trance, Shamanism, and Religion

(Nietzsche, 1883–1885) argued that he would not believe in any god unless this god was able to dance. Dance in many societies is not only delightment, but it means also to enter into contact with spirits and gods (Métraux, 1958; Lapassade, 1976; Verger, 1982; Ginn, 1990; Christoph and Oberländer, 1996; Jilek, 2009; Herbert, 2011). Although, trance may in some cultures be also reached without dancing, rhythmical techniques including breathing, hyperventilation and dance, as in the Indonesian island of Bali, are the means which are used in the majority of societies for entering trance. Some historical and actual examples for trance dances include the medieval European St. Vitus' dance, the Italian Tarantella, the Brazilian Candomblé, the Cuban Santería, the Japanese Nô, the Senegalese N'doep or the Sufi Dervish dances. Trance dance serves as catharsis reached through ecstasy. An ancestor, a spirit or a god drives the dancer; the dancer is possessed. Mask dances are common throughout societies worldwide including Malian Dogon, Japanese Kabuki, Dan acrobats in Ivory Coast, Egungun in Benin and Nigeria. Pre-Christian religious mask dances are the origin of present time Carnival traditions. Dancers moving like puppets on the strings such as in Indian Kathakali and Japanese Kabuki are the precursors of theater and pantomime. In this respect, it is interesting that a 15,000-year-old marionette puppet with moveable limbs has been found in a grave of an adult man believed to be a shaman in Brno, Czech Republic (Williams, 2011). In Ethiopian and Greek Orthodox Churches people dance for God. It is still matter of debate whether religion is an adaptive complex itself or a by-product of adaptive behaviors in other nonreligious contexts. Since there is no evidence of "natural" nonreligious control populations, it cannot be excluded that religious beliefs, at least in hunter-gatherer societies might have provided evolutionary advantages (Boyer, 2001; Dow, 2008; Antweiler, 2015).

### Embodied Pre-verbal Memorizing and Transfer of Traditions

In a pre-verbal context the importance of dance for individual and collective memorizing cannot be overemphasized. Dance in

many traditional societies is an instrument to memorize hunting techniques and to preserve traditions by telling stories about the past of the community. In South India Kathakali is danced to tell tales of the Mahabharata epic (Ginn, 1990). Since the mirror motor neurons of who observes dancers are activated dance is an excellent method to train children and adolescents and to communicate experiences and skills which are later internalized by imitation (Rizzolatti et al., 1996). Also in this function, dance is the predecessor of theater (Sachs, 1933/1980).

### Paving the Way for Verbal Communication

Language might have evolved alongside melody, possibly passing through a "musilanguage" stage as already argued by Rousseau (Rousseau, 1781; Brown and Jordania, 2013). However, the evolution of language requires an underlying rhythmic and gestural understanding, i.e., embodied communication (Oberzaucher and Grammer, 2008; Phillips-Silver et al., 2010; Honing, 2012; Gillespie-Lynch et al., 2014). Rhythm perception enables to discern words and is necessary to codify and decode language. The observation of a dancer aids to recapitulate and decode gestures (Patel and Daniele, 2003; Patel, 2008; Hausen et al., 2013; Fujii and Wan, 2014; Magne et al., 2016). Thus, it is likely that R&D paved the way for the evolution of language.

### INDIVIDUAL FITNESS

### Individual Psychological Fitness

The individual benefits from R&D in several ways. R&D have anti-depressive effects and divert thoughts from sorrows and boredom. Fetuses are able to hear their mother's physical functions already from the middle of pregnancy on (Trehub, 2003; Parncutt, 2009; Grahn, 2012). Mother's breathing and heartbeat may produce an incessant conditioning effect which one could describe as a "soothing fetal brainwash." Up to here, there is no difference between humans and other mammals. Human babies are, however, especially immature at birth as compared to other animals. Therefore, human infants may require more specific soothing efforts such as rocking. To rock the baby one needs free arms. Soothing a baby by rocking is probably a human universal which, however, has not been investigated in this respect. A recent study comparing cultural effects on rocking a baby for soothing showed more similarities than differences between different cultures (Vinall et al., 2011). There is some research on the effects of rocking in the medical literature. In PubMed we found 157 hits from 1948 to 2014. Especially premature babies benefit from rocking (Malcuit et al., 1988; Clark et al., 1989; Sammon and Darnall, 1994). Rocking has a positive effect on the entrainment of respiration as well as on neuromuscular development of infants (Malcuit et al., 1988; Clark et al., 1989). Intuitively, one may assume that experienced caretakers know that rocking is an effective means to soothe a baby, but if we look very carefully at infants' behavior we appreciate that the infants themselves induce their caretakers to rock them since other means are less effective. Infants and toddlers exhibit also active physiological stereotypic movements (Sallustro and Atwell, 1978; Thelen, 1979; Barry et al., 2011; Lutz, 2014). Interestingly, physiological rhythmical stereotypies not only have a self-soothing effect as reflected by heart rate reduction but frequently involve the legs, a behavior that could be the starting point of dancing (Soussignan and Koch, 1985). In fact, also later spontaneous unintentional movements to a musical beat most frequently involve the lower limbs reflecting an unconscious proneness to dance (Woods and Miltenberger, 1996; Janata et al., 2012). Whether or not rocking a baby is a behavior strictly confined to humans is an interesting research question which deserves to be explored by evolutionary biologists. We did not find any report of animals or non-human primates rocking their offspring. Moreover, whereas rhesus monkeys have not been found to be good detectors of beat (Honing et al., 2012), chimpanzees display rhythmical behaviors (Ravignani et al., 2013).

It is conceivable that the sensitivity of babies for being rocked and physiological stereotypes paved the way for the evolution of R&D in humans (Soussignan and Koch, 1985). Rocking may also promote the ability of infants to stratify rhythm (Phillips-Silver and Trainor, 2005). Dance enables to self-induce the soothing effect of being rocked. Dance appeases the tormented soul and leads to the secretion of hormones like dopamine and endorphins (Sutoo and Akiyama, 2004; Harris, 2007; Salimpoor et al., 2011; Dunbar et al., 2012). The particularly strong emotional impact of R&D is underscored by recent applications in medicine. Their capacity to influence mood, to reach autistic patients otherwise refractory to any emotional involvement and to make Parkinson patients start moving are taken advantage of in medicine (Hayakawa et al., 2000; Sacks, 2007; See, 2012; Moore, 2013; Nombela et al., 2013; Boehm et al., 2014; Ashoori et al., 2015). Playing musical instruments and dancing reduce the risk of dementia in the elderly (Verghese et al., 2003).

R&D enable to divert the otherwise unstoppable flow of thinking (Steiner, 2006). Dance and music playing enable to psychological "flow" experiences that wipe away unpleasant thoughts, sorrows and boredom (Thomson and Jaque, 2012; Chirico et al., 2015). Boredom may be not only a phenomenon of modern societies but also a problem of traditional societies. Men seem to be more prone to both, boredom and violence, which also are associated with suicide (Wrangham and Peterson, 1996; Heinsohn, 2003). R&D help to overcome boredom and, thus, contribute to keep peace and save lives (Sundberg et al., 1991; Choquet et al., 1993; Wexler and Goodwin, 2006).

R&D are particularly powerful means to express the essential "joie de vivre" (joy of life), i.e., the pure "raison d'être" (reason to exist) a philosophical aspect which has been particularly emphasized by Latin-American and African authors (Giglio and Giglio, 1980; Foix, 2007; Kouam and Mofor, 2011), as Jean Massoulier texted: "Je danse donc je suis" (I dance therefore I am).

### Individual Physical Fitness

Dance, rhythm, music and being rocked have been shown to have painkilling effects (Lehmann, 2010; Pillai Riddell et al., 2011; Dunbar et al., 2012; Johnston et al., 2014). The capacity of music to reduce the dosage of painkilling medication in intensive

care patients is documented in medicine (Lehmann, 2010). A recent study has shown that rhythmical music reduces the perceived exertion induced by strenuous physical performance an observation which was well-known to the cotton harvesters in the USA and is reflected by specific working songs. This effect occurs not only on a psychological but also on a proprioceptive level (Fritz et al., 2013). The pain threshold is elevated more by active drumming, dancing or singing than by passive music listening (Dunbar et al., 2012). Rhythmic movements or breathing into hyperventilation are effective means for entering trance, an effect that Hindu yogis take advantage of when they perforate their skin, tongue, or lips before starting their processions.

Furthermore, active and passive rhythmical movements improve body coordination (Trainor and Cirelli, 2015). Although, a major evolutionary advantage is to be expected from cooperation, in some given moments preparation for fighting may be useful for a given group to succeed in winning against enemies and thereby improving the access to resources (Kelly, 2000). Individual and collective coordination skills are trained in martial dances for example in Brazilian Capoeira and Maculelê, in Sicilian Taratatà, Indian Kalaripayattu (Phillips-Silver et al., 2010).

From an evolutionary perspective, all these more or less overlapping aspects are likely to have played a role although these are not equally important at the same time and age. We would tentatively rank reproductive fitness, cooperation and bonding as the driving evolutionary forces whereas the individual aspects may have further contributed to the evolutionary functions of dance in specific age, gender, and prehistoric contexts (Nowak, 2006; Nowak et al., 2010). Survival is particularly important for children in traditional societies with high infant mortality (Carter and Mendis, 2002; Hart and Sussmann, 2009). Reproductive fitness applies to sexually mature individuals who may even risk their lives in order to find potential partners. Peace-keeping and martial dancing could have been particularly important for young men during periods of high violence. On the other hand, martial dances are not human universals and high violence periods have been more widespread in historic times than in prehistory (Kelly, 2000; Haas and Piscitelli, 2013).

In summary, dance offers evolutionary advantages to humans by contributing to sexual reproduction signaling, cooperation, social bonding, infant care, violence avoidance

### REFERENCES


as well as embodied individual and social communication and memorization. Anticipating one consequence of our R&D concept we would expect that not only beat induction is innate but that during their development infants and toddlers spontaneously start to dance earlier than to express other musical utterances such as singing and that this behavior does not depend on the cultural background of their parents. For further investigating the specific functions of R&D in humans, it would be highly interesting to compare the timing of their emergence during the lifespan of humans with the emergence of synchronic behavior in non-human animals.

### CONCLUSION

The main intention of this article is to provide a refined concept for further interdisciplinary research on the evolution of dance and music in humankind. It is proposed that in future studies on the evolution of music, attention should be paid on which attribute of music precisely is focused whether rhythm, melody, or harmony. The same applies to rhythmical attributes, i.e., pulse of beats, stronger or weaker beats (downbeats, offbeats), tactus, tempo, meter, and patterns. The evolutionary functions of dance have been relatively neglected. The close mutual relationship between rhythm and dance and embodied rhythm perception should be fully acknowledged in future research.

### AUTHOR CONTRIBUTIONS

JR did the literature search, developed the hypothesis and wrote the manuscript; RO contributed to the literature search, to developing the hypothesis and writing the manuscript.

### ACKNOWLEDGMENTS

We are deeply indebted to Prof. Marc Leman of the Institute for Psychoacoustics and Electronic Music (IPEM), Department of Musicology, Ghent University, Ghent, Belgium as well as to the reviewers who have read several drafts of this manuscript and importantly contributed to improve this article.


Bergeson, T. R., and Trehub, S. (2005). Infants perception of rhythmic patterns. Music Percept. 23, 345–360.


Christoph, H., and Oberländer, H. (1996). Voodoo. Cologne: Taschen Verlag.


enhances repetition suppression. J. Neurosci. 31, 18590–18597. doi: 10.1523/JNEUROSCI.2599-11.2011





**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Richter and Ostovar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Sensory Entrainment Mechanisms in Auditory Perception: Neural Synchronization Cortico-Striatal Activation

#### Catia M. Sameiro-Barbosa<sup>1</sup> and Eveline Geiser 1, 2, 3 \*

<sup>1</sup> Service de Neuropsychologie et de Neuroréhabilitation, Centre Hospitalier Universitaire Vaudois, Lausanne, Switzerland, <sup>2</sup> The Laboratory for Investigative Neurophysiology, Department of Radiology, Centre Hospitalier Universitaire Vaudois, Lausanne, Switzerland, <sup>3</sup> Department of Brain and Cognitive Sciences, McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA

#### Edited by:

Sonja A. Kotz, Maastricht University, Netherlands; Max-Planck Institute for Human Cognitive and Brain Sciences, Germany

#### Reviewed by:

Jessica A. Grahn, University of Western Ontario, Canada Johanna Maria Rimmele, Max-Planck-Institute for Empirical Aesthetics, Germany

> \*Correspondence: Eveline Geiser eveline.geiser@chuv.ch

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> Received: 12 April 2016 Accepted: 20 July 2016 Published: 10 August 2016

#### Citation:

Sameiro-Barbosa CM and Geiser E (2016) Sensory Entrainment Mechanisms in Auditory Perception: Neural Synchronization Cortico-Striatal Activation. Front. Neurosci. 10:361. doi: 10.3389/fnins.2016.00361 The auditory system displays modulations in sensitivity that can align with the temporal structure of the acoustic environment. This sensory entrainment can facilitate sensory perception and is particularly relevant for audition. Systems neuroscience is slowly uncovering the neural mechanisms underlying the behaviorally observed sensory entrainment effects in the human sensory system. The present article summarizes the prominent behavioral effects of sensory entrainment and reviews our current understanding of the neural basis of sensory entrainment, such as synchronized neural oscillations, and potentially, neural activation in the cortico-striatal system.

Keywords: entrainment, neural oscillations, striatum, auditory, regularity, beat, phase-locking, predictive coding

## INTRODUCTION

Two pendulum clocks positioned on the same table synchronize over time; this is a process called "entrainment" (Huygens, 1893). Many scientific fields have adopted this terminology for conditions in which two dynamic systems align. This review focuses on sensory entrainment, that is, the behaviorally observed temporal alignment of the sensory system with its environment. In everyday situations, motor actions, such as clapping in synchrony with music or alignment of walking pace in a group of people, are the result of sensory entrainment (for a review, see Ross and Balasubramaniam, 2014; see Merchant et al., 2015). However, sensory entrainment is relevant beyond motor behavior. Our sensory environment is unimaginable without its temporal structure. Tuning in to this temporal structure is thought to be a fundamental mechanism required for efficient auditory and speech perception (for a review see Giraud and Poeppel, 2012; Golumbic et al., 2013; Zoefel and VanRullen, 2015). Such sensory entrainment is, for example, evidenced through facilitated sensory perception in the context of temporal regularity (Jones et al., 2002; Geiser et al., 2012). We review neural correlates that potentially underlie the behaviorally observed alignment of the sensory system to a temporally regular or quasi regular environment.

### BEHAVIORAL EVIDENCE OF SENSORY ENTRAINMENT

The behavioral effects of sensory entrainment are typically shown in the context of temporally regular, ideally isochronous, environmental stimulation in which the occurrence of the next sensory input can be temporally predicted. For example, to measure sensory-motor synchronization, listeners tap to temporally regular auditory stimulation (Nozaradan et al., 2015). Synchronization to auditory cues is more precise than to visual cues (Hove et al., 2013), although synchronization to visual and even tactile cues is also used to measure entrainment (Lange and Roeder, 2006; Fernandez Del Olmo et al., 2007; Elliott et al., 2010, 2011; Ruspantini et al., 2011). Sensory-motor synchronization tasks include not only sensory but also motor entrainment.

Pure sensory entrainment is measured in perceptual tasks. These tasks typically show facilitated perception of stimuli when they are presented in a temporal context that allows entrainment compared to a context that does not allow entrainment. In the auditory domain, auditory temporal regularity, compared to temporal irregularity, results in faster reaction times to tones in various tasks (Lange, 2009; Rimmele et al., 2011), as well as better discrimination of differences in pitch (Jones et al., 2002), intensity (Geiser et al., 2012), and duration (Barnes and Jones, 2000; McAuley and Jones, 2003). Similar effects are observed in the visual domain (Rohenkohl et al., 2012; Marchant and Driver, 2013) and cross-modally, as in cases of auditory regular temporal grids facilitating saccadic eye movement (Bolger et al., 2013; Miller et al., 2013) and improving visual word recognition and discrimination (Bolger et al., 2013; Brochard et al., 2013) and of rhythmic movement facilitating sound perception (Morillon et al., 2014). Sensory facilitation is even observed against competing task demands (Cutanda et al., 2015). Most importantly, sensory entrainment effects are observed not only when the target stimulus is presented in the context of temporal regularity but also when temporal regularity precedes the target stimulus and the target appears at a predictable point in time as defined by the preceding sequence (Ellis and Jones, 2010; Sanabria et al., 2011; Cason and Schön, 2012; Sanabria and Correa, 2013; Cason et al., 2015). For example, sound signal detection is modulated at the rate of a previously presented amplitude modulated signal (Hickok et al., 2015). Thus, a variety of experimental tasks show the temporal context sensitivity of the sensory system, indicating facilitated perception through temporal regularity. Critically, sensory entrainment is behaviorally evidenced by the internal perpetuation of previously entrained excitability of the sensory system.

Outside of the research context, strictly regular, isochronous stimulation is the exception; it is found in music, in which temporal regularity is a defining feature (Geiser et al., 2014). However, there is emerging evidence that auditory sensory entrainment is present even in the absence of strict temporal regularity. Although behavioral effects are greatest in the context of temporal isochrony, sound perception is facilitated by varying degrees of temporal expectation (Herrmann et al., 2016). The capacity of the sensory system to detect and to synchronize to the average frequency of a stream of sounds and to perpetuate this synchronization, resulting in temporal predictions, is one of the preconditions allowing the use of entrainment for processing natural stimuli such as speech.

## NEURAL CORRELATES OF SENSORY ENTRAINMENT

The temporal context in which sounds are perceived influences neural activity. Although attention might have a modulatory effect (Hsu et al., 2014), event-related potentials (ERPs) are typically attenuated in the context of temporal regularity (Lange, 2009; Schmidt-Kassow et al., 2009; Lecaignard et al., 2015). Effects of temporal regularity are observed in the auditory N1 (Lange, 2009, 2010; Costa-Faidella et al., 2011; Rimmele et al., 2011; Sanabria and Correa, 2013) and its electromagnetic correlate N1m (Okamoto et al., 2013). Moreover, the reduction in N1 amplitude to isochronously presented tones shows the suppression of early signals, indicating a modulation of activation in secondary auditory cortices, namely the planum temporale (PT), through temporal regularity (Costa-Faidella et al., 2011). The sensitivity of sensory responses in the PT to temporal regularity is paralleled in an fMRI study on speech regularity, in which activation in the PT was modulated by temporal regularity (Geiser et al., 2008). Such modulation of neural activation by temporal regularity in primary and secondary cortices could be the result of sensory entrainment. Two mechanisms underlying sensory entrainment have been suggested, both of which may or may not be independent from each other: (1) synchronized neural oscillations in sensory and motor cortices and, potentially, (2) cortico-striatal brain activation (**Figure 1**). The neural correlates supporting these suggestions are reviewed in the following sections.

The first neural correlate of sensory entrainment is synchronized neural oscillation. Neuronal populations in the living brain show intrinsic fluctuations of excitability at the level of the cell membrane (Fiser et al., 2004; Lakatos et al., 2005). These fluctuations can be measured as periodic waves intracranially or on the scalp, via local field potentials or electroencephalograms, respectively. They can be characterized by their frequency, amplitude, and phase and are defined as delta (2–4 Hz), theta (4–8 Hz), alpha (8–12 Hz), beta (12–30 Hz), and gamma (30–100 Hz) bands. Neural oscillations typically synchronize across frequency bands, as has been shown in the auditory (Lakatos et al., 2005, 2013) and visual cortices (Lakatos et al., 2008). This hierarchical cross-frequency coupling (Schroeder and Lakatos, 2009) is suggested to influence neuronal interactions (Womelsdorf et al., 2007, for a review, see Fries, 2015). Importantly, intrinsic neural oscillations display the ability to phase-lock and thus entrain to external stimulation. This neuronal entrainment through phase-locking is observed in the visual (Montemurro et al., 2008), auditory (Luo and Poeppel, 2007; Besle et al., 2011), and somatosensory (Langdon et al., 2011; Ross et al., 2013) cortices, as well as cross-modally (Luo et al., 2010; Power et al., 2012). Thus, periodic neural oscillations synchronize to external stimulation within and across modalities.

The intrinsic oscillatory state of neuronal activity can affect whether a sensory cue is detected. Both a change in amplitude (power modulation) and the point in the cycle of a neural oscillation (phase) can influence target detection in the visual

Lakatos et al., 2013) and activation in the putamen (Geiser et al., 2012) (Figures adapted from Calderone et al., 2014 and Geiser et al., 2012).

(Busch et al., 2009; Mathewson et al., 2009) and the auditory domains (Ng et al., 2012). Because the intrinsic oscillatory state can influence perception, entrained oscillations should likewise facilitate perception. Indeed, the phase of entrained neural delta oscillation predicts sound gap detection (Henry and Obleser, 2012; Henry et al., 2014). Thus, there is a strong link between the intrinsic or entrained oscillatory state of neural activity and behavioral performance.

Some components of neural oscillations, namely aspects of beta-band oscillations, seem to underlie the predictive or sustentative aspect of sensory entrainment. Synchronization of neural activity to auditory cues has been observed most strongly in the low frequencies, particularly the delta and theta frequency bands (Kayser et al., 2009; Howard and Poeppel, 2012; Ding et al., 2014), but also in higher frequencies, including the beta, and gamma frequency bands (Snyder and Large, 2005; Fujioka et al., 2012). Beta power decreases rapidly after each tone and increases before the next tone in the context of temporal regularity. Importantly, the increase depends on the tempo of the presented stimuli, with a rapid increase for fast tempi, and a slower increase for slower tempi (Fujioka et al., 2012). Moreover, when an expected stimulus is omitted, the decrease in beta power is absent, but the increase before the next tone is nevertheless present (Fujioka et al., 2009). Both findings indicate that the increase in beta power is not simply following amplitude modulations in the entraining stimulus but might represent the endogenous encoding of the predicted time interval. This modulation of the beta band by passive listening to isochronous sounds has been replicated in adults (Fujioka et al., 2015) and in children (Cirelli et al., 2014; Etchell et al., 2016). Thus, although evidence linking the predictive nature of beta-band modulations to behavior is still missing, existing electrophysiological evidence supports the idea that beta-band activity carries predictive value in the context of sensory entrainment.

Not only do neural oscillations in the sensory cortex entrain to auditory stimuli, such entrainment is also observed in other areas of the brain (i.e., motor-related brain regions). Sensorimotor cortices (the precentral and postcentral gyri), anterior cingulate cortex, cerebellum, inferior-frontal gyrus, supplementary motor area (Fujioka et al., 2012), and medial and lateral premotor cortex displayed modulation of beta oscillation in response to an external stimulation (Fujioka et al., 2015). While beta modulation in motor regions is frequently observed during movement (for review, see Khanna and Carmena, 2015), the beta activity reported here is observed in the absence of movement and must therefore relate to the temporal processing of sensory stimuli, potentially involving predictive mechanisms. It is, however, an open question whether beta oscillation in motor-related brain regions can have a predictive value, thus underlying sensory entrainment, as is assumed for the beta oscillation in the sensory cortex.

In response to more ecological stimuli, such as speech, neural oscillations can synchronize in time ranges from the level of phonemes to the level of the syllables (for a review, see Ahissar et al., 2001; Giraud and Poeppel, 2012; Saoud et al., 2012; Power et al., 2013), with differential synchronization abilities of hemispheres potentially underlying the hemispheric specialization for speech (Giraud et al., 2007). Although such neural entrainment occurs across various oscillatory frequencies (Gross et al., 2013; Peelle et al., 2013), it is most frequently observed for low frequencies (Luo and Poeppel, 2007; for review see, Peelle and Davis, 2012). Moreover, synchronization seems to depend on previous exposure to a speech cue. The degree of familiarity with speech can facilitate entrainment (Lidji et al., 2011) and modulate oscillatory responses. Power synchronization in the theta band was observed when listening to the native language only (Pérez et al., 2015) and increased gamma-band power was observed when listening to the native language compared to a foreign language (Peña and Melloni, 2011). This indicates that neural oscillations might help to assess the meaning of speech.

Another potential neural correlate of sensory entrainment is neural activation in the dorsal striatum. Several studies manipulating the temporal context of auditory sequences have reported activation in the putamen. Typically, this activation was observed when experimental subjects listened to sound sequences comprising temporal regularity. These studies examined explicit processing of timing by applying perceptual tasks, such as regularity detection (Grahn and Rowe, 2009) and duration discrimination in the context of a temporally regular sequence (Teki et al., 2011a), motor tasks such as the reproduction of a rhythm comprising temporal regularity or motor synchronization with the beat (Riecker et al., 2003; Chen et al., 2008), or simply listening to a rhythmic beat (Grahn and Brett, 2007). Hence, models of auditory perception have attributed a central role to the basal ganglia, for example, as a brain region tracking temporal modulations in acoustic signals including speech (Kotz et al., 2009; Teki et al., 2011b; Schwartze et al., 2012) or integrating predictive coding in speech perception (Lim et al., 2014).

Although the above evidence indicates that activation in the putamen plays a role in temporal regularity perception, it does not reveal whether the putamen plays a role in sensory entrainment. We measured activation in the putamen in a typical sensory entrainment task (Geiser et al., 2012). Participants had to detect an intensity change in a sequence of tones that were either temporally regular (isochronous) or temporally irregular. As expected, temporal regularity enhanced auditory perception for tone intensity, and there were two associated patterns of brain activation. First, there was decreased activation in bilateral regions of the temporal lobe in response to temporally regular sequences compared to irregular sequences. Second, there was increased activation in the putamen in response to temporally regular sequences relative to irregular sequences. Thus, striatal activation is not only involved when participants encounter temporal regularity but is observed in a typical sensory entrainment task. Importantly, across individuals, the reduced activation in primary, and secondary auditory cortices in response to temporal regularity perception, which yielded better behavioral performance, was linearly correlated with increased activation in the putamen. This correlation could indicate that the striatum dynamically interacts with the sensory cortex either directly or through a mediating brain area to facilitate perception in the context of sensory entrainment.

The functional role that the striatum could play in sensory entrainment remains elusive. One could imagine that the putamen simply detects temporal regularity or the average tempo of a sequence. Alternatively, the putamen may crucially underlie sensory entrainment by internally perpetuating temporal regularity and predicting future acoustic events. Evidence demonstrating the latter is still lacking. However, when participants explicitly tracked temporal regularity in the second of two sequences in which the tempo either changed or did not change between the two sequences, greater activation in the putamen was found when a sequence repeated the tempo of a previously heard sequence than when the tempo changed (Grahn and Rowe, 2013). This indicates that the striatum responds when a tempo prediction is confirmed by the external stimulus. Authors suggest that this indicates the encoding of predictive aspects of temporal regularity perception. This is in line with an earlier study suggesting that the putamen encodes prediction, at least in motor learning (Haruno and Kawato, 2006). Further studies will need to test whether putamen activation in the context of sensory entrainment is related more to the confirmation of a prediction or to the generation of a prediction.

Whether the two neural correlates of sensory entrainment, neural oscillations and striatal activation, are functionally linked remains to be investigated. However, evidence from motor studies suggests a potential link. At least in some putaminal recording sites, the spectral power of beta oscillations increases when monkeys perform self-generated tapping in a previously learned tempo compared to when they tapped in response to an irregularly appearing cue production (Bartolo et al., 2014; Bartolo and Merchant, 2015). This indicates that some striatal circuits might play a role in the internal generation of temporal regularity, at least in the context of motor processing. Thus, it is possible that increased putamen activation as measured in the BOLD response is driven by enhanced putaminal beta activity.

### IS ATTENTION NECESSARY FOR SENSORY ENTRAINMENT?

It has long been known that "dynamic attending" induced by temporally regular stimuli can lead to faster reaction times to temporally expected points in time (Jones and Boltz, 1989; Barnes and Jones, 2000; London, 2004). Most recent experimental paradigms measuring sensory entrainment comprise active tasks in which participants focus their attention on the entraining stimulus, allowing stimulus-driven attending that involves temporal expectancy (Jones et al., 2002; Sanabria and Correa, 2013). Sensory attenuation and putaminal activation in the context of sensory entrainment is observed in the presence of endogenous attention (Lange, 2010; Costa-Faidella et al., 2011; Geiser et al., 2012), and synchronization of neural oscillations to sensory stimuli is particularly strong when attention is directed toward the entraining sound (Besle et al., 2011; Horton et al., 2013).

While the sensory effect of temporal context in the presence of endogenous attention is well investigated, less is known about temporal expectancy in the absence of endogenous attention. Evidence from visual studies suggests that temporal expectation and attention might influence neural activation in opposite ways (Summerfield and Egner, 2009; Kok et al., 2012; see also Arnal and Giraud, 2012). In the auditory domain, orthogonal manipulation of expectation and attention showed an attenuation effect on the N1 in the attended condition only (Hsu et al., 2014). Based on this finding, one could hypothesize that the attenuating effect of a regular temporal context might depend on the presence of endogenous attention.

However, neural effects of entrainment are also observed in the absence of endogenous attention. In passive oddball paradigms, temporal predictability influences auditory ERPs to acoustic (Geiser et al., 2010) or higher-level deviants (Tavano et al., 2014). Moreover, neural oscillations entrain to auditory stimuli when participants' endogenous attention is

### REFERENCES


directed to a concurrent visual (Fujioka et al., 2009, 2012) or auditory stimulus (Golumbic et al., 2013; Horton et al., 2013; Rimmele et al., 2015). Moreover, in an unattended condition, expectation modulates auditory beta-band synchronization to tones (Todorovic et al., 2015). Thus, attention networks use oscillatory phase entrainment for both enhancement and suppression of auditory signals (for a review, see Calderone et al., 2014).

The above evidence indicates that sensory entrainment is influenced by attention but that neural effects of entrainment are present in both attended and unattended processing conditions. Further studies will need to investigate the behavioral effects and the cortico-striatal mechanisms related to sensory entrainment as a function of attention.

In summary, sensory entrainment is essential for auditory perception. It drives perception to be best at temporally expected moments in time. Neural oscillations and, potentially, striatal brain activation underlie sensory entrainment. Whether these two correlates are part of the same mechanism and the way in which attention interacts with mechanisms of sensory entrainment remain to be investigated.

### AUTHOR CONTRIBUTIONS

Conceptualization, EG. Writing-Original Draft, EG, CS. Writing, Review, and Editing, EG, CS. Visualization, CS.

### FUNDING

Swiss National Science Foundation: PZ00P1\_148184/1 awarded to EG and FN320030-159708 awarded to Stephanie Clarke.

### ACKNOWLEDGMENTS

We would like to thank the two reviewers for their helpful comments on our manuscript.


complexity and musical training. J. Cogn. Neurosci. 20, 226–239. doi: 10.1162/jocn.2008.20018


auditory selective attention. Neuron 77, 750–761. doi: 10.1016/j.neuron.2012. 11.034


amplitude: a decrease in silence and increase in noise. Behav. Brain Funct. 9:44. doi: 10.1186/1744-9081-9-44


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Sameiro-Barbosa and Geiser. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Can Birds Perceive Rhythmic Patterns? A Review and Experiments on a Songbird and a Parrot Species

Carel ten Cate<sup>1</sup> † , Michelle Spierings 1 † \*, Jeroen Hubert <sup>1</sup> and Henkjan Honing<sup>2</sup>

*<sup>1</sup> Behavioural Biology, Institute of Biology Leiden and Leiden Institute for Brain and Cognition, Leiden University, Leiden, Netherlands, <sup>2</sup> Amsterdam Brain and Cognition, Institute for Logic Language and Computation, University of Amsterdam, Amsterdam, Netherlands*

#### Edited by:

*Angela Dorkas Friederici, Max Planck Institute for Human Cognitive and Brain Sciences, Germany*

#### Reviewed by:

*Erich David Jarvis, Duke University Medical Center, USA Yoshimasa Seki, Aichi University, Japan*

\*Correspondence:

*Michelle Spierings m.j.spierings.2@biology.leidenuniv.nl*

*† These authors have contributed equally to this work.*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

Received: *10 December 2015* Accepted: *29 April 2016* Published: *19 May 2016*

#### Citation:

*ten Cate C, Spierings M, Hubert J and Honing H (2016) Can Birds Perceive Rhythmic Patterns? A Review and Experiments on a Songbird and a Parrot Species. Front. Psychol. 7:730. doi: 10.3389/fpsyg.2016.00730*

While humans can easily entrain their behavior with the beat in music, this ability is rare among animals. Yet, comparative studies in non-human species are needed if we want to understand how and why this ability evolved. Entrainment requires two abilities: (1) recognizing the regularity in the auditory stimulus and (2) the ability to adjust the own motor output to the perceived pattern. It has been suggested that beat perception and entrainment are linked to the ability for vocal learning. The presence of some bird species showing beat induction, and also the existence of vocal learning as well as vocal non-learning bird taxa, make them relevant models for comparative research on rhythm perception and its link to vocal learning. Also, some bird vocalizations show strong regularity in rhythmic structure, suggesting that birds might perceive rhythmic structures. In this paper we review the available experimental evidence for the perception of regularity and rhythms by birds, like the ability to distinguish regular from irregular stimuli over tempo transformations and report data from new experiments. While some species show a limited ability to detect regularity, most evidence suggests that birds attend primarily to absolute and not relative timing of patterns and to local features of stimuli. We conclude that, apart from some large parrot species, there is limited evidence for beat and regularity perception among birds and that the link to vocal learning is unclear. We next report the new experiments in which zebra finches and budgerigars (both vocal learners) were first trained to distinguish a regular from an irregular pattern of beats and then tested on various tempo transformations of these stimuli. The results showed that both species reduced the discrimination after tempo transformations. This suggests that, as was found in earlier studies, they attended mainly to local temporal features of the stimuli, and not to their overall regularity. However, some individuals of both species showed an additional sensitivity to the more global pattern if some local features were left unchanged. Altogether our study indicates both between and within species variation, in which birds attend to a mixture of local and to global rhythmic features.

Keywords: rhythm perception, songbirds, parrots, perceptual bias, local vs. global information

## INTRODUCTION

In 1871, Darwin wrote: "The perception, if not the enjoyment, of musical cadences and of rhythm is probably common to all animals and no doubt depends on the common physiological nature of their nervous systems" (Darwin, 1871). At the time, this thought was understandable as many animal species show behaviors that are characterized by some form of rhythmicity. It can be found in invertebrates, like the flashing patterns of fireflies, which can even be synchronized (Buck, 1988), as well as in vertebrates, like the strong rhythmicity characterizing some bird vocalizations. For instance, the cooing of the collared dove (Streptopelia decaocto) consists of a series of repeated "coos," each consisting of three vocal elements of different duration separated by brief pauses, also of unequal duration. This temporal pattern, and hence the structure of the coo as a unit, is highly stereotyped (Ballintijn and ten Cate, 1999), resulting in a characteristic rhythmic pattern for a series of coos. Receivers are very sensitive to the overall regularity of the coo: if the temporal structure is changed, the responses are strongly reduced (Slabbekoorn and ten Cate, 1999). The question is whether, as Darwin implied, such examples indicate a sensitivity to rhythmicity in general (ranging from a sensitivity for rhythmic pattern, pulse, and meter, as well as the influence of tempo and timing; Honing, 2013) or whether this sensitivity is confined to particular species specific behaviors. Below, we will first review this topic, with particular attention to the "vocal learning and rhythmic entrainment hypothesis" formulated by Patel et al. (Patel, 2006; Schachner et al., 2009; Patel et al., 2009a,b). Doing so, we focus on studies on the perception of rhythmic patterns in birds, which for various reasons provide an ideal group for comparative studies on this topic. Next we present experimental data on pattern perception and the responsiveness to tempo changes in a songbird (zebra finch) and a parrot species (budgerigar).

### The Vocal Learning and Rhythmic Entrainment Hypothesis

The interest in rhythm perception in animals is part of the more general quest for searching for signs of musicality in non-human animals, as a means to get more insight in the evolutionary and causal processes underlying human musicality (Hoeschele et al., 2015; Honing et al., 2015). The specific question whether animals can detect regularity in a stimulus and synchronize their own behavior to arbitrary rhythmic patterns got sudden attention with the discovery of Snowball, a sulfur-crested cockatoo that could synchronize head and body movements with the beat in several popular songs. Even though Snowball's behavior was only synchronized with the music for part of the time, he could adjust his movements to tempo changes of the songs (Patel et al., 2009a,b; Schachner et al., 2009). Parrots, such as Snowball, are vocal learners and vocal learning is associated with evolutionary modifications to the forebrain, which plays a key role in mediating a link between auditory input and motor output during learning (Petkov and Jarvis, 2012). As such linkage between auditory and motor areas in the brain is also required for beat induction (the ability to perceive a regular pulse in a varying rhythm, or real music; Honing, 2013) and audio-motor entrainment, Patel et al. (2009a,b) suggested that only vocal learning species might be able to show beat entrainment. A survey of YouTube movies searching for evidence of animal species that could entrain their behavior to music (Schachner et al., 2009) seemed to confirm this hypothesis: entrainment was only observed among those species that showed vocal learning, suggesting that vocal learning was a necessary, albeit not a sufficient, requirement for beat induction. However, further studies have shown the picture to be more complicated. Convincing evidence of entrainment with a musical beat has now also been established in a Californian sea lion, named Ronan (Cook et al., 2013). Although, sea lions belong to a clade of mammals (pinnipeds) that contains some vocal learners (Reichmuth and Casey, 2014), there is currently no evidence of vocal learning in this specific species, which potentially falsifies the generality of the hypothesis (Wilson and Cook, 2016). There is also some evidence of chimpanzees adapting their finger tapping to an external beat, although this seems limited to frequencies close to their spontaneous motor tempo (Hattori et al., 2013, 2015). Chimpanzees are considered vocal nonlearners, although it can be argued that they show some vocal plasticity and adjustment (Watson et al., 2015), and hence that their limited abilities to synchronize match with their limited abilities for vocal learning. In addition, recent evidence for temporally coordinated rhythmic movements between a bonobo (also a vocal non-learning species) and a human drummer (Large and Gray, 2015) suggests that the link between vocal learning and beat induction may be less clear than initially anticipated. However, the most intriguing feature of the survey of Schachner et al. (2009) is that of those taxa that show vocal learning (for mammals: dolphins and whales, seals, bats, and elephants; for birds: parrots, songbirds, and hummingbirds—Janik and Slater, 1997; Petkov and Jarvis, 2012), evidence for beat induction was only present in several parrot species and elephants. With respect to the latter, the evidence that elephants are vocal learners originates from captive Asian elephants, which imitated truck sounds (Poole et al., 2005), and words of the caretaker (Stoeger et al., 2012). However, in this latter study, the speech sounds were produced by inserting the trunk into the mouth, i.e., in a way quite different from how elephants usually produce vocalizations. As it is possible to teach elephants to perform behavior patterns well outside their natural range by operant conditioning, it might well be that the speech imitations also arose by operant shaping of the vocalizations by the human caretaker, hence being based on a different mechanism from the auditory imitative vocal learning in other species. Taken together, this leaves the parrots as the only group showing both imitative vocal learning and beat induction. This calls for a re-examination of the link between vocal learning and beat perception and induction.

### Vocal Learning and Beat Perception Revisited: Are Parrots Special?

The YouTube survey (Schachner et al., 2009) shows a remarkable contrast between the parrots and other vocal learning birds (hummingbirds and songbirds). Seven different parrot species all show evidence of beat induction. Another parrot species that is not listed as observed to synchronize with music, the budgerigar, has since been shown to be able to peck a key in synchrony with a flashing light and a metronome, and could learn to adjust this pecking to some tempo changes (Hasegawa et al., 2011), although the adjustment to each new tempo was not spontaneous but was trained specifically. In striking contrast, the list of vocal learners contains 10 different songbird species and a hummingbird, none of which provided evidence for beat induction. In addition, three songbirds were erroneously classified under "vocal nonmimics" (nuthatch, bulbul, and babbler), with none of them showing beat induction. Thus, perhaps the question should be: why is it that various parrots, but no other vocal learning (or nonlearning) birds, show beat induction? One possibility is that this difference is accidental. For instance, the total number of parrot movies is higher than that for the other bird species together, hence there may be a sampling bias. Or the difference might be related to behavioral differences between parrots and songbirds. Many parrot species show head bobbing or other body movements in their social interactions with conspecifics. If they are hand reared, as happens often with parrots, much of their social behavior will be directed to their human caretakers as a result of sexual or social imprinting (ten Cate and Vos, 1999) and a possible scenario might be that if their caretakers are dancing and moving on the beat, the parrots might be induced to do the same thing. Songbirds often lack such conspicuous rhythmic body movements in their natural behavior and may have less strong bonds with their caretakers as even captive ones are usually raised by their parents. Hence, they may possibly be less likely to provide evidence for beat detection and induction, even though they might be able to it. But it may also be that there is a more fundamental difference between parrots and other species. Showing beat induction requires at least two abilities: first the detection of a rhythmic pattern or beat in an external stimulus and next adjusting the frequency of some motor pattern to this input. Lack of beat induction may indicate lack of either or both of these abilities in other species. Thus, perhaps other bird species can detect rhythmic patterns in external stimuli, but lack the possibility to synchronize their behavior with it (see also Patel and Iversen, 2014).

Alternatively, non-parrot bird species might lack the ability for detecting rhythmic patterns in auditory stimuli altogether, due to the differences in brain pathways. It has been claimed that parrots have an enhanced vocal learning system, due to an extra song system that surrounds the song system shared with songbirds (Chakraborty et al., 2015). The motor pathway system that surrounds this shell song system shows gene expression profiles similar to the song system. These non-vocal motor brain regions are active during hopping and head bobbing movements (Feenders et al., 2008) and it is therefore suggested that this motor system is involved in entrainment (Chakraborty et al., 2015). Finally, the observed relationship between vocal learning and beat induction in parrots may be coincidental: parrots are vocal learners and show beat induction, but both may not be causally related or the relation may be due to some shared third factor underlying both.

To conclude: it is clear that it still is an open question why beat induction among birds has only been observed for parrots, calling for a further exploration of the topic. The observed contrast between parrots and songbirds make birds a particularly interesting group for comparative studies. Also, there are many vocal non-learning bird species, such as doves and pigeons, that show strong rhythmicity in their vocalizations and, finally, there are species not showing such rhythmicity. So comparing different species belonging to various avian groups may help to clarify the relation between vocal learning, beat and rhythm perception and beat induction. This may also reveal whether there is a categorical jump from synchronizing to pulses in natural behavior such as in flashing fireflies, to showing beat perception and synchronization, as implied by Patel et al. (2009a,b).

Recently, Arriaga et al. (2012) and Petkov and Jarvis (2012) proposed a vocal learning continuum hypothesis to accommodate the different levels of vocal learning ranging from vocal non-learners (like doves), limited vocal learners to complex-vocal learners (like parrots and songbirds). It may well be that there is also a more fine grained spectrum in rhythmic patterns (see also Ravignani et al., 2014), in particular when we shift the research focus from the production to the perception of rhythmic patterns. As the detection of rhythmic patterns, such as a pattern of repetitive identical inter-pulse-intervals or a higher order repetitive regularity in a rhythm, seems a first requirement for being able to move in synchrony with a beat, the central question in this paper concerns whether and which birds can detect such regularities in auditory patterns. And if so, is there a difference between species or groups of species in this ability? Or between vocal learners and non-learners?

### Can Birds Detect Rhythmic Patterns and Regularity in Auditory Patterns?

In the first instance, asking whether birds, or any animal species, can detect regularity in an auditory pattern of, say, repeated pulses seems trivial. Studies on habituation have shown that various animals habituate more quickly to isochronous pulse series than to heterochronous ones (e.g., mice—Herry et al., 2007; zebra fish—Shafiei Sabet et al., 2015). However, this need not imply that they detect "isochrony" as such, as the distinction can be achieved by attending to local features, such as the differences in pause duration of one or a few inter-pulse intervals and by predicting the timing of a next event from the preceding interval, i.e., is based on a sensitivity to absolute, and not relative, timing (Honing and Merchant, 2014; Merchant and Honing, 2014; Merchant et al., 2015). Detecting isochrony as such, or rhythmic patterns more generally, involves a global process: detecting that events are regularly distributed over a longer series, irrespective of, for instance, the precise duration of the inter-pulse-intervals (see also Geiser et al., 2014). Rhythmic pattern detection thus concerns detecting a relational property. So, if an animal can distinguish between a regular, isochronous, pattern and an irregular, heterochronous one, a critical test to see whether this is based on having detected the global difference in regularity of the pattern is to see whether the discrimination is maintained after tempo transformations. So, what is known about such perceptual abilities in birds? To date, only a handful of bird species have been examined experimentally for their

discrimination between isochronous and heterochronous sound patterns and/or for whether this discrimination is maintained with tempo changes of structured sound patterns. These are the domestic pigeon and several songbirds (starling, jackdaw, zebra finch). We briefly review these studies below.

### Pigeons

Two studies examined whether pigeons could discriminate and generalize across different tempos. The first one (Farthing and Hearst, 1972) showed that pigeons subjected to a non-differential training in which they had to peck a response key to get food while being exposed to a regularly spaced train of pulses, generalized their response to slower, and faster pulse trains. This has also been demonstrated for quail chicks (Schneider and Lickliter, 2009), but it tells little on the ability to detect regularity or even tempo generalization in general, as the birds may have attended to the mere presence of any sound. Differential training, in which responses to one pulse rate but not to another one, were rewarded resulted in discrimination between the two rates and the differentiation was maintained with stimuli showing either higher or lower pulse rates than the training ones. However, this study did not examine the ability to discriminate regular from irregular rhythms.

A more recent study examined, among others, whether pigeons are able to detect and discriminate different meters (Hagmann and Cook, 2010). Using two sounds, different meters were constructed using the same pulse rate (180 bpm). The pigeons were able to discriminate the meters, but only if these differed substantially from each other (8/4 vs. 3/4). Further tests suggest that the pigeons might not have attended to the meter, but to the time difference between the beats. They did not transfer the discrimination to similarly structured stimuli consisting of other sounds, suggesting that their responses were also tied to the nature of the sounds. A second experiment on meter discrimination showed that the discrimination was maintained with faster tempos (200 and 220 bpm), but not with reduced tempos of the pulse (140 and 160 bpm). In a next experiment, the same birds were tested for their ability to discriminate an isochronous from an irregular pulse pattern. The pigeons did not succeed in this discrimination. Finally, it was examined whether they could discriminate between two different isochronous pulse rates (with pulse and pause durations scaled proportionally), similar to the study by Farthing and Hearst (1972). Three out of the four pigeons managed this discrimination and this was again generalized to slower and faster pulse trains. From these experiments, Hagmann and Cook (2010) conclude that, on the whole, pigeons were most likely attending to the intervals between pulses, rather than to the overall metric or regular structure of the sound strings.

### Starlings

Starlings (songbirds) were tested for the perception of regularity and rhythm in a series of studies by Hulse S. H. et al. (1984), Hulse S. et al. (1984), Humpal and Cynx (1984). The birds were able to discriminate an isochronous (pulse duration 100 ms, intervals 100 ms) as well as a hierarchical pattern (four regularly spaced pulses followed by a longer pause and next followed by repetitions of this pattern) from a randomly generated heterochronous pattern with fluctuating pulse and pause durations. The discrimination was maintained with tempo changes in which pulse durations and intervals were extended or reduced proportionally (ranging from halving to doubling the tempo), although the strength of this discrimination was reduced for slower tempos (Hulse S. H. et al., 1984; Hulse S. et al., 1984). The discrimination was reduced when the inter-pulse-interval remained constant, and pulse durations varied, but not the other way around (Hulse S. H. et al., 1984; Hulse S. et al., 1984). The discrimination was also affected if pulse duration, but not the interval, was randomized or inverted, although it remained above chance level (Humpal and Cynx, 1984). Changing the pitch of the sounds affected the discrimination only slightly. Finally, the studies showed that starlings could discriminate two different rhythmic patterns consisting of four notes of different durations (50–50–300–300 ms vs. 50–300–300–50 ms), separated by longer pauses. Tempo transformation affected this discrimination, although it remained above chance in most cases (Hulse S. H. et al., 1984; Hulse S. et al., 1984). These experiments suggest that starlings are better than pigeons in attending to more global patterns of pulse trains, although most experiments show some loss of discrimination with various tempo transformations.

### Jackdaws

A pioneering study on rhythmic perception in jackdaws (corvid songbirds) was done by Reinert (1965). He showed that a jackdaw could discriminate between two different auditory patterns with the structure ABAB and ABB respectively (A and B being different sounds). Among a series of other manipulations, he showed that the discrimination was maintained under tempo transformations (tempo training stimuli: 84 bpm; test stimuli: 66–192 bpm). A second jackdaw, trained to discriminate two other patterns, also maintained the discrimination with tempo transformations. Furthermore, the study demonstrated that the jackdaws maintained the discrimination between the patterns when the sounds making up the patterns were changed (varying timbre or pitch), suggesting that the jackdaws used relative and global, rather than local features like specific interval durations or tone characteristics, to distinguish the patterns. However, the jackdaws have not been tested with isochronous stimuli in different tempos, nor whether they could discriminate an isochronous from an irregular pattern. Also, the stimuli used in this study were always very short strings. So, although suggestive, conclusive evidence that jackdaws are really sensitive to an overall rhythm formed by a repeated pattern is still lacking.

### Zebra Finches

Zebra finches are a model songbird species for behavioral (e.g., Slater et al., 1988; Jones et al., 1996; Tchernichovski et al., 2001; Lipkind et al., 2013) and neural studies on song learning (e.g., Jarvis and Nottebohm, 1997; Haesler et al., 2004; Zeigler and Marler, 2008) as well as for comparative studies examining their abilities to discriminate various (speech) sounds or artificial grammar patterns (e.g., ten Cate, 2014). Also, a few studies examined their abilities for detecting or discriminating rhythmlike structures. Nagel et al. (2010) showed that zebra finches can distinguish the songs of two different males across various tempo transformations. Zebra finches can also detect prosodic patterns of edited speech sounds (Spierings and ten Cate, 2014) and can discriminate song elements arranged in an ABAB structure from an AABB structure (van Heijningen et al., 2009), and ABA structures from AAB structures (van Heijningen et al., 2013; Chen et al., 2015). Finally, exposure to a repeated series of regularly or irregularly spaced song elements induced differences in ZENK expression in two nuclei of the auditory system (NCM, CMM) (Lampen et al., 2014). These observations suggest that zebra finches might also be able to discriminate between different rhythmic patterns or between regular and irregular pulsed sounds and maintain this with tempo transformations. However, although zebra finches can discriminate a regular isochronous from an irregular stimulus, this discrimination was strongly reduced with tempo transformations (changing the inter-pulseintervals, but not the pulse durations), even if the training consisted of several tempo variants of the isochronous and irregular stimuli (van der Aa et al., 2015). These data suggested that the zebra finches, like pigeons, attended strongly to specific local features of the individual stimuli, such as the exact duration of inter-pulse intervals, rather than the overall regularity of the stimuli. Whether they are able to use more overall features still remains to be demonstrated (see van der Aa et al., 2015, and below for a discussion).

To summarize the above overview: both pigeons and starlings are able to discriminate between two isochronous patterns in different tempos and maintain this discrimination with slower and faster tempos. However, this ability does not require perception of regularity as such, but can be achieved by attending to the duration of just one or a few intervals and generalizing from this to intervals that are more extreme to either one or the other end of the spectrum than the training ones. Although the currently available evidence is still limited, it suggests that all tested species can solve discrimination tasks when this can be done by attending to such local temporal features of the sounds, suggesting this is the "default" state birds use for auditory pattern recognition. This is also how pigeons discriminate (some) different metric patterns, generated by alternation of two types of sounds and how zebra finches discriminate regular from irregular pulse patterns (van der Aa et al., 2015). Starlings also attend to the durations of pulses and intervals when discriminating between isochronous and randomly spaced sounds varying in duration. However, their ability to maintain the discrimination over at least some tempo changes suggests that they might also be sensitive to the larger pattern. This may also be true for the jackdaws.

It can be concluded that the evidence that birds can attend to some more global "regularity" or "rhythmicity" as such, is still very limited. However, whatever evidence there is suggests that this ability may differ between species. The studies of Snowball, as well as some data of a gray parrot and the YouTube survey (Schachner et al., 2009) indicate that at least some parrot species have a quite well developed perceptual sensitivity for rhythm. The above review suggests the jackdaw as possible additional songbird candidate, but suggests also that this ability is poorer or even marginal in other songbird and non-songbird species. But, the experiments on all species are still equivocal on the issue, and more systematic comparative studies, focusing in particular on the discrimination between, and the responses to tempo transformations of regular vs. irregular stimuli are urgently required. Our experiments described below are meant to shed more light on such perceptual abilities.

### Can Zebra Finches and Budgerigars Perceive Structural Regularity?

In our experiments we compare a songbird species, the zebra finch, with a parrot species, the budgerigar, for their abilities to discriminate a regular, hierarchically structured stimulus from an irregular one. We chose the budgerigar because the study of Hasegawa et al. (2011) suggests that they are able to entrain their behavior to an audiovisual stimulus. Being a parrot species, we expect that they might also be able to attend to the more global features of temporal patterns in a perceptual discrimination task. For zebra finches, the current evidence for detecting pattern regularities is ambiguous: the study of van der Aa et al. (2015) suggest they attended only to local temporal features, but in the song discrimination study by Nagel et al. (2010) they were maintaining discrimination under tempo transformations. However, two songs changed in tempo might still be discriminable by other features that remained largely invariant after a tempo change, such as differences in the phonology of specific elements.

In the current study, we use training patterns that are similar in their (regular) inter-pulse-intervals, but differ in which pulses are accented, hence in their beat pattern. They thus show some hierarchical structure, providing the opportunity to examine how various local as well as more global temporal parameters affect the discrimination between the stimuli and whether this differs between the species. The birds are trained to discriminate between two hierarchical pulse strings, a regular one with one beat in each four pulses and an irregular string with the beats located at irregular positions. Subsequently, we test whether they generalize this discrimination to strings with modifications in the position and rate of the beat. This approach tests various hypotheses about which local and/or global features might be used when discriminating between regular and irregular strings.

### MATERIALS AND METHODS

### Subjects

Six male zebra finches and three female budgerigars were tested in this experiment. All zebra finches were between 120 and 321 days post hatching, the budgerigars were between 2 and 3 years old at the start of the experiment. The zebra finches were not subjected to previous experiments. The budgerigars had been used in a discrimination task with human speech and zebra finch sounds. Before the experiment, the animals were housed in group living facilities on a 13.5/10.5 L/D schedule and had food, water, and cuttlebone ad libitum. During the experiment, the L/D schedule was maintained, except for short dark periods as part of the experimental procedure. Water and cuttlebone were still ad libitum, the food availability was part of the experimental procedure and was monitored daily to ensure a sufficient level of food intake. The experiments were conducted in accordance to the animal experimentation guidelines of Leiden University. The protocol was approved by the Leiden committee for animal experiments, under DEC number 14071.

### Apparatus

All experiments were conducted in an operant conditioning cage [zebra finches: 70(l) × 30(d) × 45(h) cm, budgerigars: 70(l) × 60(d) × 60(h) cm]. Each operant cage was in a separate sound attenuated chamber and was illuminated by a fluorescent tube that emitted a daylight spectrum on a 13.5 L: 10.5 D schedule. A speaker (Vifa 10BGS119/8) was located 1 m above the center of the cage. The sound level was set to 70 dB at the location of the bird at the start of a trial (in front of sensor 1). The cage walls were made from wire mesh except for a plywood back wall which supported two pecking keys with LED lights. A food hatch was located in between these two keys, easily accessible to the birds. Pecking the left key (sensor 1) elicited a stimulus and illuminated the LED light of the key on the right (sensor 2). Depending on the sound, the bird had to peck sensor 2 or had to withhold its response. A correct pecking response resulted in access to food for 10 s. and an incorrect response led to 15 s. of darkness. Pecks during the sound presentation were not recorded as a response.

## Experimental Design

### Shaping

All birds started the experiment with a shaping procedure to get acquainted with the apparatus and the Go/No-go paradigm. This consisted of a 24-h acclimatization period with an opened food hatch, followed by a Go/No-go shaping procedure with one zebra finch song (Go sound) and one song element (Nogo sound). Shaping lasted until the birds reached the standard discrimination ratio (response to Go sounds >75%, response to No-go sound <25%) for three consecutive days.

### Discrimination Training

After shaping, all birds were trained to discriminate between one regular and one irregular string in the Go/No-go procedure (**Figure 1**). This training phase lasted until the bird reached the standard discrimination criterion for at least three consecutive days, after which it proceeded to the test phase.

### Test Phase

During the test phase, test strings were randomly played at 20% of the trials, whilst the other 80% of the trials remained training strings test strings (see **Figure 2**). Feedback in the form of food access and darkness was only given during training trials, never for the test trials. All test strings were presented randomly and the test phase lasted until each test string was presented 40 times.

### Stimuli

Stimuli were strings consisting of two different tones, an (accented) X-element (4000 Hz, 40 ms and 80 dB) and a yelement (2500 Hz, 40 ms, and 70 dB, created in Praat version 5.4.01), separated by a short silent interval, the pause (40 ms). The elements were concatenated to form two hierarchically organized training strings: one regular (the Go sound) and one irregular (the No-go sound, **Figure 1**), each lasting 3.5 s in total. For both strings the interval between the elements was identical, what differed between them was the position of the X-elements, which affected the number of y-elements between two X-elements. The regular string was a concatenation of 10 equal units, where each unit consisted of one X-element, followed by three y-elements (Xyyy), all spaced equally. This concatenation created a stable 320 ms inter-X-interval (IXI), measured from onset to onset. The irregular string contained the same number of X- and y-elements as the regular string, but differed from this by variance in the number of y-elements between two X-elements. This variation ranged between one and five y-elements, creating a variation in the IXI between 160 and 480 ms. Both strings started with an additional three y-elements and had a fade-in and fade-out of 800 ms.

Test strings were created in a similar fashion as the training strings. They contained modifications in the duration and number of elements and pauses, thereby modifying the IXI and string length, while leaving the regular or irregular structure intact. Three main test sets were designed to systematically assess the effect of (1) modifications in the presence and location of elements, (2) the duration of elements and pauses, and (3) the number of elements on pattern detection (**Figure 2**, all regular strings used in this study are added as Supplementary Material).

### **Beat recognition: The role of number and presence of X- and y-elements**

In these tests the IXI was kept identical to the training strings, whilst the number of y-elements in the string varied. If the birds discriminated the training strings by attending only to the duration of the accented pulses, i.e., the IXI interval, it is expected that varying the number of y-elements between the X's would not affect the discrimination. The test stimulus pair 1a (**Figure 2.1**) had an additional y-element between every two X-elements (four instead of three in the regular string). Test stimulus pairs 1b and 1c had a reduction of respectively one and three y-elements between two X-elements, creating regular strings with two and zero y-elements per IXI respectively. In the irregular strings each IXI is modified by adding or removing the same number of y-elements, with the limitation that there are never more than 5 or less than one y-element between two X-elements. Shortening or lengthening of the pause durations compensated for these modifications and kept the IXI identical to the training strings. An additional test (pair 1d) was ran with only y-elements and prolonged pauses to compensate for the lack of X-elements (**Figure 2.1**).

### **Proportional scaling: The role of element and pause duration and IXI**

In this test the number of X- and y-elements was identical to the training stimuli (**Figure 2.2**). However, the IXI interval was varied because the stimuli for this experiment had modified durations of the elements and pauses, both shorter and longer than the elements in the training stimuli. The regularity and irregularity of the training strings stayed intact by equally modifying all elements or all pauses in a string. If the birds are attending to the regularity of always having four y-elements

between the X-elements in the regular string, the discrimination should be maintained. Reduced responding would indicate that the zebra finches attend to finer temporal details of the stimuli. Two versions of this modification were created.

For stimulus pair 2a both the elements as well as the pauses were lengthened with 25%. Pair 2b had the elements and pauses shortened by 25% (**Figure 2.2**). The strings of pair 2c had the pauses shortened with 50%, but the elements stayed identical to those in the training strings. For pair 2d the elements were shortened with 50%, whilst the pauses stayed identical to the training strings. This reduced the IXI of pair 2c and 2d with 25%, similar to test pairs 2b (**Figure 2.2**).

### **Numerical scaling: The role of the number of y-elements and IXI**

In this test the IXI's were extended and compressed to the same length as in test 2, by adding (pair 3a) or removing (pair 3b) one y-element per X-interval (**Figure 2.3**). This manipulation created strings identical in the numbers of y-elements between X-elements to test stimuli 1a and 1b, but in this case the duration of the elements and pauses remained identical to those of the training stimuli, creating strings with a smaller or larger IXI (**Figure 2.3**). This stimulus thus maintains the finer details of element and pause durations from the training strings and only moves the location of the X-element within the string.

An assumption underlying the training and test procedures and stimuli is that humans exposed to these stimuli would recognize the regularity of the stimuli without being explicitly told to do so, and that, after training, they classify all test stimuli appropriately. To validate this assumption, we trained a group of 24 adult human participants to discriminate between the training strings and tested them with test set 1 and 2. The participants convincingly discriminated the regular from the irregular strings of all test pairs (average response to regular stimulus = 0.88, average response to the irregular stimulus = 0.08, pairwise comparisons per test, all p < 0.01, see Supplementary Material). This indicates that, at least for humans, the regularity of the IXI intervals is recognizable, discriminable and generalizable. Thus far, we know that the sensitivity to temporal changes in birds, in the form of discriminating differences in duration or minimum integration time, is of a comparable level to that in humans (Dooling et al., 2000; Dooling, 2004).

#### Analyses

The response data of the zebra finches and budgerigars was recorded as binomial measurements (number of Go and No-go responses). For the analysis, these measurements were converted to fractions between 0 and 1, calculated as the cumulative Go responses toward the Go or No-go strings, divided by the total number of trials. For the zebra finches, these fractions were analyzed with a generalized linear model (glm) with test item (all test strings and the training go and no-go string) as fixed effect and the individual as the random measure. This gave a significant effect of the test item on the Go-fraction (t = 2.9, p = 0.004). Pairwise comparisons were made between the fractions of responses to the Go and the No-go string of each test set and between the responses to the training and test strings by using a Tukey's post-hoc test, corrected for multiple testing. All results shown in the Results Section originate from these post-hoc tests. Furthermore, we ran a glm on all individual data to analyze the response pattern of each zebra finch by using the binominal response measures to each of the 40 trials per test string per bird. Results of this glm showed a significant effect of test item on the test scores (t = 3.09, p < 0.001) and results shown further are from the pair-wise Tukey's post-hoc tests, restricted to only pairwise comparisons within each individual.

As only three budgerigars were tested, these data were only analyzed at the individual level. Like the zebra finch results, the responses of each budgerigar to each test string were measured as a binomial response. With a glm it was tested whether these scores differed over the test strings. The glm again showed a significant effect of test item on the test score (t = 2.76, p = 0.005). This was followed by a Tukey's post-hoc test with pairwise comparisons between the responses to the Go and No-go strings and between training and test strings within each individual. Results shown in the Result Section are from these post-hoc Tukey's test. All statistics were performed in Rstudio (version 0.98.1103).

### RESULTS

### Training

The zebra finches required on average 10,245 trials to accurately discriminate between the regular and the irregular stimulus and complete the training. The three budgerigars learned the discrimination in 8495 trials on average.

### Beat Recognition

Maintaining IXI but varying the number and presence of yelements reduced the discrimination between the regular and irregular test stimuli (**Figure 3**). Zebra finches showed a trend toward a discrimination between a regular and an irregular string when these strings had one additional y-element within each IXI compared to the training strings (4 y-elements, pair 1a − z = −5.08, p = 0.08). A discrimination bordering significance is shown when the strings have one y-element less in each IXI (2 y-elements, pair 1b) with elongated pauses (z = −3.37, p = 0.05).

Strings consisting of only X-elements with an identical IXI as the training strings (pair 1c) resulted in a reduction in the responses to the regular string, which was no longer discriminated from the irregular string (z = 0.02, p = 0.81). A similarly low number of responses and no discrimination was recorded when only the y-elements of the training strings were present in the test strings (pair 1d, z = −1.56, p = 0.22).

Two zebra finches (Z2 and Z5) correctly discriminated the strings with one y-element more in each IXI (pair 1a: Z2: z = −3.98, p < 0.01; Z5: z = −4.65, p < 0.01), whilst the other four zebra finches did not discriminate (all z > −2.13, p > 0.25). None of the individuals discriminated between the regular and irregular string of pair 1b—one y-element less per IXI, 1c—no y-elements, and 1d—no X-elements (all z > −1.33, p > 0.5).

The budgerigars also showed reduced responses to test stimuli with an identical IXI to training, but with modified numbers of yelements (**Figure 3**). Nevertheless, one budgerigar (B1) correctly

pairs of test set 1. Horizontal bold lines shown the average fraction of Go responses of the six zebra finches. + symbols indicate a trend (0.05 < *p* < 0.10) toward a difference between the responses to the Go and to the No-go strings, ns indicates no significant difference (data for zebra finches only). Individual budgerigar results are shown with shaded circles (regular strings) and open circles (irregular strings). They were not tested at group level.

discriminated the regular and irregular strings both with 4 y-elements (pair 1a) and 2 y-elements (pair 1b) between the X-elements (pair 1a: z = −7.23, p < 0.01; pair 1b: z = −6.14, p < 0.01). The other two budgerigars (B3 & B2) did not discriminate these strings (pair 1a and pair 1b: both z > 0.8, p > 0.9).

Similarly to the zebra finches, budgerigars did not discriminate between the regular and irregular string when only the X-elements (pair 1c) or only the y-elements (pair 1d) were present (pair 1c: all z > 0.3, p > 0.9; pair 1d: all z > 0.04, p > 0.9).

These results show that discrimination between regular and irregular strings was only partially maintained when the IXI remained constant whilst the number of y-elements varied. Only two zebra finches and one budgerigar discriminated between strings with one extra y-element in the IXI. None of the birds maintained the discrimination when only the y-elements and their intervals were present. It is clear that both element types were required and that whatever the birds might have used to discriminate the training strings, it was not just regularity, nor exact duration of the IXI.

#### Proportional Scaling

Modifications in the duration of both pauses and elements evoked different effects depending on the direction of the modification (**Figure 4**). Zebra finches showed no discrimination between the regular and irregular strings when both elements and pauses were elongated by 25% (pair 2a: z = −1.73, p = 0.14). However, they did make a correct discrimination between the regular and irregular string when both elements and pauses were shortened by 25% (pair 2b: z = −6.31, p = 0.03).

Keeping the element duration identical to training stimuli but shortening the pauses in the strings by 50% (pair 2c), showed a bordering significant result toward good discrimination by the zebra finches (z = −3.98, p = 0.05). A similar trend in the responses was found when the elements were shortened, but pauses kept similar to the training stimuli (pair 2d: z = 3.29, p = 0.06).

Two zebra finches (Z2 and Z5) discriminated the regular and irregular string with shorter pauses and elements (pair 2b: Z2, z = −6.86, p < 0.01; Z5, z = −6.92, p < 0.01, **Figure 4**). These were the same individuals that discriminated the regular and irregular string from pair 1a (4 y-elements per IXI). One zebra finch made a correct discrimination when only the pauses were shortened (pair 2c, Z5, z = −5.47, p < 0.01). Two other zebra finches correctly discriminated when the elements were shortened (pair 2d, Z2 and Z3, z = −4.98, z = −4.09, both p < 0.01).

Budgerigars hardly responded to strings with modified pause and element durations (**Figure 4**). Irrespective of the type of modification and whether the elements, the pauses, or both were modified, none of the budgerigars discriminated between the regular and the irregular strings (pair 2a: all z > 0.03, p > 0.9; pair 2b: all z > 0.07, p > 0.6; pair 2c: all z > −1.4, p > 0.2; pair 2d: all z > −2.8, p > 0.4).

The duration of the elements and pauses influenced the birds' discrimination abilities differently in the two species. While budgerigars failed to discriminate between proportionally scaled strings, zebra finches' discrimination was maintained with shortened elements and pauses, although it was lost for all individuals when elements and pauses were elongated. There was no clear indication that reductions of elements, of pauses, or of both differ in their effect. The results suggest that the zebra finches showed at least some generalization of the discrimination when the number of y-elements between X-elements is left intact.

#### Numerical Scaling

The zebra finches maintained their discrimination between regular and irregular pulse strings when each IXI contained 4 y-elements, one y-element more than in the training strings, and thus had a 25% increase in duration of the IXI compared

shortened (2b) elements and pauses, as well as only shortened pauses (2c) and only shortened elements (2d) creating a 25% increase or decrease of the IXI. Black lines shown the average fraction of Go responses of the six zebra finches. Asterisks indicate a significant difference between the responses to the Go and to the No-go strings, + symbols indicate a trend, ns indicates no significant difference. Individual budgerigar results are shown with shaded circles (regular strings) and open circles (irregular strings).

to training (pair 3a, **Figure 5**). Overall, zebra finches showed more Go-responses to the regular string than to the irregular string (z = −9.88, p < 0.001). The same discrimination ability was found when there were 2 y-elements in each IXI, creating a decrease in duration of the IXI by 25% (pair 3b: z = −6.61, p = 0.002). The level of discrimination between the regular and irregular string did not differ between these two manipulations (z = 0.06, p = 0.45).

All but one zebra finch discriminated between the strings with 2 y-elements and a shorter IXI (pair 3b, Z1, Z2, Z3, Z4, and Z5). Also, two zebra finches made this discrimination when there were 4 y-elements and the IXI was elongated (pair 3a: Z1, z = −6.23, p < 0.01, and Z4, z = −8.45, p < 0.01).

One budgerigar (B1) discriminated correctly when each IXI contained an extra y-element, creating an IXI increase of 25% (z = −5.45, p < 0.01), whilst the other two budgerigars did not discriminate these strings (both z > −2.24, p > 0.7, **Figure 5**). When the IXI was reduced by 25% by removing 1 y-element between the X-elements, again one budgerigar (B2) made a correct discrimination (z = −4.33, p < 0.01), while the other two budgerigars did not discriminate (both z > 2.17, p > 0.15).

These results confirm that the IXI did not need to be identical to the training strings for the birds to correctly discriminate between a regular and an irregular pulse string. In these test strings, the durations of the elements and pauses were maintained, but the number of y-elements varied. This also demonstrates that in this test the birds did not use the exact number of y-elements between two X-elements, nor the location of the X-element to discriminate between the training strings. Rather it seems that generalization to longer and shorter regular patterns was at its best if the element and pause durations were kept identical to the training stimuli.

**Comparing responses to training and test strings** A comparison between the responses of the zebra finches to the training and to the test strings revealed that although there were differences in the responses toward regular and irregular strings in the various tests, the average fraction of Go responses to the regular test strings was always lower than the responses

responses to the Go and to the No-go strings. Individual budgerigar results are shown with shaded circles (regular strings) and open circles (irregular strings).

to the regular training strings (pairwise comparisons regular test strings ∼ regular training string, all z < −9.92, p < 0.01). Nevertheless, the zebra finches always responded more often to the regular test strings than they did to the irregular training strings (pairwise comparison regular test strings ∼ irregular training, all z < −4.54, p < 0.01). The irregular test string of pair 1a (increased number of y-elements, identical IXI), pair 2c (pauses shortened by 50%), and pair 3a (IXI elongated by extra y-element) were the only stimuli to which the birds responded more often with a Go response than they did to the irregular training string (pairwise comparisons irregular test strings ∼ irregular training string, pair 1a, pair 2c and pair 3a: z < −5.68, p < 0.01, all other z > −1.7, p > 0.1).

The budgerigars also responded less to all regular and irregular test strings than they did to the regular training strings (all z < −5.78, p < 0.01). However, some regular test strings got more Go responses than the irregular training strings. When one yelement was added between two X-elements and the IXI was increased correspondingly (pair 3a), all budgies responded with more Go responses to the regular string than to the irregular training string (all z < −6.38, p < 0.01). Additionally, one budgerigar (B2) also responded more strongly to regular test strings than to irregular training strings when there was one y-element removed (pair 3b), when there were no y-elements (1c) or when both pauses and elements were elongated (2a) (all z < −3.73, p < 0.02). Budgerigar B3 responded stronger to the regular test string than to the No-Go training string when one y-element was added or removed, but the IXI stayed identical to training (pairs 1a and 1b, z = −5.78, z = −5.21, both p < 0.01).

### DISCUSSION

Zebra finches and budgerigars can learn to discriminate between regular and irregular pulse strings in a Go/No-go operant training procedure. If the birds, like humans (see Supplementary Material), would make the discrimination based on differentiating on the basis of presence or absence of regularity, one would expect that all regular test stimuli would obtain similar Go-scores to the regular training stimulus, and be preferred consistently over the irregular test stimuli. This was not the case. Responding was considerably lower to regular test stimuli than to regular training stimuli and there is no consistent preference for the regular over the irregular test stimulus. However, several regular test stimuli got more responses than their irregular counterpart. So, what might underlie the differential responding?

Our three test-sets (see **Figure 2**) provide insights into the features of the regular and irregular training strings that zebra finches and budgerigars used when discriminating between them. The first test set showed that the birds did not discriminate the regular and irregular strings by attending exclusively to the IXI, nor by attending to the pattern of the yelements. Apparently both element types are required to make the discrimination. However, some individuals maintained the discrimination with an increased number of y-elements and constant IXI. Test set two revealed that the zebra finches, but not the budgerigars, tended to maintain discrimination between the regular and irregular strings if the number of yelements remained constant, but duration of pauses and/or elements were shortened. Discrimination was absent in both species when both elements and pauses were longer than the training strings. Finally, the third set showed that both zebra finches and budgerigars can discriminate between regular and irregular strings in which the number of y-elements and IXI is varied, provided that the duration of elements and pauses is maintained.

Concentrating on the statistically significant findings of the different individuals shows the presence of three main patterns. (1) Memorization without generalization: one budgerigar (B3) and one zebra finch (Z6) did not discriminate between any of the test string sets, suggesting that they memorized the training strings providing a food reward and discarded all deviating strings. (2) Generalization across varying IXI when local features of the test strings, like element and pause length, were identical to the training strings. One budgerigar (B2) and three zebra finches (Z1, Z3, and Z4) discriminated strings with more or fewer y-elements between the X-elements and therefore a longer or shorter IXI (pair 3a and 3b). (3) Generalization with local variation: One budgerigar (B1) and two zebra finches (Z2 and Z5) discriminated strings with longer or shorter elements and IXI's, indicating that they were able to generalize regularity beyond local features. However, each individual had a specific subset of test strings which it discriminated, showing that there are still some specific local features that played a role during discrimination.

The individual variation among zebra finches, ranging from a focus on the exact structure of the stimuli to one with additional attending to a more global structure, has also been found in experiments in which zebra finches had to distinguish among string sets based on different artificial grammar patterns (van Heijningen et al., 2009, 2013; Chen et al., 2015) and may hence reflect a variation in more general cognitive abilities. Our current results are also in line with the suggestion arising from reviewing earlier studies (see Section Introduction) that birds have a primary strategy to pay attention to local temporal features, in this case the duration of the elements and the pauses between them, for auditory pattern recognition. However, also in the present study it is clear that this initial strategy might be accompanied by a sensitivity to more global features, like the regularity of the pulse strings, as is shown by the correct discrimination between strings in which the IXI is modified by adding y-elements, but keeping identical element and pause durations (see pair 3a in **Figure 5**). Some sensitivity to regularity is also suggested by the finding that zebra finches responded more to the regular test strings than to the irregular training strings. The differentiation among the test stimuli of each type also suggests that they most likely based their responses on comparing test strings with both the regular as well as the irregular training string. In our experiment we used only a single regular and a single irregular training string. While this was sufficient for humans to classify novel strings as being regular or irregular this was not the case for the birds. However, it may be that if the birds had been trained on a set of regular and irregular stimuli they

might have shifted more clearly from using local features to using the global feature of regularity.

Our zebra finch results seem somewhat in between those obtained by Nagel et al. (2010) and those of van der Aa et al. (2015) for the discrimination between two stimuli of which the temporal parameters were varied compared to the training stimuli. The study by Nagel et al. (2010), using songs from two different males, showed that discrimination of manipulated stimuli was similar to those of the training stimuli with changes in song duration of even >25%.(van der Aa et al., 2015) used one type of pulse, separated by isochronous or heterochronous intervals and showed that discrimination between regular and irregular test stimuli disappeared with a 25% tempo change. Our stimuli were more complex than those of van der Aa et al. by using two types of elements, but lacked the phonological features present in full songs. In the present study, a 25% tempo change did affect some, but not all of the discriminations. Zebra finch songs differ in many features, such as the pitch contours, element length, amplitude modulations and formant patterns. Some of these features might have remained recognizable in the study of Nagel et al. (2010) where the songs were proportionally scaled, allowing the zebra finches to use these features, instead of the rhythmic ones. Hence we cannot be sure that the rhythmic structure of the songs was used in maintaining the discrimination in that study. The results of van der Aa et al. suggested that zebra finches attend in particular to local features, in that case the exact duration of inter-onset intervals. Our current results support this partly, as discrimination seems most affected when durations of pauses and elements were manipulated, but also show that some discrimination was maintained with a shortening, but not with a lengthening of element and pause durations. Maintenance of some discrimination between regular and irregular stimuli with proportional scaling of both pauses and elements has also been shown for starlings (Hulse S. H. et al., 1984; Hulse S. et al., 1984) and pigeons (Farthing and Hearst, 1972). It is of interest that for both of these species a decrease in tempo resulted in a stronger reduction of discrimination than an increase, similar to what is observed in the current experiment. The starlings appeared more sensitive to changes in tone length than changes in inter-onset interval, while the zebra finches in our study seemed to give equal weight to both.

The reduction in discrimination resulting from proportional scaling was, for both zebra finches and budgerigars, stronger than that for starlings, which maintained good discrimination with a 40% tempo change (Hulse S. et al., 1984). Hulse S. et al. (1984) interpreted their findings as indicating at least some sensitivity to rhythmicity for starlings. Our results are less conclusive on this issue. They suggest that both zebra finches and budgerigars showed some sensitivity to stimulus regularity, but only when some local features remained invariant. Similar ambiguous findings were observed in other studies of rhythm perception in birds, as discussed in the introduction. For example, pigeons could discriminate between meters with different pulse rates and between different regular pulse strings, but not between a regular and an irregular pulse string (Hagmann and Cook, 2010). Furthermore, they could not generalize the meter discrimination to pulse strings with similar rhythmic features, but different sound items. These results indicate that the discrimination by the pigeons was based on local phonological and temporal features, such as the absolute inter-pulse-intervals, and not on the global regularity of the strings. In a follow-up experiment using some of the same birds as used in the current experiment (Spierings et al., unpublished), we also found that both species hardly responded when X- and y-elements were replaced by elements of the same duration but differing in phonetic structure. In contrast to the pigeons, starlings (and possibly jackdaws—(Reinert, 1965), see Section Introduction) were able to discriminate between regular and random pulse strings, and could generalize this discrimination to some modifications of these strings, indicating that they might have attended to the global rhythmic feature of the pulse strings (Hulse S. H. et al., 1984; Hulse S. et al., 1984; Humpal and Cynx, 1984). However, just like the zebra finches, the starlings also discriminated best if some local features remained identical to the training strings, such as pulse duration, whilst changes in others such as the pitch of the pulses, did not affect the discrimination.

Abilities related to auditory-motor rhythm production have so far mainly been shown in avian species belonging to the parrot clade. Not only larger parrots can, to a certain extent, synchronize their body movements with a beat, but also the smaller budgerigars have shown rhythmic entrainment (Hasegawa et al., 2011). Nevertheless, this particular experiment might not have required regularity perception from the budgerigars. They were required to peck on a key at certain regular interval, indicated by a light and a sound. Repeating the previously heard or seen interval, i.e., attending to absolute timing, might have allowed the birds to correctly entrain to the presented rhythm. The budgerigars in the present study did not use the general regularity of the strings to discriminate, since they only discriminated between specific regular and irregular strings.

So, both zebra finches and budgerigars were in general not using the global regularity when discriminating these strings, but both could attend to some aspects of regularity. This is in contrast to another study, in which budgerigars and zebra finches were tested on their rule learning strategies (Spierings and ten Cate, in revision). That study showed that zebra finches used local, positional information to discriminate song element triplets (XYX and XXY), whilst budgerigars used a global strategy and attended to the structure of the strings. This resulted in a generalization of the structural rules by the budgerigars, but not by the zebra finches. One noticeable difference between that study and the current on is that Spierings and ten Cate (in revision) used a set of exemplars of the XYX and the XXY string during training, whereas in the current study the animals were trained with one regular and one irregular string. Less variation in training strings might have reduced the attention given to the general regularity-irregularity difference. Nevertheless, if the difference in regularity of the strings was the most prominent strategy to discriminate them, this strategy should also be employed with only one exemplar of each, as shown by the human subjects (see Supplementary Material).

One way of interpreting the existing literature and the current results is to distinguish between at least three types of perceptual biases that might characterize time and rhythm perception in birds and other animals. These three types are a bias for local features of auditory elements (such as pitch, amplitude, duration), a bias for more global prosodic features (such as pitch contour or amplitude contour), or a bias for the temporal structure, such as inter-beat-intervals. In the current study, most individuals seem to use local temporal features as their primary strategy in solving the discrimination task. We refer to this as the local feature bias hypothesis. This hypothesis suggests a preference in birds for local elements (such as duration, interonset interval, pitch, amplitude, or timbre) in perception and discrimination tasks and a lower sensitivity to whether they are part of a more global temporal structure, be it isochronous, heterochronous or metrical. This is not to say that zebra finches and budgerigars cannot take advantage of the global structure; it is just not their preferred strategy in solving this type of discrimination tasks.

To summarize the results of the current experiment and those reviewed in the introduction of our study: there is between and within species variation in how well different birds are able to detect regularity of pulse strings. However, while the vocal nonlearning pigeons seem to perform poorest on this, there is only a gradual difference with vocal learners such as zebra finches and budgerigars, which in turn show a gradual difference with starlings and jackdaws. Also, if there is, as our review suggested, a difference between parrots and other bird species in sensitivity to regularity and rhythm, it does not hold for the budgerigar. Also, the currently available data show no systematic differences among vocal learners and non-learners. So, we suggest, similar to what Merchant and Honing (2014) suggested for primates, that the current data show a continuum (instead of a categorical

### REFERENCES


jump) in the ability to detect regularity and rhythmicity. This idea is similar to the continuum hypothesis suggested for vocal learning by Arriaga et al. (2012) and Petkov and Jarvis (2012). However, it should be realized that the number of species tested for their abilities to perceive regularity or rhythm is still limited and the test methods and stimuli varied. Hence, there is a need to extend experiments to other avian groups, both vocal nonlearners as well as some vocal learning groups that are considered to be more advanced in their cognitive abilities (such as large parrots and corvids) and therefore may be expected to have more elaborate rhythm perception.

## AUTHOR CONTRIBUTIONS

CtC, MS, JH, and HH designed research; MS and JH performed research; MS and JH analyzed data; CtC and MS wrote the paper and JH and HH improved the paper.

### ACKNOWLEDGMENTS

We thank Guus van der Velden and Sissy Bijsterbosch for help with the animal work. Furthermore, we thank the two referees for their constructive comments. This research was supported by NWO-GW, grant 360.70.452.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fpsyg. 2016.00730


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 ten Cate, Spierings, Hubert and Honing. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# "Bird Song Metronomics": Isochronous Organization of Zebra Finch Song Rhythm

Philipp Norton and Constance Scharff\*

AG Verhaltensbiologie, Freie Universität Berlin, Berlin, Germany

The human capacity for speech and vocal music depends on vocal imitation. Songbirds, in contrast to non-human primates, share this vocal production learning with humans. The process through which birds and humans learn many of their vocalizations as well as the underlying neural system exhibit a number of striking parallels and have been widely researched. In contrast, rhythm, a key feature of language, and music, has received surprisingly little attention in songbirds. Investigating temporal periodicity in bird song has the potential to inform the relationship between neural mechanisms and behavioral output and can also provide insight into the biology and evolution of musicality. Here we present a method to analyze birdsong for an underlying rhythmic regularity. Using the intervals from one note onset to the next as input, we found for each bird an isochronous sequence of time stamps, a "signal-derived pulse," or pulse<sup>S</sup> , of which a subset aligned with all note onsets of the bird's song. Fourier analysis corroborated these results. To determine whether this finding was just a byproduct of the duration of notes and intervals typical for zebra finches but not dependent on the individual duration of elements and the sequence in which they are sung, we compared natural songs to models of artificial songs. Note onsets of natural song deviated from the pulse<sup>S</sup> significantly less than those of artificial songs with randomized note and gap durations. Thus, male zebra finch song has the regularity required for a listener to extract a perceived pulse (pulse<sup>P</sup> ), as yet untested. Strikingly, in our study, pulses<sup>S</sup> that best fit note onsets often also coincided with the transitions between sub-note elements within complex notes, corresponding to neuromuscular gestures. Gesture durations often equaled one or more pulse<sup>S</sup> periods. This suggests that gesture duration constitutes the basic element of the temporal hierarchy of zebra finch song rhythm, an interesting parallel to the hierarchically structured components of regular rhythms in human music.

Keywords: zebra finch, birdsong, rhythm, pulse, music, gestures

### INTRODUCTION

Rhythm is a key element in the structure of music and can be defined as the "systematic patterning of sound in terms of timing, accent and grouping" (Patel, 2008, p. 96). These patterns can be either periodic (i.e., regularly repeating) or aperiodic. A special case of a periodic pattern is an isochronous one, where the time intervals between successive events share the same duration. In many types of music across the world, including the Western European (Patel, 2008, pp. 97–99) and African

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Carel Ten Cate, Leiden University, Netherlands Gabriel Mindlin, University of Buenos Aires, Argentina

> \*Correspondence: Constance Scharff scharff@zedat.fu-berlin.de

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> Received: 07 March 2016 Accepted: 20 June 2016 Published: 06 July 2016

#### Citation:

Norton P and Scharff C (2016) "Bird Song Metronomics": Isochronous Organization of Zebra Finch Song Rhythm. Front. Neurosci. 10:309. doi: 10.3389/fnins.2016.00309 (Arom, 1991, p. 211) traditions, the timing of sonic events, mostly note onsets, is structured by a perceptually isochronous pulse (Nettl, 2001). This pulse is a cognitive construct that is usually implicit rather than being materialized in the acoustic signal itself (Arom, 1991, p. 230; Fitch, 2013). For the purpose of this article we will call this the "perceived pulse," or pulse<sup>P</sup> . In all but the simplest of rhythms not all notes fall on the pulse and some pulses occur in the silence between notes. Therefore, the intervals between the notes in a piece are rarely isochronous, but many note onsets align to an isochronous pulse. In some musical styles, variations of tempo—and therefore pulse—are used for artistic effect (e.g., accelerando and ritardando in classical music), while in others the tempo remains constant throughout a piece or performance (e.g., Central African music; Arom, 1991, p. 20). Often the pulse is further organized by a metrical structure, the recurring hierarchical patterning of strongly, and weakly accented events. In a waltz, for example, the pulse is perceptually divided into groups of three, of which the first one—the socalled downbeat—is perceived as more strongly accented than the following two ("one, two, three, one, two, three"). In this example, pulses on the lower level of the metrical hierarchy, i.e., every pulse, happen at three times the tempo of the higher level, consisting of only the strong pulses. The process of finding the pulse and frequently the subsequent attribution of meter allow us to infer the beat of a piece of music.

If you have ever danced or clapped your hands along to music, you have already encountered one function of a regular pulse: it facilitates the coordination of synchronized movements through a process called "beat perception and synchronization." It also provides musicians with a common temporal reference that is necessary for coordinated ensemble performance (Arom, 1991, p. 179; Patel, 2008, pp. 99–100). Furthermore, expectations and the interplay of successful anticipations and surprises emerging from these expectations are thought to drive the "emotive power" of human music (Huron, 2006). Pulse and meter, as well as deviations thereof, can build anticipations in the time-domain that subsequently are either fulfilled or violated.

How did such an apparently universal aspect of human music evolve? Several authors have stressed the importance of a crossspecies comparative approach to gain insights into the evolution of music (Hulse and Page, 1988; Carterette and Kendall, 1999; Hauser and McDermott, 2003; Fitch, 2006; Patel and Demorest, 2013). Crucial to this endeavor is the realization that the music faculty hinges on a variety of interacting perceptual, cognitive, emotional, and motor mechanisms that may follow different evolutionary trajectories. It is therefore helpful to break down the music faculty into these different components and investigate which of them are present, either by homology or analogy, in non-human animals (Fitch, 2006, 2015; Ravignani et al., 2014; Honing et al., 2015).

One critical component is our capacity for vocal learning. It allowed us to develop speech as well as song, which is assumed to be universal to human music (Nettl, 2001; Trehub, 2001; Brown and Jordania, 2011). Of the many species that produce vocalizations or other acoustic signals of varying complexity, only a few are well known to rely on developmental learning to acquire some of their adult vocalizations, e.g., songbirds, hummingbirds, and parrots as well as several species of bats, some marine mammals, and elephants (rewieved by Petkov and Jarvis, 2012).

Birdsong in particular has caught the interest of researchers for its putative musical features (Kneutgen, 1969; Dobson and Lemon, 1977; Marler, 2001; Baptista and Keister, 2005; Taylor, 2013; Rothenberg et al., 2014). It has frequently inspired human music and prompted composers to incorporate it into their compositions (Baptista and Keister, 2005; Taylor, 2014). Birdsong and music might also share similar mechanisms and functions. For instance, the same regions of the mesolimbic reward pathway that respond to music in humans are active in female whitethroated sparrows listening to conspecific song (Earp and Maney, 2012). Many bird species also coordinate their vocalizations by simultaneous or alternating chorusing (reviewed by Hall, 2009) or have been shown to temporally coordinate bodily movements in a dance-like manner with song during courtship (e.g., Prum, 1990; Patricelli et al., 2002; DuVal, 2007; Scholes, 2008; Dalziell et al., 2013; Ota et al., 2015; Soma and Garamszegi, 2015). Whether zebra finches (Taeniopygia guttata) coordinate singing among individuals has not been studied, but they do integrate song and dance during courtship in a non-random choreography (Williams, 2001; Ullrich et al., 2016). As in human ensemble music and dance, an isochronous pulse might serve as a temporal reference for duetting and dancing birds, facilitating the temporal coordination of vocalizations, and movements. A recent study by Benichov et al. (2016) showed that zebra finches are also adept at coordinating the timing of unlearned calls in antiphonal interactions with a robot producing isochronously spaced calls. When the robot produced some additional calls, timed to coincide with the bird's response, both males and females quickly adjusted their calls to avoid jamming, successfully predicting the regular call pattern of the robot. The forebrain motor pathway that drives learned song production in male zebra finches seems to play a major role in this precise and flexible temporal coordination, not only in males but also in females that do not sing and have a much more rudimentary song system (Benichov et al., 2016). The capacity for "beat perception and synchronization" that enables humans to extract the pulse from a complex auditory signal and move to it has so far been found only in several species of parrots (Patel et al., 2009; Schachner, 2010; Hasegawa et al., 2011) and a California sea lion (Cook et al., 2013). Since human music was used as a stimulus in these studies it is not clear how these findings relate to the animals' own vocalizations: is there regularity in any learned natural vocalization signal that permits extraction of a regular pulse?

Song production in zebra finches has been successfully used as a model for studying vocal learning and production for several decades, motivated by its parallels to speech acquisition at behavioral, neural, and genetic levels (reviewed by Doupe and Kuhl, 1999; Bolhuis et al., 2010; Berwick et al., 2012). Therefore, a large body of knowledge exists about zebra finch song structure and development as well as their neurobiological basis. Zebra finch song learning and production is controlled by a neural network of specialized song nuclei (Nottebohm et al., 1976; Bolhuis et al., 2010). The nucleus HVC, cortical in nature, significantly contributes to the coding of song. Different ensembles of neurons fire short, sparsely occurring bursts of action potentials which, through a series of downstream nuclei, translate into a motor code controlling particular ensembles of muscles of the vocal organ (Hahnloser et al., 2002; Fee et al., 2004; Okubo et al., 2015). The level of resolution of our knowledge about how behavior is neurally coded is much finer grained in songbirds than in humans. So, while the present study in songbirds is guided by what we know about rhythm from human music it has the potential to shape our inquiry into the neural basis of human rhythm production and perception. The highly stereotypic structure of zebra finch song and the fact that it remains largely unchanged in the adult bird contributes to making it a good target for first investigations of periodicity, compared to more complex singers. We therefore analyzed zebra finch song rhythm, asking whether an isochronous pulse can be derived from the timing of its notes (signal-derived pulse; pulse<sup>S</sup> ).

### MATERIALS AND METHODS

### Birds

This study used 15 adult male zebra finches, aged between 384 and 1732 days at the time of song recording. They were bred and raised at the Freie Universität Berlin breeding facility. Before entering this study, they were housed together with conspecific males, either in a large aviary or in a cage sized 90 × 35 × 45 cm. In both cases they had acoustic and visual contact to female zebra finches held in other cages or aviaries in the same room. The rooms were kept under an artificial 12 h/12 h light/dark cycle at 25 ± 3 ◦C. The birds had access to food, water, grit and cuttlebone ad libitum at all times. Birds in this study were solely used for song recording, a procedure for which the local authorities overseeing animal experimentation do not require a permit because it does not cause pain or discomfort. Information on the degree of relationship between the test subjects was only available for some of the birds. Of those, none were siblings, or had been raised by the same parents (3534, 4295, 4306, 4523, and g13r8). We cannot exclude dependencies in song structure arising from the possibility that pairs of birds were influenced by the same tutors.

### Recording

For song recording, each male was transferred into a separate cage (40 × 30 × 40 cm) inside a sound attenuation box (60 × 60 × 80 cm), kept under a 12 h/12 h light/dark cycle. Audio was recorded through cardioid microphones, mounted at about 2 cm distance from the center of the cage's front wall in each box. These were connected to a single PC through an external audio interface. Audacity 2.0.3 was used to record a singlechannel audio track (WAVE file, 44.1 kHz, 16-bit) for each bird. Recording took place over a period of 3 years: 2013 (10 birds), 2014 (4) and 2015 (1) at varying times between 8 a.m. and 6 p.m. In addition to song recorded in isolation ("undirected song") we also solicited so called "directed" song from 9 of the 15 males by exposing them to the sound and sight of a female finch in a transparent plastic box placed in front of the recording cage. Directed and undirected songs of the same bird were recorded within 1–3 days.

### Labeling

Recordings were segmented into smaller files of up to 10,000 s (2 h 46 min) length and for each bird the segment containing most song was used for the analyses. Then, an IIR Chebyshev high-pass filter with a 1 kHz cutoff was applied to remove lowfrequency noise, using Avisoft Bioacoustics SASLab Pro 5.2.07 (henceforth SASLab). Note on- and offsets were determined by automatic amplitude threshold comparison in SASLab and saved as timestamps. All measurements obtained through this procedure were reviewed by visual examination of the song sonogram and corrected by hand where necessary. Timestamps of falsely identified elements (i.e., above-threshold noise) were removed. In the rare cases in which notes could not be reliably measured by hand (due to overlapping noise or recording artifacts), all timestamps from the entire song were discarded. Introductory notes as well as calls preceding or following song were measured but not included in the subsequent analysis. All remaining timestamps were exported to MathWorks MATLAB R2012b 8.0.0.783 (henceforth Matlab), which was used for the rest of the analysis.

Song of zebra finches is composed of different notes, separated by silent intervals resulting from inhalation gaps. Notes consist of one or more sub-note elements, corresponding to neuromuscular gestures (hence called "gestures"; Amador et al., 2013). For the analysis, we labeled notes with alphabetical letters. A string of recurrent note sequences is called a motif. Slight changes of note order can result in motif variants. For each bird, notes with the same bioacoustic features within a motif were labeled with the same alphabetical letter (for examples see **Figure 1**). The number of different notes sung by each individual ranged from four to seven, labeled a through g. The most commonly sung motif received the note labels in alphabetical order. The interval between notes (hence called "gap") following note a was labeled a', following note b b' etc. Gaps were associated with the preceding syllable, as note duration correlates more strongly with the duration of the subsequent than the preceding gap (Glaze and Troyer, 2006). Introductory notes and calls were assigned different letters, and the corresponding timestamps were subsequently filtered out. When the first note of a motif was similar or identical to the introductory notes, it was considered the first note of the motif if it was present in each motif repetition and the gap between this note and the next was in the range typical of gaps within the motif. We used this criterion to distinguish between introductory notes and motif notes, because the former are separated by gaps of variable duration and the latter are not.

Rhythm analyses were performed on "chunks," e.g., songs containing 1–10 continuously sung motifs. A new chunk started when a pause between two motifs lasted 300 ms or more. Chunks containing fewer than four notes (e.g., abc) or fewer than three bioacoustically distinct notes (e.g., ababab) were discarded in order to avoid "false positives," e.g., finding a regular pulse<sup>S</sup> as a mathematical consequence of few notes or low complexity. For each bird we analyzed between 12 and 68 undirected song chunks, consisting of 4–34 notes each (9.1 ± 4.5; mean ± std). Recordings of directed song contained 15–107 chunks, consisting of 4–42 notes (8.6 ± 5.9; mean ± std).

### Pulse Matching

We used a generate-and-test (GAT) approach to find the pulse<sup>S</sup> (signal-derived pulse) that best fitted the note onsets. Essentially, isochronous pulses, i.e., strings of timestamps of equal intervals, were created for a range of different frequencies. To assess the goodness of fit of each of those pulses to a particular recorded song, the root-mean-square deviation (RMSD) of all notes in the song chunk from their nearest single pulse (i.e., timestamp) was calculated. Specifically, we aimed to determine the slowest regular pulse that could coincide with all note onsets of the particular song under investigation.

To numerically determine the lower range of pulse intervals we therefore used the shortest measured inter-onset interval (IOI) for each tested song chunk and added 10% to account for variability. Lower frequency limits calculated this way ranged from 5.5 Hz (bird 4042) to 14.9 Hz (bird 4669). Starting there, the pulse frequency was incremented in 0.01 Hz steps up to 100 Hz. Preliminary investigation revealed that the best fitting pulses very rarely had frequencies above 100 Hz.

For each chunk, the pulse that fitted note onsets best was determined in the following way. Each of the pulses of incrementing frequency (by 0.01 Hz steps) was displaced from the beginning of the recording by offsets ranging from zero to one period in 1 ms steps. For each offset of each pulse the RMSD was calculated. The offset at which the RMSD was minimal was regarded as the "optimal offset." The result of this process was a list of pulses of different frequencies (e.g., 5.50, 5.51, ..., 100 Hz) for each chunk and their respective minimal RMSD.

Because pulse frequency is mathematically related to RMSD, e.g., faster pulses are associated with lower RMSDs, we normalized the RMSD by multiplication with the pulse frequency, resulting in the "frequency-normalized RMSD" (FRMSD). The FRMSD, unlike the RMSD, does not exhibit this long-term frequency-dependent decrease (Supplementary Figure 1). The RMSD on its own is an absolute measure of deviation. In contrast, the FRMSD was used in this study, measuring the deviation relative to pulse frequency. Essentially it indicates how well the pulse fits, taken into account its tempo. We selected the pulse with the lowest FRMSD as the best fitting pulse for each chunk.

### Fourier Analysis

A Fourier analysis was performed to confirm the results of the GAT pulse matching method (Saar and Mitra, 2008). To this end the note onset timestamps of each song were used to generate a point process, i.e., a number string with a 1 ms time resolution, which was 1 at note onsets and 0 elsewhere. After performing a fast Fourier transform (FFT) on this string, we took the frequency of maximum power for each chunk (within the same Hz limits as above) and compared it to the frequency given by the GAT method.

### Gesture Transitions

Examination of the sonograms showed that not only note onsets, to which the pulses were fitted, but also onsets of distinct bioacoustic features within notes, corresponding to neuromuscular gestures, coincided with the pulse remarkably often. We identified possible time points of these gesture transitions quantitatively through a previously published algorithm that determines significant local minima in the amplitude envelope (Boari et al., 2015). Amplitude minima occur not only on gesture transitions, but also within gestures and notes of quasi-constant frequency (e.g., note e of 3534, **Figure 1**). Thus, we selected from the time points produced by the algorithm only those as gesture transitions that corresponded to clear discontinuities in the frequency trace, identified by visual examination of the sonograms. The percentage of gesture transitions that fell within certain ranges around the pulse, namely one tenth, one sixth, and a quarter of the pulse period, were calculated. In **Figures 1** and **3** gestures with a distance of less than one sixth of the pulse period to the nearest pulse are highlighted.

### Clustering

Visual examination revealed that the frequencies of the best fitting pulses of all song chunks from each bird tended to form clusters with individual values scattered between clusters (Supplementary Figure 2). To quantify this impression we used agglomerative hierarchical clustering in ten birds, taking the group average of frequency distances as a dissimilarity measure. The dissimilarity threshold was set at 0.025 for all datasets. There was a significant positive correlation between cluster frequency mean and standard deviation (Linear regression; R <sup>2</sup> = 0.21; p < 0.001; n = 78), i.e., pulse frequency clusters were more tightly packed, the lower their frequency and vice versa. In order to obtain comparable clusters, different frequency transformations (square root, loge, and log10) were applied pre-clustering and their effect on this correlation was tested. Clustering in this study was done on the basis of log10-transformed frequency data because log10-transformation led to clusters with the least frequency-dependent standard deviation (Linear regression; R 2 = 0.0007; p = 0.824; n = 77).

### Modeling

To address whether the pulse frequencies found through the GAT method could also be detected with similar goodness of fit in any arbitrary sequence of notes, we developed two sparse models of song with varying degrees of randomization. These models produce sequences of timestamps comparable to the ones obtained from the song recordings and consist of on- and offsets of virtual "notes." The pulse deviation of the recorded bird songs was then compared to that of these artificial songs. We used the results to test the hypothesis that note onsets in zebra finch song align to an isochronous pulse more closely than expected by chance.

The first model, called "random sequence" model (Model R), creates virtual notes and gaps of random duration, albeit within a certain range. It ignores the note sequence of the original song, instead picking a new duration for each individual note. Therefore, the note sequence is not consistent across motif repetitions (e.g., natural song abcd abcd abcd compared to artifical song a'c'b'd' g'i'h'k' m'l'n'o'). Model R creates a pseudorandom value for each individual note in the analyzed song chunk and uses that as the duration of the corresponding

modeled note. These pseudorandom values are drawn from a Pearson distribution using Matlab's pearsrnd() function. The distribution's mean, standard deviation, skewness, and kurtosis are equal to the distribution of all observed note durations from either undirected or directed song, depending on which is to be modeled. The same is done for each gap, only this time the distribution is modeled on that of the observed gap durations. To be more conservative and avoid introducing high variability into the gaps of the model songs, outlier values and durations of gaps with unusually high mean and variability (gray points in **Figure 6**) were excluded in the creation of the pseudorandom number distribution.

Like model R, the second so called "consistent sequence" model (Model C) creates notes and gaps of random duration within the range of actually observed durations. Unlike model R though, model C takes the note sequence of the original song into account, keeping the duration of individual notes and their associated gaps in their sequence consistent across motif repetitions (e.g., natural song abcd abcd abcd compared to artifical song a'b'c'd' a'b'c'd' a'b'c'd'). In the first step of creating a virtual "song," the different note types in the analyzed song chunk were determined (e.g., a, b, c, d). Then a set of 100 pseudorandom numbers were created for each note type of a bird, drawn from a standard normal distribution using Matlab's randn() function. These sets were then transformed to have their respective means equal a random value (drawn from a uniform distribution) between the minimum and maximum of the means of all durations of each observed note type. The standard deviation of all sets equals the mean of the standard deviations of the durations of each note type in the database. The same was done for each gap, only this time using the standard deviations and range of means of the gap durations as the basis for the set transformation. Model C draws a random element from the appropriate set for each individual note in the analyzed song chunk and its associated gap, and uses that value as the duration of the corresponding modeled note/gap. The note/gap type durations in this model were kept consistent not only across motif repetitions within a chunk, but also across all analyzed chunks of a bird. This was achieved through seeding Matlab's random number generator (RNG) before the creation of the duration sets during the modeling of each song chunk. The same seed value was used for all chunks of a single bird and different seed values were used for different birds. The RNG was seeded again before drawing the individual note/gap durations from the sets. Here, each chunk from a bird was assigned a different seed value. As a result, each modeled chunk used the same set of 100 durations for each note/gap type, but different values from that set were selected each time.

The deviations of two songs from their best fitting pulses cannot be compared if those pulses strongly differ in frequency. Just as the RMSD depends on the pulse frequency (described above), so does the FRMSD, as it measures deviation relative to pulse frequency. We therefore repeated the pulse matching process for both the recorded songs and the artificial songs, this time restricting the matched pulses to a certain frequency range that was different for each bird and identical for all

recorded and artificial songs of one bird. Since we wanted to test whether we can find equally well fitting pulses for the artificial songs as we did for the recorded songs, we chose the mean of the largest frequency cluster of each bird as the center of the range. Furthermore, the upper bound of the range was twice the frequency of the lower bound. This assured that for any frequency outside of this range, either one integer multiple or one integer fraction of that fell within the range. We then compared the FRMSD values of all recorded songs and their best fitting pulse in their frequency range to those of the artificial songs. To exclude the possibility of the models producing particularly periodic or aperiodic songs by chance, the artificial song creation and subsequent FRMSD comparison were repeated 50 times for each song.

### Statistics

To test the differences in pulse deviation between bird song and model song or between song contexts (directed and undirected song), a linear mixed effects analysis was performed (linear mixed model, LMM) using the statistical programming language R 3.0.2 (R Core Team, 2013) with the package lme4 (Bates et al., 2014). FRMSD was entered into the model as fixed effect. As FRMSD increases with the number of notes in a chunk (Supplementary Figure 3), the latter was used as a random intercept. P-values were obtained by likelihood ratio tests of the full model vs. a reduced model without the fixed effect (FRMSD). One sample t-tests were used to test whether the percentage of gesture transitions occurred in certain ranges around the pulses significantly more often than expected by chance.

### RESULTS

For each of the 15 analyzed adult male zebra finches we found an isochronous pulse<sup>S</sup> (signal-derived pulse) that coincided with all note onsets, using two independent analysis methods. For both, we used a continuous undirected song sample from each bird. The analyses were performed on segments, called "chunks" that contained notes not separated by more than 300 ms. Each analyzed chunk consisted of 1–10 motifs, composed of repeated unique notes, varying between 4 and 7 depending on the bird. Using a generate-and-test approach (GAT; see Section Materials and Methods) we identified for each chunk of a bird's recording a pulse<sup>S</sup> that fitted best to the note onsets, i.e., had the lowest frequency-normalized root-mean-square deviation (FRMSD; **Figure 1**).

### Pulse Frequencies

For all birds except one, a particular best fitting pulse dominated, e.g., between 36 and 70% of analyzed chunks from each bird clustered around a particular frequency (black circles in **Figure 2**). For 11 of 15 birds, best fitting pulse frequencies lay between 25 and 45 Hz. As a second analysis method to determine the best fitting pulses we applied a fast Fourier transform. We found that 91% of all chunks differed by <0.25 Hz from the pulse frequencies identified by our GAT method.

In all birds a portion of songs were best fitted with pulses of different frequencies than those in the largest frequency cluster. Slight measurement inaccuracies may have lead to different pulses having a lower deviation than the putative "real" pulse in some songs. Song amplitude throughout the recordings varied slightly depending on the birds' position in the cage and the orientation of their heads during singing. This is likely to have introduced some variability in the note onset measurement by amplitude threshold detection. The use of a dynamic timewarping algorithm for onset detection should provide more accurate measurements (e.g., Glaze and Troyer, 2006). Another factor that might tie into the variability in pulse deviation is the fact that zebra finches gradually slow down by a small degree during bouts of continuous song (Glaze and Troyer, 2006).

### Gesture Transitions

Often the best fitting pulse coincided not only with note onsets, but also with onsets of particular bioacoustic features within notes, corresponding to neuromuscular gestures. This was unexpected because the pulse was determined based on note onset times and not based on gesture transitions.

To quantify this observation, we identified possible time points of gesture transitions through an algorithm that determines significant local minima in the amplitude envelope (Boari et al., 2015). Out of these time points we selected those that coincided with clear discontinuities in the frequency domain of the song sonogram as gesture transitions. We did this for one song chunk from each of the 15 birds and found that overall 50.8% of gesture transitions fell within one sixth of the pulse period around single pulses (white triangles in **Figure 1**). If the gesture transitions were randomly distributed, 33.3% would be expected to fall in this range, as the range within one sixth of the period to either side of each pulse adds up to a third of total song duration. The percentage of gesture transitions that were within this range was significantly higher than the percentage expected by chance [one sample t-test, t(14) = 2.894, p = 0.0118]. We found that the pulses also had a significantly higher coincidence with the gestures than expected by chance when we chose other ranges. Within one tenth of the period around pulses lay 34.3% of the transitions, significantly more than the 20% expected by chance [t(14) = 2.315, p = 0.0363]. Within a quarter lay 65.9%, while 50% were expected if gesture transitions were randomly distributed [t(14) = 2.639, p = 0.0195]. Inspection of the sonogram revealed many cases in which gesture duration equaled one or multiple pulse periods (for one pulse period see e.g., note c of bird 3534; c of 4669; d of 4427; a and b of 4462; b and d of 4052; for multiple pulse periods see c of 4427; **Figure 1**). In other cases multiple successive gestures added up to one pulse period (c of 3534; c of 4669; b of 4052). Note offsets did not systematically fall on the pulse, but in some cases notes consisting of a single gesture spanned one or more pulse periods (d of 3534; b and e of 4669;c of 4462; a of 4052). These observations imply a strong relationship between gesture durations and IOI.

Motivated by the unexpected finding that the pulses fitted not only note onsets but also many of the gestures, we wondered whether even shorter gestures would coincide with faster pulses, corresponding to integer multiples of the slowest fitting one. Interestingly, inspection of one bird under five additional pulse frequencies revealed increasingly higher coincidence of pulses with all observable gesture transitions (**Figure 3**).

### Directed Song

Song directed by zebra finch males at females during courtship is less variable in various ways than when males sing so called "undirected" song (Sossinka and Böhner, 1980). During courtship, zebra finches deliver their song slightly faster than during undirected singing (Sossinka and Böhner, 1980; Cooper and Goller, 2006). In addition, notes and the sequence in which they are sung are produced in a more stereotyped manner from rendition to rendition during directed singing. Whether the duration of notes is also less variable in the directed than the undirected singing context is not known (Glaze and Troyer, 2006). To find out whether directed song had a faster pulse or whether the pulse fitted better due to lower variability (i.e., lower FRMSD) we recorded 9 of the previously analyzed birds also in a directed song context.

Mean pulse frequency of the largest cluster of undirected song was slightly lower than the nearest cluster in directed song in all birds (**Figure 4**). This is consistent with the fact that directed song is performed faster than undirected song (Sossinka and Böhner, 1980; Kao and Brainard, 2006; Woolley and Doupe, 2008), linked to a higher level of motivation during directed singing (Cooper and Goller, 2006). In 7 of 9 birds the pulse frequency best fitting most chunks was in the same range for undirected and directed songs. Interestingly, there was no significant difference in FRMSD between directed and undirected song (LMM; p > 0.05 for all 9 birds; **Figure 5**), indicating that note onsets in directed song do not appear to have a stronger or weaker periodicity than those of undirected song.

### Comparison to Randomized Model "Song"

To evaluate the fit of note onsets to the pulses, we created artificial "songs" consisting of randomized note and gap durations and compared the deviations of their note onsets from an isochronous pulse to those of the recorded birds.

The songs of the first model ("random sequence," model R) do not replicate the note sequence of the recorded song. Instead a new pseudorandom duration is picked for each individual note and gap from a distribution modeled on that of the recorded notes and gaps. Through this comparison we could answer the question of whether a similar periodicity could be found in any arbitrary sequence of an equal number of (finch-like) song elements. We modeled the durations on the population of measured values of all birds in this study (**Figure 6**).

For each chunk we created 50 artificial songs with different randomized duration values each time and compared those to the recorded song chunks (see Section Materials and Methods for details). In overall 99% of comparisons bird songs had a lower pulse deviation (FRMSD) than the artificial songs created by model R (**Table 1**). In 88% of cases deviations were significantly lower compared to model songs, while the opposite never occurred (LMM; p < 0.05). The analyzed natural songs therefore match a regular pulse significantly better than expected by chance. In other words, all IOI of one bird are proportional to each other (i.e., integer multiples of the pulse period), unlike an arbitrary sequence of (finch-like) durations.

In most cases IOIs within one chunk are not completely independent of each other, as notes or whole motifs are repeated, and repetitions of notes and associated gaps are mostly very similar in duration. Thus, we compared the recorded songs to a second model ("consistent sequence," model C), that preserves the sequence of the recorded song. In all artificial songs produced by model C for one bird, for example, the notes based on note a have a similar duration. In 81% of comparisons, FRMSD was lower in the natural song than in the model C songs (**Table 1**). It was significantly lower in bird songs in 55% and significantly lower in model songs in 8% of comparisons (LMM; p < 0.05). Model C songs performed better than model R songs in terms of pulse deviation, but still worse than the natural songs in the majority of cases. This leads us to conclude that the pulse is a result of the durations of the song elements as well as their sequence.

## DISCUSSION

We showed here for the first time that the song of a passerine songbird, the zebra finch, can be fitted to an isochronous pulse<sup>S</sup> (signal-derived pulse). Note onsets coincided with pulses of frequencies between 10 and 60 Hz (25–45 Hz for most birds) and at different frequencies for each individual. In femaledirected song this periodicity was not significantly different from undirected song. In addition to note onsets, many of the transitions between gestures within complex notes coincided with the same pulse as well, more so than expected by chance. Finding a pulse in zebra finch song raises questions about the underlying neural mechanism and its behavioral function. We cannot offer definite answers but some suggestions:

Song is coded in HVC neurons projecting to nucleus RA (HVCRA) of the motor pathway. Different ensembles of those neurons fire at particular positions of each rendition of a song motif in a single, roughly 10 ms long, burst of action potentials (Hahnloser et al., 2002). Finding no connection between temporal firing of these neurons and note on- and offsets led to a working hypothesis, according to which HVCRA neurons act together like a clock, producing a continuous string of ticks ("synfire chain") throughout song on a 5–10 ms timescale (Fee et al., 2004). Additional evidence for a clock-like signal in HVC controlling song production comes from experiments in which HVC was locally cooled (Long and Fee, 2008). This caused song to slow down up to 45% across all timescales, including gaps, while only slightly altering the acoustic structure. Since neural activity in RA gives rise to the motor code for song production (Mooney, 2009), one could expect to see the periodicity of the synfire chain reflected in the temporal structure of song. The frequency of this periodic activity would be in the range of 100–200 Hz. The best fitting pulses found in this study, however, are between 3 and 10 times slower. This suggests that the timing of song notes is organized on a slower timescale, occurring only at every nth clock tick, with n depending on the individual. Since we found these slower pulses in the songs of all birds and the

cluster is given. Mean frequency of those clusters is shown on top of the figure.

**133**

songs were made up of several different notes, we propose that additional mechanisms must operate to orchestrate the timing signals of the internal clock into higher hierarchical levels giving rise to the slower pulse.

One such mechanism was proposed by Trevisan et al. (2006) to explain the diverse temporal patterns in the songs of canaries (Serinus canaria). They constructed a simple nonlinear model of respiratory control that could reproduce the air sac pressure patterns recorded during singing. This model, in which respiratory gestures emerge as different subharmonics of a periodic forcing signal, could predict the effects of local cooling of canary HVC on song notes (Goldin et al., 2013). As in zebra finches, canary song begins to slow linearly with falling temperature. At a certain point, however, notes begin to break into shorter elements, as forcing, and respiration lock into a different integer ratio (e.g., from 2:1 to 1:1). Such a model might explain how a minimal time scale—e.g., in the form of an HVC synfire chain—could drive the timing of zebra finch notes on a subharmonic frequency. Zebra finch songs include more complex notes, in which several gestures of different duration are strung together in a single expiratory pulse. Our observation that gesture transitions preferentially coincided with the pulse on the note level, suggests that a similar mechanism might be responsible for periodic activation of the syringeal membrane.

Another study that recorded from HVCRA in zebra finches found that they fired preferentially at so called "gesture trajectory extrema." These comprise gesture on- and offsets as well as extrema in physiological parameters of vocal motor control within gestures, namely air sac pressure and membrane tension of the syrinx (Amador et al., 2013). This suggests that gestures might be the basic units of song production and that their timing is coded early in the song-motor pathway. It cannot be ruled out in this scenario that a number of neurons continue to fire throughout the song, sustaining a clock-like functionality (Troyer, 2013). In fact, a very recent paper using a range of methods to correlate HVC ensemble neural activity with song finds that HVC projection neurons exhibit a temporal sequence that does not occur preferentially with note onsets or offsets, nor with gesture transitions (Picardo et al., 2016). Be that as it may, our results imply that gestures transitions, like note onsets, contribute to song regularity. On average around half of the gesture transitions coincided with the pulse fitted to note onsets, significantly more than expected if they were randomly distributed. Those that did not, often occurred at the boundaries of gestures shorter than the pulse period, and


TABLE 1 | Results of the comparison between recorded undirected songs and model songs for all 15 birds.

Artificial song creation was repeated 50 times for each song chunk with different pseudorandom values for each repetition. Values are the percent of repetitions in which the pulse<sup>S</sup> deviation (FRMSD) was lower in natural songs compared to model songs and vice versa (first and third column in each block) and the percentage in which this difference was statistically significant (LMM; p < 0.05; second and fourth column in each block). Column mean is given at the bottom.

successive short gestures often added up to one or multiple periods. These observations imply a strong relationship between gesture duration and IOI, where gestures constitute the lowest level of the temporal hierarchy. Notes are on a higher level of this hierarchy, combining one or more gestures and the intervening inhalation gaps. In this sense the rhythmic structure in zebra finch song is reminiscent of the relationship between notes and phrases in metrical rhythms of human music.

What might be the behavioral function of the periodic organization of song? Temporal regularity in an auditory signal can facilitate the anticipation of events. In the wild, zebra finches live in large colonies that provide a very noisy environment. Females have to attend to the song of a single male against a backdrop of conspecific vocalizations as well as other sources of noise. Temporal predictability of an auditory signal has been shown to enhance auditory detection in humans (Lawrance et al., 2014), a phenomenon from which zebra finches could benefit as well. Humans are also thought to possess a form of periodic attention. When asked to judge the pitch difference of the last of an isochronous sequence of 10 tones of different pitches to the first, they were more successful when the last tone was on the beat than when it came slightly early or late (Jones et al., 2002). This supports the idea that accurate expectation (i.e., when a stimulus might occur) has a facilitating effect on attention, improving the ability to assess what the characteristics of the stimulus are (Seashore, 1938; Huron, 2006). The benefit of successful anticipation of events is that it allows the optimization of arousal levels and therefore the minimization of energy expenditure (Huron, 2006). When female zebra finches were given the choice between undirected and directed song from the same individual, they preferred to listen to the latter (Woolley and Doupe, 2008). In this study the strength of this preference was negatively correlated with the variability in fundamental frequency of multiple renditions of harmonic stacks (parts of notes with clear harmonic structure and little frequency-modulation, e.g., the first two gestures of note c in 4669's song; **Figure 1**). This suggests that females attend to the pitch at specific times in a male's song and show a preference for males that are able to consistently "hit the right note." If that is the case, it would be advantageous for them to be able to anticipate the timing of these structures. Since these gestures seem to be periodically timed, females could benefit from a form of periodic attention. Instead of maintaining a constant high level of attention throughout the song or establishing a new set of expectations for each individual male, they could then simply adjust the "tempo" of their periodic attention to fit that of the singer. Females possess most of the nuclei of the song system, including HVC and RA, albeit much smaller. Until recently, the function of these nuclei was largely unknown, although in canaries HVC is implicated in song recognition and discrimination (Halle et al., 2002; Lynch et al., 2013). Benichov et al. (2016) showed that following disruption of the song system, the ability for precise, predictive timing of call coordination is greatly reduced in both males and females. It is therefore probable that females use some of the same structures that enable males to produce song with high temporal regularity, to either assess the quality of this regularity, or to use it for the anticipation of other song features.

Whether zebra finches perceive the apparent periodicity in song and if so, on what timescale, is still an open question that is crucial for our understanding of their function. A recent Norton and Scharff Bird Song Metronomics

study showed that ZENK expression was found to be elevated in several auditory nuclei after exposure to arrhythmic song, where inter-note gaps were lengthened or shortened, compared to natural song (Lampen et al., 2014). The observed differences in neural response suggest that rhythm plays a role in auditory discrimination of songs. In another study, zebra finches learned to discriminate an isochronous from an irregular auditory stimulus (van der Aa et al., 2015). The birds did not generalize this discrimination well across tempo changes, suggesting that they discriminated based on differences in absolute time intervals rather than relative differences (i.e., equal intervals in the isochronous vs. variable intervals in the irregular stimulus). Subsequently, zebra finches were asked to discriminate regular from irregular beat patterns, consisting of strongly accented tones with either a regular or a varying number of interspersed weakly accented tones. Here, some of the individuals were sensitive to the global pattern of regularity, but in general seemed to be biased toward attending to local features (ten Cate et al., 2016). The stimuli used in these studies, a series of metronome-like tones, lack features present in natural song—like timbre, pitch, and amplitude modulation—which might be necessary for regularity detection, or for the birds to perceive it as a relevant signal. Further studies are needed to uncover whether zebra finches perceive a regular pulse<sup>P</sup> in song.

The pulses<sup>S</sup> fitted to song notes in the present study were faster by multiple factors than those humans preferentially perceive in musical rhythm. The latter are in a tempo range of around 500–700 ms, which translates to a pulse<sup>P</sup> frequency of 1.5–2 Hz (Parncutt, 1994; van Noorden and Moelants, 1999). Zebra finches do possess a higher auditory temporal resolution than humans (Dooling et al., 2002). It is important to note, however, that pulses<sup>S</sup> in the current study were fitted to all note onsets and represent the lowest level pulse in terms of note timing. In human

REFERENCES


music the perceived pulse<sup>P</sup> is usually slower than this low level pulse with notes occurring between successive beats. If birds perceive a pulse<sup>P</sup> in song, one could expect it to be on a longer timescale as well, e.g., integer multiples of the pulse<sup>S</sup> period, where some but not all notes coincide with the pulse<sup>S</sup> (see the top sonogram in **Figure 3** for an example).

Looking into the development of song regularity during song learning, especially in isolated juveniles, might provide further insights into whether periodicity is a result of song culture or whether it is neurally "hard-wired."

### AUTHOR CONTRIBUTIONS

PN recorded songs and analyzed data; CS and PN designed study, prepared figures, interpreted results, drafted, and revised manuscript.

### FUNDING

This research was funded by the BMBF project "Variable Töne" (FKZ: 01GQ0961) and the Excellence Cluster "Languages of Emotion" project "Do birds Tango?" (201).

### ACKNOWLEDGMENTS

We thank Prof. Winfried Menninghaus and Julian Klein for fruitful discussions during the inception of this project.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2016.00309


of femalesongbirds. Dev. Neurobiol. 73, 315–323. doi: 10.1002/dneu. 22062


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Norton and Scharff. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Sex Differences in Rhythmic Preferences in the Budgerigar (Melopsittacus undulatus): A Comparative Study with Humans

Marisa Hoeschele and Daniel L. Bowling\*

Department of Cognitive Biology, University of Vienna, Vienna, Austria

A variety of parrot species have recently gained attention as members of a small group of non-human animals that are capable of coordinating their movements in time with a rhythmic pulse. This capacity is highly developed in humans, who display unparalleled sensitivity to musical beats and appear to prefer rhythmically organized sounds in their music. Do parrots also exhibit a preference for rhythmic over arrhythmic sounds? Here, we presented humans and budgerigars (Melopsittacus undulatus) – a small parrot species that have been shown to be able to align movements with a beat – with rhythmic and arrhythmic sound patterns in an acoustic place preference paradigm. Both species were allowed to explore an environment for 5 min. We quantified how much time they spent in proximity to rhythmic vs. arrhythmic stimuli. The results show that humans spent more time with rhythmic stimuli, and also preferred rhythmic stimuli when directly asked in a post-test survey. Budgerigars did not show any such overall preferences. However, further examination of the budgerigar results showed an effect of sex, such that male budgerigars spent more time with arrthymic stimuli, and female budgerigars spent more time with rhythmic stimuli. Our results support the idea that rhythmic information is interesting to budgerigars. We suggest that future investigations into the temporal characteristics of naturalistic social behaviors in budgerigars, such as courtship vocalizations and head-bobbing displays, may help explain the sex difference we observed.

Keywords: rhythm, acoustic preference, human, budgerigar, music, auditory perception

### INTRODUCTION

Although we usually think about rhythm in the context of music, repetitive temporal patterns of acoustic events can be found throughout the animal kingdom, with familiar examples coming from the stridulations of insects, as well as the vocalizations of frogs, birds, and mammals (Wells, 1977; Haimoff, 1986; Geissmann, 2000; Greenfield, 2005; Mann et al., 2006; Hall, 2009). Many species appear specialized for rhythmic sound production, engaging in highly coordinated forms of inter-individual temporal coordination like synchrony and antiphony (Ravignani et al., 2014). In some species, there is also evidence of finely tuned sensitivity to specific temporal patterns. Coo production in Collared doves (Streptopelia decaocto), for example, is highly stereotyped in time

#### Edited by:

Henkjan Honing, University of Amsterdam, Netherlands

#### Reviewed by:

Edward W. Large, University of Connecticut, USA Yoshimasa Seki, Aichi University, Japan

> \*Correspondence: Daniel L. Bowling dan.bowling@univie.ac.at

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 07 March 2016 Accepted: 22 September 2016 Published: 04 October 2016

#### Citation:

Hoeschele M and Bowling DL (2016) Sex Differences in Rhythmic Preferences in the Budgerigar (Melopsittacus undulatus): A Comparative Study with Humans. Front. Psychol. 7:1543. doi: 10.3389/fpsyg.2016.01543

**139**

(Ballintijn and ten Cate, 1999) and conspecifics only respond if coos strictly adhere to this form (Slabbekoorn and ten Cate, 1999; see also Doherty and Hoy, 1985; Gerhardt, 1988).

Despite widespread sensitivity to temporal patterns in the animal kingdom, only a small number of the animals tested so far appear to be able to coordinate their movements in time with a rhythmic pulse, an ability we will refer to as beat perception and motor coordination, or BPMC (Large, 2000; Patel et al., 2005; Repp, 2005; Patel, 2006; Fitch, 2013). Humans all around the world regularly engage in BPMC in response to rhythmic stimuli, often spontaneously, effortlessly and with pleasure (McNeill, 1995; Patel et al., 2005; Janata et al., 2012). Some form of BPMC has been found in every musical tradition where it has been studied (Nettl, 2000), and infant electroencephalographic and looking-preference studies suggest that our sensitivity to the connection between music and movement develops very early in life (∼7 months; Phillips-Silver and Trainor, 2005). This evidence suggest that, for humans, there is something inherently rewarding about moving our bodies in time to music.

Outside of our species, the group best known for clear examples of BPMC is parrots (order Psittaciformes). The first animal in which a capacity for BPMC was conclusively demonstrated was Snowball the sulfur-crested cockatoo (Cacatua galerita). Snowball was discovered through a YouTube video posted in 2007 in which he appeared to be timing his movements to a musical beat. Detailed temporal analyses of Snowball's dancing behavior showed that he intermittently engaged in sequences of between 12 and 36 head-bobs closely aligned with the beat (average phase relation = 3.9◦ ; Patel et al., 2009). This behavior was observed across a range of tempos, and demonstrated to be more likely than expected by chance using simulations. Shortly after Snowball, a temporal analysis of "dancing" animal videos on youtube showed that the majority in which BPMC was statistically supported were in fact of parrots (Schachner et al., 2009). These results suggest that, like humans, parrots may also be motivated to pay special attention to and/or move in time with rhythmic acoustic patterns. An appropriate note of caution here is that it is unknown whether Snowball, or any of the other youtube parrots, were explicitly trained to dance by humans. What can be said, however, is that comparable human-pet interactions involving other species such as cats and dogs have so far failed to produce any similar evidence for BPCM (Schachner et al., 2009). This suggests that BPMC behavior may come more naturally to parrots, even if other many other animals, such as dogs, could be trained to perform BPMC.

One hypothesis that might explain why BPMC occurs in humans and parrots is that both species are vocal learners, that is, one of the few types of animals that learn their vocalizations from exposure to vocalizing conspecifics (Tyack, 2008). The "vocal learning and rhythmic synchronization" hypothesis suggests that the strong neural connectivity between auditory and motor regions required for vocal learning is a prerequisite for BPMC (Patel, 2006). In its strongest formulation, this hypothesis is not supported by data. Despite the fact that many of the species in which there is evidence for BPMC are vocal learners (e.g., humans, several parrots species, and elephants), there are exceptions. Ronan, California sea lion (Zalophus californicus), was trained to bob her head in time with musical beats at different tempos despite not being a vocal learner (Cook et al., 2013). Another example is Ai, who is a chimpanzee (Pan troglodytes) and thus not a vocal learner. After being trained to tap her finger, Ai showed sequences of BPMC when presented with a repeating acoustic stimulus in the background during tapping (Hattori et al., 2013), as long as its rate was similar to her natural tapping rate (Hattori et al., 2015). Nevertheless, it remains possible that a vocal learning capacity makes BPMC more likely or more intrinsically motivating, making a weaker formulation of the vocal learning and rhythmic synchronization hypothesis plausible.

Regardless whether there is a relationship between BPMC and vocal learning, it is clear that BPMC warrants further investigation in parrots. In humans, the experience of moving to a beat is often considered pleasurable and acoustic stimuli related to wanting to perform BPMC also induce positive affect (Janata et al., 2012). Here, we explore the possibility that budgerigars (Melopsittacus undulatus), like humans, find rhythmic temporal patterns to be rewarding. Most existing studies of acoustic preferences have used a place preference paradigm, in which animals can choose to spatially associate with different kinds of sounds. Typically, the animal is allowed to freely move around a space with different sounds playing in different locations (Hoeschele et al., 2015). Such laboratory studies are intended to be analogous to field studies where animals are free to move toward appetitive stimuli played through a speaker. In their study of conspecific song preferences, Leitão et al. (2006) showed that the results of place preference experiments in the lab closely match those found in the field, providing support for the validity of the place preference paradigm. Similarly, Gentner and Hulse (2000) used a place preference paradigm to provide direct evidence that female European starlings (Sturnus vulgaris) prefer longer over shorter male song bouts, for which there was previously only correlational evidence. In studies of music-related preferences, place preference experiments have been used to show that newly hatched chicks (Gallus gallus) and humans, but not cotton-top tamarins (Saguinus oedipus) prefer to associate with melodies composed of consonant as opposed to dissonant tonal relations (McDermott and Hauser, 2004; Chiandetti and Vallortigara, 2011). Further musical place preference studies suggest that cotton top tamarins prefer to listen to silence over music (McDermott and Hauser, 2007), whereas chimpanzees prefer music over silence (Mingle et al., 2014). As far as we know, no studies have yet examined rhythmic preferences in non-human animals.

Accordingly, in this study we tested whether humans and budgerigars exhibit preferences for rhythmic vs. arrhythmic acoustic temporal patterns using a place preference paradigm similar to those in the studies described above. Budgerigars are a small Australian parrot species capable of engaging in BPMC (Hasegawa et al., 2011). Rhythmic patterns were represented by a repeating 2-bar stimulus comprised of percussion instruments. Arrhythmic patterns were represented by the same percussive instruments presented equally often but separated by random temporal intervals. Similar to the studies on consonance/dissonance (McDermott and Hauser, 2004, 2005;

Chiandetti and Vallortigara, 2011), we used a paradigm with constant sound playback and a brief exposure period (5 min) to avoid habituation (see Dobson, 1973). We expected humans to spatially associate with the rhythmic pattern. If budgerigars showed similar behavior, it would provide evidence in support of the possibility that rhythmic patterns are attractive and biologically relevant in this species.

### GENERAL MATERIALS AND METHODS

The testing chambers and procedure were matched across experiments. As such, we provide a general description here.

### Ethical Statement

All procedures performed in studies involving human participants were approved by the University of Vienna Ethics Committee (Approval Number 00063) and were conducted in line with the Declaration of Helsinki (1964). All procedures performed in studies involving animals were in accordance with Austrian animal protection and housing laws and were approved by the ethical board of the behavioral research group in the faculty of Life Sciences at the University of Vienna (Approval Number 2015-005).

### Apparatus

Diagrams of the place preference test chambers used to test each species are provided in **Figure 1**. The chamber for humans and the chamber for budgerigars differed in size but were otherwise similar. In both cases, a large rectangular space was divided in two by a bisecting wall with an opening at one end that provided access between the left and right sides. Critically, the left and right sides were identical: each was empty except for an overhead lighting fixture and a single speaker (M-Audio AV 40, Cumberland, RI, USA) placed at the end opposite the entrance to the chamber. The entrance to both sides was in the middle of one of the long sides of the rectangle, directly perpendicular to the open end of the bisecting wall, such that participants were immediately faced with a choice to go left or right upon entering.

The human chamber was built inside an anechoic room to reduce the transmission of sound from one side of the chamber to the other. The chamber measured 3.5 m (width) × 2.1 m (length) × 3.2 m (height). The exterior walls, floor, and ceiling of the chamber were the walls of the anechoic room. The exception to this was the entrance wall, which was constructed out of large sheets of cardboard and heavy blankets. The bisecting wall was constructed out of heavy sheets of wood, cardboard, and blankets. A curtain was hung over the entrance itself to block visual access between the room and the holding area (3.5 m × 1.2 m × 3.2 m), where participants waited before starting the experiment.

The budgerigar chamber was essentially a smaller version of the human chamber (measuring 0.6 m × 0.5 m × 0.6 m). The outer walls and ceiling were made of wood, covered in acoustic foam on the inside to reduce reflections. To prevent budgerigars' from chewing on/eating the foam, we installed wire cage material

approximately 1 cm in front of the foam. Because the distance between the left and right sides of this chamber was considerably less than that in the human chamber, it was necessary to construct the budgerigar bisecting wall out of specially designed sound absorbing material (Wolf PhoneStar Tri sound dampening plates; Heilsbronn, Germany). Two such plates, separated by acoustic foam were used. The floor of the apparatus was made of a thin layer of wood placed on top of another sound dampening plate. A small holding area (0.15 m × 0.20 m × 0.60 m) attached to the entrance with sliding doors on either side allowed us to place the bird inside and then release it into the chamber at the start of the experiment.

### Stimuli

Three sound stimuli were used in this experiment. The rhythmic stimulus (**Figure 2**) was a repeating 2-bar pattern

of five percussion instruments (3 djembe drums, a clave, and a shaker) recorded from samples in Logic Pro (version 9; Apple, Cupertino, CA, USA). The most energetic frequencies in the samples were approximately 102, 475, 741, 2324 Hz, and between 2800 and 8000 Hz for the drums, clave and shaker respectively. All frequencies were within the audible range for both humans and budgerigars at the amplitudes used in the experiment (75 dB for humans, and 80 dB for budgerigars, see, e.g., Dooling and Saunders, 1975; Okanoya and Dooling, 1987). The two lowest frequency drum samples (djembe drums 1 and 2) occurred alternately every 375 ms suggesting a tempo of 160 beats per minute (BPM). The arrhythmic ± 200 stimulus had the same underlying pattern as the rhythmic stimulus, but each event was shifted in time by an interval randomly selected from a uniform distribution ranging from −200 to +200 ms, resulting in total disruption of any temporal regularity. Finally, a third stimulus, arrhythmic ± 50, was created by changing the uniform distribution to range from −50 and +50 ms, resulting in less disruption of temporal regularity. Instrument samples were assembled into the patterns using custom Matlab (R2013a; Mathworks, Nantick, MA, USA) code. For each experiment, the two stimuli to be contrasted were saved as the left and right tracks of a two-track wav file for presentation (sampling rate = 44100 Hz; bit depth = 32; side counterbalanced across participants). This allowed the sounds on both sides of the test chambers to be controlled from a single file.

### Procedure

The experimental procedure can be described in three steps. First, the experiment was initiated by the experimenter starting playback of the wav file. Second, the participants were permitted to enter the chamber. Third, after 5 min, playback was ended and the participant was released. While inside the chamber, the spatial behavior of each participant was recorded with overhead cameras (two C920 HD Pro Webcam for humans; Logitech, Lausanne, Switzerland; and a single Hero 3+ camera for budgerigars; GoPro, Mountain View, CA, USA). The video data were blindcoded to calculate the proportion of time participants spent on each side of the experimental chamber, which was used as a measure of the participant's preference for the corresponding stimulus (see below).

### EXPERIMENT 1: HUMANS, RHYTHMIC vs. ARRHYTHMIC ±200

Our goal in this experiment was to establish that humans show a preference for rhythmic over arrhythmic stimuli using a place preference paradigm designed to be directly comparable with the paradigm used with budgerigars (Experiment 3).

# Materials and Methods

### Participants

Twenty-five adult humans participated in the experiment (12 males, 13 females) at the University of Vienna (ages: 19–41). They were recruited either directly by a research assistant, or through an online service (Sona Systems; Tallinn, Estonia) where potential participants were registered and could sign up for experiments for monetary compensation. The majority of participants were students at the University of Vienna. None of the participants had any prior knowledge about the experiment. All participants provided informed consent before participation.

### Stimuli

In this experiment, all participants were presented with the rhythmic stimulus from one speaker, and the arrhythmic ±200 from the other speaker, side counterbalanced across participants.

### Procedure

Upon arrival at the laboratory, participants were told that the experiment consisted of entering a testing chamber with sounds being played at a comfortable level (75 dB directly in front of each speaker). They were told that they could enter the chamber through the entrance curtain as soon as they heard sound playing inside and that they were free to explore the space inside as they pleased. Lastly, they were told that when the sound ended they could come out of the chamber. Participants then filled out an informed consent form which included the provision that they could withdrawal from the study at any time without further consequences. Once the participant was ready, the experimenter initiated video recording and acoustic playback (both controlled by the same computer; Mac Mini; Apple, Cupertino, CA, USA) and the participant entered the chamber. After 5 min had passed, video recording and stimulus playback were terminated. Following completion of their time in the chamber, participants completed a brief computerized survey

(LiveCode Community 7.0.5; Edinburgh, Scotland) in which they listened to each stimulus again and answered the question "how much did you like this sound?" Responses were collected using a continuous scale from 0 ("not at all") to 100 ("very much").

### Video Coding

Participant location for the first 5 min after they had entered the testing chamber was coded by a human observer who was blind to which stimulus was presented on which side. Human participants were coded as being on either side of the chamber if both feet were fully on one side of the bisecting wall (see dividing line, **Figure 1A**). In all other cases, location was coded as being on neither side. We excluded the time that participants were on neither side and calculated the proportion of time spent on the rhythmic side by dividing the time spent on the rhythmic side by the total amount of time spent on both sides.

### Results

Three participants only visited one side (all female). Because they could not have heard both stimuli without being on both sides, and because we were looking for a preference between the two stimuli, we excluded their data from the analysis. We conducted a one-sample two-tailed t-test looking whether proportion of time spent on the rhythmic side of the apparatus was different from chance (0.5) across the remaining 22 subjects. We found that participants spent significantly more time on the rhythmic side (M = 0.60; SD = 0.22) than would be expected by chance [t(21) = 2.111, p = 0.047]. These results are displayed in **Figure 3A** along with the data from Experiment 2 (see below). The survey data also showed a significant overall preference for the rhythmic stimulus (M = 82; SD = 11) compared to the arrhythmic [M = 42; SD = 34; t(21) = 5.47, p < 0.0001]. Furthermore, the difference between survey responses (rhythmic–arrhythmic) was positively correlated with the proportion of time spent on the rhythmic stimulus side in the place preference experiment (r = 0.486, p = 0.022), suggesting that spatial behavior in the place preference paradigm is related to subjective preference across individuals.

### EXPERIMENT 2: HUMANS, RHYTHMIC vs. ARRHYTHMIC ±50

The results of Experiment 1 suggest that humans prefer rhythmic over arrhythmic temporal patterns, However, the arrhythmic stimulus we used was far removed from anything humans might experience in music because the elements were so heavily shifted that the pattern was lost completely. Our goal in Experiment 2 was to determine whether the results of Experiment 1 hold using a less temporally disturbed stimulus. This experiment was thus exactly the same as Experiment 1, except that arrhythmic ±200 stimulus was replaced by arrhythmic ±50.

## Materials and Methods

### Participants

None of the participants from Experiment 1 participated in Experiment 2. All participants were naïve to the experimental setup. Twenty adult humans participated in the experiment (10 males, 10 females) at the University of Vienna (age: 20–30). They were recruited in the same manner as in Experiment 1.

### Stimuli, Procedure, and Coding

In this experiment, all participants were presented with the rhythmic stimulus from one speaker, and the arrhythmic ±50 from the other speaker, side counterbalanced across participants. The procedure and video coding were conducted in the same manner as Experiment 1.

### Results

Two participants only visited one side (one male, one female). As in Experiment 1, we excluded their data from the analysis. We conducted a one-sample two-tailed t-test looking at whether the proportion of time spent on the rhythmic side of the apparatus was significantly different from chance (0.5) across the remaining 18 subjects. We found that participants spent significantly more time on the rhythmic side (M = 0.61; SD = 0.19) than would be expected by chance [t(17) = 2.490, p = 0.023]. These results are displayed in **Figure 3A** along with the data from Experiment 1. The survey data also showed a significant overall preference for the rhythmic stimulus (M = 65; SD = 25) compared to the arrhythmic [M = 29; SD = 23; t(17) = 4.22, p = 0.001]. However, unlike in Experiment 1, the difference between survey responses (rhythmic–arrhythmic) was not significantly correlated with the proportion of time spent on the rhythmic stimulus side in the place preference experiment (r = −0.097, p = 0.700).

### EXPERIMENT 3: BUDGERIGARS, RHYTHMIC vs. ARRHYTHMIC ±200

Because we were able to replicate our result with humans and show that they consistently spent more time with rhythmic than arrhythmic stimuli in our place preference chamber, we decided to use a smaller but otherwise similar chamber to test budgerigars using the same stimuli we had tested with humans in Experiment 1.

### Materials and Methods Participants

Sixteen budgerigars participated in the task (eight males, eight females). All birds were naïve to the test chamber and the stimuli. When not in the experimental chamber, these birds are housed together in mixed-sex groups of eight in two separate aviaries (2 m × 1 m × 2 m) within the same room.

### Stimuli

In this experiment, all birds were presented with the rhythmic stimulus from one speaker, and the arrhythmic ±200 from the other speaker, side counterbalanced across participants.

### Procedure

All birds in our study were first habituated to the test chamber. We habituated birds so that only the stimuli presented, not the environment itself, would be novel during testing to increase

the chance that any difference in time spent on either side was due to the sound. All birds were placed in the test chamber for 5 min sessions at least six times to allow adjustment to the test chamber. Birds continued these habituation sessions until they had explored both sides on at least three sessions. These criteria were intended to increase the chance that the budgerigars would fully explore the test chamber and thus be exposed to both sounds during testing. Habituation sessions were conducted like rhythm testing sessions (explained below) except that the rhythmic and arrhythmic stimuli were replaced with either budgerigar sounds (sampled from the CD album "Budgerigar Country"; Skeotch and Koschak, 2010) or silence.

After habituation, we tested the birds with the same rhythmic and arrhythmic stimuli that we had presented to the humans in Experiment 1. Birds were individually placed in the holding area. The experimenter then initiated video recording and acoustic playback and subsequently opened the sliding door that allowed access from the holding area into the test chamber. This sliding door was closed as soon as budgerigars had entered the apparatus. The holding area was kept dark, whereas there was light on both sides of the test chamber. This was intended to encourage the birds to leave the holding area quickly after the door was opened. As in Experiment 1, 5 min after the bird had entered the chamber, video recording and stimulus playback were terminated and the bird was taken back to its home aviary.

#### Video Coding

Video coding was conducted in a similar way to Experiments 1 and 2. However, because the bisecting wall separating the two rooms in the budgerigar test chamber was considerably larger than the budgerigars themselves, and because budgerigars' ears are not directly above their legs (as in humans), we coded budgerigars as being on a particular side of the apparatus when their entire head had passed over either of the dividing lines shown in **Figure 1B**. When their head was between the two dividing lines, they were coded as being on neither side. We excluded the time that budgerigars spent on neither side and calculated the proportion of time spent on each side by dividing the time spent on one side by the total amount of time spent on both sides.

### Results

Four budgerigars only visited one side (three males, one female). Following the same procedure as Experiments 1 and 2 (for the same reasons), we excluded their data from the analysis. For the remaining 12 budgerigars, we conducted a one-sample two-tailed t-test looking whether the proportion of time spent on the rhythmic side was different from chance (0.5) across subjects. We found no significant difference between the amount of time spent on the rhythmic side of the test chamber and chance [t(11) = 0.289, p = 0.778]. However, when looking at the raw data, we noticed that male birds appeared to perform differently than female birds, we thus compared the proportion of time spent on the rhythmic side between male and female birds using a Welch's t-test because of differences in sample sizes between males and females. We found a significant difference between the amount of time spent by male and female birds on the rhythmic side [t(10) = 2.410, p = 0.038] such that male birds spent less time on the rhythmic side (M = 0.25, SD = 0.26) and female birds spent more time on the rhythmic side (M = 0.61, SD = 0.29). These results are displayed in **Figure 3B**.

### DISCUSSION

The results show that humans not only prefer rhythmic over arrhythmic stimuli, they also spend more time with rhythmic stimuli than arrhythmic stimuli in an acoustic place preference paradigm. This result was replicable even when we adjusted our arrhythmic stimulus to deviate less from the rhythmic stimulus. When we studied budgerigars using highly similar methods, we found that there was no overall preference for rhythmic or arrhythmic stimuli. However, this appears to have been due to sex-dependent differences in the behavior of these birds: males

spent more time with arrhythmic stimuli and females spent more time with rhythmic stimuli.

Our results support the idea that rhythmic information is interesting to budgerigars. Although we cannot determine the cause of the sex difference we found, one possibility is that it is related to sexual dimorphism in courtship behaviors and vocalizations. Similar to many songbird species, male budgerigars sing to attract females. Their "warble song" is acoustically distinct from other budgerigar vocalizations (produced by both sexes) and is also perceived as distinct by other budgerigars (Tu et al., 2011). The warble song is also often accompanied by a repetitive head bobbing display, sometimes followed by bill touching, especially in bonded pairs (Zocchi and Brauth, 1991). To a human observer, male head bobbing resembles the head bobbing observed in other parrot species engaged in BPS. Female sensitivity to male displays is common in birds (e.g., Searcy and Marler, 1981; Borgia, 1995; Forstmeier et al., 2002; Ballentine et al., 2004; Amy et al., 2008; Hoeschele et al., 2010). Thus, it is possible that the apparently repetitive form of male budgerigar displays may underlie the female preference for rhythmic patterns observed in the present experiment. However, we emphasize that the temporal characteristics of the head bobbing display, its relation to the warble song, and female mate choice have not been characterized. The possibility that these factors are related to female preference behavior in the present study is thus necessarily speculative at present.

Additional caution in interpreting the results Experiment 3 is advised by previous findings on how budgerigars perceive and respond to rhythmic stimuli. While budgerigars have been shown to be able to entrain to a beat (Hasegawa et al., 2011), other work suggests that they do not perceive temporal patterns in the same way humans do. Specifically, budgerigars appear to primarily attend local features (such as the absolute length of silence between a specific pair of notes) and ignore global features (such as the relative length of time between all notes being equal) when trained to discriminate between regular and irregular temporal patterns (ten Cate et al., 2016). Similar results have been found in studies with pigeons (Hagmann and Cook, 2010) and zebra finches (van der Aa et al., 2015; ten Cate et al., 2016). However, this seemingly avian lack of attention to global features is not present in all individuals (ten Cate et al., 2016) and may be less pronounced in other species (e.g., starlings; Hulse et al., 1984; jackdaws, Corvus monedula; Reinert, 1965). Nevertheless, taken together these results suggest that overall temporal regularity may not have been the feature that attracted female budgerigars to spend more time with the rhythmic stimulus. A possible alternative is that females were interested in hearing the individual drum samples, some of which tended to overlap less in time in the rhythmic compared to arrhythmic stimuli and may have thus been easier to resolve. However, given that very few budgerigars have been tested on their perception of rhythm, and that methodologies differ across the bird species tested so far, it is difficult to draw broad conclusions about budgerigars, parrots, vocal learners, or indeed birds in general. What does seem clear though is that the perception of rhythm in birds is different from that of humans and requires further exploration.

We also believe it is important to interpret our results in a broader context. While Experiment 3 was designed to assess acoustic preferences in budgerigars, we need to be careful not to over-interpret the meaning of "preference" in this context. In Experiments 1 and 2, because we were able to directly survey participants, we were able to compare reported preference with behavior in our place preference paradigm. While reported preference was found to be significantly correlated with behavior in Experiment 1, it was not in Experiment 2. However, in Experiment 2 almost every participant spent more time on the rhythmic side (only 2/18 participants did not, whereas 6/22 did not in Experiment 1). In addition, all participants that showed more than an eight point difference in preference between rhythmic and arrhythmic stimuli (on the 100 point rating scale) preferred the rhythmic stimulus. Thus, although individual behavioral and survey data did not correlate with one another, overall humans both spent more time on the rhythmic side and preferred the rhythmic stimulus in both experiments. A correlation would only exist if, on an individual level, the degree of preference in the survey was directly related to the amount of time spent on the rhythmic side. It seems likely that while these measures are related insofar as we found the same preference in both domains, the degree of preference was not parallel on this particular experiment. Finally, we note that the difference between the rhythmic and arrhythmic stimulus in Experiment 2 was far less than in Experiment 1, which implies that the results should not necessarily be expected to be the same. Overall, because there was agreement in the place preference paradigm and the post-test survey, the place preference paradigm appears to be a reasonable method to assess acoustic preferences in humans. An alternative interpretation is that, because we surveyed participants after they completed the place preference task, their reported preferences may be explained by a "mere exposure" effect, in which they liked the stimulus they were more familiar with (Zajonc, 1968). We believe this is unlikely, however, because if participants had not had any preference for rhythmic vs. arrhythmic stimuli before participation, we would have expected half of them to spend more time on the rhythmic side and the other on the arrhythmic side, producing corresponding responses on the survey afterward. Also, if fundamental aspects of music found across cultures can be taken to reflect our acoustic preferences, there is considerable additional evidence that humans have a preference for rhythmic patterns (Brown and Jordania, 2013).

Even with reasonable confidence that human behavior in our paradigm reflects preference, this does not imply that the same conclusion holds for other species. In particular, it is difficult to distinguish between a preference for hearing a certain acoustic stimulus, and a functional response toward that stimulus. For example, many studies have shown that female mate choice in birds is based on acoustical information derived from male song (e.g., Searcy and Marler, 1981; Forstmeier et al., 2002; Ballentine et al., 2004; Amy et al., 2008; Hoeschele et al., 2010). Does this mean that females prefer to hear attractive male songs

over unattractive male songs? Similar place preference paradigms suggest that they do (e.g., Gentner and Hulse, 2000; Leitão et al., 2006). However, it is possible that these songs are purely indicators of male status and females respond in a functional manner (moving toward the attractive stimulus) to secure an attractive mate. In the end, we believe it is not possible to make this distinction. In the present context, we thus define preference as the tendency to spatially associate with a given stimulus. Further attributions of internal states, such as "liking" or "enjoyment," may be appropriate, but cannot be justified on the basis of our results.

We based our version of the place preference paradigm on previous work studying music-related preferences in animals (McDermott and Hauser, 2004, 2005; Chiandetti and Vallortigara, 2011). One limitation of this two-sided place preference paradigm used here and in these previous studies is that preference for one acoustic pattern is always confounded with avoidance of the other. Consequently it is not possible to determine whether behavior in this paradigm reflects attraction or repulsion. While both possibilities are consistent with preference in the sense that an animal may be more attracted to, or less repulsed by, a given stimulus, this confound presents further obstacles to claims regarding liking or enjoyment. For these reasons, we are planning to move to the use of a three-sided preference apparatus in future experiments where in addition to associating with either of two stimuli, animals can also choose silence. Additionally, now that a human place preference for rhythmic stimuli has been established, we plan to exercise more flexibility in future studies in designing stimuli for budgerigars, increasing the frequency and tempo of events in auditory patterns to better suit their auditory perceptual abilities (Dooling et al., 2002).

On the whole, the study of rhythm perception in avian species is still in its infancy. Early studies of avian pitch perception also showed that birds primarily paid attention to local features rather than global features in laboratory experiments (e.g., Hulse and Cynx, 1985), which is very similar to recent results on rhythm perception (ten Cate et al., 2016). These early studies on pitch perception made it clear that birds tended to be better at assessing

### REFERENCES


local pitch features than mammals (i.e., absolute pitch, Weisman et al., 2010). However, further studies suggested that birds can also pay attention to more global features depending on the context (Bregman et al., 2012; Hoeschele et al., 2012) and how we as experimenters break down the acoustic signal (Bregman et al., 2016). The same might well turn out to be true for rhythm.

### AUTHOR CONTRIBUTIONS

MH and DB designed the experiment and built the apparatus together. DB created the stimuli. DB was in charge of running the human participants and MH ran the budgerigars. MH analyzed the data and wrote the first draft of the article. DB contributed to data analysis and both authors worked to revise the article. DB made the figures for the article.

### FUNDING

MH was funded by a Banting Post-doctoral Fellowship awarded by the Natural Sciences and Engineering Research Council (NSERC) of Canada during the initiation of this project and is currently funded by a Lise Meitner Post-doctoral Fellowship (M 1732-B19) from the Austrian Science Fund (FWF). DB was funded by an ERC Advanced Grant SOMACCA (#230604) to W. Tecumseh Fitch during the initiation of this project and is currently funded by a Lise Meitner Post-doctoral Fellowship (M 1773-B24) from the Austrian Science Fund (FWF).

### ACKNOWLEDGMENTS

We would like to thank Jinook Oh for his technical support by designing software that would initiate playback and video recording simultaneously and name files appropriately for easy storage and analysis. We would also like to thank Pablo Graf Anchochea and Marcel Neumeier for their help in running human participants and coding the video data.




Zajonc, R. B. (1968). Attitudinal effects of mere exposure. J. Pers. Soc. Psychol. 9, 1–27. doi: 10.1080/02699931.2010.497409

Zocchi, D. C., and Brauth, S. E. (1991). An experimental study of mate directed behaviour in the budgerigar Melopsittacus undulatus. Bird Behav. 9, 49–57.

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Hoeschele and Bowling. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Extraordinary Nature of Barney's Drumming: A Complementary Study of Ordinary Noise Making in Chimpanzees

Valérie Dufour <sup>1</sup> \*, Cristian Pasquaretta1, 2, Pierre Gayet <sup>1</sup> and Elisabeth H. M. Sterck 3, 4

<sup>1</sup> Ethology Evolutive Team, Institute Pluridisciplinaire Hubert Curien (IPHC), University of Strasbourg, CNRS, Strasbourg, France, <sup>2</sup> Research Center on Animal Cognition (CRCA), Center for Integrative Biology (CBI), Centre National de la Recherche Scientifique, University of Toulouse, UPS, Toulouse, France, <sup>3</sup> Ethology Research, Animal Science Department, Biomedical Primate Research Center, Rijswijk, Netherlands, <sup>4</sup> Animal Ecology, Utrecht University, Utrecht, Netherlands

In a previous study (Dufour et al., 2015) we reported the unusual characteristics of the drumming performance of a chimpanzee named Barney. His sound production, several sequences of repeated drumming on an up-turned plastic barrel, shared features typical for human musical drumming: it was rhythmical, decontextualized, and well controlled by the chimpanzee. This type of performance raises questions about the origins of our musicality. Here we recorded spontaneously occurring events of sound production with objects in Barney's colony. First we collected data on the duration of sound making. Here we examined whether (i) the context in which objects were used for sound production, (ii) the sex of the producer, (iii) the medium, and (iv) the technique used for sound production had any effect on the duration of sound making. Interestingly, duration of drumming differed across contexts, sex, and techniques. Then we filmed as many events as possible to increase our chances of recording sequences that would be musically similar to Barney's performance in the original study. We filmed several long productions that were rhythmically interesting. However, none fully met the criteria of musical sound production, as previously reported for Barney.

Keywords: music, drumming, object manipulation, chimpanzees, Barney's colony

### INTRODUCTION

The universality of music across human cultures is an undisputable fact: all humans make music, sing, dance, and gather to enjoy sharing emotions elicited by musical performances (Merker et al., 2015). By contrast, there is very limited evidence that our closest living relatives, the great apes, make music in the same way (Fitch, 2006). From an evolutionary perspective, this leaves us wondering about the origins of our musical skills. Archeological remains do not provide sufficient evidence of instruments being used to create music prior to 40,000 BC, nor do they suggest the presence of any other musical behaviors (Kunej and Turk, 2000). Great apes probably possess some prerequisites for musical productions (Fitch, 2006; Honing et al., 2015). Drumming in chimpanzees and chest beating in gorillas is considered a homolog to human music: a shared ancestral trait not found in less closely related species (Fitch, 2006). Thus, taking a closer look at great ape drumming behavior may enlighten our understanding of our own musicality.

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Bjorn Hellmut Merker, Self, Sweden Ruth Sonnweber, Max Planck Institute for Evolutionary Anthropology (MPG), Germany

> \*Correspondence: Valérie Dufour valerie.dufour@iphc.cnrs.fr

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> Received: 31 March 2016 Accepted: 03 January 2017 Published: 19 January 2017

#### Citation:

Dufour V, Pasquaretta C, Gayet P and Sterck EHM (2017) The Extraordinary Nature of Barney's Drumming: A Complementary Study of Ordinary Noise Making in Chimpanzees. Front. Neurosci. 11:2. doi: 10.3389/fnins.2017.00002

**149**

Chimpanzees drum on tree buttresses or other resonant structures as part of their dominance displays (Goodall, 1986). The chimpanzee Mike was observed repeatedly charging higherranking males whilst propelling kerosene cans in front of him (Goodall, 1986). Drumming can accompany vocal signals such as long-distance calls (e.g., climax of a pant hoot), sometimes even replacing part of the vocal phrase (Boesch, 1991; Arcadi, 1996; Babiszewska et al., 2015). This combination of vocalizations and noise making using objects is generally associated with social tension or high levels of arousal within the group (Goodall, 1986; Nishida et al., 1999). Other individuals can join the initiator by vocalizing, drumming, or both (Fedurek et al., 2013). Also gorillas drum on their own bodies and objects during displays (Schaller, 1963). Generally, drumming in display contexts is likely to be perceived by others as a demonstration of strength.

Drumming, leaf clipping, and breaking or shaking branches are examples of how chimpanzee males create sound with objects to communicate sexual interest (Nishida, 1980; Goodall, 1986). After demonstrating such intent, it is not rare to see the targeted female approach and present for a mount. Similarly, musically qualified humans may be more successful at seduction than their non-musical counterparts, indicating that musical ability could be a sign of male health and fertility (Miller, 2000; Sluming and Manning, 2000; Charlton, 2014).

Finally, making noise can be associated with pleasurable emotion, both in humans and great apes. Humans play music. Young chimpanzees often drag branches noisily along the ground and were seen, on one occasion, to repeatedly hit a clay pot (Matsusaka, 2012), seemingly enjoying the noise. Thus, on both contextual and emotional levels, there are several links between sound making with objects in great apes and instrumental music in humans.

However, many other properties of noise making with objects by great apes do not meet criteria for music. Indeed, drumming in great apes is generally contextualized (sex, play, or display context), whereas humans produce music outside any particular context or function (Arom, 2000). Music is very often rhythmical, while drumming sequences of apes are generally too short to allow for rhythm to be detected (but see Dufour et al., 2015).

Another key component of music production is the performers' capacity to synchronize their actions to an external beat (Arom, 2000). We know that pinnipeds (Cook et al., 2013), cockatoos (Patel et al., 2009), parrots (Schachner et al., 2009), and budgerigars (Hasegawa et al., 2011) can learn or be trained—with varying levels of precision—to synchronize their body movement to a rhythm. A form of action entrainment by motor mimicry of pounding gestures has been reported in young chimpanzees watching others cracking nuts (Fuhrmann et al., 2014). The female chimpanzee Aï spontaneously pressed two keys on a keyboard in synchrony to a rhythmical auditory stimulus without any previous training (Hattori et al., 2013). Recently, a bonobo was found to occasionally match its own drumming tempo to the one of a human drummer (Large and Gray, 2015), even when the tempo differed slightly from the natural pace of the bonobo. However, the tempo matching disappeared quickly despite the bonobo being encouraged and rewarded for drumming. By comparison, human children can synchronize to external drumming at the age of around 3 years, increasing in accuracy as they grow older (Honing et al., 2012). Kanzi, a language-trained bonobo, was reported to perform rhythmical drumming (Kugler and Savage-Rumbaugh, 2002), but there are no published data describing this event. In a recent study, we described a long drumming solo on an upturned plastic barrel by a chimpanzee named Barney. This solo was rhythmic, decontextualized, and fitted several criteria for human music (Dufour et al., 2015). It would be justifiable to question the significance of this unique observation: was Barney's performance a "once in a lifetime" event, i.e., a chimpanzee accidentally "discovering music"? Or was it a rare behavior that had gone unnoticed by chimpanzee specialists and had not been given the consideration it deserved, remaining unpublished due to its anecdotal nature? Most importantly, can Barney do it again?

To identify the factors leading to Barney's performance, we conducted a 2 months-long survey on sound making with objects in the chimpanzee facility where Barney was living. Our first aim was to gather information on factors influencing the duration of sound making using objects found in the environment. To that end we recorded the context, the medium used, and the complexity of sound production techniques. Furthermore, we noted the sex of the individual producing sound with objects. Reports on chimpanzee drumming in the wild (Nishida, 1980; Goodall, 1986; Arcadi, 1996; Nishida et al., 1999) suggest that—in particular male chimpanzees—drum primarily in socioor emotionally negative contexts, such as displays. However, since studies on chimpanzee drumming are rather scarce so far, formulating informed predictions for the effects of context, sex, technique, and medium on drumming durations is not straightforward. Therefore, we adapted an explorative approach to tackle potential effects of these predictor variables on the duration of sound production. Secondly, we aimed at recording as many sound production events as possible in order to detect any performances resembling Barney's original drumming (i.e., long, rhythmically interesting, and decontextualized sequences). Any such cases were checked for evidence for musicality.

### METHODS

### Subjects and Study Site

This study took place at the Biomedical Primate Research Centre (BPRC) at Rijswijk, the Netherlands, in July and August 2005. The facility held a total of 54 individuals of which 28 were adult females and 26 adult males. The population was composed of six groups living in enclosures with outside areas facing the same courtyard (see **Figure 1**). Thus, some groups (those who faced other groups) could see each other and all groups could hear each other. Note that this colony was moved to the Safari Park Beekse Bergen in 2006, and that the BPRC no longer houses chimpanzees. The survey of sound production using objects or other enclosure elements as a resonating medium (sound-object use, hereafter referred to as So-U) was carried out in two phases. In Phase 1 (from 13th July to 12th August 2005) data were collected for the four groups that had at least two adult males (Dirk's group, Barney's group, Bob's group and Dennis' group). Due to low rates of So-U in one of the groups (Bob's group), this group was removed from Phase 1 data collection after a few days. In Phase 2 data collection on So-U sequence recordings

took place on all six groups from the 22nd of July to the 25th of August 2005.

## Phase 1: Contexts of So-U

### Data Collection

In Phase 1 the goal was to gather as much information about the factors affecting the duration of So-U as possible. One observer (PG) stood in the central courtyard and focused on one group for 15 min at a time. We conducted a total of 162 focal observations (54 per group). Focal observations were spread equally throughout the day (from 8.30 to 10 am: 15 observations per group, 10.30–12.20: 20 observations per group, and 1.20–5 p.m.: 19 observations per group) and the groups were observed in random order within each time period. During focal group observations the observer wrote down all occurrences of So-U in the group, noting the identity of the individual producing the sound, the duration of So-U, the type of medium used, the technique used, and any information (individual posture, behavior, or external factors) that could indicate the context of occurrence (**Table 1**). Some contexts were easily identifiable (i.e., nest-building, play, sexual activity, intimidation display without aggression, aggression, see Goodall, 1986). Others were not always clearly discernable in the absence of any obvious contextual information. An example for this is "tension," which was defined as So-U when the animal had its hair erect (hunched back posture) without escalation into a display or an aggression and in the absence of any other contextual element. Indeed, hunching can sometimes also occur in courtship (but lead then to copulation or to attempted copulation, accompanied by an erected penis), greetings (involving a "friendly reunion"), or excitement (upon seeing enriching food, hearing other chimpanzees, etc.) (Nishida et al., 1999). When courtship and greeting could be excluded and no external stimuli potentially triggering excitement were apparent, we assigned the context "tension" to the So-U episode.

### Data Re-coding for Phase 1

To ensure a sufficiently large and reliable dataset allowing for So-U analysis we introduced categories for contexts, drumming techniques, and objects used for sound production respectively.

As So-U occurred frequently when individuals were tense, aggressive, or displayed we regrouped these contexts into one category that we termed "socio- or emotionally negative contexts." Socio- or emotionally positive contexts comprised sexual activity, playing, teasing, nest-building, and attention seeking. Any outside-group noise and unidentified contexts (**Figure 2**) were excluded from data analyses. Note that comparing So-U durations between contexts when working with data that were not controlled for durations of the behavioral category itself might be questionable. We took this limitation into account when discussing the results.

For the noise production techniques we distinguished between "simple" (only one technique used, i.e., hitting, shaking, trailing, half-circling, pushing, or throwing) and "complex" (combination of two or more techniques) techniques.

Regarding the objects used to produce noise we differentiated between (i) metallic media (doors, fence, metallic ground), (ii) plastic containers (small plastic containers, half and large plastic barrels), and (iii) small or non-resonating objects (small plastic bottles, enclosure furniture, cardboard tubes).

#### TABLE 1 | Description of the behavioral units recorded live by the observer for contexts, type of medium, techniques, vocalizations, and postures.


activity, teasing, playing, attention seeking, and nest-building (considered as socio- or emotionally positive). Note that most So-U in sexual activity are initiated by males, except one initiated by a female who sought the male attention by knocking on a door. So-U with unidentified contexts (context "none") are also recorded.

#### Statistical Models for Phase 1

First we used log likelihood ratio test (LRT) to check for individual differences in the frequency and handling duration of So-U. We compared the full model with individual nested in group as a random effect against the full model with only group as a random effect. We build a model to study the response variable "duration of So-U." We used multi-model inference and model averaging methods to calculate the weight of evidence (Wi) for each predictor involved in the model (Burnham and Anderson, 2002). We standardized the binary predictors to a common scale both to correctly estimate their influence (Engqvist, 2005) and to control for some possible degree of collinearity among them Schielzeth (2010). We carried out an information theoretical approach: all the possible models were run and ranked based on their AIC scores (Akaike, 1985) after correction for small sample sizes (AICc) and their normalized Akaike weights (AICw, Burnham and Anderson, 2002). We computed estimates, standard errors and 95% confidence intervals (CI) for models whose cumulative model weights reached 90% of the total weights. We ran model selections and averages using the R Package MuMIn (Barton, 2009). The model was run following a negative binomial error distribution using the R package glmmADMB (Fournier et al., 2012).

The full model included the following predictor variables: (i) the medium, (ii) the complexity of the technique, (iii) the context, (iv) the sex, and (v) all possible interactions of these main effects except for the interaction "technique-medium." Indeed, among all the So-U performed with mixed techniques (33 events) 32 occurred with the same medium (plastic container) preventing us to investigate the effect of this interaction. Note that only one female used a small medium (and in only one occasion). Therefore, we also excluded the interaction "sex-medium" from the model.

### Phase 2: Video Recording of So-U for Further Rhythmical Analysis

In Phase 2 we conducted all occurrence sampling of So-U in all six groups to increase our chances of filming a long, decontextualized, and rhythmically complex So-U resembling Barney's original performance. We recorded the identity of the performer and the context of occurrence. For interesting bouts rhythmical analysis was conducted in the same way as in Dufour et al. (2015), i.e., by checking the possibility of rhythmical patterns in a series of impacts (Ljung-box test analysis), and then checking the predictability of the next section using autocorrelation analysis (see Supplementary Methods). For each sequence data were analyzed using R (R Core Team, 2013). P level of significance was set at 0.05.

### ETHICS STATEMENT

The study was conducted in compliance with all relevant Dutch laws and respected international and scientific standards and guidelines. All analyses were based on the recording of spontaneous behavioral sequences initiated by the chimpanzees. Due to the observational nature of this study and the absence of discomfort for the animals no additional permission was required from the institute's animal experiment committee, as assessed by the Biomedical Primate Research Centre Animal welfare officer.

### RESULTS

### Phase 1: Contexts of So-U

During focal observations in Phase 1 we monitored a total of 123 So-U with various media (Supplementary Figure 1). So-U occurred in several contexts including aggression, display, tension, sexual activity, teasing, playing, attention seeking, nestbuilding, and outside group noise. 15 (of the 123) bouts could not be assigned to any context. Sound production in unidentified contexts occurred only in males (**Figure 2**). While So-U occurred both in males and in females, males did so more often than females (males: 105 times, females: 18 times), all contexts included. We recorded 88 So-U with simple techniques and 35 with complex techniques (16 individuals out of 24 never used complex techniques). Focusing on the two main contextual categories (thus excluding context "none" and context "outside group noise"), we recorded more So-U in socio- or emotionally negative contexts (83) than in socio- or emotionally positive contexts (22). There were significant differences between individuals both in the frequency and duration of So-U (LRT frequency model: df = 1, 1 deviance = 4.25, P = 0.039; LRT duration model: df = 1, 1 Deviance = 6.254, P = 0.012).

The time spent handling a medium in So-U varied from 1 to 720 s, with a median duration of 6 s. Model averaging on the duration of So-U as the response variable revealed that context, sex, the interaction between context and sex, and the technique used had higher relative importance than other variables in the model (**Table 2**). So-U in socio- or emotionally positive contexts lasted longer than in socio- or emotionally negative ones

TABLE 2 | Model average using AICc-based selection approach, showing estimate, standard error (SE), 95% confidence interval (95% CI) and relative weight of evidence (Wi) for each variable both for the handling duration model.


The model was run using a "poisson" distribution and included (i) the medium, (ii) the complexity of the technique, (iii) the context, (iv) the sex, and (v) all possible interactions of these main effects (except for the interaction "technique-medium" and "sex-medium," see methods) as fixed effects and individual nested in group as random effect. Estimates with 95% (CI) that don't overlap 0 indicates a significant influence on the response variable (highlighted in bold).

(Estimate = 1.053, 95% IC: 0.052–2.055; **Table 2**). Males had longer So-U than females but this was only significant in negative contexts (in negative contexts: Estimate = −3.253, 95% IC: −4.891 to −1.615; in positive contexts: Estimate = −1.991, 95% IC: −4.151–0.167; **Table 2**, **Figure 3**). Complex techniques lasted longer than simple ones (Estimate = 1.053, 95% IC: 0.132–1.161; **Figure 4**).

### Are there any Similarities with Barney's Drumming?

In Phase 1, we observed 27 So-U that lasted longer than 20 s. Eight were performed with either a half barrel or a large barrel. Only one of these performances was not clearly associated to a context, as the performer alternated between hoots and a relaxed face while shaking the barrel. The seven remaining bouts of sound production were contextualized (4 displays, 1 tension, 1 nesting, 1 sexual activity). Barney performed only six So-U, two of which involved using a small container, and four of which involved hitting a door. Five of these events occurred in a tension context and one was associated to a display.

In Phase 2, we filmed 262 events of So-U by various individuals, including 90 So-U that lasted longer than 20 s. Eighty-four of these occurred in a clearly recognizable context: socially negative (46), nest-building (10), sexual activity (9), attention-seeking (3), or play (16). Note that several sequences could sometimes be detected within one So-U bout. In total we analyzed the rhythmical properties of 10 So-U, producing a total of 20 sequences, i.e., those in which successive beats could be clearly identified from the background noise (**Table 3**). Rhythmical analysis concerned only seven contextualized So-U (**Table 3**). Two of these were socially negative So-U (Oscar seq 1; Oscar seq 2.1 and 2.2, **Table 3**). However, the technique used in these cases (holding a barrel and hitting it against the wall with half-circling trajectories, Supplementary Video 1), could potentially explain their rhythmicity. Three sexual displays (Paul seq 1.1, 1.2, and 1.3: Supplementary Video 2, Dennis seq 1 and Dennis seq 2.1 and 2.2) showed interesting and complex rhythmical patterns (Ljung-box portmanteau test, Supplementary Figure 2). In Paul seq 1.2, for example, there were alternating series of short and long inter-beat durations with a remarkably long-term dependency (autocorrelation test, **Table 3** and Supplementary Figure 3). The tempo was independent of the technique or medium used, and seemed to be controlled by the chimpanzees. Finally, among the six So-U without clear contexts, two showed analyzable rhythmical properties (**Table 3**). All of them could also be rhythmically highly dependent on the technique used (see Supplementary Videos 3 and 4).

### DISCUSSION AND CONCLUSION

The evolution of musicality is shrouded in the mists of our past, but the production of sounds with objects by chimpanzees may reveal the presence of some of its prerequisites in our common ancestor. Although we did not record a second instance of decontextualized and rhythmic drumming like the one event recorded for Barney (Dufour et al., 2015), the chimpanzees of the studied colony frequently incorporated objects when making sound. Sound production occurred in a diversity of contexts. So-U produced in socially or emotionally positive contexts lasted longer than in negative contexts. The primary explanation could be that our target behaviors for socio- or emotionally negative contexts (e.g., displays) may simply not last as long as the target behaviors in socio- or emotionally positive contexts (e.g., play bouts or building a nest)—with or without So-U. Therefore, the difference in duration of So-U across contexts may likely be an artifact of socially or emotionally positive behaviors lasting longer than socio- and emotionally-negative ones. An alternative hypothesis (that may be considered in further studies) is that So-U might have lasted longer in this context because individuals were involved in a relaxing activity like nesting and playing. Their



attention was maybe better focused on the production of sounds and its pleasurable aspects.

More than 85% of So-U were produced by males, which is in line with what we expected based on the literature (Nishida, 1980; Goodall, 1986; Arcadi, 1996; Nishida et al., 1999). As So-U in socio- or emotionally negative contexts lasted longer when produced by males than by females, we could speculate that male chimpanzees are more motivated than females to produce sound with object in this context, but we cannot generalized at the population level. This observation fits well with the intimidating function of buttress drumming described in wild male chimpanzees (Goodall, 1986). Further work should aim at a more detailed assessment of the communicative function of So-U by assessing responses from the audience in these contexts. In socially or emotionally positive contexts, there was no difference in object handling durations between females and males.

Finally, So-U could involve complex techniques that lasted longer than So-U with simple techniques. We hypothesize that shifting from one technique to another could be a way to counter tiredness arising from multiple repetitions of the same gestures. This illustrates how chimpanzees actively engaged in producing sounds with objects, a prerequisite for the evolution of music.

Given the diversity of contexts recorded, we cannot conclude about the main driving force in the production of So-U in our colony (388 events in <2 months). We cannot therefore pinpoint which factor most probably led to the discovery and spreading of music by our ancestors: the need to demonstrate strength, to attract females, or the pleasurable aspects of making noise.

One objective of this study was to check if Barney or the other chimpanzees of the colony were capable of producing a performance similar to the one reported in Dufour et al. (2015) on a regular basis. Some individuals produced long and/or elaborated So-U bouts. Most So-U were contextualized, short and "unremarkable," except maybe for some sexual displays reminiscent of human and bird courtship displays as illustrated in Supplementary Video 3. In this video, the male successfully attracted a female's attention by repeatedly hitting a large barrel with a rather slow and clearly audible tempo (with multiple repositioning of the barrel toward the female). Note that in the wild, sexual displays are more likely to involve branch breaking or leaf clipping (Nishida, 1980) rather than demonstrative drumming per se (Crockford and Boesch, 2005). In this respect, this video illustrates, potentially, an innovative use of drumming compared to wild chimpanzees. When focusing on the longest and most decontextualized So-U, we found interesting rhythmical patterns. However, most of these manipulations were constrained by the general configuration of the sound production: like, for example, hitting a barrel with a semicircle trajectory against the wall (see Supplementary Video 1, for an example). The rhythmical element was not therefore

### REFERENCES

Akaike, H. (1985). "Prediction and entropy," in A Celebration of Statistics, eds A. C. Atkinson and S. E. Fienberg (New York, NY: Springer), 1–24.

entirely controlled by the chimpanzee, as it was in Barney's case. If Barney's solitary drumming bout had not been recorded by chance, this unique evidence of potential rhythmicity in chimpanzees would never have been brought to light. He did not repeat this feat during the study and may never do so again, making this recording all the more valuable.

At this point, we may question the adequacy of Arom's "decontextualization" criterion when evaluating musicality in animals (Arom, 2000). Indeed, human music is often contextualized (associated to rituals and social functions). The inclusion of this criteria sets the bar very high for sound production in animals to be considered music, and excludes many vocal sophistications heard in some bird songs. It also excludes some of the rhythmical sexual displays we recorded here. A more flexible use of Arom's criteria might therefore be needed to widen our understanding of animal musicality. Nevertheless, the structure of Barney's initial performance remains undeniably and intuitively recognizable as drumming, and conforms with the "higher order" criteria proposed by Arom (2000).

The many studies that explore the origins of music (e.g., including research presented in this special issue), are hampered by the limited amount of documentation describing instrumental sound production in apes. Although music appears to be within the grasp of chimpanzees, they have not yet taken the step to music per se. This modest contribution was designed to provide additional information about the use of objects and various media for sound production by chimpanzees, thus providing a starting point for further work along these lines (see also Ravignani et al., 2013). Further studies should attempt to investigate the type of attraction that instrumental noise making has on chimpanzees, including the refinement and leisureliness expressed while doing so. This should contribute to a better understanding of how music evolved.

### AUTHOR CONTRIBUTIONS

ES and VD designed the study, analyzed the data and wrote the manuscript. PG and VD collected the data. CP supervised and designed the data analysis.

### FUNDING

CP was funded by an ANR programme blanc (ANR 12 BSV7 0013 02).

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2017.00002/full#supplementary-material

Arcadi, A. C. (1996). Phrase structure of wild chimpanzee pant hoots: patterns of production and interpopulation variability. Am. J. Primatol. 39, 159–178. doi: 10.1002/(SICI)1098-2345(1996)39:3<159::AID-AJP2> 3.0.CO;2-Y


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2017 Dufour, Pasquaretta, Gayet and Sterck. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Indris Have Got Rhythm! Timing and Pitch Variation of a Primate Song Examined between Sexes and Age Classes

Marco Gamba<sup>1</sup> \*, Valeria Torti <sup>1</sup> , Vittoria Estienne<sup>2</sup> , Rose M. Randrianarison<sup>3</sup> , Daria Valente<sup>1</sup> , Paolo Rovara<sup>1</sup> , Giovanna Bonadonna<sup>1</sup> , Olivier Friard<sup>1</sup> and Cristina Giacoma<sup>1</sup>

<sup>1</sup> Department of Life Sciences and Systems Biology, University of Torino, Torino, Italy, <sup>2</sup> Department of Primatology, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany, <sup>3</sup> Département de Paléontologie et d'Anthropologie Biologique, Faculté des Sciences, Université d'Antananarivo, Antananarivo, Madagascar

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Yuko Hattori, Kyoto University, Japan Pawel Fedurek, University of Neuchatel, Swaziland

> \*Correspondence: Marco Gamba marco.gamba@unito.it

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> Received: 08 February 2016 Accepted: 20 May 2016 Published: 14 June 2016

#### Citation:

Gamba M, Torti V, Estienne V, Randrianarison RM, Valente D, Rovara P, Bonadonna G, Friard O and Giacoma C (2016) The Indris Have Got Rhythm! Timing and Pitch Variation of a Primate Song Examined between Sexes and Age Classes. Front. Neurosci. 10:249. doi: 10.3389/fnins.2016.00249 A crucial, common feature of speech and music is that they show non-random structures over time. It is an open question which of the other species share rhythmic abilities with humans, but in most cases the lack of knowledge about their behavioral displays prevents further studies. Indris are the only lemurs who sing. They produce loud howling cries that can be heard at several kilometers, in which all members of a group usually sing. We tested whether overlapping and turn-taking during the songs followed a precise pattern by analysing the temporal structure of the individuals' contribution to the song. We found that both dominants (males and females) and non-dominants influenced the onset timing one another. We have found that the dominant male and the dominant female in a group overlapped each other more frequently than they did with the non-dominants. We then focused on the temporal and frequency structure of particular phrases occurring during the song. Our results show that males and females have dimorphic inter-onset intervals during the phrases. Moreover, median frequencies of the unit emitted in the phrases also differ between the sexes, with males showing higher frequencies when compared to females. We have not found an effect of age on the temporal and spectral structure of the phrases. These results indicate that singing in indris has a high behavioral flexibility and varies according to social and individual factors. The flexible spectral structure of the phrases given during the song may underlie perceptual abilities that are relatively unknown in other non-human primates, such as the ability to recognize particular pitch patterns.

Keywords: singing primates, gender differences, lemurs, pitch pattern recognition, musical abilities

### INTRODUCTION

It is an open question whether the human ability to produce and perceive sequences of rhythmic sounds arose in an early or later stage in human evolution. Sequences of rhythmic sounds are the core of the musical melodies we listen to in our everyday life, and there is questioning whether we may find primitive forms of music in other species (Brown, 2000; Geissmann, 2000; Merker, 2000). As remarked by Ravignani et al. (2014) temporal properties of animal acoustic behavior should have a primary role in the comparison between human musicality and animal sounds. In animals, there is a wide array of displays that may be welldescribed with the definition of rhythm by McAuley (2010; see also Toussaint, 2013), "the serial pattern of durations marked by a series of events." In animal vocal sequences, these "events" are sounds (units) and silences (silent intervals).

Timing and synchronization play a crucial role in human and animal communication (Bowling et al., 2013; Ravignani et al., 2014). From katydids (Greenfield and Roizen, 1993) to fiddler crabs (Blackwell et al., 2006), to amphibians (Klump and Gerhardt, 1992), the temporal organization of acoustic signals has an important part in mediating interactions between individuals and mate choice. Previous studies have shown that generation of rhythmic sound is common for most apes, as what has been termed as drumming (Schaller, 1963) has been found in chimpanzees (Pan troglodytes, Goodall, 1986; Nishida, 2011; Babiszewska et al., 2015), bonobos (Pan paniscus, de Waal, 1988; Kugler and Savage Rumbaugh, 2002), and gorillas (Gorilla gorilla, Schaller, 1963). These sounds can be produced either by pounding with hands and/or feet on external objects or their body and are common in both captive and wild animals (Arcadi et al., 1998, 2004). However, the ability to produce a rhythmic pattern of acoustic signals does not necessarily correspond to the capacity to coordinate sound production (Fitch, 2013). As suggested by Fitch (2006a,b) and Patel (2008), joint coordination in non-human species appears widespread in sound-mimicking birds (Cacatua galerita, Patel et al., 2009; C. galerita and Psittacus erithacus, Schachner et al., 2009; Melopsittacus undulatus, Hasegawa et al., 2011) and can extend to sea lions (Zalophus californianus, Cook et al., 2013). Studying chorusing dynamics may be of critical importance to understand the flexibility of the individual timing during group displays and the adaptive functions of rhythm (Ravignani et al., 2014). Most studies suggest that monkeys do not perceive a beat and thus they cannot synchronize their movements with it (Macaca mulatta, Zarco et al., 2009; Honing et al., 2012), although a certain degree of behavioral coordination between individuals can found in the chorusing of wild chimpanzees (Fedurek et al., 2013a) and the ability of auditory synchronization has been found in captivity (Hattori et al., 2013). Observations of chimpanzees seeking objects with particular resonant properties and then using them repeatedly to drum also suggested a link between the auditory and motor systems in non-human primates (reported by Fitch, 2012).

Apart from temporal patterns, spectral properties also played a major role in the comparison between human musicality and animal vocal behavior. Previous works focused on the fact that non-human species may have a higher capacity for the temporal processing of sounds and lower sensitivity for the spectral harmonicity (Chinchilla laniger, Shofner and Chaney, 2013; Callithrix jacchus, Pistorio et al., 2006). Studies showed that many non-human primate species, in contrast to humans, did not show considerable differences in average voice pitch between sexes (see Ey et al., 2007 for a review). Patel (2008) suggested that the lack of sex dimorphism in pitch and the limited ability of non-human primates to recognize relative pitch patterns could indicate that sensitivity to pitch changes may be uniquely human, and it may have had had a critical role in the evolution of human musical abilities.

In non-human primates, group calling may have a role in communicating group cohesiveness and in advertising the occupation of a territory (Marler, 2000). Both these functions fit well with the proposed social bonding theory of the evolution of music (Dunbar, 1996) and are crucial for the regulation of territorial ranging patterns and group dynamics (Geissmann, 2002; Gamba, 2014). Non-human primates use song to advertise resource holding potential, to reduce the probability of encounters by regulating group movements in the forest, and to resolve group encounters avoiding physical fights (Mitani, 1985; Cowlishaw, 1996). These findings suggest the existence of neural capacity of advanced sound localization processes in nonhuman primate species producing songs (Brown, 1982; Maeder et al., 2001).

A quantitative, rigorous investigation of non-human primate singing displays may cast new light about the factors affecting individual singing during chorusing. It also may help in identifying the selective pressures that may have led to the evolution of this trait only in Indriidae, Tarsiidae, Callicebinae, Hylobatidae, (Deputte, 1982; Haimoff, 1983; Geissmann, 2000) and may provide insights into the improvement of these abilities during human evolution.

We investigated the rhythmic abilities of a Strepsirrhini species. Strepsirrhines are primates whose last common ancestor with humans is currently dated back between 64 and 87.2 million years ago<sup>1</sup> . There is a single singing lemur species, Indri indri (Gmelin, 1788). The indri lives in the mountain rainforests of Madagascar, where its howling cries can be heard at a distance up to 2 km (Pollock, 1986). The social organization of indri is based on a reproductive pair where the adult female is dominant over the adult male although the level of intra-group competition is low (Pollock, 1975, 1977). Usually a male, whose relatedness with the adult pair is unknown, is present in the social group, and group size usually varies between two and six animals (Torti et al., 2013). The limited number of adult individuals in a group suggested that intrasexual dominance is age-related (Pollock, 1977, 1979).

The song of the indris is a long sequence of vocal emissions (units) separated by silent gaps and organized in phrases (**Figure 1**; Thalmann et al., 1993). The indris emit harsh roars at the start of the song, followed by long and scarcely modulated units and, finally, a pattern of descending phrases, which are series of two to six units given with a slightly descending frequency pattern (Thalmann et al., 1993; Sorrentino et al., 2013; Torti et al., 2013). Within the species vocal repertoire, the song is the acoustic display covering the widest range of pitch and all group members aged 2 years and above participate in the song (Maretti et al., 2010). The songs serve to inform neighboring groups about the occupation of a territory and to defend a

<sup>1</sup> Strepsirrhines are primates whose last common ancestor with humans is currently dated back between 64 and 87.2 MYA. Estimates may differ according to the methodology used for the phylogenetic reconstruction: 67.1–97.7 MY in Steiper and Young (2006); 64 MY in Chatterjee et al (2009); 87.2 in Perelman et al. (2011).

territory actively during group encounters. They also have a cohesion function (Pollock, 1986; Torti et al., 2013) and are likely to mediate the formation of new groups (Pollock, 1986; Giacoma et al., 2010). It is not clear whether the song may attract partners, but Bonadonna et al. (2014) suggested that, given the scarcity of group encounters, singing may also mediate extra-pair copulation, allow finding a mating partner, and the formation of new groups.

The indri songs are organized behavioral displays where each caller has a precise pattern. Following the frame proposed by Ravignani et al. (2014), we could define the indri songs as the combination of individual aperiodic songs, which shows a complex, uncoupled chorusing of two or more signallers. The calls in the song can be given alternated or simultaneously, with absent, partial, or complete overlap. These characteristics make the indri an excellent model to investigate singing coordination and rhythmic abilities in a non-human species.

Our first aim was to examine coordination during singing between male and female indris. The study of the structure of duetting displays in birds led to two alternative hypotheses. One is that temporal coordination is an honest signal of the coalition quality of the individuals involved (Hall and Magrath, 2007). A coordinated duet is likely to be emitted by an established pair and is more threatening for neighbors than an uncoordinated duet (Brumm and Slater, 2007). A second hypothesis refers to studies demonstrating that temporal coordination may arise when individuals adjust their signals to minimize overlap with conspecifics (Tobias and Seddon, 2009). As the indris form cohesive, territorial pairs, and their songs have a role in advertising territorial occupancy (Torti et al., 2013), we predicted that the reproductive pair would synchronize during singing in most of the songs. Snowdon and Cleveland (1984) showed that pygmy marmosets (Cebuella pygmaea) used calls antiphonally to maintain contact, following an individual-specific pattern and a system of rules. Few studies concentrated on primate turntaking and overlapping during singing. Although a universal pattern cannot be described, studies on members of the family Hylobatidae showed that in sexually dimorphic species, males and females tend to avoid overlapping of their singing, whereas in species where morphological dimorphism is absent singers tend to overlap (Deputte, 1982). From these observations, we predicted that indris, which are not sexually dimorphic and live in socially monogamous groups as gibbons, would overlap during singing. The degree of overlap has been rarely quantified, but the studies of Merker and Cox (1999, Nomascus gabriellae) and Koda et al. (2013, 2014; Hylobates agilis, Hylobates lar) suggested that juvenile gibbons may overlap more often with adults, especially with adult females. Therefore, our prediction is that gender and dominance would affect the singing displays, in particular, nonadult indris overlapping more with the adults comparing to how much the adults overlap each other.

Our second objective was to identify whether the rhythmic structure of the indris differed between sexes and phrases and to show the developmental dynamics of rhythm in indris. Sasahara et al. (2015) demonstrated that rhythm development in birds shows high rates of change during early stages and then slowly refines toward maturity. Our prediction was that the rhythm of the indris' song phrases differed between age classes.

Our third objective was to investigate pitch variation within and between sexes to understand how sex effects on spectral properties of the indri's vocal signals and complement the results on the temporal patterns. We predicted that indris, which are size monomorphic and monochromatic, would lack marked sexual differences in pitch as it has been shown in most of the nonhuman primate species (Ey et al., 2007). Thus, we expect indris not to differ markedly in fundamental frequency between sexes and that pitch patterns presented during the song are analogous akin in both genders.

### MATERIALS AND METHODS

### Study Subjects and Recordings

We studied 21 groups living in four different areas of dense tropical forest in Madagascar: seven groups in the Analamazaotra Reserve (Andasibe-Mantadia National Park, 18◦ 56′ S, 48◦ 25′ E), two groups in Mantadia (Andasibe-Mantadia National Park), three groups in the Mitsinjo Station Forestière (18◦ 56′ S, 48◦ 24′ E), and nine groups in the Maromizaha Forest (18◦ 56′ 49′′ S, 48◦ 27′ 53′′ E). We collected data in the field every year between September and December, from 2004 to 2014, for a total of 30 months. We observed one group per day from 06:00 a.m. to 1:00 p.m. Natural marks allowed identifying each indri individually. The reproductive life of indris begins at 6–7 years of age (Pollock, 1977), thus, we labeled all the indris aged six or more as "adults," and all the animals aged between two and five as "non-adults." The reproductive individuals as reported by the guides and the genetic analyses (Bonadonna, unpublished data) were indicated as "dominant," all other members were labeled "non-dominant."

Recordings were made using Sennheiser ME 66 and ME 67 and AKG CK 98 microphones. The microphone output signal was recorded at a sampling rate of 44.1 kHz using a solidstate digital audio recorder (Marantz PMD671, SoundDevices 702, Olympus S100, or Tascam DR-100MKII 24 bit/96 kHz). All utterances were recorded at a distance from 2 to 10 m since all the study groups were habituated, and all efforts were made to ensure that the microphone was oriented toward the vocalizing animal. All recordings were made without the use of playback stimuli, and nothing was done to modify the behavior of the indris. We recorded "advertisement" songs (Torti et al., 2013), consisting of duets and choruses, with a maximum of six individuals singing the same song. When in the field, we had one observer per individual indri in a group. We used Focal animal sampling (Altmann, 1974) that allowed the attribution of each vocalization to a signaller.

We recorded a total of 496 songs. To investigate the coordination during singing, we measured the amount of overlap between two singers of the same group (hereafter, co-singing) and the timing in which each unit started being emitted during a song. For the co-singing analysis, we used 223 songs of 45 individuals (15 dominant adult males, 15 dominant adult females, 15 non-adult indris (11 males, four females). The timing was analyzed in 119 songs and 40 individuals (18 dominant adult males, 14 dominant adult females, three non-adult males, one non-adult female). For the analysis of the rhythmic pattern of the descending phrases (hereafter, DPs), we considered phrases consisting of two (hereafter, DP2), three (DP3), and four (DP4) units extracted from 475 songs and 57 individuals: 23 dominant adult males, 20 dominant adult females, seven non-adult males, three non-adult females. We investigated pitch variation of 1919 DP2s, 2182 DP3s, and 1046 DP4s extracted from 1060 individual song contributions. The sampling included phrases emitted by 25 dominant adult males, 21 dominant adult females, 17 nondominant non-adult indris (10 males and 7 females).

### Acoustic Analyses

We edited segments containing indri's songs using Praat 5.3.46 (Boersma and Weenink, 2008), and we saved each song in a single audio file (in WAV format). Using our field notes and video recordings, we identified and selected the individual contribution of each singer, and we saved this information in a Praat textgrid. We then merged textgrids of all the singers of a song to quantify the co-singing between individuals, and the portions of non-overlapping singing (those in which only one singer was vocalizing). In the case of co-singing of three indris we added that percentage to each dyad involved. We expressed the overall co-singing and non-overlapping as a percentage of the total song duration (**Figure 2**). The duration of co-singing and non-overlapping segments of each song, as well as the timing of the starting points of each song unit, were saved in Praat and exported to a Microsoft© Excel spreadsheet (Gamba and Giacoma, 2007; Gamba et al., 2012). We used the duration of overlapping contributions of each particular pair of individuals to quantify the amount of co-singing between adults and nonadults of both sexes and to calculate the ratio of co-singing within the contribution of an individual to the song. We then used the timing of the starting points of each song unit to understand whether the timing of a singer influenced another indri's song timing. Following Sasahara et al. (2015), we quantified the interonset intervals (IOI) of two adjacent units and used it as a proxy for the rhythmic structure of a phrase.

We processed the DPs to extract the pitch of the focal animal in Praat, discarding the contribution of other singers and the background noise. We analyzed pitch variability by setting a frequency range from the minimum to the maximum of each unit in a DP and then calculating the frequency value at the upper limit of the second (Q50) quartile of energy (**Figure 2**).

### Statistical Analyses

We ran the General Linear Mixed Models (GLMMs) using the lme4 package (Bates et al., 2015) in R (R Core Team, 2015; version 3.2.0).

The model we used to investigate IOI variation included the duration of IOI as the response variable, IOI type (IOI1, IOI2, or IOI3), sex, age cohort (adult vs. non-adult), and DP type (DP2, DP3, and DP4) as fixed factors and group ID, song ID, site ID, and individual ID as random factors.

To analyse the co-singing, we used a model where the duration of the overlap between two singers was the response variable. The predictors were the duration of the individual contribution, song duration, the number of singers, sex of the focal animal, sex of the co-singer, the status of both the focal animal and the co-singer (identified as dominant or non-dominant in their natal groups). We used group ID, song ID, individual ID (for both the focal and the co-singer), and site ID as random factors. Since, we predicted that the degree of overlap during the song of one individual would be influenced by the sex and the status of its co-singer, we included in this model two interactions: one between the sex of the focal individual and the sex of the co-singer, and another between the status of the focal and the status of the co-singer.

For both models, we verified the assumptions that the residuals were normally distributed and homogeneous by looking at a qqplot and the distribution of the residuals plotted against the fitted values (a function provided by R. Mundry). We excluded the occurrence of collinearity among predictors by examining the variance inflation factors (vif package; Fox and Weisberg, 2011). To test the significance of the full model (Forstmeier and Schielzeth, 2011) we compared it against a null model comprising the random factors exclusively, by using a likelihood ratio test (Anova with argument test "Chisq"; Dobson, 2002). Then, we calculated the P values for the individual predictors based on likelihood ratio tests between the full and the respective null model by using the R-function "drop1" (Barr et al., 2013). We used a multiple contrast package (multcomp in R) to perform all pairwise comparisons for the levels of each factor with the Tukey test (Bretz et al., 2010). We adjusted all the p-values (padj) using the Bonferroni correction. We reported estimate, standard error (S.E.), z- and p-values for the Tukey tests.

The predictive power of the song unit timing in one individual over another was evaluated using the Granger Casuality test (Granger, 1969). We computed the bivariate Granger causality test in two directions for each dyad of indris singing in a chorus (Brandt et al., 2008; Wessa, 2013) tracking whether they were males, females or non-adults. We used a lag-4 analysis (MSBVAR package v.0.9-1 in R) and considered significant those analyses showing p-values below 0.05 (**Figure 3**). We then calculated the percentage of significant p-values on the total of the songs, overall and for each particular dyad. We then average the results per type of dyad. We did not tested dyads of two subadults because of the small sample size.

To analyse the sex dimorphism in pitch, we used four GLMMs where the frequency at the upper limit of the second quartile of

FIGURE 2 | Schematic representation of a spectrogram (A) describing acoustic parameter collection on the isolated pitch of a song. Letters A and B mark different singers, letters SP mark the starting points of a unit (1, 2, 3…) in the song. The color bars indicate the starting and final points of the units given by two different indris (e.g., blue for a male; red for a female). Duration of the units is reflected in the schematized Praat textgrid as an interval of the same color, where solid colors indicate non-overlapping parts and striped patterns indicate co-sung portions. Duration of the IOIs of a descending phrase is marked by solid green bars. In the spectrum (B) of the third unit (in a descending phrase of four units), the green dotted line marks the frequency corresponding to the upper limit of the second quartile of energy in the spectrum (Q50). The sound spectrum displays sound pressure level (Spl) on the x-axis, frequency on the vertical axis.

energy in the spectrum Q50 was the response variable. We run a model for each unit in a DP. The predictors were sex, status (dominant or non-dominant), age cohort (adult vs. non-adult), and DP type (DP2, DP3, and DP4) as fixed factors and group ID, song ID, site ID, and individual ID as random factors. We verified the assumptions and the significance of the models as explained for the models above.

We presented the average variation of IOIs and the average variation of pitch between different units by calculating average individual means, first at the song level, then at the individual level, and finally by sex.

### RESULTS

### Overlapping between Singers

We found a considerable amount of co-singing in all the songs (average individual mean 28.10% ± 7.64, N individuals = 45). The average total duration of the song was 113.188 ± 39.682 s while the duration of an individual's phonation during the song was 30.132 ± 10.301 s (29.73% ± 11.24). The average total cosinging during the song was 8.019 ± 3.587 s. The full model significantly differed from the null model (χ <sup>2</sup> = 144.080, df = 9, P < 0.001). Since the interaction between the sexes of the singing pair was not significant, we ran a reduced model excluding such interaction. The results of such reduced model are in **Table 1**. The duration of co-singing increased significantly with the duration of the individual contribution, but not with song duration itself. The number of singers in a song significantly decreased the amount of co-singing between two singers in a song. Moreover, co-singers' status significantly affected the response variable, with increased co-singing when two non-dominant individuals sang together. We have also found that the two dominant individuals in a group co-sing significantly longer than a dominant and a non-dominant indri (Tuckey test, estimate = −1.1546; S.E.

= 0.2070; z = −5.578; padj < 0.001) and than non-dominants singing together (Tuckey test, estimate = −1.3719; S.E. = 0.3598; z = −3.813; padj < 0.001). The model did not detect any effect for the sex of the co-singers.

### Gender and Status Influence on the Singing Pattern

We asked whether singing of a particular indri influenced the contribution of another animal to the song. Applying the causality test between the timing of the onset in the individual contributions, we found an effect of the adult male singing on the pattern of the adult female in 68% (N = 94) of the songs (902.01 < Fstat < 1071.97; 0.001 < padj < 0.039). The timing of the adult female was useful to forecast when the adult male was singing in 73% (N = 91) of the songs (9.53 < Fstat < 10.44; 0.001 < padj < 0.043). The non-adults in a group influenced adult male and adult female singing in 94% (N = 47; 78.20 < Fstat < 10.08; 0.001 < padj < 0.036) and 75% (N = 63; 9315.05 < Fstat < 105.22; 0.001 < padj < 0.042) of the songs respectively. We found an effect on non-adults in 81% (N = 57) of the songs for the contribution of the adult female (90.86 < Fstat < 10.00; 0.001 < padj < 0.046) and 78% (N = 46) of the songs for the adult male (9.97 < Fstat < 10.54; 0.001 < padj < 0.030). We also analyzed data by considering each pair and dyad. We found that non-adults effect on the adult males was 89.10% ± 27.93 (N = 13) and all other combinations ranged between 72.31% ± 25.34 and 76.60% ± 34.42 (**Figure 4**).

### Rhythmic Differences between Sexes and Age Classes

We then investigated to what extent sex and age affected indris' singing rhythm. The full model significantly differed from the

TABLE 1 | Influences of the fixed factors on cosinging duration (s); results of the reduced model, including only the significant interaction (full vs. null: chisq = 144.080, df = 9, P < 0.001).


<sup>a</sup>Not shown as not having a meaningful interpretation.

<sup>b</sup>Estimate ± SE refer to the difference of the response between the reported level of this categorical predictor and the reference category of the same predictor.

<sup>c</sup>These predictors were dummy coded, with the "Focal sex (Female)," "Cosinger sex (Female)," "Focal class (Dominant),"and "Cosinger class (Dominant)" being the reference categories.

<sup>d</sup>Not shown, as the interaction between these predictors is significant.

null model (χ <sup>2</sup> = 144.080, df = 9, P < 0.001). We found that the IOI type significantly affected its duration, in particular both the types IOI2 and IOI3 were significantly longer than IOI1 (**Table 2**). IOI2 was also significantly shorter than IOI3 (Tuckey test, estimate = 0.063; S.E. = 0.014; z = 44.58; p < 0.001). The IOI duration significantly decreased at the increase of the number of units in the DP (**Figure 5**; **Table 2**). In particular, IOIs of DP2s are longer than those in the other DP types, but we found that also IOIs in the DP3s are significantly longer than those in DP4s (Tuckey test, estimate = −0.179; S.E. = 0.010; z = −17.74; padj < 0.001). We have also found a significant effect of sex, where males showed longer IOIs (**Table 2**) when compared to females. We found no effect of age cohort (**Table 2**).

### Pitch Variation Patterns

The pitch pattern of the units in a DP showed remarkable interand intra-individual frequency variation (**Figure 6**). We found that the frequency value corresponding to the second quartile of

FIGURE 4 | Bar plot of the average percentage of synchronized songs in the indris. Capped lines represent negative Standard Deviation. Each bar indicates the direction of the Granger causality test for each type of dyad (AF, adult females; AM, adult males, NA, non-adults).

TABLE 2 | Influences of the fixed factors on IOI duration (s); results of the full model (full vs. null: chisq = 2966.748, df = 6, P < 0.001).


<sup>a</sup>Not shown as not having a meaningful interpretation.

<sup>b</sup>Estimate ± SE refer to the difference of the response between the reported level of this categorical predictor and the reference category of the same predictor.

<sup>c</sup>These predictors were dummy coded, with the "IOI Type (1)," "Sex (Female)," "Age Cohort (Adult)," and "DP TYPE (DP2)" being the reference categories.

energy Q50 was significantly higher in males (**Table 3**) than in females. The Q50 of unit 1 was significantly higher than those of Unit 2, 3, and 4 (**Table 3**), which appeared descending in the frequency value Q50 along the DP (−259.485 < Estimate < −107.059; 3.819 < S.E. < 9.899; −39.92 < z < −10.81; all Ps < 0.001). The Q50 also differed significantly between DP types. DP4 and DP3 showed higher values than DP2 (**Table 3**), and also DP4 showed greater values than DP3 (Tuckey test, estimate = 68.322; S.E. = 5.264; z = 12.98; p < 0.001).

### DISCUSSION

### Coordination and Overlapping during Singing

Despite a majority of non-overlapping singing, an important part of the individual song was co-sung with another member

TABLE 3 | Influences of the fixed factors on Q50 frequency (Hz); results of the full model (full vs. null: chisq = 4330.685, df = 7, P < 0.001).


<sup>a</sup>Not shown as not having a meaningful interpretation.

<sup>b</sup>Estimate ± SE refer to the difference of the response between the reported level of this categorical predictor and the reference category of the same predictor.

<sup>c</sup>These predictors were dummy coded, with the "Unit (1)," "Sex (Female)," "Age Cohort (Adult)," and "DP TYPE (DP2)" being the reference categories.

of the social group with a positive effect of the duration of the singer's contribution rather than overall song duration. We found support for our prediction that indris, being not sexually dimorphic, would overlap during singing, in agreement with what postulated by Deputte (1982) on the Hylobatidae. This finding appears to confirm what previous studies have shown for gibbons. The sex-specific individual song contributions may indeed serve different functions and therefore, may be under different selective pressures (Cowlishaw, 1992; Geissmann, 2002). At the same time, the overlap has an adaptive value because it may have a role in signaling group cohesion and resource holding potential to conspecifics (Torti et al., 2013).

Describing the temporal properties of the indris' singing, we found that the singer's and co-singer's sex did not affect cosinging duration, showing that not only indris of the two sexes participate equally to the song (Giacoma et al., 2010) but they also similarly co-sung with conspecifics of the opposite sex. We found instead that being dominant or non-dominant affected co-singing rates during a group song, in agreement with what Cowlishaw (1992) suggested about the duration of solo bouts in gibbons (see also Mitani, 1984, 1987; Dallmann and Geissmann, 2009). In the indris, solo songs are exceedingly rare. Giacoma and colleagues (unpublished data) recorded three songs emitted

FIGURE 6 | Density plots obtained in R (MASS package) for male descending phrases DP2s (A), female DP2s (B), male DP3s (D), female DP3s (E), male DP4s (G), and female DP4s (H). The bar plots show the average frequency corresponding to the upper limit of the second quartile of energy in the spectrum (Q50) for males and females, for DP2s (C) DP3s (F), and DP4s (I). Units within a DP are indicated by different colors. Capped lines represent ± Standard Deviation.

by a single young adult indri male during a sampling time in which over 600 duets and group choruses were recorded. Thus, we can suppose that the indris chorusing may itself play a role in the competition among paired and unmated males, and that conspecifics may assess males' (and females') characteristics from their collective singing (Torti et al., 2013).

The fact that indris showed overlapping avoidance in between dominants and non-dominants (which are often sub-adults in our sampling) and more frequent overlapping between adult males and females marks a difference to what is known for gibbon songs. Adult male gibbons and females tended to alternate their calls and immature individuals frequently overlap (Merker and Cox, 1999; Koda et al., 2013), but a different scenario emerged from our findings. The fact that adults singing together showed a significantly longer overlap falsified our second hypothesis that co-singing rates in these species are higher between nondominants and dominants. Our results suggest that overlap between adults can indeed serve inter-group communication as suggested by previous studies (Merker, 2000). Co-singing may correspond to louder signals, and overlapping of the paired mates may serve to maintain a territory. Non-overlapping singing may provide the advantage of advertising the resource holding potential of the group, but overlapping another conspecific may represent a cost for an individual singer, which cannot broadcast its individuality. It makes sense that non-dominant individuals tend to co-sing less than paired, dominant indris. Non-dominant indris may attempt to maximize their solitary singing during the chorus, to advertise their fighting ability to conspecifics of other groups and their individuality to potential mates (Cowlishaw, 1992).

Studying chorusing dynamics, we found that differences in co-singing reflect differences in coordinating the emissions of units during the song. We demonstrated the existence of a coordination of the calls in both dominant and non-dominant individuals, with a consistent influence between the singing of different indris during the song. We found that the coordination between singers was mutual between sexes and age cohorts, but the non-dominants appeared to have an especially strong effect on dominant adult males. Indris within a group coordinated on average more than 70% of their songs to form duets, suggesting that duetting is indeed associated with pair cohesion and the strength of the pair bonds (Geissmann and Orgeldinger, 2000). In indris, as it happens for bird species, duetting may have a crucial role in territory defense but may also have evolved for multiple functions (Dahlin and Benedict, 2013), including the localization of conspecific (Torti et al., 2013; Bonadonna et al., 2014) and providing information about the quality of their pair bond (Merker, 2000; Hall and Peters, 2008; Hall, 2009; Dowling and Webster, 2015).

Unlike what Geissmann (2000) hypothesized for gibbons (2000), the indris' song may also facilitate finding a mate either for an extra-pair copulation (Bonadonna et al., 2014) or to form a new pair (Torti et al., 2013). Thus, the interplay between singers can be particularly meaningful for the non-adults which may attempt to broadcast their individuality and may affect the dominant male singing pattern. We cannot exclude that dominant male singing may contribute to the development of singing non-dominant indris, as it has been found in gibbons (Koda et al., 2013).

Acoustic analyses of indris' vocal behavior during the song may also indicate the ability of precise timing in a particular social display, like the song. A parallel with humans may be found in the study of Bowling et al. (2013) showing that speech timing is more precise when speakers are together with a partner than when the same speaker is alone. Further studies are needed, but the investigation of songs given in different behavioral context showed that animals tended to turn taking more precisely when in visual contact than when they were not (Torti et al., 2013). Moreover, dominant adults may indeed have a synchronization capacity that is developing in younger non-dominants.

### Rhythmic Differences in the Indris

We identified a system of distinct units produced in sequences in agreement with previous studies (Thalmann et al., 1993; Giacoma et al., 2010; Baker-Médard et al., 2013; Torti et al., 2013; Gamba et al., 2014). We analyzed short phrases consisting of two, three or four units and we found that the rhythmic structure differed within and between descending phrases. Namely, the interval between onsets decreased significantly during a DP, but also differed between DP types.

These differences in the rhythmic structure of descending phrases suggest that indris may be capable of regulating timing, unit duration, and interval duration. This ability appears similar to those shown by the chimpanzees producing a "pant hoot chorus." In agreement with the findings of Fedurek and colleagues on the chimpanzees, the indris appear to adjust the timing of their emissions (Fedurek et al., 2013a) during the song and to do that to interact vocally with another member of their social group (Mitani and Gros-Louis, 1998). The ability to adjust the emissions within a song has emerged when investigating contextual variation in the acoustic structure of the song (Torti et al., 2013). It may indeed play a role in social interactions within- and between-groups as it has been suggested for chimpanzees' joint hooting (Fedurek et al., 2013b) or agile gibbons' singing (Koda et al., 2013).

We also demonstrated that there is a remarkable difference between males and females, with females showing shorter IOI in all DP types. This sex dimorphism in rhythm is surprising when seen in the light of the indris' social monogamy and external morphology, which would both predict a little dimorphism in the size of the vocal apparatus (Dixson, 2013). Current data on indris' vocal tract morphology is poor, but we found reference to the fact that both males and females possess a dorsal air sac (Grandidier, 1875; Petter et al., 1977). The presence of larger vocal sacs in the male indris could explain the longer IOI observed in all descending phrases. The study on apes showed that there is usually a pronounced sex dimorphism in the size of the vocal sac in the polygynous species (G. gorilla, Pongo pygmaeus), which also produce sex-specific calls (Harcourt et al., 1993; Delgado and Van Schaik, 2000). This dimorphism is apparently less marked in the chimpanzees (P. troglodytes schweinfurthii) group cohesion pant hoots, which are given by both sexes (Mitani and Nishida, 1993). Recent studies on the howler monkeys (Alouatta sp.) confirmed a role of vocal competition and suggested that vocal tract traits have been sexually selected in those forest-living, arboreal species (Dunn et al., 2015). Vocal competition can also occur for indris, where sexual monogamy may occur together with the presence of multiple males and females within a group, can involve extra-pair copulation (Torti et al., 2013; Bonadonna et al., 2014) and where inter-sexual selection may have played a role (Singleton et al., 2009).

We found support for our prediction that IOIs differed between males and females. The results instead falsify the hypothesis that rhythm changes during the indris' development because we failed to find clear changes in rhythm between indris of different age cohorts. These results are in disagreement with previous finding on birds (Saar and Mitra, 2008; Sasahara et al., 2015), although the analysis of the entire song instead of single phrases could lead to different results. However, we are convinced that our findings clearly show that non-adults rhythms did not substantially differ from the adult rhythms. These findings also provide insight into the development of the indris' song showing that when the animals start singing the cognitive processes and the vocal apparatus that produces song are fully developed. Thus, the dynamics of the song, at least at a phrase level, has then a limited plasticity.

### Pitch Variation

Our results showed that units emitted sequentially in the DPs differ consistently in frequency, in agreement with the qualitative observations of Thalmann et al. (1993) and Giacoma et al. (2010). The units given during the DPs have a descending frequency on average with remarkable individual variation. We demonstrated that pitch differs between sexes, despite a similar trend in frequency change.

We expected variation within individuals apparently to override sex differences, but the results falsified the prediction that indris lacked marked sexual differences in the pitch of song units. Our findings are in contrast with the general frame reported by Ey et al. (2007) and show that indris present sexual vocal dimorphism. The presence of differences in frequency variation is shown in our study across comparable series of units, and not limited to different unit types, as previously found by Sorrentino et al. (2013).

Indris are sexually monomorphic (Pollock, 1977), and group encounters are rare (Torti et al., 2013). Thus, sex recognition relying on vocal signals is potentially useful and may be indeed encoded both in the rhythmic structure and the frequency of the DP units. The use of song phrases to broadcast sex may be essential during pair formation (Torti et al., 2013) at distances where other communicative signals may be ineffective (Fletcher, 2009).

We support the conclusions of Torti and colleagues suggesting that the song, or part of the song, may be important in sex recognition and for finding mates, but the question of whether indris recognize the sex of an individual listening to its song is still unanswered. As suggested by previous theoretical works, singing in indris is probably the results of several selective pressures that acted differently on the two sexes. Whether indris have a voluntary control over their timing is still unclear and can be further investigated. However, as Gamba (unpublished data) observed in captive siamangs (Symphalangus syndactylus, the emission of harsh sounds ("barks" in siamangs, "roars" in indris) may serve as to synchronize the successive emissions of group members (Giacoma et al., 2010; Torti et al., 2013). Then, the song reaches its most consistent portion of the emission of the descending phrases (Torti et al., 2013), which indeed represent an interesting case of timing and pitch variation, a crucial feature of birdsong and human speech (Levinson and Holler, 2014).

The musical ability of animals has been connected to speciesspecific perceptual templates, which may in some species change according to brain plasticity. However, the extensive evidence of the processes involved in learning concerns bird and humans (Maguire et al., 2000; Kilgard et al., 2001; Anderson et al., 2002) and there is no equivalent evidence for primates. Our knowledge of primates, and especially of "singing primates" is limited to behavioral observations and few experiments. Studies on humans and other mammals demonstrated that learning corresponds to plastic changes in the auditory cortex (Metherlate and Weinberger, 1990; Norton et al., 2005), but it is still unclear whether this can also be the case of non-human primates and can indeed involve processes involved in vocal production learning.

The indris are good candidates for further investigations of the evolution of typical speech features because the turn-taking between individuals, the constant exchange of short vocal units, and the variable degree of overlap are shared trait of modern human communication.

## AUTHOR CONTRIBUTIONS

MG, VT, GB, and CG designed research; MG, VT, VE, RR, DV, GB, and CG performed research; MG, VT, DV, OF, PR, and VE analyzed data; MG, VT, VE, and CG wrote the paper.

### ACKNOWLEDGMENTS

This research was supported by Università degli Studi di Torino and the African, Caribbean, and Pacific (ACP) Science and Technology Programme of the ACP Group of States, with the financial assistance of the European Union, through the Projects BIRD (Biodiversity Integration and Rural Development; No. FED/2009/217077) and SCORE (Supporting Cooperation for Research and Education; Contract No. ACP RPR 118 # 36) and by grants from the Parco Natura Viva—Centro Tutela Specie Minacciate. We thank Roger Mundry and Colleen Stephens for advice about the GLMMs. We are grateful to GERP (Groupe d'Etudes et des Recherche sur les Primates) and Dr. Jonah Ratsimbazafy. We thank Dr. Cesare Avesani Zaborra and Dr. Caterina Spiezio for helping us with the organization of the field station in Maromizaha. We are grateful to the researchers and the international guides, to Lanto and Mamatin, for their help and logistical support. The contents of this document are the sole responsibility of the authors and can under no circumstances be regarded as reflecting the position of the European Union. We are grateful to two anonymous Reviewers and to the Editor Andrea Ravignani for their comments on a previous version of the manuscript. We have received permits for this research, each year, from "Direction des Eaux et Forêts" and "Madagascar National Parks" (formerly ANGAP) [(2004 (N◦ 190/MINENV.EF/SG/DGEF/DPB/SCBLF/RECH) 2005 (N◦ 197/ MINENV.EF/SG/DGEF/DPB/SCBLF/RECH), 2006 (N◦ 172/06/ MINENV.EF/SG/DGEF/DPB/SCBLF), 2007 (N◦ 0220/07/MINE

### REFERENCES


NV.EF/SG/DGEF/DPSAP/SSE), 2008 (N◦ 258/08/MEFT/SG/DG EF/DSAP/SSE), 2009 (N◦ 243/09/MEF/SG/DGF/DCB.SAP/SLR SE), 2010 (N◦ 118/10/MEF/SG/DGF/DCB.SAP/SCBSE; N◦ 293/ 10/MEF/SG/DGF/DCB.SAP/SCB), 2011 (N◦ 274/11/MEF/SG/D GF/DCB.SAP/SCB), 2012 (N◦ 245/12/MEF/SG/DGF/DCB.SAP/ SCB), 2014 (N◦ 066/14/MEF/SG/DGF/DCB.SAP/SCB)].

Cowlishaw, G. (1996). Sexual selection and information content in gibbon song bouts. Ethology 102, 272–284. doi: 10.1111/j.1439-0310.1996.tb01125.x


Prosimian Biology, eds J. Masters, M. Gamba, and F. Génin (New York, NY: Springer Science + Business Media), 315–322.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Gamba, Torti, Estienne, Randrianarison, Valente, Rovara, Bonadonna, Friard and Giacoma. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Beat Keeping in a Sea Lion As Coupled Oscillation: Implications for Comparative Understanding of Human Rhythm

#### Andrew A. Rouse<sup>1</sup> \*, Peter F. Cook <sup>2</sup> , Edward W. Large<sup>3</sup> and Colleen Reichmuth<sup>1</sup>

*<sup>1</sup> Long Marine Laboratory, Institute of Marine Sciences, University of California Santa Cruz, Santa Cruz, CA, USA, <sup>2</sup> Department of Psychology, Emory University, Atlanta, GA, USA, <sup>3</sup> Department of Psychological Sciences, University of Connecticut, Storrs, CT, USA*

Human capacity for entraining movement to external rhythms—i.e., beat keeping—is ubiquitous, but its evolutionary history and neural underpinnings remain a mystery. Recent findings of entrainment to simple and complex rhythms in non-human animals pave the way for a novel comparative approach to assess the origins and mechanisms of rhythmic behavior. The most reliable non-human beat keeper to date is a California sea lion, Ronan, who was trained to match head movements to isochronous repeating stimuli and showed spontaneous generalization of this ability to novel tempos and to the complex rhythms of music. Does Ronan's performance rely on the same neural mechanisms as human rhythmic behavior? In the current study, we presented Ronan with simple rhythmic stimuli at novel tempos. On some trials, we introduced "perturbations," altering either tempo or phase in the middle of a presentation. Ronan quickly adjusted her behavior following all perturbations, recovering her consistent phase and tempo relationships to the stimulus within a few beats. Ronan's performance was consistent with predictions of mathematical models describing coupled oscillation: a model relying solely on phase coupling strongly matched her behavior, and the model was further improved with the addition of period coupling. These findings are the clearest evidence yet for parity in human and non-human beat keeping and support the view that the human ability to perceive and move in time to rhythm may be rooted in broadly conserved neural mechanisms.

Keywords: sensorimotor synchronization, rhythmic entrainment, neural oscillators, sea lions, music cognition and perception, non-human models

### INTRODUCTION

Auditory-motoric entrainment—the coordination of motor movement with simple and complex rhythmic sounds—has a strong presence in human culture and is found across all human societies (Clayton et al., 2005). This phenomenon of "beat keeping" was believed to be unique to humans (Wallin et al., 2000; Bispham, 2006; Zatorre et al., 2007), but new findings in non-human animals have decisively put that idea to rest. Evidence for some faculty to flexibly entrain movement to simple metronome-like stimuli has been found in bonobos (Large and Gray, 2015), chimpanzees (Hattori et al., 2013), and budgerigars (Hasegawa et al., 2011). The ability to entrain to more

### Edited by:

*Andrea Ravignani, Vrije Universiteit Brussel, Belgium*

#### Reviewed by:

*Shinya Fujii, The University of Tokyo, Japan Angela S. Stoeger, University of Vienna, Austria*

> \*Correspondence: *Andrew A. Rouse arouse@ucsc.edu*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience*

> Received: *05 March 2016* Accepted: *23 May 2016* Published: *03 June 2016*

#### Citation:

*Rouse AA, Cook PF, Large EW and Reichmuth C (2016) Beat Keeping in a Sea Lion As Coupled Oscillation: Implications for Comparative Understanding of Human Rhythm. Front. Neurosci. 10:257. doi: 10.3389/fnins.2016.00257*

complex musical stimuli has been shown in cockatoos (Patel et al., 2009), parrots (Schachner et al., 2009), and most reliably, a California sea lion (Cook et al., 2013). Preliminary evidence suggesting beat-keeping behavior has also been identified for elephants (Schachner et al., 2009) and horses (Bregman et al., 2013). While hypotheses have been advanced suggesting that beat keeping is dependent on specialized and relatively rare neural adaptations (Patel et al., 2009; Merchant and Honing, 2013), or exposure to auditory rhythm during critical developmental periods (Schachner, 2012), it is increasingly difficult to identify candidate traits exclusive to the phylogenetically distant species now shown capable of rhythmic entrainment. This suggests that rather than being a derived ability, this faculty is instead broadly conserved, supported by mechanisms of domain-general sensorimotor synchronization found across the animal kingdom (see Wilson and Cook, 2016).

Performance dynamics and variability in human rhythmic behavior have been extensively and carefully studied (see Repp, 2005; Large, 2008; Repp and Su, 2013 for reviews). Although beat-keeping behavior has now been ascribed to a number of non-human species, the mechanisms have not yet been explored outside of humans. If human rhythm is broadly conserved, beat keeping in other animals should be consistent with the principles governing the behavior in humans. One parsimonious and well-established theory of beat keeping is that of neural resonance. This theory proposes that the perception of pulse in simple and complex rhythms, and associated behavioral synchronization to those rhythms, arise from intrinsic properties of neural oscillation (Large and Snyder, 2009; Large et al., 2015). Unlike information-processing theories, in which beat perception and synchronization are separate computational processes that require specialized neural circuitry (Vorberg and Wing, 1996; Repp and Keller, 2004; Patel, 2006; Patel and Iversen, 2014), the theory of neural resonance states that both phenomena are byproducts of the physical principles of coupled oscillation (Large, 2008), and does not presuppose any specialized and potentially restricted neural adaptations beyond auditory-motor coupling to explain auditory motor entrainment.

Neural resonance theory is supported by the well-established finding of neural oscillation: interaction between excitatory and inhibitory neuronal populations gives rise to population rhythms throughout the brain, including across sensory and motor networks (Brunel, 2003; Börgers and Kopell, 2003; Buzsáki and Draughn, 2004; Stefanescu and Jirsa, 2008). In brief, when acoustic stimuli are presented in a periodic pattern, auditory oscillations spontaneously entrain to the structure of the stimulus stream (Will and Berg, 2007; Nozaradan et al., 2011, 2012). Presumably, these auditory oscillations then induce synchronized neural oscillations in coupled motor systems, leading to rhythmic behavior with a strong phase and tempo relationship to the auditory stimulus (e.g., Loehr et al., 2011). Models of neural resonance are neurologically plausible and fully compatible with widely accepted models of functional connectivity in the brain (see Biswal et al., 1995). The brain can be described as a complicated set of overlapping networks linking neural populations into functional units (Bullmore and Sporns, 2009), and connectivity between these units can be described in terms of synchrony of firing rates between neural populations (Biswal et al., 1995; Greicius et al., 2003). Perception and cognition are then posited to emerge out of the action and interaction of these networks (Sporns et al., 2004; Bressler and Menon, 2010). Importantly, although neural resonance does not require specialized neural mechanisms beyond linked auditory and motor networks, beat-keeping behavior is not necessarily obligate and automatic. Learning clearly changes the properties of coupling between auditory and motor networks, and attention and intention play important roles in producing or inhibiting beat-keeping behavior (Large and Jones, 1999; Repp and Keller, 2004).

An advantage of this sort of theoretical analysis is the ability to link complex oscillation of high dimensional neuronal populations with simpler lower dimensional population- and behavior-level models that capture much of the behavioral richness observed in high dimensional systems (Wilson and Cowan, 1973; Stefanescu and Jirsa, 2008) and that are amenable to theoretical and computational analysis (Aronson et al., 1990; Hoppensteadt and Izhikevich, 1996).

Neural resonance models have been used extensively to accurately describe rhythmic entrainment in humans, for both simple and complex stimuli. A common experimental approach uses behavioral paradigms that involve perturbations in both phase and tempo (Michon, 1967; Large and Palmer, 2002; Large et al., 2002; Repp and Keller, 2004; Loehr et al., 2011). Animal synchronization studies have historically attempted to demonstrate synchronization using statistical methods designed to show a nonrandom phase relationship between stimulus and movement (e.g., Patel et al., 2009, see Fisher, 1993; Pikovsky et al., 2001). However, to probe the underlying mechanisms, a different approach is required. One avenue is to perturb the stimulus and observe relaxation back to steady state behavior (a stable phase relationship).

Behavioral responses to stimulus perturbations can be modeled using the same discrete-time model of coupled oscillation that has been applied to both perception of rhythmic auditory sequences (Large and Jones, 1999) and perception-action coordination with rhythmic auditory sequences (deGuzman and Kelso, 1991; Loehr et al., 2011). Conceptually, a neural/behavioral oscillation is coupled to a rhythmic auditory stimulus that consists of brief acoustic events. The model assumes that the behavioral oscillation is temporally continuous, and the stimulus sequence is temporally discrete, illustrated in Equation (1).

$$
\phi\_{n+1} = \phi\_n + \Omega - \alpha \sin \phi\_n \tag{1}
$$

Here, φ<sup>n</sup> is the phase of the behavioral oscillation at which acoustic event n occurs. The model predicts the phase of the behavioral oscillation φn+<sup>1</sup> at the next acoustic event, given the relative frequency of the behavioral oscillation and stimulus, = 2πfosc/fstim, and the coupling between the two, −α sin φn. Generic models such as Equation (1) are particularly powerful because they make strong predictions regarding both steady state (synchronization) and transient (relaxation) behavior and are easily implemented and analyzed. However, they have not yet been applied to examine rhythmic behavior in non-human animals.

The most reliable and precise non-human beat keeper known is Ronan, a California sea lion (Zalophus californianus) who was trained using operant methods to match her head movement to a simple isochronous stimulus, thus "bobbing" her head in time to the rhythmic beats. Once she had learned to bob in time to simple stimuli at set tempos, she successfully transferred to novel tempos and stimuli, including music at multiple tempos (Cook et al., 2013). To date, Ronan's results represent the most extensive and robust dataset of beat keeping in a non-human animal. This sea lion's ability to entrain her body movement to sound makes her a valuable candidate for cross-species testing of theories and allows the examination of the underlying mechanisms in a broader comparative approach.

Here, we apply neural resonance theory to an experimental study of Ronan's entrainment behavior, in which we tested her ability to adapt to sudden changes in both the phase and tempo of an isochronous repeating stimulus. Although—like humans— Ronan had previously shown strong entrainment to complex musical stimuli, we used simple stimuli comparable to those presented to humans in similar studies. To determine whether her performance was consistent with theories of neural resonance as is seen in humans, we evaluated her performance with a discrete-time model of coupled oscillation (see Equation 1). We hypothesized that Ronan's beat-keeping performance, and her response to phase and tempo perturbations, would be well fit by simple models of phase and period coupling.

### METHODS

### Subject

The subject was "Ronan," a 7-year-old female California sea lion (NOA0006602), who was housed at Long Marine Laboratory at the University of California Santa Cruz. Ronan was a healthy individual that was placed into captivity around age one after repeated stranding incidents and rescues. She previously participated in a study examining her ability to synchronize to auditory rhythms (Cook et al., 2013). In brief, Ronan was trained to match regular head movement to a simple isochronous stimulus at tempos of 80 and 120 beats per minute (bpm). She then successfully generalized the behavior to novel tempos of 72, 88, 96, 108, and 132 bpm with the simple stimulus, and to novel musical stimuli at tempos of 104, 108, 117, 124, 130, 137, and 143 bpm. Following data collection for Cook et al. (2013), Ronan received intermittent "practice" sessions (typically no more than one per week) with familiar simple stimuli and several novel musical stimuli. During this time, Ronan also participated in several other cognitive and perceptual studies unrelated to rhythm (Reichmuth et al., 2013; Cunningham et al., 2014a,b; Cook et al., 2015; Cunningham and Reichmuth, 2016).

The current experiment occurred from September 2015 to January 2016. During this time, Ronan received a daily diet of 5.7–6.6 kg of freshly thawed, cut herring and capelin fish. She was maintained at a healthy weight of ∼72 kg, and her diet was not constrained for experimental purposes. Ronan typically participated in five sessions per week, receiving ∼40% of her diet during these experimental sessions.

The study was conducted without harm under National Marine Fisheries Service marine mammal research permits 14535 and 18902, with the approval and oversight of the Institutional Animal Care and Use Committee at the University of California Santa Cruz.

### Apparatus

Testing occurred in a 3.6 × 5.2 m enclosure containing a 1.2 m deep, 2.25 m square pool, and surrounding deck space. The experimental setup (similar to that used in Cook et al., 2013) consisted of a 1.1 × 1.5 m painted wooden panel mounted vertically in the doorway of the enclosure. A 0.8 × 0.3 m raised wooden platform was placed on the deck facing the panel, 0.4 m away. Ronan used this platform to find and maintain a consistent stationing position prior to each trial. She rested her foreflippers on the platform while directly facing the panel, and could then move her head freely without touching the panel. An assistant sitting quietly outside the enclosure and behind the panel delivered fish rewards through a short length of PVC pipe mounted in the panel. The experimenter observed Ronan's real-time performance from behind the panel through a 9 cm diameter convex mirror placed 2 m to the side of the flipper station. Both the experimenter and the assistant were concealed from Ronan's view during all trials.

Each session was recorded on a GoPro Hero 2 camera mounted inside the enclosure, 0.25 m above the convex mirror. The auditory stimuli were projected from an Advent AV570 amplified speaker placed ∼1 m from Ronan and from the camera. The absolute broadband received level of the brief auditory stimuli presented through the speaker was ∼100 dBpeak (re 20 µPa); the equivalent sensation level was 60–80 dB at the frequency of the test stimuli based on species-typical hearing sensitivity (Reichmuth et al., 2013). The level of the stimulus was established to ensure saliency of the auditory cues in an outdoor, coastal environment.

### Stimuli

Stimuli were repetitive click tracks created in Audacity, an opensource audio editing program. The clicks comprised two overlaid pure tones of 659 and 1319 Hz for a duration of 10 ms, as was used in Cook et al. (2013). Each track began with a series of beats at a steady rate followed by a single perturbation of either phase or tempo at a magnitude of ±25, ±15, ±8, or ±3% of the inter-onset interval (IOI, equivalent to 60 divided by the tempo in beats per minute). For example, a +15% shift of the 85 bpm condition would be 73.913 bpm. The perturbations were introduced at a different beat for each condition (between 16 and 25 beats after the beginning of the trial) to prevent prediction of the onset location. Primary testing was completed at a base tempo of 85 bpm (705.88 ms IOI), to which Ronan had not previously been exposed; **Table 1** lists the perturbation tempos, and their corresponding IOI values. We also tested Ronan with stimulus perturbations at two additional novel base tempos (94.444 and 77.273 bpm, ±10% of the 85 bpm condition, see Supplementary Materials) independently of each other and the main tempo.

#### TABLE 1 | Tempo and inter-onset interval perturbation values referenced to the baseline (no perturbation) condition.


### General Procedure

The auditory stimulus was started after Ronan calmly positioned at the flipper station and oriented toward the panel. The trial was ended after a predetermined performance criterion of 20 or 40 consecutive, apparently entrained bobs (termed "good" bobs) as judged by the experimenter in real time, similar to the procedure used in Cook et al. (2013). Transfer trials (i.e., trials when a perturbation was presented) were run to a criterion of 20 good bobs following the perturbation, and baseline trials (those with no perturbation) were run to a criterion of 40 good bobs starting at the beginning of the trial. At the beginning of each session, two "warm-up" trials (at the base tempo with no perturbation) were presented to confirm stimulus control of the behavior and run to a criterion of 15–30 apparently entrained beats. Each trial was terminated with a previously conditioned reinforcer (a sharp whistle blown by the experimenter that marked the last bob in the criterial run) followed by a reward of two whole capelin fish offered to Ronan through the feeding port in the panel. Ronan then entered the pool for a small fish reward and returned to the flipper station to begin the next trial.

One experimental replicate at a given base tempo encompassed 24 trials: one trial at each test condition (eight phase changes and eight tempo changes for a total of 16 perturbations) and eight unperturbed trials at the base rate. One session, equivalent to one half of a replicate series, consisted of two warm-up trials at the base rate followed by a randomly-shuffled sequence consisting of four baseline (unperturbed) trials, four tempo perturbations, and four phase perturbations. The perturbations for both phase and tempo were each counterbalanced to ensure an equal number of positive and negative shifts per session. Sessions were broken into three blocks of four stimulus presentations each (not including the two warm-up trials), with a short 30 s break between each block, in which Ronan received four to five half capelin while swimming calmly in the water.

Ronan completed 10 replicates of the 85 bpm test series during 20 sessions; that is, she completed 10 trials with each of the 16 phase and tempo perturbations (n = 160 perturbation trials, n = 80 baseline trials). Subsequently, she completed a single replicate with the two additional base tempos over a total of four sessions (n = 32 perturbation trials, n = 16 baseline trials).

During two sessions, trials were interrupted by external factors before the perturbation occurred: once by vocalizations from an animal in a neighboring enclosure, and once by beeping from a truck in close proximity to the testing facility. In these cases, the experimenter immediately stopped the interrupted trial and proceeded as though the trial had been completed. The aborted trial was retested at the end of the session, and the interrupted trials were not included in any analysis.

### Video Analysis

We recorded each session at a frame rate of 120 frames per second (equivalent to a resolution of 8.333 ms per frame). Although Ronan's behavior was continuous, the video data is necessarily binned into windows of 8.333 ms, which introduces some small margin of error into any analysis of precise timing. Nevertheless, a single frame represents, at most, 2% of the IOI.

The primary measure of Ronan's performance was selected as the coincidence of the nadir of her head position with the onset of an auditory beat. Specifically, the height of the tip of the nose was used as the marker for the inflection point. An observer using frame-by-frame analysis determined the time of the lowest point for each head bob, with trials viewed in AvsPmod, an open-source video editing program. When more than one frame appeared to show the lowest point, the first of these frames was selected.

Video footage was analyzed independently by two observers. Inter-observer reliability was determined using a common subset of practice trials. Out of 136 bobs, the observers agreed on the frame number corresponding to the lowest point on 127 of them, and the nine disagreements fell within one frame of each other; this means that 100% of the observations were within 8.333 ms. Considering the 10,777 frames over these 136 bobs, the calculated Cohen's kappa (0.933) indicated very high interobserver agreement.

To compare the observed head movements to the timing of the auditory stimuli, we combined the movement data from video analysis with the timing of beat onset determined from the corresponding camera audio, sampled at 48 kHz. For each trial, we located the onset of the first beat by visual analysis of the waveform to the nearest millisecond using Audacity, and then calculated times of subsequent beats based on the stimulus tempo and perturbation location. Because the speaker, the camera, and Ronan were approximately equidistant, we considered the sound travel time from the speaker to Ronan as approximately equal to the sound travel time from the speaker to the camera. Therefore, we used the audio from the camera for timing and did not consider the sound travel time in any subsequent calculations.

### Statistical Analysis

We quantified Ronan's performance with circular statistics. As the analysis was focused on performance following perturbation rather than overall statistical similarity between movement and beat, analysis of each transfer trial was restricted to a subset of bobs: specifically, the 10 bobs preceding the perturbation and the 20 bobs following. In baseline trials, where there was no perturbation, the entire trial was included.

In several trials (n = 10), Ronan exhibited "double bobs" where she bobbed twice for a given beat. This typically occurred during the +25% tempo condition of the 85 bpm base tempo, when she continued moving at the original rate such that when the first shifted beat occurred, her head was near the highest point of the bob rather than the lowest. In response to this beat, Ronan immediately nodded her head in a smaller bobbing motion before slowing her overall movement to adapt to the new tempo. These outlier bobs were easily identified by the shift in angle of her head in relation to her neck, and were excluded from the analysis. An example of a double bob can be seen in the first trial of Supplementary Video 1.

First, the relative phase angle between each head bob and the nearest stimulus beat was calculated using Equation (2).

$$\phi\_n = 2\pi (\frac{t\_{bob\_n} - t\_{bcat\_n}}{I\_n}) \tag{2}$$

The phase is expressed, in radians, as the time between a head bob (tbob<sup>n</sup> ) and the nearest stimulus beat (tbeat<sup>n</sup> ), as a proportion of the inter-onset interval of the beats surrounding the head bob (In). For all bobs, the relative phase angle was inherently restricted to a range of −π to π radians, as no bob occurred more than π radians away from its nearest beat.

We calculated the mean relative phase angle (φ) and mean vector length (r) for each trial using the argument (Equation 3) and modulus (Equation 4), respectively, of the sums of the angles as complex numbers.

$$\overline{\phi} = \frac{1}{n} \text{arg} \sum\_{j=1}^{n} e^{i \cdot \phi\_j} \tag{3}$$

$$r = \frac{1}{n} \text{abs} \sum\_{j=1}^{n} e^{i \cdot \phi\_j} \tag{4}$$

The mean vector length, which indicates the concentration of the mean angle, ranges from 0 (no mean angle) to 1 (perfect concordance of angles). Together, the mean relative phase angle and mean vector length specify the strength of Ronan's performance on a given trial. For each trial, we used the Vtest to determine whether Ronan's performance was significantly different from a mean relative phase of 0, which would indicate perfect synchrony with the stimulus (Zar, 1999). Again, we included both pre-perturbation (10 preceding) bobs and postperturbation (20 following) bobs in this analysis to provide an evaluation of her synchronization across each transfer trial; all bobs within each baseline trial were included.

Relative phase angles for each beat were also averaged across replicates to obtain an average trial for the baseline and each perturbation type. We used these averaged trials to fit the oscillator models.

### Model Fitting

The nonlinear equations that describe rhythmic behavior are often explained using a "circle map," an equation that produces a set of phases which predict the phase of a stimulus event relative to the onset of the behavioral oscillation (Pikovsky et al., 2001; Large and Palmer, 2002).

$$\phi\_{n+1} = \phi\_n + 2\pi f\_{ronan} \left( t\_{n+1} - t\_n \right) - \alpha \sin \phi\_n \quad \text{(} mod\_{-\pi,\pi}2\pi\text{)} \tag{5}$$

Equation (5) states that the phase of each successive auditory event (φn+1)—in this case, the onset of the click stimulus—is determined by the current auditory event's relative phase (φn), the frequency of the stimulus relative to the oscillator's frequency (expressed as the product of the current period of the stimulus (tn+<sup>1</sup> − tn) and the current radian frequency of the oscillator (ω<sup>n</sup> = 2πfronan), and a stimulus coupling (i.e., sine of the current auditory event's relative phase modulated by a coupling factor, α). The coupling factor indicates how strongly the relative phase of the oscillator is affected by the stimulus. Because phase is a circular value, the resulting phase is taken modulo 2π (the remainder after dividing the phase by 2π) and normalized to the range of −π to π.

If Ronan were to bob her head at exactly the same rate as the stimulus, the number of bobs to correct for being ahead of or behind the beat would depend solely on the phase coupling factor, α: a high coupling factor would mean a very quick adaptation to the stimulus, while a low coupling factor would mean a slower adaptation. The optimal value for the phase coupling factor is 1.0, which is the largest value that does not result in overcorrection. A value >2.0 causes the equation to become unstable (Pikovsky et al., 2001).

If, on the other hand, Ronan's period were different from the stimulus period, she still might be able to adapt to a steady phase, but it would be at a non-zero phase. This non-zero phase can be calculated with Equation (5) by assuming that φn+<sup>1</sup> = φ<sup>n</sup> and solving for φn. However, if the phase coupling factor (α) was not sufficiently large, she might never adapt to a steady phase and instead would start to "phase-wrap." Therefore, phase adaptation alone does not guarantee perfect synchronization; period adaptation is required as well. This is described by a supplemental equation to the circle map.

$$
\alpha\_{n+1} = \alpha\_n - \beta \sin \phi\_n \tag{6}
$$

From Equation (6), we can see that each successive oscillator radian frequency (ωn+1)—in this case, the oscillator radian frequency is 2π multiplied by the inverse of the time between two successive head bobs—is dependent on the oscillator radian frequency of the current beat (ωn) and a different stimulus coupling (the sine of the relative phase of the current beat, modulated by a different coupling factor, β). Again, this coupling factor indicates how strongly the oscillator period is affected by the stimulus.

Thus, in order to completely synchronize, an oscillator needs to adapt both phase and period. Together, Equations (5) and (6) accurately model not only the entrainment of human performers to a simple repeating stimulus, but to stimuli with multiple changes in phase and tempo over the course of the stimulus.

We fit a deterministic version of Equations (5) and (6) to the circularly averaged trial from each condition. α and β were varied to minimize the root mean squared error (RMSE) of the test model compared to Ronan's data. The range used for fitting α was 0.4–2.0 in increments of 0.1, and the range used for fitting β was 0.01–0.70 in increments of 0.01. Because Equation (5) describes the phase of the stimulus relative to the oscillator, the phases produced by the model are opposite of the measured behavioral

data, which indicates the phase of the oscillator to the stimulus. In other words, φmodel = −φmeasured.

#### Parameter Regression

To identify any potential relationship between either of the coupling parameters and the perturbation magnitude, we regressed both α and β as a function of condition against perturbation magnitude.

### RESULTS

Ronan successfully entrained to all stimuli and perturbations at the base tempo of 85 bpm (see Supplementary Video 1). For all trials, the critical value of the V-test was >3.4, indicating that the distribution of bobs was significantly nonrandom (p < 0.001) with respect to 0 radians. Thus, Ronan's head movements were strongly correlated with the beat on both baseline trials and transfer trials containing a perturbation event. **Figure 1** displays the correspondence of the angular distributions of the initial presentations of each condition, and **Figure 2** displays the similarity between performances in all trials. Her subsequent performance on the two additional base tempos (94.444 and 77.273 bpm) showed similar results: she successfully entrained to all stimuli and perturbations (see Supplementary Tables 1, 2 and Supplementary Figures 1–3).

Ronan's performance on baseline and transfer trials revealed rapid entrainment to the base tempo, with performance stabilizing within the first four beats (Supplementary Video 1), as previously observed by Cook et al. (2013). Unexpectedly, her performance with all tempos and all stimuli showed a slight phase progression over the course of each trial: the average slope of relative phase per beat across all trials was −0.0206, and the slope was different from zero for the majority of trials (n = 80 baseline trials, n = 62 phase perturbation trials, n = 55 tempo perturbation trials, p < 0.05). This represents a deviation from her previous performance (Cook et al., 2013). However, this deviation was consistent across replicates and conditions.

**Figure 3** shows the model fit compared to Ronan's pooled performance for each condition. **Table 2** and **Figure 4** show the fitted coupling parameter values and final RMSE for each condition. RMSE was very low for all conditions, with an average value of 0.0518 radians. Phase coupling was strong; across all conditions, the average parameter value was 0.894. Loehr et al. (2011) found that human subjects performing a comparable task (playing a piano keyboard to a metronome with a changing tempo) had an average phase coupling parameter value of 0.875, quite close to Ronan's. Ronan's observed period coupling was much weaker, averaging 0.0471 across all conditions. This is quite low compared to the subjects in the Loehr study, who had an average period coupling parameter value of 0.450.

We also observed a significant positive linear relationship between phase coupling magnitude and absolute perturbation magnitude [RMSE = 0.2800, F(1, 15) = 35.1, p < 0.0001]. Period coupling, on the other hand, showed a significant negative linear relationship with perturbation magnitude [RMSE = 0.0288, F(1, 15) = 9.02, p < 0.01].

FIGURE 2 | Mean phase (A) and vector length (B) of Ronan's post-perturbation bobs at a base tempo of 85 bpm, grouped by condition. Each trial is plotted as a circle, and the mean for each condition is represented by the line. Mean phase shows a linear trend with tempo changes (right portion of plot A), a trend described in a prior study of Ronan's rhythmic entrainment ability (Cook et al., 2013). In all cases, mean vector length falls between 0.88 and 0.99, indicating a very high concordance of phases within the trial.

### DISCUSSION

Ronan's performance with novel tempos containing embedded tempo and phase perturbations showed remarkable ability to adapt quickly and accurately to synchronize her body motion to the temporal features of the auditory stimulus stream. Moreover, her beat keeping (ranging from 61.818 to 125.925 bpm across the three base tempos) and adaptation (±25% of the IOI) impressively fit models of co-oscillation, drawn from physics and validated in human beat-keeping experiments. The findings show a strong similarity between dynamics of Ronan's performance and human performance, and parsimony suggests these are rooted in similar and conserved neural mechanisms rather than species-specific adaptations. However, these results by themselves

coupled oscillation. On perturbation conditions, the vertical line at 10 beats indicates the onset of the tempo or phase shift indicated at the top of the plot. Phase (α) and period (β) coupling factors are noted in the upper right portion of each plot.

TABLE 2 | Phase (α) and Period (β) coupling parameter values and Root Mean Squared Error (RMSE) for model fits of Ronan's experimental data at 85 bpm.


are not dispositive, and more comparative data are needed to fully resolve the debate over underlying mechanisms.

Through the lens of neural resonance, we see that Ronan's beat-keeping behavior in response to stimulus perturbations compared strongly to that measured in humans in four ways: (1) flexibility in tempo matching was evident in her behavior throughout testing, (2) changes in phase and tempo were matched through both phase and period adaptation, (3) phase adaptation was stronger than tempo adaptation, and (4) reduced sensitivity to smaller perturbations was observed (discussed below).

Ronan's performance in this study differed from human performance in two important ways, related to (1) phase coupling, and (2) period adaptation. For most perturbation trials, phase coupling (α) varied based on perturbation magnitude, dramatically increasing for larger magnitude shifts (perturbations >8%). The significant linear relationship between α and absolute perturbation magnitude suggests that the more noticeable alterations induced a larger change in coupling. In most human studies, phase coupling has been considered more or less constant (for review see Repp, 2005; Large, 2008), so Ronan's variable coupling strength is a novel discovery. However, her performance does align with findings in humans that larger perturbations are more noticeable because they represent more significant violations of expectation of where the next beat should occur (Large and Jones, 1999). It also suggests another similarity to humans: the just-noticeable difference in humans for tempo changes of a single interval (equivalent to a change in phase) is ∼6% (Drake and Botte, 1993). Ronan's results here imply that she did not readily perceive the ±3 or ±8% phase perturbations or the ±3% tempo perturbations, similar to what might be expected in human performance based on available research.

The second divergence in Ronan's behavior relative to that of humans is decreased period adaptation. While human studies have shown typical period coupling values between 0.3 and 0.8 (Loehr et al., 2011), Ronan's period coupling values did not exceed 0.2. Again, it is important to note that the human comparison is imperfect. Human subjects played a melody on the keyboard with a metronome, as opposed to a single discrete repeating movement. Furthermore, rather than a single sudden shift, the tempo changed continuously following the shift onset.

Additionally, Ronan had a larger phase/tempo offset than typically seen in humans (e.g., Repp, 2005; Repp and Su, 2013): her starting relative phase showed a direct linear correlation with IOI, with faster tempos effecting a starting phase further behind the beat, a trend described previously in Cook et al. (2013).

These differences relative to human subjects may be rooted in behavioral aspects of Ronan's performance. The gradual phase progression on all trials and changing phase coupling strength for larger perturbations suggest that Ronan used a specific strategy to entrain to these stimuli. Although she showed reliable phase and tempo matching throughout the experiment, her precision dramatically increased following relatively large perturbations. One possible interpretation is that basic beat keeping with simple metronomic stimuli is quite easy for Ronan following her extensive training with these and more complicated stimuli. Perhaps she uses a motor heuristic to produce "good enough" entrainment without employing any significant attentive effort. However, following a perturbation, realigning her movement with the beat may require greater attention. This could then drive an up-regulation of auditory motor networks, leading to increased coupling and greater performance. There is extensive evidence that human beat-keeping performance is heavily reliant on intent and attention (see Large and Jones, 1999; Repp, 2005). Furthermore, "task-positive" networks in humans attention-driven brain networks that up-regulate in-network functional connectivity (i.e., co-synchrony across nodes) during rigorous mental action—include motor and motor planning regions (Fox et al., 2005; Bardouille and Boe, 2012). Increased attention to the stimulus following perturbation could therefore lead to increased resonance between the neural oscillators of interest.

In most respects, Ronan's beat-keeping performance was as precise and reliable as that observed in human studies, and was well fit by models of coupled oscillation. Ronan's obvious ability to adaptively entrain her body movements to auditory rhythms extends the findings reported for this subject by Cook et al. (2013). Although the current experiment did not explicitly test adaptation to phase or tempo change in more complex stimuli, Ronan has successfully entrained to human-generated music that contains natural variability in both phase and tempo (Cook et al., 2013). Not only does this support the likelihood of shared mechanisms, it emphasizes Ronan's usefulness as a comparative model to study other aspects of rhythmic entrainment. In addition, Ronan's beat keeping did not emerge de novo—she received explicit and extensive operant (positive reinforcement) training. Therefore, she may serve as a model for training other non-human beat keepers. Supplemental testing with Ronan and with additional non-human subjects should clarify the mechanisms supporting beat-keeping ability and resolve whether these mechanisms are evolutionarily conserved. Further exploration of these results may also improve understanding of other facets of human musical ability. Resonance of neural oscillators with an external stimulus has been proposed as the foundation for many areas of music perception and cognition, including pitch and meter perception (see Large, 2008 for a review).

Patterns of neural oscillations have been observed in every nervous system examined (Glass, 2001). The basic physics of the structure of neural oscillators shows that if stimulated rhythmically, they will synchronize. The finding that non-human as well as human beat keeping is consistent with models of neural resonance supports a parsimonious explanation of beatkeeping behavior as arising from basic principles of nervous system behavior. That being said, the tendency of linked neural populations to co-oscillate could be only the beginning of an

### REFERENCES


understanding of sensorimotor synchronization. While coupled oscillation between neural populations may be necessary and sufficient for supporting a general faculty for beat keeping, great potential still exists for variability in the dynamics of beat-keeping behavior. First and foremost, animals may differ in connection strengths between relevant neural populations. This could be due to differences in anatomical connectivity, or differences in functional connectivity in these brain circuits, which can change with learning, across development, and dynamically with attention, intention, and other psychological factors. To date, the field of comparative rhythm has focused on answering the question "which species can keep a beat?" If basic and conserved neural mechanisms support entrainment intrinsically, the more productive question is this: "How can we use sensorimotor synchronization paradigms as a comparative tool to better understand brain function and behavior across species and contexts?"

### AUTHOR CONTRIBUTIONS

AR, PC, and CR designed the study; AR conducted all experiments; and AR, PC, CR, and EL analyzed the data, interpreted the results, and wrote the manuscript.

### FUNDING

Support for this work was provided in part by the Special Projects Fund of the Pinniped Cognition and Sensory Systems Laboratory, and a grant to CR from the International Association of Oil and Gas Producers through the Exploration and Production (E&P) Sound and Marine Life Joint Industry Programme (Award 22-07-23).

### ACKNOWLEDGMENTS

The authors thank the dedicated research team of the Pinniped Cognition and Sensory Systems Laboratory. We are especially grateful to Lima Kayello and Jacob Linsky for their assistance with data analysis. We also thank Ronan for her enthusiastic participation and cooperation in this behavioral research.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2016.00257


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Rouse, Cook, Large and Reichmuth. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Sensorimotor Synchronization with Different Metrical Levels of Point-Light Dance Movements

#### Yi-Huang Su \*

Department of Movement Science, Faculty of Sport and Health Sciences, Technical University of Munich, Munich, Germany

Rhythm perception and synchronization have been extensively investigated in the auditory domain, as they underlie means of human communication such as music and speech. Although recent studies suggest comparable mechanisms for synchronizing with periodically moving visual objects, the extent to which it applies to ecologically relevant information, such as the rhythm of complex biological motion, remains unknown. The present study addressed this issue by linking rhythm of music and dance in the framework of action-perception coupling. As a previous study showed that observers perceived multiple metrical periodicities in dance movements that embodied this structure, the present study examined whether sensorimotor synchronization (SMS) to dance movements resembles what is known of auditory SMS. Participants watched a point-light figure performing two basic steps of Swing dance cyclically, in which the trunk bounced at every beat and the limbs moved at every second beat, forming two metrical periodicities. Participants tapped synchronously to the bounce of the trunk with or without the limbs moving in the stimuli (Experiment 1), or tapped synchronously to the leg movements with or without the trunk bouncing simultaneously (Experiment 2). Results showed that, while synchronization with the bounce (lower-level pulse) was not influenced by the presence or absence of limb movements (metrical accent), synchronization with the legs (beat) was improved by the presence of the bounce (metrical subdivision) across different movement types. The latter finding parallels the "subdivision benefit" often demonstrated in auditory tasks, suggesting common sensorimotor mechanisms for visual rhythms in dance and auditory rhythms in music.

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Peter Keller, University of Western Sydney, Australia Michael Hove, Harvard Medical School, USA

#### \*Correspondence:

Yi-Huang Su yihuang.su@tum.de

Received: 14 February 2016 Accepted: 12 April 2016 Published: 27 April 2016

#### Citation:

Su Y-H (2016) Sensorimotor Synchronization with Different Metrical Levels of Point-Light Dance Movements. Front. Hum. Neurosci. 10:186. doi: 10.3389/fnhum.2016.00186 Keywords: visual rhythm, sensorimotor synchronization, dance, music, biological motion

### INTRODUCTION

Musical rhythms encompass multiple metrical levels of periodicity. While listeners can tune to different levels (each one termed a pulse) and hence different tempi in the same rhythm, each individual typically identifies a most salient periodicity as the beat (Drake et al., 2000; McKinney and Moelants, 2006). Perceptual grouping of alternating strongly and weakly accented events, or stronger and weaker beats, gives rise to the musical meter, which functions as a temporal reference frame and yields a distinct percept of patterning in music (London, 2012). These temporal modules not only define the structure of musical rhythms, but more importantly engage human behaviors. People move their bodies naturally to the musical beat, and the movements often consist of regular patterns (Toiviainen et al., 2010; Su and Pöppel, 2012; Manning and Schutz, 2013; Burger et al., 2014). This is an everyday example of coordinating one's motor output with external sensory rhythms, known as sensorimotor synchronization (SMS; Repp and Su, 2013). Though better understood in humans, SMS in various forms—albeit to different degrees—has also been observed in other non-human animals (Fitch, 2013; Repp and Su, 2013). From an evolutionary point of view, synchronization behaviors serve to coordinate individuals with each other as well as in response to environmental signals. As SMS requires tracking the underlying periodicity of sensory rhythms (Merker et al., 2009), this process would be hindered if no temporal regularity can be perceived. Thus, the metrical structure of musical rhythm plays a functional role in SMS for humans, and this function may have evolutionary purposes shared by rhythmic behaviors in other species (Ravignani et al., 2014).

Most knowledge of rhythm perception and SMS originates from findings of auditory tasks for two possible reasons. For one, rhythm in human society is most readily ascribed to characteristics of auditory stimuli, such as music or speech, whose prevalent role in communication makes the research question relevant. For another, it seems unfeasible to manipulate rhythmic visual stimuli such that various metrical structures, e.g., different simultaneous periodicities, can be naturally presented as in the auditory tasks. As such, theoretical frameworks of rhythm processing, such as how the perceptual system entrains to the hierarchical periodicities, have mainly been developed and tested in the auditory domain (Large and Snyder, 2009; London, 2012). While the mechanism of temporal tracking has been postulated to be generalizable to the visual modality (Large and Jones, 1999), there has been little verification. The lack of investigation on visual synchronization overlooks the significance of rhythmic visual cues in guiding timed actions, which is supported by animal research: for example, monkeys exhibit similar temporal sensitivity to auditory and visual signals (Zarco et al., 2009); they also show signs of synchronization to ecologically-relevant visual rhythms, such as regular limb motions of another monkey (Nagasaka et al., 2013). Furthermore, visual cues as communicated by structured patterns of body movements (i.e., ''dancelike'' movements) are important signals to regulate inter-individual behaviors in the animal kingdom, e.g., in songbirds' courting rituals (Ota et al., 2015), and possibly also in Chimpanzees' playing (Oota, 2015). These behaviors give insight into how humans synchronize to visual cues derived from movement patterns, which manifest most evidently in dance (Kirsch and Cross, 2015) and may even be underpinned by overlapping mechanisms as synchronizing to music (Su, 2016).

When gauging human SMS of simple movements (finger tapping) with relatively simple auditory stimuli (isochronous tones), the metrical structure of auditory rhythm has been found to modulate SMS in various ways (Repp, 2005; Repp and Su, 2013). One repeatedly shown finding is that, within an interbeat interval (IBI) of 200–1800 ms, synchronization to the beat is stabilized by the presence of metrical subdivisions in tasks known as ''1:n tapping'' (n = 2, 3, or 4, i.e., tapping to every second, third, or fourth event, Repp, 2003; Zendel et al., 2011; Madison, 2014). This effect, termed ''subdivision benefit'' (Repp, 2003), has also been demonstrated in subdivisions that are mentally imposed (Repp and Doggett, 2007; Repp, 2008a). That a parallel, lower-level periodicity—either physically or mentally imposed—can influence SMS with the beat indicates that rhythm perception and synchronization involve tracking multiple periodicities simultaneously, during which temporal information across several metrical levels may be integrated (Repp, 2008b). This idea is further supplemented by findings that the same beat tempo is perceived to be slower with than without the presence of metrical subdivisions (Repp, 2008a; Su, 2016), pointing to the effect of a lower-level pulse on temporal processing of an attended beat. It is less clear, though, whether the other way around is also true, i.e., whether SMS with a less salient pulse is modulated by the presence of a higherlevel, more salient beat. An earlier study suggested that adding metrical accents may assist offbeat tapping to an otherwise unaccented isochronous sequence (Keller and Repp, 2005). Beyond that, there seems to be no systematic investigation in this regard.

Human SMS studies employing comparable auditory and visual stimuli (Repp, 2003; Patel et al., 2005; Lorås et al., 2012) often demonstrated inferior synchronization in the latter. The adopted visual stimuli were, however, rather simple (e.g., flashes) and bore little resemblance to the environmental signals. Recent studies started incorporating dynamic visual stimuli that move with realistic object or biological kinematics, such as a bouncing ball (Hove et al., 2013b; Iversen et al., 2015) or a bouncing human figure (Su, 2014b). Synchronization to such periodic movements improves considerably compared to situations of repetitive flashes, suggesting better visual SMS capacity than previously believed. Nevertheless, these visual rhythms contain only one periodicity, and the obstacle remains as to how to investigate SMS to visual stimuli with even more complex rhythmic structure. It is still unclear whether visual synchronization with realistically moving stimuli—especially if they contain rich metrical information—can engage similar mechanisms to what is known for auditory rhythms. For example, if multiple metrical periodicities were simultaneously present in the visual stimuli, would such a phenomenon as the ''subdivision benefit'' be observed?

The present study addressed these issues using novel, naturalistic visual stimuli of a set of human dance movements developed in a recent study (Su, 2016). Given that the link between musical rhythms and human movements is well reflected in how humans move, or dance, to music (Burger et al., 2014), dance observation may perpetuate visual rhythm perception based on action-perception coupling. Specifically, as dance movements can embody the metrical structure of music (Naveda and Leman, 2010; Toiviainen et al., 2010), when presented as visual stimuli they may communicate multiple levels of periodicity in parallel, providing a suitable and ecological analog to auditory musical rhythms. The recent study (Su, 2016) presented stimuli of a point-light figure (PLF, Johansson, 1973)

performing basic steps of Charleston and Balboa dance cyclically (**Figure 1**). An important characteristics of both dances was the regular bounce (generated by knee flexion and extension) that can be seen especially in the trunk movement pattern (**Figure 2C**). Besides, in Charleston the limbs moved with relative large trajectories in space (regular leg and arm swinging, **Figure 2A**), whereas in Balboa the legs moved in a footsteplike manner, and the arms remained still (**Figure 2B**). A critical feature in both dances was that the trunk bounced vertically at every beat while the limbs moved laterally at every second beat, yielding two possible metrical levels in the movements. It was found that observers could tune to either periodicity flexibly, with the leg movements more often perceived as beat than the bounce. Moreover, the tempo of the leg movements (beat) was perceived to be slower with than without the trunk bouncing simultaneously (subdivisions), mirroring previous auditory findings (Repp, 2008a). From here on, it seems logical to examine visual SMS with these stimuli as a next step, in an attempt to answer the questions raised above: namely, effects of different metrical structures in the movement on SMS.

In two finger-tapping tasks, the present study investigated two different effects of visual metrical structure on SMS with the point-light dance movements: synchronizing to the pulse with or without metrical accents (Experiment 1), and synchronizing to the beat with or without metrical subdivisions (Experiment 2). In Experiment 1, participants observed the PLF performing the two dance movements and tapped to the bounce of the trunk in a synchronized manner. For the Charleston stimuli, which involved the trunk bouncing and the legs and arms moving in a symmetrical manner, movement variations were implemented to examine whether tapping to the bounce (i.e., lower-level pulse) was stabilized by the presence of the leg movement, the arm movement, or both (i.e., higher-level metrical accents). For the Balboa stimuli, which involved only the trunk and the legs, the legs could move either in the same tempo as the bounce (i.e., accents at the same level as the pulse), or twice as slow as the bounce (i.e., accents at one metrical level higher than the pulse)<sup>1</sup> . Besides the effect of leg movement on tapping to the bounce, it was of interest whether this effect was modulated by the metrical relation between the two periodicities.

In Experiment 2, participants observed the same dancing PLF and tapped synchronously to the leg movements. Both dance movements were presented either naturally, or without the trunk bouncing in the stimuli. If visual SMS engages similar mechanisms as the auditory counterpart, tapping to the leg movements (i.e., beat) should be more stable with than without the simultaneous trunk movement (i.e., metrical subdivisions, Repp, 2003; Zendel et al., 2011; Madison, 2014). Moreover, the Charleston stimuli probed whether adding another metrical accent, i.e., the arm movement, would further stabilize synchronization. Finally, as an additional variable of interest, the PLF in both experiments danced either with horizontal

<sup>1</sup>The variation of legs moving at the same tempo as the bounce was not possible for the Charleston dance due to biomechanical constraint in the movement, and was thus only implemented for Balboa.

FIGURE 2 | The trajectory and kinematics of the point-light motion stimuli. (A) The tracked trajectories, shown in green lines, of the left foot (upper panel) and the right foot (lower panel) in Charleston, plotted on the frame at Beat 2 and Beat 6, respectively. The left foot positions at Beat 1 and 3, and the right foot positions at Beat 5 and 7, are noted relative to each plotted trajectory. The tracked foot marker is shown in yellow in the respective panel. The dotted trace represents the trajectory leading up to the earliest beat in each panel. The two frames are taken from the same perspective relative to the PLF movement, and TM can be seen here as the PLF has moved forward in the lower panel relative to the upper one; without TM there is no such horizontal displacement. (B) The tracked trajectories (green lines) of the right and left foot (both in yellow) in Balboa, plotted on the frame at Beat 6. The trace between Beat 1 and Beat 5 belongs to the left foot, and the trace before Beat 3 belongs to the right foot. TM can also be seen here as the horizontal displacement of both feet relative to the starting position (the beginning of each dotted trace). (C) The velocity and position profile of the trunk movement, averaged across the four trunk markers along the time vector (X-axis). This profile is taken from one cycle of Charleston without TM at the IBI of 500 ms; the trunk kinematic pattern in other dance movement conditions is essentially the same. The green and blue circles mark the point of peak velocity and the point of lowest position in each bounce, respectively. (D) The 3D velocity (upper panel) and the Y (sagittal) position profile (lower panel) of the foot movement in one cycle of Charleston, IBI = 450 ms, with TM. The green and blue circles mark the point of peak velocity and the point of end position in each trajectory, with the corresponding beat number notated. (E) The 3D velocity (upper panel) and the vertical position profile (lower panel) of the foot movement in two cycles of Balboa, IBI = 450 ms, with TM. Points of peak velocity and end position are illustrated in the same manner as in (D); beat 9–15 occur in the second cycle.

translational motion (TM), i.e., the whole body moving forward and backward regularly (see Su, 2016), or with the whole body remaining in place (without TM). This was meant to examine whether effects of metrical accent or metrical subdivision were modulated by horizontal spatial information in the entire movement.

### EXPERIMENT 1: SYNCHRONIZING TO THE BOUNCE

### Methods

#### Participants

Eighteen young, healthy volunteers (five males, mean age 26.3 years, SD = 4.8) took part in this experiment. Participants were naïve of the purpose, gave written informed consent prior to the experiment, and received an honorarium of 8 e per hour for their participation. Participants were not pre-screened for musical or dance training, which ranged from 0 to 21 years (all amateurs). Thirteen and eight participants had trained in music and dance (but none in swing dance), respectively, amongst whom six had trained in both. The mean duration of music and dance training was 5.3 years (SD = 5.2) and 2.6 years (SD = 3.4). The study had been approved by the ethic commission of Technical University of Munich, and was conducted in accordance with the ethical standards of the 1964 Declaration of Helsinki.

#### Stimuli and Materials

The visual stimuli consisted of a human PLF performing basic steps of Charleston and Balboa dance in two different tempi. The stimuli had been generated by recording a swing dancer performing these steps using a 3-D motion capture system (Qualisys Oqus, 8 cameras at a sampling rate of 200 Hz, with 13 markers attached to the joints, Johansson, 1973) paced by metronomes with an IBI of 500 and 550 ms, respectively. The stimuli were a subset of the movement sequences used in a recent study (Su, 2016), where the stimuli preparation and construction were reported in detail. The description here will thus be brief.

Each dance was performed in continuous cycles, with one cycle corresponding temporally to eight metronome beats (**Figure 1**). In both dances, the trunk bounced vertically at every beat (beat 1–8), which was conveyed by movement patterns of the shoulder and the hip markers on both sides. The limbs (legs and arms for Charleston, and legs for Balboa) moved laterally at every second beat (beat 1, 3, 5, and 7), where the leg movements were conveyed by the knee and the foot markers on both sides, **Figures 2A,B**. The PLF performed these movements either with horizontal TM, or mostly in place (no TM, see Su, 2016). The best cycle amongst the recorded ones performed in a given condition (i.e., at a given tempo, with or without TM) was looped as visual stimuli. The Charleston dance presented here was authentic of the repertoire. The Balboa dance was presented both as in the original repertoire, wherein the legs moved at the same tempo as the trunk, and in a modified version, wherein the legs moved at half of the trunk tempo (see Su, 2016). The inclusion of both versions of the Balboa was meant to compare effects of leg movements that moved at the same metrical level as the bounce (the original version), or at one metrical level higher (the modified version).

The main manipulation was the presence or absence of the limb movements in parallel to the trunk bouncing. For Charleston, four movement variations were created: (1) the whole body moved naturally as recorded (termed ''Trunk + Arms + Legs'' to reflect the moving body parts); (2) the arm movements were removed by replacing the trajectories of the elbow and hand markers on both sides with similar ones as in Balboa, i.e., the palms were placed on the hips throughout (termed ''Trunk + Legs''); (3) the leg movements were removed by replacing the trajectories of the knee and foot markers on both sides with a constant position on the X and Z dimension (taken from the first frame), while their Y positions (along the sagittal plane) were made to change in the same magnitude as the hip markers (termed ''Trunk + Arms''); and (4) both the leg and the arm movements were removed by combining manipulations in (2) and (3), leaving only trunk movements intact (termed ''Trunk only''). For Balboa, two variations were introduced: (1) natural as recorded (''Trunk + Legs'', as there was no arm movement in Balboa); and (2) leg trajectories removed in the same manner as described in Charleston (''Trunk only''). Note that all the manipulations were carried out on the first (natural) movement condition, and thus the trunk movement was identical across all conditions for each dance. Besides, in conditions where the leg movements were artificially removed, all the leg markers remained present; if the PLF moved with TM, the leg markers moved back and forth with the upper body (as if sliding on wheels), and thus the image of a humanlike figure was preserved throughout the sequence.

The 3-D motion data of each dance were presented as pointlight display on a 2-D monitor, using routines of Psychophysics Toolbox version 3 (Brainard, 1997) running on Matlab<sup>r</sup> R2012b (Mathworks). The function moglDrawDots3D allowed for depth perception in a 2-D display. The PLF was represented by 13 white discs against a black background, each of which subtended 0.4◦ of visual angle. The whole PLF subtended approximately 5 ◦ (width) and 12◦ (height) when viewed at 80 cm. The PLF was displayed facing the observers, in a configuration as if the observers were watching from 20◦ to the left of the PLF, which served to optimize depth perception of biological motion in a 2-D environment.

#### Procedure and Design

The stimuli and experimental program were controlled by a customized Matlab script and Psychtoolbox version 3 routines running on a Linux Ubuntu 14.04 Long Term Support (LTS) system. The visual stimuli were displayed on a 17-inch CRT monitor (Fujitsu X178 P117A) with a frame frequency of 100 Hz at a spatial resolution of 1024 × 768 pixels. Participants sat with a viewing distance of 80 cm. The finger taps were registered by a customized force transducer that was connected to the Linux computer via a data acquisition device (Measurement Computingr, USB-1608FS). Data were collected at 200 Hz, which was controlled and synchronized on a trial basis by the experimental program in Matlab. Participants wore closed studio headphones (AKG K271 MKII) to avoid potential auditory distraction.

Participants self-initiated each trial by pressing the space key. On each trial, a PLF was shown performing either a Charleston or a Balboa sequence cyclically in one of the two tempi, either with or without TM. For each Tempo × TM condition, there were four movement variations regarding the moving body parts for Charleston and two for Balboa (as described in ''Stimuli and Materials'' Section). Participants' task was to observe the PLF movement as a whole and tap to the bounce of the trunk in a synchronized manner. They tapped with the index finger of their dominant hand on the force transducer. In total six complete movement cycles were presented on each trial, equaling 48 bounces.

The experiment consisted of the following conditions: 4 (moving part) × 2 (TM) × 2 (tempo) for Charleston, and 2 (movement version: original or modified) × 2 (moving part) × 2 (TM) × 2 (tempo) for Balboa. All the conditions were presented in six blocks of 36 trials each, with all the conditions balanced across blocks and the order of conditions randomized within a block. Participants underwent six practice trials before starting the experiment. The entire experiment lasted around 2 h, completed in two sessions of three blocks each either on different days or on the same day with a longer pause (at least half an hour) in between.

#### Data Analysis

The timing of each tap was extracted by identifying the time point right before the amplitude of the measured force data exceeded a predefined threshold. The tap times were temporally aligned to the start of the visual stimulus, allowing for calculation of absolute asynchronies between each tap and the corresponding visual signal. The stimulus onset time, i.e., the beat as communicated by each bounce, was derived from the kinematic profile of the four trunk markers (shoulders and hips on both sides) averaged along the time vector for each sequence. Specifically, visual beat in a periodic biological motion may be communicated by both the position and the velocity parameters, such as the recurrent lowest position or the recurrent peak velocity of the bounce (Su, 2014a). Although several studies support the role of velocity cues (Luck and Sloboda, 2009; Wöllner et al., 2012; Su, 2014a), the position information might still influence where the beat was perceived (Su, 2014a; Booth and Elliott, 2015). As such, two sets of stimulus beat onset times were extracted, one based on the peak vertical velocity (termed ''velocity beat'') and the other based on the vertical end position (termed ''position beat'') of the bounce, with the former preceding the latter in every bounce (**Figure 2C**). Tap times were first calculated relative to each beat parameter separately. The first two taps in each trial were discarded from analyses.

The main index of synchronization was the stability of the taps relative to the beats (Repp and Su, 2013). As visual synchronization is known to be variable and the present stimuli were complex, circular statistics (Berens, 2009) was applied to analyze the tap-beat phase relations (Hove et al., 2013b; Iversen et al., 2015; see also Hove et al., 2012, for circular statistics applied to SMS with non-isochronous beats). Each tap time was converted to the phase relative to its closest beat on a circular scale (0–360◦ between two consecutive beats). For a given trial, the tap-beat stability was indexed by R, the mean resultant length of the relative phase vector. R ranged from 0 (taps distributed uniformly around beat onsets, suggesting no synchronization) to 1 (perfect synchronization with taps distributed unimodally relative to beat onsets, see also Kirschner and Tomasello, 2009, for a comprehensive description). The mean direction of the relative phase, θ, was also calculated for each trial, indexing the mean magnitude and direction of the tap-beat asynchronies. Both R and θ were first analyzed with respect to the velocity beat and the position beat separately.

### Results

Analyses were carried out for the two dances separately to answer different questions. For Charleston, of interest was the effect of leg movement, arm movement, or both, on synchronization with the bounce. For Balboa, it was of interest whether the effect of leg movement differed when the legs moved at the same tempo or at half the tempo of the bounce. All the analysis of variances (ANOVAs) reported in this study were repeatedmeasures ANOVA.

It should be noted that the different TM × tempo stimuli were generated by recording these movements performed separately in the respective condition, and the trajectory of each marker was not further spatially or temporally adjusted (in order to present authentic biological motion stimuli). There were thus inevitable differences in deviation from isochrony, as well as variations of kinematics, across different conditions. As such, results of TM and tempo will be focused on whether they interact with the main variable of interest, the moving part, in order to verify whether the effect of moving part generalizes to different movement conditions. In case of main effects of TM and tempo, or interactions between the two, the results will not be further discussed if they may be attributed to differences in the variability of beat timing (i.e., higher or lower tapping variability associated with higher or lower variability of the beat onset times). The same rules will apply to results of Experiment 2.

#### Determining the Synchronization Target

First, in order to identify which kinematic feature participants synchronized to, the individual means of angular direction (θ) were analyzed in a full factorial ANOVA for Charleston: 4 (moving part) × 2 (TM) × 2 (tempo) × 2 (beat parameter: velocity or position), and for Balboa: 2 (movement version) × 2 (moving part) × 2 (TM) × 2 (tempo) × 2 (beat parameter). In both ANOVAs, there was a main effect of beat parameter, F(1,17) = 27.26, p < 0.001, η 2 <sup>p</sup> = 0.62, and F(1,17) = 89.36, p < 0.001, η 2 <sup>p</sup> = 0.84, both showing that taps were closer (less negative θ) to the velocity than to the position beat. For the Charleston stimuli, mean θ was −21.07◦ and −69.75◦ for the velocity and the position beat, respectively. For the Balboa stimuli, mean θ was −24.17◦ and −82.71◦ , respectively. Given that synchronization stability (R) was comparable with respect to the velocity and the position beat (both beat parameters yielded mean R = 0.70 for Charleston; both parameters yielded mean R = 0.75 for Balboa), the smaller magnitude of asynchrony was taken as evidence that the velocity beat was the preferred synchronization target in this experiment. The subsequent analyses were conducted on R with respect to the velocity beat.

#### Synchronization to the Beat

For Charleston, the individual means of R were submitted to a 4 (moving part) × 2 (TM) × 2 (tempo) ANOVA. Moving part had no significant effect on R, F(3,51) = 1.30, p > 0.2, η 2 <sup>p</sup> = 0.07, nor interaction with any other variable. The main effect of tempo was significant, F(1,17) = 9.71, p < 0.01, η 2 <sup>p</sup> = 0.36, showing more stable synchronization for the faster tempo (IBI = 500 ms). The main effect of TM was also significant, F(1,17) = 25.16, p < 0.001, η 2 <sup>p</sup> = 0.60, showing greater stability for synchronizing to movements with TM than without (**Figure 3A**). There was a significant TM × tempo interaction, F(1,17) = 7.44, p < 0.02, η 2 <sup>p</sup> = 0.30; follow-up one-way ANOVAs showed that the effect of TM was only significant for IBI = 500, F(1,17) = 40.29, p < 0.001, η 2 <sup>p</sup> = 0.70, and not for IBI = 550, p > 0.3. The main effect of TM as well as its interaction with tempo could, however, be due to the corresponding variability of the stimulus beat.

For Balboa, the individual means of R were submitted to a 2 (movement version) × 2 (moving part) × 2 (TM) × 2 (tempo) ANOVA. Again, the main effect of moving part was not significant, F(1,17) = 1.62, p > 0.2, η 2 <sup>p</sup> = 0.09, and nor did it interact with other variables. Significant main effects were found for movement version, F(1,17) = 8.24, p < 0.02, η 2 <sup>p</sup> = 0.33 (greater R when the legs moved at half the bounce tempo), for TM, F(1,17) = 6.28, p < 0.03, η 2 <sup>p</sup> = 0.27 (greater R for movements without TM than with), and for tempo, F(1,17) = 17.35, p < 0.001, η 2 <sup>p</sup> = 0.51 (greater R for movement at the faster tempo), **Figures 3B,C**. These patterns were, however, in the same direction as the difference in stimulus beat variability between the respective conditions.

In sum, stability of synchronizing to the bounce was not affected by the presence of lateral limb movements, which was true whether the limbs moved at the same metrical level as the bounce, or at one level higher. For the Charleston stimuli, taps were more synchronized to the bounce at the faster tempo (nominal IBI = 500 ms). As an additional note, taps generally preceded the beat, which is reminiscent of the negative mean asynchronies (NMA) typically found in SMS with auditory stimuli (Repp and Su, 2013).

### EXPERIMENT 2: SYNCHRONIZING TO THE LEG MOVEMENTS

This experiment examined whether synchronization to the beat was improved by the presence of metrical subdivisions. Participants tapped synchronously to the leg movements (beat) of both dances, in which the trunk bounced simultaneously (subdivisions) or not in the stimuli. While referred to as ''beat'' borrowing the musical terminology, the leg movements naturally deviated more from isochrony than the trunk bouncing. Nevertheless, observers were able to perceive and synchronize with the rhythm of these movements (Su, 2016).

### Methods

### Participants

Twenty volunteers (nine male, mean age 25.3 years, SD = 5.2) took part in this experiment. Fourteen and 10 participants had trained in music and dance, respectively, amongst whom six had trained in both. The music and dance training duration ranged from 0 to 16 years (all amateurs), with a mean of 5.1 years (SD = 4.3) and 3.6 years (SD = 5.0), respectively. The participant handling and ethics procedure was the same as in Experiment 1.

#### Stimuli and Materials

The visual stimuli consisted of the same PLF performing Charleston and Balboa dance, with and without TM, in two different tempi corresponding to the metronome IBI of 400 and 450 ms (indicating the trunk tempi). The two tempi were chosen, as the previous study suggested that the leg movements and the bounce could be most optimally perceived in parallel in these tempi (Su, 2016). For Balboa, only the modified version was included, in which the legs moved at half of the bounce tempo. In both dances, the legs thus moved at an interval of around 800 and 900 ms, respectively. For consistency purpose, the tempo of the movement will still be referred to by the metronome IBI, 400 and 450 ms.

The main manipulation here was the presence or absence of the trunk bouncing simultaneously to the leg movements. For Charleston, three movement variations were included: (1) natural as it was (''Arms + Trunk + Legs''); (2) the arm movements were removed in the same manner as described in Experiment 1 (''Trunk + Legs''); and (3) both the arm and the trunk movements were removed (''Legs only''), in which the trunk bounce was removed by keeping constant the vertical position of the shoulder and hip markers, while leaving their positions in the horizontal plane as natural (see Su, 2016, Experiment 2). For Balboa, two variations were included: (1) natural as it was (''Trunk + Legs''); and (2) with the trunk bounce removed (''Legs only'').

#### Procedure and Design

The setup and the procedure were the same as in Experiment 1. The task was now to tap to the leg movements in a synchronized manner. The experimenter made sure every participant understood the pattern of leg movements they should tap to. Eight complete movement cycles were presented in each trial, equaling 32 leg movements.

In total the following conditions were included: 3 (moving part) × 2 (TM) × 2 (tempo) for Charleston, and 2 (moving part) × 2 (TM) × 2 (tempo) for Balboa. All the conditions were presented in eight blocks of 20 trials each, with the conditions balanced across blocks and the order of conditions randomized within a block. Participants underwent six practice trials before

starting the experiment. The entire experiment lasted around 2 h, completed in two sessions of four blocks each.

### Data Analysis

The visual beat was first defined by the peak velocity and the end position of the foot markers separately. The velocity beat was calculated as the time point of the peak 3D (absolute) velocity in each leg trajectory. The position beat was defined by the time point of the end position in the Y (sagittal) and in the Z (vertical) dimension for each trajectory in Charleston and Balboa, respectively (**Figures 2D,E**). The velocity beat always occurred prior to the position beat. The tap times, synchronization stability (R), and mean relative phase (θ) were analyzed in the same manner as in Experiment 1.

### Results

#### Determining the Synchronization Target

To examine which kinematic feature served the synchronization target, the individual means of θ were first analyzed in a full factorial ANOVA, excluding the ''Trunk + Arms + Legs'' condition in Charleston: 2 (dance style) × 2 (moving part) × 2 (TM) × 2 (tempo) × 2 (beat parameter: velocity or position), which yielded a main effect of beat parameter, F(1,19) = 12455, p < 0.001, η 2 <sup>p</sup> = 0.99. It was found that taps lagged the velocity beat (mean θ = 51.49◦ ) while leading the position beat (mean θ = −22.92◦ ), suggesting that both parameters might have been taken into account for synchronization. In addition, the full ANOVA with the same factors was conducted on individual means of R, which revealed a significant interaction between beat parameter and dance style, F(1,19) = 1558, p < 0.001, η 2 <sup>p</sup> = 0.98. Partial ANOVAs showed that synchronization with Charleston stimuli was better in terms of the velocity beat, F(1,19) = 59.99, p < 0.001, η 2 <sup>p</sup> = 0.76 (R = 0.81 for velocity and R = 0.75 for position beat), whereas synchronization with Balboa stimuli was better in terms of the position beat, F(1,19) = 811.4, p < 0.001, η 2 <sup>p</sup> = 0.98 (R = 0.65 for velocity and R = 0.85 for position beat). As such, it was assumed that the velocity and the position beat served the more effective synchronization target for the Charleston and the Balboa stimuli, respectively. R values for each dance in the subsequent analyses were calculated according to the respective synchronization target.

See **Table 1** for an overview of the timing parameters of the stimulus beat for each movement condition, as well as the observed mean R. Synchronization stability generally agreed with the variability of beat onset times, i.e., more regular velocity-defined beat for the Charleston stimuli and more regular position-defined beat for the Balboa stimuli.

#### Synchronization to the Beat

The individual means of R were submitted to a 2 (dance style) × 2 (moving part) × 2 (TM) × 2 (tempo) ANOVA, not including the ''Trunk + Arms + Legs'' condition in Charleston. All four main effects were significant: (1) dance style, F(1,19) = 62.27, p < 0.001, η 2 <sup>p</sup> = 0.77, showing more stable synchronization with Balboa than with Charleston; (2) moving part, F(1,19) = 29.83, p < 0.001, η 2 <sup>p</sup> = 0.61, showing more stable synchronization to the leg movement with than without the simultaneous trunk movement; (3) TM, F(1,19) = 234.3, p < 0.001, η 2 <sup>p</sup> = 0.92, showing more stable synchronization to movements with TM; and (4) tempo, F(1,19) = 34.73, p < 0.001, η 2 <sup>p</sup> = 0.65, showing more stable synchronization to the slower movement tempo (IBI = 450 ms; **Figures 4A,B**). The effect of TM and Tempo may be associated with the variability of the stimulus beat.

Moving part was involved in a significant three-way interaction: dance style × moving part × tempo, F(1,19) = 5.69, p = 0.027, η 2 <sup>p</sup> = 0.23. Follow-up partial ANOVAs revealed that the moving part × tempo interaction was only (about) significant for Charleston, F(1,19) = 4.77, p = 0.042, η 2 <sup>p</sup> = 0.20, but not for Balboa, F(1,19) = 0.76, p = 0.39, η 2 <sup>p</sup> = 0.04. TABLE 1 | Timing parameters of the stimulus beat (leg movements) in each movement condition, and the corresponding synchronization stability measured in Experiment 2.


Mean IBI: the mean inter-beat interval calculated from the leg movements of each condition. CV: the variability of the beat onset times indexed by the coefficient of variations of the IBIs. Mean Abs. Dev.: the mean absolute deviation of the IBIs from twice the metronome interval. R: the mean synchronization stability across participants for each condition, pulling together conditions with and without trunk movement.

Post hoc one-way ANOVAs conducted for each tempo in the Charleston conditions showed better synchronization in the presence of the bounce at the faster tempo (IBI = 400 ms), F(1,19) = 17.99, p < 0.001, η 2 <sup>p</sup> = 0.49, but only marginally so at the slower tempo (IBI = 450 ms), F(1,19) = 3.99, p = 0.06, η 2 <sup>p</sup> = 0.17.

The interaction in Charleston was further confirmed by contrasting the effect of moving part for each tempo, using 95% confidence intervals (CI) of the difference scores across participants (Masson and Loftus, 2003; Cumming, 2014), i.e., difference in R between conditions with and without trunk for each tempo in Charleston. As shown in **Figure 4C**, only for IBI = 400 ms was the difference between conditions greater than zero at the 95% CI.

Finally, to examine the effect of the presence of arm movement compared to the other two moving part conditions in Charleston, difference scores were computed between Arms + Trunk + Legs and Legs only, by subtracting the latter from the former (**Figure 5A**), as well as between Arms + Trunk + Legs and Trunk + Legs (**Figure 5B**). The effects as indexed by the 95% CI of the difference scores were contrasted for each tempo and TM condition separately. As shown, synchronization was better with

Arms + Trunk + Legs compared to Legs only at the slower tempo (IBI = 450 ms) with TM. None of the other comparisons showed an effect.

In summary, across all movement variations, synchronizing to the leg movement was more stable when the trunk was bouncing simultaneously at twice the leg tempo. For the Charleston stimuli, the effect of trunk movement was more evident at the faster tempo (IBI = 400 ms). Between the two dances, the velocity beat served a more effective synchronization target for Charleston, whereas the position beat was more

effective for Balboa. Lastly, while the presence of trunk movement assisted tapping to the leg movement, adding another lateral limb movement (the arms) around the same tempo as the legs did not further improve synchronization.

### DISCUSSION

The present study investigated visual SMS with biological motion stimuli of a dancing PLF. The dance movements were such that two metrical levels of periodicity were visually available, with the lateral leg movements being twice as slow as the vertical trunk bouncing, and the former more often perceived as beat (Su, 2016). To verify whether synchronization to metrical visual stimuli resembles that to auditory rhythms, two tapping experiments examined effects of metrical accent (leg movement) on synchronization to the pulse (bounce), as well as effects of metrical subdivision (bounce) on synchronization to the beat (leg movement). The main results show that, while metrical accents did not influence synchronization to a lower-level pulse, metrical subdivisions improved synchronization to the beat compared to the absence thereof. The latter finding replicated the subdivision benefit consistently shown in SMS with auditory rhythms (Repp, 2003; Zendel et al., 2011; Madison, 2014).

That the subdivision benefit was observed using visual dance stimuli has at least three theoretical implications. First, it extends a well-established auditory finding to the visual modality, suggesting similar rhythm processing across the two senses (Hove et al., 2013a). In auditory SMS, this effect has mainly been shown in metronomic stimuli, but not as consistently replicated in real music (Martens, 2011). One possible reason is that the rich metrical structure in music, perhaps strengthened by other cues such as pitch or melodic contour (Lerdahl and Jackendoff, 1983), may have led to a ceiling effect of SMS to the beat. The present result generalizes the subdivision effect to realistic visual stimuli across different dance styles and movement variations. At the same time, the effect can be argued to reside within the visual modality, as it seems unlikely that such complex stimuli would be recoded into auditory representation to guide behaviors (Guttman et al., 2005; Grahn et al., 2011). Secondly, building on recent findings of visual SMS with a single motion periodicity (Hove et al., 2013b; Su, 2014b; Iversen et al., 2015), the present biological stimuli contained multiple periodicities, making this effect not only ecologically plausible in the visual domain, but also comparable to music. Visual rhythms can thus be defined beyond simple stimuli to mirror their auditory counterpart. While both musical rhythms and the trajectories of dance movement often deviate from isochrony, in both cases the listeners and observers are able to extract the underlying regularity and track hierarchical levels of periodicity simultaneously, which in turn modulates motor behaviors (Large and Palmer, 2002; Large et al., 2002). This supports the idea that similar sensorimotor mechanisms may underlie auditory synchronization to music and visual synchronization to dance. Finally, regarding rhythm in the action-perception framework (Prinz, 1997; Maes et al., 2014), the subdivision benefit confirms how the metrical structure is visually perceived in dance movements that embody this structure (Su, 2016). This in turn suggests that rhythm perception, which is a prerequisite for SMS (Repp and Su, 2013), can be evoked by temporally structured auditory stimuli, as well as visual information of movements performed in response to these auditory rhythms. As both can engage the motor system (Schubotz, 2007; Hove et al., 2013a), observing rhythmic movements being arguably a form of motor simulation (Kirsch and Cross, 2015), the auditory rhythm of music and the visual rhythm of dance may indeed share a common sensorimotor representation.

In auditory musical rhythm as well as visual rhythm of biological motion, metrical subdivisions appear to facilitate synchronization by providing additional temporal information for the upcoming, attended beat (Madison, 2014). In the auditory stimuli, the effect can be explained by means of predictive temporal tracking (Large and Jones, 1999; Repp, 2008b). In dynamic visual stimuli as the present ones, it may involve the kinematics of one body part (the trunk) predicting that of another (the legs). The action-observation literature proposes that human observers form internal representation of familiar movement kinematics, which allows them to predict the spatiotemporal course of an action (Stadler et al., 2012). The predictive mechanism seems to apply to the action as a whole, rather than forming separate expectations for different body parts. As such, the kinematic information of all concurrently moving parts in a dance movement might be (automatically) integrated to make the eventual prediction of each ''beat''. In this light, the bounce not only serves a finer temporal scale, but also provides additional kinematic cues for the leg movements. While it is beyond the current scope to elaborate on how the kinematic cues were visually integrated across different moving parts, it is worth noting that the obtained subdivision benefit could not have been confounded with participants tapping to the trunk movement when it was present. As the beat onset times of the trunk and of the legs did not coincide with each other, nor maintain a constant phase difference (see Table S1 in the ''Supplementary Material'' for a summary of these parameters), synchronization stability analyzed with respect to the leg movements would not have benefitted from taps synchronized to the trunk.

The other question asked in this study, i.e., whether tapping to the pulse would be stabilized by the metrical accent, was met with a negative answer. There has been little research in the auditory (and none in the visual) domain to address this issue, except for one study on offbeat tapping (Keller and Repp, 2005). The present result suggests that, visually, imposing an additional metrical frame yields no more gain on temporal coordination than what the lower-level periodicity already entails. The same might be speculated for the auditory stimuli. While the brain does respond differentially to subjectively accented and unaccented events in an isochronous auditory sequence (Iversen et al., 2009; Potter et al., 2009; Fujioka et al., 2010), there is thus far no evidence that enhanced anticipation at the metrically accented level leads to overall better motor synchronization to the lower-level pulse. One possible reason applicable to both modalities is that the metrical accent yields alternating on-beat and off-beat positions, and the effect of one level may cancel out that of the other (Repp et al., 2008). Notably, the null result of metrical accent on SMS also applies to adding accents at the same level as the pulse (i.e., the original version Balboa in Experiment 1), as well as superimposing another accent at the same level as the existing one (i.e., adding arm movement along with leg movement in Charleston in Experiment 2). As such, it seems an additional metrical periodicity has a functional impact on SMS only when it subdivides the target IBI (Madison, 2014).

Horizontal TM in the dance movement did not modulate the effect of metrical accent or metrical subdivision on SMS. This was more surprising regarding the subdivision effect for Balboa dance. As the leg movements in Balboa did not consist of large lateral trajectories, one would expect that the horizontal spatial frame imposed by TM (such that the leg trajectories were additionally marked by the regular positions on the ground) would be necessary to induce visual metrical accent. The results suggest that, regardless of the magnitude of their trajectories, the leg movements were readily differentiated from the trunk bouncing as being more accentuating<sup>2</sup> . This pattern is consistent with our recent work (Su and Salazar-López, in press), showing that regular leg movements in the Flamenco dance repertoire are perceptually prioritized as visual beat over other moving body parts. Results here thus extend this finding to different movements and dance genres. The role of leg movements in visual beat perception and synchronization is reminiscent of the finding that the preferred tempo in

<sup>2</sup>This effect was not likely contingent upon different amount of motion between the legs and the trunk. Within a cycle of Charleston, the total traveled distance of the legs (M = 382.35 cm) was greater than that of the trunk (M = 236.90 cm), whereas in Balboa the traveled distance of the trunk (M = 196.36 cm) was greater than the legs (M = 163.17 cm).

musical rhythms (Moelants, 2002) corresponds roughly to that in locomotion (MacDougall and Moore, 2005). In both cases, the perceptual preference of beat seems to be linked to the motor representation of the lower limbs. Evolutionarily, this finding suggests that the functional purpose of rhythmic sounds may be at least in part associated with rhythmic patterns of locomotion, or other movements generated by the legs, which is critical for survival. Depending on the affinity to different sensory modalities, in some species these cues may also be extracted via movement observation (Nagasaka et al., 2013; Kirsch and Cross, 2015).

Finally, a few other effects are worth brief discussions. First, in Experiment 2 the subdivision benefit in the Charleston stimuli was more evident for the faster movement tempo (IBI around 400 ms for the bounce and 800 ms for the legs). Interestingly, in the recent study employing the same stimuli, the effect of subdivision (bounce) on slowing tempo perception of the leg movement was also more obvious at this tempo (Su, 2016). There might be a range of tempi in which the two metrical levels of the present movements—embodied by different body parts—can be most optimally perceived in parallel. Similarly, synchronization to the bounce of Charleston dance was more stable with the IBI around 500 ms than 550 ms (Experiment 1). Future research may investigate how different movement types performed by different body parts yield the optimal tempo for visual rhythm perception and SMS. Another notable result is that the synchronization target may be served by different kinematic parameters between different kinds of movements. As observed, while both velocity and position cues might influence visual beat perception (Su, 2014a), the velocity cues are more stable and thus more useful when synchronizing with movements of large lateral trajectories, such as the leg movement of Charleston. Position cues, on the other hand, may convey the more regular beat when the movement amplitudes are smaller and the trajectories more concentrated on one dimension (vertical in the legs of Balboa).

In conclusion, the present study demonstrates that SMS with visual rhythms of dance resembles SMS with auditory rhythms of music, in that metrical subdivisions benefit synchronization to the beat. Synchronization to the pulse,

### REFERENCES


on the other hand, is not further improved by a higher-level metrical accent. While biological motion yields spatiotemporally complex visual signals, which may not be as precise as a metronome, rhythmic movements as in dance can embody the metrical structure in a comparable manner as music, which by observation modulates synchronization behaviors. The present results not only highlight the similarity in rhythm processing between the two sensory modalities, but more importantly link rhythm cognition of music and dance in a common framework of action-perception coupling.

### AUTHOR CONTRIBUTIONS

Y-HS conceptualized and designed the study, prepared the stimuli, conducted the experiments (data collection carried out by the research assistant), analyzed the data, interpreted the results, and wrote the manuscript.

### FUNDING

This work and the author were supported by a grant from the German Research Foundation (DFG), SU 782/1–2. The publication of this work was supported by the German Research Foundation (DFG) and the Technical University of Munich (TUM) in the framework of the Open Access Publishing Program.

### ACKNOWLEDGMENTS

The author thanks Frank Häusler and Alan Armstrong for helping set up the Linux system, Theresa Neumayer for data collection, and the reviewers for useful suggestions.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnhum.2016.00 186/abstract


**Conflict of Interest Statement**: The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Su. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution and reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Effects on Inter-Personal Memory of Dancing in Time with Others

#### Matthew H. Woolhouse1,2 \*, Dan Tidhar<sup>3</sup> and Ian Cross<sup>3</sup>

<sup>1</sup> Digital Music Lab, School of the Arts, McMaster University, Hamilton, ON, Canada, <sup>2</sup> McMaster Institute for Music and the Mind, McMaster University, Hamilton, ON, Canada, <sup>3</sup> Faculty of Music, University of Cambridge, Cambridge, UK

We report an experiment investigating whether dancing to the same music enhances recall of person-related memory targets. The experiment used 40 dancers (all of whom were unaware of the experiment's aim), two-channel silent-disco radio headphones, a marked-up dance floor, two types of music, and memory targets (sash colors and symbols). In each trial, 10 dancers wore radio headphones and one of four different colored sashes, half of which carried cat symbols. Using silent-disco technology, one type of music was surreptitiously transmitted to half the dancers, while music at a different tempo was transmitted to the remaining dancers. Pre-experiment, the dancers' faces were photographed. Post-experiment, each dancer was presented with the photographs of the other dancers and asked to recall their memory targets. Results showed that same-music dancing significantly enhanced memory for sash color and sash symbol. Our findings are discussed in light of recent eye-movement research that showed significantly increased gaze durations for people observing music-dance synchrony versus music-dance asynchrony, and in relation to current literature on interpersonal entrainment, group cohesion, and social bonding.

#### Edited by:

Henkjan Honing, University of Amsterdam, Netherlands

#### Reviewed by:

Jed A. Meltzer, Baycrest Hospital and University of Toronto, Canada Roberta Bianco, Max Planck Institute, Germany

### \*Correspondence:

Matthew H. Woolhouse woolhouse@mcmaster.ca

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

Received: 27 September 2015 Accepted: 28 January 2016 Published: 23 February 2016

#### Citation:

Woolhouse MH, Tidhar D and Cross I (2016) Effects on Inter-Personal Memory of Dancing in Time with Others. Front. Psychol. 7:167. doi: 10.3389/fpsyg.2016.00167 Keywords: music and dance, interpersonal entrainment, person perception, social bonding, silent disco, memory

## INTRODUCTION

Spontaneous coupling of behavior or coordinated joint-action in the absence of explicit instruction is an important feature of social interaction (Richardson and Dale, 2005; Sebanz et al., 2006). Repetitive and rhythmic movement synchrony between individuals, sometimes referred to as "interpersonal entrainment" (Clayton et al., 2005; Phillips-Silver and Keller, 2012), has been shown to have positive effects on perceived social relationships (Hove and Risen, 2009; Miles et al., 2009; Kirschner and Tomasello, 2010; Cirelli et al., 2014), and has been identified as an important factor in everyday social interaction (Shockley et al., 2003, 2009). In a series of studies, Wiltermuth and Heath (2009) showed that acting in synchrony with others increases cooperation by strengthening social attachment among group members. The positive effects on perceived social relationships found in research on person-to-person coordination fit well with accounts of group dancing in the ethnographic and sociological literature (Cross, 2008). In short, dancing is an important way to create situations in which people move either synchronously or in "sympathy", thereby establishing and reinforce social bonds.

The positive social effects referred to above are also consistent with cognitive and evolutionary hypotheses concerning entrainment in dance and music<sup>1</sup> , which suggest that interpersonal

<sup>1</sup>For many world cultures, music and dance are not separable cultural categories; for example, Merriam (1964, p. 275) quotes Gbeho as stating that in the indigenous music of the Gold Coast "If we speak of a man being musical we mean that he understands all the dances, the drums and the songs."

entrainment helps people to direct their attentional foci toward one other (Large and Jones, 1999). In turn, coordinated, joint attention may support functions such as social bonding (Freeman, 2000; Merker, 2000; Brown et al., 2004; Cross, 2011), courtship (Catchpole and Slater, 1995; Miller, 2001), and coalition signaling (Hagen and Bryant, 2003). Although the underlying cognitive and neural mechanisms that enable complex behavioral coordination are not fully understood (Fairhurst et al., 2010), social attention has been so far considered one of the most likely candidates (Macrae et al., 2008), together with action simulation (Gazzola and Keysers, 2009; Rizzolatti and Sinigaglia, 2010), and feelings of cooperation (Lieberman, 2007). Lim and Young (2006), using animal models, have proposed a three-step process to explain neurobiological mechanisms regulating the development of social relationships. First, the animal must be motivated to approach and engage another individual. Second, the animal must be able to identify the individual based on social cues through the formation of social memories; and third, with appropriate conditions, a bond can form, leading to preferential interaction with that individual. Thus, circumstances that promote interpersonal memory are necessary for social bonding.

We propose that dancing together in groups is one way to set the conditions for the first step in promoting human social bonding, namely, the motivation to approach an individual. The second component in promoting social relationships, the formation of social memories, has been demonstrated by Macrae et al. (2008), who showed that individuals had enhanced memory for person attributes as a result of engaging in synchronous movement. These researchers speculate that the mechanism that explains enhanced memory is greater attentional focus on the interaction partner who is moving in synchrony. Evidence supporting this notion has recently been found in an eyetracking study in which participants observed people dancing either synchronously or asynchronously to a given musical beat (Woolhouse and Lai, 2014). In brief, Woolhouse and Lai (2014) found significantly increased gaze dwell times in cases where dancers' actions synchronized to a regularly pulsed audio track, as opposed to instances in which dancers' actions were asynchronous to the audio. The implications of these findings for the experiment reported here are addressed in the discussion to this paper.

Returning to the previous theme of the cognitive and neural mechanisms underpinning social bonding, recent research suggests that these processes are likely to be interconnected. Miles et al. (2010, p. 460) noted that synchrony or imitation probably leads to ". . .more interdependent, or 'other-focused' information processing"—an "attentional union"—which, in turn, enhances perceptions of intimacy. Therefore, the increased person perception found by Macrae et al. (2008) may represent one of several cognitive-neurological processes involved in social cognition. Likewise, the factors responsible for the heightened levels of compassion and altruistic behavior found by Valdesolo and DeSteno (2011) may represent a complementary process responsible for bonding together groups of individuals. However, although the social-bonding effects found in many studies of synchronized movement have been observed in different social contexts, the scientific literature on dance has yet fully to examine the mechanisms by which such bonding is actuated. By exploring the way in which group dancing creates similar effects to those identified in the synchronous- and rhythmic-movement literature, the present study was designed partly to fill this gap. Moreover, we sought to do this in a realistic, non-lab setting, which, as D'Ausilio et al. (2015) have noted, can have tangible benefits: music (and we would argue, also dance) is able to balance the requirements of ecological validity and experimental control when used to investigate human social interaction and cognition.

Specifically, we aimed to test whether dancers whose movements were temporally aligned had enhanced memory for various attributes of each other; simply put, whether people remember more about each other when they dance to the same music. In this respect, this paper develops the study of Macrae et al. (2008), which found that participants recalled more target words, and were more likely to recognize the experimenter, when the experimenter's and participant's hands waved in phase versus anti-phase, or did not wave at all. If a similar effect of improved recall for details of individuals with whom individuals danced in synchrony was also found, it would indicate that dance—along with music—is an interactive medium that exploits the effects of a more general capacity for coordinated movement. And, consequently, if enhanced memory between individuals can be demonstrated to arise from collective dancing, then it is possible that a motivation for dancing in groups is because it enhances memory for one another, thereby facilitating social bonding and cohesion.

Our study utilized the fact that, in general, music and dance in all world cultures has a periodic, rhythmic framework (Leman and Naveda, 2010; Naveda and Leman, 2010), a quality that facilitates interpersonal entrainment (Clayton et al., 2005). Furthermore, people who dance to the same piece of music are likely to have shared affective experiences (Koelsch, 2010), which, in turn, may enable them to attend to one another to a greater degree. As a result, we hypothesized that there would be greater memory for person attributes between those dancing to the same music than those dancing to different music. We examined the effects on memory for person-attributes in an experiment using "silent-disco" technology, which enabled us simultaneously to transmit two different songs, A and B (at different tempi), to two subgroups of dancers, one receiving Song A, the other Song B. The technology allowed us to conduct a real-world study with experimental rigor, and thus, arguably, to explore the effects of collective dancing on participants who were more likely to exhibit spontaneous "natural" behaviors than those taking part in purely lab-based studies.

### DANCE EXPERIMENT

### Method

The silent-disco technology employed in the study consisted of a multichannel wireless transmitter unit and sets of wireless headphones capable of receiving a signal on more than one channel. Here, a simple two-channel set-up was used, the wireless

transmitter having a range of greater than 100 m and each headphone set having a signal-to-noise ratio of greater than 75 dB. The experiment consisted of four sessions each with ten different participants, in which the effect of group dancing on memory for person attributes was tested. Using silent-disco headphones, in each session five participants danced to music at a faster tempo, while the remaining five participants danced to music at a slower tempo. A marked-up dance floor controlled the participants' proximity to one another. This ensured that within the experiment sessions each participant was brought into close proximity with every other dancer, and allowed distance on the dance floor to be taken into account within the analysis. The memory targets used in the experiment were different colored sashes worn by the participants, half of which also prominently displayed large symbols of cats.

### Ethical Statement

Ethics approval for this study was granted by the Faculty of Music Ethics Committee, University of Cambridge, UK.

### Participants

The participants were non-professional dancers, and included 23 females and 17 males (mean age = 31.4; SD = 11.2; range = 18– 60). The female/male distribution, mean age, and age SD per group within each trial was as follows. Trial 1: Group A, 3F/2M, mean age = 21.6, SD = 2.1; Group B, 1F/4M, mean age = 20.2, SD = 1.8. Trail 2: Group A, 4F/1M, mean age = 38.6, SD = 11.2; Group B, 4F/1M, mean age = 36.8, SD = 9.4. Trail 3: Group A, 3F/2M, mean age = 26.2; SD = 2.3; Group B, 2F/3M, mean age = 25.2; SD = 2.6. Trial 4: Group A, 3F/2M, mean age = 45.0; SD = 12.1; Group B, 3F/2M, mean age = 37.4, SD = 9.9. Expressed as percentages, where 100% = everyone knew everyone and 0% = no one knew anyone, known/unknown pairings per trial were as follows: Trial 1 = 35.6%; Trial 2 = 2.2%; Trial 3 = 51.1%; Trial = 26.7%. Half were recruited from a weekly amateur dance class held within the host university, while the remainder were recruited on a University Open Day, and included both students and members of the general public. The experiment was run in a spacious, artificially lit dance studio.

### Procedure

On entering the foyer of the dance studio, each participant's face was photographed. Subsequently the participants were brought into the studio and randomly seated in a circle around the dance floor. Each was given silent-disco radio headphones set covertly either to Channel A or Channel B. The covert headphone setting meant that participants were unaware that they had been assigned to one of two groups, either A or B. There was then a short sound check during which headphone volumes were adjusted to comfortable levels.

Prior to participants' arrival, the dance floor had been divided into a series of 10 interlocking hexagons using white adhesive tape, each hexagon measuring 1.25 m in diameter. The hexagons were compactly configured, and a series of arrows delineated a path from one hexagon to another to be followed by participants during the experiment; see **Figure 1**. Before the experiment, participants were instructed to move between the hexagons via

FIGURE 1 | Photograph of the dance floor showing the taped hexagons. The hexagons were connected by a series of arrows outlining a path to be followed by the participants.

the arrows on the dance floor every time they heard a 2-s pause in the music. The 2-s pauses separated 10 periods of dancing, each 45 s in duration. The arrows were arranged so as to ensure that every participant was brought into a neighbor relationship with every other participant at least twice during the experiment. The marked-up dance floor and arrows enabled the distances between participants to be control, and, importantly, to equalize as far as was practicable the amount of time that each participant spent in close proximity with others dancing either to the same or different music. Participants were instructed to dance freely with their eyes open, interact with one another, and, if possible, relax and enjoy themselves. Lastly, so as to eliminate physicalinteraction effects, participants were also requested to dance without making bodily contact.

Membership of either Group A or B depended on whether a participant's headphones were set to Channel A or B, which, in turn, determined their starting position on the dance floor. After participants were positioned on the dance floor, the studio's normal lighting was switched off and ultra violet lighting switched on. A red, yellow, blue, or green sash was then placed on each dancer in a pseudo-random order; see **Figure 2**. Ultra violet light was used during sash placement in order to restrict color observation to the dancing phase of the experiment (all sashes appeared gray under ultra violet light). Had normal lighting conditions been present during sash placement, it would not have been possible to ensure that observed memory effects had solely resulted from the dance phase of the experiment. As previously mentioned, half the sashes also carried rectangular cat symbols measuring 10 cm × 14 cm; see **Figure 2**.

Normal studio lighting was switched back on as the music and dancing commenced; the studio lighting ensured that participants could see each other's sash colors and cat symbols; see **Figure 3**. Music on Channel A was transmitted to Group A, while at the same time music on Channel B was transmitted to Group B. The two types of music were differentiated by artist, tempo, lyrics, and mood. Channel A music was Wannabe (1996) by British pop girl-band "Spice Girls"; music on Channel B was Lady in Red (1986) by Argentinian-born British singer-songwriter Chris de Burgh. The tempo of Wannabe was 121 bpm, the mood lively and upbeat. In contrast, the tempo of Lady in Red was 69 bpm, the mood subdued and sentimental. The two tempi were chosen because they lacked a simple integer ratio relationship, and therefore created asynchronous dancing between the two groups.

The experiment finished after each participant had occupied all 10 hexagons on the dance floor, i.e., after one complete cycle of the dance-floor pattern. The dancing lasted approximately 8 min (10 hexagon positions × 45 s = 7 min and 50 s). As soon as the music stopped for the final time, the studio lights were switched off, and participants' sashes, symbols and headphones removed, again ensuring that the sash colors and symbols were only seen while the participants danced. Following the removal of the sashes, symbols and headphones, normal lighting was switched back on and the participants asked to return their seats. During the dancing, the participants' seats were turned through 180◦ , forming an outward-facing circle.

During the 8-min dance phase of the experiment, questionnaires incorporating the colored photographs of the participants were prepared and printed. As soon as the participants were seated (looking away from one another), each was given a pencil and a copy of the questionnaire to complete. Participants were requested to recall the sash color and possible symbol associated with each dancer, identified by their photograph. The questionnaire also asked participants

to indicate which dancers they knew prior to the experiment. The participants had 10 min in which to complete the memory task and questionnaire. In order to avoid participants adopting conscious memory strategies, only after completing the dancing were they made aware of the experiment's memory component. Participants were debriefed and informed of the hypothesis of the experiment after completing the questionnaires.

### RESULTS

the dancers.

A score of 1 was awarded for each correctly remembered sash color, and 1 for whether a participant wore a symbol; 0 was awarded for each incorrect response. Given that for each participant there were four same-music members and five different-music members, there was an increased probability that participants could have guessed the symbol and sash color of different-music members than samemusic members (five different-music versus four same-music). Accordingly, each participant's mean same- and differentmusic scores for symbol and sash color were expressed as percentages prior to analysis. Missing responses were modeled at chance: 0.5 for symbol (either symbol or no symbol), and 0.25 for sash color (one of four possible colors); missing responses accounted for 86 of the 720 possible responses, approximately 12%.

We investigated whether memory for sash color and symbol had been affected by dancing to the same piece of music in two separate analyses, one for sash color, the other for symbol. In addition, we took into account possible positive effects on memory of higher levels of arousal that may have resulted from dancing at the faster tempo (121 bpm versus 69 bpm; see Lambourne and Tomporowski, 2010). Accordingly, we conducted two full factorial repeated-measures ANOVA's on participants' scores (n = 40), with Music (same or different)

as the within-group variable, and Tempo (fast or slow) as the between-group variable.

### Sash Color

There was a significant main effect of Music [F(1,39) = 4.229, p < 0.05, η <sup>2</sup> = 0.048], but not of Tempo (F < 1). Participants who danced to the same music remembered each other's sash colors to a greater degree than those who danced to different music; the tempo of the music did not have a significant effect on memory. With respect to sash color there was no interaction of Music with Tempo (F < 1).

### Sash Symbol

There was a significant main effect of Music [F(1,39) = 5.006, p < 0.05, η <sup>2</sup> = 0.039], but not of Tempo (F < 1). As with sash color, participants who danced to the same music remembered each other's sash symbols to a greater degree than those who danced to different music; the tempo of the music did not have a significant effect on memory. And similarly, as with sash color, for sash symbol there was no interaction of Music with Tempo (F < 1).

In sum, there was no effect of tempo on participants' memory performances, whereas participants who danced to the same music, and therefore at the same tempo, exhibited greater interpersonal memory than those who danced to different music (and at a different tempo). Memory for symbol was greater than memory for sash color. This is perhaps not surprising given that the two targets used in the memory task, color and symbol, involved different memory loads: four colors of sash were present and all participants wore sashes; half the participants wore the cat symbol, half did not. Hence, overall performance on symbol was expected to be higher than performance on color, and, indeed, this was the case: the mean score for symbol was 0.59; for color 0.35. However, the memory advantage for same music over different music held for both color and symbol: the means for same-music symbol and color memory were, respectively, 0.64 and 0.40; the respective different-music means were 0.55 and 0.30. See **Figure 4**.

### Familiarity and Proximity

In order to explore whether prior familiarity and/or proximity on the dance floor had contributed to the results, we combined the memory performances of participants for sash color and symbol. With respect to prior familiarity, scores of participants who knew each other were then excluded and the analysis rerun; which is to say, participants were not excluded in the secondary analysis, simply scores where prior familiarity existed. For example, if Participant A had scores for Participants B, C, and D, and A knew B, A's revised score would have been the average of C and D (i.e., only A's score for B would have been excluded). Again, there was a significant main effect of the key factor Music: F(1,39) = 7.894, p < 0.01. Tempo was not significant (F < 1).

To test whether the results had been influenced by proximity on the dance floor, the original scores were linearly weighted to reflect the amount of time each participant had spent one hexagon, two hexagons or three hexagons away from every other participant throughout the experiment. As before, there was a main effect of Music [F(1,39) = 11.287, p < 0.005], but not of Tempo (F < 1), indicating that overall proximity had little noticeable effect. These and other findings reported above are now discussed.

### DISCUSSION

The results of the experiment are consistent with the hypothesis that people who dance to the same piece of (in-tempo) music are more likely to recall various attributes of one another than those who dance to different (out-of-tempo) music. An apparent motivation, therefore, for people to engage in same-tempo group dancing may be to enhance person perception, and so provide the necessary conditions under which social bonding

can occur. Given that participants danced freely, rather than in a synchronized manner, it is also possible that a general, albeit loose, coupling mechanism was responsible for the enhanced memory effect, rather than one relying upon precisely matched movements, as indicated in previous synchronization studies (e.g., Macrae et al., 2008). Moreover, as participants were encouraged to interact and observe each other on the dance floor, it seems likely that this coupling mechanism must relate to vision—participants did not converse or physically touch during the experiment, leaving vision as the only means by which pertinent information could be acquired.

Despite vision seemingly being key to our results, the experiment was not able to show whether same-music participants observed each other more than different-music participants. However, if we assume that a purpose of group dancing is to increase joint attention, it is reasonable to propose that participants' gazes may have been predominantly directed towards other same-tempo dancers and away from differenttempo dancers. Indeed, results from a recent eye-tracking study exploring music-dance synchrony provide strong evidence that this is likely to have been the case.

Woolhouse and Lai (2014) investigated people's eyemovements whilst observing pairs of laterally positioned dancers dancing synchronously or asynchronously to a musical beat, i.e., moving either in or out of tempo with the music the observer was hearing. Specifically, they tested two hypotheses: that enhanced memory for person attributes is the result of (1) increased gaze time between in-tempo dancers, and/or (2) greater attentional focus between in-tempo dancers. Woolhouse and Lai's (2014) findings were consistent with the first hypothesis—music-dance synchrony resulted in significantly greater gaze times than musicdance asynchrony, and thus, they inferred, was likely to lead to increased memory for the attributes of those dancing together in time. In addition, they found a preference for upper-body fixations over lower-body fixations across both synchronous and asynchronous conditions. A subsequent, single-dancer eye-tracking study, also reported by Woolhouse and Lai (2014), investigated fixations across different body regions, including face, torso, legs, and feet. Significantly greater gaze times were recorded for face and torso than for legs and feet.

Recollect that in our silent-disco experiment, upon being presented with photographs of co-dancers' faces, participants had to recall two memory targets, sash color and symbol, both of which were located on the upper body. In light of Woolhouse and Lai's (2014) finding that dancers' faces and torsos attracted greater gaze times than lower body regions, it is perhaps not surprising that participants' same-tempo and different-tempo memory performances were above chance. In general, it appears to have been the case that the overall tendency for upper-body fixations led participants in our experiment to form mental associations between faces (subsequently presented in photographs) and memory targets, irrespective of synchrony. In addition, Woolhouse and Lai's (2014) study also offers a plausible explanation for our finding that participants were more likely to recall the memory targets of those with whom they danced in time rather than out of time. As per Woolhouse and Lai's (2014) experiment, in which music-dance synchrony resulted in significantly greater gaze times than asynchrony, our experiment resulted in participants exhibiting enhanced recall for those with whom they danced in time as opposed to out of time. The implication being that mutual gaze and dwell times between people who dance in time is significantly greater than between those who dance out of time. Of course, without data from mobile eye-tracking systems, which, given their physical presence around the eyes may inhibit dancing, this proposal is conjectural to some degree; however, the evidence for the existence of a vision mechanism linking gaze, human movement synchrony, and interpersonal memory would seem to be compelling.

Although, to our knowledge, this is the first recorded instance of enhanced interpersonal memory in the context of music and dance, our findings are consistent with previous studies involving recollection of person attributes, behavioral coordination and synchrony (e.g., Macrae et al., 2008). Moreover, our results with respect to familiarity and dancer proximity suggest that the proposed vision mechanism is relatively robust. Previously established relationships and/or friendships did not result in participants recalling memory targets disproportionately. Similarly, proximity on the dance floor appears not to have had a significant effect. This may have been due to the dance floor having good lines of sight, enabling non-adjacent dancers to see one another clearly; see **Figure 3**. Moreover, the manner in which participants were required to move around the dance floor (45 s per hexagon) brought them into close proximity, and thus effectively 'mixed up' the participants during the dance. Nor was there an effect of dancing at the faster tempo on memory target recall, which might have been anticipated based on research examining the correlation between exercise and cognitive performance (Lambourne and Tomporowski, 2010). In essence, if there was an effect of physiological arousal, it appears to have affected all participants equally, irrespective of the tempo at which they danced.

While the ability of individuals to entrain, and thus attend to one another, is likely to have been most strongly affected by the meter and tempo of the music, factors other than tempo may have contributed to our results—semantic/lyric differences, for example, could have led participants to express themselves with dissimilar dance gestures. However, eliminating semantic factors in music and musical entrainment, e.g., by using the same song at different tempi, may not be that trivial, since tempo is itself a component of musical meaning (Koelsch et al., 2004).

In the post-experiment phase, we obtained only informal information concerning whether participants knew that some people within the experiment were dancing at a different tempo, and therefore there is a limited amount that can be inferred from this feedback. However, many reported not realizing that some of their fellow dancers were dancing to different music—few participants had had previous experience of silent discos, and even fewer expected there to be multimusic and multi-tempi components to the experiment. In other words, most participants appear to have assumed that everyone was dancing to the same music, which, arguably, reinforces the notion that the memory effects we did observe were, to some extent, incidental, and not the result of explicit knowledge or conscious strategies adopted by

all participants. Some participants did report finding the memory task difficult, which is perhaps understandable given that the dancing lasted only 8 min, and that they had no prior knowledge of the post-experiment memory task. Nevertheless, despite these difficulties, the results suggest that something as commonplace as dancing in time with other people significantly enhances memory for person attributes.

More generally, the results of our study support the conjecture that at least one significant, and possibly evolutionarily adaptive, function of music and dance is for bonding groups that extend beyond immediate family (Dissanayake, 2000; Nettl, 2005; Cross, 2011; Kaufman Shelemay, 2011). The ecologically grounded nature of our study, achieved by using a non-lab dance environment, and employing relatively large numbers of dancers within each trial, extends the scope of previous interpersonal entrainment research into a real-world setting.

### Summary

This paper demonstrates that dancing together in time can lead to increased recall of person-related memory targets, and therefore builds on the findings of previous studies showing that enhanced memory effects can arise in situations where movements are synchronous. Furthermore, our results suggest that the experience of group dancing is motivated by similar processes to those referred to in the literature on synchronized movement (and possibly also in communicative interaction in speech; see, Shockley et al., 2009). Dancing in groups is selfevidently a case where individuals synchronize and coordinate their movements in time, and, as noted at the outset of this paper, there is a substantial literature advancing the notion that a major role of collective dance is the affirmation and/or transformation of social bonds. However, to our knowledge it has not previously been proposed that group dancing should lead to (1) modifications in visual attention, and, as a result, (2) enhanced memory for attributes of those with whom we are dancing. Certainly it is the case that enhanced memory effects have been previously observed in synchronization tasks (e.g., Macrae et al., 2008), but not between individuals engaging in collective dance, an ecologically valid and relatively sophisticated cultural activity in comparison to laboratory studies. Should this finding be entirely unexpected, however? Arguably not: scholars, particularly those engaged in studying cultures other than contemporary urban Western societies, have for some time been aware that dance (and music) are more than mere pastimes. Rather, these activities underpin a range of individual and social functions, and are integral to the "healthy" functioning of the cultures they have explored. As the ethnomusicologist Thomas Turino (1999, p. 234) puts it,

The subtle rhythmic patterns—basic to how we speak, how we walk, how we dance, how we play music—are unspoken signs of

### REFERENCES

Brown, S., Martinez, M. J., and Parsons, L. M. (2004). Passive music listening spontaneously engages limbic and paralimbic systems. Neuroreport 15, 2033– 2037. doi: 10.1097/00001756-200409150-00008

who we are, whom we resemble, and thus whom we are with. Conversely, divergences in kinesic and other features of social style directly identify outsiders, those who are not like us. . . Sonic and kinesic iconicity, or lack thereof, however, comes to the fore in participatory musical and dance occasions because in such occasions these signs are the focal point of attention.

Silent-disco technology offers considerable benefits to those wishing to construct controlled social-interaction experiments in complex real-world environments. Given the purpose for which they were originally designed (to enable large numbers of people to dance together energetically in party-like settings), silent-disco headsets are relatively robust, and thus applicable to a range of experimental setups and designs. Further studies could incorporate motion-capture in order to investigate how dancing to the same or different music translates into synchronous action; a cross-cultural element might explore the extent to which our current findings are generalizable or specific. The degree to which the former is the case, i.e., generalizable, will, we anticipate, be a major target of future research within the rapidly expanding area of rhythmic entrainment research.

### AUTHOR CONTRIBUTIONS

MW: Study design and execution, data interpretation and analysis, manuscript drafting. DT: Study design and execution, data interpretation and analysis. IC: Manuscript drafting, data interpretation and analysis.

### ACKNOWLEDGMENTS

This paper is dedicated to the memory of Leonie Marie Woolhouse (1937–2014), whose expert sewing skills provided the authors with tailor-made sashes for their experiment.

In addition to thanking our participants, we would like to express our gratitude to the following people for their assistance in running this experiment: Nick Collins, Jonathan Green, Guy Hayward, Inga Maria Klaucke, Michele Phillips, Tal-Chen Rabinowitch, Ghofur Woodruff and Sarita Woolhouse. We would also like to make special mention of Claire Slavin Stewart for her careful reading of an earlier draft of the paper and for her thoughtful amendments and recommendations. Lastly, the first author would like to thank Victor Kuperman, and his students Priscilla Ally and Matthew Ha who helped to conduct pilot studies of Woolhouse and Lai's (2014) eye-tracking dance-music synchronization study, discussed at length in this paper.

A preliminary version of this study was published in the Proceedings of the 11th International Conference on Music Perception and Cognition, 2010.

Catchpole, C. K., and Slater, P. J. B. (1995). Bird Song: Biological Themes and Variations. Cambridge: Cambridge University Press.

Cirelli, L. K., Einarson, K. M., and Trainor, L. J. (2014). Interpersonal synchrony increases prosocial behavior in infants. Dev. Sci. 17, 1003–1011. doi: 10.1111/desc.12193


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Woolhouse, Tidhar and Cross. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Measuring Neural Entrainment to Beat and Meter in Infants: Effects of Music Background

#### Laura K. Cirelli <sup>1</sup> , Christina Spinelli <sup>1</sup> , Sylvie Nozaradan2, 3, 4 and Laurel J. Trainor 1, 5, 6 \*

<sup>1</sup> Department of Psychology, Neuroscience and Behaviour, McMaster University, Hamilton, ON, Canada, <sup>2</sup> MARCS Institute, Western Sydney University, Milperra, NSW, Australia, <sup>3</sup> Institute of Neuroscience, Université Catholique de Louvain, Louvain-la-Neuve, Belgium, <sup>4</sup> BRAMS, Université de Montréal, Outremont, QC, Canada, <sup>5</sup> McMaster Institute for Music and the Mind, McMaster University, Hamilton, ON, Canada, <sup>6</sup> Rotman Research Institute, Baycrest Hospital, Toronto, ON, Canada

Caregivers often engage in musical interactions with their infants. For example, parents across cultures sing lullabies and playsongs to their infants from birth. Behavioral studies indicate that infants not only extract beat information, but also group these beats into metrical hierarchies by as early as 6 months of age. However, it is not known how this is accomplished in the infant brain. An EEG frequency-tagging approach has been used successfully with adults to measure neural entrainment to auditory rhythms. The current study is the first to use this technique with infants in order to investigate how infants' brains encode rhythms. Furthermore, we examine how infant and parent music background is associated with individual differences in rhythm encoding. In Experiment 1, EEG was recorded while 7-month-old infants listened to an ambiguous rhythmic pattern that could be perceived to be in two different meters. In Experiment 2, EEG was recorded while 15-month-old infants listened to a rhythmic pattern with an unambiguous meter. In both age groups, information about music background (parent music training, infant music classes, hours of music listening) was collected. Both age groups showed clear EEG responses frequency-locked to the rhythms, at frequencies corresponding to both beat and meter. For the younger infants (Experiment 1), the amplitudes at duple meter frequencies were selectively enhanced for infants enrolled in music classes compared to those who had not engaged in such classes. For the older infants (Experiment 2), amplitudes at beat and meter frequencies were larger for infants with musically-trained compared to musically-untrained parents. These results suggest that the frequency-tagging method is sensitive to individual differences in beat and meter processing in infancy and could be used to track developmental changes.

Keywords: neural entrainment, rhythm, meter, electroencephalography, infancy, steady-state evoked potentials, music, frequency-tagging

### INTRODUCTION

Mothers across cultures interact with their infants in musical ways, frequently singing them lullabies and playsongs (Trehub and Schellenberg, 1995; de l'Etoile, 2006; Trehub and Gudmundsdottir, 2015). In turn, infants respond positively to this input (Trainor, 1996). Furthermore, caregivers rock infants to the rhythms of music, and such synchronous interpersonal movement appears to increase infant social affiliative behaviors (Cirelli et al., 2014a,b; 2016; Tunçgenç et al., 2015). Yet little is known about how infants' brains encode musical

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

István Winkler, University of Szeged, Hungary Erin E. Hannon, University of Nevada, Las Vegas, USA

> \*Correspondence: Laurel J. Trainor ljt@mcmaster.ca

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> Received: 15 February 2016 Accepted: 09 May 2016 Published: 24 May 2016

#### Citation:

Cirelli LK, Spinelli C, Nozaradan S and Trainor LJ (2016) Measuring Neural Entrainment to Beat and Meter in Infants: Effects of Music Background. Front. Neurosci. 10:229. doi: 10.3389/fnins.2016.00229 rhythms. The present paper reports the results of two experiments using an original electroencephalographic (EEG) frequency-tagging approach to investigate the neural encoding of rhythms in 7- and 15-month-old infants. The results suggest that the frequency-tagging approach can be successfully used with infants, and also revealed individual differences in musical rhythm processing related to differences in infant and parent music training.

Humans are very good at organizing timing structures in music (for a review, see Repp and Su, 2013). From the rhythm (i.e., pattern of tone onsets and offsets), people can easily extract the underlying pulse, or beat. These beats are then perceptually organized into hierarchical groups to create an internal metrical structure representation through which the musical input is interpreted as alternating patterns of strong and weak beats. Some examples of common Western music meters include grouping isochronous beats into a duple metrical structure (groups of 2), a triple metrical structure (groups of 3), or a quadruple metrical structure (groups of 4). While non-musicians easily perceive meter (especially when low-level components of the rhythm make meter salient), musicians often display advantages on tasks involving meter perception and production (for example, Drake et al., 2000; Brochard et al., 2003). Perception of beat and meter are not only driven by auditory cues in the stimulus, but also shaped by topdown processes such as attention, expectation and previous experience (see for example Large and Jones, 1999; Brochard et al., 2003; Phillips-Silver and Trainor, 2007, 2008; Nozaradan et al., 2011; Schaefer et al., 2011; Manning and and Schutz, 2013; Chemin et al., 2014; Butler and Trainor, 2015; Celma-Miralles et al., 2016). For example, when listening to unaccented isochronous tones, both musician and non-musician adults showed larger event-related potentials (ERPs) to every second tone, suggesting that a duple metrical structure was automatically applied (Brochard et al., 2003). Thus, meter perception can be automatic, especially when low-level information in the rhythmic stimulus increases the salience of strong over weak metrical beats, but it can also be shaped by attention and music training.

While many questions remain about the developmental timecourse of beat and meter perception, there is evidence that young infants are sensitive to this timing information. Newborns can discriminate between spoken languages that fall into different rhythmic categories (for example, English compared to Japanese) (Nazzi et al., 1998). By as early as 2 months of age, infants can detect tempo changes (Baruch and Drake, 1997) and discriminate between different musical rhythm patterns (Chang and Trehub, 1977; Demany et al., 1977). In terms of meter processing, Winkler et al. (2009) argue that such processing occurs even in newborn infants. Using electroencephalography (EEG) and a mismatchnegativity paradigm, they showed that newborns were better able to detect the omission of a metrically important than metrically unimportant beats in a rhythm pattern. Behaviorally, by as early as 7 months of age infants can categorize melodies and rhythms based on metrical structure (Hannon and Johnson, 2005), and use movement to guide meter perception (Phillips-Silver and Trainor, 2005).

Infants are also learning about metrical structure through exposure to music during the first year after birth. Hannon and Trehub (2005a,b) revealed this effect of exposure by taking advantage of the fact that musical systems in different cultures use predominantly different meter styles. For instance, Western meters tend to be simple, with metrical groupings of beats by 2 or 3, whereas music from many other places in the world contains complex meters with metrical groupings of 5 or 7 beats (for example, Bulgarian music). Hence, Western adults are much better at detecting violations in patterns with simple meters compared to patterns with complex meters, whereas adults exposed to musical systems containing complex meters (such as Bulgarian music) are equally good at detecting violations in patterns with both simple and complex meters (Hannon and Trehub, 2005a). Interestingly, at 6 months of age, Western infants are apt at detecting metrical violations in both culturally familiar and unfamiliar rhythmic patterns. However, by the time these babies are 12-months-old, they perform like adults, and are only able to detect violations in patterns with simple meters (Hannon and Trehub, 2005b). This perceptual narrowing indicates that musical exposure during the first year after birth shapes how musical timing structures are processed and perceived by infants.

The influence of controlled musical exposure during infancy on music processing has not been extensively studied, but one series of experiments did find evidence for such effects (Gerry et al., 2012; Trainor et al., 2012). In this investigation, 6-monthold infants and their parents were randomly assigned to attend 6-months of one of two types of caregiver/infant classes: (1) active music classes or (2) control classes focusing on play while music was presented passively. After (but not before) the training period, infants in the active music classes displayed larger and earlier brain responses to musical sounds as measured using EEG. Interesting correlations between preferences for expressive over mechanical music performances and socio-economic status (SES) were also found, independent of class assignment. Infants from families with a higher compared to lower SES were more likely to prefer expressive music to synthesized non-expressive music (Trainor et al., 2012). While this correlation is difficult to interpret, it is possible that parents from a higher SES have the means to receive music training themselves and expose their infants to a wider variety of musical stimuli. While these results have important implications on how musical exposure in infancy shapes music perception, they do not address how experience might affect the encoding of beat and meter in infancy.

One promising method for exploring infant rhythm processing is the EEG frequency-tagging approach (see Nozaradan, 2014 for a review). This original method was initially used to investigate the neural mechanisms underlying rhythm processing in adults. Neural entrainment to the incoming rhythm is measured in the form of peaks emerging from the EEG spectrum at frequencies corresponding to the rhythm envelope (Nozaradan et al., 2012b). In an initial study, participants were asked to listen to an isochronous auditory stimulus with a 2.4 Hz beat frequency and to imagine either that beats were metrically grouped in twos (a duple meter frequency at 1.2 Hz) or threes (a triple meter frequency at 0.8 Hz) (Nozaradan et al., 2011). The sound stimulus itself did not contain any energy

at either of these metric frequencies. Interestingly, compared to when they were not asked to imagine a metrical structure, participants displayed a peak of brain activity (i.e., steady-state evoked potentials, or SS-EPs) specifically located at the imagined metrical frequencies. Importantly, this result suggests that the SS-EPs elicited in response to the sound do not merely constitute a faithful encoding of the stimulus rhythm. Rather, the brain transforms the rhythmic input by amplifying frequencies that coincide with perceived beat and meter frequencies. This finding was corroborated by subsequent frequency-tagging studies showing that SS-EPs elicited at frequencies corresponding to the perceived beat and meter were influenced not only by bottom-up stimulus properties, but also by top-down processes such as movement or predictive timing (Chemin et al., 2014; Nozaradan et al., 2012b, 2015, 2016).

The purpose of the present investigation was to use the frequency-tagging approach with infants to test whether music training shapes the neural encoding of rhythms early in infancy. EEG was recorded while infants listened to rhythmic patterns. There were two main goals of the research. First and foremost, we aimed to investigate the neural entrainment to rhythmic patterns in infants by measuring SS-EPs at beat- and meterrelated frequencies in the EEG spectrum. Second, we investigated how individual differences in music background correlated with individual differences in these SS-EP measurements. We expected to find enhanced neural entrainment at beat and/or meter frequencies in the infants with stronger music backgrounds. Information was collected about parents' music training, enrolment in caregiver/infant music classes, and weekly hours of music listening. Experiment 1 presents results from a large sample of 7-month-old infants. These babies listened to an ambiguous rhythmic stimulus (that could be interpreted as in either duple or triple meter) used previously in behavioral studies with infants and adults (Phillips-Silver and Trainor, 2005, 2007), as well as EEG studies with adults (Chemin et al., 2014). By this age, infants are not yet encultured to their musical environment, but do perceive beat and meter (Hannon and Trehub, 2005a). Experiment 2 presents results from 15-monthold infants, some whom had been recently randomly assigned to attend caregiver-infant music classes. These older infants listened to an unambiguous rhythmic stimulus with a typical Western quadruple meter that had been previously used in behavioral and EEG studies with adults (Nozaradan et al., 2012a, 2016). By this age, infants should be encultured to their musical environment, and should show more adult-like responses. Having two age groups and two different rhythm patterns provides a test of the generalizability of the frequency tagging method.

### EXPERIMENT 1

### Materials and Methods Participants

Sixty 7-month-old (28 males; M age = 7.56 mo, SD = 0.29 mo) normal hearing infants participated in this experiment. An additional 14 infants participated, but were too fussy to complete the procedure. These infants were recruited from the Developmental Studies Database at McMaster University. The McMaster Research Ethics Board approved all procedures and informed consent was obtained from parents.

### Stimulus

The stimulus consisted of a six-beat rhythm pattern, lasting 2 s, based on the stimulus used by Phillips-Silver and Trainor (2005) (**Figure 1B**). The rhythm pattern consisted of the following: tonesilence-tone-tone-tone-silence. Each beat had an inter-onsetinterval of 333 ms (180 beats per minute), which translated to a beat frequency of 3 Hz. The tones were 990 Hz pure tones lasting 333 ms with 10 ms rise and fall times synthesized using the program Audacity 2.0.5 (www.audacity.sourceforge.net). One 34-s long trial consisted of 17 repetitions of this stimulus. These trials were repeated 32 times, with no pauses between trials, so that the entire procedure lasted just over 18 min with a break at the halfway point. The stimulus was presented at a comfortable intensity level [∼60 dB SPL at the location of the infants' head over a noise floor of <30 dB(A)] using Eprime software through an AudioVideo Methods speaker (P73) located approximately 1 m in front of the infant.

To determine frequencies of interest for the SS-EP analysis, the temporal envelope of the rhythm pattern was extracted using a Hilbert function implemented in MATLAB, yielding a timevarying estimate of the instantaneous amplitude of the sound envelope. The obtained envelope was then transformed in the frequency domain using a discrete Fourier transform, yielding a frequency spectrum of acoustic energy (**Figure 1A**).

### Procedure

After the nature of the study was described, the infant's parent(s) gave written consent to participate and also filled out a questionnaire about their child's and their own hearing and musical history.

The parent sat on a chair ∼1 m in front of the speaker, and held their infant on their lap. Infants' EEG signals were recorded while they passively listened to the stimulus for 18 min, with one break at the halfway point. During the procedure, an experimenter stayed in the room and silently entertained the infant with puppets, bubbles and toys to keep them still and content. A silent video played on a monitor below the speaker. Parents were asked to not speak during the recording session and to minimize their movements.

### Data Acquisition and Analysis

EEG signals were collected using a 124-channel HydroCel GSN net with an Electrical Geodesic NetAmps 200 amplifier passing a digitized signal to Electrical Geodesics NETSTATION software (v.4.3.1). Signals were recorded online with at a sampling rate of 1000 Hz and with a Cz reference. Electrode impedance during recording was maintained below 50 k.

The data were filtered offline using EEProbe Software with high-pass and low-pass filters set at 0.5 and 20 Hz respectively. The data were resampled at 200 Hz in order to be processed using the Artifact Blocking algorithm in MATLAB (Mourad et al., 2007). This algorithm is especially useful for improving signal to noise ratios in continuous infant data (Fujioka et al., 2011). Using EEProbe Software, recordings were then digitally re-referenced

to a common average. The 32 trials were averaged from 1000 to 34,000 ms, with baseline defined between 900 and 1000 ms. The first second of each epoch was removed (i) to discard the transient auditory evoked potentials related to stimulus onset and (ii) because SS-EPs require several cycles of stimulation to be entrained (Regan, 1989; Nozaradan et al., 2011, 2012b).

A Fourier transform was applied to the averaged EEG waveforms at each electrode using Letswave5 (Mouraux and Iannetti, 2008). This yielded a frequency spectrum where the signal amplitude (µV) ranged from 0 to 500 with a frequency resolution of 0.031 Hz. To obtain valid estimates of SS-EPs, the contribution of unrelated residual background noise was removed. This was accomplished by subtracting the averaged amplitude measured at neighboring frequency bins from each frequency bin (Mouraux et al., 2011; Nozaradan et al., 2012a). The two neighboring bins ranged from −0.15 to −0.09 Hz and +0.09 to +0.15 Hz relative to each frequency bin, thus corresponding to −3 to −5 and +3 to +5 bins around each frequency bin of the spectrum. Then, SS-EP magnitudes were averaged across all scalp electrodes for each participant, to allow SS-EP amplitudes to be compared across groups while avoiding electrode selection bias (Nozaradan et al., 2011, 2012b, 2016).

Event related potential (ERP) analyses were also performed on the filtered (0.5 Hz high pass, 20 Hz low pass), resampled, artifact corrected and re-referenced data. Epochs from −100 to 300 ms relative to the onset of the first tone in each 6-beat sequence were averaged (total trials = 544), with baseline defined as −100 to 0 ms. Waveforms from eight right frontal channels were averaged and waveforms from the corresponding eight left frontal channels were averaged to examine the response from auditory cortex. Because of the orientation of auditory cortex around the Sylvian Fissure, activity from auditory areas typically shows up at frontal electrode sites on the surface of the scalp (Trainor, 2012). We defined the time-point at which the largest magnitude peak occurred in the grand average at each electrode grouping. Area under the curve was then calculated for each individual infant for each hemisphere as the area ±50 ms around this time point.

### Results

To check for outliers, an average SS-EP amplitude score was calculated for each infant across the 5 peaks frequency-tagged from the sound stimulus (1, 1.5, 2, 2.5, and 3 Hz). The zscores across these averages were calculated, and one infant was excluded from further analyses using a z-score cutoff of ±3. For ANOVAs using repeated measures, Greenhouse-Geisser corrections are reported where applicable.

### SS-EP Responses

SS-EPs averaged across all channels and scalp topographies are visualized in **Figures 1C,D**. The expected beat frequency is at 3 Hz (i.e., 333 ms long tones and rests in the rhythmic pattern). Based on previous work with this rhythm pattern (Phillips-Silver and Trainor, 2005, 2007; Chemin et al., 2014), this ambiguous stimulus pattern can be interpreted as in either duple or triple meter, although there is a bias in Western adults for the duple interpretation (Chemin et al., 2014). 1.5 Hz represents the related metrical frequency where beats are grouped in two (duple) and 1 Hz represents the metrical frequency where beats are grouped in three (triple).

To determine if the peaks in the frequency-transformed EEG occurred as expected above the noise floor, peaks of interest were first determined from the FFT of the sound stimulus (See **Figure 1A**). Amplitudes in the frequency-transformed EEG were calculated at frequencies where peaks were present in the sound stimulus (1, 1.5, 2, 2.5, and 3 Hz; 0.5 Hz was excluded due to our use of a 0.5 Hz high pass filter). These were also calculated at frequencies where no peaks were present in the sound stimulus (0.75, 1.25, 1.75, 2.25, and 2.75 Hz). These amplitudes were calculated by selecting the maximum amplitude within a 3-bin band centered on the frequency of interest.

Average noise floor amplitude was calculated as the average across 0.75, 1.25, 1.75, 2.25, and 2.75 Hz. Using pairedsamples t-tests corrected for multiple comparisons using the Bonferroni correction, EEG amplitudes at each of the frequencies contained in the sound stimulus were significantly above this average noise floor (all p's < 0.010). In addition, the average amplitude of beat and meter-related frequencies (1, 1.5, 3 Hz) was significantly greater than the average amplitude of beatand meter-unrelated frequencies present in the sound (2 Hz, 2.5 Hz), t(58) = 9.54, p < 0.001. The significant presence of peaks at both 1.0 and 1.5 Hz in the grand average likely reflects that some infants perceived the rhythm in duple meter and some in triple meter, or that individual infants may have switched back and forth in their interpretation, but we are not able to distinguish these possibilities. In general, these results suggest that the frequency tagging SS-EP method can result in significant signal to noise ratios when used with this age group.

### Effects of Music Background

#### **Effect of infant music classes**

Thirteen infants in this sample were reported to have participated in infant music classes with their caregiver. Most (11 of the 13) reported attending the classes for 45–60 minutes hour per week. Classes were varied (e.g., Kindermusik, Music Together) and started at various ages (starting age ranged from 1- to 6-months-old).

An ANOVA with participation in music classes as a between subjects variable and frequency (five levels: beat, 3 Hz; duple meter, 1.5 Hz; triple meter, 1 Hz; unrelated, 2 and 2.5 Hz) as a within subjects variable was used to investigate SS-EP amplitudes. A main effect of infant music classes [F(1, 57) = 5.692, p = 0.02] was qualified by a significant interaction between class participation and frequency, F(2.83, 161.17) = 2.95, p = 0.037. We explored this interaction using post-hoc t-tests (using a Bonferroni correction and familywise alpha of p = 0.10) to investigate how infants with music classes compared to those without at each frequency level. While infants with music classes did not have larger SS-EPs at the beat frequency (p = 0.656), triple meter frequency (p = 0.183), or either of the unrelated frequencies (p = 0.082 for 2 Hz; p = 0.216 for 2.5 Hz), infants with music classes did have larger amplitudes at duple meter frequency (1.5 Hz) than those without training, t(57) = 2.58, p = 0.012 (See **Figure 2**).

The effect of music classes was further explored in the ERPs to the first tone in each 6-beat sequence. At this age, infant ERP responses to tones are dominated by a slow positive wave at frontal electrodes, thought to be generated in auditory cortex

FIGURE 2 | SS-EP amplitudes (noise subtracted, averaged across all channels) across frequencies of interest for 7-month-old infants who have participated in parent-caregiver music classes (n = 13, shown in purple) and infants who have not (n = 46, shown in blue). Infants who have had music classes show larger amplitudes at the duple meter frequency (1.5 Hz) compared to those who do not, p = 0.012. Interestingly, there was no significant difference at beat frequency (3 Hz), p = 0.656.

(e.g., Leppänen et al., 1997; Trainor et al., 2003; He et al., 2007, 2009). As can be seen in **Figure 3**, this wave peaked around 175 ms after tone onset. The area under the curve method described above was used to assess the magnitude of this auditory response. An ANOVA was conducted with between-subjects factor infant music training and within-subjects factor hemisphere (left, right). While there was no main effect of hemisphere [F(1, 57) = 0.01, p = 0.929] or interaction between hemisphere and infant music training [F(1, 57) = 1.60, p = 0.211], there was a significant main effect of infant music training, F(1, 57) = 4.09, p = 0.048. Infants who had been enrolled in music classes had larger evoked responses to these tones compared to those who had not been enrolled in such classes.

#### **Effects of parent music training**

Years of parent music training were calculated as a combination of mother- and father-reported levels. Infants were divided into two groups based on this information: infants with parents who had ≥5 years of combined music training (n = 24), and infants with parents who had <5 years of music training (n = 35)

An ANOVA with parent music training groups as a between subjects variable and frequency (five levels: beat, 3 Hz; duple meter, 1.5 Hz; triple meter, 1 Hz; unrelated, 2 and 2.5 Hz) as a within subjects variable was used to investigate SS-EP amplitudes. There was no main effect of parent music classes [F(1, 57) = 1.523, p = 0.222] and no interaction between music classes and frequency [F(2.73, 155.56) = 0.41, p = 0.802]. There was also no correlation between reported years of parent music training and amplitudes at beat and meter frequencies (all p's > 0.102).

#### **Reported hours of infant music listening**

Parents were asked to report how many hours a week their infants heard music (either passive or active, but while awake). These reported rates did not correlate with amplitudes at beat and meter frequencies (all p's > 0.255). Interestingly, reported hours of music listening also did not correlate with years of combined parent music training, p = 0.579, and did not differ across infant music class groups, p = 0.968

### EXPERIMENT 2

### Materials and methods Participants

Thirty-three infants between the ages of 14- and 16-months (17 males; M age = 15.45 mo, SD = 0.79 mo) participated in this experiment. An additional 2 infants participated, but were too fussy to complete the experiment and were not included in the analyses. The results reported here for Experiment 2 are subsets of results from a larger study on the effect of infant music training. Here, we only report on the EEG portion of the experiment. These infants were recruited when they were between 9- and 10-months-old, and were randomly assigned to either a music training condition or a control condition. Infants in the music training condition (n = 14) received 20 weeks (1 h a week) of caregiver-infant music classes, provided by the Royal Conservatory of Music in Hamilton, ON. Infants in the control condition (n = 19) received this training after all experimental testing procedures were complete, so they had not received music training at the time of testing. EEG data collection took place within 2 weeks following the 20-week music training period. These infants were recruited from the Developmental Studies Database at McMaster University. The McMaster Research Ethics Board approved all procedures and informed consent was obtained from parents.

### Stimulus

The stimulus consisted of a rhythmic pattern lasting 3.996 s, made up of a rhythmic combination of 12 sounds and silent intervals (**Figure 4B**). This stimulus was based on the one used by Nozaradan et al. (2012b, 2016). Each beat had an inter-onsetinterval of 333 ms (180 beats per minute), which translated to a beat frequency of 3 Hz. The tones were 990 Hz pure tones lasting 333 ms with a 10 ms rise and fall time synthesized using the program Audacity 2.0.5 (www.audacity.sourceforge.net). One 36-s long trial consisted of 9 repetitions of this stimulus. These trials were repeated 14 times, with no pauses between trials, so that the entire procedure lasted about 9 min with no break. The stimulus was presented at a comfortable intensity level [∼60 dB SPL over a noise floor of <30 dB(A)] using Eprime software through an AudioVideo Methods speaker (P73) located ∼1 m in front of the infant.

The envelope spectrum of this sequence was analyzed using the same procedure as in Experiment 1, in order to compare stimulus and SS-EPs frequency spectra.

#### Procedure

The procedure matched Experiment 1, except that a table was placed in front of the infant so that they understood that they could not get down and play on the floor. It takes some effort to get infants of this age to sit still, so they were given toys to play with on the table if necessary.

#### Data Acquisition and Analyses

Data acquisition and analysis matched Experiment 1 in all respects except in epoch length, due to the different stimulus lengths (14 trials of 35,000 ms). The ERPs to the first tone in each of the 12-beat sequences were also analyzed using the same procedure as in Experiment 1 (total trials = 146).

### Results

The z-score cutoff method described in Experiment 1 was employed here, using the average SS-EP amplitude across all frequencies tagged from the stimulus. No infants met the ±3 z-score cutoff criteria, and so all were included in the following analyses.

#### SS-EP Responses

SS-EPs averaged across all channels and scalp topographies are visualized in **Figures 4C,D**. The expected beat frequency is at 3 Hz (as tones and rests in the rhythm pattern are 333 ms long). Based on previous work with this rhythm pattern (Nozaradan et al., 2012b, 2016), these beats are most naturally grouped in 4's, representing the quadruple meter at 0.75 Hz. Grouping these beats in 2's (1.5 Hz) is also fairly common. To determine if the peaks in the frequency-transformed EEG occur as expected above the noise floor, peaks of interest were first determined from the FFT of the sound stimulus (See **Figure 4A**). Amplitudes in the frequency-transformed EEG were calculated at frequencies where peaks were present in the sound stimulus (0.75, 1, 1.25, 1.5, 1.75, 2, 2.25, 2.5, 2.75, and 3 Hz; 0.5 Hz was excluded due to our use of a 0.5 Hz high pass filter) and at frequencies between these, where no peaks were present in the sound stimulus (0.625, 0.875, 1.125,

1.375, 1.625, 1.875, 2.125, 2.375, 2.625, and 2.875 Hz). These amplitudes were calculated by selecting the maximum amplitude within a 3-bin band centered on each frequency of interest.

frequency (0.75 Hz) and beat frequency (3 Hz), averaged across all participants.

Noise floor amplitude was calculated as the average across the frequencies not present in the sound. Using paired-samples t-tests corrected for multiple comparisons with the Bonferroni correction, EEG amplitudes at each of the frequencies contained in the sound stimulus were compared to this noise floor value. Amplitudes at 0.75 Hz [t(32) = 6.54, p < 0.001] and 3 Hz [t(32) = 4.49, p < 0.001] were significantly greater than the noise floor value. Amplitudes at 1.5 Hz were also significantly greater than the noise floor [t(32) = 2.80, p = 0.009], but this was not significant using the Bonferroni corrected alpha value of p < 0.005 for each comparison. No other amplitudes at frequencies present in the stimulus were significantly larger than the noise floor value. In addition, the average amplitude of beat and meter-related frequencies (0.75, 1.5, 3 Hz) was significantly greater than the average amplitude of beat- and meter-unrelated frequencies present in the sound (1, 1.25, 1.75, 2, 2.25, 2.75 Hz), t(32) = 7.806, p < 0.001. Together, these results suggest that the frequency tagging SS-EP method can result in good signal to noise ratios when used with this age group.

### Effects of Music Background

#### **Effects of infant music classes**

An ANOVA with participation in music classes as a between subjects variable and frequency (only those that were above the noise floor, i.e., 0.75, 1.5, and 3 Hz) as a within subjects variable was used to investigate SS-EP amplitudes. Surprisingly, there was no main effect of infant music classes [F(1, 31) = 0.09, p = 0.761] and no interaction between music classes and frequency [F(1.48, 45.76) = 0.17, p = 0.778]. This suggests that the 20 weeks of music training provided to the experimental group may not have influenced this measure.

### **Effects of parent music training**

Years of parent music training were calculated as a combination of mother and father reported levels, as in Experiment 1. Infants were divided into two groups based on this information: infants with parents who had ≥5 years of music training (n = 16), and infants with parents who had <5 years of music training (n = 17)

In an ANOVA with parent music training group as a between subjects variable and frequency (only those that were above the noise floor, i.e., 0.75, 1.5, and 3 Hz) as a within subjects variable, a significant main effects of parent music group was found, F(1, 31) = 4.73, p = 0.037. There was also a main effect of frequency [F(1.46, 45.27) = 19.52, p < 0.001] driven by the fact that responses at 0.75 Hz were larger than responses at 1.5 or 3 Hz. Interestingly, there was no interaction between parent music training and frequency, F(1.46, 45.27) = 0.76, p = 0.473. These results suggest that infants with musically trained parents had larger average SS-EP amplitude overall compared to infants with musically untrained parents (See **Figure 5**).

This effect was further explored in the ERPs. As can be seen in **Figure 6**, this wave peaked around 160 ms after tone onset for the left frontal, and 155 ms after tone onset for the right frontal location. The area under the curve method described above was used to assess the magnitude of this auditory response. An ANOVA was conducted with between-subjects factor parent music training (high, low) and within-subjects factor hemisphere (left, right). While there was no main effect of hemisphere [F(1, 31) = 0.01, p = 0.938] or interaction between hemisphere

and parent music training [F(1, 31) = 1.24, p = 0.275], there was a significant main effect of parent music training, F(1, 31) = 5.56, p = 0.025. Infants whose parents had more music training had larger evoked responses to these tones compared to infants whose parents had less music training. Supporting this, a positive correlation was found between frontal ERP response magnitude (averaged across hemisphere) and parent years of music training, r = 0.47, p = 0.006.

### **Reported hours of infant music listening**

There were no correlations with reported hours of infant music listening and amplitudes at any of the frequencies of interest (all p's > 0.314). Again, there was surprisingly no correlation between reported hours of music listening and years of parent music training, p = 0.236.

### DISCUSSION

Here we reported the results from two infant experiments where we measured SS-EP responses that were frequency-locked to auditory rhythm patterns. In Experiment 1, 7-month-old infants listened to a rhythm pattern that could either be interpreted as triple or duple meter. Infants showed obvious amplitude peaks at the beat level (3 Hz) as well as both potential meter levels (duple and triple meter frequencies). Interestingly, infants who had music training showed greater SS-EP amplitudes elicited at duple meter (but not beat) frequencies than infants who had not received music training. ERP analyses also revealed that these infants with music training had larger evoked responses to the first tone in the rhythmic pattern than the infants with no training. In Experiment 2, 15-month-old infants listened to a rhythm pattern that could easily be interpreted as quadruple meter. Again, in an even shorter testing period (9 min), we found clean signal-to-noise ratios with clear amplitude peaks at both the beat (3 Hz) and meter (0.75 and 1.5 Hz) frequencies. While we did not find an influence of infant music training on SS-EP amplitudes with this age group, we did find an interesting effect of parent music training. Infants with parents who had at least 5 years of combined music training showed larger amplitudes across SS-EP frequencies than infants without musically trained parents. ERP analyses also revealed that babies with musically trained parents had larger evoked responses to the first tone in the sequence than the infants with musically untrained parents. We also found a positive correlation between parent years of training and ERP magnitude.

These results highlight the usefulness of the frequency tagging method for testing rhythm processing in infants with EEG. In both age groups (7 and 15 months), with relatively short testing periods (18 and 9 min, respectively), clear responses above the noise floor were seen at frequencies corresponding to the rhythm envelope. This result provides direct evidence for the capacity of the infant's brain to entrain, that is, frequency-lock, to incoming auditory rhythms. How much of these responses are due to lowlevel processes of the acoustic inputs and how much is due to top-down perceptual processes remains to be clarified. In adults, active attentional processes and body movement have been shown to shape such responses (Chemin et al., 2014; Nozaradan et al., 2011, 2015). As well, individual differences in adults' tapping ability are reflected in the size of SS-EPs elicited at beat frequencies (Nozaradan et al., 2016). In the present study, while we could not explicitly control the active or automatic attention of the infants, we did find evidence for individual differences in rhythm processing. In Experiment 1 in particular, the effect of infant music training on duple meter frequency peaks, but not beat, frequency peaks, implies that music training may not only selectively amplify specific frequencies, but may also enhance metrical processing. This supports the idea that the SS-EPs we measured are not simply stimulus driven, and may be influenced by higher-level processing. Overall, these experiments suggest that the frequency-tagging method is apt to investigate the mechanisms through which the neural encoding of rhythms is shaped during early development.

Importantly, the frequency-tagging approach appears promising for observing the developmental trajectory of neural entrainment across infancy. The robust signal-to-noise ratios obtained in 7- and 15-month-olds bode well for observing effects as young as the newborn period. Little is known about rhythm processing in very young infants, and the frequency tagging approach offers a potential way to study this. Furthermore, it could be used in conjunction with future experimental designs in which infant musical experience is controlled through random assignment to different types of training. The effects of infant music training in Experiment 1 (where assignment to training was not controlled) and lack of effects of infant music training in Experiment 2 (where all parents enrolled their infants in music classes, but the classes were delayed in the control group so that those infants were untrained at the time of testing) suggest that either (1) training must occur early, before musical enculturation, for clear rhythm processing differences to be measured or (2) parents who choose to enroll their children in infant music classes may be different in some related variable from parents who do not.

The individual differences observed in the current study provide new insights on how infant and parent music background might shape music listening experiences. With two age groups and two different rhythm patterns, we found two different relations between music background and neural responses to rhythms. With the younger but not older infants, infant music training was related to enhanced duple meter processing. With older but not younger infants, parent music training was related to a non-specific enhancement across beat and meter frequencies. It is difficult to compare the results across these two experiments given the methodological differences. Here we present possible ideas for why we may have found differences in experiential effects between these age groups, but all such interpretations must be treated with caution. It could be that direct musical exposure in the form of infant music classes is more likely to shape meter processing in younger infants, since they have not yet become fully encultured to the metrical structures of the music in their environment, which occurs between 6 and 12 months (Hannon and Trehub, 2005a,b). It could also be the case that parents with higher levels of music training, through an interplay of genes and environment, encourage their infants to attend to temporal information in music more than parents with less music training. Further research is needed to directly assess these possibilities. Other differences in parent-infant lifestyles across these groups (SES, parent involvement in infant's daily life) would also need to be measured and controlled.

It was surprising that a correlation between reported parent years of training and infant music listening was not found in Experiments 1 or 2. Previous work has shown that parents with more music training typically engage in more musical activities with their babies. More specifically, parental musical experience was associated with the habit of listening to music with baby (Ilari, 2005), and with the frequency of playing music to and singing to baby (Custodero and Johnson-Green, 2003). It is possible that our simple question "How many hours per week does your infant hear music" (which covers both active and passive listening, with and without the parent) was not specific enough to capture potential differences in levels of musical engagement between parent and infant. Therefore, it is possible that our measure of parent music training is a better proxy for infant exposure to music in engaged settings than the question in our questionnaire.

One limitation of this study is that parents did not wear noisecanceling headphones. It is possible that some mothers (despite being blind as to our hypotheses and despite our clear instruction to avoid movement) may have subtly moved their bodies (and therefore their baby) to the rhythms heard over the loudspeaker, which may have influenced EEG recordings. The experimenter in the room tasked with infant distraction was trained to watch for such movements and, should they occur, to communicate to the parent that they must stop. All parents complied with this instruction. We also had a second experimenter watching the parent and infant from outside the sound attenuated chamber via a live webcam feed, to ensure that instructions were being followed.

Overall, the results of these experiments provide new insights on how the processing of beat and meter may develop from the interplay between genes and environment. Specifically, we present evidence that the frequency-tagging approach is apt to measure infants' neural entrainment to rhythmic patterns. We also present evidence that the neural responses entrained to beat and meter frequencies can be influenced by individual differences in infant and parent music backgrounds. These findings raise interesting questions about how musical experiences across the lifespan, especially in the early months of infancy, shape auditory processing in general and temporal processing in particular.

### AUTHOR CONTRIBUTIONS

LC was the primary researcher and LT the senior researcher but all authors contributed to the ideas, analyses, and writing of the manuscript. LC and CS tested participants.

### ACKNOWLEDGMENTS

This research was funded by a grant from the Canadian Institutes of Health Research (MOP 42554) to LT, and by a postgraduate scholarship from the Social Sciences and Humanities Research Council to LC. We also thank the Royal Conservatory of Music in Toronto, Ontario for supporting this research and providing the caregiver/infant classes received by participants in Experiment 2. Thanks also to Elaine Whiskin, Christine Ung, Madeleine McKitrick, Ammaarah Baksh, and Sonia Gandhi for their assistance in data collection and pre-processing. We also thank Dave Thompson for technical support.

### REFERENCES


Natl. Acad. Sci. U.S.A. 106, 2468–2471. doi: 10.1073/pnas.08090 35106

**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Cirelli, Spinelli, Nozaradan and Trainor. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# An Exploration of Rhythmic Grouping of Speech Sequences by French- and German-Learning Infants

Nawal Abboub1,2\*, Natalie Boll-Avetisyan<sup>3</sup> , Anjali Bhatara1,2 , Barbara Höhle<sup>3</sup> and Thierry Nazzi 1,2

<sup>1</sup> Laboratoire Psychologie de la Perception, Université Paris Descartes, Paris, France, <sup>2</sup> Laboratoire Psychologie de la Perception, Centre National de la Recherche Scientifique (CNRS), Paris, France, <sup>3</sup> Faculty of Human Sciences, Universität Potsdam, Potsdam, Germany

Rhythm in music and speech can be characterized by a constellation of several acoustic cues. Individually, these cues have different effects on rhythmic perception: sequences of sounds alternating in duration are perceived as short-long pairs (weakstrong/iambic pattern), whereas sequences of sounds alternating in intensity or pitch are perceived as loud-soft, or high-low pairs (strong-weak/trochaic pattern). This perceptual bias—called the Iambic-Trochaic Law (ITL)–has been claimed to be an universal property of the auditory system applying in both the music and the language domains. Recent studies have shown that language experience can modulate the effects of the ITL on rhythmic perception of both speech and non-speech sequences in adults, and of non-speech sequences in 7.5-month-old infants. The goal of the present study was to explore whether language experience also modulates infants' grouping of speech. To do so, we presented sequences of syllables to monolingual French- and German-learning 7.5-month-olds. Using the Headturn Preference Procedure (HPP), we examined whether they were able to perceive a rhythmic structure in sequences of syllables that alternated in duration, pitch, or intensity. Our findings show that both French- and German-learning infants perceived a rhythmic structure when it was cued by duration or pitch but not intensity. Our findings also show differences in how these infants use duration and pitch cues to group syllable sequences, suggesting that pitch cues were the easier ones to use. Moreover, performance did not differ across languages, failing to reveal early language effects on rhythmic perception. These results contribute to our understanding of the origin of rhythmic perception and perceptual mechanisms shared across music and speech, which may bootstrap language acquisition.

Keywords: language acquisition, prosody, grouping, iambic-trochaic law, perceptual biases, french-learning infants, german-learning infants

### INTRODUCTION

Perception of complex auditory patterns such as music or speech requires the auditory system to decompose or parse the acoustic signal into smaller units. One example is the segmentation of auditory input into chunks, known as perceptual grouping. An everyday example of this phenomenon is the ''tick-tock'' one hears when listening to a clock (Bolton, 1894)

Edited by:

Sonja A. Kotz, Maastricht University, Netherlands

#### Reviewed by:

Milene Bonte, Maastricht University, Netherlands Kathrin Rothermich, McGill University, Canada Simone Falk, Ludwig Maximilians University of Munich, Germany

> \*Correspondence: Nawal Abboub

n.abboub@gmail.com

Received: 15 December 2015 Accepted: 31 May 2016 Published: 14 June 2016

#### Citation:

Abboub N, Boll-Avetisyan N, Bhatara A, Höhle B and Nazzi T (2016) An Exploration of Rhythmic Grouping of Speech Sequences by French- and German-Learning Infants. Front. Hum. Neurosci. 10:292. doi: 10.3389/fnhum.2016.00292

**216**

although the signal consists of re-occurrences of identical sounds. Importantly, such grouping is influenced by the duration, intensity, and pitch variation of the sounds. If the sequence alternates in duration (i.e., long-short-long-short), adult listeners group the sounds into iambic (weak-strong, final prominence) pairs (short-long, see **Figure 1A**). If a sequence alternates in intensity (i.e., loud-soft-loud-soft. . .) or pitch (i.e., high-lowhigh-low. . .) the opposite is true: adult listeners tend to group the sequence into chunks of a trochaic (strong-weak, initial prominence) pattern (i.e., loud-soft [see **Figure 1B**] or highlow [see **Figure 1C**]). This particular rhythmic grouping was initially demonstrated in the musical domain (Woodrow, 1909, 1911; Cooper and Meyer, 1960), and it was later extended to the linguistic domain and termed the Iambic-Trochaic Law (ITL; Hayes, 1995; Nespor et al., 2008). Iambic and trochaic are two possible stress patterns of words and/or prosodic constituents (such as phonological phrases), which constitute a language's rhythmic structure. In many languages, words or phrases with initial prosodic prominence are often higher in intensity and/or in pitch on the first syllable, whereas words or phrases with final prominence are often longer in duration on the last syllable (Hayes, 1995; Nespor et al., 2008). This tendency affects the way individual listeners group sequences of sounds that alternate in one of these three acoustic cues. This effect has been demonstrated in adult listeners of numerous languages (French and German: Bhatara et al., 2013, 2015; Italian: Bion et al., 2011; French and English: Hay and Diehl, 2007). Based on the similarities in rhythmic structure between language and music as well as perceptual grouping preferences of speech and

non-speech material, it has been proposed that the ITL may be a general auditory mechanism, governed by abstract, universal principles (Hayes, 1995; Iversen et al., 2008; Nespor et al., 2008; Yoshida et al., 2010; Bion et al., 2011).

Nevertheless, it has been hypothesized that a listener's language environment might modulate the effects of the ITL, at least in part. Prosodic patterns as well as their acoustic correlates differ across languages. For example, in English and German, accentuation at the word level predominantly falls on the initial syllable in disyllabic words and is marked mostly by a contrast in intensity and/or pitch (trochaic pattern), whereas French has no lexical stress per se (Delattre, 1966; Dogil and Williams, 1999). Languages can also differ with respect to accent placement within the phonological phrase. For example, in English, phrases are iambic (i.e., emphasizing the last word in a phrase, e.g., to PARIS) whereas in Japanese, they are trochaic (i.e., emphasizing the first word in a phrase, e.g., TOKYO (Nespor and Vogel, 1986; Iversen et al., 2008; Nespor et al., 2008). German on the other hand, can follow either an iambic or a trochaic pattern. This phrasal distinction is caused by cross-linguistic differences in the syntactic parameter of head direction, which determines the possible order of heads and their complements in a given language. As the complement carries the prosodic prominence, head direction is associated with the position of the most prominent element within the phonological phrase. In head initial languages (e.g., French), the most prominent element is at the end of a phonological phrase, and in head final languages (e.g., Japanese), it is at the beginning (see Nespor et al., 2008). Note also that the acoustic realization of accentuation varies cross-linguistically. For example, in French, phrasal stress is mainly marked by phrase-final lengthening (hence, a durational iambic pattern); it is also marked by phrase-final pitch movement, corresponding to a pitch rise (iambic) if the phrase is sentence-internal, and a pitch fall if the phrase is sentence-final (trochaic; see Delattre, 1966; Jun and Fougeron, 2002).

This hypothesis of cross-linguistic modulation of the ITL has received support from recent adult studies showing that grouping preferences vary with language experience. Iversen et al. (2008) found that both English and Japanese listeners group sequences of complex tones varying in intensity as predicted by the ITL, that is, trochaically. However, English but not Japanese listeners grouped sequences varying in duration iambically, as predicted by the ITL; in fact, Japanese listeners did not display a consistent preference. Recent findings by Bhatara et al. (2013, 2015) comparing German and French listeners' grouping of sequences of syllables provided further support for cross-linguistic modulation of rhythmic grouping. Both groups followed the ITL predictions for duration and intensity, but the German group showed more consistent performance than the French participants, reflecting a more stable bias. Additionally a pitch-based trochaic grouping preference was found only for the German listeners. The relative weakness in grouping preferences by French listeners when compared to German listeners can, however, be ameliorated when German is acquired as a foreign language in adulthood (Boll-Avetisyan et al., 2015b).

The authors of the above studies have argued that the cross-linguistic differences observed in their studies result from prosodic differences between the languages of the participants in either phrasal structure (for English/Japanese) or word stress (for German/French). If so, this raises the issue of how and when these cross-linguistic differences in the input give rise to cross-linguistic differences in perception (i.e., the nature and developmental trajectory of the mechanisms that generate rhythmic perception of speech/non-speech stimuli). Is the ITL present early in life or does it emerge later in development? How and when in infancy does it become modulated by language exposure? Does the use of all three cues (duration, intensity, pitch) follow the same developmental trajectory?

Recent developmental work has found early rhythmic grouping preferences as predicted by the ITL by 6 months of age. One of the first studies to explore this issue in Englishlearning infants presented them with complex non-speech tones alternating in duration or intensity and measured their detection of silences (Trainor and Adams, 2000). The results showed that 8-month-old infants perceived iambic groupings when duration was alternated, but they perceived no specific grouping when intensity was alternated. The authors interpret these results as an indication that grouping by duration and by intensity follow different developmental trajectories. The hypothesis of different trajectories for different cues is also supported by another study, which tested English-learning 6- and 9-montholds on a grouping task using syllables instead of complex tones (Hay and Saffran, 2012). However, Hay and Saffran's results showed the opposite pattern to those of Trainor and Adams (2000), with the 6-month-old infants grouping by intensity and not by duration and the 9-month-olds grouping by both intensity and duration. The discrepancy between these studies could be due to the materials used, but Hay and Saffran (2012) also tested the 9-month-olds on complex tones and found weaker but similar results to the syllables task; hence, it is unlikely that the difference is due strictly to the materials used. A more likely explanation of the difference between the studies is in the task. Hay and Saffran (2012) were not testing only rhythmic grouping; they also linked the prosodic cues to statistical cues (transitional probabilities), which may have changed the weight given by infants to the different cues. Together, these results nevertheless strongly suggest early use of intensity and duration for grouping, in accordance with the ITL, but also different developmental trajectories for rhythmic grouping based on these two cues.

A study by Bion et al. (2011) also showed variation in the developmental trajectory of the cues used for rhythmic grouping, this time between duration and pitch. Using the Headturn Preference Procedure (HPP), they familiarized two groups of 7.5-month-old Italian-learning infants with sequences of six syllables, presented repeatedly in the same order for 3 min, alternating either in duration (group 1) or in pitch (group 2). All infants were then tested on their perception of pairs of these syllables, presented without any acoustic/prosodic cues. Half of the test pairs of syllables had been presented with final prominence (short-long or low-high) in the familiarization phase, and the other half had been presented with initial prominence (long-short or high-low) in the familiarization phase. Results showed a preference for the initial prominence items in the pitch condition (interpreted as a familiarity effect), but failed to show a preference for either type of grouping in the duration condition. Similar results were obtained in a study on rats presented with sequences of complex tones; the rats showed evidence of rhythmic grouping by pitch but not by duration (de la Mora et al., 2013). A second study on rats further established that the emergence of duration-based grouping in rats is dependent on the nature of exposure: indeed, after being exposed to a regular duration-based pattern (pairs with either initial or final prominence), rats consistently grouped sequences varying in duration according to the pattern to which they had been exposed (Toro and Nespor, 2015). The authors interpreted these findings as possible evidence that grouping by pitch results from a universal perceptual bias shared across species, whereas grouping by duration would be more linked to auditory experience, and therefore would emerge later in human development after some exposure to language. In order to better understand the interplay of potential universal biases and the role of experience in rhythmic perception, and how these factors may depend on the type of acoustic cue, studies are needed that compare the use of pitch, intensity and duration cues in different languages, using the exact same procedures and stimuli.

Only two studies have looked at cross-linguistic differences on rhythmic grouping in infants, using the same procedures and materials for each linguistic group. The first study tested monolingual infants who were 5–6 months or 7–8 months old, learning either English or Japanese (Yoshida et al., 2010). The authors presented the infants with a 2-min familiarization sequence made up of complex tones alternating in duration. In the test phase, they measured the infants' preference for pairs with final prominence (short-long) or initial prominence (long-short). Only the English-learning 7–8-month-olds showed any preference, and they preferred the pairs with an initial prominence. The Japanese-learning infants did not show a preference at either age. The authors suggest that the older English-learning infants were able to segment the sequence varying in duration into pairs with final prominence, and thus interpret the preference for pairs with initial prominence as a novelty preference. These results appear to extend the crosslinguistic differences found with Japanese- and English-listening adults to infancy (Iversen et al., 2008), and were taken as evidence that modulation of the ITL might be related to linguistic properties at the phrasal level. However, because there were no control sequences (without acoustic variation, or varying in another parameter) in the experiment, the possibility remains that this effect is due to English-learning infants' bias for prominence-initial items, which has been demonstrated around this same age (Jusczyk et al., 1993), and not to sensitivity to the ITL per se.

The second cross-linguistic study used the same methods with tone stimuli as Yoshida et al. (2010) with a shorter familiarization (1 min 30 s), testing 9-month-old bilingual infants who were dominant in either Basque or Spanish, two languages that also differ in phrasal prosody with Basque having phrase-initial and Spanish having phrase-final stress (Molnar et al., 2014, 2016). Both duration- and intensity-varied sequences were tested. For intensity, both groups showed the same perceptual grouping and no linguistic modulation was found. However, for duration, cross-linguistic differences similar to those reported by Yoshida et al. (2010) were found: the Spanish-dominant infants had a preference for sequences with initial prominence, whereas the Basque-dominant infants had no significant preference, but showed a positive correlation between the amount of exposure to Basque and a preference for sequences with final prominence. On a more methodological issue, note that in both Yoshida et al. (2010) and Molnar et al. (2014), the preferences observed were attributed to novelty effects, assuming that they had grouped the familiarization string into short-long sequences (as expected by the ITL), and subsequently preferred novel long-short groupings over familiar short-long groupings. These new findings again suggest a link between modulation of the ITL and experience with the phrasal prosody of the native language.

In summary, previous research suggests that grouping preferences as predicted by the ITL emerge between 6 and 9 months and are partly modulated by linguistic experience. These studies suggest that a general auditory mechanism may be in place very early in infancy and may also be modulated by language exposure. However, the number of studies exploring grouping in infants is limited, and the results are mixed. Indeed, there is some evidence for the use of all of the three cues in at least one language and age group, but use is not found consistently. This raises questions regarding how this use is modulated across languages and development. Moreover, even though there appears to be a strong and early effect of language exposure, it remains poorly understood. So far, evidence from crosslinguistic studies has only revealed differences for sequences varying in duration. These results could suggest that grouping by duration is more dependent on language experience, whereas grouping by pitch or intensity are based on more general auditory mechanisms. Grouping by duration could be dependent on the prosodic properties of the language at the phrasal level (Iversen et al., 2008; Yoshida et al., 2010; Gervain and Werker, 2013; Molnar et al., 2014), although it has also been proposed that some modulation of the ITL might also stem from the word level (Bhatara et al., 2013, 2015). However, it is difficult to make generalizations like this based on the literature available thus far; the only two studies to test grouping in infants cross-linguistically (Yoshida et al., 2010; Molnar et al., 2014) differ in ages tested, length of familiarization, and acoustic cues tested. As mentioned earlier, no single study so far has tested all three of the cues crosslinguistically using the same methodology and using speech stimuli. Additionally, previous cross-linguistic studies testing infants have only compared languages differing in their phrasal stress (English/Japanese; Spanish/Basque). Here, we present a comparison of infants learning languages differing mostly in stress at the word level (French and German; see Bhatara et al., 2013), and we examine their perception of all three acoustic cues in syllable strings.

The present study is designed to answer the following questions: first, are all three acoustic cues (duration, intensity or pitch) used for grouping speech sequences at 7.5 months? If this were not the case, it would support differences in the trajectory of use for rhythmic grouping of the three cues. Second, is this grouping modulated by linguistic experience at this early age? Accordingly, we used methods similar to those of Bion et al. (2011), testing 7.5-month-old infants learning either French or German. If grouping by certain cues is modulated by linguistic experience at this age, we should see differences between these two groups that reflect the differences demonstrated in French and German adults (Bhatara et al., 2013, 2015). If, however, language has no effect on the ITL at this age, both groups should show similar patterns of grouping, that is, according to the ITL, trochaic grouping for the intensity and pitch conditions and iambic grouping for the duration condition.

In order to test these hypotheses, we used a familiarizationplus-test procedure using the Headturn Preference Paradigm following Bion et al. (2011), and testing both Frenchand German-learning infants. As in that study, all infants were familiarized for 3 min with a continuous stream of the same six syllables, each infant being assigned to one of four conditions: three conditions each testing the use for grouping of a given cue (duration variation, intensity variation or pitch variation) and one control condition (no duration/intensity/pitch variation), as a baseline preference for the test items. Then, all infants were tested with two types of syllable pairs from these streams, presented without acoustic variation, and which had been presented with initial vs. final


TABLE 1 | Participant information of the four experimental conditions.

prominence in familiarization. In this specific HPP paradigm, rhythmic grouping of the familiarization sequence would be attested if infants demonstrated a differential response (measured in looking times) to the two types of stimuli in the test phase. In our case, based on Bion et al. (2011), we hypothesized that this difference in looking time would show that infants have memorized the stream as pairs of syllables that followed the rhythmic patterns predicted by the ITL (i.e., syllable pairs instantiating a short-long, loud-soft or high-low pattern in the stream).

What was less clear based on the literature was whether in the present study, infants would show a preference for the syllable pairs corresponding to the familiar (ITL-based) or the novel grouping. However, Hunter and Ames (1988) suggested that preferences in infants reflect the interaction among several factors, such as age, stimulus complexity and task difficulty. They proposed that infants typically display novelty preferences if the task is relatively easy (in the present case, if a cue is easy to use for grouping since age and task were constant across conditions) and familiarity preferences if the task is relatively complex. According to previous research investigating the ITL in infants reviewed earlier, using the same procedure (HPP; Yoshida et al., 2010; Bion et al., 2011; Molnar et al., 2014) and a procedure that does not rely on any novelty/familiarity interpretation (conditioned head turn; Trainor and Adams, 2000), the emergence of rhythmic grouping preference might differ across the three acoustic cues, meaning that some cues may be processed more easily than others. For this reason, we might expect that infants would have a familiarity preference for the cue(s) that are more complex for them to use for grouping and a novelty preference for less complex cue(s).

### MATERIALS AND METHODS

### Participants

We tested a total of 205 monolingual full-term 7.5-month-old infants, learning either French in Paris, France, or German in Potsdam, Germany. There were four conditions (pitch, intensity, duration, and control) and two languages. Twenty infants were included in each condition/language combination (n = 160). We excluded 51 infants because of fussiness/tiredness (36 infants), having more than three insufficient looking times (<1500 ms; 4 infants), due to technical error (4 infants), being an outlier (i.e., with the difference between the mean orientation times of the two item types 2 SDs above or below the group mean; 5 infants) or other inability to finish the experiment (2 infants). See **Table 1** for more details on the infants included in each condition. All parents gave informed consent before the experiment. The present experiment was approved in Paris by the ethics board ''Conseil d'évaluation éthique pour les Recherches en Santé'' (CERES) at the Université Paris Descartes and by the ethics board of the University of Potsdam.

### Stimuli

As in Bion et al. (2011), the stimuli were sequences of six syllables created by combining six vowels (/a:, e:, i:, o:, u:, y:/) with six consonants (/f, n, g, p, r, z/). These were selected for two reasons: first, they are phonemes that exist in both French and German, even if their realizations differ slightly. The result of this was that the segmental variability was the same for the two language groups. Second, the vowels and consonants vary phonologically (vowel roundness, height, and place, and consonant voicing, manner, and place). These syllables were concatenated into a stream in such a way that it contained no disyllabic words in either language: /na: zu: gi: pe: fy: ro:/. Syllables were separated by a pause of 100 ms. For the four different conditions, we created sequences in which syllables alternated in either duration, intensity, pitch, or nothing (control condition), see **Supplementary Figure S1** in Supplementary Material. The sequences were synthesized using two female voices in MBROLA (Dutoit et al., 1996), one French (fr4) and one German (de5)<sup>1</sup> .

In the familiarization phase, the six-syllabic sequence was repeated 66 times, in order to last about 3 min. The acoustic variation was added to the sequence using a combination of MBROLA and PRAAT (Boersma and Weenink, 2010). The duration manipulation was applied to the vowels and pitch and intensity at the whole syllable. The values of duration, intensity and pitch variation were chosen in order to stay close to the values naturally present in these two languages (e.g., Bhatara et al., 2013) while also attempting to replicate Bion et al. (2011) as closely as possible (see **Table 2** for a summary of the values). Note that the baseline and control values for the intensity (70 dB) and duration (360 ms) conditions were also the means of the variation condition. However for the pitch condition, because the baseline would have been too high and sounded unnatural if we had used the mean of the variation we chose (200–420 Hz, so 310 Hz), we decided to use 200 Hz as baseline.

<sup>1</sup>Prior experiments with duration, intensity or pitch cues were run using eight syllables rather than six syllables and without pauses between the syllables, making the task more difficult. These experiments revealed no evidence of ITL-based responses in 7 month-old French- or German-learning infants (Boll-Avetisyan et al., 2015a,b).

TABLE 2 | Acoustic variation values for each condition.


Several previous studies on rhythmic grouping found a strong influence of the onset of the sequence on perceived grouping. The first two sounds heard tend to serve as an anchor point (Woodrow, 1909; Trainor and Adams, 2000; Hay and Diehl, 2007). For this reason, we created ''ramps'' to mask the onset of each sequence in two ways. The first aspect of the sequence onset mask was inspired by Hay and Diehl (2007) and used in the same way as Bhatara et al. (2013): we added white noise masking over the first four repetitions of the sequence (10 s), decreasing in intensity as the sequence itself increased in intensity from silence, with both the increase and the decrease following a raised cosine function. The second aspect of the sequence onset masking ramp was inspired by Bion et al. (2011), who inserted a gradual increase of the rhythmic cue, starting on the third syllable. For example, in the duration condition, the first two syllables had equal duration (260 ms) and starting with the third syllable, every odd syllable increased by 20 ms. Thus, the third syllable was 280 ms long, the fourth 260 ms, the fifth 300 ms, the sixth 260 ms, and so on until the maximum of 460 ms was reached, so that every odd syllable was the longer one. In our study, to counterbalance the start of the increasing variation, half of the increases started on the fourth rather than the third syllable, so that every even syllable was the longer one. This resulted in two different ramp types. Similar manipulations were performed in the pitch and intensity conditions. The pitch condition started at 200 Hz and increased by 20 Hz every other syllable until it reached 420 Hz, and the intensity condition started at 66 dB and increased by 1 dB every other syllable until it reached 74 dB.

The items used at test corresponded to the six disyllables that had occurred during the familiarization (/na:zu:/, /zu:gi:/, /gi:pe:/, /pe:fy:/, /fy:ro:/, /ro:na:/). Crucially, during the test phase, both syllables of a test item were equal in pitch, duration, and intensity; hence, preferences observed could not depend on the acoustic/prosodic properties of the syllables at test. Half of these disyllables had been presented with final prominence (i.e., short-long, low-high, or soft-loud, depending on the condition) and the other half with initial prominence (i.e., long-short, high-low, or loud-soft) in the familiarization. Six sound files were prepared, each containing one of the six test items repeated 16 times, lasting 14.5 s. Each of the six sound files was presented twice, leading to a test phase of 12 trials, and a different random order of presentation of the trials was used for each participant. The test items were synthesized with the same MBROLA voice the participants had heard during the familiarization phase.

### Procedure Apparatus and Design

We used the HPP (Kemler Nelson et al., 1995) for this study. The infants were seated comfortably on their parents' lap in a soundproof booth. A green light was directly in front of the infant. There was a red light on each side of the room, at the same height as the green light. Speakers were hidden behind the wall underneath the red lights. The experimenter sat outside the booth and observed the infants using a camera under the green light. The experimenter controlled the stimulus presentation and the blinking of the lights according to the head movements of the infant by pressing three buttons on a button-box (one for each light, see further details below). Both the experimenter and the parent wore headphones playing music that masked the stimuli.

The experiment began with the familiarization phase. The infant heard one of the four-familiarization sequences (duration, intensity, pitch, or control) from both speakers simultaneously via an amplifier. Half of the infants heard a sequence of familiarization with the acoustic variation starting on the third and half heard variation starting on the fourth syllable. Additionally, half of the infants in each condition and in each language heard the German voice, and half heard the French voice. During this phase, the lights blinked according to where the infant looked, but the sound was not presented contingently on the infant's behavior. In the test phase, which was the same across all four conditions, all infants heard 12 trials: three disyllables that had had initial prominence and three that had had final prominence during familiarization, each presented twice (in order to simplify the item labeling, we use the terms ''initial prominence'' and ''final prominence'' even though all disyllables were free of acoustic variation in the test phase). The trials were presented in a different random order for each participant. The side of the loudspeaker from which the stimuli were presented was randomly varied from trial to trial, with the constraint that 1/2 of the trials of each kind were presented on each side.

Each trial began with the green light blinking in order to center the infant's gaze. After the infant looked at the green light, the experimenter pressed a button to make the red light on one of the side panels start blinking. When the infant looked at the red light, the sound and the trial began. The side of the loudspeaker from which the stimuli were presented was randomly varied from trial to trial. If the infant turned away from the light for more than 2 s while the sound was playing, both the blinking and the sound stopped, and the green light began blinking again. If the infant looked away and then back again within the period of 2 s, the sound continued to play. However, this time of looking away from the light was subtracted from the total looking time for that trial. Information about the duration of the head turn was stored on the computer.

## RESULTS

Looking times for the prominence-initial and prominence-final items were averaged across each participant. Note that test items in the control condition cannot be referred to as having initial or final prominence because the syllables were all at baseline pitch, intensity and duration. However, to explore whether the onset of the sequences served as anchor points for grouping (as in Woodrow, 1909; Trainor and Adams, 2000; Hay and Diehl, 2007), we decided in the control condition to label the pairs that would be segmented by using the first syllable to initiate the grouping process (1–2, 3–4, and 5–6) as ''prominence-initial,'' and the other pairs, corresponding to starting the grouping from the second syllable (2–3, 4–5, and 6–1), as ''prominencefinal.''

To determine whether infants process the two ''types'' of pairs (prominence-initial vs. final) as predicted by the ITL for the duration, intensity and pitch duration but not for the control condition, the mean looking times for the two types were averaged across infants, for the first vs. last three trials of each type (in order to be able to explore block effects). We performed repeated-measures ANOVAs on these mean looking times with the within-subjects factor of block (first vs. last three trials of each item type) and the betweensubjects factors of condition (duration, intensity, pitch, or control). There was a strong effect of block, F(1,156) = 81.82, p < 0.00001, η 2 <sup>p</sup> = 0.34 and an effect of condition F(3,156) = 2.70, p = 0.048, η 2 <sup>p</sup> = 0.049. No other effects or interactions were significant.

The lack of an effect or interactions involving item type (prominence-initial and prominence-final items) indicates a failure to show grouping. However, the results show that infants' looking times significantly decrease throughout the experiment (from a mean of 8.67 s in the first trial to 6.24 s in the last trial), independently of the condition. For this reason, we decided to run a second ANOVA restricted to the first block of each item type, since more transient effects might be observed in the earlier part of the test phase (see further discussion of this point in the ''General Discussion'' Section).

Results restricted to the first block are shown in **Figure 2**. We performed a repeated-measures ANOVA on these mean looking times with the within-subjects factor of item type (prominence-initial or -final) and the betweensubjects factors of native language (French or German) and condition (duration, intensity, pitch, or control). There was a significant effect of item type, F(1,152) = 4.76, p = 0.03, η 2 <sup>p</sup> = 0.03, as well as a significant interaction between item type and condition, F(3,152) = 3.29, p = 0.022, η 2 <sup>p</sup> = 0.06. These results show that infants differentiate between disyllables that had initial vs. final prominence during familiarization, and that the way they differentiate them changes depending on the condition. There were no other significant effects.

Next, in order to more closely examine the interaction between item type and condition, we analyzed each condition separately. For the control condition, a simple t-test for item type was conducted. For the other three conditions, repeatedmeasures ANOVAs were conducted with the factors of item type and ramp type (whether the variation started on the third or fourth syllable).

## Control Condition

There was no effect of item type, t(39) = 1.41 p = 0.16. Infants looked equivalently at the ''prominence-initial'' items (M = 8.28 s) and the ''prominence-final'' items (M = 7.81 s), suggesting that during the test phase, the infants had no particular preference for specific pairs of syllables following the familiarization sequence that was neutral in terms of promoting ITL-based grouping.

### Duration Condition

There was a significant main effect of item type, F(1,38) = 9.88, p = 0.003, η 2 <sup>p</sup> = 0.21, with infants looking longer for prominence-final items (M = 8.18 s) than for prominenceinitial items (M = 7.21 s). There was also a significant interaction between ramp type and item type, F(1,38) = 6.34, p = 0.016, η 2 <sup>p</sup> = 0.14. No other effects or interactions were significant.

To better understand the interaction between ramp and item type, we examined the effect of item type on each ramp type. It appears that the infants looked longer for prominence-final items when the ramp started on the fourth syllable (Mfinal = 8.97 s vs. Minitial = 7.21 s), t(19) = 3.83, p < 0.001, whereas there was no looking time difference if the ramp started on the third syllable (Minitial = 7.20 s vs. Mfinal = 7.39 s), t(19) = −0.46, p = 0.65.

### Intensity Condition

There were no significant effects or interactions for intensityvaried sequences, indicating that infants did not show any grouping preference.

### Pitch Condition

For the pitch-varied sequences, there was a significant effect of item type, F(1,38) = 4.78, p = 0.035, η 2 <sup>p</sup> = 0.11. Infants looked longer for prominence-final items (M = 7.87 s) than for prominence-initial items (M = 7.01 s). There were no significant effects of ramp type and no significant interactions.

## DISCUSSION

In this set of studies, we have examined French- and Germanlearning infants' rhythmic grouping of linguistic sequences according to the ITL. Our first research question was whether German- or French-learning infants by 7.5 months of age would group linguistic sequences according to the ITL: that is, prominence-initial for intensity and pitch and prominencefinal for duration. Second, we wanted to evaluate whether this ITL-based grouping is already modulated by native language experience at that age.

In both the duration and the pitch conditions, both Germanand French-learning infants at test looked longer at the items that had been prominence-final in the familiarization. Given that there was no preference for the test items in the control condition, this shows that the preferences observed here are induced by the specific properties of the two familiarization conditions. Hence, it appears that 7.5-month-old infants learning either language can make use of duration and pitch cues to segment syllable sequences. For the intensity condition, there was no significant effect. Below, we discuss the results of the separate conditions followed by a comparison of the results of the separate conditions in an integrative discussion.

### Control

We did not find any evidence for a default grouping in the 7.5 month-olds in our study when no acoustic variation was present in the familiarization sequence. Previous studies with adults have found a default trochaic grouping of sequences without any acoustic variation of the relevant cues in native speakers of English (Rice, 1992; Hay and Diehl, 2007) and of German but not of French (Bhatara et al., 2013), unless native speakers of French had second language knowledge of German (Boll-Avetisyan et al., 2015b). These results suggest that under the present experimental conditions (i.e., with segmental variability and without any prosodic cues), infants at 7.5 months of age do not segment the familiarization sequence into bisyllabic chunks. The lack of a preference for any of the syllable pairs presented during the test phase is important to note because it indicates that any preferences found in the other familiarization conditions do not result from intrinsic preferences for some of these pairs (since test pairs were identical across all four conditions) but instead reflect preferences induced by rhythmic grouping. Moreover, this null-result may suggest that without any acoustic cues, infants do not group the sequence at that age, and that the default trochaic segmentation bias found in English and German adults emerges later in development.

### Duration

For the duration-varied condition, we found a preference for prominence-final items over prominence-initial items in the test phase, establishing that the infants used the duration variation in the familiarization phase to group syllables into pairs. This grouping then induced a preference at test, a conclusion based on the fact that no preference was found for the same test items following familiarization in the control condition (see details above). Additionally, if we assume that the infants grouped syllables iambically during the familiarization as predicted by the ITL for duration-varied sequences, then in the test phase, they appear to listen to the syllable pairs that were ''familiar'' given the familiarization phase.

This familiarity effect is in the opposite direction from results of previous studies, which had shown that 7–8-monthold Canadian English-learning infants (Yoshida et al., 2010) and 9-month-old bilingual Spanish/Basque-learning infants (Molnar et al., 2014) have preferences for prominence-initial items at test. Both sets of authors interpreted these results as indicative of a novelty preference, according to the model proposed by Hunter and Ames (1988). Because familiarity preferences may arise instead of novelty preferences when infants are younger or find the tasks more difficult (Hunter and Ames, 1988), we hypothesize that the present familiarity preference is a consequence of the fact that either the task or material used in the present study might be more difficult than that of Yoshida et al. (2010) or Molnar et al. (2014). The present task included a long familiarization sequence (3 min, as in Bion et al., 2011) relative to previous studies (2 min in Yoshida et al., 2010 and 90 s in Molnar et al., 2014), which should have led, if anything, to an even stronger novelty effect. However, our task might have been more difficult because infants had to memorize the syllables during familiarization in order to show a preference in the subsequent test phase. In contrast, in Yoshida et al. (2010) and Molnar et al. (2014), the test phase (with stimuli including acoustic cues for prominence) tested for a relative preference for trochaic over iambic stress patterns, and this task did not require memorizing and recognizing the familiarization stimuli. Furthermore, the preference for the trochaic pattern found in these studies could have resulted, at least for English, from the emergence of infants' preference for trochaic words around that age (Jusczyk et al., 1993). Moreover, regarding the material, the stimuli in both previous studies were sequences of tones, whereas we presented infants with sequences of syllables, that is, with much more acoustic variation created by the segments forming the syllables. Since more complex stimuli have been shown to be more difficult to process in ITL-related tasks for both French and German adults (Bhatara et al., 2015), it would be reasonable to assume that the same would be true for infants. Hence, both our task and stimuli might be more difficult than in Yoshida et al. (2010) and Molnar et al. (2014), which might explain why we found a familiarity rather than a novelty preference for the ITL-based short-long pattern.

Moreover, our findings also appear to differ from those of Bion et al. (2011), which did not show sensitivity to duration for grouping in Italian-learning infants at the same age, even though our method was closely replicating the method they used. However, there was one important difference between the two studies; in Bion et al. (2011), the prominent syllables were always the odd syllables of the syllable sequence, whereas in the current study, the position of prominent syllables was counterbalanced (on odd syllables for half of the infants, on even syllables for the other). This counterbalancing effect turned out to have a marked impact on performance in our infants (in spite of the onset of the sequence being masked by fading-out white noise and fading-in intensity in addition to the step-wise increase of the alternation), replicating similar effects in both adults and infants (Trainor and Adams, 2000; Hay and Diehl, 2007). Indeed, our findings show that when the duration variation was on odd syllables (as in Bion et al., 2011), French- and German-learning infants also failed to show evidence of having grouped the syllables of the familiarization sequence. However, when the duration variation was on even syllables, infants succeeded.

How can we explain this ramp/positional effect? One possibility is that infants also use a default grouping mechanism that extracts disyllables aligned to the onset of the sequence they are presented with. When this default mechanism is aligned with ITL-based grouping (extracting short-long pairs) as is the case when stress is on even syllables, then both mechanisms would converge in their grouping results. Note however that this default mechanism is probably not very powerful, as attested by the lack of grouping effects in the control condition (where it should have given odd-even syllable sequences), and in the pitch and intensity conditions (where it should have induced larger effects in the sequences with variation on the odd syllables, which would have aligned with prominence-initial ITL-based grouping). If this interpretation is correct, then it is possible that Bion et al. (2011) would have found a grouping preference for duration if they had presented the ramp starting on the fourth syllable. Note that this ramp effect is another indication that, at least in this task, duration may be a grouping cue difficult to use at 7.5 months of age (see more on this point regarding the differing pattern of results in the pitch condition).

Hence, our study adds to the existing literature on infants with different native languages (English, bilingual Spanish/Basque [Spanish-dominant], French, and German), which have shown that infants group sound sequences by duration. This literature further shows that some infants, namely Japanese and Spanish/Basque [Basque-dominant]-exposed infants, do not show grouping preferences based on duration. These results suggest that several factors influence rhythmic grouping development, including both sequence structure and native language. In our study, it may be that the familiarity preference shows that our French- and German-learning 7.5-month-olds still found it difficult to use the duration cue present in our stimuli (as it is found only in one ramp condition).

### Pitch

In the pitch condition, we found that infants at test preferred listening to items that had had final prominence during the familiarization phase, establishing that they used pitch to group the syllables in the familiarization sequence. Considering prior experiments with adults and infants (Bion et al., 2011; Bhatara et al., 2013; Gervain and Werker, 2013) showing ITL-based trochaic grouping of pitch-varied sequences, these results would indicate a novelty preference. Given the fact that we found a familiarity effect in the duration condition, and given the Hunter and Ames (1988) model, this novelty preference would suggest that it was easier for our French- and German-learning 7.5-month-olds to group the syllables based on pitch than on duration variation. This interpretation is independently supported by the fact that the effects in the pitch condition were not modulated by the ramp used, contrary to what was found in the duration condition, suggesting more stable use of pitch than duration to group in our experiments. In addition, recall that Bion et al. (2011) found a familiarity preference in their pitch-varied stimuli among Italian infants, that is, a preference for prominence-initial items. Hence, it also appears that it was easier for the French- and German-learning infants than it had been for the Italian infants to use pitch. Since both studies used very similar materials (synthesized non-word syllable sequences) and methods, it is possible that this difference relates to the infants' native languages, but other cross-linguistic studies on grouping by pitch are needed to further explore this difference.

### Intensity

In the intensity condition, there were no significant effects in the analysis. To our knowledge, all previous studies who tested grouping by intensity used tones (Trainor and Adams, 2000; Molnar et al., 2014) or presented an artificial speech stream including cues from transitional probabilities between syllables along with the acoustic cues (Hay and Saffran, 2012). The present study used speech stimuli, and there were no cues from transitional probabilities between syllables. Hence, the reason that previous studies (Trainor and Adams, 2000; Molnar et al., 2014) but not the present study found effects could be due to methodological differences, either in complexity of material or in lack of additional cues. Another possible explanation for the present lack of grouping by intensity is that intensity alone is not by itself a relevant cue for infants' processing of rhythm in language. Specifically, it is difficult to tease apart the effect of pitch and the effect of intensity given the fact that sounds with a higher pitch tend to be perceived with higher intensity (Fry, 1955; van Heuven and Menert, 1996; Mattys and Samuel, 2000). It has been shown that 7.5-month-old infants are sensitive to pitch variations in a lexical recognition task but ignore intensity variations (Singh et al., 2008). Taken together, these results suggest that intensity can only be used for grouping in combination with other cues (e.g., transitional probabilities and other prosodic cues). This possibility will have to be tested in future research.

### General Discussion

The present findings are in part consistent with the predictions of the ITL. Nevertheless, our results show that there are differences in the way rhythmic cues are used for rhythmic grouping. Whereas we did not find an effect in the control or intensity conditions, we found a familiarity effect in the duration condition that was additionally modulated by the position of the ramp and a novelty effect in the pitch condition that was not affected by the position of the ramp. It is possible that such differences reflect different perceptual mechanisms being at play. Based on previous studies (Bion et al., 2011; de la Mora et al., 2013), one possible interpretation of these results is that a more stable and consistent grouping for pitch reflects a general, universal auditory processing mechanism, although—because pitch in developmental populations was only previously tested in Bion et al. (2011)—this hypothesis it is still under debate. In contrast, grouping by duration and intensity may be more context-dependent and/or more affected by language experience. This interpretation is supported by the results of previous studies (including data on rats: de la Mora et al., 2013; Toro and Nespor, 2015). First, the finding that the use of duration as a cue for grouping depends on language background has been found in several studies in both adults (Iversen et al., 2008; Bhatara et al., 2013, 2015; Crowhurst and Teodocio Olivares, 2014; Boll-Avetisyan et al., 2015b) and infants (Yoshida et al., 2010; Molnar et al., 2014). It is therefore possible that, at least at the beginning of life, before full mastery of an infant's native language, the specific pattern heard in their auditory environment affects rhythmic grouping based on duration but not pitch. However, our study is the first to test all three cues at the same time, and the emerging pattern of results remains difficult to fully interpret. Future studies comparing all three acoustic cues are needed, using different familiarization times, different levels of prosodic variability (in the present study, we only used one value per cue, making it difficult to evaluate the relative weight of each cue), and different types of materials (in particular, comparing linguistic and non-linguistic stimuli, see below).

Moreover, in the present study the effect of cue-based grouping was present only in the first part of the test phase. Indeed, in our first analysis analyzing all the trials, we found no significant effect or interaction involving test item type, but a significant decrease in looking times over the course of the test phase. This decrease in looking times is not surprising for two reasons: first, our familiarization phase was quite long when compared to other grouping experiments. Hence, it is not unexpected that the infants' attention cannot be maintained over the course of this long experimental session. Second, it is not unexpected that memory for grouped items is freshest immediately after the familiarization phase. Adult segmentation studies have also reported that preferences for the ''segmented'' items are most pronounced in the initial portion of the test phase (e.g., Boll-Avetisyan and Kager, 2014). Hence, it is reasonable that infants' preferences for, for example, prominence-final items in the duration condition are stronger in the initial portion of the test phase, immediately after they have been exposed to the stream. Further studies will have to take into account this effect in the design of this type of task.

Another interesting aspect of our results is the lack of crosslinguistic differences. We found the same listening biases for all infants independently of whether they are acquiring French or German. Until now, rhythmic grouping based on duration has generally been shown to be modulated by language exposure. Therefore, our results might be surprising at first, but they can be interpreted based on prosodic properties of the languages we tested, in particular in terms of their acoustic cues and position of lexical stress. Remember that Yoshida et al. (2010) as well as Molnar et al. (2014) compared languages with prominence at the end of the phonological phrase (English, Spanish) to languages with prominence at the beginning (Japanese, Basque). In contrast, French and German are less different on this level than the languages used in these previous studies, with phrases being prominence-final in French and predominantly prominence-final in German (even if both prominence-final and prominence-initial phrases are allowed in German). In contrast, these two languages greatly differ in terms of prosody at the lexical level. Because of the overall similarity with respect to the position of phrasal prominence, French- and German-learning infants' perception of rhythmic groups might still be similar at 7.5 months of age. Hence, a comparison of infants learning two less similar languages would be more likely to demonstrate crosslinguistic differences, as Yoshida et al. (2010) and Molnar et al. (2014) have shown in Japanese/English and Basque/Spanish comparisons. Although cross-linguistic differences in ITL-based grouping were found between French and German adults (Bhatara et al., 2013, 2015), it appears that these cross-linguistic differences might be set into place later in development for languages that mostly differ in prosody at the lexical rather than the phrasal level.

A corollary to this discussion on developmental changes in the weighting of rhythmic cues has been observed in infants' processing of prosodic boundaries in sentences. Studies by Seidl (2007) and Seidl and Cristià (2008) have shown that 4- and 6-month-old English-learning infants perceived acoustic cues of prosodic phrase boundaries differently (Seidl and Cristià, 2008). At 6 months, a pitch change but no lengthening or pause was necessary to perceive the boundary whereas at 4 months, infants needed a combination of all three of these boundary cues. Furthermore, this weighting and its developmental trajectory seems to vary cross-linguistically, as 6-month-old Dutch- and German-learning infants have been found to rely more heavily on the pause than their English-learning age mates, and Germanlearning infants are able to detect a phrase boundary that is solely marked by pitch and lengthening at the age of 8 but not 6 months (Seidl, 2007; Johnson and Seidl, 2008; Wellmann et al., 2012). These developmental changes might be linked to the nature of linguistic exposure, but also, in some part, might be related to infant directed speech (IDS) acoustic cues. IDS relative to Adult Directed Speech (ADS) has a generally slower tempo, higher average pitch, and exaggerated intonation contours (Fernald and Simon, 1984). These cues can differ in terms of strength according to the age of the infant (Kitamura and Burnham, 2003) and can also influence developmental changes in listening preference (Panneton et al., 2006) and might impact rhythmic grouping. Based on this reasoning, it is possible that, at the beginning of language acquisition, infants are relying on the ITL and general auditory perception, and that language-specific strategies will emerge later depending of the nature of the language exposure. Hence, the present study provides further evidence for a link between basic auditory processing and speech processing in language acquisition.

These results more broadly contribute to the common auditory skills account of speech, music and sound processing (Patel, 2003; Asaridou and McQueen, 2013) and particularly rhythmic processing for linguistic and non-linguistic sounds (Hay and Diehl, 2007; François et al., 2013; Bhatara et al., 2015; Boll-Avetisyan et al., 2015b). Rhythmicity characterizes many physiological processes and is widely acknowledged to be an important feature of both speech and music. Here for the first time we show early sensitivities for grouping linguistic sequences varying in pitch and duration in Frenchand German-learning infants, finding results similar to those from previous studies that presented infants with non-linguistic tones. These findings suggest that speech and non-speech sounds are processed by common mechanisms, although further support to this claim would be provided by studies directly comparing the processing of linguistic and nonlinguistic stimuli. Although this has not been done for early rhythmic grouping (with the exception of Hay and Saffran, 2012), several studies have shown links between musical and linguistic perception at different other levels. Behavioral research has highlighted similarities in terms of structural processing for musical and linguistic sequences in adults and infants (Krumhansl and Jusczyk, 1990; Fedorenko et al., 2009). Neuroimaging studies have also found common networks for structural processing of speech and music (Patel, 2003; Abrams et al., 2011; Fedorenko and Thompson-Schill, 2014). There is also evidence that musical experience can influence rhythmic grouping of both speech and non-speech sequences by duration and intensity (Bhatara et al., 2015; Boll-Avetisyan et al., 2015b).

In summary, the present study has shown, for the first time using linguistic stimuli, grouping preferences based on pitch and duration variation in 7.5-month-old French- and Germanlearning infants. These results suggest that these perceptual grouping mechanisms are in place early in development. In our experiments, infants were tested in a speech segmentation task, in which they succeeded using both pitch and duration cues for segmentation. Because pitch and duration are relatively reliable cues to word- and phrase boundaries in natural speech, it is possible that infants use the same mechanisms that they relied on in the present task to segment words and phrases from natural speech. Moreover, we found no cross-linguistic differences for these cues, in contrast with previous studies examining similar questions in French and German adults (Bhatara et al., 2013, 2015). This suggests that, at 7.5 months, language experience might have begun to shape these mechanisms (as directly found in Yoshida et al., 2010; and Molnar et al., 2014; and indirectly by comparing our findings with those of Bion et al., 2011), but that some cross-linguistic differences might take longer to emerge (as in the present case of French and German). Moreover, our study contributes to the view that rhythmic grouping preference

### REFERENCES


for speech may emerge from general perceptual biases. Recent studies have shown that groupings similar to those formalized by the ITL may even be found across species (de la Mora et al., 2013; Toro and Nespor, 2015) and across modalities (Peña et al., 2011). These studies reinforce the idea that these auditory biases would have evolutionary bases and that these biases that infants might use to segment the speech stream into lexical and/or phrasal units would rely on auditory mechanisms not specific to language processing, which might be present very early on in development. To get an even more precise view of the development of these abilities, future studies will have to further explore the roles of pitch, duration and intensity in speech and their emergence as grouping cues, testing infants with different native languages at different ages.

### AUTHOR CONTRIBUTIONS

All authors participated in designing the experiments, interpreting the results and writing the article. NA, NB-A and AB prepared the stimuli, ran the experiments and analyzed the data.

### ACKNOWLEDGMENTS

Thanks to the infants and their parents for their kindness and cooperation, and Tom Fritzsche, Stefanie Meister, Annika Unger, Anna Richert, Katja Schneller, Anne Beyer, Carolin Jäkel, Mareike Orschinski, Marie Zieliena, and Wiebke Bruchmüller for help with recruiting and testing participants. We also thank Judit Gervain for helpful discussions regarding this study. This work was supported by the Agence Nationale de la Recherche—Deutsche Forschungsgemeinschaft grant #ANR-13- FRAL-0010 and DFG Ho 1960/15-1 to TN and BH, as well as by the Labex EFL (ANR-10-LABX-0083) to TN.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnhum.2016.00 292/abstract

Figure S1 | Segment from the middle of the familiarization stream with the three test conditions: (A) Intensity condition, (B) Pitch condition, and (C) Duration condition.


Bolton, T. L. (1894). Rhythm. Am. J. Psychol. 6, 145–238. doi: 10.2307/1410948


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The reviewer MB and handling Editor declared their shared affiliation, and the handling Editor states that the process nevertheless met the standards of a fair and objective review.

Copyright © 2016 Abboub, Boll-Avetisyan, Bhatara, Höhle and Nazzi. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution and reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Awareness of Rhythm Patterns in Speech and Music in Children with Specific Language Impairments

Ruth Cumming, Angela Wilson, Victoria Leong, Lincoln J. Colling and Usha Goswami\*

*Department of Psychology, Centre for Neuroscience in Education, University of Cambridge, Cambridge, UK*

Children with specific language impairments (SLIs) show impaired perception and production of language, and also show impairments in perceiving auditory cues to rhythm [amplitude rise time (ART) and sound duration] and in tapping to a rhythmic beat. Here we explore potential links between language development and rhythm perception in 45 children with SLI and 50 age-matched controls. We administered three rhythmic tasks, a musical beat detection task, a tapping-to-music task, and a novel music/speech task, which varied rhythm and pitch cues independently or together in both speech and music. Via low-pass filtering, the music sounded as though it was played from a low-quality radio and the speech sounded as though it was muffled (heard "behind the door"). We report data for all of the SLI children (*N* = 45, IQ varying), as well as for two independent subgroupings with intact IQ. One subgroup, "Pure SLI," had intact phonology and reading (*N* = 16), the other, "SLI PPR" (*N* = 15), had impaired phonology and reading. When IQ varied (all SLI children), we found significant group differences in all the rhythmic tasks. For the Pure SLI group, there were rhythmic impairments in the tapping task only. For children with SLI and poor phonology (SLI PPR), group differences were found in all of the filtered speech/music AXB tasks. We conclude that difficulties with rhythmic cues in both speech and music are present in children with SLIs, but that some rhythmic measures are more sensitive than others. The data are interpreted within a "prosodic phrasing" hypothesis, and we discuss the potential utility of rhythmic and musical interventions in remediating speech and language difficulties in children.

#### *Stefan Elmer, University of Zurich, Switzerland*

Keywords: SLI, phonology, auditory processing, rise time

## INTRODUCTION

There is growing interest in a range of disciplines over whether language processing and music processing draw on shared neural resources (e.g., Patel, 2010; Rebuschat et al., 2012, for overviews). In developmental psychology studies, this interest is fuelled in part by the belief that music may offer novel interventions to help children with language learning impairments (e.g., Koelsch et al., 1999; Besson et al., 2007; Elmer et al., 2012). Here we explore the awareness of rhythm patterns in speech and music in children with speech and language impairments (SLIs), adopting a theoretical focus drawn from the relationship between children's sensitivity to amplitude rise time (ART) and the accuracy of their neural entrainment to amplitude modulations (AMs) in the speech signal [Temporal Sampling (TS) theory, see Goswami, 2011, 2015, for overviews]. Based on temporal sampling theory, we have proposed a new perceptual hypothesis to explain the etiology

Edited by: *Henkjan Honing,*

Reviewed by:

*Simone Dalla Bella,*

\*Correspondence: *Usha Goswami ucg10@cam.ac.uk*

Received: *04 May 2015* Accepted: *30 November 2015* Published: *22 December 2015*

*Cumming R, Wilson A, Leong V, Colling LJ and Goswami U (2015) Awareness of Rhythm Patterns in Speech and Music in Children with Specific Language Impairments. Front. Hum. Neurosci. 9:672. doi: 10.3389/fnhum.2015.00672*

Citation:

*University of Amsterdam, Netherlands*

*University of Montpellier, France*

of SLIs across languages, the "prosodic phrasing hypothesis" (Cumming et al., 2015). The prosodic phrasing hypothesis proposes that an auditory difficulty in the accurate processing of ARTs (Fraser et al., 2010; Beattie and Manis, 2012) and sound duration (Corriveau et al., 2007; Richards and Goswami, 2015) causes children with SLIs to have difficulties in detecting the prosodic or rhythmic phrasing of given utterances. Theoretically, prosodic patterning is considered to be the skeletal beat-based structure upon which human language processing builds (for both phonology and syntax, see e.g., Gleitman et al., 1988; Frazier et al., 2006; Goswami and Leong, 2013).

Although prosodic phrasing in adult-directed speech is not periodic and therefore difficult to conceive of as beat-based, it is important to note that child-directed speech is often overtly rhythmic. An example is the English "nursery rhyme," which typically has strong trochaic or iambic beat structure (Gueron, 1974). ART (or "attack time" for musical notes) is perceptually critical for perceiving beat structure (patterns of beats, see Goswami et al., 2013). The time to peak amplitude modulation (i.e., ART) governs "P-centre" perception, the subjective moment of occurrence (or "perceptual centre") of a beat, whether in language (a syllable beat), or in music (Scott, 1993). Leong and Goswami (2015) modeled the AM structure of English nursery rhymes using an AM phase hierarchy approach drawn from neural models of oscillatory speech encoding (see also Leong and Goswami, 2014). The acoustic modeling revealed three nested temporal tiers of AMs in English nursery rhymes, at approximately 2, 5, and 20 Hz. These three modulation rates enabled the model to extract phonological structure (prosody, syllables, and the onset-rime syllable division). Nursery rhymes are inherently rhythmic, but when adults deliberately spoke them rhythmically (in time with the regular beat of a pacing metronome), then the model's success in extracting phonological units improved significantly. For freely-produced nursery rhymes, the Spectral-Amplitude Modulation Phase Hierarchy (S-AMPH) model detected 72% of strong-weak syllable alternations successfully, compared to 95% when the nursery rhymes were spoken to a metronome beat. Regarding syllable finding, for freely-produced nursery rhymes the model detected 82% of syllables successfully, compared to 98% success for deliberately timed nursery rhymes. The corresponding figures for onset-rime discrimination were 78 vs. 91% (see Leong and Goswami, 2015). The S-AMPH acoustic modeling supports a series of behavioral studies across languages showing significant relationships between ART discrimination and phonological development in children (Goswami, 2011, 2015 for summaries).

The prosodic phrasing hypothesis for SLI (Cumming et al., 2015) builds on these documented relationships between ART, AMs and phonological development. It suggests that grammatical impairments in children with SLIs arise in part from their perceptual difficulties in discriminating ART and duration. These perceptual impairments make it more difficult for SLI children to benefit from rhythmic patterning in language and hence to extract the phonological structure that often supports syntax (see Cumming et al., 2015). Accordingly, in the current study we investigate the sensitivity of children with SLIs to rhythmic patterning in language and music. Accurate behavioral assessments should provide a guide regarding whether musical rhythmic remediation might support the perception of speech rhythm in children with SLIs, thereby facilitating the perception of prosodic phrasing.

Children with SLIs have been shown to be impaired in tapping to a metronome beat, at temporal rates that approximately reflect the occurrence of stressed syllables in speech (focused around 2 Hz, see Corriveau and Goswami, 2009). The temporal sampling framework proposes that the stressed syllable rate is a foundational AM rate for language learning. Across languages, adult speakers produce stressed syllables roughly every 500 ms (Dauer, 1983), and stressed syllables are cued by large changes in ART. These regular 2 Hz ARTs appear to provide crucial perceptual cues to prosodic structure, providing a beat-based skeleton for language learning (see Goswami and Leong, 2013; Goswami, 2015, for detail). Infants use prosody to segment the speech stream across languages (Jusczyk et al., 1999). Perceptual difficulties in discriminating AM rates could be associated with atypical oscillatory neural entrainment to the speech signal for affected infants, with associated difficulties in perceiving prosodic structure and consequently in extracting phonological and morphological information from speech. The literature on behavioral motor synchronization to the beat is a large one (Repp, 2005; Repp and Yi-Huang, 2013), but motor variability in synchronization to the beat tasks at rates around 2 Hz are related to language impairments and to phonological/literacy impairments in prior studies (Thomson and Goswami, 2008; Corriveau and Goswami, 2009). Accordingly, Goswami has argued that neural entrainment and phase-phase alignment of neuronal oscillations across motor and sensory areas may offer a developmental framework for considering shared neural resources in processing language and music (e.g., Goswami, 2012a,b).

Specifically, the temporal sampling model would predict that training children with SLIs to match rhythm in music, which is overt and is typically supported by multiple cues in the melody, with rhythm in speech, may benefit linguistic processing for children with SLIs.

Studies showing links between music processing and language processing are indeed beginning to emerge in the developmental literature. For example, Francois et al. (2012) gave typicallydeveloping (TD) French-speaking children either a music intervention or a painting intervention when they were aged 8 years, and tested their speech segmentation performance. The speech segmentation task utilized an unbroken sung sequence of triples of nonsense syllables produced without prosodic phrasing, which lasted for 5 min ("gimysy-sipygy-pogysi-pymiso... "; hyphens segment the individual nonsense "words," although no actual physical pauses occurred). Children then made a forcedchoice response concerning whether a spoken item had been part of the artificial language or not [e.g., gimysy (yes) vs. sysipy (no)]. Francois et al. reported that while both groups of children were at chance in this task in a pretest, the music intervention children were significantly better at this task than the painting intervention children after both 1 and 2 years of intervention. A second study, also with TD children, asked English-speaking participants aged 6 years to make same-different judgements about tone sequences arranged in different metrical patterns, and to judge whether short monotonic melodies had the same or different rhythms (Gordon et al., 2015). The children were also given an expressive grammar test requiring them to describe pictures using different morpho-syntactic constructions. Gordon and colleagues reported a significant correlation between the children's rhythm discrimination scores and their accuracy in the expressive grammar test, which remained significant when individual differences in phonological awareness were controlled. Gordon et al. concluded that there was a robust relationship between rhythmic and syntactic abilities in TD children.

Regarding atypical development, Przybylski et al. (2013) investigated the utility of acoustic synchronization to the beat for supporting grammatical performance by French children with SLIs. They gave 9-year-olds a grammaticality judgment task, and manipulated whether a musical prime supported the grammaticality judgements. The musical prime either had a regular beat pattern based on a tempo of 500 ms (2 Hz) or an irregular beat pattern with varying tempo, produced by tam tams and maracas. Children judged sentences like "Le camera filme les danseurs" (gender agreement violation, as it should be "La camera"). Przybylski and colleagues reported that hearing the regular musical prime significantly improved grammaticality judgements, for both the children with SLI and for TD control children, and interpreted these improvements using temporal sampling theory. They argued that the predictable metrical structure of the music might entrain internal oscillators that supported temporal segmentation of speech input at the sentence level (see Large and Jones, 1999). In temporal sampling theory, these internal oscillators would be auditory cortical networks entraining to the amplitude modulation hierarchy in speech (see Goswami, 2011, 2015).

Regarding phonological awareness, Dellatolas et al. (2009) asked 1028 TD French-speaking children aged 5–6 years to produce 21 rhythmic patterns modeled by the experimenter by tapping with a pencil on a table. Dellatolas et al. reported that individual differences in rhythmic performance were a significant predictor of reading at age 7–8 years (2nd grade), even after controlling for attention and linguistic skills. Indeed, the rhythm reproduction task showed a normal distribution, and a linear relationship with attainment in reading. In a phonological training study with German-speaking children, Degé and Schwarzer (2011) offered preschoolers (4- to 5 year-olds) either musical training, a sports intervention, or phonological training. The musical training included joint singing, joint drumming, rhythmic exercises, metrical training, dancing and rudimentary notation skills, and was given to small groups for 10 min daily for 20 weeks. Degé and Schwarzer reported significant improvements in phonological awareness for the music group, equivalent to the improvements made by the phonological awareness training group. Both groups showed significantly more improvement than the children receiving the sports intervention.

Most recently, a study by Bhide et al. (2013) explored the possible benefits of a music and rhythm intervention based on temporal sampling theory for phonological and reading development. The intervention related musical beat patterns to the theoretically-related linguistic factors of syllable stress, syllable parsing and onset-rime awareness. Children aged 6– 7 years participated in activities like bongo drumming to varied beat structures in music, making rhythm judgements with different tempi, marching and clapping to songs, learning poetry and playing chanting/hand-clap games. The children showed significant improvements in phonological awareness, reading and spelling, with large effect sizes (e.g., d = 1.01 for rhyme awareness). Circular statistics showed that individual improvements in motor synchronization to the beat and reading were significantly related, supporting temporal sampling theory.

In the current study, we extended temporal sampling theory to SLI children. We gave 45 children with SLIs a set of nonspeech rhythm tasks, in order to explore their sensitivity to beat patterns in music. The SLI children were known to have significant impairments in processing syllable stress related to auditory impairments in perceiving ARTs and sound duration (see Cumming et al., 2015). The first non-speech rhythm task measured sensitivity to patterns of musical beat distribution, and was originally devised for children with developmental dyslexia (Huss et al., 2011; Goswami et al., 2013). Children with dyslexia aged 10 years showed significant difficulties in the task, scoring on average 63% correct compared to 84% correct for their TD controls. Performance in the musical beat processing task along with age and IQ explained over 60% of the variance in single word reading in the dyslexic sample. The second non-speech task was a tapping-to-music task. The music was varied to have either 3/4 time or 4/4 time, with the former expected to be more difficult. The aim was to see whether motor synchronization to the beat would be easier for children with SLIs when a rich musical stimulus was used (a metronome beat was used by Corriveau and Goswami, 2009). We also created a novel task measuring children's sensitivity to pitch contour and rhythm patterns in both speech and music. This was an AXB task in which matching to the standard (X) was required on the basis of either rhythm, pitch contour or both rhythm and pitch contour.

Overall, we expected to find difficulties in perceiving musical rhythm in our children with SLIs, as the children had known impairments in perceiving ART and duration (Cumming et al., 2015). Studies with dyslexic children have previously shown that ART sensitivity is related to musical beat perception, in both cross-sectional and longitudinal assessments (Huss et al., 2011; Goswami et al., 2013). Of interest here was whether difficulties in rhythmic processing in the same task would be less pronounced or greater for children with SLIs compared to children with dyslexia, and whether the processing of pitch contours as well as rhythm would be affected in music and/or speech for children with SLIs. As melody depends on both rhythm and pitch, greater facility with the music tasks compared to the speech tasks would support the use of language interventions based on music for children with SLIs.

### METHODS

### Participants

Ninety-five children aged on average 9 years 6 months participated in this study, of whom 45 were referred by their schools as having a specific language impairment (SLI). All Cumming et al. Awareness of Rhythm in SLI

participants and their guardians gave informed consent, and the study was approved by the Psychology Research Ethics Committee of the University of Cambridge. Only children who had no additional learning difficulties (e.g., dyspraxia, ADHD, autistic spectrum disorder, dyslexia) and English as the first language spoken at home were included. The absence of additional learning difficulties was based on the reports of teachers and speech and language therapists in schools, and our own testing impressions of the children. All children received a short hearing screen using an audiometer. Sounds were presented in both the left and the right ear at a range of frequencies (250, 500, 1000, 2000, 4000, 8000 Hz), and all children were sensitive to sounds within the 20 dB HL range. Forty-five of the children (31 male, 14 female; mean age 9 years, 6 months; range 6 years 4 months to 12 years 1 month) either had a statement of SLI from their local education authority, or had received special help for language via the teacher(s) with responsibility for special educational needs in school, and/or showed severe language deficits according to our own test battery. These children (SLI group) were drawn from a number of schools via language support units in the schools, referral to the study by speech and language therapists or referral by teachers with responsibility for special educational needs (note that SLI is more common in boys, with a ratio of ∼4:1). All children with SLI were assessed experimentally using two expressive and two receptive subtests of the Clinical Evaluation of Language Fundamentals-3 (CELF-3; Semel et al., 1995), and were included in the study if they scored at least 1 SD below the mean on two or more of these subtests. Individual standardized scores of the children in the SLI group for the four CELF-3 subtests administered, as well as receptive vocabulary as measured by the British Picture Vocabulary Scales (BPVS, Dunn, Dunn et al., 1982), and nonverbal IQ as measured by the Wechsler Intelligence Scales for Children (WISC-III; Wechsler, 1992) or Raven's Standard Progressive Matrices (Plus version, Raven, 2008), are shown in **Table 1**. The table also shows single-word reading and spelling scores on the British Ability Scales (BAS; Elliott et al., 1996) and the Test of Word Reading Efficiency (TOWRE; Torgesen et al., 1999).

Note that in our prior studies of ART and beat perception in music in dyslexia (e.g., Goswami et al., 2013), only children with a diagnosis of dyslexia and no history of speech or language impairments were studied. Here, we studied children with a diagnosis of SLI and no history or diagnosis of reading impairments. Nevertheless, as indicated on **Table 1**, a number of the 45 children with SLI did show impaired reading on our test battery. **Table 1** also shows that IQ varied greatly within the SLI group. Therefore, from this sample of 45 SLI children, we created two sub-groups with intact IQ. Following Fraser et al. (2010), children with SLI were regarded as having non-verbal IQ within the normal range if they scored 80 or above on at least one of the two non-verbal measures (WISC, Ravens). One subgroup comprised a sample of children with pure SLI and no IQ or reading difficulties (N = 16, 11 boys), hereafter the "Pure SLI" group. The second sub-group (N = 15, 4 boys) comprised a separate sample of SLI children with preserved IQ but reading difficulties, defined as having a SS < 85 on at least two of the standardized reading and spelling tests used. These children also showed phonological difficulties on the experimental measures of phonological processing used (described below), hence hereafter they are termed the "SLI PPR" (poor phonology and reading) group. Note that the SLI PPR children would not qualify for a diagnosis of developmental dyslexia because of their spoken language impairments. As there is no theoretical reason to expect auditory processing skills to vary with I.Q. (see Kuppen et al., 2011), we analyse data for the entire sample of SLI children as well as for these two sub-groups (Pure SLI, SLI PPR).

Fifty chronological age (CA) matched control children from the same schools as the SLI children also participated in the study. These comprised children who returned consent forms and who were close to individual SLI participants in age. The control group included 21 males and 29 females, with a mean age of 9 years, 4 months, range 6 years 4 months to 11 years 8 months). By selecting control children with non-verbal IQ and reading in the normal range, we created a matched sample of typically-developing children for the Pure SLI group (N = 16) and for the SLI PPR group (N = 15). Group matching for the standardized ability tasks is shown in **Table 2** for these two SLI sub-groupings. **Table 2** also includes performance on experimental tests of phonology (see below), to indicate that there were no phonological impairments in the Pure SLI group.

### Standardized Tests

Language abilities were measured through the use of two receptive subtests (Concepts and Directions, and Semantic Relations or Sentence Structure, depending on the child's age) and two expressive subtests (Formulating Sentences, and Sentence Assembly or Word Structure, depending on the child's age) of the CELF-3 (Semel et al., 1995). For all children, receptive vocabulary was measured through use of the BPVS, and single word reading was assessed using the BAS and TOWRE tests. All children also completed four subscales of the WISC III: Block Design, Picture Arrangement, Similarities, and Vocabulary. These four scales yield an estimate of full-scale IQ (pro-rated, see Sattler, 1982), and the two non-verbal scales (Block Design, Picture Arrangement) were used to gain an estimate of nonverbal IQ following the procedure adopted by Sattler (1982, p.166-7). Non-verbal IQ was also assessed using the Ravens. There were no significant non-verbal IQ differences between the matched sub-groups, as shown in **Table 2**.

### Phonological Tasks

Children with SLI who also present with poor reading would be expected to have phonological processing difficulties, whereas the Pure SLI group identified here should not show phonological difficulties. Three experimental measures of phonological processing, previously used with children with dyslexia, were therefore also administered. As shown in **Table 2**, the Pure SLI group did not show phonological processing difficulties in these tasks compared to control children, whereas the SLI PPR group did show phonological difficulties.

### Rhyme Oddity Task

Children listened to sets of three words and had to select the nonrhyme (e.g., boot,cool, root; Goswami et al., 2013). The words were presented by computer through headphones using digitized recordings of speech produced by a female native speaker of TABLE 1 | Participant details for the children with SLIs, showing sub-group membership.


*N/A means score not available, either as the child was absent or as the child refused to try the non-word component of the TOWRE, which is a timed test.*

*<sup>a</sup>British Picture Vocabulary Standard Score (M* = *100, SD* = *15).*

*<sup>b</sup>Clinical Evaluation of Language Fundamentals (CELF) Expressive and Receptive Sub-tests (M* = *10, SD* = *3).*

*<sup>c</sup>Higher Standard Score of WISC or Ravens (M* = *100, SD* = *15).*

*<sup>d</sup>Test of Word Reading Efficiency combined Standard Score (M* = *100, SD* = *15).*

*<sup>e</sup>British Ability Scales Standard Score (M* = *100, SD* = *15).*

*<sup>f</sup>Ravens SS shown instead of WISC SS.*



*Standard deviations in parentheses.* \**p* < *0.05,* \*\**p* < *0.01,* \*\*\**p* < *0.001.*

*<sup>b</sup>Clinical Evaluation of Language Fundamentals, Receptive.*

*<sup>c</sup>Clinical Evaluation of Language Fundamentals, Expressive.*

*<sup>d</sup>WISC non-verbal IQ.*

*<sup>e</sup>British Picture Vocabulary Scales.*

*<sup>f</sup>British Ability Scales single word reading.*

*<sup>g</sup>Test of Word Reading Efficiency combined score.*

*<sup>h</sup>Phonological short-term memory.*

*<sup>j</sup>Rapid Automatised Naming combined score.*

Standard Southern British English, and trials were presented in one of three fixed random orders. The task comprised 20 trials. Two practice trials with feedback were given prior to the experimental trials.

#### Phonological Short-Term Memory Task

The children heard four monosyllabic consonant-vowelconsonant words presented by computer through headphones using digitized recordings of speech produced by a female native speaker of Standard Southern British English (e.g., type, rib, nook, bud; originally used in Thomson et al., 2005). The children were required to repeat back the words as spoken. Sixteen trials were presented in total, eight comprising items drawn from dense phonological neighborhoods, and eight trials comprising items drawn from sparse phonological neighborhoods. The total number of items reported correctly out of 64 was used in the analyses.

### Rapid Automatized Naming (RAN) Task

In the RAN task, children were asked to name line drawings of two sets of familiar objects (first set: cat, shell, knob, zip, thumb; second set: web, fish, book, dog, cup; see Richardson et al., 2004). For each set, children were first introduced to the names of the pictures and then shown a page with the same pictures repeated 40 times in random order. The children were asked to produce the names as quickly as possible. Average naming speed across the two lists in seconds was used in the analyses.

### Auditory Tasks

A set of auditory processing tasks using non-speech stimuli (sine tones) or speech stimuli (the syllable "ba," described further below) were created or adapted for this project by RC. Detailed information regarding group performance on these tasks is given in Cumming et al. (under submission) and is not repeated here. The stimuli were presented binaurally through headphones at 75 dB SPL. Earphone sensitivity was calculated using a Zwislocki coupler in one ear of a KEMAR manikin (Burkhard and Sachs, 1975). The tasks used a cartoon "Dinosaur" threshold estimation interface originally created by Dorothy Bishop (Oxford University). An adaptive staircase procedure (Levitt, 1971) using a combined 2-down 1-up and 3-down 1-up procedure was used, with a test run terminating after 8 response reversals or the maximum possible 40 trials. The threshold was calculated using the measures from the last four reversals. This indicated the smallest difference between stimuli at which the participant could still discriminate with a 79.4% accuracy rate. The children were assessed individually in a quiet room within their school or at home. A rigorous practice procedure (five trials) was applied prior to the presentation of the experimental stimuli. For all the Dinosaur tasks (unless otherwise stated below in the individual task descriptions), an AXB paradigm was used; three sounds were presented consecutively, as if they were the sounds made by three distinctive cartoon dinosaurs on screen (500 ms ISI). The middle stimulus (X) was always the standard stimulus and either the first (A) or the last (B) stimulus was different from the standard. At the start of each task, the child was introduced to three cartoon dinosaurs, and for each trial the child was asked to choose which dinosaur produced the target sound i.e., whether A or B was different from X. Feedback was given online throughout the course of the experiment. All the speech stimuli were based on the monosyllable [bA:], and were resynthesized from a natural [bA:] token produced by a female native speaker of Standard

*<sup>a</sup>SS, standard score.*

Southern British English. She was recorded in a sound-attenuated booth; the equipment used was a Tascam DR-100 handheld recorder with an AKG C1000S cardioid microphone. One [bA:] token was selected for manipulation and saved in.wav format. Details of the stimulus manipulations, which were done with the software Praat (Boersma and Weenink, 2010), are given below. All "ba" tasks were run with the Dinosaur program, but the cartoon animals that appeared on screen were sheep (because sheep say "baaa").

### Amplitude Rise Time (ART) Tasks

For the non-speech task, three 800 ms sinusoid tones (500 Hz) were presented. The second tone was always a standard tone (X), with a 15 ms linear ART, 735 ms steady state, and a 50 ms linear fall time. One of the other two tones was identical to this standard, and the other tone varied in linear ART. For this variable ART, a continuum of 39 stimuli was used which increased in 7.3 ms steps from the standard to the tone with the longest ART at 293 ms. It was explained that each dinosaur would make a sound and that the child's task was to decide which dinosaur made the sound that started off more quietly and got louder more slowly than the other two dinosaurs (longer ART). In previous papers by Goswami and colleagues, this task has been called the "1 Rise" task. For the Speech task, three [bA:] stimuli with a duration of 300 ms and a flat f0 at 200 Hz were presented. The second [bA:] was always a standard stimulus (X), with a 10 ms ART. One of the other two stimuli was identical to this standard, and the other stimulus varied in ART. For this variable ART, a continuum of 39 stimuli was used which increased in 3.7 ms steps from the standard to the stimulus with the longest ART at 150 ms. This continuum was created by copying the original [bA:] token 39 times, and resynthesing each copy with a specified ART using the IntensityTier function in Praat. The standard stimulus also underwent resynthesis from the original token, but without a change of rise time. It was explained that each sheep would make a sound and the child's task was to decide which sheep didn't make a proper "b" sound at the start compared to the other two sheep (longer ART). (This instruction was decided on after pilot tests showed it was the best description and children understood what was meant as soon as they heard the practice trials).

### Duration Tasks

For the nonspeech task, three 500 Hz sinusoid tones with a 50 ms linear ART and 50 ms linear fall time were presented. The second tone was always a standard tone (X) at 125 ms (note that this is a measure of shorter durations than those used by (Corriveau et al., 2007), which varied between 400 and 600 ms). One of the other two tones was identical to this standard, and the other varied in duration. For this variable duration, a continuum of 39 stimuli was used which increased in 3.2 ms steps from the standard to the longest tone at 247 ms. It was explained that each dinosaur would make a sound and that the child's task was to decide which dinosaur made the sound that was longer. For the Speech task, three [bA:] stimuli with a flat f0 at 200 Hz were presented. The second [bA:] was always a standard stimulus (X) at 150 ms. One of the other two stimuli was identical to this standard, and the other stimulus varied in duration. For this variable duration, a continuum of 39 stimuli was used which increased in 3.9 ms steps from the standard to the longest stimulus at 300 ms. This continuum was created by copying the original [bA:] token 39 times, and resynthesising each copy with a specified duration using the DurationTier function in Praat. The standard stimulus also underwent resynthesis from the original token, but without a change of duration. It was explained that each sheep would make a "baa" sound and that the child's task was to decide which sheep made the "baa" sound that was longer.

### Frequency (Rising f0) Tasks

Three 300 ms sinusoid tones with a 5 ms linear ART and 5 ms linear fall time were presented. The second tone was always a standard tone (X) with a 10 ms fundamental frequency (f0) rise time from 295 to 500 Hz (hence dynamic f0). One of the other two tones was identical to this standard, and the other tone varied in f0 rise time. For this variable f0 rise time, a continuum of 39 stimuli was used which increased as an exponential function from the standard to the tone with the longest f0 rise time at 150 ms. It was explained that each dinosaur would make a sound and that the child's task was to decide which dinosaur made the sound that started "wobbly" compared to the other two dinosaurs (longer f0 rise time). (This instruction was decided on after pilot tests showed it was the best description and children understood what was meant as soon as they heard the practice trials). For the Speech task, three [bA:] stimuli with a duration of 300 ms were presented. The second [bA:] was always a standard stimulus (X) with a 10 ms f0 rise time from 130 to 220 Hz (hence dynamic f0). (The onset of the f0 rise was the point of vowel onset (as opposed to syllable onset), because f0 would not be perceptible during the silence of the closure and the aperiodicity of the burst releasing the plosive [b].) One of the other two stimuli was identical to this standard, and the other stimulus varied in f0 rise time. For this variable f0 rise time, a continuum of 39 stimuli was used which increased as an exponential function from the standard to the stimulus with the longest f0 rise time at 150 ms. This continuum was created by copying the original [bA:] token 39 times, and resynthesizing each copy with a specified f0 rise time using the PitchTier function in Praat. The standard stimulus also underwent resynthesis from the original token, but without a change of f0 rise time. It was explained that each sheep would make a sound and that the child's task was to decide which sheep made the sound that started "wobbly" compared to the other two sheep (longer f0 rise time).

### Music and Speech Tasks Beat Perception in Music Task

This was the same task used with children with dyslexia by Goswami et al. (2013). It was a shortened version of the "musical meter" task originally reported by Huss et al. (2011). The task comprised 24 trials of different beat structure arrangements of a series of notes with an underlying pulse rate of 500 ms (120 bpm). Twelve of the trials delivered the identical series of notes twice ("same" trials), and 12 delivered two slightly different series of notes ("different" trials). Different trials were created by elongating the accented note by either 100 ms or 166 ms. All of the "different" trials are provided as **Figure 1**. The "same" trials were the identical arrangements without a lengthening of the accented note. The sound files were originally created by John Verney at the Centre for Neuroscience in Education, University of Cambridge, using Sibelius Version 4 from a sound set produced by Native Instruments (Kontakt Gold). Hence the "tunes" sounded musical with appropriate timbre and slow decay times. Fourteen trials (7 same, 7 different) were in 4/4 time and 10 trials (5 same, 5 different) were in 3/4 time. The delay in the rhythm structure was either short (100 ms, 7 "different" trials) or long (166 ms, 5 "different" trials). The child's task in all cases was to make a same-different judgment: were the two "tunes" the same or different? Trials were delivered in a pseudo-random order. The % of trials judged correctly was the variable used for data analysis. Further details can be found in Huss et al. (2011) and Goswami et al. (2013).

#### Tapping to Music Task

Two short pieces of instrumental music written by John Verney of the Centre for Neuroscience in Education, Cambridge, were adapted for this study. Each piece was played to the children via headphones, each was 20 s long. One piece of music was composed in 4/4 time, and the second piece was in 3/4 time, and each piece was composed so that the underlying beat rate was 500 ms. The children were asked to tap along in time to the music using the computer mouse, which in this case was a single large button. We took great care to explain to the children and demonstrate how they should tap briefly on the button along to the beat (rather than pressing and holding, which could lead to beat sub-division). This was expected to result in a tap rate (Tactus) of around 500 ms, however some children tapped once every two beats (1000 ms). The tapping was recorded by specially-written software so that inter-tap intervals could be analyzed.

To score performance for each participant, we report two methods. One is a synchronization measure that is derived from the log-transform of Absolute Error (Spray, 1986), a measure of the accuracy of motor performance. The synchronization score is an unambiguous measure of the child's accuracy in tapping at the tactus rate. The second measure represented individual differences in the children's performance using circular statistics, which are increasingly popular in motor synchronization studies (Sowinski and Dalla Bella, 2013; Falk et al., 2015). Circular statistics are most useful when there is an unambiguous pacing signal, for example a metronome beat. As our pieces of music did not include a pacing signal of this nature, we utilized the temporal offset of the dominant percussion instrument to compute circular statistics, as detailed below.

#### **Synchronization score (sync score)**

Before data processing the first and last tap were removed from the sequence (we did not remove more taps as the sequences were only 20 s long). To compute the synchronization score, the absolute distance (in ms) from the "ideal" tap rate was computed for each child, at his/her chosen tactus level (which could be either 500 ms or 1000 ms, depending on whether children tapped on every beat or on every two beats). Therefore, this distance was |(median ITI – 500)|for the 500 ms tactus, and |(median ITI – 1000)|for the 1000 ms tactus. The absolute value was used because children who tapped at −100 ms from the ideal rate were deemed to be as accurate as those who tapped at +100 ms from the ideal rate. Median scores were used rather than mean scores to minimize the effects of breaks in tapping (when children missed out a beat). Group variability in tapping is captured by the variance and standard deviations of these median scores. As the resulting absolute values could not be negative, resulting in a heavily skewed distribution, a logarithmic transform was applied. A negative log was used so that large distances would result in small scores and vice versa. "+1" was added to each score before the logarithmic transform to prevent infinite values if participants were perfectly synchronized (i.e., log 0). Finally, "+6" was added to the score after the logarithmic transform so that all resulting scores would be positive. Therefore, final synchronization scores could range in value between 0 (very poor) to 6 (perfect; in fact they ranged between 0.35 and 6). The formula for the SyncScore is shown in Equation (1) below:

$$\text{SynSccore} = (-\log(|\text{ITI} \text{-actus}| + 1) + 6) \tag{1}$$

These synchronization scores were then used in further data analysis.

#### **Circular statistics**

Before calculating the circular metrics for a trial, the first and last taps were removed. Any taps with an offset deviating from the mean by more than two standard deviations were also removed Circular statistics transform the asynchronies between the tapping and pacing signal into a point on the circumference of a unit circle. Taps that are aligned perfectly are placed at 0 radians on the circle, while taps that fall exactly between two beats of the pacing signal are placed at π/2 radians. Consideration of all the angles obtained in a particular trial enables calculation of a vector corresponding to the mean angle. The direction of this vector gives a measure of the degree to which the taps deviate, on average, from the pacing signal at 0′ . The length of the vector gives a measure of the variability of phase alignment or phase locking strength. The direction is sometimes interpreted as a measure of synchronization accuracy (e.g., Falk et al., 2015). For the purposes of the present study, however, this is not the case. For example, if one group of children were tapping in exact antiphase to the pacing signal, this would not necessarily signify low accuracy, as the children may still be responding by producing taps with spacing that corresponds to the beat of the music (the tactus). Consequently, in the current study vector length most closely corresponds to synchronization accuracy. For example, a child who taps reliably out of phase but at the exact tactus of the music will have a high mean angle and a vector length close to 1. However, a child who has taps that vary in ISI from below to above the tactus may have a mean angle of close to 0, but a much smaller vector length. The participant in the first case would also obtain a high SyncScore, while the participant in the second case would obtain a low SyncScore. Although we report both mean angle and vector length below, the data that we obtain from applying circular statistics should be interpreted with this in mind.

### Pitch and Rhythm in Music and Speech (Music/Speech AXB Tasks)

Six tasks varying rhythm and/or pitch were created by RC by adapting a listening game based on Pingu (a cartoon penguin whose speech was difficult to understand) originally developed by Richards (2010). There were three musical tasks and three speech tasks, all of which used an AXB paradigm. In each case, one of the comparison stimuli (A or B) had the same rhythm and/or pitch contour as the standard (X), while the other did not. The children thus had to match the comparison stimuli with the standard on the basis of shared rhythm, shared pitch, or both. The music tasks were created by low-pass filtering a short tune. Children were shown a picture of a radio, and listened to the tune (standard stimulus: X). They then saw two pictures, one of a girl dressed in green playing the piano and one of a girl dressed in blue playing the piano. Two short unfiltered piano tunes were played (comparison stimuli: A—from the girl in green; B—from the girl in blue). Finally they heard the filtered radio tune again. The children were asked whether it was the same as that played by the girl in green or the girl in blue. For the speech tasks, the children saw a picture of a door, and a "muffled" voice spoke a sentence as if from behind the door (standard stimulus: X). The children were then shown two pictures, one of a lady dressed in green and one of a lady dressed in purple, and heard two sentences of natural (unfiltered) speech (comparison stimuli: A—from the lady in green; B—from the lady in purple). Finally they heard the muffled sentence again. The children were asked whether it was the same as that said by the lady in green or by the lady in purple. An overview of the tasks is given in **Table 3**.

#### **Music tasks**

The stimuli were melodies (5–7 notes) played by a single piano, created using the software Sibelius (version 6) from a sound set produced by Native Instruments (Kontakt Gold). A copy of each tune (.wav format) underwent filtering with a 0–500 Hz pass band and 50 Hz smoothing, using the software Praat (Boersma and Weenink, 2010). This left enough spectral information to hear the durational properties and the amplitude and fundamental frequency (f0) modulations, but the melodies no longer sounded like a piano (which fitted the picture story, that the radio played a poorer quality sound than a live performance). Each filtered tune became a standard stimulus (X), and the natural (unfiltered) tunes became the comparison stimuli (A, B).

### **Music: rhythm only**

In this task, only rhythm was disrupted, hence both comparison stimuli had the same f0 contour as the standard. This meant that successful matching to the standard depended on matching rhythm. To disrupt the rhythm, the standard rhythm pattern of "strong" and "weak" notes was altered by changing a strong note for a weak note or vice versa (see **Table 3**). The strong notes had a longer duration (crotchet or quarter note) than the weak notes (quaver or eighth note), and a higher amplitude (specified as fortissimo in Sibelius) than the weak notes (no dynamic indication in Sibelius). The location of this exchange of notes (i.e., beginning/middle/end of the tune) varied between trials.

### **Music: pitch only**

In this task, only pitch was disrupted. This meant that successful matching to the standard depended on matching pitch contour. The tunes comprised notes in the same rhythm (no strongweak pattern), but with varying melodies (i.e., different f0 contours). For each trial, the two comparison stimuli differed by the exchange of two notes with a different f0, e.g., E B A A B compared to E B A B A—here the 4th and 5th notes were swapped. The location of this exchange of notes (i.e., beginning/middle/end of the tune) varied between trials. There were seven trials in which the f0 contour across the two notes involved in the exchange was falling in stimulus A and rising in stimulus B, and seven trials in which the f0 contour across the two notes was rising in A and falling in B.


### **Music—rhythm and pitch**

In this task, tunes comprised notes with varying rhythms and melodies. For each trial, the two comparison stimuli differed by the exchange of a strong note for a weak note or vice versa, e.g., SSWSS compared to SSWSW. Both comparison stimuli in each trial had the same melody, however this exchange meant that the rhythm differed between the comparison stimuli and the f0 contour differed within each stimulus. The effect was that the discrepant tune sounded different in both rhythm and pitch, which should make selection of the correct match easier. A fuller description of each task is provided in the Appendix of Supplementary Material.

### **Speech tasks**

The stimuli were sentences (5–7 syllables) appropriate for primary school children in terms of their syntactic and lexical complexity. A female speaker of Standard Southern British English was recorded in a sound-attenuated booth producing each sentence 5 times; the equipment used was a Tascam DR-100 handheld recorder with an AKG C1000S cardioid microphone. The token of each sentence with the best fluency and stress pattern was selected to make the stimuli. This token (.wav format) underwent filtering with a 0–500 Hz pass band and 50 Hz smoothing, using the software Praat (Boersma and Weenink, 2010). Following filtering, enough spectral information remained to hear the durational properties and amplitude and f0 modulations (i.e., prosody), but it was no longer possible to decipher what was said (which fitted the idea of speech being muffled by the door). Each filtered sentence became a standard stimulus (X), and the natural (unfiltered) sentences became the comparison stimuli (A, B).

### **Speech—rhythm only**

In this task, only speech rhythm was disrupted, as both comparison stimuli had the same (flat) f0 contour. This meant that successful matching to the standard depended on matching rhythm. The stimuli hence comprised three sentences with the same flat f0, with varying rhythm. The f0 was manipulated in Praat, by making a copy of each original sentence and resynthesising it with a flat f0 at 223 Hz (the mean f0 of all the original sentence recordings). The standard stimulus (X, filtered speech) for each sentence was also manipulated to have a flat f0 at 223 Hz using the same method. For each trial, the comparison stimuli comprised different words, but had the same number of syllables and were not perceptibly different to X in overall duration. The difference in rhythm was created by manipulating the syllable upon which the nuclear accent (or main sentence stress) fell. This nuclear-accented syllable had a perceptibly higher amplitude and longer duration, accounting for inherent phoneme length, than surrounding syllables (the large f0 excursion of the nuclear accent was lost once the f0 was flattened). The difference in the words used also contributed to the rhythmic difference, because the duration and amplitude of each individual syllable partly determined the rhythm too. (For more details, see Appendix in Supplementary Material).

### **Speech—pitch only**

In this task, only pitch contour was disrupted and all sentences had the same rhythm. Sentences were created using words with only fully sonorant sounds. Therefore, in the filtered (X) stimuli: (a) there were no breaks in periodicity, so no syllable boundaries were detectable, hence there was no durational information for individual syllables; and (b) the filtered sonorant sounds all had a similar amplitude, hence relatively little amplitude information for individual syllables was available (except the nuclear-accented syllable). For each trial, the comparison stimuli used different words, but had the same number of syllables as X and were not perceptibly different in overall duration. The difference in pitch (f0 contour) was again created by varying which syllable received the nuclear accent. This nuclear-accented syllable had a large f0 excursion, as well as a perceptibly higher amplitude than surrounding syllables. So the primary difference between the standard and comparison sentences was carried by pitch contour, even though both f0 and amplitude modulations were varied (although durational properties were kept constant). Note that this latter variability was unavoidable, as speech is a more complex signal than a piano tune, and therefore controlling acoustic variables whilst keeping the stimuli naturalsounding is challenging. Nevertheless, as there were no detectable syllable boundaries, it was not possible to perceive amplitude modulations relative to individual syllable durations. Therefore, the correct choice was necessarily based on recognizing an overall contour of f0 (and amplitude) modulation across the standard and comparison sentences (for more details of all trials, see Appendix in Supplementary Material).

### **Speech—rhythm and pitch**

In this task, both pitch and rhythm were disrupted, which should make correct matching easier. The stimuli comprised sentences with various rhythms and modulated f0 contours. For each trial, the comparison stimuli used different words, but had the same number of syllables and overall duration. The difference in rhythm and f0 contour was again created by letting the nuclear accent fall on different syllables. The nuclear-accented syllable had a perceptibly higher amplitude and longer duration (accounting for inherent phoneme length) than surrounding syllables, and a large f0 excursion. As the words in the standard and comparison sentences also differed, there was also a rhythmic difference. (For further details, see Appendix in Supplementary Material).

### Scoring

The tasks were presented in PowerPoint. For each condition, there were 14 trials preceded by two practice trials, presented in a fixed random order. The correct match to stimulus X was A in half of the trials and B in half of the trials. Before the first trial, the child viewed the pictures with the experimenter, who explained the task without the sounds. For each experimental trial, once the stimuli had played through in sequence (X-A-B-X, ISI 1 s), the child was given an unlimited response time. Children could ask for the trial to be repeated if necessary. Responses were scored as three points for each correct answer after one listening, two points after two listenings, one point after three or more listenings, and 0 points for each incorrect answer. This gave a maximum score out of 42 for each condition, with 42 representing ceiling performance (every answer correct after listening once). The children completed all six tasks in one of two fixed orders, with three tasks given in one test session and three given in a separate session. Each task took between 7 and 10 min to complete.

### RESULTS

Group data for the musical beat perception and tapping tasks and the speech/music AXB tasks are shown in **Table 4**, which presents group average performance for the three groupings of SLI children (Pure SLI, SLI PPR, whole group with IQ varying) and their TD controls. ANOVA was used as the main analysis technique, although note that as the TD controls for the Pure SLI grouping and the SLI PPR grouping were partly similar and partly different, we could not incorporate all three groups into one ANOVA (Pure SLI, SLI PPR, TD) as this removed the IQ-matching. Further, as there is no straightforward circular equivalent of the mixed-effects ANOVA, we used t-tests when analyzing children's synchronization to the musical beat via circular measures.

### Musical Beat Perception

Musical beat perception was significantly above chance (50%) for all groups. For the Pure SLI group and their TD controls, a One-way ANOVA by group showed no difference in performance [64.9% correct vs. 70.6% correct, F(1, 30) = 1.6, p = 0.22]. There was also no significant difference in musical beat perception for the SLI PPR children and their TD controls, F(1, 28) = 2.0, p = 0.17 (mean scores 63.9% correct vs. 71.5% correct). There was a significant group difference for the whole SLI sample with IQ varying, however, as performance was significantly better for the control children (70.7% correct) than the SLI children [61.3% correct; F(1, 94) = 9.7, p = 0.002]. Hence sensitivity to patterns of


beats in music as measured by this task appears to be preserved in SLI children with intact NVIQ.

### Tapping to Music

One child was absent for the tapping tasks (CA control). For the Pure SLI children, a 2 × 2 [Group × Tempo (3/4 time, 4/4 time)] ANOVA taking the SyncScore as the dependent variable showed a significant main effect of Group only, F(1, 29) = 5.3, p = 0.03. There was no interaction between Tempo and Group, F(1, 29) = 0.5, p = 0.47, and no main effect of Tempo, F(1, 29) = 1.2, p = 0.29. Hence the children with Pure SLI showed significantly poorer motor synchronization with the beat compared to the TD controls. For the SLI PPR children, a second 2×2 ANOVA utilizing the SyncScores showed a significant main effect of Tempo, F(1, 27) = 4.5, p = 0.04, because synchronizing with the beat in 4/4 time was easier for all children. The main effect of Group approached significance, F(1, 27) = 3.0, p = 0.09, and there was no interaction between Tempo and Group, F(1, 27) = 0.7, p = 0.79. For the whole sample, the SyncScore ANOVA showed a significant main effect of Tempo, F(1, 92) = 7.5, p = 0.007, because music in 4/4 time was significantly easier to synchronize to. The main effect of Group was also significant, F(1, 92) = 9.7, p = 0.002, but there was no interaction between Tempo and Group, F(1, 92) = 0.08, p = 0.78. Hence overall the SLI children showed less synchronization to the beat than the TD control children.

A second set of analyses was also conducted using circular statistics. Circular statistics produce two metrics that can be submitted to data analysis: mean angle and vector length. Before submitting vector length to analysis the data were first logit transformed because these data were heavily skewed. Furthermore, because no straightforward way exists to conduct mixed ANOVA on circular data, separate analyses were conducted at each level of the within-subject factor (Tempo: 4/4 time, 3/4 time) for both metrics (vector length and vector angle; note that conceptually vector length is expected to be the variable


*Standard deviations in parentheses.*

*<sup>a</sup>Maximum score* = *42.*

that mirrors performance as characterized by the SyncScores). For the Pure SLI children and vector length, t-tests revealed no significant differences between the tapping behavior of the SLI children and the TD controls for the 3/4 tempo, t(29) = 1.487, p = 0.148. For the 4/4 tempo, however, SLI children tapped less consistently than their TD controls, t(28) = 2.483, p = 0.019. These results are consistent with the results obtained using the SyncScore. For the SLI PPR children and vector length, no differences were found in their tapping behavior relative to controls for the 3/4 tempo, t(27) = 0.554, p = 0.584, nor for the 4/4 tempo, t(27) = 1.596, p = 0.122. These results again match those obtained with the SyncScore. For the whole sample, no differences were found between the tapping behavior of the SLI children and the TD control group using the vector length measure, for both the 3/4 tempo, t(91) = 1.68, p = 0.096, and the 4/4 tempo, t(91) = 1.6, p = 0.113. Hence for the whole group of SLI children (IQ varying), the vector length measure did not show the significant group differences revealed by the SyncScore measure, however the Group effect did approach significance for 3/4 time.

The final circular statistical analysis was conducted on mean angle or direction. Data were analyzed separately for each tempo using a Watson-Williams test (implemented in the MATLAB circular statistics toolbox). The Watson-Williams tests revealed no significant differences in synchronization between SLI and TD controls for either the full sample or the subsamples (Pure SLI, SLI PPR), for both tempi (3/4 time, 4/4 time; all p's > 0.50). We also explored the Spearman correlation between the SyncScore and both vector length and vector angle. Vector length was significantly correlated with the SyncScore for both the 3/4 tempo, ρ = 0.247, p = 0.017, and the 4/4 tempo, ρ = 0.403, p < 0.001. Vector angle, however, was not correlated with the SyncScore for either the 3/4 tempo, ρ = −0.073, p = 0.486, or the 4/4 tempo, ρ = 0.027, p = 0.800. Because the analyses using vector length and the SyncScore are largely in agreement, further analyses of tapping to music will focus on the SyncScore and vector length measures only.

### Music/Speech AXB Tasks

A series of repeated measures 2 × 2 × 3 ANOVAs [Group × Condition (Music, Speech) × Task (Match Pitch, Match Rhythm, Match Rhythm and Pitch)] were run, one for each SLI grouping. For the Pure SLI group, the ANOVA showed a significant main effect of Condition, F(1, 30) = 5.7, p = 0.006 only. Neither Group, F(1, 30) = 2.1, p = 0.16, nor Task, F(1, 30) = 3.2, p = 0.08, were significant, and there were no significant interactions. Post-hoc tests (Tukey) showed that all the children found it significantly easier to match on the basis of both Rhythm and Pitch than to match on the basis of just Pitch, p = 0.003. There was no significant difference between matching on the basis of Rhythm only vs. Pitch only, nor between matching on the basis of both Rhythm and Pitch vs. Rhythm only. So sensitivity to rhythm was supporting all children's performance in both the speech and music tasks. The significant trend for Task arose because the music conditions were easier than the speech conditions, p = 0.08. For the SLI PPR children, the ANOVA showed a significant main effect of Condition, F(2, 56) = 5.4, p = 0.007, a significant main effect of Group, F(1, 28) = 15.2, p = 0.001, and a significant interaction between Task and Condition, F(2, 56) = 6.5, p = 0.003. Post-hoc testing (Tukeys) of the interaction showed that for the Speech AXB task, the children found it significantly easier to match on the basis of both Rhythm and Pitch than to match on the basis of either cue alone (both p's = 0.001). The ease of matching on the basis of Rhythm alone or Pitch alone did not differ. For Music, by contrast, there were no significant differences between the three conditions. However, it was significantly more difficult for the children to match on Pitch alone in the Speech AXB task than to match on the basis of Rhythm alone in the Music AXB task, p = 0.003. The main effect of Group arose because the children with SLI showed significantly poorer performance than the TD children, as expected. The main effect of Task did not approach significance, F(1, 28) = 0.5, p = 0.48, but the interaction between Task and Group did approach significance, F(2, 28) = 3.5, p = 0.07. Post-hoc inspection of the interaction showed that it arose because the TD controls performed at a significantly higher level in each task than the SLI PPR children, with larger group differences for the music tasks (Music, p < 0.001; Speech, p = 0.03). The overall patterns of performance for the two groupings of SLI children with intact IQ are shown in **Figures 2**, **3** for the Music and Speech AXB tasks respectively. As the Figures show, the SLI PPR children tended to perform more poorly than the Pure SLI children, in the Music tasks in particular. The Music tasks also tended to be performed better by the TD children than by the SLI children.

For the whole sample with IQ varying, the 2 × 2 × 3 ANOVA showed significant main effects of Condition, F(2, 186) = 22.4, p < 0.001, and Group, F(1, 93) = 46.2, p < 0.001, while the main effect of Task approached significance, F(1, 93) = 3.7, p = 0.06. There were also significant interactions between Group and Condition, F(1, 186) = 5.3, p = 0.006, and Group and Task, F(1, 93) = 5.5, p = 0.02. The main effect of Task approached significance because overall the Music tasks were easier than the Speech tasks (p = 0.06). Post-hoc inspection of the significant interaction with Group (Tukeys) showed that this was only the case for the TD children however, p < 0.001. The SLI children performed at similar levels in both tasks. The main effect of Condition arose because the Rhythm and Pitch conditions were significantly easier than the Rhythm only conditions, which in turn were significantly easier than the Pitch only conditions (p's < 0.001). The interaction between Condition and Group arose because this effect was again due to the TD children only. The children with SLIs found all three conditions equally difficult. As would be expected, the TD children were also significantly better than the SLI children in every condition (p's < 0.001).

Overall, the AXB data suggest that even children with SLI who have intact IQ find it difficult to perceive rhythm patterns in music, but that this difficulty is greater for the SLI PPR children than for the Pure SLI children. Both SLI groups with preserved NVIQ tended to find the Musical AXB tasks more difficult than the Speech AXB tasks (see **Figures 2**, **3**). This could be due to less familiarity in listening to music vs. speech. In order to see whether individual differences in the musical beat perception, tapping-to-music and Speech/Music AXB tasks

would show the expected relations with individual differences in basic auditory processing (auditory thresholds in the ART, duration and dynamic f0 tasks described earlier) rank order correlations were computed and 1-tailed tests were applied. The correlation matrix is shown as **Table 5**.

Inspection of **Table 5** shows that most of the tasks showed significant negative correlations with the auditory processing measures (higher auditory thresholds were related to poorer task performance, as would be expected). For the Speech AXB tasks, there are some theoretically interesting exceptions. Sensitivity to ART in speech is unrelated to matching speech patterns on the basis of pitch contour, while sensitivity to rising F0 in speech is unrelated to matching speech patterns on the basis of rhythm, suggesting a cognitive dissociation. The tapping to music tasks are also unrelated to matching speech patterns on the basis of pitch contour, for both the SyncScore and vector length measures, again suggesting a dissociation between rhythmic performance and sensitivity to pitch contours. Whereas, the SyncScore accuracy measures are usually significantly related to AXB rhythm matching, for both speech and music, the vector length measures are usually not. The tapping to music measures also show relatively few significant correlations with the auditory tasks, apart from with tone ART (and vector length) and speech F0 (and both SyncScore and vector length). As noted, the mean angle data are not shown in the table, however here no correlations with the auditory tasks reached significance except for one, between Ba ART and 3/4 time, r = 0.24, p < 0.05. Regarding correlations between the different Music/Speech AXB tasks (also not shown in the table), performance in all except two of the tasks was significantly related. No relationship was found for matching rhythm in speech and matching pitch in speech (the AXB "behind the door" tasks, r = 0.07). This interesting pattern supports the pattern of correlations discussed above in suggesting that the "behind the door" task (using filtered speech) is successfully distinguishing children's sensitivity to pitch contours in speech from their sensitivity to durational and rise time (rhythm-related) cues.

In a second set of correlations, potential associations between performance in the music and speech tasks and performance in the standardized measures of language, phonology and reading were explored. This time positive correlations were predicted, as better sensitivity in the rhythm and pitch measures should theoretically lead to better language, reading and phonology outcomes. The correlation matrix is shown as **Table 6**. As expected, the matrix shows significant positive associations between performance in virtually all of the tasks. Again, no significant correlations were found for the mean angle measures, which are not shown in the table.

In order to explore the relative strength of associations between our music tasks, overall language development (receptive vs. expressive), phonological development and reading development, multiple regression analyses were used. To assess relationships for these four outcome measures, 4 sets of 3 equations were created, a set of equations for CELF Receptive language scores, a set for CELF Expressive language scores, a set for phonological awareness (rhyme oddity) and a set for reading (BAS single word reading standard scores) respectively. The predictors in the sets of equations were chosen from the task battery to be relatively independent measures of the variables of interest, and age and NVIQ were also included, to account for general developmental variance. The predictors in the first equation in each set were thus age, NVIQ, matching rhythm in speech and matching pitch in speech (AXB task scores, recall that performance in these two speech tasks was not related). The predictors in the second and third equations in each set were age, NVIQ, performance in the musical beat perception task, and performance in the tapping-to-music task (either measured via the SyncScore or by Vector Length in 4/4 time, the easier tempo). The bootstrap function in SPSS was applied in each case, as not all predictors were normally distributed in the full sample (1000 permutations, confidence intervals 95%, bias corrected and accelerated). The results are shown as **Tables 7, 8**, which report the Beta values with standard errors and confidence intervals, the standardized Beta values, and the bootstrapped p-values.



\**p* < *0.05,* \*\**p* < *0.01,* \*\*\**p* < *0.001.*

TABLE 6 | Spearman's rank correlations between performance in the music and speech tasks and language skills.


\**p* < *0.05,* \*\**p* < *0.01,* \*\*\**p* < *0.001.*

#### TABLE 7 | Multiple regression equations with receptive and expressive language scaled scores as dependent variables and age, NVIQ and the music/speech tasks as predictors.


*CELF Recep, CELF Receptive Scaled Score;* ž *(CI), unstandardized beta and confidence intervals;* ž *SE—standardized error for* ž*; ß, standardized beta coefficient; Sig, bootstrapped significance; CELF Expre, CELF Expressive Scaled Score; Speech Rhythm, Match Speech Rhythm AXB score; Speech Pitch, Match Speech Pitch AXB score; Musical Beat, Musical Beat Perception accuracy; Tapping 4/4, timing accuracy when tapping to music in 4/4 time (synchronization score); VL 4/4, timing accuracy when tapping to music in 4/4 time, vector length score. \*p* < *0.05; \*\*p* < *0.01.*

#### TABLE 8 | Multiple regression equations with phonological awareness (Rhyme Oddity) and single word reading (BAS SS) as dependent variables and age, NVIQ and the music/speech tasks as predictors.


*Phon Aware, oddity rhyme task;* ž *(CI), unstandardized beta and confidence intervals;* ž *SE—standardized error for* ž*; ß, standardized beta coefficient; Sig, bootstrapped significance; BAS Read, BAS Single Word Reading Scaled Score; Speech Rhythm, Match Speech Rhythm AXB score; Speech Pitch, Match Speech Pitch AXB score; Musical Beat, Musical Beat Perception accuracy; Tapping 4/4, timing accuracy when tapping to music in 4/4 time (synchronization score); VL 4/4, timing accuracy when tapping to music in 4/4 time, vector length score. \*p* < *0.05; \*\*p* < *0.01.*

The regression analyses for children's receptive language scores (see **Table 7**) accounted for a significant 68% [F(4, 90) = 18.8, p < 0.001], 67% [F(4, 89) = 17.6, p < 0.001] and 68% [F(4, 88) = 19.0, p < 0.001] of the variance respectively. The first equation revealed two significant predictors, NVIQ and matching rhythm in speech. SLI children with higher NVIQ and better rhythm matching had better receptive language outcomes. The second and third equations also revealed two significant predictors, NVIQ and musical beat perception accuracy. SLI children with higher NVIQ and better performance in the musical beat perception task also had better receptive language outcomes. For expressive language scores (also **Table 7**), the analyses accounted for a significant 67% [F(4, 90) = 17.9, p < 0.001], 68% [F(4, 89) = 18.8, p < 0.001], and 68% [F(4, 88) = 18.7, p < 0.001] of the variance respectively. The first equation again revealed two significant predictors, NVIQ and matching rhythm in speech, and the second and third equations again revealed two significant predictors, NVIQ and performance in the musical beat perception task. In each case, SLI children with higher NVIQ and better rhythmic performance had higher expressive language outcomes. Inspection of **Table 7** shows that performance in the tapping-to-music task approached significance as a predictor of scores in the expressive language measure (p's = 0.085 and 0.052). Better motor synchronization to the beat related to better language outcomes.

Regarding phonological awareness (**Table 8**), each equation accounted for a significant 67% of the variance [F(4, 90) = 18.6, p < 0.001, F(4, 89) = 18.3, p < 0.001], [F(4, 88) = 17.6, p < 0.001]. The first equation revealed that all four independent variables were significant predictors of phonological awareness, with older children with higher NVIQ and better rhythm and pitch matching showing better phonological awareness. In the second and third equations, the significant predictors of phonological awareness were age, NVIQ and musical beat perception. Older children with higher NVIQ and better performance in the musical beat perception task showed better phonological awareness. The tapping-to-music task did not approach significance as a predictor of phonological development in this sample for either measure (SyncScore or Vector Length). Finally, for single word reading (BAS SS, also **Table 8**), the equations accounted for a significant 70% of the variance [F(4, 90) = 21.6, p < 0.001], 68% of the variance [F(4, 89) = 19.2, p < 0.001] and 67% of the variance [F(4, 88) = 18.1, p < 0.001] respectively. All four independent variables were again significant predictors in the first equation, while the second and third equations showed two significant predictors, NVIQ and musical beat perception, with age close to significant (p = 0.051). Children with higher NVIQ, better rhythm and pitch matching performance and better musical beat perception showed better reading development. However, age was negatively related to reading development. This most likely reflects the fact that reading scores tend to plateau in older SLI children. Performance in the tapping task (4/4 time) was not significantly related to reading development. Hence the only independent variables that were consistent predictors of performance across the language, phonology and reading outcome measures were the measures of sensitivity to rhythm in speech and the musical beat perception task. Accordingly, perceptual awareness of rhythm patterns in speech and in music appears to be an important predictor of individual differences in language, phonology and reading development. However, some of the experimental tasks used here are more sensitive measures of this relationship than others.

### DISCUSSION

Here we set out to investigate sensitivity to rhythm in music and speech for children with SLIs. Following prior studies of children with dyslexia (e.g., Huss et al., 2011; Goswami et al., 2013; Flaugnacco et al., 2014) and language impairments (e.g., Przybylski et al., 2013; Cumming et al., 2015), we expected to find rhythm perception deficits in children with SLIs. We were also interested in whether sensitivity to rhythm in music might be stronger than sensitivity to rhythm in speech for children with speech and language difficulties, in which case musical interventions might be of benefit (musical interventions may enhance shared cognitive/neural resources; see Goswami, 2012a,b; Przybylski et al., 2013; Gordon et al., 2015). The data revealed significant impairments in processing rhythm in the children with SLIs, but some tasks were more sensitive to the children's difficulties than other tasks. Contrary to expectation, the music AXB tasks were only easier than the matched speech AXB tasks for typically-developing children. The children with SLIs found it difficult to make rhythm judgements in both speech and music.

At the group level, the children with SLIs who had intact NVIQ did not show significant impairments in a musical beat perception task previously administered to children with developmental dyslexia (Huss et al., 2011; Goswami et al., 2013), although both the Pure SLI and SLI PPR groups scored more poorly than their TD controls. When IQ varied, the children with SLIs (61%) were significantly poorer than TD controls (71%). Nevertheless, the relative performance of the children with SLI studied here compared to TD controls showed less impairment compared to children with dyslexia. Comparison with prior dyslexic performance (Huss et al., 2011) reveals that older children (10-year-olds) with dyslexia and intact NVIQ averaged 63% correct in this task, while their TD controls averaged 84% correct (Huss et al., 2011). This may be suggestive of a less severe impairment in children with SLI in perceiving patterns of musical beats in this task, although longitudinal data are required. Nevertheless, individual differences in the musical beat perception task were a significant predictor of both the expressive and receptive language scores achieved by SLI children in multiple regression equations (see **Table 7**). Individual differences in musical beat perception were also a significant predictor of individual differences in both phonological awareness and reading development (**Table 8**). The latter developmental relationship is also found for children with dyslexia (Huss et al., 2011; Goswami et al., 2013).

In a tapping-to-music task created for the current study (in 3/4 time and 4/4 time), children with Pure SLI and intact IQ did show a significant impairment compared to TD children, for both tempi and for both synchronization measures (SyncScore, vector length). The group effect also approached significance for children with SLIs and phonological impairments (the SLI PPR group, p = 0.09 for SyncScores and p = 0.12 for vector length). When the whole sample was considered (IQ varying), the group effect was significant for the SyncScore measure (p = 0.002) but not for the vector length measure (3/4 time, p = 0.096; 4/4 time, p = 0.113). Overall, the data suggest that impaired motor synchronization to the beat is characteristic of children with SLIs and does not reflect low NVIQ. This finding supports our prior data showing motor variability in synchronization to the beat in children with SLIs as well as the data of others (e.g., Corriveau and Goswami, 2009; Woodruff Carr et al., 2014). However, most previous demonstrations of significant rhythmic tapping deficits in children with SLIs have utilized a metronome or drumbeat, where the pulse rate is clear. Our music task required synchronization to the pulse rate underlying the different musical instruments, potentially providing richer support to scaffold children's rhythmic timing accuracy (in music there are more auditory cues supporting the beat). The finding that motor synchronization to the beat in SLI children can be impaired for rich musical stimuli as well as for a metronome beat (cf. Corriveau and Goswami, 2009) suggests that musical interventions per se may not have much utility in enhancing neural entrainment to language (syllable) beats for children with SLIs. Rather, interventions may have to consider carefully how the musical beat supports the prosodic phrasing of target linguistic utterances.

Unexpectedly, individual differences in the tapping-to-music tasks did not reach significance as predictors of receptive language development in this cohort of children (**Table 7**), although individual differences did approach significance as predictors of expressive language development. This contrasts with some previous findings regarding tapping to a metronome (e.g., Corriveau and Goswami, 2009; Tierney and Kraus, 2013). Rhythmic timing accuracy as measured by this music task also failed to reach significance as a predictor of phonological awareness and reading (**Table 8**). These findings differ from the associations between beat-based timing accuracy and phonological and reading development that we have found in samples of children and adults with developmental dyslexia Thomson et al., 2006; Thomson and Goswami, 2008; see also Tierney and Kraus, 2013). One possibility is that the inconsistency in findings arises from our use of musical pieces rather than a metronome to measure variability in motor synchronization. Corriveau and Goswami (2009) did find significant relationships between tapping variability, phonology and reading in a different SLI cohort. Alternatively, the difference may arise because children with SLIs are relatively less impaired than children with dyslexia in motor synchronization tasks.

We also created a novel music AXB task for this study which required participants to match tunes to a filtered standard tune (X) on the basis of shared rhythm patterns and/or shared pitch contour. A speech analog of the musical AXB task enabled direct comparison with children's sensitivity to rhythm patterns and pitch contours in speech. The analyses showed that all children, TD and SLI, found matching on the basis of rhythm easier than matching on the basis of pitch, for both music and speech. The music tasks were also easier than the speech tasks for the TD children, but not for the children with SLIs. In fact, the children with SLIs showed poorer performance in all the rhythm tasks than the TD children, although performance was only significantly poorer for the SLI PPR group and the whole SLI group with NVIQ varying (the group effect for the Pure SLI children did not reach significance, p = 0.16). As would be expected, individual differences in the speech AXB rhythm task were a significant predictor of receptive and expressive language development (**Table 7**). In contrast, individual differences in the speech AXB pitch contour task did not predict language outcomes (**Table 7**). Individual differences in the speech AXB pitch contour task did, however, predict phonological and reading outcomes (**Table 8**).

Finally, we turn to the possible benefits of musical interventions for remediating the language deficits in children with SLIs. The fact that the children with SLIs found the music AXB tasks more difficult than the speech AXB tasks was unexpected. It suggests that musical interventions for children with SLIs need to be very carefully designed if they are to be effective. Given that the linguistic difficulties in SLI are associated with prosodic structure, it seems important that future studies investigate whether most benefit might be derived from musical sequences that match the overall prosodic phrasing of speech utterances, so that music is used to highlight precedence and prominence relations in larger lexical structures (Frazier et al., 2006). Simple beat synchronization tasks per se may not offer the same benefits for children with SLIs that are found for children with developmental dyslexia. For children with reading difficulties, phonological benefits ensue from music-based remediation that focuses on multi-modal synchronization to the beat (e.g., synchronizing spoken stressed syllables with clapping or marching actions and with a musical accompaniment, see Bhide et al., 2013). In Bhide et al.'s study, the degree of increased efficiency of motor synchronization to the beat by the children (temporal accuracy in bongo drumming to different rhythms) was significantly related to individual gains in reading. Temporal sampling theory provides a potential neural cross-modal explanation of this finding based on phase-phase coupling at delta and theta rates (Goswami, 2011, 2015). This relationship has yet to be tested for children with SLIs regarding individual gains in language. In cases of Pure SLI, where the child has morphological but not phonological deficits, motor synchronization to the beat per se may not offer significant benefits for grammatical development. More data is required to find out whether remediation for children with SLIs should focus primarily on the auditory and motor domains, as in developmental dyslexia (Bhide et al., 2013; Tierney and Kraus, 2013).

Indeed, multi-modal rhythmic interventions for children with SLIs (including the visual modality) may potentially offer greater benefits than simpler beat-based interventions. Children with SLIs typically have deficits in the auditory processing of both ART and duration. Grouping cues, such as those used to group precedence and prominence relations in an utterance, may rely more on sensitivity to duration. This has yet to be systematically investigated, but it is interesting to consider that SLI children's attention to "visual prosody" (the mouth, jaw, cheek, and head movements that the speaker unconsciously produces when emphasizing oral prosody, see Munhall et al., 2004) has not been quantified. Visual prosody may be important developmentally in supporting the long-range aural perception required to identify the weaker syllables that typically carry morphological information (Ghazanfar and Takahashi, 2014a,b). Attention to visual prosody is thus potentially crucial for morphological development in children. If this were to be the case, then musical interventions that involve group singing, or other musical activities that offer opportunities for visual as well as auditory and motor rhythmic synchronization, may offer the best outcomes in remediating language, syntax and phonology in children with SLIs.

### REFERENCES


### ACKNOWLEDGMENTS

We thank the children, parents, head teachers and teachers participating in this study. This project has been funded by the Nuffield Foundation, but the views expressed are those of the authors and not necessarily those of the Foundation. http://www. nuffieldfoundation.org Requests for reprints should be addressed to UG, Centre for Neuroscience in Education, Downing Street, Cambridge CB2 3EB, U.K.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnhum. 2015.00672


musical meter predicts reading and phonology. Cortex 47, 674–689. doi: 10.1016/j.cortex.2010.07.010


Patel, A. (2010). Music, Language and the Brain. Oxford: Oxford University Press.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2015 Cumming, Wilson, Leong, Colling and Goswami. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Temporally Regular Musical Primes Facilitate Subsequent Syntax Processing in Children with Specific Language Impairment

Nathalie Bedoin<sup>1</sup> , Lucie Brisseau2 †, Pauline Molinier 2 †, Didier Roch<sup>2</sup> and Barbara Tillmann<sup>3</sup> \*

<sup>1</sup> Dynamique Du Langage Laboratory, Centre National de la Recherche Scientifique UMR 5596 and University Lyon 2, Lyon, France, <sup>2</sup> Institut Médico-Educatif Franchemont, Franchemont, France, <sup>3</sup> Lyon Neuroscience Research Center, Auditory Cognition and Psychoacoustics Team, Centre National de la Recherche Scientifique -UMR 5292, INSERM U 1082, University Lyon 1, Lyon, France

#### Edited by:

Sonja A. Kotz, University of Manchester, UK; Maastricht University, Netherlands; Max Planck Institute for Human Cognitive and Brain Sciences, Germany

#### Reviewed by:

Psyche Loui, Wesleyan University, USA Yune-Sang Lee, University of Pennsylvania, USA

\*Correspondence: Barbara Tillmann barbara.tillmann@cnrs.fr These authors have contributed equally to this work.

†

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

> Received: 04 March 2016 Accepted: 17 May 2016 Published: 20 June 2016

#### Citation:

Bedoin N, Brisseau L, Molinier P, Roch D and Tillmann B (2016) Temporally Regular Musical Primes Facilitate Subsequent Syntax Processing in Children with Specific Language Impairment. Front. Neurosci. 10:245. doi: 10.3389/fnins.2016.00245 Children with developmental language disorders have been shown to be also impaired in rhythm and meter perception. Temporal processing and its link to language processing can be understood within the dynamic attending theory. An external stimulus can stimulate internal oscillators, which orient attention over time and drive speech signal segmentation to provide benefits for syntax processing, which is impaired in various patient populations. For children with Specific Language Impairment (SLI) and dyslexia, previous research has shown the influence of an external rhythmic stimulation on subsequent language processing by comparing the influence of a temporally regular musical prime to that of a temporally irregular prime. Here we tested whether the observed rhythmic stimulation effect is indeed due to a benefit provided by the regular musical prime (rather than a cost subsequent to the temporally irregular prime). Sixteen children with SLI and 16 age-matched controls listened to either a regular musical prime sequence or an environmental sound scene (without temporal regularities in event occurrence; i.e., referred to as "baseline condition") followed by grammatically correct and incorrect sentences. They were required to perform grammaticality judgments for each auditorily presented sentence. Results revealed that performance for the grammaticality judgments was better after the regular prime sequences than after the baseline sequences. Our findings are interpreted in the theoretical framework of the dynamic attending theory (Jones, 1976) and the temporal sampling (oscillatory) framework for developmental language disorders (Goswami, 2011). Furthermore, they encourage the use of rhythmic structures (even in non-verbal materials) to boost linguistic structure processing and outline perspectives for rehabilitation.

Keywords: SLI, syntax processing, rhythm processing, temporal attention, music

## INTRODUCTION

The role of rhythm in speech processing as well as in language rehabilitation has attracted increased interest (e.g., Fujii and Wan, 2014). Rhythm with its sensorimotor coupling has been proposed to be a powerful stimulator of communication and social interactions, leading to the hypothesis that investigating the relation between rhythm and speech provides relevant insights in the origins of human communication, as well as perspectives for the rehabilitation of neurological disorders, notably by promoting rhythm and music stimulation or training as a potential tool for the rehabilitation of language disorders.

Investigating the potential influence of auditory rhythmic stimulation on language processing has been motivated by links between musical rhythm processing and speech processing for competences as well as deficits. Musical training has been shown to enhance phonological skills (Tierney and Kraus, 2014), even in dyslexic children (Flaugnacco et al., 2015). In typically developing young school-age children, Gordon et al. (2015a,b) showed a strong positive association between rhythm perception skills and expressive grammar skills. Performance in rhythm discrimination tasks predicted grammar skills in children and adults. For example, musical rhythm processing predicted the variance in performance of 6-year-old children for the production of complex syntax and the online reorganization of grammatical information. Furthermore, pre-schoolers with a good capacity to synchronize to the beat score higher on tests of early language skills (e.g., reading readiness), such as phonological processing, auditory short-term memory or rapid naming (Woodruff Carr et al., 2014), in comparison to weak synchronizers scoring low also on the language tests. The link between rhythm and language skills finds further support in data obtained for children with developmental language disorders. Indeed, impaired rhythm and meter processing has been reported in children with Specific Language Impairment (SLI; Weinert, 1992; Corriveau and Goswami, 2009) and in dyslexic children (Overy et al., 2003; Muneaux et al., 2004). SLI children's performance in a paced tapping task (i.e., tapping to a metronome) predicted their performance in word and nonword reading, rime awareness, non-word repetition, and reading comprehension (Corriveau and Goswami, 2009). Similarly, dyslexic children's performance in beat perception predicted word and non-word reading as well as phonological awareness (Muneaux et al., 2004). Congruent findings had been previously reported by Overy et al. (2003) asking dyslexic children to tap to the rhythm of a song (i.e., Happy Birthday), which is a form of syllable segmentation and reflects a type of phonological awareness that is of major importance for acquiring skilled reading.

These rhythm-processing deficits have been suggested to lead to difficulties in accurately processing relevant auditory cues in speech. They can lead to deficits in language perception by disrupting supra-segmental processing required to extract words and syllables from the speech stream (Thomson and Goswami, 2008; Corriveau and Goswami, 2009), and by impairing the to-be-developed phonological representations (e.g., onset-rime awareness), which are also relevant for reading (Muneaux et al., 2004). Impaired encoding of supra-segmental information (e.g., word stress, intonation, rhythm) in SLI and dyslexia has also consequences on syntactic structure processing (Weinert, 1992; Marshall et al., 2009; Sabisch et al., 2009). Syntax deficits are particularly pronounced in SLI, in addition to deficits in phonological and semantic processing (Bishop and Snowling, 2004; Catts et al., 2005).

Rhythmic and temporal processing can be understood in Jones' framework of dynamic attending (e.g., Jones and Boltz, 1989; Jones, 2008). Originally inspired by the processing of musical structures, this framework has been also applied to speech (e.g., Quene and Port, 2005; Kotz et al., 2009). The framework postulates that attention is not equally distributed over time, but develops in cycles: internal oscillators synchronize to the temporal regularities of an external stimulus. They orient attention over time and allow developing expectations about the temporal occurrence of a next event, which then facilitates processing of events at expected time points and facilitates segmentation and structural, temporal integration. Also referring to the dynamic attending theory (Large and Jones, 1999), Goswami (2011) proposed a temporal sampling (oscillatory) framework for developmental dyslexia and, by extension, for SLI. This framework explains phonological impairments and other observed impairments via an underlying deficit in temporal coding and attention.

The use of rhythmic and musical stimulation to improve language processing, and in particular syntax processing, has provided converging evidence for the role of dynamic attending and of internal attentional oscillators for speech processing. The influence of a prior rhythmic stimulation on subsequent syntax processing has been shown for four patient populations who all encounter syntax processing deficits as well as difficulties in temporal processing (including temporal processing in nonverbal materials): patients with basal ganglia lesions (Kotz et al., 2005), patients with Parkinson Disease (Kotz and Gunter, 2015), children with SLI and children with dyslexia (Przybylski et al., 2013).

For patients with basal ganglia lesions who do not show the P600 component evoked by syntactic violations (Kotz et al., 2003), Kotz and colleagues tested whether these patients may benefit from an external, temporally regular stimulation, such as a rhythmically regular (metrical) musical prime. This prime should stimulate internal oscillator set-ups and thus help subsequent speech processing. Patients first listened to a rhythmic prime (i.e., a sequence of a march) for 3 min, followed by the language testing blocks with syntactically correct and incorrect sentences. The external rhythmic stimulation showed a compensatory effect and restored the P600 to syntactic violations in patients with basal ganglia lesions (Kotz et al., 2005) and Parkinson Disease (Kotz and Gunter, 2015).

For children with developmental language disorders (SLI and dyslexia), Przybylski et al. (2013) investigated the potential influence of a musical rhythmic prime on the performance in a subsequent language task requiring syntax processing. They contrasted two musical primes (short musical excerpts played by percussion instruments), for which meter extraction was either easy or difficult (referred to as regular or irregular prime, respectively). In the experimental session, each music presentation was followed by a block of experimental trials of the language task that investigated syntax processing. Children were asked to make grammaticality judgments on auditorily presented sentences that were syntactically either correct or incorrect. Performance of all children (children with SLI, children with dyslexia and control children) in the grammaticality judgments was better after regular prime sequences than after irregular prime sequences. These findings suggest that the rhythmicity of the musical prime can influence temporal attention (e.g., via internal oscillators), which allows reinforcing processes underlying phonological processing, speech segmentation and syntax processing, and that this influence holds over the temporal delay to the language task (i.e., music and language were not presented simultaneously).

However, for these studies, no baseline condition was used for comparison even though a baseline comparison is necessary to judge for the potential benefit of the rhythmic stimulation and its potential perspectives for the development of training and rehabilitation programs. For basal ganglia patients (Kotz et al., 2005), the effect of the musical prime was shown in the restoration of an ERP component (the P600 following the perception of syntactic violations) that was reported as missing in previous work (Kotz et al., 2003). For the developmental language disorders (Przybylski et al., 2013), the effect of the musical prime was shown by comparing two prime types (regular, irregular), thus showing a relative facilitation between the two conditions: regular vs. irregular. However, this comparison does not yet allow concluding about compensatory benefits of the regular prime in comparison to children's performance without music. As in previous linguistic and musical priming research, studying relative facilitation is a first step that then leads to investigating benefits and costs in comparison to a baseline condition, which was the goal of our study.

Building on the previously observed influence of prior music stimulation on subsequent language processing (even though as relative facilitation), our present study aims to investigate the potential benefits provided by a regular musical structure of a preceding sound context on language performance by including the comparison with a baseline condition (e.g., an environmental sound scene that did not include temporal regularities). We focused on the investigation of the potential benefit of the regular prime (and not of the potential cost of the irregular prime) as this result will allow opening for potential avenues in using rhythmic structures (even with non-verbal materials) to boost linguistic structure processing in patient populations. We here tested a group of SLI children and a group of age-matched control children with the experimental set-up of Przybylski et al. (2013), except that we compared performance in syntax processing after a regular musical prime and a neutral baseline condition.

As SLI children have been shown to be impaired not only in rhythm and meter processing (e.g., Corriveau and Goswami, 2009), but also in pitch processing (e.g., (Mengler et al., 2005) for perception; (Clément et al., 2015), for production), we also tested the SLI children's musical abilities for the processing of musical pitch and rhythm, as assessed by two subtests of the abbreviated version of the Montreal Battery of Evaluation of Musical Abilities (MBEMA; Peretz et al., 2013).

### METHODS

### Participants

The present experiment included a group of SLI children and a group of control children that were matched for chronological age (CA). For all children, French was the main language, none had benefited from musical training and none reported auditory or visual deficits. For all children, we assessed reading age (RA) with scores obtained with a standardized reading test, the Alouette test, which focuses on decoding mechanisms by requiring children to read sentences without semantic support (Lefavrais, 1965). All children and their parents had given their written informed consent, as well as the director of the institute for the SLI group, prior to the study. The experiment was conducted according to the Helsinki Declaration, Convention of the Council of Europe on Human Rights and Biomedicine, and the experimental paradigm (i.e., a musical prime followed by grammaticality judgments on aurally presented sentences) was approved by the French ethics committee Comité de Protection de Personnes for testing in children with developmental language disorders and typically developing children (see Przybylski et al., 2013).

All SLI children were recruited from the "Institut Médico-Educatif (IME) Franchemont à Champigny sur Marne," a medical, pedagogical institute with a boarding school for children with language disorders who cannot be accepted in the normal school system. Diagnoses of the language deficit and general neurological assessments were made by neuropsychologists or speech therapists. The evaluations were based on a variety of French neuropsychological and language tests, with pathological scores being defined as scores that were at least two standard deviations inferior to the population mean. The SLI children were not diagnosed as mentally retarded (using WISC IV), even though one child had a score two standard deviations inferior to the population mean in one of the four subtests. All children were not diagnosed with additional learning difficulties (e.g., dyspraxia, ADHD, autistic spectrum disorder or other neurological or psychiatric disorders).

Sixteen SLI children (13 boys, average CA: 9 years 7 months, SD = 13 months, range: 7 years 3 months to 10 years 11 months; average RA: 6 years 11 months, SD = 6 months, range: 6 years 7 month to 8 years 0 month) participated in the experiment. They were diagnosed with a phonological-syntactic syndrome (de Weck and Rosat, 2003) with verbal expression mainly affected at phonological, syntactic, and semantic levels, as assessed by various batteries including at least word and pseudo-word repetition, naming, morphosyntactic production and phonemic fluency. In addition, two children were also diagnosed with receptive dysphasia and one with lexicalsyntactic dysphasia. Before the experiment, the SLI children were tested with two further language tests: (1) ECOSSE ["Epreuve de compréhension syntaxico-sémantique" (Test of syntacticsemantic understanding), (Lecoq, 1996)], a French adaptation of the Test for Reception of Grammar (TROG, Bishop, 1983), evaluating the child's syntactic and semantic understanding capacities in spoken language; (2) EVIP ["Echelle de Vocabulaire en Images Peabody" (Scale of Vocabulary evaluated by images)], a French adaptation of the Peabody Picture Vocabulary Test-Revised (Dunn et al., 1993) that uses pictures to assess the child's vocabulary level.

For the ECOSSE, average performance was 76.19 (SD = 23.70; range: 10–110; with 100 being the average score of the reference population). For the EVIP, averaged normalized scores (out of 100) was 91.56 (SD = 16.01; range: 60–117). The children also performed the Raven's Colored Progressive Matrices (Raven et al., 1998) so that we can use their scores to correlate their level of non-verbal intelligence with performance in the syntax task and the MBEMA. The group's average score was 85.53 (SD = 16.31), ranging from 56.5 to 118, with three children performing more than 2 SD below the average score of the reference population (100), but note that none of them had been diagnosed as mentally retarded (see above). See **Table 1** for a summarized presentation of the SLI children's performance on these three tests and the reading test ("L'Alouette"). **Table 2** presents correlations between results of these four tests and completed with chronological age. The results reveal that none of these features correlated, except for performance between the ECOSSE and the EVIP, both capturing aspects of children's language processing capacities.

Sixteen control children (matched to the SLI children for CA) were included in this study: 9 boys, average CA: 9 years 5 months, SD = 14 months, range: 7 years 4 months to 11 years 0 month; average RA: 9 years 10 months, SD = 17 months, range: 7 years 5 months to 13 years 0 month. None of the children in the control group reported a history of written or spoken language impairments.

TABLE 1 | SLI children's results for the additional neuropsychological tests.


For ECOSSE, EVIP, and Raven's Progressive Matrices, 100 is the average score of the reference population. For the reading test ("L'Alouette"), we indicate here the scores transformed in reading age, presented in months.

TABLE 2 | Correlations r between the SLI children's results in the neuropsychological tests [Raven's Colored Progressive Matrices, Reading test ("L'Alouette," scores transformed into reading age in months), ECOSSE, EVIPE], the results on the music perception test MBEMA with its pitch and rhythm subtests as well as the patient's chronological age.


\*\*p < 0.01 (two-tailed); \*p < 0.05 (two-tailed).

### Materials

The regular musical sequence and the linguistic material of Przybylski et al. (2013) were used. Musical and linguistic materials were presented over headphones. The experiment was run on Psyscope software (Cohen et al., 1993).

For the regular prime condition, the musical sequence had a duration of 32 s and contained a rhythmic structure allowing for relatively easy meter extraction (**Figure 1**; see Supplementary Material). The sequence was played by two percussion instruments (i.e., a tam–tam at 175 Hz and a maracas at 466 Hz), rendering the musical stimulus more attractive than a single line and two voices allowed for reinforcing the underlying beat (e.g., by two simultaneously played events). Each instrumental line was composed of a section of eight beats of 500 ms, which was repeated eight times to form the prime sequence. The simple rhythmic structure consisted of inter-onset-intervals of 250, 500, 750, or 1000 ms and one unit of 375 ms followed by 125 ms (i.e., creating together an interval of 500 ms). To extract the metrical structure, listeners needed to find regular subdivisions of 125 ms, then 250 ms and built a hierarchy with the main beat every 500 ms, followed by another hierarchy level at 1000 ms. The hierarchy was reinforced by the simultaneous presentation of events played by the two instruments on six of the eight beats in the pattern. We selected the tempo of 500 ms based on the developmental work by McAuley et al. (2006) on entrainment; they reported that the spontaneous motor tempo of children of the age from 8 to 10 years lies at about 521 ms (±61).

For the baseline condition, the auditory sequence had a duration of 30 s and presented the recording of an environmental sound scene outside on the street with a playground (Supplementary Material). This environmental scene did not contain temporal regularities in the occurring sounds (as shown by a Fast Fourrier Transform analysis of the sound file, **Figure 2**) and no comprehensible speech (even though voices were present). The sound file was extracted from the database "universal-soundbank.com."

The linguistic material was composed of 96 French sentences that were grammatically either correct (48) or incorrect (48). We first created 48 correct sentences, and derived from each correct sentence an incorrect sentence. The violations used were of three different types (Gunter et al., 2000), and affected gender agreement, number agreement or person agreement. Grammatical and ungrammatical sentences were composed of an average of 6.1 words (range of 4 to 8) and an average of 8.29 syllables (range of 6 to 11); their duration was on average 2300 ms (±353). Participants listened to the same sentence in either its grammatically correct version or its incorrect version. For that aim, the 96 sentences were split into two lists (A and B) of 48 sentences. A grammatically correct sentence (presented in list A) was matched in number of words, number of syllables, number of letters and the words' lexical frequency (Lété et al., 2004) with another correct sentence (presented in list B). Based on these lists, two experimental sets were constructed: (1) 24 grammatically correct sentences chosen from list A, and 24 grammatically incorrect sentences from list B (each of the three syntax violation types was represented by eight sentences), (2) 24 grammatically correct sentences chosen from list B, and 24

grammatically incorrect sentences from list A. Each participant worked on one of the sets. Sentences were pronounced by a native female speaker of French with a natural speed of production.

For the MBEMA (abbreviated version, Peretz et al., 2013), we selected the subtests of musical pitch and rhythm. Both tests used 20 unfamiliar tonal melodies (average duration of 3.5 s) that were computer-generated and presented in different musical timbres (e.g., piano, marimba, guitar, flute). Melodies were presented in pairs for 20 trials: 10 trials with identical (same) standard and comparison melodies of a pair and 10 trials with comparison melodies that either differed with a scale-violating tone (pitch subtest) or a change of duration of two adjacent notes (rhythm subtest). For each subtest, there were two additional trials for practice. The material was downloaded from the authors' website<sup>1</sup> . However, at the time of testing, only 17 trials of the pitch subtest (10 different pairs, 7 same pairs) and its practice trials were available and we had to run the test with this reduced version. We transformed performance into percentage of correct response (as in Peretz et al., 2013), but to check whether this reduced version might have influenced performance, we ran an

mbea\_variety\_child\_version/child\_version\_mbea\_variety\_stimuli.html

additional group of 16 control children (matched for CA: average age: 9 years 7 months, SD = 13 months, range: 7 years 2 months to 10 years 10 months) aiming to compare performance with that reported in Peretz et al. (2013). For that purpose, we transformed all mean scores into percentages of correct responses. The two subtests of the MBEMA were programmed with Psyscope (Cohen et al., 1993).

### Procedure

The 48 sentences were presented by blocks of six sentences, with the constraint that each block contained three grammatically correct sentences and three incorrect sentences (covering violations of gender, number and person agreement, respectively). Before each of the eight blocks, one of the two prime sequences was presented (with four blocks preceded by a regular prime and four by a baseline prime). The order of the primes and the blocks as well as the order of the sentences in each block were randomized for each participant. Participants were asked to listen to the music and were shown a picture on the computer screen (a black and white drawing, which represented, for example, a guitar playing music). At the end of the prime sequence, a blue exclamation mark appeared on the screen to indicate the beginning of the sentence. Participants were asked to judge the grammaticality of the sentences. To facilitate the understanding of the required grammaticality judgment, it was explained to the children that two dragons pronounced the sentences: one who was never wrong and one who was always wrong. At the end of the sentence, two pictures of dragons were presented on the screen: a dragon who looked satisfied and a dragon who looked puzzled. Participants answered by pressing one of two buttons on the computer keyboard, one below each dragon. The next sentence was triggered by the experimenter. At the beginning, the organization of an experimental trial was illustrated with one grammatically correct sentence and the experimenter performed one trial with the child to make sure that the instructions were understood. While children were listening to the prime sequences, the experimenter listened via headphones to some different music to be unaware of the type of prime sequence presented before the next set of sentences and avoid any unconscious influence on the child's behavior.

<sup>1</sup>http://www.brams.umontreal.ca/plab/research/Stimuli/

For the MBEMA (administered before the main experiment in a separate testing session), we followed the implementation of Peretz et al. (2013): The pitch test was always presented first, followed by the rhythm test, and for each subtest, the order of trials was fixed. Each trial was preceded by a warning tone, followed by the target melody, a 1500 ms silent interval and the comparison melody. Participants were asked to judge for each trial, whether the two melodies were the same or different by pressing one of two response keys on the computer keyboard. The next trial started by pressing a third response key. Note that because of the material availability of the pitch test, we replayed the first three pairs in order to reach 20 test pairs, even though only the first 17 were analyzed.

### RESULTS

### Grammaticality Judgments

Performance was analyzed with signal detection theory calculating discrimination sensitivity with d' and response bias with c for each participant and for each prime condition. These analyses are based on Hits (i.e., correct responses for ungrammatical sentences) and False Alarms FAs (i.e., errors for grammatical sentences) after regular and baseline primes, respectively<sup>2</sup> . d' is defined as z(Hits) – z(FAs), and response bias c as −0.5 (z(Hits) <sup>∗</sup> z(FAs)); see (Macmillan and Creelman, 1991) for more details. d' and c were analyzed by ANOVAs with prime (regular, baseline) as within-participant factor and group (SLI children, controls) as between-participants factor. To estimate effect sizes, we calculated partial η 2 (Cohen, 1988).

For d' (**Figure 3**), the main effect of group was significant, F(1, 30) = 105.57, p < 0.0001, MSE = 0.91, η 2 <sup>p</sup> = 0.78, reflecting, as expected, that controls performed better than SLI children. Most interestingly, the main effect of musical prime was significant, F(1, 30) = 4.92, p = 0.03, MSE = 0.34, η 2 <sup>p</sup> = 0.14, and it did not interact with group, p = 0.32. For all participant groups, performance was better after the regular musical prime than after the irregular musical prime. Average performance suggests that the musical prime effect was reduced in controls due to close-to-ceiling performance. As the focus of the study was on the SLI children, we further checked that the effect for the SLI children was indeed significant when focusing on this participant group, [F(1, 30) = 5.24, p = 0.03, partial η <sup>2</sup> = 0.15]. Note that performance for SLI children was above chance level after the regular prime (p < 0.0001) and after the irregular prime (p < 0.0001). Due to the age range among patients and their CA matched controls, we ran correlational analyses between participants' age and the difference in d' for regular and baseline primes; these correlations were not significant over all participants, r(30) = 0.12, nor for each participant group considered separately [SLI children: r(14) = 0.13; control children: r(14) = 0.40].

In addition, we calculated correlations between SLI children's performance (d' in the regular condition, d' in the baseline condition and the difference in performance between regular and baseline conditions) and their results of the four neuropsychological tests: Raven Matrices (testing for non-verbal intelligence), Alouette test (reading score), ECOSSE (syntactic comprehension) and EVIP (lexical knowledge). Performance in the regular condition correlated with performance of the ECOSSE, r(14) = 0.50, p < 0.01, and performance in the baseline condition correlated with performance of the EVIP, r(14) = 0.66, p < 0.01. The other correlations were not significant.

The analysis of c (**Table 3**) revealed that this effect of musical prime was not accompanied by a difference in response bias. Only the main effect of group was significant, F(1, 30) = 9.83, p = 0.004, MSE = 0.29, η 2 <sup>p</sup> = 0.25, but not the main effect of musical prime, p = 0.13, nor their interaction, p = 0.89. Note that SLI children showed a bias to respond "grammatical" (with c superior to 0) for both the baseline condition (p = 0.001) and the regular condition (p = 0.01), while control children did not show a response bias that differed significantly from 0, ps > 0.68.

### Pitch and Rhythm Subtests of the MBEMA

First, we compared the performance of our new control group with performance of children reported in Peretz et al. (2013), notably with the group of 8-year old children [the oldest group tested in Peretz et al. (2013), followed by a group of adult participants]. Performance of the two children groups was highly comparable: Peretz et al.'s children reached 76% of correct responses and 84% for pitch and rhythm subtests, respectively, while our control group here reached 75 and 81% respectively. Based on this, we transformed the cut-off scores of Peretz et al. (2013) into percentages for the pitch subtest (54%) and

TABLE 3 | c data pattern (means, standard errors) averaged over participants, presented as a function of the prime (Regular, Baseline), and the participant groups (SLI children, control children).


standard errors.

<sup>2</sup>The correction of the d' and c measures used .01 for cases without false alarms and .99 for the maximum number of hits.

#### TABLE 4 | Percentages of correct responses (averaged over participants) and standard errors presented as a function of the subtest of the MBEMA (pitch, rhythm) and the participant groups (SLI children, control children).


Note that this group of control children was different from that of the main experiment, albeit matched to the SLI group (see text).

the rhythm subtest (63%) to evaluate performance of the SLI children. Second, we analyzed performance with a 2 × 2 ANOVA with Group (SLI children, controls) as between-participants factor and test type (pitch, rhythm) as within-participant factor (**Table 4**). The main effect of group was significant, F(1, 30) = 9.29, MSE = 378.84, p = 0.005, η 2 <sup>p</sup> = 0.24, with better performance for controls than for SLI children. The main effect of test type was also significant, F(1, 30) = 11.08, MSE = 83.25, p = 0.002, η 2 p = 0.27, with better performance for the rhythm subtest than the pitch subtest (as in Peretz et al., 2013). The interaction between group and test type was not significant, p = 0.46. Performance between the two subtests correlated significantly for SLI children [r(14) = 0.61, p < 0.05] and controls [r(14) = 0.71, p < 0.01]. As a group, SLI children and control children performed significantly above chance level for the pitch test [SLI: p = 0.02; controls: p < 0.0001] and the rhythm test [SLI: p = 0.0002; controls: p < 0.0001]. However, on an individual level, using the cut-off scores from Peretz et al. (2013), more detailed analyses revealed that for the SLI group, eight children were below cut-off for the pitch test and six children for the rhythm test. For the control group, no child was below cut-off for the pitch test, but two children were below cut-off for the rhythm test.

Based on these findings, we ran supplementary analyses combining the SLI children's results of the MBEMA, our neuropsychological testing and our main experimental task. First, SLI children's performance at the MBEMA did not correlate significantly with their scores at the Raven Matrix [r(14) = 0.41 for the pitch test, r(14) = 0.36 for the rhythm test], suggesting that their decreased performance cannot be explained by cognitive impairments. Note that it did not correlate neither with the other neuropsychological tests (ECOSSE, EVIPE, reading age) nor chronological age (**Table 2**). Second, MBEMA performance of either subtest did not correlate significantly with performance in the regular condition [pitch: r(14) = 0.07; rhythm: r(14) = 0.05], in the baseline condition [pitch:r(14) = 0.37; rhythm:r(14) = 0.41] or the difference in performance between the two conditions [pitch: r(14) = 0.20; rhythm: r(14) = 0.25]. Third, the benefit of the regular condition over the baseline condition was comparable for participants performing above or below cut-off for the rhythm test (a difference of 0.41 and 0.57, respectively, p = 0.74) and also for the pitch test (a difference of 0.53 and 0.37, respectively, p = 0.74). In sum, while SLI children performed below control children in the two subtests of the MBEMA, this decreased performance did not seem to modulate their performance of the main experimental task and the benefit of the regular prime.

### DISCUSSION

The present study builds on previous research having shown the influence of an external rhythmic stimulation on subsequent language processing by comparing the influence of a temporally regular musical prime to that of a temporally irregular prime (Przybylski et al., 2013). We here introduced the necessary baseline condition to investigate whether the observed rhythmic stimulation effect is indeed due to a benefit provided by the regular musical prime. SLI children and their matched controls performed grammaticality judgments after having listened to either a regular musical prime or a rather neutral environmental sound scene. Results revealed better performance (as measured by d') after the regular musical prime than after the baseline prime, and this benefit was not accompanied by a change in response bias (as measured by c) between the two conditions. These findings suggest that the previously reported difference between the influence of a temporally regular musical prime and a temporally irregular prime (Przybylski et al., 2013) was not solely due to a cost in processing created by the temporally irregular prime, but included a beneficial effect of the temporally regular prime. We here focused on the investigation of the potential beneficial effect of a temporal regular stimulation as this opens to new perspectives for motivating training and rehabilitation programs. We thus cannot conclude about potentially disturbing effects of the temporal irregular prime on subsequent language processing (i.e., cost of processing in comparison to a neutral baseline condition). However, the comparison of the effect sizes across studies can give us some indication of the potential cost of the irregular prime in addition to the benefit of the regular prime: when comparing regular and irregular primes, the effect size for children with SLI was 0.34 (as measured by partial η 2 ; Przybylski et al., 2013), but when comparing regular and baseline primes, the effect size was 0.15 (partial η 2 reported here). This comparison suggests that the irregular prime also has an influence on language processing, and that thus the observed difference reported by Przybylski et al. (2013) might have included both the benefit of the regular prime and the cost of the irregular prime. However, this cross-study comparison needs to be considered with caution and future studies should directly implement the three experimental conditions (regular, irregular, baseline) in a within-participants design to clearly establish costs and benefits of a temporal context with its irregularities and regularities. In this line, it is worth underlining that determining adequate baseline conditions is difficult (see (Jonides and Mack, 1984), and (Tillmann et al., 2003), for discussions of this difficulty for language and music materials), and it might well be that our baseline prime might have provided some general arousal that would lead to an underestimation of the benefit of the regular prime and an overestimation of the cost of the irregular prime. Future research might want to use a silent prime condition baseline to further study involved benefits, albeit this might be difficult because of the not-yet-known temporal persistence of the musical prime effect over time (thus potentially contaminating a silent baseline condition after having listened to a regular prime condition).

The here observed findings confirm that the previously reported temporal processing deficits in children with developmental language disorders, notably deficits in rhythm and meter processing (e.g., Corriveau and Goswami, 2009), did not hinder the beneficial influence of the regular prime on subsequent language processing, here requiring syntactic processing. Our additional findings on children's capacity of processing pitch and time dimensions (as measured with the MBEMA) are in agreement with this observation. Even though the SLI children performed worse on this test than the control children, their performance levels in the syntax task and their benefit of the regular prime (in comparison to the baseline condition) did not correlate with their performance level in the MBEMA. Even though the MBEMA is not testing for fine-grained temporal processing, this observation suggests that temporal processing capacities should not represent a necessary condition or exclusion criterion to benefit from a rhythmic prime for subsequent language processing. Note that we here observed a deficit in the MBEMA not only in the rhythm subtests (as predicted by previous findings on SLI children's impairments in temporal processing tasks, e.g., Weinert, 1992; Corriveau and Goswami, 2009), but also in the pitch subtest. Both impairments did not correlate with the SLI children's scores at the Raven Matrix test, suggesting that the decreased performance cannot be explained by more general cognitive impairments. Interestingly, the deficit on the pitch dimension is in agreement with other recent findings reporting impaired singing abilities in children with SLI, notably for a pitch-matching task and a melodic reproduction task (Clément et al., 2015). However, even though the SLI group performed below the control group in the pitch and rhythm subtests, the more detailed analyses revealed that on an individual level not all SLI children performed below cut-off (Peretz et al., 2013). These findings thus suggest that verbal and musical deficits can co-occur, but do not necessarily, pointing to the potential heterogeneity of SLI expressions and also indicating that it would be premature concluding for shared deficits in SLI and amusia.

In addition, we also correlated SLI children's performance in the grammaticality task with the results of the neuropsychological tests. While the correlations with Raven Matrices (testing for non-verbal intelligence) and the reading score were not significant, the correlations with the ECOSSE and EVIP tests, measuring syntactic comprehension and lexical knowledge, respectively, were interesting. Participants' performance in the ECOSSE test correlated significantly with participants' performance in the regular prime condition (as measured by d'), while participants' performance in the EVIP test correlated significantly with performance in the baseline condition. This finding might be interpreted as an index of stimulation provided by the baseline to help accessing the use of lexical knowledge in the experimental task, while the regular musical prime rather helped to tap into syntactic processing capacities of the children.

Our findings are in agreement with previous research that has shown a beneficial effect of a temporally regular musical stimulus on syntax processing in patients with basal ganglia lesions (Kotz et al., 2005), Parkinson's disease (Kotz and Gunter, 2015) or children with developmental language disorder (Przybylski et al., 2013). For these patient populations, the deficits in temporal processing capacities might affect language processing, which requires sequencing and segmentation (such as syntax here). However, the impaired system can be activated or stimulated by the musical material with its clear metrical structure (clearer than in language material). Indeed, the regular structure provides predictable cues that might boost and entrain internal oscillators, which then benefit the sequencing and temporal segmentation at the sentence level and thus favoring syntax processing. This explanation of the beneficial effect can be tied back to previously proposed hypotheses that SLI children have a sequencing deficit (Weinert, 1992) or a more general procedural deficit (Ullman and Pierpont, 2005). This deficit has been attributed to impaired processing of grammatical structures and temporal sequences—whether language (syntax, morphology, phonology) or music (Ullman, 2001; Ullman and Pierpont, 2005; Corriveau and Goswami, 2009). Together with the previous studies (e.g., Kotz et al., 2003; Przybylski et al., 2013), the findings suggest that non-linguistic stimuli with strongly regular temporal structures might help decreasing this deficit, that is, that they improve cognitive sequencing. As suggested by the dynamic attending theory of Jones (1976), which was then also integrated in the temporal sampling framework proposed by Goswami (2011) to account for impaired rhythmic entrainment (in dyslexia and by extension in SLI), the regular structures entrain internal oscillators and allow guiding temporal attention, thus benefiting for temporal expectancy formation and temporal sequencing more generally. This benefit is particularly relevant for speech processing as speech is tied to time and requires temporal processing and cognitive sequencing (see Kotz and Schwartze, 2010). Kotz et al. (2009) discussed two potential pathways involved in sequencing (i.e., expectancy formation, auditory stream segmentation, syntax processing) and temporal attention: a basal ganglia-preSMA circuitry, and a cerebellarthalamic-preSMA pathway. These pathways would be involved in the perception of sensory predictable cues (such as beats in metrical structures) and the synchronization between internal oscillators and external (stimulus) regularities (as suggested by the dynamic attending theory, Jones, 1976). Deficits in one of the pathways, such as due to abnormalities in regions of the frontal cortex (in particular Broca's area and pre-motor regions), as reported for SLI (see Ullman and Pierpont, 2005 for a review) might be reduced by stimulating the system with highly regular stimuli (e.g., musical sequences), which may be more efficient for the impaired pathway, or via the alternative pathway, thus allowing for boosting of sequencing capacities and compensating the effect of a sequencing deficit on sentence processing. This has led to the proposition to use musical, rhythmic stimuli and metrical stimulation as a tool for therapeutic interventions or educational practices (e.g., Kotz et al., 2005; Goswami, 2011; Fujii and Wan, 2014), which have been started to be investigated (e.g., Overy, 2000; Flaugnacco et al., 2015). This approach could also exploit the motivational advantages and pleasantness that musical material provides in a training program, beyond its stimulating effect for impaired temporal processing networks.

The importance of sequencing in language processing has also been pointed out by Conway et al. (2009) who focused on the potential origin of the impairment of the involved processes in hearing-impaired listeners. They postulated that sound deprivation in deafness leads to an impaired development of cognitive sequencing capacities affecting speech processing and other structural processing (e.g., they tested the statistical learning of new structured systems). This may be due to the early deprivation from sound in the environment (with its temporal and rhythmic characteristics), which is an efficient source of daily training for cognitive sequencing. Interestingly, deficits in statistical learning have been also reported for SLI children (Evans et al., 2009; Hsu et al., 2014). Together with the dynamic attending theory, this suggests that training of temporal sequencing with auditory non-verbal signals (in particular music with its structural regularities on pitch and time dimensions, leading to expectations about what and when) could help for speech processing not only in SLI children, but also in hearingimpaired listeners or listeners with cochlear implants. Some recent research has started to use rhythmic primes to improve subsequent language processing in hearing impaired children either by the immediate repetition of the sentence's accent structure (Cason and Schön, 2012) or on a more abstract level (like in our present study) with a regular musical prime (Bedoin et al., in preparation). For example, Cason and colleagues have shown the effect of a rhythmic prime on subsequent language processing, notably speech perception and speech production (repetition of the presented sentence) in hearingimpaired children (Cason et al., 2015a), as well as performance in a phoneme detection task in the healthy population (Cason et al., 2015b). Most interestingly, Cason et al. have shown that the effect of the rhythmic prime was enhanced when it was associated to movement (participants were required to tap with their hand to the rhythmic structure), thus suggesting the influence of movement and auditory-motor coupling in temporal attention and cognitive sequencing. This finding is also in agreement with the reported beneficial influence of motor activity on the precise temporal encoding of acoustic sequences (Schmidt-Kassow et al., 2013; Chemin et al., 2014; Morillon et al., 2014) as well as the observation of activity in the motor cortex when listening to a temporally regular rhythmic sequence (Fujioka

### REFERENCES


et al., 2012, 2015). These findings thus suggest that it will be interesting to further exploit the association between regular rhythmic stimulation (in particular using music) and movement, aiming to further enhance its beneficial effects for cognitive and temporal sequencing, and in particular speech processing, in rehabilitation settings. Recently, the importance of movement for temporal processing and auditory prediction (predictive timing; temporal expectations) has further been developed in Patel and Iversen's (2014) proposed "Action Simulation for Auditory Prediction" hypothesis, situated in the perspective of evolutionary neuroscience of music perception (and musical beat perception in particular). In this theoretical paper, they propose not only further testable predictions for research investigating the influence of movement on temporal processing, but propose also some speculations about the evolution of beat processing by comparing humans and non-human primates and further motivating cross-species research.

### AUTHOR CONTRIBUTIONS

BT and NB conceived the experiment, analyzed the data and wrote the manuscript. LB, PM, and DR tested participants and analyzed data.

### FUNDING

This work was conducted in the framework of the LabEx CeLyA ("Centre Lyonnais d'Acoustique", ANR-10-LABX-0060) operated by the French National Research Agency (ANR).

### ACKNOWLEDGMENTS

We thank Alexandra Corneyllie for her help in analyzing the baseline condition and Magali Thollon for her help in testing one of the control groups.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2016.00245


hypothesis. Curr. Dir. Psychol. Sci. 18, 275–279. doi: 10.1111/j.1467-8721.2009. 01651.x


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Bedoin, Brisseau, Molinier, Roch and Tillmann. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Exaggeration of Language-Specific Rhythms in English and French Children's Songs

#### Erin E. Hannon<sup>1</sup> \*, Yohana Lévêque<sup>2</sup> , Karli M. Nave<sup>1</sup> and Sandra E. Trehub<sup>3</sup>

<sup>1</sup> Department of Psychology, University of Nevada Las Vegas, Las Vegas, NV, USA, <sup>2</sup> Centre de Recherche en Neurosciences de Lyon, Institut National de la Santé et de la Recherche Médicale, Lyon 1 University, Lyon, France, <sup>3</sup> Department of Psychology, University of Toronto Mississauga, Mississauga, ON, Canada

The available evidence indicates that the music of a culture reflects the speech rhythm of the prevailing language. The normalized pairwise variability index (nPVI) is a measure of durational contrast between successive events that can be applied to vowels in speech and to notes in music. Music–language parallels may have implications for the acquisition of language and music, but it is unclear whether native-language rhythms are reflected in children's songs. In general, children's songs exhibit greater rhythmic regularity than adults' songs, in line with their caregiving goals and frequent coordination with rhythmic movement. Accordingly, one might expect lower nPVI values (i.e., lower variability) for such songs regardless of culture. In addition to their caregiving goals, children's songs may serve an intuitive didactic function by modeling culturally relevant content and structure for music and language. One might therefore expect pronounced rhythmic parallels between children's songs and language of origin. To evaluate these predictions, we analyzed a corpus of 269 English and French songs from folk and children's music anthologies. As in prior work, nPVI values were significantly higher for English than for French children's songs. For folk songs (i.e., songs not for children), the difference in nPVI for English and French songs was small and in the expected direction but non-significant. We subsequently collected ratings from American and French monolingual and bilingual adults, who rated their familiarity with each song, how much they liked it, and whether or not they thought it was a children's song. Listeners gave higher familiarity and liking ratings to songs from their own culture, and they gave higher familiarity and preference ratings to children's songs than to other songs. Although higher child-directedness ratings were given to children's than to folk songs, French listeners drove this effect, and their ratings were uniquely predicted by nPVI. Together, these findings suggest that language-based rhythmic structures are evident in children's songs, and that listeners expect exaggerated language-based rhythms in children's songs. The implications of these findings for enculturation processes and for the acquisition of music and language are discussed.

Keywords: rhythm, development, infancy, music, speech, infant-directed modification

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Clément François, University of Barcelona, Spain L. Robert Slevc, University of Maryland, College Park, USA

> \*Correspondence: Erin E. Hannon erin.hannon@unlv.edu

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

> Received: 07 March 2016 Accepted: 07 June 2016 Published: 21 June 2016

#### Citation:

Hannon EE, Lévêque Y, Nave KM and Trehub SE (2016) Exaggeration of Language-Specific Rhythms in English and French Children's Songs. Front. Psychol. 7:939. doi: 10.3389/fpsyg.2016.00939

## INTRODUCTION

Music and language are universal and uniquely human, yet they exhibit tremendous cultural diversity. One consequence of this diversity is that children must acquire culture-specific knowledge and skills without explicit instruction and within a relatively short developmental window. Young listeners must also disentangle musical from linguistic input despite many overlapping elements and features. Because rhythm is prominent in music and language but variable across cultures, it is a potentially important source of information about culture-specific content and structure.

Rhythmic behaviors are ubiquitous in the form of dancing and coordinated music- making. Simple rhythmic patterns and regular underlying beats predominate across cultures (Savage et al., 2015) and are readily perceived by young infants (Trehub and Thorpe, 1989; Baruch and Drake, 1997; Hannon and Johnson, 2005; Winkler et al., 2009; Otte et al., 2013). Nevertheless, there is considerable variation in the complexity and regularity of musical rhythm and beat across cultures (Temperley, 2000; Clayton, 2001). These crosscultural differences have consequences for music perception and production among adult listeners, even those with no formal music training (Magill and Pressing, 1997; Hannon and Trehub, 2005a; Hannon et al., 2012a; Ullal-Gupta et al., 2014). Importantly, features of culture-specific rhythms gradually influence children's perception of music during a prolonged developmental window (Hannon and Trehub, 2005b; Gerry et al., 2010; Soley and Hannon, 2010; Hannon et al., 2011, 2012b).

Rhythm is also a basic feature of spoken language. The diversity of rhythm and stress patterning across languages of the world gives rise to the perception and production of accent (Cutler, 2012). Listeners use language-specific rhythms to segment words from fluent speech (Vroomen et al., 1998; Ling et al., 2000; Sanders and Neville, 2000), beginning in infancy (Jusczyk et al., 1999; Johnson and Jusczyk, 2001; Thiessen and Saffran, 2003; Thiessen et al., 2005). As in music, rhythm in spoken language is hierarchically structured, with alternating patterns of stressed and unstressed elements occurring at nested hierarchical levels (Liberman and Prince, 1977; Fletcher, 2010). Linguists have grouped languages into rhythmic classes, on the basis that some languages, such as Spanish, give the impression of a machine-gun rhythm, while others, such as English, have a Morse Code quality. Accordingly, syllable-timed languages like Spanish and French were thought to have regular or isochronous intervals between syllables, whereas stress-timed languages such as English and Dutch were thought to have isochronous intervals between stressed syllables (and still other mora-timed languages, like Japanese, have the mora as the isochronous unit; Abercrombie, 1967; Cummins, 2009).

Even very young listeners are sensitive to these rhythmic classes. Rhythmic input is available prenatally because of the low-pass filtering properties of the intrauterine environment (Gerhardt and Abrams, 1996; Ullal-Gupta et al., 2013). Prenatal exposure may underlie newborns' preferences for maternal speech (DeCasper and Fifer, 1980; Cooper and Aslin, 1989), their native language (Mehler et al., 1988; Moon et al., 1993), specific passages of speech (DeCasper and Spence, 1986) and specific songs (Hepper, 1991). A role for rhythm in such preferences is implicated by the finding that newborns can discriminate two languages from contrasting rhythmic classes (e.g., Spanish and English) but not from the same rhythmic class (e.g., English and Dutch) even when one is the ambient language (Nazzi et al., 1998). It would seem that language rhythms direct infants' attention to the native language.

The basis for impressions that some languages sound like a machine gun and others like Morse Code is unclear. Acoustic analyses do not support traditional notions of isochronous syllables or stress feet (Dauer, 1983; Grabe and Low, 2002). However, measures of durational contrast between vocalic and consonantal intervals capture some of the presumed differences in language rhythms (Ramus et al., 1999; Grabe and Low, 2002). One such measure is the normalized Pairwise Variability Index (nPVI), which is high in stress-timed languages, where vowel reduction is prominent, and low in syllable-timed languages, where vowel reduction is minimal (Ramus et al., 1999; Grabe and Low, 2002; White and Mattys, 2007).

Speech rhythm tends to govern poetic forms in different languages. For example, English poetic forms such as limericks are organized around the stress foot, whereas French poetic forms are organized around the syllable (Cutler, 2012). One might therefore expect the music of a particular culture to reflect the language rhythms of that culture. Indeed, when the nPVI metric is applied to sequential note durations in instrumental classical and folk music, durational contrast in music parallels speech rhythm from the same region, with, for example, higher nPVIs reported for English than for French music (Huron and Ollen, 2003; Patel and Daniele, 2003; London and Jones, 2011; McGowan and Levitt, 2011). Importantly, non-musician adults accurately classify songs according to their language of origin (French or English), and nPVI predicts how well they generalize this classification to novel songs (Hannon, 2009).

The finding that music and language have parallel rhythmic structure raises important questions about development and learning. Early biases toward familiar stimuli presumably influence what infants learn and when they learn it (Tardif, 1996; Kuhl, 2007; Imai et al., 2008). Given that rhythm may drive infants' early listening biases (Nazzi et al., 1998; Soley and Hannon, 2010), the presence of overlapping rhythms in speech and song input could influence both music and language learning. It could also have implications for their ability to differentiate music from speech, because young listeners would presumably need to use features other than rhythm to accomplish this, such as pitch (Vanden Bosch der Nederlanden et al., 2015). A major goal of the present study was to determine whether the rhythmic differences between stress- and syllable-timed languages have parallels in children's songs.

Child-directed songs from different cultures might preserve or even exaggerate the rhythmic differences between languages. When compared to adult-directed speech, infantand child-directed speech exaggerate prosodic cues to word boundaries such as stress (Christiansen et al., 1998; Dominey and Dodane, 2004; Thiessen et al., 2005). The typical diminutive forms of the child-directed register in stress-timed languages, such as "mommy" and "doggie," increase the prevalence of stressed/unstressed syllables, which would predict higher durational contrast in childdirected than in adult-directed speech (Kempe et al., 2005). If exaggerated language-specific rhythms occurred in childdirected linguistic and musical input, this would support the hypothesis that music signals cultural or social group membership by reinforcing culturally or linguistically relevant information (Kandhadai et al., 2014; Mehr et al., 2016).

However, child-directed speech is not invariably didactic, and the acoustic features that characterize this style of speaking (and presumably singing) may instead reflect universal caregiving functions such as soothing or emotion regulation (Corbeil et al., 2015; Trehub et al., 2015). Across cultures, infant- and childdirected speech is characterized by higher pitch, greater pitch range, shorter and simpler utterances, slower speech rate, longer pauses, repetition, and rhythmic regularity (Fernald et al., 1989). Although infant-directed singing is more restricted by discrete pitches and rhythmic values, it has many of the features of infant-directed speech, in particular, rhythmic regularity and slower tempos (Trainor et al., 1997; Longhi, 2009; Nakata and Trehub, 2011). As a result, caregiving functions could decrease rhythmic contrasts in child-directed speech and song, regardless of language.

The literature is currently unclear regarding these contrasting predictions. In one study, child-directed speech, regardless of language, had lower nPVIs than adult-directed speech (Payne et al., 2009), whereas other studies found no differences in rhythmic contrast between adult- and child-directed speech for English (Wang et al., 2015) and Japanese (Tajima et al., 2013). In other instances, rhythmic differences between stresstimed and syllable-timed languages were preserved but not exaggerated in child-directed speech and singing (Payne et al., 2009; Salselas and Herrera, 2011). Importantly, prior evidence of higher nPVI in English than in French music was based primarily on instrumental music (Patel and Daniele, 2003). By contrast, one study found no nPVI differences between French and German vocal music (VanHandel and Song, 2010, but see Daniele and Patel, 2013). It is therefore unclear whether children's songs would be expected to exhibit rhythmic features consistent with their language of origin.

The present investigation asked whether children's songs originally set to French or English lyrics exhibit rhythmic patterning in line with their language of origin, as demonstrated for instrumental music. We analyzed rhythmic contrast (as nPVI) in a large corpus of songs from anthologies of children's music. Because children's songs are invariably set to text, we also analyzed a corpus of folk songs also set to text but not designated as children's songs. This allowed us to compare similar genres of vocal music that differed primarily in child-directedness. To determine whether listeners link rhythm to the child-directedness of songs, we collected ratings of instrumental renditions of each song from individuals with different linguistic and cultural backgrounds.

### CORPUS ANALYSIS

### Materials and Methods

The corpus consisted of 269 songs originally set to English or French lyrics. Approximately half were children's songs (English: n = 68, French: n = 61); the others were folk songs primarily for adults (English N = 72, French N = 68). Songs were collected from anthologies of folk and children's songs and from Internet sources that provided musical notation and.mid files (See Appendix A for the sources for songs in the corpus). We excluded two songs with the same tune and rhythm as other songs in the corpus but different lyrics). We designated three songs that appeared in folk and children's anthologies as children's songs.

The nPVI provides a measure of durational contrast or rhythmic variability, as shown in the following equation (Grabe and Low, 2002; Patel et al., 2006):

$$\text{nPVI} = \frac{100}{m-1} \times \sum\_{k=1}^{m-1} \left| \frac{\frac{d\_k - d\_{k+1}}{d\_k + d\_{k+1}}}{2} \right|$$

where m is number of elements in a sequence and d<sup>k</sup> is the duration of the kth element. In the speech literature the kth element can be defined by any unit of interest (whether vocalic or consonantal; Ramus et al., 1999; Grabe and Low, 2002). In studies of music, however, the kth element is defined exclusively according to musical note duration, or the inter-onset interval between consecutive notes (Patel and Daniele, 2003; Patel et al., 2006). Thus, in line with prior research, the absolute difference was calculated between each successive inter-onset interval, normalized by the mean duration of the pair Values of nPVI range from 0 to 200, with 200 reflecting maximum durational contrast. Note durations were entered directly into a spreadsheet from the sheet music or score, and these duration values were used to calculate nPVI values for each song. Because many songs contained multiple sub-phrases, we omitted values for note pairs that straddled phrase boundaries.

The songs in the corpus were arranged in a variety of meters, including 2/2 (2 English, 0 French), 2/4 (26 English, 55 French), 3/4 (29 English, 23 French), 3/8 (1 English, 3 French), 6/8 (6 English, 20 French), and 4/4 (76 English, 20 French), with 2/4, 3/4, and 4/4 predominating in English and French songs. The corpus included songs composed in all 12 major keys and a few minor keys; C major (40 English, 13 French), F major (30 English, 24 French), and G major (35 English, 49 French) were the most common.

### Results and Discussion

On average, nPVI values were higher for English songs (M = 42.04, SME = 1.4) than for French songs (M = 36.96, SME = 1.5), which is consistent with prior studies of instrumental music by English and French composers (Patel and Daniele, 2003). We also found that nPVI values were lower for children's songs (M = 37.12, SME = 1.46) than for folk songs (M = 41.88, SME = 1.4), which is consistent with the observation that infant-directed singing has simple, repetitive rhythmic structures and greater temporal regularity than songs performed alone or for adults (Trainor et al., 1997; Longhi, 2009; Nakata and Trehub, 2011).

Although both song types exhibited a trend toward higher English than French nPVI, **Figure 1** suggests that this difference was larger for children's songs than for folk songs. Independent samples t-tests confirmed that nPVI values were significantly higher for English than for French children's songs (English M = 40.43, SME = 1.9; French M = 33.8, SME = 2.26), t(127)=2.256, p = 0.026. For folk songs the same trend was evident but not significant, (English M = 43.6, SME = 1.8; French M = 40.1, SME = 2.13), t(138) = 1.27, p = 0.10.

This suggests that while English songs in our corpus generally had higher rhythmic variability than French songs, the difference was exaggerated in children's songs. This finding could arise from a tendency to exaggerate native language rhythms in children's songs, at least for syllable-timed languages, either because song creators intuitively exaggerate the rhythms of child-directed lyrics or because of caregivers' intuitive tendency to select children's songs that preserve or exaggerate nativelanguage rhythms. Note that English children's and folk songs did not differ in rhythmic contrast, much like the absence of rhythmic differences between child- and adult-directed English speech (Wang et al., 2015). Native language prosody may constrain child-directed songs such that rhythmic features of the childdirected register, like greater rhythmic regularity and simplicity, are exaggerated in children's songs only when such features preserve native-language prosody. This is consistent with our observation that French children's songs had the lowest nPVI, a trend that was reduced for English.

Our corpus analysis revealed rhythmic differences between children's songs and folk songs, but this interpretation rests on the assumption that songs were classified accurately based on their presence or absence in anthologies of children's music. In prior studies of child-directed speech or singing, the intended audience was clearly known to the speaker or singer (e.g., Fernald, 1985; Nakata and Trehub, 2011), but the situation is less clear for corpora of transcribed songs. Furthermore, some songs occurred in both anthologies of children's and folk songs. At times, caregivers sing pop songs and their own invented songs to infants (Trehub et al., 1997). It is therefore unclear whether native-language rhythm would influence the songs that caregivers choose to perform for children. To further examine this question, we collected ratings of adults from different cultures.

### RATINGS

English- and French-speaking adults listened to instrumental versions of the entire corpus from Study 1 and rated each song's familiarity, whether or not they liked it, and whether or not it was "for children." We expected listeners to be more familiar with and to better like songs from their own culture than from another culture, but the critical question was whether they would be more likely to classify songs as "for children" if they exemplified the rhythms of their native language.

### Materials and Methods Participants

All subjects were approved by and run in accordance with the guidelines and principles of the internal review board/ethics committee at University of Nevada Las Vegas and Lyon University. Listeners were recruited from the United States and France. Adults (50 female, 50 male) with self-reported normal hearing from the University of Nevada, Las Vegas participated for partial course credit (M = 20.04 years, SD = 3.41 years, Range: 18–45 years). The majority of American participants were monolingual native speakers of English s (n = 62). The remaining participants were bilingual speakers of English and Spanish (n = 24), Italian (n = 2), Tagalog (n = 5), Polish (n = 1), Arabic (n = 3), Chamorro (n = 1), Korean (n = 1), and Amharic (n = 1). Bilingual speakers acquired English simultaneously with the other language (n = 7) or learned English later as a second language (N = 31). Because we were interested in the influences of nationality (country of residence) and native exposure to a syllable-timed language, we created a group of American bilingual participants who acquired a syllable-timed language from infancy (Spanish, Italian, or Tagalog, n = 30). The remaining American participants were considered English monolinguals (N = 70)<sup>1</sup> . The amount of formal training on a musical instrument ranged from 0 to 16 years (M = 2.57 years, SD = 3.8 years) for monolingual English speakers and from 0 to 12 years (M = 2.26 years, SD = 3.26 years) for bilinguals. Dance training ranged from 0 to 14 years (Mean = 1.37 years, SD = 3.03 years) for monolinguals and from 0 to 10 years (Mean = 0.91 years, SD = 1.94 years) for bilinguals.

Forty adults (26 female, 14 male) recruited in Lyon, France received token compensation for their participation (M =26.40 years, SD = 9.63 years, Range: 19–60 years). All French participants had self-reported normal hearing and were native speakers of French, most of whom claimed to have intermediate

<sup>1</sup> Because we were interested in the relationship between native-language rhythm and song ratings, we chose to combine monolingual English speakers with the eight bilingual participants who spoke stress-timed languages. Although these eight participants were bilingual, they were like monolingual English speakers in that they acquired a syllable-timed language from birth. All reported analyses without these 8 participants yielded the same results.

to high levels of competency in English (n = 38). Formal instrumental training ranged from 0 to 17 years (M = 4.68 years, SD = 5.49 years). Except for one participant, who had 48 years of dance training, formal dance training ranged from 0 to 10 years (M = 1.56 years, SD = 2.5 years). Dance training did not differ across language groups, F(2, 139) = 1.67, p = 0.19, but formal music training (in years) did, F(2, 139) = 4.08, p = 0.02, with the French participants having significantly more formal music training than monolingual, p = 0.04, or bilingual Americans, p = 0.04.

#### Stimuli

All 269 songs from the corpus analysis were presented to Englishand French-speaking participants. To keep the test session to 1 h or less, we randomly divided the corpus into 5 lists of 54 songs each, ensuring that each list contained at least eight of each of the four song types (English or French children's or folk songs). As a result, no participant heard all songs in the corpus, but each song received ratings from at least 28 participants.

Each song was presented as a simple instrumental (flute) melody without words. Using Logic Pro (see http://www.apple. com/logic-pro/), each song was entered directly from the notation into a MIDI sequencer and converted to AIFF format using the flute instrument. All songs were transposed to C major. Quarter-note durations were set to 600 ms or 100 beats per minute (bpm) unless the musical notation specified a particular tempo. This resulted in comparable tempos for children's songs (M note duration = 484 ms, SME = 18.8) and folk songs (M = 488 ms, SME = 18.1), F(1, <sup>265</sup>)=0.027, p = 0.87. Overall, however, English songs were slower (M = 524 ms, SME = 18.06) than French songs (M = 448 ms, SME = 18.8), F(1, 265) = 8.39, p = 0.004, η 2 <sup>p</sup>= 0.03, however there was no effect or interaction with song type.

#### Procedure

Participants, who were tested individually, were presented with instructions and stimuli over headphones by means of PsyScope software (Cohen et al., 1993). On each trial, after hearing the entire instrumental rendition of a song, participants were asked to rate its familiarity ["How familiar is the song on a scale of 1 (very unfamiliar) to 7 (very familiar)?"], and, if familiar, to provide the song name. Participants were then asked whether it was a children's song ("Do you think this is a children's song?" Yes or No) and to rate the confidence of that judgment on a scale of 1 (very confident this is NOT a children's song) to 7 (very confident this IS a children's song). Finally, participants were asked to rate how much they liked the song ["How much do you like the song on a scale of 1 (dislike very much) to 7 (like very much)?"] Participants responded to these queries for each of 54 songs over the course of 3 blocks of 18 trials.

Following the test session, participants completed a questionnaire (in English or French) about their linguistic/ethnic background, music training, dance training, and hearing status (normal or not). All procedures were reviewed and approved by the local institutional ethics committees in the United States and France, and informed written consent was obtained from all participants.

FIGURE 2 | Mean familiarity ratings (left) and liking ratings (right) of English and French songs by American monolingual, American bilingual, and French listeners, shown separately for folk and children's songs. Error bars represent standard error.

### Results and Discussion

#### Listener Effects on Mean Ratings

In the first analysis we averaged ratings of familiarity, preference, and song type across songs in each category (French children's, English children's, French folk songs, English folk songs) and compared the performance of individuals with contrasting language background and nationality.

### **Familiarity**

Each participant's mean familiarity rating was calculated for each song category. Simple correlations showed that years of dance training were unrelated to familiarity ratings for any song category, but there was a modest correlation between music training and familiarity for French children's songs only, r(138) = 0.18, p = 0.035. Music but not dance training was therefore included as a covariate in subsequent analyses.

Overall, songs were considered moderately familiar, with most means falling just below the mid-point of the 7-point rating scale (**Figure 2**). Familiarity ratings were submitted to a 2×2 × 3 (Song Type [children's, folk] × Language of Origin [French, English] × Language Group [monolingual Americans, bilingual Americans, French speakers]) mixed-design analysis of variance (ANOVA),

TABLE 1 | Main effects and interactions for all ANOVAs conducted with familiarity rating as dependent variable.


Overall ANOVA results are followed by ANOVAs run separately for each Song Type. \*p < 0.05. \*\*p < 0.01.

with Music Training (in years) as a covariate. All main effects and interactions for this ANOVA are shown in **Table 1**. Overall, English songs were significantly more familiar than French songs (English M = 3.21, SME = 0.08; French M = 2.9, SME = 0.07), and children's songs were rated as significantly more familiar than folk songs (children's: M = 3.33, SME = 0.08; folk: M = 2.8, SME = 0.07) (**Figure 2**).

To ascertain whether listeners were more familiar with songs from their own culture or language, we examined the interactions for Song Type, Language of Origin, and Language Group for each song type (children's or folk) by means of separate 2 × 3 (Language of Origin [English, French] × Language Group [monolingual Americans, bilingual Americans, French]) mixeddesign ANOVAs, with Music Training as a covariate.

For folk songs, there was a significant main effect of Language of Origin and a significant interaction between Language of Origin and Language Group (see **Table 1**). English folk songs were more familiar than French folk songs (English M = 3.2, SME = 0.08; French M = 2.37, SME = 0.07) for all three groups [monolingual Americans, t(69) = 14.2, p < 0.001; bilingual Americans, t(29) = 6.4, p < 0.001; French, t(39) = 2.4, p = 0.02], but particularly for Americans.

For children's songs there were significant main effects of Language of Origin and Language Group, and an interaction between Language of Origin and Language Group (see **Table 1**). Higher familiarity ratings were given to French (M = 3.44, SME = 0.09) than to English (M = 3.22, SME = 0.09) songs. French speakers gave higher overall ratings (M = 3.75, SME = 0.14) than monolingual Americans (M = 3.14, SME = 0.10) or bilingual Americans (M = 3.1, SME = 0.15), p < 0.001, and the two American groups did not differ, p = 0.81. The interaction

TABLE 2 | Main effects and interactions for all ANOVAs conducted with liking rating as dependent variable.


Overall ANOVA results are followed by ANOVAs run separately for each Song Type. \*p < 0.05. \*\*p < 0.01.

in **Figure 2** indicates that French listeners were more familiar with French songs than with English songs, t(39) = −10.38, p < 0.001, and both American groups were more familiar with English songs than with French songs [monolingual Americans, t(69) = 9.9, p < 0.001; bilingual Americans, t(29) = 3.3, p = 0.003].

Relatively few participants provided specific names for songs they found familiar, but the percent of correct naming roughly paralleled the observed pattern of familiarity ratings. Monolingual Americans correctly named more English folk (8%) and children's songs (15%) than French folk (<1%) and children's songs (2.5%), as did bilingual Americans (4% English folk, 12% English children's, <1% French folk, 2% French children's). French listeners, by contrast, correctly named only 1% of English folk songs and 2% of English children's songs, but they correctly named 3% of French folk songs and 28% of French children's songs.

To summarize, familiarity ratings depended primarily on country of residence. Americans found English songs more familiar than French songs, even when their native language was syllable-timed (e.g., bilingual Americans). While all listeners

TABLE 3 | Main effects and interactions for all ANOVAs conducted with derived child-directedness rating as dependent variable.


Overall ANOVA results are followed by ANOVAs run separately for each Language of Origin.

\*p < 0.05. \*\*p < 0.01.

found English folk songs to be more familiar than French folk songs, this difference was smallest for French listeners. In general, familiarity ratings reflected listeners' country of residence, but this pattern was particularly robust for children's songs, with American listeners giving higher ratings to English than French songs and French listeners doing the opposite. This result underscores the prominence of children's songs in everyday listening experience.

### **Liking**

Mean liking ratings were uncorrelated with music or dance training. Liking ratings were submitted to a 2 × 2 × 3 (Song Type [children's, folk] × Language of Origin [French, English] × Language Group [English-speaking Americans, bilingual Americans, French]) mixed-design ANOVA. All main effects and interactions are shown in **Table 2**. Overall, main effects of Language of Origin and Song Type revealed that English songs received significantly higher liking ratings than French songs (English M = 3.9, SME = 0.09; French M = 3.8, SME = 0.09), and children's songs received significantly higher liking ratings than folk songs (children's: M = 3.9, SME = 0.09; folk: M = 3.8, SME = 0.09) (**Table 2**). To examine significant interactions between Language of Origin and Language Group, and between Song Type, Language of Origin, and Language Group (see **Table 2**), we ran separate 2 × 3 (Language of Origin [English, French] × Language Group [American monolinguals, American bilinguals, French]) mixed-design ANOVAs for each song type.

For folk songs, there was a significant main effect of Language of Origin and a significant interaction between Language of Origin and Language Group (**Table 2**). English songs were liked more (M = 3.9, SME = 0.1) than French songs (M = 3.72, SME = 0.1). Monolingual Americans gave significantly higher liking ratings to English than to French songs, t(69) = 5.09, p < 0.001, whereas English and French songs were liked equally by bilingual Americans, t(29) = 0.5, p = 0.14, and French speakers, t(39) = 1.02, p = 0.32. Thus, only monolingual Americans preferred folk songs from their own language/culture (**Figure 2**).

For children's songs, there was a significant interaction between Language of Origin and Language Group. Monolingual Americans gave higher liking ratings to English songs than to French songs, t(69) = 5.7, p < 0.001, as did bilingual Americans, t(29) = 2.8, p = 0.009, but French speakers liked French songs more than English songs, t(39) = −3.86, p < 0.001. Thus, listeners gave higher liking ratings to children's songs whose language of origin matched their country of residence.

**Figure 2** suggests that liking ratings paralleled familiarity ratings, which is consistent with evidence that listeners prefer familiar music (Szpunar et al., 2004). However, there were also important differences. Although there were robust effects of nationality on familiarity ratings, there was considerably less variation across groups for liking ratings. This was particularly notable for bilingual Americans and French speakers who rated English folk songs as more familiar than French folk songs but nevertheless did not necessarily like English folk songs better than French folk songs. Thus liking ratings might only partially reflect familiarity.

### **Classification**

A measure of perceived "child-directedness" was derived by calculating for each participant the proportion of songs labeled "for children" in each of the four song categories. These values were uncorrelated with dance training, but they were positively correlated with music training for French children's songs only, r(140) = 0.18, p = 0.03. Therefore music training was included as a covariate in subsequent analyses.

The derived child-directedness measure was submitted to a 2 × 2 × 3 (Song Type [children's, folk] × Language of Origin [French, English] × Language Group [monolingual Americans, bilingual Americans, French]) mixed-design ANOVA, with years of Music Training as a covariate. All main effects and interactions are shown in **Table 3**. We observed a main effect of Song Type, with overall higher proportions of "for children" classifications given to children's songs (M = 0.514, SME = 0.02) than to folk songs (M = 0.42, SME = 0.016; see **Table 3**). There were also significant two-way interactions between Song Type and Language of Origin, Song Type and Language Group, Language of Origins and Language Group, and among Song Type, Language of Origin, and Language Group (see **Table 3**). Our goal with the child-directedness measure was to examine whether listeners' classifications corresponded to the traditional classifications of children's songs vs. folk songs. For this analysis, we therefore ran separate 2 × 3 (Song Type [children's, folk] × Language Group [English-speaking Americans, bilingual Americans, French]) mixed-design ANOVAs, with Music Training as a covariate. **Figure 3** displays adults' ratings of children's and folk songs for each group, separately for each language of origin.

For English songs, we observed a significant main effect of Language Group (**Table 3**), with French listeners giving lower child-directedness ratings (M = 0.37, SME = 0.03) than monolingual Americans (M = 0.53, SME = 0.02), p < 0.001, or bilingual Americans (M = 0.47, SME = 0.03), p = 0.014, who did not differ from each other, p = 0.09. There was no indication, however, that any of the groups classified English children's songs as more child-directed than English folk songs.

For French songs, we observed significant main effects of Song Type and Language Group (**Table 3**). French children's songs received higher child-directedness ratings (M = 0.57, SME = 0.02) than French folk songs (M = 0.57, SME = 0.02). Moreover, higher child-directedness ratings were given by French speakers (M = 0.57, SME = 0.03) than by monolingual Americans (M = 0.41, SME = 0.02), p < 0.001, or bilingual Americans (M = 0.44, SME = 0.03), p = 0.004, who did not differ from each other, p = 0.39. We also observed a two-way interaction between Song Type and Language Group (**Table 3**), with Bonferronicorrected post-hoc t-tests revealing that French listeners gave far higher ratings to French children's songs than did monolingual Americans, t(108) = −7.9, p < 0.001, or bilingual Americans, t(68) = −5.8, p < 0.001, and the latter two groups did not differ, t(98) = −0.77, p = 0.44. Despite these differences, all three groups accurately rated French children's songs as more child-directed than French folk songs [monolingual Americans,

t(69) = 3.6, p < 0.001; bilingual Americans, t(29) = 2.4, p = 0.02; French, t(39) = 10.1, p < 0.001].

songs. Error bars represent standard error.

To summarize, listeners' likelihood of endorsing a song as "for children" was higher for children's songs than for folk songs, which validates the traditional classification of songs in the corpus. However, even though this trend was evident for English songs (**Figure 3**), the main effect of Song Type was driven by French songs, and French listeners showed the most robust differentiation of children's songs from folk songs. This finding is surprising because American listeners, regardless of language background, did not differentiate English children's songs from folk songs, despite greater familiarity with English songs. Instead, American listeners generally rated all English songs as more child-directed than French songs (**Figure 3**). Perhaps this is not surprising in light of the observation (**Figure 1**) that French folk and children's songs are better differentiated than English folk and children's songs.

#### Regression Analysis

The aforementioned results indicate that, on the whole, songs from children's anthologies were more likely to be classified as "for children," and listeners were more likely to like songs from their own culture. It is also clear, however, that children's songs


TABLE 4 | Simple (r) and R<sup>2</sup> Change predicting child-directedness endorsements from variables rhythm (nPVI), mean note duration (Tempo), Preference, and Familiarity, separately for each listener group.

\*p < 0.05. \*\*p < 0.01. <sup>a</sup> p = 0.053.

were more familiar than folk songs, and songs from one's culture were more familiar than other songs. We therefore conducted multiple regression analyses, one for each language group, to determine the relative contributions of familiarity, preference, and rhythmic features (nPVI and tempo) in predicting the perceived child-directedness of each song.

For each listener group, four variables indicative of rhythmic variability (nPVI), tempo (mean duration of each note in ms), familiarity (mean familiarity rating), and preference (mean liking ratings) were regressed onto the averaged response for each group (tendency to classify a song as "for children") for each of the 269 songs. **Table 4** presents simple correlations for each variable, separately for each group. Multiple regressions were conducted to determine how the removal of specific variables affected the predictiveness of the model. Thus, R <sup>2</sup> Change for a given variable indicates the amount by which the predictive strength of the model containing all four variables decreases when that variable is removed from the regression, reflecting the unique contribution of that variable (Darlington, 1990).

The four-variable models yielded moderate prediction levels for all three groups [monolingual Americans, R 2 (4, 264) <sup>=</sup> 0.59, p < 0.001; bilingual Americans, R 2 (4, 264) <sup>=</sup> 0.38, <sup>p</sup> <sup>&</sup>lt; 0.001; French, R 2 (4, 264) <sup>=</sup> 0.41, <sup>p</sup> <sup>&</sup>lt; 0.001]. As shown in **Table 4**, familiarity and liking were positively correlated with childdirectedness, suggesting that listeners had a strong tendency to classify songs that were more familiar and that they liked as "for children." Of the two measures, only familiarity contributed uniquely and robustly to the models for all three groups (**Table 4**). This suggests that while liking ratings correlated with childdirectedness ratings, liking did not predict child-directedness after controlling for familiarity. Tempo correlated negatively with responses and contributed uniquely to the model for all three groups. Faster songs were rated as more child-directed, a tendency that was somewhat stronger for American than for French listeners. Critically, nPVI did not correlate with the child-directedness ratings of monolingual Americans, but it correlated with the responses of bilingual American and French listeners, such that for these groups lower nPVI was associated with children's songs. Even after controlling for familiarity, nPVI contributed uniquely to the model for French listeners and marginally for American bilingual listeners (**Table 4**). This suggests that for speakers of a syllable-timed language, rhythmic features predict the perceived appropriateness of a song for children.

### GENERAL DISCUSSION

The present study is the first to demonstrate that the observed rhythmic parallels between the music and language of different cultures are not only preserved in music for children (children's songs) but also exaggerated relative to a similar genre of music (folk songs). By complementing our corpus analysis with listener ratings, we show that rhythmic differences in our corpus may reflect culture-specific intuitions about the role of rhythm in children's music. Our findings suggest that, when considering a repertoire of songs to perform for or with children, French-, and to some extent, Spanish-speaking listeners are more likely to select a song with lower rhythmic contrast, which parallels and enhances the rhythmic features of their language. By contrast, English-speaking listeners generally endorse English songs regardless of rhythm, which is consistent with the properties of English children's songs and folk songs but results in song choices that maintain the higher rhythmic contrast typical of English.

If children's music reinforces linguistically or culturally relevant information by exaggerating language-specific speech rhythm (Kandhadai et al., 2014), one might expect English children's songs to exhibit greater exaggeration (higher contrast) than English "adult-directed" folk songs, and French children's songs to exhibit greater exaggeration (lower contrast) than French folk songs. Our results are consistent with this prediction for French songs but not for English songs, which had comparable rhythmic contrast for both song types. In child-directedness ratings, moreover, French listeners robustly differentiated French children's songs from folk songs, while English speakers did not do so for English songs. Instead, English speakers, like French and Spanish speakers, instead classified French children's songs as child-directed even though those songs had lower rhythmic contrast (unlike English). The regression analysis indicated, however, that these decisions were driven by song familiarity and tempo rather than rhythm. In other words, English listeners did not use rhythm in their ratings of child-directedness, perhaps because of their exposure to music that is rhythmically undifferentiated across child and folk song categories.

Why is exaggeration of language-typical rhythmic patterns absent in English songs but present in French songs? This situation could represent a trade-off between the caregiving and didactic functions of child-directed input, with increased rhythmic regularity in child-directed vocalizations being universal (Fernald et al., 1989; Trainor et al., 1997) and nativelanguage speech rhythm varying by culture (Payne et al., 2009; Wang et al., 2015). For French input, lower regularity is consistent with both caregiving and didactic functions, whereas for English these two functions are at odds. Perhaps English children's songs would be more rhythmically regular if reduced rhythmic contrast did not undermine native language rhythm.

Cultural differences in caregiving styles may provide yet another explanation for why English children's songs are not more rhythmically regular. North American caregivers engage in more stimulating and playful interaction with infants than do caregivers from other cultures, who are more likely to soothe infants and lull them to sleep (Trehub and Trainor, 1998). While rhythmic contrast in infant-directed speech varies by communicative intention (e.g., affection, disapproval or questions; Salselas and Herrera, 2011), a systematic analysis of rhythmic contrast in play songs and lullabies would shed light on this issue.

The bilingual, English-speaking Americans were expected to disentangle the influence of native language from country of residence because all of them acquired a syllable-timed language from birth yet lived in the United States and presumably had continuous exposure to American music. Indeed, familiarity ratings suggest that this group was very similar to monolingual Americans in their exposure to folk and children's songs. By contrast, their preference ratings were only partially consistent with monolingual Americans, and the regression analysis suggested that bilingual Americans were more likely to use nPVI when endorsing child-directedness in songs, although this result did not reach conventional levels of significance. Further research is needed to ascertain whether exposure to multiple languages and cultures influences the perception and use of rhythm in linguistic and musical interactions with children.

The present work has several limitations. Practical considerations resulted in unequal sample sizes across groups, potentially affecting some outcomes by reducing power, notably the small group of bilingual Americans. Similarly, although our corpus size was comparable to or larger than that in several related studies of music (McGowan and Levitt, 2011; Salselas and Herrera, 2011; Temperley and Temperley, 2011), it was much smaller than that in other studies of music (Huron and Ollen, 2003; Patel and Daniele, 2003) and speech (Payne et al., 2009; Wang et al., 2015). The present study was also limited to French and English materials and to the coarse nPVI measure that may not capture the nuanced rhythmic features that differentiate many other languages (Cutler, 2012). Because of our exclusive reliance on musical notation, our nPVI values may differ in important ways from expressively sung performances, which might further enhance speech rhythms in music (Palmer, 1997). Furthermore, in view of the fact that multiple variability measures have been used in the speech literature (based on vocalic or consonantal durations, for example), it may be worthwhile to consider other units of musical time, for example, using both note duration and inter-onset interval, particularly in performed music. A future goal is to expand the corpora of child-directed music to expressively performed songs in a wider range of cultures.

The rating study was also limited by training differences across groups. French participants had more music training than American participants, and training was positively correlated with some measures (familiarity and child-directedness ratings of French songs). Music training was included as a covariate whenever it was correlated with any measures, and there were no interactions with music training. Nevertheless, musicians may be more sensitive to rhythmic features that distinguish folk songs from children's songs, leading French listeners to outperform American listeners regardless of native language and country of residence. The performance of bilingual Americans casts doubt on this explanation because bilingual Americans' child-directedness endorsements were driven by nPVI, like those of French listeners, despite having less music training. Although it is desirable to balance music training across groups, imbalances in music training are often central to the cultures under consideration.

Overall, the present findings provide new insights into the role of rhythm in music development by indicating that rhythmic features of the native language not only appear in children's music from that culture but are enhanced in such music. Because rhythm is accessible from birth (Winkler et al., 2009) and drives early listening preferences (Nazzi et al., 1998; Soley and Hannon, 2010), the presence of native-language rhythm in musical input may have important implications for learning in music and language domains. In one example of generalization from music to speech processing, 9-month-old infants who participated in a 4-week intervention involving movement to music with triple meter exhibited enhanced neural processing of temporal structure in speech and music relative to infants who participated in a play intervention without music (Zhao and Kuhl, 2016). Incidental exposure to language input in verse or song may fine tune temporal attention and enhance memory, providing a particularly effective scaffold for young children's learning (Levedeva and Kuhl, 2010; de Diego-Balaguer et al., 2016; Kiraly et al., 2016). In sum, rhythmic input affects enculturation and cultural transmission by ensuring that young children are exposed to the communication features of their social and cultural group.

### AUTHOR CONTRIBUTIONS

EH conceived and carried out corpus analysis, EH and ST designed behavioral study, EH supervised and trained research assistants who created stimuli, experiment program, and ran the experiment in the USA, YL collected and supervised research in France, EH, YL, and KN analyzed data, EH, YL, KN, and ST wrote the paper.

### ACKNOWLEDGMENTS

This research was funded by grants from the National Science Foundation BCS-1052718 awarded to EH and Natural Sciences and Engineering Research Council of Canada to ST.

### REFERENCES


Karpeles, M., (ed.) (1956). Folk Songs of Europe. London: Novello and Co. Limited.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Hannon, Lévêque, Nave and Trehub. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

## APPENDIX A. CORPUS SONG SOURCES.

#### **English Song Sources**

Davison, A. T., and Surette, T. W. (1922b). 140 Folk-Tunes: Rote Songs: Grades I, II and III for School and Home. Boston, MA: E.C. Schirmer Music Co.

Leonard, H. (1998). The Really Big Book of Children's Songs. Milwaukee, WL: H. Leonard Corp.

Leonard, H. (1999). The Mighty Big Book of Children's Songs. Milwaukee, WL: H. Leonard Corp.

Radcliffe-Whitehead, J. B. (1903). Folk-Songs and other Songs for Children. Boston, MA: Oliver Ditson Company.

Seeger, R. C. (1948). American Folk Songs for Children. New York, NY: Doubleday & Co.

Haywood, C. (1966). Folk Songs of the World, Gathered from more than 100 Countries. New York, NY: J. Day Co.

Lomax, J. A., Lomax, A., and Kittredge, G. L. (1994). American Ballads and Folk Songs. New York, NY: Dover.

Lomax, J. A., Lomax, A., Seeger, R. C., Thompson, H. W., and Tick, J. (2000). Our Singing Country: Folk Songs and Ballads. Mineola, NY: Dover.

Moore, E., and Moore, C. O. (1964). Ballads and Folk Songs of the Southwest: More Than 600 Titles, Melodies, and Texts Collected in Oklahoma. Norman, OK: University of Oklahoma Press.

Trehune, P. (2006). Guitar Masters Quality Publications [sheet music in.PDF file format]. Guitar-Masters ©. Available online at: http://guitar-primer.com/Folk/ (Accessed November 28, 2006).

### **French Song Sources**

Davison, A. T., and Surette, T. W. (1922a). 140 Folk Tunes for School and Home. Boston, MA: E.C. Schirmer Music Co.

Momes.net (trans. Kids.net). Comptines (trans:Rhymes) [sheet music in.PDF file type]. Available online at: http://www.momes.net/ comptines/ (Accessed October 20, 2006).

Bujeaud, J. (1980). Chants et Chansons Populaires des Provinces de l'ouest. (Trans. Popular chants and songs from the provinces of L'ouest). Marseille: Laffitte Reprints.

Byrd, J. (1903). Folk Songs and Other Songs for Children. Boston, MA: Radcliffe-Whitehead.

Fassio, A. (1932). French Folk Songs: Nursery Rhymes and Children Rounds. New York, NY: Marks Music Corporation.

Gerhard, R. (2002). Six Chansons Populaires Francaises: Arrangement Pour Chant et Piano. (Trans. Six French Folksongs: arranged for voice and piano). London: Boosey & Hawkes.

Karpeles, M., (Ed.) (1956). Folk Songs of Europe. London: Novello and Co. Limited.

Loffet, B. (2006). Plein de tablatures gratuites pour accordéon diatonique (Trans. Full free tablatures for diatonic accordion) [Sheet music in.PDF file type]. available online at: http://diato.org/tablat.htm (Accessed December 5, 2006)

Macaulay, W. (2004). Five Line Skink version 1.2a4 [ABC reader program for Mac OSX].

Piron, S. (2007). 655 airs trad. de France (Trans. 655 Traditional Tunes from France) [sheet music in ABC notation]. Available online at: http://www.tradfrance.com/matf01.txt (accessed January 27, 2007).

Poire, H. (1962). Mon Premier Livre de Chansons (Trans. My first song book). Paris: Larousse.

# Beat Perception and Sociability: Evidence from Williams Syndrome

Miriam D. Lense1,2,3 \* and Elisabeth M. Dykens<sup>2</sup>

<sup>1</sup> Marcus Autism Center, Children's Healthcare of Atlanta, Emory University, Atlanta, GA, USA, <sup>2</sup> Vanderbilt Kennedy Center, Vanderbilt University Medical Center, Nashville, TN, USA, <sup>3</sup> Program for Music, Mind and Society, Department of Otolaryngology, Vanderbilt University Medical Center, Nashville, TN, USA

Beat perception in music has been proposed to be a human universal that may have its origins in adaptive processes involving temporal entrainment such as social communication and interaction. We examined beat perception skills in individuals with Williams syndrome (WS), a genetic, neurodevelopmental disorder. Musical interest and hypersociability are two prominent aspects of the WS phenotype although actual musical and social skills are variable. On a group level, beat and meter perception skills were poorer in WS than in age-matched peers though there was significant individual variability. Cognitive ability, sound processing style, and musical training predicted beat and meter perception performance in WS. Moreover, we found significant relationships between beat and meter perception and adaptive communication and socialization skills in WS. Results have implications for understanding the role of predictive timing in both music and social interactions in the general population, and suggest music as a promising avenue for addressing social communication difficulties in WS.

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Jeanette Tamplin, University of Melbourne, Australia Eveline Geiser, Centre Hospitalier Universitaire Vaudois, Switzerland

#### \*Correspondence:

Miriam D. Lense miriam.lense@vanderbilt.edu

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology

> Received: 06 March 2016 Accepted: 30 May 2016 Published: 20 June 2016

#### Citation:

Lense MD and Dykens EM (2016) Beat Perception and Sociability: Evidence from Williams Syndrome. Front. Psychol. 7:886. doi: 10.3389/fpsyg.2016.00886 Keywords: Williams syndrome, beat, rhythm, social, communication, meter

"When you play the guitar, do you use a metronome or do you use your heart to keep the beat?" -question posed to a songwriter by an individual with Williams syndrome attending a music camp

### INTRODUCTION

Williams syndrome (WS) is a genetic, neurodevelopmental disorder caused by the deletion of ∼28 genes on chromosome 7 (Ewart et al., 1993). WS is associated with a unique cognitive behavioral profile including mild to moderate cognitive impairment, greater verbal than spatial abilities, anxiety, and hypersociability (see Martens et al., 2008 for a review). Additionally, people with WS have pronounced auditory sensitivities and increased emotional responsiveness to music, even as they vary greatly in their musical skills (Lense et al., 2013). A better understanding of the musical profile in WS may help determine how their musical interests and abilities fit with other aspects of the WS phenotype, as well as lead to insights into gene-brain-behavior relationships involved in musical engagement. One area of particular interest is skills related to rhythm and timing, which are crucial in both music and social communication and interaction (e.g., Patel, 2008).

Many aspects of timing are incorporated into music including tempo, beat, rhythmic patterns, meter, and temporal variability or expressive timing (Honing, 2013). Rhythm refers to the pattern of durations of the musical notes. Rhythm is often perceived within the framework of a musical beat, i.e., a regular pulse marking equally spaced time intervals (Large and Palmer, 2002). The structure provided by a regular, predictable beat enhances rhythm discrimination and rhythm production abilities (e.g., Essens, 1986; Patel et al., 2005; Grahn and Brett, 2009). Additionally, the

**274**

beat can be organized into different hierarchical levels of strong and weak beats, which make up the musical meter (for example the strong-weak beat pattern of a march versus the strongweak-weak beat pattern of a waltz) and further enhance beat and rhythm perception and production (e.g., Grahn and Brett, 2007; Grube and Griffiths, 2009). Tempo refers to the pace or rate of the music, i.e., the speed of the musical beats. Temporal variability refers to the timing differences expressed in a musical performance, which might be perceived as a more mechanical versus expressive production of the music.

Most studies of musical timing in WS have focused on perception of rhythmic patterns, with variability evident both within and across studies. For example, several studies have used the Gordon Primary Measures of Music Audiation (PMMA; Gordon, 1986) to examine rhythmic abilities. In this task, participants make same/difference judgments regarding the rhythm of two consecutive musical phrases (i.e., rhythm discrimination judgments). One study reported that adolescents with WS performed worse than chronological age-matched typically developing (TD) peers on this task (Hopyan et al., 2001); indeed, the performance of the WS group did not differ from chance. Using the same rhythm task, Don et al. (1999) reported that children and adolescents with WS performed at a level generally consistent with their receptive vocabulary skills (used to estimate verbal mental age), suggesting that musical and simple language skills may both be areas of relative strength in WS. At the same time, individuals with WS demonstrated worse rhythm perception skills than TD individuals of a similar verbal mental age (who had fewer years of musical exposure). In contrast, Levitin (2005) reported that their participants with WS performed equivalently to highly trained TD music students on these rhythm discrimination items, though formal analyses were not presented. Finally, Martens et al. (2010) also reported poorer performance on a similar rhythm discrimination task in WS compared with same-aged TD peers. Variable findings across studies may relate to differences in age of participants (children versus adults), type of control group (mental age versus chronological age matched, who also have different years of musical exposure and training), and recruitment site (music camp versus clinic). Even so, this work suggests that rhythm discrimination skills are not necessarily preserved in WS though may be commensurate with language skills.

However, as noted by Levitin et al. (2004), results may also have been affected by task order: participants in Don et al. (1999) and Hopyan et al. (2001) completed the rhythm task after a melody task and attention difficulties in WS may have led to decreased performance as the assessments progressed. Levitin et al. (2004) also suggest that small manufacturing defects in the PMMA test stimuli might have disproportionately affected the participants with WS given their auditory hypersensitivities. Finally, these same/difference tasks require holding the original musical phrase in memory to compare with the second phrase. None of the studies controlled for auditory working memory, despite findings that individuals with WS may show poorer working memory than expected based on receptive language skills (Don et al., 1999; Rhodes et al., 2011).

There is also limited data on musical beat and rhythm production skills in WS. In a sample of 25 children and adults with WS, Martens et al. (2010) reported that participants performed as well as chronological age-matched TD controls on a task that required them to clap in time to the beat of musical passages, but performed worse than TD controls at reproducing rhythmic patterns by clapping or singing them. Levitin and Bellugi (1998) found that eight individuals with WS were able to clap back rhythmic patterns based on the PMMA as well as younger TD children (who were not strictly matched on mental age or musical training). Additionally, the errors made by the WS group tended to be musically compatible (i.e., fit within the overarching beat and metrical structure of the phrase). Thus, there is some evidence that individuals with WS show strengths in beat-based production tasks, while findings for specific rhythmic pattern reproduction are unclear. Differences in task designs and task order, perceived musical nature of the stimuli, and testing set-up may also contribute to these differences. For example, in the Levitin and Bellugi (1998) study, participants with WS clapped rhythms in response to those clapped by another person while the stimuli in Martens et al. (2010) were prerecorded. Given the hypersociability in WS, a social context could lead to differential performance results. Indeed, Levitin et al. (2004) reported improved performance in individuals with WS on a modified PMMA task when stimuli were presented in person rather than a recording. Assessing rhythm in a more naturalistic musical context may also impact findings as it may be more engaging and better maintain participants' attention and interest.

Overall, the results of previous studies examining aspects of rhythm and timing in WS present a somewhat mixed set of results. Moreover, these studies have been limited by small sample sizes (8–20 individuals with WS) and task demands (for example, working memory load, attention requirements, potential stimuli defects in the PMMA). Studies have also primarily focused on the rhythmic aspect of music with limited attention given to other crucial aspects of musical timing – beat and meter – that provide a framework for structuring musical rhythms. The clapping tasks used by Levitin and Bellugi (1998) and Martens et al. (2010) suggest that individuals with WS are able to execute motoric responses in line with the musical beat. Therefore, studies that explicitly examine meter and beat perception in WS are needed to better understand the musical profile in WS. Additionally, previous studies have not fully examined individual differences in these abilities despite the variability both within and across studies.

Beat and meter perception are particularly relevant for WS when considering the WS social profile. A hallmark of WS is hypersociability, including increased empathy and motivation to interact with others (Jarvinen and Bellugi, 2013; Thurman and Fisher, 2015). Intriguingly, beat perception in music creates a salient signal for perceptual and motor entrainment (e.g., Large and Jones, 1999; Patel et al., 2005; Fujioka et al., 2012). These types of synchronized behaviors are linked with social bonding and prosocial behaviors from infancy through adulthood in TD populations. For example, adults who engage in synchronized musical activities show increased cooperation even at their own

expense (Anshel and Kippler, 1988; Wiltermuth and Heath, 2009), and children are more helpful to each other following musical versus non-musical games (Kirschner and Tomasello, 2010). Infants are more helpful to an experimenter who has synchronously bounced with them to the beat of music versus an experimenter who bounced asynchronously (Cirelli et al., 2014). The ability of music to engender synchronous activities and increase social cohesion has been proposed as an adaptive value of music and may explain its ubiquity in social situations such as maternal-infant interaction, religious ceremonies, sports events, and military activities (Trevarthen et al., 1999–2000; Cross, 2003; Dissanayake, 2008; Wiltermuth and Heath, 2009). Additionally, numerous studies have documented behavioral and neural overlap in beat and rhythm processing in music and language in typical populations (e.g., Tierney and Kraus, 2013; Magne et al., 2016) and atypical populations with reading and/or language impairments (e.g., Corriveau and Goswami, 2009; Cumming et al., 2015). Indeed, beyond the potential social adaptive values of perceiving a beat in music, Patel (Patel, 2006; Patel et al., 2009) has hypothesized that human's abilities to perceive and synchronize to a beat may be due in part to our status as vocal learners, perhaps in combination with factors such as being able to engage in non-vocal movement imitation and living in complex social groups.

Though, WS is characterized by increased social motivation and relatively stronger simple verbal skills, recent research implicates an uneven linguistic and social profile in WS (Mervis and Becerra, 2007). Of particular relevance for beat and rhythm processing, onset of language is delayed in WS, which appears to be related to motor delays (Masataka, 2001). Additionally, similar to those with language impairments, individuals with WS show difficulties using stress-patterns for word meaning (Plesa Skwerer et al., 2007). Moreover, difficulties in social communication and interpersonal interactions are common in WS including impairments in initiating and maintaining appropriate conversations (Laws and Bishop, 2004; Stojanovik, 2006). These difficulties lead to difficulties forming and maintaining relationships with peers (see Thurman and Fisher, 2015 for a review).

Thus, the vast individual variability in musical and social behaviors in WS provides an opportunity to examine relationships between musical and social communication behaviors. This may provide a novel window into understanding the role of timing in both music (e.g., beat, meter processing) and social communication (e.g., stress prosody; back-and-forth rhythm of a conversation; anticipating social cues). Previous research has indicated increased links between musical and social emotions in WS compared to TD populations (Ng et al., 2013; Lense et al., 2014b; Pridmore et al., 2014) and musical engagement is sometimes used as a vehicle for social engagement in WS (for example, as described in Levitin and Bellugi, 1998; Levitin et al., 2004). However, direct links between music and social behaviors within the realm of beat perception are unexplored.

In this two-part study, we first examined musical beat and meter perception in a large sample of individuals with WS, and identified how these perceptions related to cognitive abilities, musical training, and musical exposure. We also examined how individual differences in auditory processing style (i.e., reliance on the fundamental frequency versus harmonic overtones) predicted beat and rhythm skills. We specifically chose wellvalidated tasks that did not have a working memory requirement (i.e., same/difference judgments) and examined beat and meter perception in the context of actual musical examples rather than isolated rhythmic excerpts. Additionally, we examined how differences in the stimuli (for example, tempo, genre, beat variability) predicted individuals' performance with the stimuli. Thus, we were able to examine both how characteristics of participants and characteristics of stimuli impacted the individual differences in beat perception in WS. We hypothesized that individuals with WS versus TD controls would show poorer beat perception abilities but significant individual variability, and would also benefit more from music that had a consistent beat and that was of a familiar genre.

In the second study (which included a subset of individuals from Study 1, as well as new participants), we conducted exploratory analyses on the relationship between musical beat, meter perception and adaptive social communication skills in WS. Adaptive skills are behaviors that people routinely perform to meet the personal and social demands of daily life. We hypothesized that greater beat and meter perception skills would be associated with improved adaptive communication and social skills, consistent with beat perception scaffolding entrainment of social engagement and interaction.

### STUDY 1: INDIVIDUAL DIFFERENCES IN BEAT AND METER PERCEPTION

### Methods

### Participants

Participants included 74 children and adults (mean age: 26.4 ± 9.6, 50.0% male) with genetically confirmed diagnoses of WS who were recruited from a residential summer camp or national convention. Both children and adults were recruited as we aimed to identify if age accounted for individual differences in beat perception. Due to exclusion of invalid data or changes in study protocol over the multi-year study period, some participants completed both meter and beat perception tests (n = 57) while others completed only one of the tests (Beat = 59; Meter = 72).

The beat and meter tests were also administered to a comparison group of 53 TD participants (mean age: 24.3 ± 9.4, 48.1% male); 35 of them completed both the beat and meter tests while an additional 18 completed only the meter test. As shown in **Table 1**, WS and TD participants were well-matched on age, sex, types of musical training, cumulative years of individual lessons, percent with percussion/piano training, and time spent currently listening to music. On average, however, the WS group spent more time currently playing music. As expected, the TD versus WS participants had significantly higher IQ scores.

The University Institutional Review Board approved the study, and written, informed consent was obtained from TD adult

#### TABLE 1 | Demographic information for Study 1.

fpsyg-07-00886 June 20, 2016 Time: 17:23 # 4


participants, and from the parents/guardians of participants with WS and TD minor participants. Participants with WS and TD minors provided verbal and written assent.

### Measures and Procedures

### **Musical Background**

Typically developing participants and parents of participants with WS completed a Musical Background Questionnaire (Lense and Dykens, 2012; Lense et al., 2013). Consistent with previous research (Lense et al., 2013, 2014a), musical training was quantified both as the number of types of formal music lessons received (including individual and group lessons both within school or as an extra-curricular activity, as well as ensemble participation) and as the cumulative duration of individual extracurricular music lessons. The former appears to better reflect musical training experiences in WS while the latter is a standard metric used in TD studies.

### Behavioral Assessments

### **Cognitive Assessment**

Participants were individually administered the Kaufman Brief Intelligence Test, 2nd edition (KBIT-2; Kaufman and Kaufman, 2004), which provides verbal, non-verbal, and full-scale IQ scores. The full-scale IQ was used as an index of cognitive abilities.

### **Beat Alignment Test (BAT)**

A subset of 16-items from the Beat Alignment Test, version 2 (BAT: Iversen and Patel, 2008) was used to assess beat perception. Participants listened to music from different genres (rock, jazz, or orchestral pops) with a superimposed track of beeps that either aligned or misaligned with the musical beat. On misaligned tracks, beeps were phase shifted by 30% ahead of or behind the beat. Participants were given demonstration items with feedback prior to the test, and then responded if the beeps matched the beat of the music. Although standard administration asks that participants stay still while completing the BAT, many individuals with WS were unable to inhibit their motoric response (e.g., rocking, head shaking, or tapping hand or foot). BAT scores were converted to d' to control for response biases (i.e., a tendency to answer that beeps matched versus did not match the beat). A metric used in signal detection theory (Green and Swets, 1966), d' is computed as the z-standardized hit rate (correctly responding that beeps matched the beat) minus the z-standardized false alarm rate (incorrectly responding that beeps matched the beat when they did not match). In this way, d' provides a measure of perceptual sensitivity that considers both accuracy and response bias. The BAT stimuli were presented at ∼68 dB from either two speakers approximately 40 cm in front of the participant or from one speaker ∼60 cm above their head. A subset of 14 participants with WS completed the BAT a second time over a 1–3 years period with a test–retest reliability ICC = 0.71.

### **Montreal Battery of Evaluation of Amusia Meter subtest (MBEA-m)**

The meter subtest is one of six subtests of the Montreal Battery of Evaluation of Amusia (MBEA; Ayotte et al., 2002; Peretz et al., 2003), a widely used measure for assessing musical perception skills. Participants were presented with 30 two-phrase harmonic melodies in a piano timbre, with chords emphasizing either duple or triple meter. Following each melody, participants determined if the melody was "in 2" (march) or "in 3" (waltz). Participants were given practice examples with feedback prior to the test items and shown visuals of march/in 2 and waltz/in 3. In contrast to the BAT, participants were encouraged to clap, tap, sing, or otherwise move to feel the beat. (During example items, the experimenter demonstrated moving to a march versus a waltz to encourage the use of different movements in determining the meter.) Participants' total scores (out of 30) were used as a measure of meter perception. MBEA-m stimuli were presented at 68 dB from two speakers approximately 40 cm in front of the participant. A subsample of 21 individuals with WS completed the MBEA a second time following a 1–3-years delay. Test–retest reliability was excellent (ICC = 0.820) based on Fleiss (1986) guidelines.

### **Sound Perception**

To assess the sound processing style of participants with WS, the 12-items version of the spectral-fundamental processing test (SFP; Schneider et al., 2005; Wengenroth et al., 2010) was administered. The SFP characterizes an individual's dominant auditory processing style along a continuum from primary spectral processing (perceive sound by decomposing it into its harmonics) to primary fundamental processing (perceive sound based on the fundamental frequency). On the SFP, participants heard pairs of 500-ms tones repeated twice. The tones varied in number, height and averaged frequency of their harmonics (see Wengenroth et al., 2010 for a full description). Participants reported if the second tone in the pair was higher or lower than the first. For each tone pair, the perceived direction of pitch change reflected either spectral or fundamental processing. An SFP index was computed [(number of spectrally perceived items – number of fundamentally perceived items)/total number of items] where scores vary from −1 to + 1. Higher scores reflect greater use of spectral processing and lower scores reflect greater use of fundamental processing. Scores around 0 reflect no consistent preference.

### Analyses

Consistent with prior research, the MBEA-m scores were significantly negatively skewed while the BAT scores were more normally distributed in both the TD and WS groups. Therefore, we used non-parametric statistics to compare performances between the WS and TD groups, and to examine correlations between MBEA-m and BAT performance with age, IQ, musical training (number of types of lessons and cumulative duration of individual lessons), time spent playing and listening to music, and sound processing style in the WS participants (sound processing was not collected in TD participants). Variables that were significantly related to MBEA-m and BAT scores were then entered into linear regression analyses to assess their unique contributions to variance in beat and meter perception skills in WS. Parametric linear regressions were appropriate given the normally distributed residuals.

Preliminary analyses of the BAT data suggested that for both WS and TD participants, some items were easier than others. As the BAT stimuli reflect real musical excerpts, we examined how certain stimuli-specific factors might have influenced performance in the WS and TD groups. We used multilevel logistic models that allowed us to cluster items within individual participants, thus controlling for the individual participant variability in BAT performance. We examined musical genre (rock versus jazz versus orchestral pops, with rock music used as the reference); tempo (based on the inter-onset-intervals of the beat (in ms), grand-centered); and beat variability (based on the coefficient of variability (CV; Lovie, 2005) of the inter-onset-intervals, grand-centered) as predictors of item-level performance in WS and TD groups.

### Results

### Beat Alignment Test

Beat Alignment Test performance was highly variable in the WS and TD groups. Overall, TD participants demonstrated significantly greater performance on the BAT than WS participant (d' = 2.17 ± 0.75 versus 1.37 ± 0.98, Mann– Whitney U = 540.5, p < 0.001). In WS, BAT performance was associated with IQ (ρ = 0.445, p < 0.001), types of musical training (ρ = 0.320, p = 0.016), cumulative years of individual music lessons (ρ = 0.357, p = 0.007), and sound processing style (ρ = −0.401, p = 0.002) but not with age or time spent playing or listening to music. The negative correlation between BAT and sound processing style suggests that greater BAT performance is associated with greater use of a fundamental processing style. In the TD group, BAT performance was similarly associated with types of musical training (ρ = 0.372, p = 0.03) and cumulative years of individual music lessons (ρ = 0.360, p = 0.036) but not with IQ, age, or time spent playing or listening to music.

The regression analysis including the four predictor variables (IQ, sound processing, and the two musical training variables) explained 28.7% (adjusted R 2 ) of the variance in BAT score [F(4,51) = 6.523, p < 0.001] in the WS group (**Table 2**). IQ and sound processing were the strongest

#### TABLE 2 | Regression model predicting BAT performance in WS.


Significant predictors are italicized.

predictors of BAT score, explaining 7.6 and 11.0% of the variance in BAT scores, respectively. Cumulative duration of independent musical training predicted an additional 3.8% of the variance.

The multilevel binomial models confirmed the significant individual variability in BAT performance in both the WS [χ 2 (58) = 141.31, p < 0.001] and TD groups [χ 2 (34) = 66.69, p < 0.001]. Results of the models can be found in **Table 3** (WS) and **Table 4** (TD). When examining effects of genre, tempo, and beat variability, only beat variability predicted item level performance in the WS group. Each percent increase in beat variability (compared with average beat variability across stimuli) was associated with a 70.4% chance of correctly answering an item on the BAT [t(881) = −4.599, p < 0.001; Odds Ratio = 0.78]. However, further analysis found that this was due to one particular excerpt, an orchestral pops version of the Superman theme. The beat of this stimulus was more than twice as variable (CV = 7.8%) as the next most variable stimulus (an orchestral pops rendition of Richard Rogers waltzes CV = 3.7%) or the average of all other stimuli included in our test battery (CV = 2.7%). When this item was excluded, beat variability was no longer a significant predictor of item accuracy. Instead, orchestral pops stimuli (versus rock stimuli) were associated with only a 71.3% chance of a correct answer [t(761) = −2.049, p = 0.041, OR = 0.62] when controlling for tempo and beat variability.

In contrast, in TD participants, tempo and jazz genres were significant predictors of accuracy. A 1 ms increase in interonset-interval of the beat (i.e., slower tempo) was associated with a 90.8% chance of correctly answering a BAT item while jazz stimuli (versus rock stimuli) were associated with an 83% chance of correctly answering a BAT item.

### Montreal Battery of Evaluation of Amusia Meter Subtest

As depicted in **Figure 1**, there were vast individual differences in meter perception in both the WS and TD groups though overall performance was significantly greater in the TD (mean = 26.92 ± 3.5, median = 28) than WS (mean = 24.43 ± 5.1, median = 26) group (Mann–Whitney U = 1333, p = 0.004). Among WS participants, greater MBEA-m performance was associated with higher IQ (ρ = 0.342, p = 0.003), number of lesson types (ρ = 0.413, p < 0.001), and cumulative years of musical training (ρ = 0.424, p < 0.001). MBEA-m performance was also associated with



Tempo and beat variability are grand-centered around the mean. The reference genre is rock music.

sound processing style (ρ =− 0.287, p = 0.014), indicating that individuals with a greater fundamental processing style had better meter perception skills on the MBEA-m. Among TD participants, greater MBEA-m performance was associated with number of lesson types (ρ = 0.388, p = 0.004), cumulative years of musical training (ρ = 0.335, p = 0.004), and time spent currently playing music (ρ = 0.284, p = 0.041). No other significant relationships were found.

The regression analysis including IQ, sound processing, and the two musical training variables predicted 25.3% (adjusted R 2 ) of the variance in MBEA-m performance in WS [**Table 5**; F(4,64) = 6.757, p < 0.001]. IQ and sound processing had the greatest predictive value on MBEA-m performance, accounting for approximately 5.8 and 4.6% of the variability in MBEA-m performance when controlling for the other factors, with types of musical training accounting for an additional 3.8% of variance.

### Beat Alignment Test and Montreal Battery of Evaluation of Amusia Meter

Among the 57 participants with WS with both BAT and MBEA-m scores, performance on the two measures was highly correlated (ρ = 0.610, p < 0.001). This is an expected finding, as meter perception emerges from the hierarchical organization of beats. Therefore, we conducted a step-wise regression examining MBEA-m performance with BAT performance. At Step 1, we entered the significant predictors from the original analysis to confirm similar results on MBEA-m performance in this subset of participants (IQ, sound processing, and the two musical training variables). Results were similar to findings with the full set of participants with MBEA-m data, explaining 27.9% of the variance in MBEA-m performance in this smaller sample. As before, IQ (β = 0.263, t = 2.207, p = 0.032, sr <sup>2</sup> = 6.5%) and sound processing style (β = −0.277, t = −2.273, p = 0.027, sr <sup>2</sup> = 6.9%) were the greatest predictors of MBEA-m performance. However,


Tempo and beat variability are grand-centered around the mean. The reference genre is rock music.

TABLE 5 | Regression model predicting MBEA-m performance in WS.


Significant predictors are italicized.

the addition of BAT d' score at Step 2 explained an additional 8.8% (adjusted R <sup>2</sup> = 36.7%) of the overall MBEA-m variance and BAT d' score was the only significant predictor (β = 0.378, t = 2.825, p = 0.007, sr <sup>2</sup> = 9.4%). IQ and sound processing style (which had predicted BAT performance) were no longer unique predictors of MBEA-m once beat perception abilities were taken into account. (Results were similar when the Superman item was excluded when calculating BAT d' scores.)

### STUDY 2: BEAT AND METER PERCEPTION AND SOCIAL COMMUNICATION SKILLS

### Methods

#### Participants

Data for Study 2 were collected from 50 adults with WS who attended a 1-week residential summer camp over a 4 years period. The camp focused on musical activities such as songwriting and performances, as well as developing social and daily living skills. Musical talent was not required to attend this program and attendees varied widely in their music abilities, as reflected in Study 1 and previous work (e.g., Lense and Dykens, 2012; Lense et al., 2013). Of the 50 participants in Study 2, 31 also participated in Study 1 and 19 were new participants.

Due to changes in study protocol across years, 37 (mean age: 26.2 ± 8.4 years, 56.8% male) of the 50 participants completed the beat test (BAT) and 40 (mean age: 26.8 ± 8.3 years, 64.0% male) completed the meter (MBEA-m) test, with 28 of these participants completing both tests. One individual who completed the MBEA-m was excluded from the BAT because they did not follow task directions. An additional participant was excluded from both BAT and MBEA testing because they did not understand directions. Results were very similar when analyses only included participants with both tests. Demographic information for these groups of participants is provided in **Table 6**.

### Measures

Participants completed the KBIT-2, BAT, and MBEA-m, described in Study 1. Adaptive functioning was assessed with the Vineland Adaptive Behavior Scales, 2nd edition (Vineland-II; Sparrow et al., 2005), which identifies adaptive functioning in three domains: Communication, Daily Living Skills, and TABLE 6 | Demographic information, Vineland-II, and BAT/MBEA-m scores for Study 2.


Socialization. This semi-structured, standardized interview was conducted over the phone with participants' primary caregiver. The three domains yield standard scores (M = 100, SD = 15), which were used in analyses.

### Analyses

We conducted zero-order and partial correlations controlling for IQ between the BAT, MBEA-m, and Vineland-II domain scores.

## Results

#### Beat Alignment Test

**Figure 2** depicts the relationship between BAT scores and Vineland-II scores. BAT performance (d') was significantly associated with Vineland-II Communication (ρ = 0.472, p = 0.003) and Socialization (ρ = 0.370, p = 0.024) but not Daily Living Skills (ρ = 0.145, p = 0.391). This pattern of findings remained when controlling for IQ (Communication: ρ = 0.436, p = 0.008, Socialization: ρ = 0.326, p = 0.052, Daily Living Skills: ρ = 0.141, p = 0.411).

### Montreal Battery of Evaluation of Amusia Meter Subtest

As depicted in **Figure 3**, MBEA-m scores were associated with the Vineland-II Socialization domain (ρ = 0.439, p = 0.005), even when controlling for IQ (ρ = 0.307, p = 0.057), but MBEA-m performance was not associated with Communication or Daily Living Skills (ρ's = 0.21 and 0.24, respectively).

### DISCUSSION

In contrast to literature portraying preserved musical skills in WS, this study instead highlights considerable variability in beat and meter perception skills in people with this syndrome. On average, individuals with WS have poorer beat and meter perception skills than age-matched typically developing individuals with comparable musical training. Beat and meter perception was influenced by individual-level characteristics (e.g., IQ, sound processing style, musical training), as well as stimulilevel characteristics (e.g., musical genre and beat variability).

While our findings may seem inconsistent with two previous reports suggesting age-appropriate beat abilities in WS to clap to/in response to music, there are several possibilities for these differences in findings including task demands, constructs, and sample characteristics. For example, the current study examined perceptual abilities while the previous studies documented production skills. Additionally, the perception tasks in the current study required explicit answers while the clapping tasks in the production studies may have been more implicit. Indeed, Levitin and Bellugi (1998) noted that in their clapping tasks, both WS and TD participants naturally responded to the

experimenter's clapping by clapping back in time themselves (i.e., implicitly preserving the overall meter). Thus, it is possible participants with WS do better on tasks that do not require explicit responses. This pattern of greater impairment in explicit

versus implicit musical tasks has previously been seen in cases of otherwise TD individuals with extremely poor pitch perception abilities (Loui et al., 2008). In contrast, studies of beat perception in TD groups have documented cases where a given individual's perceptual abilities are stronger than their ability to tap to the beat (Iversen and Patel, 2008). Future studies will need to directly assess beat perception and production skills in the same individuals with WS to examine relationships between these abilities and to determine if some individuals have significant difficulties specifically on explicit perception tasks. Martens et al. (2010) tested participants with WS on a variety of melodic and rhythmic perception and production tasks and noted that a few individuals did well on the production tasks and poorly on the perception tasks and vice versa. As scores were aggregated across all tasks, it is unknown if these patterns stem from both the melodic and rhythmic tasks or discrepancies in just one of the task categories.

We selected the BAT as our test of beat perception in part because of its use of real, readily accessible music, which is more likely to represent participants' actual music listening experiences. The BAT successfully maintained most participants' attention and engagement, but for some participants, the rich auditory information, with multiple instrumental timbres, may have interfered with their ability to perceive and consciously report on the beat. In comparison, the clapping tasks used in previous studies (Levitin and Bellugi, 1998; Martens et al., 2010) are presented in only one timbre. Thus, the sound complexity of the BAT stimuli may have exerted a greater effect on WS versus TD participants, especially given their poor auditory filtering and increased sensitivities to sounds, including to musical timbres (Levitin et al., 2005; John and Mervis, 2010; Lense et al., 2012). Additionally, though the BAT stimuli (especially with the exception of the Superman theme item) generally have low beat variability, they are not strictly isochronous as are stimuli in previous studies.

Furthermore, as correct/incorrect responses in production studies involving clapping were based on examiners' judgments (Levitin and Bellugi, 1998; Martens et al., 2010), it is possible that more fine-grained analyses using acoustic measurements could have revealed group differences in either temporal precision or consistency. For example, raters did not perceive differences in the rhythmic accuracy of the singing of "Happy Birthday" by adults with WS versus chronological age-matched TD adults, but acoustic measurements demonstrated that the rhythmic patterns were less precise in the WS group (Martinez-Castilla and Sotillo, 2008). Additionally, Martens et al. (2010) noted that both WS and TD participants averaged above 90% on the clapping to the beat of the music task. Thus, a ceiling effect may have precluded finding group differences.

Beyond group differences, we found considerable individual variability in beat and meter perception skills in both the WS and TD groups. Similarly, previous studies with TD individuals have documented vast individual differences in beat perception (Iversen and Patel, 2008; Grahn and Schuit, 2012). Thus, it is not surprising to also find such broad individual differences in WS. Indeed, many participants with WS performed comparably to the TD group, including individuals scoring perfectly on the beat and/or meter measure, while others were at chance levels. IQ predicted beat and meter perception skills in the WS participants but not in the TD participants in our sample. This may be because of the wide variability in IQ in the WS participants while participants in the TD group all had IQ in the average range. Previous work in WS has indicated relationships between rhythm abilities and developmental level (e.g., Don et al., 1999). Synchronization of tapping behaviors to rhythmic stimuli also increases across development in TD children (e.g., Drake et al., 2000a).

In both WS and TD, beat and meter perception were associated with musical training. Our findings of the relationship between musical training and beat and meter skills in TD are consistent with previous studies (e.g., Drake et al., 2000a,b; Grahn and Schuit, 2012). Previous studies on musical timing perception (including rhythm, beat, and meter) in WS have not examined the role of musical training. However, studies of specific pitch perception abilities, singing skills, and musical instrument learning have all indicated a role of musical training in WS (Martinez-Castilla and Sotillo, 2008; Lense and Dykens, 2012; Lense et al., 2013). Therefore, the role of musical training in beat perception skills in WS appears to generally be consistent with findings in the TD population.

Even when controlling for IQ and musical training, sound processing style significantly predicted beat and meter performance in WS, and in the case of the BAT, was an even greater predictor than IQ. Previous studies in TD samples have not examined the role of sound processing style in beat perception, and we unfortunately were unable to collect this measure in our TD participants. However, greater use of a fundamental processing style in TD has been linked to preferences for playing percussion instruments (Schneider et al., 2005). Moreover, use of a fundamental processing style has been associated with a preference for hard rock music, which has a salient and consistent beat (while preference for jazz music, in contrast, is associated with greater use of spectral processing style; Schneider and Wengenroth, 2009). Individuals with WS exhibit a stronger fundamental processing style compared to TD individuals (Wengenroth et al., 2010). Additionally, individuals with WS primarily play percussion instruments (Lense et al., 2013) and listen to and prefer music with a strong and consistent beat such as hard rock or country rock/pop (Lense and Dykens, 2015).

The preferences for rock music in WS may therefore relate to sound processing style and a corresponding preference for music with a salient and consistent beat. Indeed, an item analysis on the BAT revealed that individuals with WS (but not TD individuals) had significant difficulty with one particular item, which was characterized by substantially greater beat variability than the other test items. It may seem surprising that beat variability did not predict accuracy in the TD group (or in WS after the exclusion of the outlier item) as temporal fluctuations influence beat perception and synchronization in TD individuals (Drake et al., 2000b; Large et al., 2002). However, the beat variability was generally quite low on the BAT items, particularly when excluding the Superman theme item. TD individuals are generally still able to track and synchronize to the beat in music despite

temporal fluctuations (Large and Jones, 1999; Drake et al., 2000b). Thus, individuals with WS may have more difficulty finding and tracking the beat at lower thresholds of beat variability than TD individuals.

Once the Superman theme item was excluded, the range of beat variability in the stimuli was greatly reduced and beat variability no longer predicted item accuracy. However, controlling for tempo and beat variability, WS participants tended to do worse on orchestral pops than rock music excerpts, which may simply be a reflection of less exposure to orchestral music. In contrast, TD participants were more affected by tempo (performed better with faster items). This may be because the aligned/misaligned beeps always started 5 s into the musical track. Therefore, during faster tempo songs, participants would have heard more musical beats and thus may have better entrained to the beat of the songs to then determine the alignment of the superimposed beeps. It is not clear why the TD participants tended to do worse on the jazz items when controlling for other factors. It is possible that this genre of music was less familiar for the TD participants in our sample. While the BAT is a widely used measure of beat perception in TD, to our knowledge, thorough item level analyses have not been conducted. Future studies may want to carefully consider characteristics of the BAT stimuli when examining individual differences on this test or if choosing only a subset of items to administer.

Beyond factors that are typically thought of as being related to musical perception abilities (e.g., cognition, auditory processing, and musical training), Study 2 revealed that beat perception was significantly associated with adaptive Communication and Socialization skills, while meter perception was related to adaptive Socialization skills. These relationships, reflecting a medium effect size (Cohen, 1988), were maintained even when controlling for IQ. In contrast, there was no relationship with Daily Living Skills suggesting that beat and meter perception are specifically related to social communication abilities.<sup>1</sup>

While many studies have examined relationships between beat perception and specific linguistic and social tasks, to our knowledge, this is the first study to find relationships between beat perception and more global measures of social communication skills that reflect the performance of skills or behaviors in every day activities. Many of the skills that are aggregated in the Vineland-II have, however, been associated with beat perception in prior studies. For example, the Communication domain assesses Receptive (e.g., following directions), Expressive (e.g., conversations; grammatical skills), and Written (e.g., reading at a certain grade level) language abilities. Prior research has found that individual differences in beat perception relate to performance on standardized tasks of receptive and expressive language (e.g., Cumming et al., 2015) and word reading and reading comprehension (Corriveau and Goswami, 2009; Tierney and Kraus, 2013). Additionally, the presence of a beat-based or metric framework enhances speech comprehension (e.g., Rothermich and Kotz, 2013; Magne et al., 2016). Similarly, the Socialization domain assesses Interpersonal Relationships (e.g., understanding indirect cues, cooperating with others), Play and Leisure Time (e.g., following rules in games, going places with friends), and Coping (e.g., managing emotions, avoiding unsafe relationships) skills. Prior studies have indicated that musical beat-based activities promote cooperation (Anshel and Kippler, 1988; Wiltermuth and Heath, 2009; Kirschner and Tomasello, 2010; Cirelli et al., 2014) and feelings of connection (Demos et al., 2012).

Further support for the relationship between beat perception and social communication skills comes from neuroimaging studies. A network of auditory and motor processing areas contributes to beat perception involving the auditory cortex, supplementary motor area, premotor cortex, and basal ganglia (e.g., Grahn and McAuley, 2009; Grahn and Rowe, 2009). In particular, different components of the basal ganglia have been associated with beat prediction (Grahn and Rowe, 2009) and sensory timing (Schwartze et al., 2015), consistent with the involvement of the basal ganglia in prediction and prediction error more broadly. People with WS show reduced volume of the basal ganglia (Faria et al., 2012), and reduced volume or dysfunction of the basal ganglia has been associated with social difficulties in WS (Campbell et al., 2009) and autism (Qiu et al., 2010). Thus, general impairments in predictive timing may contribute to both musical beat perception and social communication difficulties, as successful prediction in a dynamic world is key to successful social engagement (e.g., Sinha et al., 2014). As a metric and beat-based stimulus, music may in part be appealing to individuals with WS because it provides a structured rhythmic framework that guides attention and increases predictability.

There are several limitations to the current study that should be noted. First, future studies should examine a variety of rhythm perception and production skills in the same group of individuals with WS to elucidate relationships among different aspects of temporal skills, including directly exploring relationships between perception and production and the role of beat perception in scaffolding rhythm perception in WS. We used real music that was not strictly isochronous but that generally had low levels of beat variability. Future studies could examine beat perception abilities along a wider spectrum of beat variability (including isochrony) to determine the range of variability to which individuals with WS versus TD are able to perceive and synchronize to a beat.

As well, we did not directly assess the role of movement in supporting beat perception. Consistent with task instructions, we asked participants to stay still during the BAT but encouraged them to move during the MBEA-m task. Informally, we observed that individuals who struggled the most on the MBEA-m task had no to minimal musical training and were least engaged in movement during these tasks. Additional work is needed on those individuals with WS who are unable to find the beat in music and move to it.

<sup>1</sup> It is plausible that individuals with greater communication and social skills were simply better able to participate in the one-on-one testing session, leading them to perform better on the beat/meter perception tests. However, this seems unlikely for several reasons. First, all examiners were highly trained in working with individuals with WS and managing social difficulties. Additionally, a subset of 32 individuals also completed a pitch perception task during the same testing session with the same examiner. There was no relationship between pitch perception and any of the Vineland-II domains (ρ's = −0.156 to −0.003, p's > 0.3). Thus, adaptive social skills appear to be specifically related to beat/meter skills and not pitch skills.

Finally, it will be important to address whether the relationship between social and communications skills and beat perception differ in WS compared with TD populations as studies in other domains (e.g., emotion) have found greater links between musical and social processing in WS than in TD (e.g., Lense et al., 2013; Ng et al., 2013). Relatedly, although we used a standardized measure of global communication and social skills (Vineland-II), future research could examine the role of beat perception in specific social communication tasks in WS and TD. TD individuals would be expected to score in the average range on a global measure such as the Vineland-II, which incorporates a variety of social communication skills. However, examination of specific social communication tasks with beat perception skills in WS and TD might reveal similarities and differences in specific, nuanced skills that contribute to successful social communication abilities. For example, TD individuals may be better able to compensate for a specific social communication weakness (e.g., by knowledge of how to act in a social situation) while this same vulnerability in WS might contribute to, or reflect, a broader constellation of social communication impairments. Furthermore, given the unique WS social profile of social communication deficits yet heightened social interest and motivation, it will be important to identify differences in beat perception in social versus nonsocial contexts in both WS and TD to examine how social context may differentially scaffold beat perception abilities in these populations.

Our finding that musical beat perception is related to social communication suggests that musical engagement and music therapy may be effective for social goals in WS. A growing number of studies report that group musical activities and/or music therapy promote social and emotional development in infants (Gerry et al., 2012), typically developing children with lower levels of prosocial behaviors (Schellenberg et al., 2015), and children with autism (Geretsegger et al., 2014). Increased musical interest in WS begins early in life (Levitin et al., 2004) and while music therapy is often appealing for WS given their musical interests, it is important to examine the mechanisms by which different musical experiences contribute to development in order to refine and optimize such interventions. For example, Martens et al. (2011) found that children with WS with versus

### REFERENCES


without musical training did better on a verbal memory task involving novel words when it was administered via singing (to the tune of Twinkle Little Star) rather than speaking. It is likely that temporal structure differed across the singing versus speech conditions. In particular, the novel words in the singing condition all occurred on the musical beat. Children with musical training may have been more aware of the beat and metric structure of the song and thus may have been more attentive to the timing of the salient words. Of course, additional studies are needed that control for different aspects of the stimuli in order to test this hypothesis.

In summary, our study highlights the broad variability in musical beat and meter perception in WS and finds that these abilities are related to cognitive skills, sound processing style, and musical training. Moreover, we find relationships between beat and meter perception and social communication skills. To our knowledge, this is the first study to examine these relationships in a population with a unique social profile including social communication difficulties despite increased social motivation. Future studies are needed to determine if these relationships are also seen in other populations, or if this reflects closer links between predictive timing in musical and social contexts in WS.

### AUTHOR CONTRIBUTIONS

ML and ED conceived and designed the experiments, ML conducted the experiments together with members of the lab, ML analyzed the data, ML and ED wrote the manuscript.

### FUNDING

This work was supported in part by a grant from NICHD (P30 HD015052-30) and a National Science Foundation Graduate Research Fellowship, as well as grant support to the Vanderbilt Institute for Clinical and Translational Research (UL1TR000011 from NCATS/NIH). We thank the participants and their families for taking part in the study. We also thank the ACM Lifting Lives Music Camp and the Williams Syndrome Association for their assistance.



phenotype. J. Child Psychol. Psychiatry 49, 576–608. doi: 10.1111/j.1469- 7610.2008.01887.x


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Lense and Dykens. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Both Isochronous and Non-Isochronous Metrical Subdivision Afford Precise and Stable Ensemble Entrainment: A Corpus Study of Malian Jembe Drumming

Rainer Polak <sup>1</sup> † , Justin London2 † and Nori Jacoby <sup>3</sup> \* †

*1 Institute for World Music, Cologne University of Music and Dance, Cologne, Germany, <sup>2</sup> Department of Music, Carleton College, Northfield, MN, USA, <sup>3</sup> Computational Cognitive Science Lab, Department of Psychology, University of California, Berkeley, Berkeley, CA, USA*

#### Edited by:

*Andrea Ravignani, Vrije Universiteit Brussel, Belgium*

### Reviewed by:

*Dirk Vorberg, University of Muenster, Germany Lauren Victoria Hadley, Medical Research Council Institute of Hearing Research (MRC IHR), UK*

> \*Correspondence: *Nori Jacoby nori.viola@gmail.com*

*† These authors have contributed equally to this work.*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience*

> Received: *06 March 2016* Accepted: *07 June 2016* Published: *28 June 2016*

#### Citation:

*Polak R, London J and Jacoby N (2016) Both Isochronous and Non-Isochronous Metrical Subdivision Afford Precise and Stable Ensemble Entrainment: A Corpus Study of Malian Jembe Drumming. Front. Neurosci. 10:285. doi: 10.3389/fnins.2016.00285* Most approaches to musical rhythm, whether in music theory, music psychology, or musical neuroscience, presume that musical rhythms are based on isochronous (temporally equidistant) beats and/or beat subdivisions. However, rhythms that are based on non-isochronous, or unequal patterns of time are prominent in the music of Southeast Europe, the Near East and Southern Asia, and in the music of Africa and the African diaspora. The present study examines one such style found in contemporary Malian jembe percussion music. A corpus of 15 representative performances of three different pieces ("Manjanin," "Maraka," and "Woloso") containing ∼43,000 data points was analyzed. Manjanin and Woloso are characterized by non-isochronous beat subdivisions (a short IOI followed by two longer IOIs), while Maraka subdivisions are quasi-isochronous. Analyses of onsets and asynchronies show no significant differences in timing precision and coordination between the isochronously timed Maraka vs. the non-isochronously timed Woloso performances, though both pieces were slightly less variable than non-isochronous Manjanin. Thus, the precision and stability of rhythm and entrainment in human music does not necessarily depend on metric isochrony, consistent with the hypothesis that isochrony is not a biologically-based constraint on human rhythmic behavior. Rather, it may represent a historically popular option within a variety of culturally contingent options for metric organization.

Keywords: rhythmic timing, meter, beat subdivision, ensemble entrainment, audio-based corpus, African drumming, culture

### INTRODUCTION

The rhythms of human music and dance are significantly more complex, more diverse, and more flexible than the rhythmic behaviors found in any other species (see Patel et al., 2005; Bispham, 2006; Fitch, 2006, 2012, 2013; Patel, 2006, 2014; Merker et al., 2009; Bowling et al., 2013; Merchant and Honing, 2014; Ravignani et al., 2014; Merchant et al., 2015). While birds and bonobos may be able to entrain to musical or quasi-musical stimuli exhibiting a constant and acoustically obvious pulse at specific frequencies, adult humans are able to find regular pulses in irregular rhythmic patterns, and at a wider range of tempos, than any other species (McAuley et al., 2006). However, a common presumption in biomusicological studies is that the essence of this human capacity involves the extraction of an isochronous (temporally equidistant) pulse train, which provides a framework for temporal perception and action. Indeed, most approaches to musical rhythm, whether in ethnomusicology (Waterman, 1952; Arom, 1984, 1991; Kubik, 1988, 1994; Tenzer, 2011), music theory (Lerdahl and Jackendoff, 1983; Hasty, 1997; Mirka, 2009), music psychology (Longuet-Higgins and Lee, 1982, 1984; Povel and Essens, 1985; Desain and Honing, 1999; Madison and Merker, 2002), and musical neuroscience (Large and Jones, 1999; Snyder and Large, 2005; Grahn and Brett, 2007; Large, 2008; Grube and Griffiths, 2009; Grube et al., 2010; Nozaradan et al., 2012, 2015; Nozaradan, 2014), as well as biomusicology, presume that human rhythmic entrainment is based on a hierarchical organization of isochronous beats and beat subdivisions. In other words, it is commonplace to regard isochrony as a universal, constitutive feature of the regularity that entrainable rhythms require. Savage et al. (2015) show that isochronous beats represent a statistical universal of near global spread, and the authors suggest that the occurrence of such statistical universals might indicate biological constraints on cultural diversity.

In principle, presuming relative simplicity as a functional prerequisite of metric pulse appears plausible. Together with other mechanisms, such as categorical rhythm perception (Clarke, 1987; Schulze, 1989; Desain and Honing, 2003), it allows one to tell a story of rhythmic evolution along the following lines. While many creatures exhibit isochronous rhythmic behaviors (e.g., locomotive gaits and wing beating, resting respiration, etc.), and while a few can exhibit an isochronous rhythmic response to an external isochronous rhythm (e.g., primate chorusing), humans evolved a capacity for creating endogenous isochronous pulses from more complex stimuli (Merchant and Honing, 2014). Specifically, the relative simplicity of the pulse phenomenon can be understood arising from human behavioral complexity coupled with a need for stable and predictable interpersonal interaction. The temporally predictive functionality of pulse and meter suggests that it should be structurally simpler than the rhythmic structures that give rise to it.

However, this nativist view of a natural predisposition toward isochrony resulting from biological constraints is implausible from a cross-cultural, ethnomusicologically informed perspective. The main thrust of rhythm research in comparative musicology and ethnomusicology has been to emphasize the dramatic range of cultural diversity and difference, not only in their surface rhythms, but also in the metrical systems that function as frameworks for their rhythm perception and production. It is empirically evident that music in many parts of the world makes structural usage of non-isochronous beats, including northern Europe (Kvifte, 2007; Johansson, 2009; Haugen, 2014), south-east Europe (Brailoiu, 1984; Moelants, ˘ 2006; Goldberg, 2015; Polak, 2015), Turkey (Cler, 1997; Bates, 2011; Holzapfel, 2015; Reinhard et al., 2015), Egypt and the Arab world (Marcus, 2001, 2007), Central Asia (During, 1997), India (Clayton, 1997, 2000), and parts of Africa and its diasporas (Gerischer, 2003, 2006; Polak, 2010; Jankowsky, 2013; Haugen and Godøy, 2014; Polak and London, 2014). Both isochronous and non-isochronous beats co-exist in most, if not all, of these regions. Musicians, listeners and respondents —people dancing, singing, working, marching, trancing, or clapping to music—are typically at ease with employing different (yet appropriate) metric frameworks in different pieces of the same repertoires, genres, and styles.

In this paper, performance timings of three pieces of jembe ensemble music from Mali are analyzed to assess whether rhythms characterized by non-isochronous beat subdivisions differ with respect to their precision and stability in complex, polyrhythmic multi-part ensemble music, in comparison to rhythms characterized by isochronous beat subdivisions from the same genre and musical tradition. If isochronous meters are privileged in human rhythm perception and production, then we hypothesize that music that involves non-isochronous beat subdivisions should exhibit less precision and stability than music with isochronous beat subdivisions. In particular, we would expect rhythms produced in a non-isochronous context to display:


Within a corpus of isochronous and non-isochronous pieces displaying otherwise similar characteristics, the aforementioned hypothesis predicts that non-isochronous pieces will display greater timing variability and greater ensemble asynchrony in comparison with isochronous pieces. If, however, we find that variability and asynchrony of non-isochronous rhythms are not substantially different than isochronous rhythms, then one can no longer claim that isochronous meter has a privileged status in human rhythm perception and production.

### MATERIALS AND METHODS

### Music and Recordings Used in this Study

The music we have studied is colloquially known as "jembe music," as the jembe (also djembe) is featured as main instrument in these ensembles. The jembe is a goblet shaped drum beaten with bare hands, originating from Guinea and Mali. Traditionally, jembe-centered percussion ensemble music has played a central role in celebratory dance events such as weddings and other life cycle events, as well as with agricultural worktasks such as hoeing fields for weeding. In the 1960s, jembe music and dance entered programs of state-sponsored folkloric ensembles and, at the same time, became part of the urban popular culture in Bamako, Conakry, Dakar, and Abidjan, among other West African cities. Since the 1980s, West African jembe music, musicians, and instruments have migrated globally (see Charry, 1996, 2000, chapter 4; Polak, 2000, 2004, 2005, 2007, 2012). The popular, vernacular, and participatory characteristics of jembe music make it a particularly relevant case for issues in the psychology and biology of music, because these qualities, which are typical of many types of functional music, are arguably more representative of human musicality than, for instance, Western art music (Peretz, 2006).

Malian drum ensembles typically involve three distinct musical roles: a variative lead drum, a repertoire-specific timeline, and one or more ostinato accompaniment parts (Polak and London, 2014). These roles are assigned to specific instrumental "voices" or ensemble parts. In the Bamako style of jembe music performance in the 1990s and early 2000s, the minimum ensemble size was a duet of one jembe playing the lead part and one dundun, a cylindrical drum beaten with a stick, playing the timeline. Trios add a second jembe playing an ostinato accompaniment rhythm; if financial and logistic resources allow for it, a second dundun is added to further support the accompaniment section.

The set of recordings analyzed here is comprised of three different pieces: Maraka, Manjanin, and Woloso. These three are among the core repertoire of standard pieces in the Bamako style of jembe music (see Polak, 2012). The pieces in our corpus involve two different meters, and were performed by three different ensemble sizes and with four different lead drummers (see **Table 1**).

As is typical of jembe music performance, all recordings show a large-scale, nearly continuous structural accelerando; the tempo at the end of each piece is 30–45% faster than in the beginning. The average ending tempo of 185 bpm (IOI = 324 ms per beat) is very rapid, yielding an average IOI of 108 ms per metric subdivision, which is near the limit for sensori-motor synchronization (Repp, 2003). Their rhythmic textures are nearmaximally saturated, that is, each time-point at the subdivision level almost always receives a note onset. Typically, no single player articulates every time-point in the metric cycle for more than a few cycles. Rather, the saturated rhythmic texture results from the interweaving phrases of various ensemble members playing together.

The three studied pieces share a common type of metric framework: a cycle of four regular beats with ternary subdivision. Polak (2010) found two different timing patterns for the ternary subdivision timing in these three pieces. Maraka has quasiisochronous triplets, while the non-isochronous or "swung" ternary subdivision in Manjanin and Woloso consistently showed either a short-medium-long (SML) or short-long-long (SLL) pattern, which were assumed to represent variations of a slightly more generic pattern type, short-flexible-long (SFL). These patterns appeared stable for each piece, across different recordings, players, ensemble parts, durations, phrases, and tempo changes, and thus seemed to represent repertoire-specific metric norms. They were found in other types of drum ensemble music from Mali as well (Polak and London, 2014). **Figure 1** graphically represents the basic drumstroke patterns used by each part in each piece. Note that the column widths are indicative of their characteristic timings.

### Data Collection and Preparation

In 2006/07, author RP collected a set of 15 multi-track audio and video recordings of complete live drum performances while conducting ethnographic field research in Bamako, Mali. Unidirectional microphones (AKG C-419) were clipped-on to the rims of each drum. Individual parts were recorded to a mobile digital four-track studio (Edirol R4) in WAVE-file format at 16 bit/48-Hz. A mini-DV camcorder (Canon XM2) captured video footage at 25 progressive scans per second. Recording sessions took place in the open air, where there was little acoustical crosstalk of instruments and reverberation from walls. The single tracks of the multitrack recordings proved clean enough for audio analysis without the need for frequency filtering.

Audio and video recordings were combined and synchronized in Vegas Pro 11 and 12 (Sony); Soundforge Pro 10 (Sony), Wavelab 7 (Steinberg), and Cubase 7 (Steinberg) were used for onset detection and marking. Onsets were detected automatically, and then were individually checked by eye. Note onset times were exported to Excel 2013 for data organization, and then to Matlab 8 (Mathworks) for further analysis. Out of the 42,297 resulting onsets, some 1054 data points (2.5% of all onsets) were excised from the beginning and end of recordings, to exclude informal introductions and formulaic endings that do not conform to the stable polyrhythms of interest in our study.

Given the structural tempo changes in each recording, analyzing timing data as absolute durations (in seconds or milliseconds) would be disadvantageous, because the magnitudes of resulting values then would be incomparable across the greatly different tempos covered in the performances. We therefore chose the four-beat metric cycle as the basic unit of analysis and normalized ("detrended") the time-series from the tempo factor by giving temporal intervals as percentages of the local four-beat cycles. To obtain this, we performed the following process:


All normalization was done only at the four-beat cycle level; we did not normalize each beat independently.

**Figure 2** (top) shows the result of this process for one piece in the corpus. Despite the large tempo changes (in this piece from 136 bpm to 197 bpm) the onsets are organized in a highly structured fashion. **Figure 2** (bottom) shows the aggregated histogram of all onsets, each peak corresponding to one of the 12 metrical grid positions. **Figure 3** shows that these peaks were also consistent across renditions of the same piece. The strictness of adherence to the metric grid for each piece is striking, justifying the heuristic for the identification of the cycle start. Based on this structure we also defined heuristic boundaries between metric positions (displayed in red in **Figure 3**) and binned each onset to the corresponding metric bin. The exact location of the boundaries does not matter much for the binning process, as the peaks are extremely well separated. However, a small percentage (less than 3%) of all events was nevertheless positioned in ambiguous locations near the heuristic boundaries. These events

#### TABLE 1 | Set of recordings.



FIGURE 1 | Rhythmic patterns (melodic and timbral aspects omitted) for Maraka, Manjanin, and Woloso in annotated box notation. The pattern given for Jembe 1 is an example of a typical lead drum phrase.

almost exclusively represent metrically extraneous onsets by the lead-drum part. The first jembe frequently embellishes phrases by adding extra ornamental strokes. These include flams, which consist of two onsets that perceptually merge into one rhythmic event, as well as rolls that combine three or more strokes at a frequency higher than that of the metric subdivision. The approach to filtering these extraneous onsets was two-fold. First, we assumed that only one event within each subdivision "bin"

would function as the articulation of that particular subdivision pulse and hence be relevant for ensemble synchrony. Whenever one metric pulse-bin received two onsets by the lead drum, we discounted the onset that was more distant from the mean value for that metric position. Secondly, we defined windows of 17% of the normalized beat duration for each of the three subdivisions (that is, about half of their nominal normalized duration), spread asymmetrically (−10% to +7%) around the mean value for each of them, and discarded all onsets outside that window. Author RP, an expert in this style of music, verified that the decision made by this heuristic corresponded to his understanding of the musical style by visual and audio inspection of the entire corpus. In any case, the number of filtered events was small, totaling merely 1170 events (2.8%) of all events in the corpus.

### RESULTS

### Isochronous vs. Non-Isochronous Subdivision Timings

All three pieces exhibit a meter comprised of four isochronous beats that show almost no local differences in IOI. However, within each beat, the three pieces show two distinct patterns of subdivision timing (see **Figures 4**, **5**). The difference is particularly evident in the second (middle) subdivision pulsebin. In Maraka the subdivisions are nearly isochronous, albeit with a characteristic slight compression of the middle element (see Desain and Honing, 2003; Repp, 2005; Repp and Su, 2013). By contrast, Manjanin and Woloso display a short-medium-long pattern of subdivision, with an earlier articulation of the middle element.

As can be seen in **Figure 6**, the variability of subdivision timing is very low on average; the standard deviations of all onsets in all recordings for each of the three pulse classes are approximately 2.5–3.5% of the local beat duration.

We further analyzed these variabilities with a 2-way Piece × Subdivision ANOVA that shows both a significant main effect of Piece [F(2, 36) = 10.7, p < 0.001 and of Subdivision F(2, 36) = 13.6, p < 0.001], but no significant interaction [F(4, 36) = 0.96, p = n.s]. Post-hoc tests showed that (a) there is no significant difference in variability between the isochronous Maraka and the non-isochronous Woloso [t(31) = 0.47, p = n.s], whereas the variability of Manjanin was significantly larger than both Woloso [t(25) = 3.05, p = 0.016] and Maraka [t(28) = 3.46, p = 0.005] (Bonferroni correction for multiple comparisons applied here and in all post-hoc tests noted below); (b) the variability of the first subdivision (onbeat) is significantly smaller than both the second subdivision [mid-beat; t(25) = 3.04, p = 0.016],

and the third [up-beat; t(28) = 3.45 p = 0.005], which were not significantly different from one another [t(31) = 0.47, p = n.s]. This is consistent with the idea that the strong metric positions (onbeat) are more stable than weak metric positions (London, 2012; see Repp, 2003 for similar result in a finger tapping experiment).

recording. Error bars represent the standard deviation of the subdivision position (1 beat = 100%) computed for each recording individually. Dashed lines represent idealized isochronous subdivisions.

To test the consistency of variability over the large tempo changes within each performance, we divided each recording into two parts with the same number of four-beat cycles. The average tempo of the second half of the pieces (168 BPM) was

significantly faster than the beginning half [145 BPM; t(14) = 16.3, p < 0.001]. However, the differences between the first and second half in terms of relative performance variability were extremely small: 2.7 and 2.9%, respectively. While it is to be expected that relative variability will increase with tempo (Wing and Kristofferson, 1973), a 3-way Piece × Subdivision × Part (first vs. second half) ANOVA showed only a marginally significant main effect of Part [F(1, 76) = 4.07, p = 0.05] but a significant Part × Piece interaction [F(2, 76) = 3.53, p = 0.03]. However, a post-hoc test only found a significant contrast between the end of the Manjanin pieces and all other possible parts and pieces (p < 0.05) (Bonferroni correction for multiple comparisons applied here and in subsequent post-hoc tests). Importantly, there was no statistically significant difference between isochronous Maraka and non-isochronous Woloso among all the possible tested situations, i.e., the beginning and ending of the piece and each of the three possible subdivisions (p ≤ 0.05). These results show that (a) the basic subdivision timings (**Figures 2**, **3**) are highly stable in all three pieces, and (b) there is no significant difference in variability between the isochronous Maraka and the non-isochronous Woloso.

### Asynchronies between Ensemble Parts

To assess the precision of coordination among parts and to provide a window on the performers' use of a common metric framework, we measured the extent, pattern, and variability of the mean asynchronies between onsets by different individual ensemble members in the same metric position. Mean signed asynchronies were calculated relative to a virtual reference beat, which we defined as the mean of all onsets within each metric bin for each performance. Across all three pieces in the corpus, the value of the mean signed asynchronies is about 2% of the normalized local beat duration (see **Figure 7**). Depending on the tempo (beat IOIs from ≈300 to ≈600 ms), these mean asynchronies are in the range of 6–12 ms.

A 2-way Piece × Instrument ANOVA shows a significant main effect of instrument [F(3, 34) = 14.1, p < 0.001] but no significant effect of piece [F(2, 34) = 0.01, p = n.s] nor significant interaction [F(6, 34) = 0.52, p = n.s]. Post-hoc tests found that the lead drummer (Jembe 1) tended to play ahead of the accompanists [Jembe 2: t(25) = 7.92, p < 0.001; Dundun 2: t(17) = 6.7, p < 0.001] as well as ahead of the timeline [Dundun 1: t(28) = 3.39 p = 0.012]. Another related measure of accuracy is the absolute value of the mean asynchrony: a 2-way Piece × Instrument ANOVA did not show any significant main effect [piece: F(2, 34) = 0.59, p = 0.55; instrument: F(3, 34) = 1.47, p = 0.24] nor an interaction [F(6, 34) = 0.42, p = 0.85]. Taken together, these results show that the pattern and extent of asynchrony between players does not vary between pieces; isochronous and non-isochronous pieces do not differ in this respect.

The variability of asynchronies is also low (standard deviations range between 1.5–3.2% of the local beat duration), indicating that the minimal amount of mean asynchrony does not result from averaging out larger deviations, but represents a very stable pattern of highly precise ensemble timing (see **Figure 8**) 1 .

Analyzing the standard deviation of the asynchronies with 2 way Piece × Instrument ANOVA showed significant main effect of piece [F(2, 34) = 13.2, p < 0.001] and instrument [F(3, 34) = 21.9, p < 0.001] but no significant interaction [F(6, 34) = 0.19, p = 0.97]. Post-hoc analyses show that the isochronous Maraka and non-isochronous Woloso do not significantly differ from each other [t(31) = 0.96, p = n.s], but are significantly less variable than the non-isochronous Manjanin [t(29) = 3.05, p = 0.005]. In addition, the post-hoc analysis showed that Jembe 1 has a significantly larger variability compared with Jembe 2 [t(25) = 6.12, p < 0.001] and Dundun 1 [t(28) = 4.57, p < 0.001]. However Jembe 1 and Dundun 2 were not significantly different from one another [t(17) = 2.55, p = n.s]. Note, however, that all these differences and nominal values are extremely small. For example, the differences are less than 1% of the beat duration, and the largest nominal value of variability (Jembe 1: 3.3%) represents a timing difference of only 10–20 ms.

### DISCUSSION AND CONCLUSION

This paper examines the assumption that isochrony is privileged in human rhythmic perception and production by testing the hypothesis that the production of non-isochronous rhythms will be associated with both greater durational variability as well as larger and less stable inter-personal asynchronies in ensemble performance. We analyzed three pieces whose rhythms are characterized by either isochronous or non-isochronous meters. Manjanin and Woloso share a similar short-flexiblelong subdivision timing pattern that is different from the

<sup>1</sup>Note that the standard deviations of the asynchronies were computed similarly to the mean signed asynchrony: separately for each piece, metric position, and instrument. This computation is therefore slightly different from the standard deviation computed in **Figure 6**, in which onsets within different metric bins (in the four-beat cycle) that are associated with the same metric subdivision (onbeat, midbeat, or upbeat) were aggregated independently of whether referring to Beat 1, 2, 3, or 4 in the four-beat cycle. Note that both methods provide consistent results (compare **Figures 6**, **8**).

quasi-isochronous subdivisions in Maraka (**Figures 4**, **5**). This hypothesis predicts much smaller and less variable asynchronies among ensemble members performing the isochronous Maraka than in performances of both non-isochronous Woloso and Manjanin. However, our results are inconsistent with this prediction in three main ways:


While music based on isochronous pulses is held to represent a statistical universal (Savage et al., 2015), it remains that

(a) music based on non-isochronous pulse structures is found in many cultures (referenced in the introduction) and (b) non-isochronous pulse structures afford precise and stable rhythmic performance and entrainment, as our study above has shown. This forces one to conclude that isochrony is not an inherent, biologically-based constraint on human rhythmic behavior. Rather, it may represent a historically popular option within a variety of culturally contingent options for metric organization. A range of evidence supports this assumption. First, Hannon and colleagues have demonstrated in a series of experimental studies that enculturation overrides the mathematical complexity inherent in non-isochronous beats. Non-isochronous beat sequences such as 2+2+3 are more difficult than isochronous ones for Western adult listeners, but not for Bulgarian, Macedonian, Turkish, and Indian listeners (Hannon and Trehub, 2005a; Hannon, 2010; Hannon et al., 2012a; Kalender et al., 2013; Ullal-Gupta et al., 2014). Studies of rhythmic development have shown that 6-month-old infants can respond to isochronous and non-isochronous beats with equal facility, but by 12 months, infants already develop a bias toward the rhythms of their environment. Yet one-year-old infants can quickly learn to adapt to "foreign" (e.g., non-isochronous) rhythmic patterns through brief exposure (Hannon and Trehub, 2005a,b). Statistical learning by passive exposure quickly and strongly shapes our perception and cognition of rhythm and meter (Hannon et al., 2012b). The transition from culture-general to culture-specific patterns in beat perception starts very early in life, and the privileging of isochronous over non-isochronous beats is on the culture-specific, not on the culture-general side of the developmental divide (Hannon and Trehub, 2005b).

Second, long-term ethnographic research in Malian jembe music (author RP) reveals that local players, listeners, and dancers do not experience non-isochronous subdivisions as relatively difficult or irregular, nor do they conceptually distinguish them from isochronous patterns. For instance, professional teachers do not try to avoid non-isochrony when students show difficulties in understanding a rhythm.

Biomusical discussions of the nature of human rhythmic and entrainment capacities emphasize the diversity and flexibility of human rhythmicity, while at the same time presuming that these complex behaviors supervene upon a small number of simple underlying metrical processes. However, from our study and the other cross-cultural studies of rhythm cited above, it is evident that the human capacity for rhythm, and pulse perception and production in particular, may be more complex than previously assumed. Metric flexibility is surely limited in degree when compared to rhythmic flexibility, yet clearly metric regularity does not depend upon isochrony, though this has been supposed in many theoretical, analytical, and psychological accounts of rhythm in Western classical and popular music.

This re-characterization of the human capacity for rhythm and entrainment further emphasizes the distinction of humans from all other species. For example, fireflies have one meter/rhythm (without rhythm-meter distinction), whereas birds and great apes may have a few rhythms and one meter, within narrow limits of tempo (Schachner et al., 2009; Patel et al., 2009a,b; Patel, 2014; Ravignani et al., 2014; Large and Gray, 2015). Humans, by contrast, are able to perform a great many rhythms at many different tempos; contrary to conventional presumptions, they also perceive many more meters than time signatures in Western musical notations suggest (London, 2012). Humans are able to adapt to a much broader range of rhythmic situations and contexts partly because their capacity for meter, too, is more flexible and differentiated. One aspect of the flexibility and source for differentiation of meters is that metric pulses do not need to be isochronous—neither their beats nor their subdivisions.

### REFERENCES


Biomusicological studies of rhythm hotly contest the rhythmic abilities of non-human animals. By contrast, they seem to assume that our understanding of the human capacity for rhythm and entrainment is more or less fully understood, or at least fully documented. This is premature. In particular, existing and emerging knowledge about cultural diversity has not been sufficiently integrated into music theoretic, psychological, neuroscientific, and biological discussions of human rhythmicity (for recent, surprisingly innovative insights of such perspective in other domains such as economic behavior, visual perception, or spatial cognition, see Henrich et al., 2010a,b). This bears the risk of distortion, since the standard contexts for the evolution, history, and practice of human music and dance are marked by the encultured development of individuals and the encultured social situations and institutions of individual action and social interaction. The definition of the human capacity for rhythm needs to recognize that cultural diversity and flexibility are part and parcel of human nature.

### AUTHOR CONTRIBUTIONS

All authors contributed equally to the paper. Author RP collected the data. Authors RP, JL, and NJ analyzed the data and wrote the paper.

### FUNDING

Data collection was funded by Deutsche Forschungsgemeinschaft (DFG), research grant PO 627/3-1.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Polak, London and Jacoby. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Brain Bases of Working Memory for Time Intervals in Rhythmic Sequences

#### Sundeep Teki <sup>1</sup> \* † and Timothy D. Griffiths 1, 2

*<sup>1</sup> Wellcome Trust Centre for Neuroimaging, University College London, London, UK, <sup>2</sup> Institute of Neuroscience, Newcastle University, Newcastle upon Tyne, UK*

Perception of auditory time intervals is critical for accurate comprehension of natural sounds like speech and music. However, the neural substrates and mechanisms underlying the representation of time intervals in working memory are poorly understood. In this study, we investigate the brain bases of working memory for time intervals in rhythmic sequences using functional magnetic resonance imaging. We used a novel behavioral paradigm to investigate time-interval representation in working memory as a function of the temporal jitter and memory load of the sequences containing those time intervals. Human participants were presented with a sequence of intervals and required to reproduce the duration of a particular probed interval. We found that perceptual timing areas including the cerebellum and the striatum were more or less active as a function of increasing and decreasing jitter of the intervals held in working memory respectively whilst the activity of the inferior parietal cortex is modulated as a function of memory load. Additionally, we also analyzed structural correlations between gray and white matter density and behavior and found significant correlations in the cerebellum and the striatum, mirroring the functional results. Our data demonstrate neural substrates of working memory for time intervals and suggest that the cerebellum and the striatum represent core areas for representing temporal information in working memory.

#### Keywords: interval timing, time perception, working memory, rhythm, fMRI

### INTRODUCTION

Everyday we are required to assess sequences of variable time intervals that occur in sounds like speech, music and environmental sounds, a process that requires us to hold multiple time intervals in memory. This work examines the neural bases for holding time intervals in working memory and the effect of changing the amount of information in these sequences determined by the temporal variability and number of intervals.

The nature of working memory in general is under debate (Ma et al., 2014). Classical visual models assume a limited working memory capacity (Miller, 1956; Cowan, 2001) where information is stored in a fixed number of discrete slots (Luck and Vogel, 1997). However, recent visual and auditory studies support a resource allocation model based on a limited working memory resource that is dynamically distributed between multiple items in natural scenes, without a slot limit (Bays and Husain, 2008; Gorgoraptis et al., 2011; van den Berg et al., 2012; Kumar et al., 2013; Ma et al., 2014; Teki and Griffiths, 2014; Joseph et al., 2015a,b, 2016). Neither of

#### Edited by:

*Sonja A. Kotz, Maastricht University, Netherlands; Max Planck Institute for Human Cognitive and Brain Science, Germany*

#### Reviewed by:

*Amy Poremba, University of Iowa, USA Virginia Penhune, Concordia University, Canada*

\*Correspondence: *Sundeep Teki sundeep.teki@dpag.ox.ac.uk*

### †Present Address:

*Sundeep Teki, Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford, UK*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience*

> Received: *21 January 2016* Accepted: *17 May 2016* Published: *01 June 2016*

#### Citation:

*Teki S and Griffiths TD (2016) Brain Bases of Working Memory for Time Intervals in Rhythmic Sequences. Front. Neurosci. 10:239. doi: 10.3389/fnins.2016.00239*

**298**

these models, however, has considered the question of how time intervals are held in working memory.

We designed a novel paradigm to assess working memory for sequences of intervals that systematically changed the information held in working memory and examined workingmemory fidelity (Teki and Griffiths, 2014). Listeners were presented with sequences that consisted of two types of sequences: (1) sequences with a fixed number of intervals with different levels of temporal regularity, and, (2) sequences with a varying number of intervals with a fixed temporal regularity. The task did not involve a binary response (e.g., shorter/longer or same/different judgment) about the probed interval change as in previous studies, but instead required the participant to reproduce the duration of a single interval that was probed after the sequence. This allowed us to examine the effects of the variability and number of intervals in the sequence on the precision (reciprocal of standard deviation) for probed interval reproduction (Teki and Griffiths, 2014). The results are consistent with a working memory model based on a fixed resource for storing time intervals so that a greater numbers of intervals can be stored at the expense of fidelity (Bays and Husain, 2008). The present study sought to address the neural bases for the core working memory resource, determined by both temporal variability and number of intervals.

Previous work on memory for time was either based on retention of a single interval into memory for subsequent comparison or involved multiple presentations of a standard interval that formed an isochronous sequence (Keele et al., 1989; Ivry and Hazeltine, 1995; Merchant et al., 2008). Other studies used induction sequences to study the effect of rate of presentation of those sequences (Barnes and Jones, 2000) or the temporal structure of the sequence (McAuley and Jones, 2003; Teki et al., 2011) on judgments of the duration of subsequent intervals. However, as these studies involved repetition of standard intervals, the effective memory load was limited to the interval used as the basis for the induction sequence.

Previous imaging work has shown that the putamen and caudate nucleus encode the duration of single time intervals (Rao et al., 2001; Coull et al., 2008) while recent work suggests that areas for the analysis of single intervals alter with the sequence context (Merchant et al., 2013). Timing in regular sequences relies more on a striato-thalamo-cortical network whilst timing in irregular sequences depends more on the cerebellum (Grahn and Rowe, 2009; Teki et al., 2011, 2012; Kung et al., 2013; Allman et al., 2014). The present study addresses brain bases for storing time intervals in memory as is required for natural acoustic stimuli, for which we hypothesized a striatal and cerebellar substrate.

Another motivation of the study was to examine contextual factors: the effect of task context on stimuli with the same variability and number of intervals. Previous work was mostly based on single intervals and thus could not address this crucial question. Recent reviews emphasize task-dependent activation of brain areas associated with temporal processing (Wiener et al., 2010a; Merchant et al., 2013) but there are no data suggesting that the activity of brain areas underlying memory for time intervals may also be modulated by task context.

We used functional magnetic resonance imaging to uncover the neural substrates that represent sequences of intervals in working memory. Our results highlight activity in core perceptual timing areas including the cerebellum and the striatum that varies with the amount of information in a sequence, determined by temporal regularity and number of intervals. Holding and manipulating the same interval in working memory depended on the context, the number of intervals in the sequence, in the caudate nucleus and the inferior parietal cortex. Our data support the flexible representation of time intervals in working memory where the cerebellum and caudate provide the core resource.

### MATERIALS AND METHODS

### Participants

Nineteen listeners (12 females; mean age: 27.4 ± 2.3 years) with normal hearing and no history of audiological or neurological disorders provided informed written consent and participated in the experiment. A female listener was excluded from the analysis due to excessive movement in the scanner. Two listeners could not complete the number-of-interval blocks. Thus, 18 listeners provided datasets for the jitter condition whilst 16 listeners' datasets were analyzed for the number-of-intervals condition. All but four listeners had musical experience but none of them were currently practicing music. Experimental procedures were approved by the research ethics committee of University College London.

### Stimuli

The stimulus (**Figure 1**) consists of a sequence of clicks of 0.5 ms duration and identical loudness. The inter-onset interval (IOI) was selected from a normal distribution that ranged from 500 to 600 ms. For the Jitter blocks, the stimulus comprised four time intervals. By jitter, we refer to variability in the length of a time interval around a mean value of inter-onset interval. For instance, introducing a 10% jitter for a 100 ms interval would yield an interval whose duration may vary from 90 to 110 ms. Four different levels of temporal jitter were incorporated: (i) 5–10%, (ii) 20–25%, (iii) 35–40%, and, (iv) 50–55%. Higher jitter values enhance the difference in duration between the various intervals and make each interval more unique, thereby increasing the memory load. The exact jitter values were randomly drawn from a normal distribution centered on the mean of each of the four ranges of jitter. Each sequence block was jittered by only one of the above ranges of jitter.

The stimuli for the Number-of-intervals blocks consisted of sequences with different number of time intervals, from 1 to 4, and a fixed jitter of 20–25%. The stimulus for the reaction time task consisted of a single click only.

Stimuli were created digitally using MATLAB 2012 (MathWorks Inc.) at a sampling rate of 44.1 kHz and resolution of 16 bits. Sounds were delivered diotically through MRI compatible insert earphones (Sensimetrics Corp.) and presented at a comfortable listening level between

80 and 90 db SPL that was adjusted by each listener. The stimulus presentation was controlled using Cogent (http://www.vislab.ucl.ac.uk/cogent.php).

### Timing Task

The task was designed to assess listeners' memory for time intervals embedded in sequences in which the temporal jitter and number of intervals were varied parametrically (Teki and Griffiths, 2014). Listeners were instructed to attend to the sequence of clicks and reproduce the duration of interval that was probed after the sequence (via text displayed on the screen e.g., "Match time interval: 1"). The probed interval number was displayed during the entire delay after the sequence period lasting 1.5 s. A click was played after the delay period and indicated the start of the interval to be reproduced. The listeners' task was to press a button at a point in time (after this click) that corresponded to their memory of the duration of the probed interval. Responses made within a window of 2 s were considered valid responses while responses longer than 2 s were treated as "missed" responses. Feedback, equal to the difference between the duration of the probed interval and the listeners' response (adjusted for reaction times) was presented for 500 ms after each trial (e.g., "Shorter by 53.2 ms" or "Longer by 107.4 ms").

### Control Task

A control task was used prior to each timing block to calculate listeners' response times to a single click. The reaction times were used to regress out variance due to the motor response from the time matching responses in the experimental blocks.

### Procedure

Listeners received instructions about the task and practiced a reaction time block of 15 trials and a jitter block of 24 trials. Training was repeated until performance improved, as assessed by precision values. However, participants did not receive any explicit information or training for the number-of-interval blocks.

In order to investigate context-sensitive responses, listeners only received training on the jitter blocks and not the (later) number-of-interval blocks. It was important to not counterbalance the order of the jitter and number-of-interval blocks to ensure that listeners held only one task context in mind during the jitter block and then switched to a different task context provided by the number-of-interval blocks. Listeners received brief training on the number-of-intervals condition in the scanner after the jitter blocks were completed. This enabled us to compare brain activations for trials that were identical in structure (32 trials with 20–25% jitter and 4 intervals in a sequence) across the two task conditions.

The task of the listeners was to reproduce the duration of the cued interval from memory by pressing a button on a keypad. Responses were always made with the index finger and the use of right and left hand was counterbalanced across participants. Prior to each timing block, listeners completed a reaction time block comprising 30 trials where they pressed a button in response to a single click. Listeners were instructed to respond at a comfortable rate and maintain the same pace for both the reaction time and timing trials throughout the experiment.

The imaging experiment lasted ∼1 h and consisted of two jitter blocks (varying jitter and fixed number of intervals) followed by two number-of-intervals blocks (varying number of intervals and fixed jitter) where each block consisted of 64 trials. Field maps were acquired after the first two blocks and listeners were instructed about the change in stimulus structure and received limited training on the number-of-intervals condition whilst in the scanner. Each block lasted ∼15 min and short breaks were allowed between successive blocks. Listeners were instructed to keep their eyes open as the probed interval was indicated visually on the screen. At the end of each block, listeners received feedback specifying the number of trials on which their timing error was less than 100 ms, between 100 and 200 ms, or greater than 200 ms. A structural scan was acquired at the end of the functional imaging experiment for each participant.

### Behavioral Analysis

The median of the reaction times for the final 24/30 trials was computed for each reaction time block. For the timing blocks, the error response was calculated as the difference between the time matching response and the actual duration of the cued interval. The median reaction time from the preceding control (reaction time) block was subtracted from this value. This allowed us to obtain a cleaner measure of the time matching response that was not confounded with the time taken for button press (see Teki and Griffiths, 2014). This analysis was repeated for each timing block.

The absolute value of the error responses was used to calculate precision, by computing the inverse of the standard deviation of the error responses. Precision was measured as a function of jitter and as a function of number of intervals for the corresponding blocks. Precision was used as the primary measure of interest as it captures the true variability in memory performance. This is useful to interpret variability in performance with increasing number of items and examine whether performance is fixed up to a certain number of slots (according to slot models) or scales flexibly according to the total amount of information to be remembered (according to shared resource models). The slot model would predict that the precision would be at ceiling for a set number of items such as four (see Cowan, 2001) until capacity is exceeded and would drop to floor for a set size that exceeds the working memory capacity. The shared resource model, however, predicts that precision is highest for a set size of one and decays as a function of the number of items to be remembered (Ma et al., 2014). Crucially, the precision for higher memory loads greater than four is predicted to be higher than that obtained by chance. Absolute error or accuracy measures do not capture such variability and are thus not ideal for comparing the two models.

### Image Acquisition

Gradient weighted whole-brain echo planar images were acquired using a 3T Siemens Allegra system using a sparse imaging design: time to repeat (TR) of 14.76 s; time to echo: 30 ms (TE); time for volume acquisition (TA): 3.36 s (70 ms to acquire one slice × 48 slices); matrix size: 64 × 72; slice thickness: 2 mm with 1 mm gap between slices; and, in-plane resolution: 3.0 × 3.0 mm<sup>2</sup> . The slices were tilted by −7 ◦ (transverse > coronal) to obtain full coverage of the cerebellum. This orientation was used successfully to uncover perceptual timing responses in the inferior olive and the cerebellum in our previous fMRI timing study (Teki et al., 2011). Field maps were acquired to compensate for geometric distortions due to magnetic field inhomogeneity (Hutton et al., 2002) using a double-echo gradient echo field map sequence (TE<sup>1</sup> = 10.00 ms and TE<sup>2</sup> = 12.46 ms). A T1 weighted structural scan was acquired after the functional scans (Deichmann et al., 2004).

A sparse sampling design (**Figure 1**) was used to obtain clean auditory activations unaffected by the scanner noise (Belin et al., 1999). The total duration of the stimulus ranged from 0.5 to 2.6 s depending on the number of intervals (1–4) in the sequence. A variable silence period preceded the onset of the stimulus such that the combined duration of silence and stimulus was fixed at 7.4 s. A delay period of 1.5 s, response window of 2 s and a feedback period of 0.5 s, in that order, completed each trial with a fixed duration of 11.4 s. The latency between trial offset and scanner onset was fixed at 4 s so that the acquisition of each scan was time-locked to the onset of the delay period. This latency of 4 s was based on our previous study where we used a similar sparse imaging protocol to isolate timing responses in the cerebellum and the striatum (Teki et al., 2011). The fixed latency helped ensure that the peak of the BOLD signal captured brain activity corresponding to the manipulation and retrieval of the cued interval from memory rather than earlier stimulusevoked or subsequent motor activity, with minimal overlap in their hemodynamic response functions (HRFs). Given the poor temporal resolution of fMRI, one cannot be completely confident about the extent to which the scan acquired at the end of each trial was contaminated by effects not related to memory processes during the delay period. However, the manipulation of keeping a fixed latency from the onset of the delay period to the onset of the acquisition of the scan is motivated by the characteristic latency of BOLD responses to sounds in sparse imaging protocols (∼4 s, Belin et al., 1999; Hall et al., 1999) and is a reliable method to obtain pseudo time-locked responses using sparse fMRI (Teki et al., 2011; Talavage et al., 2014).

### Image Analysis

The analysis of brain imaging data was performed using SPM12 (Wellcome Trust Centre for Neuroimaging, Ashburner, 2012). Each block comprised of 66 volume acquisitions out of which the first two volumes were rejected to control for saturation effects. The remaining 64 volumes were realigned to the first volume and unwarped using field map parameters. The structural image was segmented to obtain a bias-corrected structural image that has more uniform intensities within six different tissue classes including gray matter (GM) and white matter (WM). The resulting image was co-registered with the mean functional image obtained after realignment. DARTEL was used to create a series of templates using the GM and WM images (Ashburner, 2007). The final template from this step was affine registered with tissue probability maps (available in SPM12) to obtain spatially normalized images in MNI space (Friston et al., 1995a). The normalized images were smoothed with an isotropic Gaussian kernel of 5 mm full-width at half-maximum (FWHM).

Statistical analysis of the images was performed using general linear model (Friston et al., 1995b). Data from the jitter and number-of-interval blocks were analyzed separately using a parametric contrast to examine brain activity that increased as a function of jitter and number-of-intervals respectively. All trials were convolved with an HRF boxcar function and missed trials were modeled as conditions of no interest (separately for each condition) to remove unwanted variance. The data were not high-pass filtered as a sparse design ensures minimal lowfrequency variations in the BOLD signal.

A whole-brain random-effects model was used to account for within-subject variance (Penny and Holmes, 2004). Each subject's first-level contrast images were subjected to secondlevel t-tests for the primary contrasts of interest: "parametric effect of jitter" and "parametric effect of number of intervals." To examine context-dependent memory encoding for trials that were identical in the two conditions, a separate design based on difference in activations between the jitter versus numberof-interval blocks (and vice-versa) was used. Functional data were visualized on the group-averaged T1-weighted structural scan and activations specific to the cerebellum were overlaid on the high-resolution, spatially unbiased infra-tentorial template (SUIT) atlas of the human cerebellum (Diedrichsen, 2006; Diedrichsen et al., 2009).

Structural brain images were analyzed using voxel-based morphometry (VBM; Ashburner and Friston, 2000). The segmented GM and WM images were imported into DARTEL and a series of template images were created by iteratively matching images to align them with the average-shaped template. The final template obtained in this procedure was normalized to MNI space through an affine registration of the template with tissue probability maps. The resultant images were smoothed with an isotropic Gaussian kernel of 8 mm FWHM. The smoothed images for each individual were entered into a secondlevel ANOVA to examine brain areas in which GM and WM volume varied as a function of jitter and number of intervals respectively.

## RESULTS

### Behavioral Results

Participants' performance in the scanner was measured by calculating precision, the inverse of the variance of the timing error distribution for both blocks. Precision provides a continuous measure of memory performance and has been used previously in studies of working memory based on the shared resource model (Bays and Husain, 2008; Bays et al., 2009; Kumar et al., 2013; Ma et al., 2014; Teki and Griffiths, 2014; Joseph et al., 2015a,b, 2016).

ANOVA revealed a main effect of jitter (p = 0.02, F = 3.40, η <sup>2</sup> = 0.14) but a non-significant effect of number of intervals [p = 0.36, F(3, 63) = 1.10, η <sup>2</sup> = 0.05] as shown in **Figures 1B,C** respectively. Post-hoc analysis revealed a significant difference between the precision for the least and most irregular conditions in the jitter experiment: p = 0.048, t = 2.05; and a marginal but not significant difference between the precision for the trials with lowest and highest number of intervals: p = 0.10, t = 1.69. Secondary analysis of precision as a function of serial position did not reveal a significant effect for either condition: p = 0.10, F = 2.14, η <sup>2</sup> = 0.09 (jitter block), p = 0.38, F = 1.05, η <sup>2</sup> = 0.05 (number of intervals block).

Although a significant effect of number of intervals was not observed during performance in the scanner, our previous psychophysical work did demonstrate a significant effect: [p = 0.01, F(3, 28) = 4.27, η <sup>2</sup> = 0.31, n = 8; Teki and Griffiths, 2014]. The absence of a behavioral effect in the scanner could be due to a number of reasons: (i) listeners did not receive explicit and adequate training about the number-of-intervals blocks before the experiment, (ii) the number-of-interval blocks were always run after the jitter blocks and could be associated with increased fatigue, (iii) reduced number of trials in the scanner: 2 blocks of 64 trials compared to 4–5 blocks of 96 trials in the psychophysics study, (iv) limited response time and a noisier task environment in the scanner. Further investigation of individual behavioral scores in the number-of-interval blocks revealed the opposite trend in 4 subjects who showed no significant effect: [F(3, 15) = 0.66, p = 0.59, η <sup>2</sup> = 0.14]. A similar ANOVA on the scores of the remaining 12/16 subjects revealed a significant effect of number of intervals: [F(3, 47) = 2.84, p = 0.04, η <sup>2</sup> = 0.16].

### Functional Imaging Results

We analyzed BOLD responses to examine brain areas that: (i) encode memory for time as a function of increasing and decreasing jitter, (ii) are activated as a function of increasing and decreasing numbers of intervals, and (iii) the effect of task context in modulating brain activity in response to identical trials across the two conditions.

A priori, we predicted that both cerebellum and striatum would show increased activity as a function of increasing as well as decreasing jitter, but with opposite effects such that cerebellum would be more strongly activated for encoding temporal memory in irregular sequences and the striatum would show elevated activity for regular sequences (Grahn and Brett, 2007; Teki et al., 2011, 2012; Grahn, 2012; Merchant et al., 2013). Secondly, based on previous fMRI work on temporal memory encoding (Rao et al., 2001; Coull et al., 2008), we hypothesized that the striatum would be involved in encoding memory for time as a function of increasing numbers of intervals. Thirdly, we expected that task context would modulate brain activity such that areas that represent the structure of sequences of intervals would show differential responses for trials that were identical in structure during the jitter and number-of-intervals conditions.

#### Effect of Jitter

To answer the first question, data from the blocks with different levels of jitter were analyzed. A parametric contrast was used to examine areas that showed an increase in response as a function of increasing jitter. Results revealed significant clusters in the left cerebellum (lobules I-IV, V) including the vermis as shown in **Figure 2A**. The striatum was also significantly modulated, with clusters in the putamen and pallidum. Other brain areas whose activity was significantly modulated by increasing levels of jitter included the precuneus, the parahippocampal gyrus and the middle temporal gyrus (see **Table 1A**).

Examination of parametric responses in the opposite direction (as a function of decreasing jitter) showed maximum activation in the striatum including the caudate and putamen (**Figure 2B**). We also observed activity in the cerebellum (right posterior lobe); however, the strength of the activation in the cerebellum was weaker than the striatal response (see **Table 1B**). The frontal cortex, temporal pole and thalamus also showed significant activations with decreasing levels of jitter.

#### Table 1A | Brain areas whose activity increased as a function of jitter.


*Local maxima are shown at p* ≤ *0.001 (uncorrected).*

parahippocampal gyrus (overlaid on a coronal section of the average normalized structural scan and zoomed to 80 × 80 mm) at a threshold of *p* < 0.001 (uncorrected, for each figure). Other activations in the precuneus, MTG, and pallidum are listed in Table 1A. The strength of activations (*t*-value) is graded according to the adjacent color scheme on the right (for each figure). (B) Brain areas that encode temporal memory in the context of regular sequences. BOLD activations in the striatum including the caudate and putamen as well as the cerebellum are shown. Other activations in the thalamus, temporal pole, and frontal cortex are listed in Table 1B. The significant clusters are displayed according to the same scheme as in Figure 1A.

### Effect of Number of Trials

The second question focused on parametric brain responses as a function of increasing numbers of intervals. Results across all subjects revealed significant activations in the bilateral inferior parietal cortex (abutting supramarginal gyrus) and the left caudate nucleus (**Figure 3A**; **Table 2A**). In the 12/16 subjects who showed a significant behavioral effect of number of intervals, similar activations in the inferior parietal cortex were observed as well (x = 33, y = −37, z = 39; t = 4.11, and x = −28, y = −52, z = 39, t = 3.97, respectively). As the number of


*Local maxima are shown at p* ≤ *0.001 (uncorrected).*

intervals decreased, activity in the superior cerebellum increased as shown in **Figure 3B**. Other areas to encode memory for time with decreasing number of intervals included the inferior orbitofrontal cortex and the insula (also see **Table 2B**).

### Effect of Task Context

One of the key motivations of the study was to examine whether encoding of time into memory depends on contextual factors like the temporal structure and number of intervals in the sequences. The experiment was designed to have an orthogonal design with 32 identical trials in the jitter and number-ofinterval blocks respectively with a jitter of 20–25% and 4 intervals in each sequence. A subtraction analysis between jitter vs. number of interval blocks revealed enhanced activity in the right anterior cerebellar lobe and the striatum (including left caudate and bilateral putamen and pallidum) as shown in **Figure 4A**. Other areas included the thalamus, Heschl's gyrus, precuneus, hippocampus, orbitofrontal cortex, precuneus and the amygdala (see **Table 3A**). The reverse contrast (number of intervals vs. jitter) showed differential activation in the right cerebellar lobule VI (see **Figure 4B**; **Table 3B**).

### Structural Imaging Results

Structural imaging data were analyzed using VBM to investigate correlations between gray and white matter volume (GM; WM) and task performance. Specifically, we wanted to assess whether the key timing areas revealed by previous work (e.g., Grahn and Brett, 2007; Wiener et al., 2010b; Teki et al., 2011) and in the present study, i.e., the cerebellum and the striatum, also

found to vary as a function of decreasing number of intervals. The MNI coordinates are provided in Table 2B.

showed structural correlations with behavior. The correlations were performed between GM and WM density and precision (for all levels of the factor of interest, i.e., jitter and number of intervals).

Table 2A | Brain areas whose activity increased as a function of memory load.


*Local maxima are shown at p* ≤ *0.001 (uncorrected).*

Table 2B | Brain areas whose activity decreased as a function of memory load.


*Local maxima are shown at p* ≤ *0.001 (uncorrected).*

We found a significant correlation between precision on trials with increasing jitter and GM volume in the cerebellum (see **Figure 5A**) in a similar region of the cerebellar cortex as implicated in the functional data (Table S1A). In contrast, precision on trials with decreasing jitter and GM volume was demonstrated in sensory cortical areas including the Heschl's gyrus and superior temporal gyrus (see **Figure 5B**; Table S1B).

Similar analysis between precision on trials with increasing number of intervals and GM volume revealed significant clusters in the caudate (also activated in functional data) as shown in **Figure 5C** (also see Table S2A). The GM volume of the cerebellum was correlated with precision on trials with decreasing load (**Figure 5D**; Table S2B).

Correlation analysis of WM volume as a function of increasing jitter revealed bilateral clusters in the pallidum (**Figure 5E**; Table S3A) whilst no areas were found to be significant in the reverse contrast. The WM volume was also found to be higher in the pallidum as a function of increasing number of intervals (**Figure 5F**; Table S3B). The precuneus was the only area found to show significant effect in the opposite direction (Table S3C).

### DISCUSSION

We investigated the neural bases of working memory for time intervals in the context of a shared resource model of working


Table 3A | Brain areas activated for jitter vs. number-of-intervals

*Local maxima are shown at p* ≤ *0.001 (uncorrected).*

Table 3B | Brain areas activated for number-of-intervals vs. jitter condition.


*Local maxima are shown at p* ≤ *0.001 (uncorrected).*

memory where the resource is flexibly distributed according to the amount of information to be encoded. We manipulated the information content in sequences by manipulating the temporal regularity and number of intervals, which we hypothesized to affect the working memory load. We examined, from first principles, whether there are core brain areas that are activated through these two manipulations of the resource even though the magnitude of the effect of temporal regularity and number of intervals may be different.

Behaviorally, listeners' performance decreased with greater information in the sequence, achieved by manipulating temporal jitter and the number of intervals. The fMRI activations revealed the striatum and cerebellum as core areas for encoding temporal memory as a function of increasing jitter and number of intervals. Additionally, the inferior parietal cortex was also

strongly involved in representing time intervals as a function of load. We also analyzed structural correlations between gray and white matter volume and behavior that revealed correlations in the striatum and cerebellum, in line with the functional results. Furthermore, the analysis of context-specific responses for identical trials across the two conditions also revealed

higher WM volume that correlated with performance as a function of increasing memory load associated with the sequences (see Table S3B). activations in the striatum and the cerebellum, suggesting on the whole, a critical role for these two subcortical motor areas in representing time intervals in working memory.

### Effect of Jitter

Behavioral performance showed significant sensitivity to the temporal structure of the sequences (**Figure 1B**). The analyses of the underlying brain responses revealed activation of core timing areas in the cerebellum and the striatum (Buhusi and Meck, 2005; Ivry and Schlerf, 2008; Teki et al., 2011). Temporal context of the sequences of intervals provides a basis to distinguish the timing functions of the cerebellum and the striatum: whilst the cerebellum is associated with absolute, duration-based timing of intervals in irregular sequences, the striatum in coordination with fronto-striatal loops mediates relative, beat-based timing (Teki et al., 2012; Allman et al., 2014). This dissociation is supported by several lines of evidence: behavioral work (Monahan and Hirsh, 1990; Yee et al., 1994; Pashler, 2001; McAuley and Jones, 2003), neuropsychological assessment of patients (Grube et al., 2010; Cope et al., 2014a,b), motor timing studies (Schlerf et al., 2007; Spencer et al., 2007), and neuroimaging studies (Grahn and Brett, 2007; Teki et al., 2011; Grahn and Rowe, 2013). We have previously suggested a synergistic relationship between the striatum and the cerebellum whereby the striatum serves as a default clock and the cerebellum serves to encode the error in the timing activity of the striatal clock (Teki et al., 2012; Allman et al., 2014). Other timing models like the Striatal Beat Frequency model (SBF; Matell and Meck, 2004; Buhusi and Meck, 2005; Meck et al., 2008) based on coincident activity in the medium spiny neurons in the striatum do not address timing in sequences containing several intervals and the effect of temporal jitter.

The present data suggest that in addition to perception of time, the cerebellum and striatum also represent memory for time with the level of activation depending on the temporal context of the sequences. The cerebellum and vermis (see **Table 1A** for precise locations with cerebellum) were more strongly activated as a function of increasing jitter compared to the putamen and pallidum whilst the caudate and putamen were more active relative to the cerebellum as a function of decreasing jitter. Other memory-related areas that were activated as a function of increasing jitter included the precuneus, the posteromedial portion of the parietal lobe and the parahippocampal cortex. These two areas are involved in encoding and retrieval of episodic memory but have not been specifically implicated in temporal processing before. The activation of these areas suggests a link between subcortical timing areas and higher-order memory related areas in the medial temporal lobe that remains to be investigated.

It is important to note that sound-evoked activity is also observed in the cerebellum (e.g., Wolfe, 1972; Jastreboff and Tarnecki, 1975) and the basal ganglia (Hikosaka et al., 1989). Although it can be argued that the observed BOLD activations might capture sound-evoked responses, it is unlikely that such responses would scale as a function of jitter or number of intervals. Thus, the parametric analysis reported in the present study can be assumed to primarily reflect temporal processing activity.

### Effect of Number of Intervals

We also varied the amount of information in the sequences by manipulating the number of intervals. Although the task was based on the recall and reproduction of a single interval, the number-of-intervals condition required representation of multiple intervals in working memory. Activity in the caudate nucleus and the inferior parietal cortex systematically increased with increasing number of intervals in the sequence, consistent with previous event-related fMRI studies on memory for a single time interval (Rao et al., 2001; Coull et al., 2008).

The striatum is widely acknowledged to contribute to working memory (Postle and D'Esposito, 1999; Lewis et al., 2004; McNab and Klingberg, 2008; Darki and Klingberg, 2015) via dopaminergic interactions with frontal cortex (Goldman-Rakic, 1996; Frank et al., 2001). Consistent with this, disorders affecting the basal ganglia including Parkinson's, Huntington's and Multiple Systems Atrophy are associated with impairment on a range of working memory tasks (Robbins et al., 1992; Grahn et al., 2006; Dumas et al., 2013). The role of the striatum and frontal cortex in controlling access to working memory storage (McNab and Klingberg, 2008) is particularly significant in light of the SBF model that emphasizes the role of frontostriatal dopaminergic loops in interval timing. The SBF model posits that striatal medium spiny neurons perform coincidence detection of cortical oscillatory activity, triggered by nigrostriatal dopaminergic signals. These theoretical considerations suggest a close relationship between perception and memory for time in fronto-striatal pathways (Darki and Klingberg, 2015).

The parietal cortex is also implicated in storage of information in working memory (McNab and Klingberg, 2008; Darki and Klingberg, 2015) and shows robust load-sensitive activity in visual working memory tasks (Todd and Marois, 2004; Vogel and Machizawa, 2004; Vogel et al., 2005; Ma et al., 2014). The parametric increase in the activity of the parietal cortex suggests a common framework for working memory processing in the brain that not only applies to storage of sensory information but also to temporal information. Timing activity in the parietal cortex has been demonstrated in nonhuman primates (Leon and Shadlen, 2003; Schneider and Ghose, 2012) as well as humans (Wiener et al., 2010a, 2012; Hayashi et al., 2015). Furthermore, the parietal cortex has also been shown to encode magnitude in general, and process time, space, and number (Walsh, 2003; Bueti and Walsh, 2009). The current data provide converging evidence from the temporal domain that parietal cortex may encode "temporal" magnitude and represent multiple time intervals in working memory.

The activity of the cerebellum (lobule V) was modulated as a function of decreasing load. This is consistent with cerebellar specialization for encoding the absolute duration of single intervals (Grube et al., 2010).

### Effect of Task Context

Behaviorally, there was no difference in precision between the trials that were identical in the jitter and number-of-intervals blocks (32 trials with 25% jitter and 4 intervals): p = 0.64, t = 0.47. However, there was a significant difference in BOLD responses between the two conditions. For a contrast of jitter vs. number-of-intervals, putamen, caudate, and cerebellum (lobule V) showed significant differential activity. The reverse contrast showed enhanced responses in the cerebellum (lobule VI) only. These data suggest that brain areas involved in holding and manipulating time intervals in memory are selectively activated by different task contexts: differential striatal and cerebellar activity for the jitter condition is consistent with previous work on rhythm and time perception (Grahn, 2012; Teki et al., 2012). The activation of cerebellar lobule VI is consistent with the specific role of this cerebellar sub-region in verbal working memory (Koziol et al., 2014), which may be attributed to its role in temporal sequencing of internal motor traces representing inner speech (Marvel and Desmond, 2010).

### Structural Correlation with Behavior

VBM correlation analysis was performed to assess whether the gray and white matter volume of specific temporal processing regions correlated with behavioral performance in the jitter and number-of-intervals conditions. In the absence of previous work on correlates between brain structure and timing behavior, we did not have strong well-defined anatomical hypotheses and, therefore, examined correspondence between the functional and structural brain data.

GM volume in the cerebellum (lobule V) correlated with behavior as the jitter increased, consistent with greater functional response in the same cerebellar sub-region. On the other hand, the GM volume of the Heschl's gyrus correlated with listeners' performance on regular trials. As the sequences become more regular, stronger phase-locking to the clicks at low rates (2 Hz) may explain the correlation observed in the auditory cortex. For the memory task, the GM volume of the caudate correlated with behavioral performance as the load increased. The reverse correlation was found in the cerebellum as a function of decreasing load.

Correlation between WM volume and behavior showed effects in the pallidum as a function of both increasing jitter and load. This result is consistent with recent evidence from a longitudinal study that revealed a correlation between working memory capacity and the fractional isotropy (FA) and the WM volume of fronto-striatal tracts (Darki and Klingberg, 2015). More specifically, they found that FA in white matter tracts and

### REFERENCES


activity in the caudate predict future working memory capacity. Overall, the VBM results show strong correspondence with the functional data and highlight the importance of the cerebellum and the striatum in representation of temporal memory.

### CONCLUSIONS

We have demonstrated using fMRI that working memory for time intervals is implemented in a core resource in the striatum and the cerebellum, achieved through manipulating the information content by varying the regularity and number of intervals in sequences. These results are supported by concordant structural correlations with behavior in the same areas. Our results highlight functional and structural correlates of a flexible working memory resource for time intervals in rhythmic sequences and provide a strong basis to examine the underlying neural correlates of context-dependent memory for time, e.g., beta-band oscillations in the auditory-motor pathways (Iversen et al., 2009; Fujioka et al., 2012; Teki, 2014; Bartolo and Merchant, 2015), using techniques with higher temporal resolution than fMRI.

### AUTHOR CONTRIBUTIONS

ST designed the study; ST collected and analyzed the data; ST and TG wrote the manuscript.

### ACKNOWLEDGMENTS

This work was supported by the Wellcome Trust (WT091681MA awarded to TG). ST is supported by the Wellcome Trust (WT106084/Z/14/Z). We thank the Physics and Radiology group at the Wellcome Trust Centre for Neuroimaging for technical support.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnins. 2016.00239

of regular tapping in the monkey. J. Neurosci. 35, 4635–4630. doi: 10.1523/JNEUROSCI.4570-14.2015


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Teki and Griffiths. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Look at the Beat, Feel the Meter: Top–Down Effects of Meter Induction on Auditory and Visual Modalities

Alexandre Celma-Miralles<sup>1</sup> \*, Robert F. de Menezes<sup>1</sup> and Juan M. Toro1,2

1 Information and Communication Technologies Engineering (ETIC), Language and Comparative Cognition Group – Center for Brain and Cognition, Universitat Pompeu Fabra, Barcelona, Spain, <sup>2</sup> Institució Catalana de Recerca i Estudis Avançats, Barcelona, Spain

Recent research has demonstrated top–down effects on meter induction in the auditory modality. However, little is known about these effects in the visual domain, especially without the involvement of motor acts such as tapping. In the present study, we aim to assess whether the projection of meter on auditory beats is also present in the visual domain. We asked 16 musicians to internally project binary (i.e., a strongweak pattern) and ternary (i.e., a strong-weak-weak pattern) meter onto separate, but analog, visual and auditory isochronous stimuli. Participants were presented with sequences of tones or blinking circular shapes (i.e., flashes) at 2.4 Hz while their electrophysiological responses were recorded. A frequency analysis of the elicited steady-state evoked potentials allowed us to compare the frequencies of the beat (2.4 Hz), its first harmonic (4.8 Hz), the binary subharmonic (1.2 Hz), and the ternary subharmonic (0.8 Hz) within and across modalities. Taking the amplitude spectra into account, we observed an enhancement of the amplitude at 0.8 Hz in the ternary condition for both modalities, suggesting meter induction across modalities. There was an interaction between modality and voltage at 2.4 and 4.8 Hz. Looking at the power spectra, we also observed significant differences from zero in the auditory, but not in the visual, binary condition at 1.2 Hz. These findings suggest that meter processing is modulated by top–down mechanisms that interact with our perception of rhythmic events and that such modulation can also be found in the visual domain. The reported cross-modal effects of meter may shed light on the origins of our timing mechanisms, partially developed in primates and allowing humans to synchronize across modalities accurately.

Keywords: beat perception, meter induction, cross-modal timing mechanisms, music evolution

## INTRODUCTION

Metrical structure is fundamental for our perception of rhythm in music. It allows humans to process the temporal events of music in an organized manner. Metrical structure is based on two distinct processes: beat extraction and meter induction (Fitch, 2013). The former consists of extracting an isochronous beat from a stream of events. This results in beats appearing as periodic points over time. The latter consists of the hierarchical organization of these periodic beats into sequences of strong and weak patterns. The downbeat (the perceptually prominent beat) usually occurs at a subharmonic frequency of the beat, such as 2:1, 3:1, or other more complex integer

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Germany

#### Reviewed by:

Gábor Péter Háden, Hungarian Academy of Sciences, Hungary Makiko Sadakata, Radboud University Nijmegen and University of Amsterdam, Netherlands

#### \*Correspondence:

Alexandre Celma-Miralles alexandre.celma@upf.edu

Received: 19 December 2015 Accepted: 28 February 2016 Published: 23 March 2016

#### Citation:

Celma-Miralles A, de Menezes RF and Toro JM (2016) Look at the Beat, Feel the Meter: Top–Down Effects of Meter Induction on Auditory and Visual Modalities. Front. Hum. Neurosci. 10:108. doi: 10.3389/fnhum.2016.00108

ratios. The saliency of the downbeat is usually elicited by variations of loudness, pitch, or timbre in the perceived sound. It can also be generated endogenously via an active top-down process of rhythmic perception (Nozaradan et al., 2011; London, 2012). This endogenous sense of regular alternations of strong and weak patterns is commonly termed assubjective metricization (Keller, 2012). In short, meter induction organizes periodic beats in a hierarchical manner and can be modulated voluntarily.

Recent research has advanced in the identification of neural correlates of beat perception and meter induction. It has been demonstrated that neural activity increases at the frequencies corresponding to the beat and those corresponding to an induced meter (see Nozaradan et al., 2011, 2012). A frequencytagging approach by means of electroencephalography (EEG) has been used to explore these neural substrates (Nozaradan, 2014; Nozaradan et al., 2015). This method uses periodic properties of the stimuli to induce steady-state evoked potentials (SS-EPs): changes in voltage that are stable in phase and amplitude over time. These neural responses can be elicited via different modalities (Vialatte and Maurice, 2009; Vialatte et al., 2010a; Nozaradan et al., 2012) and are easily recorded with EEG. Once SS-EPs are analyzed in the frequency domain, narrow-band peaks appear at the frequencies corresponding to the external stimuli. In addition to the emergence of these peaks reflecting bottom–up processing, those frequencies representing the beat and the meter are selectively enhanced. Crucially, top–down effects of meter induction are also captured, such as when meter is internally driven and imposed on the stimuli. When binary (march-like) and ternary (waltz-like) meter was mentally projected onto an auditory stimulus, Nozaradan et al. (2011) found an enhancement at the frequencies corresponding to the subharmonics of the beat: f/2 and f/3, respectively. Thus, the increase of neural activity at selected frequencies reflects the voluntarily control of meter induction, that is, the top–down processing that guides attention toward relevant points in the stimulus.

So far, the neural correlates of meter induction have primarily been studied in the auditory modality. There is an open debate about whether meter is restricted to the auditory modality or whether other modalities can access meter. Recent behavioral research has revealed that metrical structure is also apparent in visual stimuli. Lee et al. (2015) presented participants with videos of choreographies marking strong and weak beats at the same time as isochronous sounds. In their first experiment, deviant timbre sounds were presented in different metrical positions. Participants were slower in reacting to deviant sounds placed at the visually inferred strong positions, which suggests visual meter induction and a split of attention between modalities. Although beat and meter have been tested in other modalities, such as visual and tactile, these studies mostly involve motor acts, such as synchronized tapping (reviewed in Repp, 2005; Repp and Su, 2013). For example, it has been found that the act of moving at certain meter leads to listening preferences in infants (Phillips-Silver and Trainor, 2005, 2008). Movement also shapes the internal representation of auditory rhythms by enhancing the EEG signal at the metrical frequency (Chemin et al., 2014). However, it is difficult to disentangle the relative contribution of the motor act itself from metrical effects in these sensorimotor synchronization (SMS) studies.

There are good reasons to believe that beat processing and meter induction may be amodal. Several species have been found to effectively use interval-based and beat-based mechanisms in both the auditory and the visual modalities (Repp, 2005; Repp and Su, 2013; Patel and Iversen, 2014; Ravignani et al., 2014; Merchant et al., 2015). For example, visual rhythm synchronization has been attested in the behavior of some insects, such as fireflies (Buck and Buck, 1968), and in controlled laboratory studies with non-human primates, such as macaques (Zarco et al., 2009). Moreover, the accurate synchronizations observed during dance and music across human cultures (Fitch, 2013) suggests that these timing mechanisms might not only engage the visual modality, but also make use of a non-domainspecific meter.

The possibility that rhythm synchronization engages a domain-general timing mechanism makes it necessary to explore the extent to which we can also identify neural correlates for meter induction across modalities. The present work aims to explore meter induction in the visual modality by comparing the effects of endogenously driven meter projected onto auditory and visual periodic stimuli. We apply a frequency-tagging approach to the SS-EPs elicited during meter induction without the involvement of motor behavior. Our objective is to explore the extent to which meter induction is domain-specific: whether it is tightly constrained to the auditory modality, or whether similar neural correlates can be found across modalities. As visually transmitted meter is observed in some natural settings, such as musicians following a conductor in an orchestra, or dancers synchronizing with each other, our prediction is that we should observe similar metrical effects in both the auditory and the visual modalities.

### MATERIALS AND METHODS

### Participants

Sixteen healthy musicians were included in the present study (9 females, 6 left-handers, mean age: 23.38 ± 3.85, age range: 18–35). There were 17 participants in total, but one male was excluded due to an excess of artifacts in the EEG data. All participants had extensive musical experience, starting at 6.31 ± 2.30 years of age, and six of them reported some training in dance. We chose to recruit musicians because it is clear to them what differentiates binary from ternary meter and because they have extensive experience extracting metrical cues from audiovisual sequences of rhythms. No participant reported any history of hearing, visual, motor, or psychiatric disorders, and all participants had normal or corrected-to-normal vision. All participants signed a written consent form and received payment for their participation in the study.

### Ethics Statement

All procedures were approved by the ethical committee from the Universitat Pompeu Fabra.

Stimuli consisted of isochronous sequences of either tones (in the auditory condition) or flashes (in the visual condition). They were presented at a frequency of 2.4 Hz (IOI = 416.66 ms). Every sequence lasted for 35 s and was comprised of 84 tones or flashes. All the target frequencies fell within the ecological range of tempo perception and production (Vialatte and Maurice, 2009; Nozaradan et al., 2011). Every event (tone or flash) within the sequence progressively diminished until the next one appeared (see **Figure 1**), thus marking the onset of the beat with the maximum intensity of sound (in the auditory condition) or with the maximum intensity of light (in the visual condition). Each condition had eight auditory or visual 35 s sequences. After each 35-s sequence, the pitch or the color of the stimuli was changed in order to maintain participants' attention. Within each 35-s sequence, the stimuli were the same.

The auditory stimuli were presented at a comfortable hearing level through two speakers placed 70 cm in front of the participant. We converted a pure sinusoidal tone into stereo by using Audacity. These pure tones were raised half a tone in each 35-s sequence, going from an F4 up to a C5. The entire auditory condition, therefore, consisted of eight different sequences of 84 repeated sinusoidal tones: F4 (349.2 Hz), F#4 (370.0 Hz), G4 (392.0 Hz), G#4 (415.3 Hz), A4 (440.0 Hz), A#4 (466.2 Hz), B4 (493.9 Hz) and C5 (523.3 Hz).

The visual stimuli were presented on a computer screen (1280 × 1024 resolution) placed 70 cm in front of the participant. A colored circle was placed at the center of the screen with a black background and had a radius of 25 mm. To create the blinking effect, we progressively diminished its luminescence until it turned completely black (see **Figure 1**). These flashes changed color after every 35-s sequence of 84 flashes with the following RGB progression: red (255 0 0), orange (255 128 0), yellow (255 255 0), green (128 255 0), turquoise (0 255 128), light blue (0 255 255), dark blue (0 0 255), and violet (128 0 255). The auditory and visual stimuli were created using Matlab (v.2013, The MathWorks) and presented with Psychophysics Toolbox extensions (Brainard, 1997; Pelli, 1997).

### Procedure

Participants were seated in a comfortable armchair in a soundproof room with the keyboard placed on their lap. Our study consisted of three conditions: the passive beat condition (which served as a control), the binary imagery task, and the ternary imagery task. These three conditions were the same for the auditory and visual modalities. They were pseudo-randomly presented to each participant, with the control condition always presented first, and the visual and auditory modalities alternated, forming a total of 16 possible combinations. During the control condition, the participants were passively either looking at the flashes or listening to the sounds. In order to avoid the induction of an involuntary meter in the control condition, the well-known tick-tock effect (Brochard et al., 2003), we reminded the participants at the beginning of each 35-s sequence to perceive each beat individually, that is, as independent from the previous and the following tone or flash. During the binary and ternary tasks, in contrast to the control condition, participants were asked to mentally project a binary structure (strong–weak pattern) or a ternary structure (strong–weak–weak pattern) onto the same perceptual stimuli presented during the control condition. In other words, they were asked to silently project a metrical structure focusing on the subharmonics of the beat: f/2 (1.2 Hz) for the binary meter and f/3 (0.8 Hz) for the ternary meter. Participants were asked to start the meter imagery task as soon as the first stimulus was presented. At the end of each 35-s sequence, they had to report whether the last beat was strong or weak as a way of maintaining their attention and making sure they were focused on the task. Because the participants had to press the space bar to go on to the next block, they were allowed to take a break when needed. Once the study finished, a questionnaire was given to the participant to evaluate and comment on the stimuli and the difficulty of each task. Psychtoolbox was used to run the experiment.

### Electrophysiological Recording

The EEG signal was recorded using a BrainAmp amplifier and the BrainVision Analyzer Software package (v.2.0; Brain Products) using an actiCAP with 60 electrodes placed on the scalp according to the International 10/10 system (Fp1, Fp2, AF7, AF3, AF4, AF8, F7, F3, F1, Fz, F2, F4, F8, FT9, FT7, FT8, FT10, FC5, FC3, FC1, FC2, FC4, FC6, C5, C3, C1, Cz, C2, C4, C6, T7, T8, TP9, TP7, TP8, TP10, CP5, CP3, CP1, CPz, CP2, CP4, CP6, P7, P5, P3, P1, Pz, P2, P4, P6, P8, PO9, PO3, POz, PO4, PO10, O1, Oz, O2). Vertical and horizontal eye movements were monitored using two electrodes placed on the infra-orbital ridge and the outer canthus of the right eye. Two additional electrodes were placed on the left and right mastoid. The signals were referenced to the FCz online channel and all electrode impedances were kept below 25 k. The signals were amplified and digitized at a sampling rate of 1000 Hz.

### Data Analyses

Preprocessing of the continuous EEG recordings was implemented using BrainVision Analyzer 2.1 (Brain Products GmbH). First, any channel that appeared flat or noisy was interpolated from the surrounding channels via spherical spline interpolation. All the channels were then filtered using a zerophase Butterworth filter to remove slow drifts in the recordings, with a high pass filter at 0.1 Hz (48 dB/oct) and low pass filter at 10 Hz (time constant 1.591549, 48 dB/oct). Channels with EEG exceeding either ±100 µV at any channel, activity <0.5 µV, or voltage step/sampling >50 µV within intervals of 200 ms, were automatically detected offline. Subsequently, eye blinks and muscular movements were corrected using Ocular Correction ICA. Finally, the filtered EEG data was segmented into epochs of 36 s (corresponding to the 35-s sequences with an extra second in the beginning) for each condition and modality. These files were then exported to Matlab. All further analyses were performed in Matlab and SPSS (version 19, IBM).

For each condition and modality, eight epochs lasting 32.5 s were obtained by removing the first 3.5 s of each sequence.

This removal, as justified in Nozaradan et al. (2011), discards the evoked potential related to the stimuli onset and relies on the fact that SS-EPs require several repetitions or cycles to be elicited (Repp, 2005; Vialatte et al., 2010a). In order to enhance the signal-to-noise ratio and attenuate activities that are not phase locked to the auditory and visual stimuli, the EEG epochs for each participant, modality, and condition were averaged across trials. To get the signal's amplitude (in µV), we applied a fast Fourier transform (FFT), and to get its power (in µV 2 ), we squared the modulus of the FFT. Both frequency spectra ranged from 0 to 500 Hz with a frequency resolution of 0.0305 Hz. The obtained signal is assumed to correspond to the EEG activity induced by the physical stimuli and the meter imagery. However, it may also include residual background noise due to spontaneous activity. Two different signal-to-noise techniques were applied: the subtraction method used in Nozaradan et al. (2011) for the amplitude spectrum (in µV) and a relative measure that compared each individual's binary and ternary meter values to their own beat condition as a baseline for the power spectrum (converting µV 2 into decibels), thus minimizing variation due to group variability.

For the amplitude spectrum, noise was removed by subtracting the averaged amplitude of the two surrounding non-adjacent frequency bins, ranging from −0.15 to −0.09 Hz and from 0.09 to 0.15 Hz, at each frequency bin from 0.5 to 5 Hz. For the power spectrum, we assume that the beat condition serves as a baseline, since the subjects were passively listening to the stimuli, and that their EEG activity corresponds to the processing of beat without any top–down projection of meter. We took the power spectrum of each meter imagery condition for each participant and divided it by their own control condition. Subsequently, we converted these values into dB by taking the log<sup>10</sup> of each value and multiplying by 10. Following this procedure, we do not need to compare the selected frequency bins among conditions and modalities against the control, but instead check whether the obtained values at each frequency of interest differ from zero. This procedure makes the comparison between modalities more reliable, as the effect of meter is relative to the control condition of each modality.

In order to correct for spectral leakage from our target frequencies in the amplitude spectrum, we averaged every target frequency bin ±0.0305 Hz, the three frequency bins centered on the target frequencies, as Nozaradan et al. (2011) did. However, to compare between the peaks, we only took the values from every target frequency bin. We did not apply this technique for spectral leakage to the original power spectrum because, after using the baseline correction method, the induced activity was centered very concisely in a single bin. The mean of all electrodes across each individual's scalp was calculated for each condition and modality at each target frequency, revealing multiple peaks in the data. For the amplitude spectra, the values for each target frequency (0.8, 1.2, 2.4, 4.8 Hz) were separately submitted to two-way repeated measures ANOVAs with the factors Modality (auditory and visual) and Condition (beat, binary, ternary). When one of the ANOVA factors was significant, post hoc pairwise comparisons were performed with Fischer's LSD and Bonferroni. Size effects were expressed using the partial η 2 . For the power spectra, one sample t-tests were used to see if the values at each target frequency significantly differed from zero. The significance level was set at p < 0.05 for all statistical analyses.

The present approach does not deal with topological effects because our hypothesis does not predict any region of interest for a domain-general meter induction, although certain correlations could appear (such as occipital areas showing a strong connection to the visual modality of the stimuli). This is the reason why the means from every electrode were averaged across the scalp, thereby excluding selection biases.

FIGURE 2 | Six amplitude spectra depicting the amplitude (µV) of the averaged EEG signal at each frequency between 0.5 and 5 Hz. The auditory (left column) and visual (right column) modalities are split into three conditions: the beat control (first row), the binary meter imagery task (second row), and the ternary meter imagery task (third row). The mean of all participants' amplitudes (gray lines) is depicted in red for each condition and modality. The frequencies of the ternary meter, binary meter, beat, and its first harmonic are signaled with gray dotted lines and colored triangles over the abscissa: 0.8 Hz (green), 1.2 Hz (red), 2.4 Hz (blue) and 4.8 Hz (magenta).

Celma-Miralles et al. Look at the Beat, Feel the Meter

TABLE 1 | Results of the four two-way repeated measures ANOVAs applied to the averaged amplitudes (3 bins) of each frequency of interest: 0.8, 1.2, 2.4, and 4.8 Hz.


The two levels of Modality (auditory and visual) and Condition (beat control, binarymeter, and ternary-meter) are shown accompanied by their interactions. For each level and interaction, degrees of freedom, F-statistics, p-values, and size effects are reported. <sup>1</sup>A Greenhouse–Geisser correction was applied to correct for violations of sphericity. <sup>∗</sup>p < 0.05, ∗∗p < 0.01.

### RESULTS

### Amplitude Spectra

In **Figure 2**, the mean of all participants' amplitudes (red line) is plotted over each individual's amplitude spectra (gray lines) at the target frequencies (0.8, 1.2, 2.4, 4.8 Hz). A clear peak appears at the frequency of the stimuli (2.4 Hz) in all the conditions (beat, binary, ternary). Similarly, a peak is observed for the first harmonic (4.8 Hz) in all three conditions. However, there seem to be differences across conditions regarding the sub-harmonics. The peak at 1.2 Hz only appears in the auditory binary condition, while the peak at 0.8 Hz is found in both auditory and visual ternary conditions. Furthermore, there is a larger peak at the beat frequency for all three auditory conditions compared to their visual analogs, whereas the inverse effect occurs at the frequency of the first harmonic, depicting a larger peak for all three visual conditions.

A two-way repeated measures ANOVA [Modality (auditory and visual) × Condition (beat, binary, ternary)] was applied to each frequency of interest (0.8, 1.2, 2.4, 4.8 Hz) separately. **Table 1** reports the values obtained from ANOVAS for Modality, Condition, and their interaction. For the frequency of the first harmonic of the beat (4.8 Hz), there was a main effect of Modality: F(1,15) = 14.945, η <sup>2</sup> = 0.499, p = 0.002. For the ternary subharmonic of the beat (f/3 = 0.8 Hz), a main effect of Condition was observed [F(2,30) = 5.018, η <sup>2</sup> = 0.251, p = 0.013]. However, for the binary subharmonic of the beat (f/2 = 1.2 Hz), no main effect of Condition was found [F(2,30) = 1.599, η <sup>2</sup> = 0.096, p = 0.219]. Post hoc pairwise comparisons (see **Table 2**) using a Bonferroni adjusted alpha indicated that (i) the averaged amplitudes of the ternary condition were greater than those of the beat control at 0.8 Hz (MD = 0.454, p = 0.046), and (ii) the averaged amplitudes of the visual modality were greater than those of the auditory modality at 4.8 Hz (MD = 1.091, p = 0.002). Interestingly, less restrictive post hoc pairwise comparisons using Fischer's Least Significant Difference (LSD) showed that the averaged amplitudes of the ternary condition were also greater than those of the binary condition at 0.8 Hz (MD = 0.434, p = 0.022).

We found modality differences regarding the first harmonic of the beat (4.8 Hz), but not regarding the beat (2.4 Hz), even though the peaks were apparently higher in the auditory modality (see **Figure 2**). This counterintuitive finding may be related to the way we averaged the amplitudes for each peak, which assumed that a considerable leakage was taking place. However, the amplitude spectra show that the peak for the beat in the auditory modality was sharper and larger, while in the visual modality was wider and shorter.

A second two-way repeated measures ANOVA [Modality (auditory and visual) × Condition (beat, binary, ternary)] was applied to the top-of-the-peak values of each frequency of interest (0.8, 1.2, 2.4, 4.8 Hz) separately, as reported in **Table 3**. For the frequency of the beat (2.4 Hz), there was a main effect of Modality: F(1,15) = 14.215, η <sup>2</sup> = 0.487, p = 0.002. For the ternary subharmonic of the beat (f/3 = 0.8 Hz), a main effect of Condition was observed [F(1.345,20.178) = 5.842, η <sup>2</sup> = 0.280,

TABLE 2 | Post hoc pairwise comparisons from the two-way repeated measures ANOVAs of the averaged amplitudes (3 bins) with and without adjustments for multiple comparisons.


For the frequencies 0.8, 2.4, and 4.8 Hz, the mean differences, standard deviations, and p-values (Fischer's LSD as non-readjusted alpha, Bonferroni as readjusted alpha) are reported. Significance level was always kept below 0.05. The condition 'Beat' stands for the beat control, the condition 'Bin.' stands for the binary meter, and the condition 'Tern.' stands for the ternary meter. The modality 'Aud.' stands for audition, and the modality 'Vis.' stands for vision. <sup>∗</sup>p < 0.05, ∗∗p < 0.01.



The two levels of Modality (auditory and visual) and Condition (beat control, binarymeter, and ternary-meter) are shown accompanied by their interactions. For each level and interaction, degrees of freedom, F-statistics, p-values, and size effects are reported. <sup>1</sup>A Greenhouse–Geisser correction was applied to correct for violations of sphericity. <sup>∗</sup>p < 0.05, ∗∗p < 0.01.

p = 0.018]. For the binary subharmonic of the beat (f/2 = 1.2 Hz), a main effect of Condition [F(1.422,21.334) = 4.609, η <sup>2</sup> = 0.235, p = 0.032] and an interaction were attested [F(2,30) = 1.783, η <sup>2</sup> = 0.106, p = 0.026]. Post hoc pairwise comparisons (see **Table 4**) using a Bonferroni adjusted alpha indicated that (i) the peak amplitudes of the ternary condition were higher than those of the beat control at 0.8 Hz (MD = 1.178, p = 0.018), (ii) the peak amplitudes of the binary condition were higher than those of the beat control at 1.2 Hz (MD = 0.655, p = 0.050), and (iii) the peak amplitudes of the auditory modality were higher than those of the visual modality at 2.4 Hz (MD = 3.592, p = 0.002). Less restrictive post hoc pairwise comparisons using Fischer's LSD showed that the peak amplitudes of the binary condition were significantly higher than those of the beat control at 1.2 Hz (MD = 0.655, p = 0.017).

These results might be due to a similar top–down meter effect for both modalities in the ternary-meter condition enhancing the ternary subharmonic of the beat (f/3), the downbeat of the ternary meter. Unfortunately, this enhancement was not consistently found for the binary subharmonic (f/2) because the averaged values of both auditory and visual binary-meter conditions at 1.2 Hz might not be significantly greater than those of other conditions. In fact, the interaction reported by the second ANOVA (comparing each top-of-the-peak value) at 1.2 Hz points to the absence of the binary-meter effect in one modality. Furthermore, both binary and ternary metrical effects could also be affected by the variability across participants in projecting meter or in unconsciously grouping the beat during the control condition, despite our instructions. In order to control for inter-individual variability (see the gray lines of **Figure 2**), we explored our data using a different signal-tonoise method, namely using each individual's beat condition as a baseline in order to obtain a relative measure that takes that person's variability into account.

### Power Spectra

The power spectra for each modality and condition were obtained by taking the modulus squared of the amplitudes resulting from the FFT. Here, we used the control (beat) condition as a baseline to normalize and convert the amplitudes from the metrical conditions (binary, ternary) into decibels. This procedure consisted of applying the following operation: 10 log<sup>10</sup> condition control . This method gives us the opportunity to see relative distances from zero as differences between conditions with respect to the control. When positive, the values indicate more power for the metrical condition, whereas when negative, they indicate more power for the beat baseline. **Figure 3** shows all the relativized power

TABLE 4 | Post hoc pairwise comparisons from the two-way repeated measures ANOVAs of the top-of-the-peak amplitudes (1 bin) with and without adjustments for multiple comparisons.


For the frequencies 0.8, 2.4, and 4.8 Hz, the mean differences, standard deviations, and p-values (Fischer's LSD as non-readjusted alpha, Bonferroni as readjusted alpha) are reported. Significance level was always kept below 0.05. The condition 'Beat' stands for the beat control, the condition 'Bin.' stands for the binary meter, and the condition 'Tern.' stands for the ternary meter. The modality 'Aud.' stands for audition and the modality 'Vis.' stands for vision. <sup>∗</sup>p < 0.05, ∗∗p < 0.01.

conditions. Second row: visual conditions. First column: binary meter task. Second column: ternary meter task. Black rectangles frame the relevant frequencies when their values are significantly distinct from zero. The electrodes are ordered from anterior to posterior regions, with odd and even numbers representing the right and left hemispheres, respectively. <sup>∗</sup> p < 0.05, ∗∗ p < 0.01.

spectra displaying the activity of all the electrodes at each frequency.

To summarize the differences between conditions and modalities, **Figure 4** depicts the mean of all participants' electrodes contrasting each condition in both the visual and the auditory modality. The frequencies of interest are marked with a vertical line to make it easier to see the meter-induced peaks at the subharmonics of the beat. Similar to the findings in the amplitude spectra, there are larger peaks at 0.8 Hz for the ternary condition in both modalities, but there is a peak at 1.2 Hz for the binary condition in the auditory modality only. In contrast with our first analyses, no differences arise between modalities at the frequency of the beat (2.4 Hz) in the power spectra. This is due to the use of the beat condition as a baseline. Given that beat is similarly induced in all conditions, dividing the two conditions will result in a relative value close to zero.

One-sample t-tests were applied to the spectrum of each target frequency to examine whether the values significantly differed from zero. At 0.8 Hz, the values of the ternary conditions were significantly different from zero in the auditory [t(15) = 2.778, p = 0.014] and the visual [t(15) = 3.692, p = 0.002] modalities. The same occurred at 1.2 Hz for the values of the binary condition in the auditory modality [t(15) = 3.489, p = 0.003]. These findings show that the peaks appearing in both ternary conditions at 0.8 Hz and the auditory binary condition at 1.2 Hz are significantly

different from zero and thus larger than the control condition. However, no effects for the visual binary meter were attested.

### DISCUSSION

Our work is among the first studies to compare meter induction between two different modalities, audition and vision, without requiring synchronized, overt movement. Although SMS studies have also explored the degree to which participants can reliably extract the pulse from visual rhythms (Patel et al., 2005; Repp, 2005; Nozaradan et al., 2012; Repp and Su, 2013; Iversen et al., 2015), there is not much information to date regarding meter induction in other modalities which do not involve motor acts. Here, we examined whether visually induced beats can be organized in endogenously driven hierarchies. To this end, each participant was asked to internally project binary and ternary meter onto the isochronous stimuli. Their recorded EEG data was converted into the frequency domain, resulting in amplitude and power spectra. We observed an effect of ternary meter induction in both the auditory and the visual modalities, as well as an effect of binary meter induction in the auditory modality. This suggests that some degree of meter induction can also be observed in the visual modality.

We evaluated the SS-EPs elicited by the beat frequency and its natural first harmonic in both modalities. The emergence of a periodic entrainment at the frequency of the beat's first harmonic (2f = 4.8 Hz) illustrates a natural preference for integer harmonics that tend to appear involuntarily when a periodic stimulus is presented. In fact, this entrainment may be related to a predisposition to subdivide the beat into integer harmonics, like duplets (1:2) or triplets (1:3) of eighth notes in music. We observed different amplitudes at the frequency of the beat and its first harmonic depending on the stimuli's modality. There was a significantly higher peak at 2.4 Hz in the auditory modality than in the visual modality. Inversely, there was a significantly greater peak at 4.8 Hz in the visual modality than in the auditory modality. Since our visual stimuli did not abruptly appear and disappear at 4.8 Hz, but progressively vanished (see **Figure 1**), there is no reason to believe that the stimulus' offset reinforced the first harmonic in the visual domain. This difference across modalities could be related to the comfortable frequencies that have been found for vision and audition with synchronized tapping (Repp, 2003). While the upper inter-onset interval (IOI) would be the same for both modalities (around 1800 ms), the lower IOI seems to be just above 460 ms for visual flashes and 160 ms for auditory beats (Repp, 2005). These modality-specific boundaries for SMS could reflect the neural demands of extracting and keeping the beat in each modality, possibly working on different frequencies. Importantly, the difference across modalities could also be due to the fact that musicians are well-trained to deal with auditory beats, and may be able to accurately synchronize with the frequency of the auditory stimuli, as shown by the sharper peak at the auditory beat (**Figure 2**). In contrast, since musicians may be less used to visual beats, their synchronization with the frequency of the visual stimuli may become less accurate and more distributed around 2.4 Hz, as shown by the wider, rounder peak (**Figure 2**). In this case, experience may play an important role in determining the differences between modalities.

A key finding in the present study is the emergence of periodic responses at the subharmonics f/2 and f/3 of the beat. These effects reflect the voluntary metrical interpretation of the beat in the binary and the ternary conditions. Our results suggest that neural entrainment in both modalities not only occurs for the beat, but also for the subharmonic f/3 during the ternary condition. This finding supports top–down effects of meter induction in the visual modality. However, we did not observe any effect of the binary meter in the visual modality. This lack of effect in the binary condition suggests that there is no automatic conversion between the visual and the auditory modalities. If there were such a conversion, we should observe an effect of binary and ternary meter on both modalities (see Guttman et al., 2005; McAuley and Henry, 2010). Nevertheless, the fact that meter induction is apparent in the visual modality for the ternary, but not for the binary, meter condition suggests that meter induction applies to the visual cues independent of any auditory conversion. To confirm this idea, future research testing visual meter induction in deaf people is needed, as a visual-to-auditory

conversion would not be expected in this case. Furthermore, some participants reported in the questionnaire that, instead of "counting" the pulse or thinking metrical patterns in the visual conditions, they made use of visual cues to project the meter onto the flashes, such as imagining the downbeat brighter or slightly displaced toward one side of the screen. The alteration of these visual features to project the metrical structure may also provide evidence against an auditory-to-visual conversion.

The fact that we did not find evidence of meter induction in the visual binary condition could be due to different cognitive demands across the subharmonics of the beat. A magnetoencephalographic study conducted by Fujioka et al. (2010) revealed differences in internally inducing binary and ternary meter in the auditory domain. Distinct time courses of auditory evoked responses on distributed networks were found for binary and ternary meter. The authors also observed that the contrast between strong and weak beats was only present for the ternary meter. Thus, differences between projecting binary and ternary meter could have interacted with the modality we used to induce the beat. Although more research is needed to clarify the differences between binary and ternary meter in terms of cognitive demands, the results we observed from the visual ternary condition support the idea that meter induction can apply beyond the auditory modality.

There is a feature in our visual stimuli that could have contributed to the lack of an effect in the binary condition. We used colored flickering flashes that suddenly appeared and gradually vanished to promote a beat. Stimuli such as flashes have been used extensively to create SS-EPs (Krause et al., 2010; Vialatte et al., 2010b) and even seem to be perceived with binary meter by deaf individuals (Iversen et al., 2015). However, recent studies suggest it is easier to observe SMS in the visual domain using geometrical and spatial features (Hove et al., 2010), as well as stimuli displaying natural and biological motion (Hove et al., 2013b; Su, 2014a,b). Showing a bouncing ball on a screen, Gan et al. (2015) obtained slightly better results on visual synchronization at IOIs from 500 to 900 ms compared to using an auditory metronome. Similarly, Iversen et al. (2015) reported equally accurate performance by deaf and hearing individuals when they had to synchronize with a ball bouncing at an IOI of 600 ms. In light of these findings, we can speculate that using motion cues could have aided the neural synchronization with the beat and perhaps facilitated the projection of the binary meter in the visual modality, whose frequency actually fell within the above-mentioned IOI ranges.

It has been proposed that meter consists of a cyclical fluctuation of attention over isochronous events to yield expectancies and predictions of incoming beats (Jones and Boltz, 1989; Large and Jones, 1999). A mirroring oscillatory network could explain the neuronal engagement to the external stimuli, either auditory or visual, such as the non-linear oscillator network proposed by the Resonance Theory (Large, 2008, 2010; Large et al., 2015). In the visual domain, these predictions would be fundamental to allow for dance synchrony and other coordinated social activities, such as sports. Studies using neuroimaging techniques provide further support for crossmodal meter. For instance, Hove et al. (2013a) found that the basal ganglia activity was more associated with SMS stability than with modality features. Musical meter has also been found to elicit cross-modal attention effects in the caudate nucleus (Trost et al., 2014). Coupled with our findings on the ternary visual meter, the general picture appears to be that of an amodal timing mechanism for beat and meter, allowing humans to deal with temporal information across modalities.

Comparative cognition studies also point toward the idea that timing mechanisms are more detached from a single modality than previously believed. Trained rhesus monkeys (macaques) showed an accurate performance of a synchronization-continuation task with both visual and auditory metronome at different tempos (Zarco et al., 2009), but their tapping behavior was clearly biased toward visual cues (Merchant and Honing, 2013). In addition, Japanese macaques were found to synchronize limb movements when facing each other, suggesting social coordination through visual imitation (Nagasaka et al., 2013). In our closest relatives, the chimpanzees, synchronization has been tested in the auditory domain (Hattori et al., 2013). However, likely by using both visual and auditory information, a bonobo intermittently displayed entrainment and phase matching to distinct isochronous sounds when interactively drumming with a human drummer (Large and Gray, 2015). Rhythmic behaviors, such as the drumming found in wild gorillas and chimps, seem to be tied to social functions (Ravignani et al., 2013). These findings suggest that primates may have evolved social interactive behaviors (i.e., the ancestors of our music and dance) by gradually tuning the neurodynamics of their timing mechanisms toward a more precise visuo- and audio-motor coupling so as to improve social learning, group coordination, and cohesion. Among primates, only humans are complex vocal learners, an ability that requires tighter auditory-motor connections (Petkov and Jarvis, 2012; Patel and Iversen, 2014; Merchant et al., 2015) than those found in the motor cortico-basal ganglia-thalamocortical circuit of primates (Merchant and Honing, 2013). We propose that the advantages of our evolved beat-based timing mechanism are not restricted to the auditory modality. Instead, even though the auditory modality has been specialized for the rhythms of speech and music through cultural experience, these same mechanisms may also be available to process rhythms in other modalities, allowing for sign language, dance, and sports.

The present results suggest that meter induction is domainindependent. A top-down projection of meter, without having any external cue to mark the metrical structure, is available for both the visual and the auditory modalities. This is informative regarding current theories on rhythm evolution and opens interesting questions. If meter induction evolved as a result of purely acoustic rhythm processing, it is then necessary to explain how it emerges in the visual modality. One possibility is that the feeling of meter emerges after linking the kinesthetic and vestibular systems

to another perceptual modality, like vision or audition. In fact, the importance of the vestibular system was highlighted by the study of Phillips-Silver and Trainor (2005), which looked at 7-month-old babies' metrical preferences after being bounced at binary and ternary meter. In addition, this vestibular and kinesthetic feeling of meter could also be the base for the movements found in musicians and dancers, and may point to how they learn to extract meter from distinct modalities. Another possibility, proposed by Fitch (2013), is that the hierarchical component of meter derives from a more general computation that is specialized in building temporal hierarchies. This would explain why meter is present in both music and language and why it could also be applicable to other modalities. If that were the case, the hierarchical organization of beats would be a by-product of our linguistic mind, and the organization of rhythm in music and dance may not be much different from the organization of rhythms in speech and signing, such as those constituting stress and prosody. Finally, it is still an open question whether animals can take advantage of this rhythmic mechanism that is not constrained to the auditory modality. The lack of evidence in the animal kingdom cannot be used as evidence that only humans have meter (Fitch, 2013), as several species show coordinated rhythmic abilities across modalities and deserve to be properly studied (Zarco et al., 2009; Nagasaka et al., 2013; Large and Gray, 2015).

Finally, it is important to consider the extent to which the present findings can be generalized to other populations. We tested musicians, who have been found to show an increased sensitivity to events and physical changes occurring in strong beat positions (Geiser et al., 2010; Repp, 2010; Kung et al., 2011; but see Bouwer et al., 2014). This suggests that musical expertise enhances attention to relevant metrical positions (Grahn and Rowe, 2009; Doelling and Poeppel, 2015). Accordingly, the extensive experience with auditory rhythmic stimuli in the population we tested could contribute to explain the variability in metrical timing observed in the visual domain. Future studies should test non-musicians in order to claim that meter induction is a cross-modal timing mechanism.

### REFERENCES


### CONCLUSION

The present study tackles the nature of musical meter, fundamental to the organization of events over time in a hierarchical way. We compared meter induction in audition and vision and found a similar effect for projecting ternary meter in both modalities. However, we only found an effect for binary meter in the auditory modality. The fact that ternary meter was successfully projected in both modalities suggests that human rhythmic abilities are more domain-general than previously believed. The existence of meter induction in the visual domain supports the idea of amodal timing mechanisms. These mechanisms seem to be at least partially present in some primates and may have developed in our species. It is the evolution of these amodal timing mechanisms that may have allowed us to master language, music, dance, and other synchronized activities that require a precise timing of actions across modalities.

### AUTHOR CONTRIBUTIONS

AC-M designed and ran the study, analyzed the data, and wrote the article; RdM contributed to the design of the study, analyzing the data, and revising the article; JT designed the study and supervised the analysis of the data and the writing of the article.

## FUNDING

This work was supported by the European Research Council (ERC) Starting Grant agreement n.312519 and by the Spanish Ministerio de Economía y Competitividad (MEC) FPI grant BES-2014-070547.

### ACKNOWLEDGMENTS

We would like to thank Xavier Mayoral for technical support, as well as Albert Compte and Mireia Torralba for helpful and instructive comments.



evolution hypothesis. Front. Neurosci. 7:724. doi: 10.3389/fnins.2013. 00274



**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Celma-Miralles, de Menezes and Toro. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Impact of Instrument-Specific Musical Training on Rhythm Perception and Production

### *Tomas E. Matthews\*, Joseph N. L. Thibodeau, Brian P. Gunther and Virginia B. Penhune*

*Laboratory for Motor Learning and Neural Plasticity, Department of Psychology, Concordia University, Montreal, QC, Canada*

Studies comparing musicians and non-musicians have shown that musical training can improve rhythmic perception and production. These findings tell us that training can result in rhythm processing advantages, but they do not tell us whether practicing a particular instrument could lead to specific effects on rhythm perception or production. The current study used a battery of four rhythm perception and production tasks that were designed to test both higher- and lower-level aspects of rhythm processing. Four groups of musicians (drummers, singers, pianists, string players) and a control group of non-musicians were tested. Within-task differences in performance showed that factors such as meter, metrical complexity, tempo, and beat phase significantly affected the ability to perceive and synchronize taps to a rhythm or beat. Musicians showed better performance on all rhythm tasks compared to non-musicians. Interestingly, our results revealed no significant differences between musician groups for the vast majority of task measures. This was despite the fact that all musicians were selected to have the majority of their training on the target instrument, had on average more than 10 years of experience on their instrument, and were currently practicing. These results suggest that general musical experience is more important than specialized musical experience

with regards to perception and production of rhythms.

Keywords: rhythm perception, rhythm production, beat perception, musical training, motor timing, expertise, tapping

## INTRODUCTION

Perceptually grouping a series of auditory events into a coherent rhythmic pattern within the context of music is a skill that develops early and is likely innate, at least in humans (Phillips-Silver and Trainor, 2005; Iversen et al., 2008; Honing, 2012). Production of a musical rhythm involves the temporally precise coordination of auditory and motor processes at a level not seen in other domains (Zatorre et al., 2007). Perception and production of rhythms are likely universal skills as most people can synchronize their movements to music, even without any formal training. However, various studies have shown that musical training can improve rhythmic perception and production (Smith, 1983; Drake, 1993; Kincaid et al., 2002; Chen et al., 2008), fine-grained temporal processing (Drake and Botte, 1993; Rammsayer and Altenmüller, 2006; Repp, 2010; Farrugia et al., 2012; van Vugt and Tillmann, 2014) and precise motor synchronization (Franek et al., 1991; Collier and Ogden, 2004; Repp and Doggett, 2007; Repp, 2010; Baer et al., 2013, 2015). These improvements are hypothesized to be driven by reinforced connections between sensory, proprioceptive, cognitive, and motor systems resulting from years of instrumental practice (Herholz and Zatorre, 2012).

#### *Edited by:*

*Sonja A. Kotz, Max Planck Institute Leipzig, Germany*

### *Reviewed by:*

*Barbara Tillmann, Centre National de la Recherche Scientifique, France Feng Rong, University of California, Irvine, USA*

*\*Correspondence:*

*Tomas E. Matthews flepid@hotmail.com*

#### *Specialty section:*

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Psychology*

*Received: 09 September 2015 Accepted: 12 January 2016 Published: 03 February 2016*

#### *Citation:*

*Matthews TE, Thibodeau JNL, Gunther BP and Penhune VB (2016) The Impact of Instrument-Specific Musical Training on Rhythm Perception and Production. Front. Psychol. 7:69. doi: 10.3389/fpsyg.2016.00069*

**324**

These findings tell us that music training can result in rhythm processing advantages, but they do not tell us whether practicing a particular instrument could lead to specific effects on rhythm perception or production. We might hypothesize that those musicians whose training emphasizes rhythm, or whose instrument requires motor skills similar to those used in tests of rhythm production, would out-perform musicians whose training emphasized pitch or whose instrument requires very different effectors. There are many studies showing differences between musicians and non-musicians (e.g., Chen et al., 2008; Repp, 2010), however, there are very few studies that have looked at the effects of specialized training on rhythm processing among different groups of musicians (e.g., Krause et al., 2010; Cicchini et al., 2012). These studies, which are reviewed below, suggest that subtle differences may exist between musician groups, but the findings are inconsistent. Most studies assessed relatively low-level aspects of rhythm processing, such as timing sensitivity and isochronous tapping rather than higher-level aspects, such as beat perception or rhythm synchronization. In addition, none assessed possible systematic differences in the duration and type of musical training across instrument types. To address these issues, the current study compared four groups of musicians: drummers, singers, pianists, and string players as well as a control group of non-musicians on a battery of rhythm perception and production tasks. We predicted that drummers, whose training focuses on rhythm processing, and pianists, whose motor skills match the demands of rhythm production tasks might perform better than violinists and singers, whose training emphasizes pitch processing and whose motor skills require different effectors and movements.

Many researchers have tested the effects of musical training on rhythm processing by comparing musicians and non-musicians. These studies can be divided into two main categories, those that focus on higher-level processing of meter and beat, and those concerned with lower-level motor and timing processes. These studies show that musical training improves rhythm processing in two ways. On the one hand, musical experience improves the ability to use a rhythmic framework to find the underlying pulse and parse the metrical structure (Smith, 1983; Drake, 1993; Chen et al., 2008; Grahn and Rowe, 2009). On the other hand, training improves lower-level abilities, such as finedgrained timing perception, sensorimotor synchronization, and continuation tapping (Franek et al., 1991; Collier and Ogden, 2004; Repp, 2010; Baer et al., 2013, 2015).

High-level rhythm processing was investigated in a functional magnetic resonance imaging (fMRI) study which tested the effects of metrical complexity and musical training using a rhythm synchronization task (RST; Chen et al., 2008). They found that the musicians were more synchronous and less variable than the non-musicians. Further, the musicians reported that they imposed a metrical structure on the rhythms whereas the non-musicians reported a chunking strategy. During the task musicians showed more engagement of right prefrontal brain regions, consistent with the use of a top-down, meterbased strategy. Further support for the idea that training leads musicians to adopt a meter-based strategy in rhythm tasks comes from studies showing improved rhythm reproduction (Smith, 1983; Drake, 1993) as well as differences in brain activity and ratings of beat presence in musicians compared to non-musicians (Grahn and Rowe, 2009).

Other studies have focused on improvements in lower-level timing and motor functions due to musical training. Generally, musicians show more accurate tapping and/or lower tapping variability on synchronization and continuation tapping tasks as well as a timed sequence production task, compared to nonmusicians (Drake et al., 2000; Kincaid et al., 2002; Aoki et al., 2005; Repp, 2010; Baer et al., 2013). Furthermore, previous work has shown that musical expertise leads to improvements in tempo sensitivity (Drake and Botte, 1993), sensitivity to phase shifts (Repp, 2010), anisochrony detection (Friberg and Sundberg, 1995) and duration reproduction (Franek et al., 1991).

Recently, two separate groups have developed batteries of rhythm and timing tasks that assess both higher-level and lower-level aspects of rhythm perception and production. The Battery for the Assessment of Auditory Sensorimotor and Timing Abilities (BAASTA; Farrugia et al., 2012) includes four timing and rhythm perception tasks and four tapping tasks. The authors found that musicians were better than non-musicians on all four perceptual tasks which tested both high-level and low-level processing. Only non-musicians were tested on the tapping tasks. The Harvard Beat Assessment Test (H-BAT; Fujii and Schlaug, 2013) includes four beat-based tasks involving both higher-level and lower-level processing and both perceptual and productive components. Although musicians and non-musicians were not compared explicitly, training was significantly correlated with measures of synchronization consistency.

Taken together, studies comparing musicians and nonmusicians show that training improves performance on both higher- and lower-level aspects of rhythm processing. But, they do not tell us about the possible effects of training on a particular instrument.

Several studies comparing rhythm experts (i.e., drummers and percussionists) to other musician and non-musician groups have shown that rhythm training generalizes to lowlevel timing abilities. For example, percussionists detected smaller timing deviations in a discrimination task compared to classical musicians (three pianists and one singer) and non-musicians, who did not differ (Ehrlé and Samson, 2005). Two other studies compared musician groups on low-level production tasks with both auditory and visual stimuli. The first tested drummers, string players and non-musicians on an interval reproduction task. Drummers were less variable and used a different strategy than the other groups on the visual task but not the auditory task (Cicchini et al., 2012). The second tested drummers, professional pianists, amateur pianists, singers, and non-musicians on a production task where participants synchronized taps to an isochronous signal (Krause et al., 2010). The professional pianists had approximately 25 years of experience with their instrument on average while the amateur pianists and drummers had approximately 15 years of experience on average. The singer group had approximately 22 years of singing experience on average. During auditory synchronization, drummers had significantly smaller asynchronies than the amateur pianists, and were less variable in their tapping compared to singers, amateur pianists, and non-musicians. However, there was no difference between the drummers and professional pianists. Two other studies found that experienced drummers had reduced tapping variability compared to non-drummers, however, it was not clear if the non-drummers played other instruments (Fujii and Oda, 2006; Fujii et al., 2009). Non-drummer musicians, specifically brass-players and pianists, have also shown enhanced performance compared to non-musicians on low-level perceptual and production tasks (van Vugt and Tillmann, 2014). Unexpectedly, there were no differences in performance between the two musician groups. The authors determined that individual differences in synchronization ability were better predictors of performance on the perceptual tasks, over and above musical experience. This suggests that musical training improves synchronization abilities which generalize to better timing perception, regardless of the instrument-specific experience.

Rhythm experts have also shown enhanced performance on higher-level rhythm tasks. These tasks test one's ability to perceive and synchronize to metrical structures rather than testing timing sensitivity or synchronization to an isochronous sequence. One study compared a 'rhythm expert group,' consisting of four percussionists and one violinist who had performed particularly well in previous tapping experiments, to other well-trained musicians on a bimanual Synchronization–Continuation task using non-isochronous rhythms (Repp et al., 2013). This 'rhythm expert group' was less variable in their tapping than the other musicians for the faster rhythms only, suggesting a contextdependent effect. Another study showed that percussionists performed better on beat tapping and rhythm reproduction tasks compared to a non-percussionist group that included both musicians and non-musicians (Cameron and Grahn, 2014). Finally, a recent study showed that percussionists were better at a beat-based perceptual timing task compared to nonpercussionists when synchronizing to the beat (Manning and Schutz, 2015). However, no differences were seen between groups in the no-movement condition, suggesting that motor synchronization is crucial for the beat-based timing benefits that comes with rhythm-focused training. It should be noted that in the latter two studies the musicians in the non-percussionist group were not explicitly identified, therefore conclusions cannot be made regarding the effects of specialized instrumental training.

Together, these studies suggest that drummers and percussionists may have superior low-level timing abilities. However, the results are not always consistent. Krause et al. (2010) showed that professional drummers showed enhanced synchronization abilities compared to amateur pianists and non-musicians, but not compared to professional pianists. Other studies showed enhanced performance in drummers over non-drummers but the non-drummer groups were not well defined (Fujii and Oda, 2006; Fujii et al., 2009). Further, many of these studies did not quantify the length and type of musical experience in their musician groups. Thus, it is difficult to ascertain whether it is the amount or the type of musical training that determines low-level timing ability. Only two studies have looked at higher-level rhythmic abilities among different musician groups (Repp et al., 2013; Cameron and Grahn, 2014) and they do not lead to strong conclusions regarding the effect of specialized musical training. Therefore, the current study used a battery of four rhythm perception and production tasks that were designed to test both higher- and lower-level aspects of rhythm processing. In order to address effects of training, participants were selected to have similar levels of experience with their primary instrument. Furthermore, measures related to the amount of training were collected in order to compare across groups.

In choosing which tasks to include in our battery, we focused on the two key processes involved in perceiving and producing a musical rhythm. First, the underlying beat or pulse is extracted by finding the most stable and/or salient isochronous structure within a rhythm. Secondly, elements of the rhythm are grouped into hierarchical (i.e., metrical) structures based on explicit and subjective accents as well as one's knowledge of musical patterns (Fitch, 2013). To test these higher-level processes we used the RST, the Beat Synchronization Task (BST) and a perceptual version of the Beat Alignment Test. In the RST, participants were asked to listen to rhythms and then synchronize finger taps with each note of the rhythm upon the second presentation (Chen et al., 2008). In the BST, participants were asked to synchronize finger taps with the underlying beat of a rhythm during the second and third presentations of that rhythm (Kung et al., 2013). Separate rhythms were used in the Beat and RSTs. The Beat Alignment Test involves judging whether a metronome was synchronized with the underlying beat of musical excerpts (Iversen and Patel, 2008). A variant of the Synchronization–Continuation Task (SCT) was used to test lower-level, motor timing processes required to synchronize one's movements to the underlying beat or individual onsets of a rhythm (Stevens, 1886). In this task participants synchronized their taps to a metronome and then continued tapping at the same rate as accurately as possible after the metronome had stopped.

Auditory short term memory (including working memory) has been correlated with rhythm abilities generally (Grahn and Schuit, 2012) and performance on the RST specifically (Bailey and Penhune, 2012). Therefore, two measures of auditory working memory [the Digit Span (DS) and Letter Number Sequencing (LNS) tasks from the WAIS-IV; Wechsler, 2008] were included to investigate whether differences in rhythm abilities among groups were related to auditory working memory.

The goal of this study was to investigate the effects of specialized music training on rhythm perception and production by comparing four musician groups (drummers, pianists, string players, and singers) and a control group of non-musicians on a battery of rhythm tasks. These tasks involved both higherlevel cognitive and low-level motor timing aspects of rhythm processing in order to investigate whether these processes are affected differentially by specialized musical training. Differences between groups were expected as the musicians differed in terms of their rhythm expertise and the match between the motor skills required for their instrument and those required for the tasks.

### MATERIALS AND METHODS

### Experimental Design

All participants performed the BST, the RST, the Synchronization–Continuation task, the beat alignment perception task (BAPT) and two working memory tasks (DS and LNS). As discussed above, the four rhythm tasks were included to test four different aspects of rhythm processing including basic abilities, such as sensorimotor synchronization and higher level abilities, such as beat and meter perception and production. The tasks were administered in a counter-balanced order, except that the RST and BST were never administered consecutively. DS and LNS were administered during the break between the two blocks of the RST and BST with the DS always administered first. The whole battery of tasks took approximately 2 h. Stimuli were presented and responses recorded on an IBM-compatible computer. All stimuli were played through Sony MDR 7506 headphones at a comfortable volume.

### Participants

We tested forty-two musicians (9 drummers, 11 pianists, 10 singers, and 12 string players) and 14 non-musicians (age 18 to 35 [*M* = 23.54, *SD* = 4.46]). Musicians were selected to have at least 10 years of experience on their primary instrument and to have limited training with any of the other target instruments or singing. The non-musician group included individuals with less than 3 years of musical training or experience, who did not have any formal training in the last 5 years and were not playing an instrument at the time of testing. Participants completed an extensive Musical Experience Questionnaire (MEQ) developed in our laboratory (Bailey and Penhune, 2010). From this questionnaire, we extracted five variables that we thought were most important in characterizing the groups: age of start, number of years playing primary instrument, number of years playing any instrument, current hours of practice per week and years of lessons (see Results and **Table 1** for a detailed report of these measures).

Participants were recruited, via advertisements placed online and around the McGill and Concordia University campuses. All were right-handed, free of any neurological disorders and reported no motor or hearing problems. Written informed consent was obtained in accordance with the Declaration of Helsinki and participants were compensated for their time. The study was approved by the Concordia University Human Research Ethics Committee.

### Test Battery

#### Rhythm Synchronization Task (RST) *Stimuli and Procedures*

The RST is a variant of the task developed by Chen et al. (2008) and has been used in several previous studies with musicians and non-musicians (Bailey and Penhune, 2010, 2012). The RST requires participants to listen to, and then tap in synchrony with a series of auditory rhythms varying in metrical complexity. Metrical complexity was defined based on the rules of metric organization of Povel and Essens (1985), who showed that as the number of sounds that fall on predicted beat points increases, the metrical strength of the rhythm increases. All rhythms were comprised of the same 11 woodblock notes: five eighth notes (250 ms), three quarter notes (500 ms), one dotted quarter note (750 ms), one half note (1000 ms), and one dotted half note (1500 ms). The duration of the woodblock sound was always 200 ms and each rhythm sequence lasted 6 s. These notes were rearranged to create six different rhythms, two at each level of complexity: metrically simple (MS), metrically complex (MC), and non-metric (NM), (see Chen et al., 2008 and Bailey and Penhune, 2010 for a more detailed description of the stimuli). Each trial had two parts: listen and synchronize. During the listen phase, participants listened to each rhythm without moving. During the synchronization phase, participants tapped in synchrony with each note of the rhythms using the right index finger on the computer mouse. The task included two blocks of 36 trials (six repetitions of each rhythm). Between the listen and synchronization phases there was a three second silence followed by a warning sound, followed by an interval of 750 ms. Rhythms were presented and tapping responses were recorded using Presentation, v0.8, (Neurobehavioral Systems).

### *Measures*

Rhythm synchronization performance was assessed using two measures: the percentage of correct taps (percent correct) and the


*Values outside parentheses are means and values inside parentheses are standard deviations.*

inter-tap interval deviation (ITI deviation). A tap was considered correct if it was made within half of the onset-to-onset interval (IOI) before or after each woodblock note. For correct taps only, the ITI deviation was calculated by dividing the ITI by the IOI, subtracting this ratio from one and then taking the absolute value. This measure is indicative of how well participants have reproduced the temporal structure of the rhythms. Both measures were calculated for each trial and averaged across rhythm type and block.

#### Beat Synchronization Task (BST) *Stimuli and procedure*

The BST was developed in our laboratory as part of an fMRI study (Kung et al., 2013). In this task, participants are required to listen to, and then tap in synchrony with the beat of musical rhythms that vary across four levels of metrical complexity. As in the RST, metrical complexity, or beat strength, was defined based on the rules of metric organization of Povel and Essens (1985).

Rhythms were either in duple (20 rhythms) or triple meter (12 rhythms), and could have two different tempos (16 each). All rhythms were comprised of the same 11 woodblock notes (100 ms). Each of the duple rhythms contained five eighth notes (195 or 260 ms in fast and slow tempi, respectively), three quarter notes (390 or 520 ms), one dotted quarter note (585 or 780 ms), and one half note (780 or 1040 ms). Rhythms at the fast and slow tempi lasted 3.51 and 4.68 s, respectively. Rhythms were divided into four levels of complexity (perfectly metric, strongly metric, metric, and weakly metric) based on the number of notes falling on predicted beats points. For the perfectly metrical rhythms, there were note onsets at all predicted beat points (five or seven onsets on the beat for the duple and triple meters, respectively), whereas the weakly metric rhythms had stimulus onsets at only a subset of the predicted beat points (two or four onsets on the beat for the duple and triple meters, respectively; see Kung et al., 2013 for more detail on how the stimuli were created). *C*-scores were also calculated based on the model of Povel and Essens (1985). A *C*-score is the amount of counterevidence a rhythm supplies regarding the beat locations based on the number of silences and weakly accented notes that fall on predicted beat points (see Povel and Essens, 1985 for more detail on how *C*-scores are calculated). *C*-scores within each level of complexity were highly consistent (see **Table 2**).

Each trial contained three repetitions of each rhythm. During the first repetition participants were instructed to listen and find the beat of the rhythm without moving. During the next two repetitions, which were interleaved with warning sounds, participants were instructed to tap in synchrony with the beat of the rhythms using the right index finger on the computer mouse. The intervals preceding and following the warning sounds were



twice the length of the IBI so as not to interfere with the pulse of the rhythm. Participants were not instructed as to what beat level they should tap for each rhythm (i.e., quadruple, duple, sextuple, or triple). The order of trials in triple or duple meter, and fast or slow tempi, was pseudo-randomized to prevent participants from carrying over the beat from one trial to the next. The task was split into two 11 min blocks, each consisting of 32 trials, where each rhythm was used once per block. The rhythms in the two blocks were alternated such that the slow rhythms in the first block were the fast rhythms in the second. Rhythms were presented and tapping responses were recorded using Presentation, v0.8, (Neurobehavioral Systems).

### *Measures*

Beat synchronization performance was assessed by calculating the percentage of correct taps (percent correct) and the ITI deviation. A tap was considered correct if it was made within 20% of the inter-beat interval (IBI). A lower percentage was used in this task compared to the RST as it was a percentage of the single-length IBI rather than the multiple-length IOIs of the RST sequences. Participants were not instructed as to what metric level they were to tap, therefore the first step in the analysis was to determine whether they tapped at the duple or quadruple level, thus determining the target IBI. This was done by inspecting the tap data, comparing average ITIs to target IBIs and comparing the number of taps to the expected number of taps. If the average ITI for a particular trial was greater than the target ITI plus 50% it was determined that the participant tapped at the quadruple level rather than the duple. Similarly to the RST, ITI deviation was calculated, however, for the BST, the ratio of the ITI and IBI was calculated rather than the ratio of ITI and IOI. Percent correct and ITI deviation were calculated for each trial and averaged across meter, tempo, and metrical complexity.

### Synchronization–Continuation Task (SCT) *Stimuli and procedure*

This is a variant on the commonly used synchronization– continuation task (Repp, 2005) which has been used in our lab previously to measure self-paced isochronous tapping (Baer et al., 2013, 2015). We used the identical experimental setup, data cleaning and analysis procedures as those used by Baer et al. (2015). In the paced phase, participants were asked to tap in synchrony with a metronome (1 KHz pure tone, 20 ms in duration; 35 cycles). In the continuation phase after the metronome stopped, participants were instructed to continue tapping at the same rate until they heard a stop cue (35 cycles). There were three tempos with IOIs of 200, 500, and 750 ms. There were six trials per tempo and tempo order was counterbalanced across participants (see Baer et al., 2015 for more details). Finger movements were recorded using an active, three dimensional motion capture system (Visualeyez VZ3000, Phoenix Technologies, Burnaby, BC, Canada). Two infrared-sensitive cameras tracked the motion of an infrared light emitting diode (LED) that was attached to participants' right index fingernail using Velcro. The trajectory of the LED was tracked at a sampling rate of 200 Hz and to a spatial resolution of 0.015 mm. The infrared-sensitive cameras were synchronized to the metronome with a National Instruments 6221 Data Acquisition board.

### *Measures*

This task was used to measure the ability to produce and maintain an isochronous beat using internal timing processes. Preprocessing and analysis steps were identical to those used in previous studies from this lab (Baer et al., 2013, 2015) which also focused on internal timing processes. Therefore, only taps during the continuation phase were analyzed. Performance for the continuation phase of the task was analyzed to assess the ability to maintain and reproduce an isochronous beat. Mean ITIs were compared to ensure that participants were able to tap out the target interval successfully. In order to characterize long-term drift away from the target interval, the tapping data was linearly detrended. The absolute slope of the detrending line was used as a measure of the magnitude of drift. The variance of the ITIs that remained in the data after detrending was used as a more accurate representation of the cycle-to-cycle tapping variability. According to the Wing and Kristofferson (1973) model, continuation tapping involves two independent processes: an internal timekeeper that acts as a clock generating timing signals, and a motor implementation process which uses input from the time-keeper to accurately time movements. Using this model, tapping variability related to the internal timekeeper and that related to motor implementation were analyzed separately. In addition to these variability measures, use of the motion capture system allowed for analysis of kinematic measures. Smoothness of tapping movement was measured using mean squared jerk (see, Baer et al., 2013). This measure was used to investigate whether between-group differences in tapping performance due to specialized training may be reflected in different kinematic strategies. All measures were averaged over each tempo and compared between groups.

#### The Beat Alignment Perception Test (BAPT) *Stimuli and procedure*

The BAPT was used to measure the ability to perceive the underlying pulse of a rhythm without requiring a motor response. In this task participants listen to 17 clips of recorded music (average duration = 15.9 s) which have a superimposed computer-generated metronome (1 KHz pure tone, 100 ms duration) that is either in sync or out of sync with the underlying beat. The metronome can be out of sync in one of two ways: stretched (Stretch: at a slower tempo than the music clip) or shifted (Shift: out of phase with the music clip). For each trial participants listened to the stimuli and then were asked to indicate whether the metronome was in sync with the beat or not (Yes or No), and to rate their confidence on a scale of zero to two (0 = just guessing, 1 = pretty sure, and 2 = 100% sure).

Stimuli for the BAPT were created and made available by Iversen and Patel (2008) and the version we used was created by Müllensiefen et al. (2013) for the Goldsmiths Musical Sophistication Index (Gold-MSI) v1.0. Stimuli were presented and responses were recorded with software written in Python (v2.7).

### *Measures*

Measures of interest for the BAPT were the proportion of correct yes and no responses as well as the confidence ratings. Percent correct was averaged for each condition (On, stretch and shift). To analyze the confidence ratings, the proportion of responses corresponding to 'just guessing,' 'pretty sure,' and '100% sure' were averaged over all trials.

Working Memory Tasks In order to assess possible group differences and to examine the involvement of auditory working memory in rhythm abilities, participants were tested on the DS and LNS tasks from the Wechsler Adult Intelligence Scale (WAIS-IV; Wechsler, 2008). In the DS task participants are required to recall strings of numbers and in the LNS task participants are required to recall and mentally manipulate strings of letters and numbers. Tests were scored according to the WAIS manual and age-normed scaled scores were derived.

### Data Analysis

All analyses were conducted using SPSS version 22 (PASW Inc, Chicago, IL, USA). For each task measure a mixed factor repeated-measures analysis of variance (ANOVA) was used with task level (e.g., meter, tempo, or metrical complexity) as within-subject factors and musician type as the between-subjects factor. The Greenhouse–Geisser correction was applied in cases where the assumption of sphericity was violated according to Mauchly's test. All pairwise and simple comparisons reported below have been corrected for multiple comparisons using the Bonferroni correction. In SPSS, the Bonferroni correction is applied by multiplying the *p*-value by the number of comparisons. In this way, the 0.05 significance threshold can still be used (see The calculation of Bonferroni-adjusted *p*-values, retrieved from http://www-01.ibm.com/support/docview.wss? uid=swg21476685).

In order to examine whether individual experience impacts performance, correlations were performed between the performance measures and five musical experience measures (years of lessons, age of start, current hours of practice, years playing primary instrument, and years playing total). Additionally, we assessed the relationship between auditory working memory and task performance by examining correlations between a combined DS and LNS score and all task measures. Finally, we analyzed correlations between performance measures on all four rhythm tasks. The Benjamini and Hochberg (1995) false discover rate procedure was used to control for multiple correlations. However, as the correlation analysis was exploratory, uncorrected correlation values are reported while values that remain significant following correction are indicated in **Tables 3** and **4**.

#### TABLE 3 | Results of working memory tasks.


*Values represent scaled scores. Values outside parentheses are means and values inside parentheses are standard deviations.*

### RESULTS

### Musical Training and Experience

To assess possible differences in training and experience between the musician groups, we used separate one-way ANOVAs for each measure from the MEQ. Only significant or marginally significant differences between groups are reported here (see **Table 1** for all measures). There was a significant effect of age of start [*F*(3,40) = 3.24, *p* = 0.033, η<sup>2</sup> = 0.208]. None of the between-group comparisons reached significance, however, the drummer group started later on average (*M* = 11.2, *SD* = 3.27), followed by the singers (*M* = 10.7, *SD* = 4.32), string players (*M* = 8.08, *SD* = 2.77), and pianists (*M* = 7.6, *SD* = 2.11). There was a significant main effect of years of lessons [*F*(3,40) = 3.27, *<sup>p</sup>* <sup>=</sup> 0.032, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.212], showing that pianists (*<sup>M</sup>* <sup>=</sup> 12.9, *SD* = 2.96 ) had taken lessons on their primary instrument for significantly longer than drummers (*M* = 5.44, *SD* = 3.61; *p* = 0.030). There were no significant differences between singers (*M* = 11.30, *SD* = 7.97) and string players (*M* = 10.50, *SD* = 5.53) or between these groups and the others. The nonmusician group played an instrument for 1.25 years on average (*SD* = 0.94).

### Rhythm Synchronization

There was a main effect of metrical complexity for the percent correct measure [*F*(2,100) <sup>=</sup> 33.93, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.404], such that participants had a significantly lower proportion of correct taps for the NM rhythms compared to both the MS (*p <* 0.001) and the MC rhythms (*<sup>p</sup> <sup>&</sup>lt;* 0.001; see **Figure 1B**). There was a main effect of group [*F*(4,50) <sup>=</sup> 2.83, *<sup>p</sup>* <sup>=</sup> 0.034, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.185], however, none of the follow-up comparisons survived Bonferroni correction (see **Figure 1A**). There was no metrical complexity by group interaction [*F*(8,100) <sup>=</sup> 1.27, *<sup>p</sup>* <sup>=</sup> 0.267, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.092; see **Figure 1F**]. This is consistent with the findings of previous studies using the same task (Chen et al., 2008; Bailey and Penhune, 2010,


*BAPT, Beat Alignment Perception Test; RST, Rhythm Synchronization Task; BST, Beat Synchronization Task; SCT, Synchronization–Continuation Task; MSJerk, Mean Squared Jerk.* <sup>a</sup>*Correlation survived correction for multiple correlations.*

2012), which show that, based on this global measure, metrical complexity affects performance and that even non-musicians perform adequately.

For the more specific ITI deviation measures, there was a significant effect of metrical complexity [*F*(2,100) = 18.97, *p <* 0.001, η<sup>2</sup> <sup>p</sup> = 0.275], such that ITI deviation was significantly lower for the MS rhythms compared to both the MC and NM rhythms (*p <* 0.001 for both comparisons). MC and NM rhythms did not differ significantly (*<sup>p</sup>* <sup>=</sup> 0.225; see **Figure 1D**). There was a main effect of group [*F*(4,50), = 3.14, *p* = 0.022, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.20], but no significant metric complexity by group interaction [*F*(8,100) <sup>=</sup> 1.52, *<sup>p</sup>* <sup>=</sup> 0.16, <sup>η</sup><sup>2</sup> <sup>p</sup> <sup>=</sup> 0.108; see **Figure 1E**]. The main effect of group was driven by a significantly larger ITI deviation in non-musicians compared to string players (*<sup>p</sup>* <sup>=</sup> 0.022; see **Figure 1C**). There were no statistically significant differences between musician groups.

### Beat Synchronization

Two mixed factorial ANOVAs were performed with percent correct and ITI deviation as dependent variables. Meter (triple and duple), metrical complexity (perfectly metric, strongly metric, metric and weakly metric) and tempo (fast and slow) were included as within-subject factors and musician group as a between-subject factor. For percent correct, there was a main effect of metrical complexity [*F*(3,147) = 11.17, *p <* 0.001, η<sup>2</sup> <sup>p</sup> = 0.186]. There was also a main effect of meter [*F*(1,49) <sup>=</sup> 106.09, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.684], tempo [*F*(1,49) <sup>=</sup> 27.69, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.336], and group [*F*(4,49) = 7.94, *p <* 0.001, η<sup>2</sup> = 0.112]. There was a metrical complexity by tempo interaction [*F*(3,147) = 5.28, *p* = 0.002, η2 <sup>p</sup> = 0.097] such that percent correct was higher for slow rhythms compared to fast for all levels of metrical complexity (all *p <* 0.01) except metric (*p* = 0.342). There was also a metrical complexity by meter interaction [*F*(3,147) <sup>=</sup> 3.04, *<sup>p</sup>* <sup>=</sup> 0.031, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.058] and a three-way interaction with metrical complexity, meter and group [*F*(12,147) <sup>=</sup> 2.73, *<sup>p</sup>* <sup>=</sup> 0.002, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.182; see **Figures 2A,B**]. No other interactions were significant. Followup comparisons showed that for the triple meter, drummers had higher percent correct compared to singers (*p* = 0.003) and non-musicians (*p* = 0.007) for strongly metric rhythms. For metric rhythms in triple meter, drummers had higher percent correct compared to all other groups (Pianists: *p* = 0.015; Singers: *p* = 0.015; String players: *p* = 0.003; Non-musicians: *<sup>p</sup>* <sup>=</sup> 0.021; see **Figure 2A**). For duple meter rhythms, nonmusicians had lower percent correct compared to all musician groups for perfectly metric, strongly metric and metric rhythms. For weakly metric rhythms, non-musicians had lower percent correct compared to drummers only (*<sup>p</sup>* <sup>=</sup> 0.012; see **Figure 2B**). No other comparisons were significant.

FIGURE 1 | Performance on the Rhythm Synchronization Task. (A) Percentage of correct taps across the four musician groups and non-musician control group. (B) Percentage of correct taps for each level of metrical complexity averaged over all groups. (C) ITI deviations across the four musician groups and non-musician control group. (D) ITI deviations for each level of metrical complexity averaged over all groups. (E) ITI deviations for each level of metrical complexity for each group. (F) Percent correct for each level of metrical complexity for each group. ∗*p <* 0.05, ∗∗*p <* 0.01.

For the ITI deviation measure, there was a main effect of metrical complexity [*F*(3,147) <sup>=</sup> 3.46, *<sup>p</sup>* <sup>=</sup> 0.018, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.066] such that there was higher tapping variability for weakly metric rhythms compared to metric rhythms (*<sup>p</sup>* <sup>=</sup> 0.004; see **Figure 2C**). There was a main effect of meter [*F*(1,49) = 39.90, *p <* 0.001, η2 <sup>p</sup> = 0.449], a marginally significant main effect of group [*F*(4,49) <sup>=</sup> 2.56, *<sup>p</sup>* <sup>=</sup> 0.05, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.173] and a meter by group interaction [*F*(4,49) <sup>=</sup> 4.59, *<sup>p</sup>* <sup>=</sup> 0.003, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.272]. There was no main effect of tempo [*F*(1,49) <sup>=</sup> 1.74, *<sup>p</sup>* <sup>=</sup> 0.193, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.034]. No other interactions reached significance. Follow-up comparisons showed that in the triple meter, singers showed higher tapping variability than non-musicians (*p* = 0.024). For the duple meter, non-musicians were more variable in their tapping compared to all musician groups (Drummers: *p* = 0.042; Pianists: *p* = 0.005; Singers: *<sup>p</sup>* <sup>=</sup> 0.005; String players: *<sup>p</sup>* <sup>=</sup> 0.031; see **Figure 2D**).

### Synchronization–Continuation

Following on a large number of studies using the same task (see Repp and Su, 2013 for review), we focused on data from the continuation phase, when internal timing processes, rather than synchronization with external stimuli, were likely to predominate. Mean ITIs were compared across tempi and groups to ensure that participants were able to tap accurately at the three tempi without the aid of a metronome. As expected,

there was a highly significant main effect of tempo, with faster rates producing shorter ITIs [*F*(2,100) = 9559.96, *p <* 0.001, η2 <sup>p</sup> = 0.995]. Mean rates of for each tempo were 252.23 (*SD* = 9.27), 496.37 (*SD* = 14.68), and 747.12 (*SD* = 33.48). There was no main effect of group [*F*(4,50) = 0.182, *p* = 0.947, η<sup>2</sup> = 0.014] and no significant tempo by group interaction [*F*(8,100) <sup>=</sup> 0.835, *<sup>p</sup>* <sup>=</sup> 0.530, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.063]. Together these results show that all groups were able to tap the target intervals successfully, even without the metronome.

For detrended variance there was a significant main effect of tempo [*F*(2,100) <sup>=</sup> 47.01, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.485], a significant main effect of group [*F*(4,50) <sup>=</sup> 4.42, *<sup>p</sup> <sup>&</sup>lt;* 0.004, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.261], and a significant tempo by group interaction [*F*(8,100) = 3.06, *<sup>p</sup>* <sup>=</sup> 0.019, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.197]. Follow-up comparisons showed that for the medium tempo (ITI of 500 ms) non-musicians had greater variability than the pianists and singers (*p* = 0.005 and *p* = 0.04, respectively) and that for the slow tempo non-musicians had greater variability than the drummers and the singers (*p* = 0.031 and *<sup>p</sup>* <sup>=</sup> 0.019, respectively; see **Figure 3A**). There were no significant differences between musician groups.

The magnitude of long-term drift, as measured by the absolute slope of the detrending line, was compared across groups and tempi. Consistent with previous work (Collier and Ogden, 2004), there was a significant effect of tempo [*F*(2,98) = 16.63, *p <* 0.001, η<sup>2</sup> <sup>p</sup> = 0.253], such that the magnitude of the slope increased as tempo decreased. There was a main effect of group [*F*(4,49) <sup>=</sup> 2.93, *<sup>p</sup>* <sup>=</sup> 0.03, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.193] and a tempo by group interaction [*F*(8,98) <sup>=</sup> 2.22, *<sup>p</sup>* <sup>=</sup> 0.054, <sup>η</sup><sup>2</sup> <sup>p</sup> <sup>=</sup> 0.153; see **Figure 3B**]. Follow-up comparisons showed that string players had a larger absolute slope compared to pianists (*p* = 0.022) at the slow tempo (ITI of 750 ms). There were no other group differences.

As discussed above, mean squared jerk (MSJerk) is a measure of the smoothness of movement, such that smooth movements have low MSJerk. For this measure, there was a main effect of tempo [*F*(2,96) <sup>=</sup> 565.17, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.922], showing that MSJerk increased significantly as the tempo decreased, consistent with previous findings (Baer et al., 2013). There was no main effect of group [*F*(4,48) = 0.398, *p* = 0.81, η<sup>2</sup> = 0.032] and no tempo by group interaction [*F*(8,96) <sup>=</sup> 0.24, *<sup>p</sup>* <sup>=</sup> 0.93, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.020*;* see **Figures 3C,D**].

Tapping variability was split into timer variability and motor variability using the Wing–Kristofferson model. For the timer variability, there was a significant main effect of tempo [*F*(2,96) <sup>=</sup> 39.96, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.454], a main effect of group [*F*(4,48) <sup>=</sup> 7.17, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.374], and significant tempo by group interaction [*F*(8,96) = 4.16, *p* = 0.004, η<sup>2</sup> <sup>p</sup> = 0.257]. Follow-up comparisons showed that at the medium tempo, nonmusicians showed significantly greater timer variability than drummers (*p* = 0.043) and pianists (*p* = 0.016) but not singers (*p* = 0.066) or string players (*p* = 0.23). For the slow tempo, timer variability was significantly higher in non-musicians compared to drummers (*p* = 0.002) and singers (*p* = 0.005) and

near-significant when compared to pianists (*p* = 0.056) and string players (*p* = 0.052). There was no significant between-group differences at the fast tempo (see **Figure 4B**).

For motor variability, there was a significant main effect of tempo [*F*(2,92) = 18.05, *p <* 0.001, η<sup>2</sup> <sup>p</sup> = 0.282], a main effect of group [*F*(4,46) <sup>=</sup> 5.44, *<sup>p</sup>* <sup>=</sup> 0.001, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.321] and a tempo by group interaction [*F*(8,92) <sup>=</sup> 2.32, *<sup>p</sup>* <sup>=</sup> 0.026, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.168]. Follow-up comparisons showed that at the medium tempo nonmusicians had greater motor variability than pianists and singers (*p* = 0.002 and *p* = 0.063, respectively). At the slow tempo, string players had greater motor variability than the pianist group (*p* = 0.033) while the non-musicians showed greater variability than the pianists and singer groups (*p* = 0.007 and *p* = 0.02, respectively; see **Figure 4A**).

### Beat Alignment Perception Task

One singer and one pianist did not complete the BAPT task. The accuracy of "on" and "off " beat judgments were compared across groups and across the two "off " conditions (stretch and shift; see **Figure 5**). There was a main effect of the on/off variable [*F*(2,98) <sup>=</sup> 28.29, *<sup>p</sup> <sup>&</sup>lt;* 0.001, <sup>η</sup><sup>2</sup> <sup>p</sup> = 0.366], showing that participants were significantly less accurate for the shift condition compared to the on condition (*p <* 0.001) and the stretch condition (*<sup>p</sup> <sup>&</sup>lt;* 0.001; see **Figure 5B**). There was a main effect of group [*F*(4,49) = 8.61, *p <* 0.001, η<sup>2</sup> = 0.413]. Follow-up comparisons showed that non-musicians showed lower accuracy compared to all musician groups across all conditions (*p <* 0.001, for all comparisons; see **Figure 5A**).

In order to test for differences in confidence ratings between groups, proportion of ratings with a value of 0, (just guessing), 1 (pretty sure), and 2 (100% sure) were compared between groups. There was a significant main effect of confidence rating [*F*(2,98) = 122.09, *p <* 0.001, η<sup>2</sup> <sup>p</sup> = 0.714] and a significant group by confidence rating interaction [*F*(8,98) = 4.62, *p <* 0.001, η<sup>2</sup> <sup>p</sup> = 0.274]. The main effect of group was not significant [*F*(4,49) <sup>=</sup> 1.11, *<sup>p</sup>* <sup>=</sup> 0.36, <sup>η</sup><sup>2</sup> <sup>=</sup> 0.083]. Follow up comparisons showed that non-musicians rated their confidence as 'pretty sure' for a larger proportion of trials compared to pianists (*p* = 0.009) and drummers (*p* = 0.007) and rated their confidence as '100 % sure' for a smaller proportion of trials compared to pianists (*p* = 0.006) and drummers (*p* = 0.005).

### Working Memory Tasks

Scaled scores on the DS and LNS tasks were compared across the musician and non-musician groups separately. There was no main effect of group for both the DS Task [*F*(4,51) = 2.05, *p* = 0.10, η<sup>2</sup> = 0.138] and LNS Task [*F*(4,51) = 1.54, *p* = 0.205, η2 <sup>p</sup> <sup>=</sup> 0.108; see **Table 3**].

### Correlations Between Measures of Musical Training, Working Memory, and Task Performance

Analysis of musical experience measures showed significant differences between musician groups in terms of the age at which they started playing their primary instrument and the years of formal training on their primary instrument. Therefore, a correlation analysis was performed to assess whether task performance was correlated with these factors. Years of lessons was significantly correlated with ITI deviation [*r*(39) = –0.30, *p* = 0.053] and percent correct [*r*(39) = 0.34, *p* = 0.029] on the RST as well as with motor variance [*r*(36) = –0.37, *p* = 0.022] on the Synchronization–Continuation task (see **Table 4**).

Because a number of previous studies have shown a relationship between rhythm task performance and working memory (Bailey and Penhune, 2010, 2012; Grahn and Schuit, 2012), we examined correlations between a combined score for the two working memory tasks and the main behavioral measures. The only significant correlation was found with performance on the BAPT [*r*(38) = 0.35, *p* = 0.028 for musicians only; *r*(51) = 0.33, *p* = 0.015 including non-musicians].

Finally, in order to assess how performance on the tasks in our battery related to each other, we examined correlations among the main behavioral measures across tasks for the whole sample. Nearly all task measures correlated significantly with each other and many that were not significant showed a trend towards significance (see **Table 5**).

### DISCUSSION

The purpose of this study was to compare the rhythm perception and production abilities of drummers, singers, pianists, string players. A battery of four rhythm and beat-based tasks were used to assess the effects of specific musical training on both higherlevel rhythm processing and low-level motor timing abilities.

Existing research had suggested that drummers might outperform other musicians on basic rhythm and interval timing tasks (Krause et al., 2010; Cicchini et al., 2012; Repp et al., 2013). However, our results revealed no significant differences between musician groups for the majority of task measures. This was despite the fact that all musicians were selected to have the majority of their training on the target instrument,

#### TABLE 5 | Between-task correlations.


∗*p < 0.05,* ∗∗*p < 0.01, two-tailed. BAPT, Beat Alignment Perception Test; RST, Rhythm Synchronization Task; BST, Beat Synchronization Task; SCT, Synchronization– Continuation Task.* <sup>a</sup>*Correlation survived correction for multiple correlations.*

had on average more than 10 years of experience on their instrument, and were currently practicing. Generally, musicians showed better performance compared to non-musicians on all rhythm tasks. This suggests that musical training, whether rhythm-focused as in the drummers, or melody-focused as in the singers, improves rhythm perception and production. The only differences between musician groups were found on the BST and SCT.

### Similarities and Differences in Performance Across Tasks

In the BST, drummers had a higher percentage of correct taps compared to all other groups for metric rhythms in triple meter. Additionally, drummers had a higher percentage of correct taps compared to singers and non-musicians for strongly metric rhythms in triple meter. For rhythms in duple meter, all musicians performed equally well and significantly better than non-musicians. This indicates that all musicians were able to use the rhythmic structure of the more common duple meter to find and synchronize to the underlying beat. All musicians had more difficulty synchronizing to the beat of rhythms in triple meter and many performed similarly to non-musicians. The superior performance of drummers for rhythms in triple meter may indicate that the advantage imparted by rhythm-specific training is only evident in the more difficult condition. Another possibility is that drummers are more accustomed to synchronizing with rhythms in triple meter, however, this likely depends less on instrument-specific training than genre-specific training which was not tested here. Differences between musician groups were not seen for the ITI deviation measure. The higher sensitivity of this measure may have increased variability overall which may have obscured more subtle between-group differences.

Continuation tapping in the SCT was used to measure lowlevel motor timing abilities including the ability to produce and maintain an isochronous beat. No differences between musician groups were found on the detrended variance, MSJerk and timer variance measures. However, pianists were shown to have lower motor variability and less long-term drift, but only compared to string players. Through extensive practice at making precise finger movements, pianists are likely to develop a particularly high level of finger dexterity, possibly explaining their reduced motor variability and drift. In addition, the tapping movement required for this task was very similar to the keypress movements on which pianists train. This is supported by work showing different kinematics in professional pianists compared to amateur pianists (Winges and Furuya, 2015) leading to lower tapping variability compared to non-musicians (Aoki et al., 2005). The lack of other between-group differences in longterm drift indicates that all groups including musicians were able to maintain the target interval over the course of a trial.

The lack of differences between musician groups on the other measures of the SCT may indicate that having a highly developed rhythm framework transfers to low-level tapping abilities. Another possibility is that training on using precise movements to produce music improves the ability to tap accurately, regardless of the specific movements that one is trained in. There were differences in performance on this task between pianists and string players but not for any other groups, a result that cannot be accounted for by either of the above explanations. This suggests that both top-down and bottom-up processes likely interact to determine a musician's continuation tapping ability.

In the RST, only string players showed lower ITI deviation than the non-musicians. This may be due to high within-group variability on this task. In performing this task, participants are expected to use the metrical structure of the rhythms to better encode and recall the elements of the rhythms, thus facilitating prediction and synchronization (Chen et al., 2008). Enhanced synchronization across all groups for the metric simple and metric complex rhythms compared to the non-metric rhythms, supports the idea that participants were using the metric structure to predict onsets and synchronize finger taps. However, although non-musicians showed lower percent correct and higher ITI deviations, these differences generally did not reach significance (see **Figure 1**). Furthermore, performance on this task was similar to that of musicians and non-musicians tested on this task in previous studies (Chen et al., 2008; Bailey and Penhune, 2010, 2012). This further suggests that the lack of significant differences between musicians and non-musicians is likely due to high variability among the musicians rather than a failure of this particular sample of musicians to perform the task.

For the BAPT, results indicate that perceiving a phase shift relative to the underlying beat of a musical excerpt is more difficult than perceiving a tempo shift for all groups of musicians. Furthermore, musicians were shown to have more sensitive beat perception than non-musicians and nonmusicians showed lower confidence in their responses compared to drummers and pianists. However, there were no differences in beat perception between musician groups. This is despite the fact that drummers showed enhanced performance on the BST compared to other the musicians. This difference may due to the importance of movement in beat perception, even for rhythm experts. For example, a recent study showed that percussionists showed improved performance on a beat-based perception task compared to non-percussionists when synchronizing to the beat but were not better in the no-movement condition (Manning and Schutz, 2015). Although, other studies have shown improved performance on basic rhythm and timing tasks (Krause et al., 2010; Cicchini et al., 2012; Repp et al., 2013), it is possible that advantages in higher level beat perception relies on synchronous movement. Another possibility is that this task is not challenging enough to reveal subtle differences in beat perception between the musician groups as the average accuracy was 88%. Betweengroup differences in the BST were only shown for the more difficult triple meter rhythms. Perhaps with smaller differences between the metronome and the underlying beat, differences in beat perception between musicians would be revealed.

Performance measures on all tasks were highly correlated, indicating that there is strong overlap between perception and production, as well as different rhythm and timing processes. Future work using tasks that account for unique as well as overlapping aspects of rhythm and timing abilities would perhaps detect more subtle differences between musician groups.

The lack of differences between musicians on the tasks requiring tapping, and the lack of advantage for the drummer group in particular, may be due to a discrepancy between the effector and movement type required for these tasks and those used to perform on their instrument. Drummers and percussionists generally perform using drum sticks or mallets and make large, often whole-arm, movements. This discrepancy between effector used in performance and the tasks used here may have reduced the motor timing advantages imparted by drummer's rhythm-focused training. Continuation tapping with a drumstick leads to lower variability compared to finger tapping in non-musicians (Madison et al., 2013) and percussionists show larger movement-related perceptual timing benefits than nonmusicians when tapping with a drumstick (Manning and Schutz, 2015). Furthermore, synchronization with a metronome is less variable when string players synchronize by playing their own instrument compared to finger tapping (Stoklasa et al., 2012). Therefore, motor timing advantages may be specific to the effector and movements that are inherently part of a musician's training. This is supported by the fact that pianists showed the lowest motor timing variability and long-term drift in the SCT, however, this difference was only significant relative to string players. It has also been suggested that, compared to tapping with a drumstick, finger tapping is more susceptible to small accumulating errors which may increase overall variability and obscuring between-group differences (Madison et al., 2013). Based on this, it would be informative to compare rhythm and timing abilities of musicians using a drumstick and/or their own instruments instead of finger tapping.

Another possible reason for the lack of differences between drummers and the other musicians on these tasks is that the drummer group, although similar to the other musicians in terms of years of experience, had less formal training overall. Although many studies have found differences in synchronization abilities between musicians and non-musicians, two did not find effects related to musical training (e.g., Essens and Povel, 1985; Hove et al., 2010). Other studies did not find differences between musician groups (van Vugt and Tillmann, 2014), or only saw differences in certain contexts (Krause et al., 2010; Carey et al., 2015; Manning and Schutz, 2015). In the study by Krause et al. (2010), drummers had more years of experience than those in the current study (∼15 years vs. ∼12 years in the current study) although they still had less experience than the other musician groups to whom they were compared. Other studies have shown varying results relating to the importance of age of start and years of formal training (Kincaid et al., 2002; Fujii et al., 2009; Bailey and Penhune, 2010, 2012). Therefore, future studies could investigate whether years playing an instrument, age of start or years of formal training contribute differentially to rhythm perception and production abilities in musicians.

### Impact of Musical Experience

There are a number of factors related to music training and experience that have been shown to be related to rhythm perception and production abilities. These include, years of experience, years of formal lessons and age of start, among others (Bailey and Penhune, 2010, 2012; Grahn and Schuit, 2012). Therefore, in this study we attempted to match a number of these potentially confounding variables across our groups. However, specific patterns of experience appeared between the groups that were difficult to control. We were successful in matching the number of years of experience with the primary instrument and the weekly hours of current practice, which did not differ across groups. However, drummers started playing their instrument later and had fewer years of lessons compared to the other musician groups. Age of start was not correlated with any of the task measures, but the number of years of lessons was correlated with percent correct and ITI deviation on the RST, as well as motor variance on the SCT. This is consistent with previous studies using the RST (Bailey and Penhune, 2010, 2012) and may suggest that years of lessons is an important predictor of rhythm abilities. However, this is contrary to a study which showed that tapping stability was correlated with age of start but not years of drum training (Fujii et al., 2009). Therefore, it is possible that fewer years of formal training and/or later age of start in the drummers may have contributed to the lack of differences between them and the other musician groups, despite being matched in terms of years playing their primary instrument.

It is suggested here that musicians with a high level of musical training perform equally well on rhythm tasks, possibly due to extensive knowledge of rhythmic structure in music and strong low-level timing abilities. Therefore, it is possible that testing musicians with intermediate levels of experience would lead to between-group differences. Also, melodies generally contain rhythmic information therefore melody experts such as singers may become rhythm experts as a side effect of their melodic training. Perhaps by comparing musicians based on the type of music they perform (e.g., beat-based vs. not beat-based), differences in rhythmic abilities would emerge. We also show here that non-musicians are generally able to perform tasks that require higher-level rhythm processing, despite the lack of training. This supports the idea that processing even the more abstract aspects of musical rhythm is a skill that is universal among humans.

### Within-Task Comparisons

The lack of differences across musician groups could raise questions as to whether performance or task limitations might affect our results. However, comparison with previous studies using the same tasks, and examination of within-task performance measures indicate that these findings cannot be explained by floor or ceiling effects, or by problems with task manipulations. First, musicians out-performed non-musicians on virtually all task measures, and all musician groups performed in the range of other musicians tested in previous studies. Second, within-task manipulations of metrical complexity, meter and tempo affected performance in predictable ways, consistent with previous studies using the same tasks.

In both the RST and BST, increased metrical complexity led to increased tapping variability and decreased accuracy (Chen et al., 2008; Bailey and Penhune, 2010, 2012; Kung et al., 2013). Similarly, manipulation of meter in the BST and tempo in the SCT showed the expected within-group results (Baer et al., 2013; Kung et al., 2013). Participants were more variable in the BST for rhythms in the triple meter which is consistent with Kung et al. (2013) and was expected as the majority of western music is in duple or quadruple meter. In the SCT, mean ITIs were close to the target intervals showing that participants were able to perform the task successfully. Additionally, tapping variability, long-term drift, jerkiness of movements, as well as motor and timing variability increased as tempo decreased; all consistent with previous work (Repp and Doggett, 2007; Baer et al., 2013, 2015). Likewise, results for the BAPT were consistent with previous research (Iversen and Patel, 2008). Percent correct was higher for the "on" judgments compared to the "off " judgments and participants had more difficulty when the metronome was phaseshifted compared to when it was stretched relative to the beat. Because differences related to within-task factors were consistent with previous research for all groups on all tasks, the lack of between-group differences cannot be attributed to a failure of the task manipulations to alter performance. Finally, these results cannot be attributed to differences in auditory working memory, as no significant differences were found between groups on these tasks.

### SUMMARY AND CONCLUSION

To summarize, we tested drummers, pianists, singers, string players, and a non-musician control group on four rhythm tasks. Overall, musicians performed better than non-musicians on most tasks, however differences between musician groups were not found on a majority of the tasks. Together these results suggest that general musical experience is more important than specialized instrument-specific experience with regards to rhythm perception and production. Only the BST and SCT showed differences between groups. Drummers were better at extracting and synchronizing to the underlying beat of rhythms in the more difficult triple meter condition in the BST. Pianists showed lower motor variability and less drift than string players on the SCT. These results indicate that higher-level and lowerlevel aspects of rhythm abilities interact in subtle ways such that one may obscure the other when there is a discrepancy between the effector and movements required for the task and those used in training. As only finger tapping tasks were used to measure synchronization and self-paced tapping, drummers only showed higher-level rhythm processing abilities in the most difficult condition of the BST. Conversely, lower-level motor timing advantages were only shown for pianists for the measures that reflect their effector-specific training. The lack of match between training and tasks may have masked differences between groups in all but the most difficult or training-specific conditions. Therefore, musical training improves rhythm abilities in general, whereas more fine-grained, instrument-specific differences are only seen in musicians when task requirements match particular aspects of training.

### AUTHOR CONTRIBUTIONS

TM was involved with the design, acquisition, analysis, and interpretation of this work as well as writing and revising the manuscript. BG and JT were involved in the acquisition and analysis of the data and contributed to the revising of the manuscript. VP was involved in the conception and design of this work as well as the interpretation of data and revising of the final manuscript.

### ACKNOWLEDGMENTS

The authors thank Kirsten Anderson and Lucy O'Toole for assistance in testing and recruiting. The authors also thank Larry Baer, Joyce Chen, Shu-Jen Kung, and Müllensiefen and colleagues for the use for of their tasks and stimuli. During the time of the study, TM was supported by a graduate award from the Natural Sciences and Engineering Research Council of Canada (NSERC)- Create Training Program in Auditory Cognitive Neuroscience (371324). This work was also supported by an NSERC Discovery Grant to VP (238670).

### REFERENCES


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

*Copyright © 2016 Matthews, Thibodeau, Gunther and Penhune. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.*

# Rhythm Facilitates the Detection of Repeating Sound Patterns

Vani G. Rajendran, Nicol S. Harper, Khaled H. A. Abdel-Latif and Jan W. H. Schnupp\*

Auditory Neuroscience Group, Department of Physiology, Anatomy, and Genetics, University of Oxford, Oxford, UK

This study investigates the influence of temporal regularity on human listeners' ability to detect a repeating noise pattern embedded in statistically identical non-repeating noise. Human listeners were presented with white noise stimuli that either contained a frozen segment of noise that repeated in a temporally regular or irregular manner, or did not contain any repetition at all. Subjects were instructed to respond as soon as they detected any repetition in the stimulus. Pattern detection performance was best when repeated targets occurred in a temporally regular manner, suggesting that temporal regularity plays a facilitative role in pattern detection. A modulation filterbank model could account for these results.

#### Edited by:

Sonja A. Kotz, Max Planck Institute Leipzig, Germany

#### Reviewed by:

Daniel Pressnitzer, École Normale Supérieure, France Nai Ding, Zhejiang University, China

> \*Correspondence: Jan W. H. Schnupp jan.schnupp@dpag.ox.ac.uk

#### Specialty section:

This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience

Received: 04 November 2015 Accepted: 11 January 2016 Published: 29 January 2016

#### Citation:

Rajendran VG, Harper NS, Abdel-Latif KHA and Schnupp JWH (2016) Rhythm Facilitates the Detection of Repeating Sound Patterns. Front. Neurosci. 10:9. doi: 10.3389/fnins.2016.00009 Keywords: rhythm, pattern detection, temporal regularity, noise learning, psychoacoustics, modulation filters, footsteps, auditory neuroscience models

## INTRODUCTION

Beneficial to survival in a complex and ever-changing acoustic environment is the ability to quickly identify relevant sounds that comprise the scene. One useful strategy is to detect recurring patterns over time, as these are often salient and suggestive of animate sound sources. Consider footsteps: steps on gravel sound nothing like steps through grass or through puddles, yet all of these very disparate sounds are easily recognized as the sound of footsteps if they occur in a rhythmic, repeating pattern. To recognize rhythmic patterns, the brain needs to search for recurrences of arbitrary and potentially complex sounds over timescales ranging from fractions of a second to tens of seconds.

Studies of auditory pattern detection often employ Gaussian white noise stimuli because they are spectrally broadband and devoid of prior meaning to listeners. Humans exhibit an impressive capacity to rapidly form recognition memories of frozen white noise tokens (Kaernbach, 2004; Agus and Pressnitzer, 2013), and these memories can persist for weeks (Agus et al., 2010). While human sensitivity to arbitrary repeating patterns has been well documented (Kaernbach, 2004; Chait et al., 2007; Agus et al., 2010; Agus and Pressnitzer, 2013), the question of how repetition is detected in the first place remains poorly understood. Previous noise learning studies have only explored conditions where repeating noise tokens were presented at precisely regular (isochronous) time intervals, and it is unclear whether such regularity is necessary or helpful for pattern detection. If sensory memory alone is responsible for pattern detection, then whether the sounds occur at regular or irregular intervals should have no effect on pattern detection.

However, the experiments described here reveal that detection performance does decline with increasing temporal irregularity, indicating that a sensitivity to slow temporal modulations or entrainment to the rhythmic structure of incoming sounds might facilitate pattern detection.

### METHODS

The experimental methodology was approved by the local Ethical Review Committee of the Experimental Psychology Department of the University of Oxford, and conforms to the ethical standards in the 1964 Convention of Helsinki.

In order to investigate to what extent temporal regularity might facilitate pattern detection, we asked human subjects to detect repeating noise patterns played over headphones. We generated frozen noise "targets" and manipulated their regularity by embedding them in non-frozen "filler" noise of varying length. In this manner, we probed pattern detection in a temporally regular (REP-R) and temporally irregular or jittered (REP-J) context. REP-R stimuli were designed to measure how the detectability of a target depended on its duration relative to a fixed inter-onset interval (IOI). REP-J stimuli were designed to measure how the detectability of a target depended on the variability of IOI. Background "false alarm" detection rates were measured with a control stimulus of totally non-repeating noise (RAND). As a further control to test the subjects' ability to report changes in the quality of the noise stimuli rapidly and reliably, we also incorporated a fourth stimulus type in the experiment (PINK), in which the spectrum of the noise changed from white (flat amplitude spectrum) to pink (1/f amplitude spectrum). MATLAB was used for stimulus generation, response collection and data analysis.

All stimuli were 8 s in duration, and either remained nonrepeating noise throughout (RAND condition), or started with non-repeating noise for a variable (uniformly distributed over 3–4 s) duration before transitioning to alternating noise targets and fillers (REP conditions) or to pink noise (PINK condition). The repeating section contained exactly 8 repeats of a single noise target embedded in noise fillers with the duration and jitter parameter combinations shown in **Figure 1**. REP-R stimuli had target durations T of either 500, 400, 300, or 200 ms, and filler durations F = 500-T ms to yield a constant IOI of 500 ms. For the REP-J stimuli, T was fixed at 300 ms, and F was drawn independently from a Gaussian distribution with a mean of 200 ms and a standard deviation J of either 10, 50, or 100 ms. Thus, REP-J stimuli had normally-distributed random IOIs with a mean of 500 ms, while REP-R had a fixed IOI of 500 ms, corresponding to a repetition rate of ∼2 Hz for all REP stimuli. Expressed as percentages, REP-R explored the detection of targets that were 100, 80, 60, or 40% of the IOI, and REP-J probed temporal jitter levels of 2, 10, or 20%, when quantified as the standard deviation of the IOI as a fraction of its mean. Examples of these stimuli can be found online<sup>1</sup> .

Subjects first underwent an instructional period during which the task was explained and examples of each stimulus type were played until the subjects reported that they could hear the repeating pattern on at least one occasion in both the REP-R and REP-J contexts. Subjects were told that the experiment would consist of four blocks, that stimuli in each block would come one after another with a short silence (∼3 s) between stimuli, and that they would be given a break between each ∼9 min block. Subjects were instructed to press a button as soon as they detected repetition or a transition in the sound.

Each data collection block contained 4 trials of each of the seven REP conditions, 16 RAND trials, and 4 PINK trials, all randomly interleaved. For each subject, over all four blocks, this amounted to 16 trials of each of the seven REP conditions, 64 RAND trials, and 16 trials of PINK. Importantly, the stimulus for each trial was generated from a different random seed and was therefore unique, with its own target, fillers, and set of jittered intervals. This eliminated the possible confound of longer-term memory effects across multiple trials (Agus et al., 2010). Additionally, in order to reduce the likelihood that any trend found could be explained by the particular noise stimuli that make up a single stimulus set, a different stimulus set was independently generated for each subject.

Stimuli were played through a TDT RM1 mobile processor (Tucker Davis Technologies, Alachue, FL, USA), and presented diotically at 50 dB SPL over Sennheiser HD 650 headphones (Wedemark, Germany). The TDT device delivered the stimuli and recorded button presses, allowing precise reaction times to be measured. Experiments were conducted in a double-walled sound-proof chamber.

Two performance measures were analyzed for all conditions tested: fraction detected and reaction time. The fraction detected is the proportion of trials (out of a total of 16 in each REP or PINK condition) during which the subject pressed the button to indicate detection. Reaction time was measured from the onset of the first target noise to the time of the button press. By dividing reaction time by 0.5 s, one can determine approximately how many noise targets had been presented before detection. "Miss" trials where repetition was present but not detected were excluded from reaction time calculations.

### RESULTS

Twenty-one paid participants aged 20–40 with normal hearing were recruited for this study. Three subjects were authors on this study, and 12 had some musical training. To ensure that subjects were performing the task correctly, subjects with a false alarm rate greater than 50%, calculated as the percentage of all RAND stimuli during which an erroneous detection was reported, were excluded from further analysis, leaving a final total of 17 subjects.

The population false alarm rate, calculated as the proportion of erroneous detections during RAND trials (out of a total of 64) averaged across all 17 subjects, was 17.9%. The average reaction time for PINK trials, was 588 ms.

### Shorter Noise Targets are Harder to Detect

**Figures 2A,C** show REP-R detection performance and reaction times, respectively. A few subjects were near 100% detection for all REP-R stimuli, but the overall trend is for detection performance to increase with increasing target duration

<sup>1</sup>http://www.auditoryneuroscience.com/patternsinnoise

(**Figure 2A**), and for reaction time to decrease (**Figure 2C**). Relative to the shortest target duration (T200-J0: 200 ms target duration, 0 ms jitter), detection performance was significantly higher and reaction times significantly lower (p < 0.001 and p < 0.01 respectively, n = 17 subjects, Wilcoxon signed rank test, Holm-Bonferroni corrected) for all other target durations tested. A significant drop in reaction time was also observed between T300-J0 and T500-J0 (p < 0.05, n = 17, Wilcoxon signed rank test, Holm-Bonferroni corrected). Thus, as the duration of repeating targets makes up a larger proportion of the IOI, their repetition is more likely to be detected, and fewer target presentations are required for their detection.

### Substantial Jitter Impairs Noise Target Detection

**Figures 2B,D** show individual detection rates and reaction times, respectively, for REP-J conditions. Subjects could detect the repeating pattern quite well for all jitter levels, but there are some systematic effects of jitter. We observed no statistically significant differences in reaction time between REP-J conditions (p > 0.05, n = 17, Wilcoxon signed rank test, Holm-Bonferroni corrected), but detection performance on T300- J100, the most jittered condition, was significantly worse than on T300-J0 and T300-J10 (p < 0.05, n = 17, Wilcoxon signed rank test, Holm-Bonferroni corrected). Thus, a substantial amount of jitter makes a repeating target more difficult to detect, but modest amounts of jitter appear to be well tolerated.

### What about Natural Sounds?

We motivated this study of pattern-in-noise detection by considering the ecological need to detect rhythmic natural sounds, such as footsteps, out of background noise. Rhythmic structure is often a hallmark of locomotion or vocal behavior of animate sound sources, and an ability to detect rhythmic patterns may have evolved to facilitate detection of another animal's activity. This could confer a competitive advantage by signaling the presence of potential mates, prey, or predators. Our experimental results indicate that pattern detection benefits from temporal regularity, and it would provide some context for our findings to explore how much temporal jitter is present in

natural sounds. We analyzed step interval data from normally walking healthy humans, compiled from three separate studies (Frenkel-Toledo et al., 2005; Yogev et al., 2005; Hausdorff et al., 2007), available on the PhysioNet database (Goldberger et al., 2000). The dataset logged pressure sensor data recorded from underneath both feet, and we defined foot strikes to occur each time pressure under either foot crossed a threshold. Footstep intervals were calculated as the time between successive foot strikes. Since participants were pacing back and forth through a hallway, the need to turn around introduced some footstep intervals that were clear outliers from an otherwise tight distribution. Hence, as an outlier-proof measure of the jitter in step intervals, we calculated the median percentage deviation from the median step interval for each individual. These median deviations ranged from 1.6 to 6.6% across the 72 subjects, with a median of 1.9%. The entire range is less than the median deviation of the intermediate jitter condition (T300-J50) of 6.7%. From this we can conclude that, at least for this class of rhythmic natural sounds, the amount of temporal jitter present would be too small to impair detection performance.

subject, and boxplots are over the 17 subjects' means. (D) Mean reaction times for REP-J.

### DISCUSSION

Firstly, we found that for a fixed IOI, a repeating target noise becomes easier to detect as its duration increases. This is consistent with the findings reported in Kaernbach (2004) and may be due to increased signal to noise for longer duration target noises. Secondly, we found that detection performance declines with substantial amounts of temporal jitter (more than the amount of jitter found in footsteps), though pattern detection was remarkably robust to levels of jitter below this level.

### Does Repetition Detection Rely on Synaptic Memory Traces of Recent Inputs?

Agus et al. (2010) suggested that memory traces that are presumably needed for repetition detection may involve spiketiming dependent plasticity (STDP). Networks incorporating STDP have been shown to quickly learn to detect a repeating pattern of afferent spiking activity amidst otherwise stochastic firing (Masquelier et al., 2008). This makes STDP an appealing candidate mechanism consistent with experimental observations made to date, with two possible caveats. First, subjects were able to recognize repetition with only two presentations of a frozen noise target (Agus et al., 2010), a performance that is so far unmatched by existing models of STDP. Secondly, a purely STDP based model would accurately detect noise targets equally well whether they arrive at regular intervals or not, which is in contrast to our finding that temporal regularity results in better detection performance. This does not rule out that STDP may have a role to play, but it does suggest that it alone does not account for all aspects of noise learning, and indeed Agus and Pressnitzer (2013) suggested the possibility that sensitivity to amplitude modulations may also be involved.

### Modulation Filterbanks as an Alternative Mechanism

A mechanism that would potentially account for the timing aspect of our findings is a modulation filterbank, which is a set of neural filters tuned to different frequencies of modulation of the sound envelope (typically within a frequency band). Modulation filterbank models of the auditory system have shown good agreement with human psychoacoustic data on amplitude modulation detection (Dau et al., 1997) and speech intelligibility (Jørgensen and Dau, 2011), and electrophysiological evidence for modulation tuning exists at various levels of the auditory system (Schreiner and Urbas, 1986, 1988; Kilgard and Merzenich, 1999; Joris et al., 2004). We propose that the brain relies at least in part on modulation filters to detect repetitions in noise, and that the performance decrease we observe in the presence of jitter might be explained by the fact that jittered stimuli will drive modulation filters less strongly. We illustrate the plausibility of this idea through the following analysis.

### Repetition of Frozen Noise Targets Results in Distinct Peaks in the Modulation Spectrum

We calculated the modulation spectrum for each stimulus and used the standard deviation of the modulation spectrum as a measure of its "peakiness." A peaky modulation spectrum would indicate that some modulation filters are being driven more strongly than others, and we sought to investigate whether this correlated with detection performance using the method illustrated in **Figure 3A**. The first step was to transform each sound stimulus into a simple approximation of the activity pattern received by the auditory pathway by calculating a sound's log-scaled spectrogram ('cochleagram'). For each sound, the power spectrogram was taken using 20 ms Hanning windows, overlapping by 10 ms. The power across neighboring Fourier frequency components was aggregated using overlapping triangular windows comprising 43 frequency channels with center frequencies ranging from 150 to 19,200 Hz (1/6 octave spacing). Then, the log was taken of the power in each time-frequency bin, and finally any values below a low threshold were set to that threshold. These calculations were performed using code adapted from melbank.m<sup>2</sup> .

The cochleagram was calculated over a 3 s window starting 4 s into the sound, by which time frozen noise targets must have ensued in all REP conditions. The magnitude spectrum of the activity in each frequency channel was calculated and then summed across frequency channels to get the overall modulation spectrum of the sound. We then calculated the standard deviation of the modulation spectrum (≤20 Hz) to quantify how much it deviated from a "flat" modulation spectrum. Gaussian white noise without repeating frozen noise targets (our RAND condition) should have a flat modulation spectrum and small standard deviation, while isochronously presented targets (our REP-R conditions) should introduce significant peaks in

<sup>2</sup>http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

the modulation spectrum, increasing its standard deviation (**Figure 3A**, rightmost column). As illustrated in **Figure 3B**, REP stimuli with longer or more regularly spaced targets had "peakier" modulation spectra and were more reliably and more quickly detected by our subjects. Peakiness of the modulation spectrum correlates significantly with detection (p = 0.03, n = 7 conditions by 17 subjects = 119, logistic regression), and with faster reaction times (p < 10−<sup>20</sup> , n = 1649, Pearson correlation). No significant trends were found in fraction detected or in reaction times within individual conditions (p > 0.05 in all cases, Pearson correlation).

The analysis in **Figure 3B** is consistent with the idea that modulation filter type mechanisms could be responsible for the detection of repetition in noise, but it of course does not prove that physiological modulation filters are the only possible mechanism. For example, one might wonder whether autocorrelation models, which are often invoked to describe the processing of periodicities of sounds in the pitch range, might not provide equally good or perhaps even better alternative candidate mechanisms. In digital signal processing, autocorrelations are normally computed by quantifying the similarity of incoming signals to a delayed copy of the input, which is held in memory with complete accuracy for whatever delay period may be required, perhaps up to several seconds. How such highly accurate and flexible auditory short-term memory banks might be implemented using known neurobiological signal processing mechanisms is far from obvious. Nevertheless, we cannot exclude the possibility that the mechanisms that the brain uses to detect recurrent patterns in noise may operate in ways that resemble an autocorrelator more than a modulation filter bank.

### Could a Modulation Filterbank Model Account for Previous Findings?

As mentioned earlier, Kaernbach (2004) and Agus et al. (2010) both demonstrated that human listeners could detect repetition in a 1 s long stimulus where a 500 ms noise token was played only twice. A question worth asking is whether "peakiness" in the modulation spectrum could still be helpful even when there are so few repeats. **Figure 3C** shows that the standard deviation of the modulation spectrum calculated from 1 s (two period) segments taken from our analogous T500-J0 stimuli do indeed differ substantially from equivalent 1 s segments from non-repeating RAND stimuli, suggesting that peaks in the modulation frequency domain could have provided a useful cue in the aforementioned studies. However, modulation filters alone would not account for the observation in Agus et al. (2010) that noise memory traces can have surprisingly long lasting effects. Thus, both modulation filterlike mechanisms and long term plasticity are likely to be required to fully account for our ability to detect patterns in noise.

Further work is needed to confirm whether modulation filters indeed underlie the results reported here, as well as in other related psychoacoustic studies. For example, timing predictability is also an important cue during auditory scene analysis (Bendixen, 2014), and different patterns of activity

(clumped) reaction times of each 16 nearest neighbors along the x-axis from the same stimulus condition are plotted. The black line is a linear regression run on all (un-clumped) data. On the secondary y-axis is a histogram showing the distribution of all (un-clumped) standard deviation values for all stimuli within each condition. (C) A histogram showing the distribution of standard deviation values calculated over 1 s intervals during RAND (gray) and T500-J0 (black), analogous to the stimuli used in Agus et al. (2010). Note the larger standard deviation values for both conditions in (C) using a 1 s window compared to the 3 s window used in (B). For our examples of RAND (n = 1088) and T500-J0 (n = 272), we see no overlap, suggesting that the modulation spectrum would contain enough information to detect repetition from a single repeat of a 500 ms frozen noise target.

across frequency channels in the modulation frequency domain could be involved in the tendency for temporally jitter to cause streams to segregate (Andreou et al., 2011; Rajendran et al., 2013). An additional consideration is the evidence for the oscillatory nature of temporal attention and its effect on task performance, which has been studied both in the visual (Correa et al., 2006; Lakatos et al., 2008) and auditory (Jones et al., 2002; Lakatos, 2005; Jaramillo and Zador, 2011; Henry

and Obleser, 2012; Lakatos et al., 2013; Lawrance et al., 2014) domain.

## CONCLUSIONS

Our results demonstrate that the ability to detect a repeating pattern is affected by the regularity of timing with which repeated sounds are presented. Specifically, we found that at a presentation rate of 2 Hz, applying a temporal jitter of 20% to the onsets of the noise targets significantly hindered their detection. We also found that the amount of jitter present in natural sounds such as footsteps is likely too small to be detrimental to detection. Finally we showed that aspects of perceptual performance in our study and in other noise pattern detection studies can be well accounted for by the hypothesis that the auditory system uses low frequency modulation filters to detect rhythmic patterns. All together, we conclude that temporal regularity aids in detecting subtle structure in sound.

### REFERENCES


### AUTHOR CONTRIBUTIONS

VR, NH, KA, JS designed the study, VR, KA acquired the data, VR, NH, JS analyzed the data, interpreted the results, and wrote the manuscript.

### ACKNOWLEDGMENTS

This work was supported by the Wellcome Trust (grant numbers WT099750MA and WT076508AIA).


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Rajendran, Harper, Abdel-Latif and Schnupp. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Seeking Temporal Predictability in Speech: Comparing Statistical Approaches on 18 World Languages

Yannick Jadoul † , Andrea Ravignani † \*, Bill Thompson † , Piera Filippi and Bart de Boer

*Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium*

Temporal regularities in speech, such as interdependencies in the timing of speech events, are thought to scaffold early acquisition of the building blocks in speech. By providing on-line clues to the location and duration of upcoming syllables, temporal structure may aid segmentation and clustering of continuous speech into separable units. This hypothesis tacitly assumes that learners exploit *predictability* in the temporal structure of speech. Existing measures of speech timing tend to focus on first-order regularities among adjacent units, and are overly sensitive to idiosyncrasies in the data they describe. Here, we compare several statistical methods on a sample of 18 languages, testing whether syllable occurrence is predictable over time. Rather than looking for differences between languages, we aim to find across languages (using clearly defined acoustic, rather than orthographic, measures), temporal predictability in the speech signal which could be exploited by a language learner. First, we analyse distributional regularities using two novel techniques: a Bayesian ideal learner analysis, and a simple distributional measure. Second, we model *higher-order* temporal structure—regularities arising in an ordered *series* of syllable timings—testing the hypothesis that non-adjacent temporal structures may explain the gap between subjectively-perceived temporal regularities, and the absence of universally-accepted lower-order objective measures. Together, our analyses provide limited evidence for predictability at different time scales, though higher-order predictability is difficult to reliably infer. We conclude that temporal predictability in speech may well arise from a combination of individually weak perceptual cues at multiple structural levels, but is challenging to pinpoint.

Edited by: *Mikhail Lebedev, Duke University, USA*

#### Reviewed by:

*Manon Grube, Technical University of Berlin, Germany Christina Zhao, University of Washington, USA*

#### \*Correspondence:

*Andrea Ravignani andrea.ravignani@gmail.com*

*† Authors share first authorship.*

Received: *07 March 2016* Accepted: *03 November 2016* Published: *02 December 2016*

#### Citation:

*Jadoul Y, Ravignani A, Thompson B, Filippi P and de Boer B (2016) Seeking Temporal Predictability in Speech: Comparing Statistical Approaches on 18 World Languages. Front. Hum. Neurosci. 10:586. doi: 10.3389/fnhum.2016.00586* Keywords: speech perception, temporal structure, rhythm, Bayesian, time series, autoregressive models, nPVI, timing

### INTRODUCTION

To acquire a language, human infants must solve a range of intertwined inductive problems which, taken together, represent one of the most demanding computational challenges a child will ever face. One of the earliest and most basic of these component problems is to segment continuous speech into distinct units, such as words, syllables or phonemes. Segmentation problems recur at multiple levels of linguistic structure, and must be solved either before or in tandem with higherlevel inferences or generalizations that are defined over these units (e.g., syntactic, morphosyntactic, and phonotactic rules). However, it is at present unclear—both theoretically and in terms of building speech technologies—which properties of speech allow this highly underconstrained inductive problem to be solved.

In this paper, we test whether this problem might be made more tractable by predictability in the temporal structure of speech. The key idea is that, if the timing of syllables follows any kind of pattern, this temporal pattern might be helpful for infants acquiring speech (Bialek et al., 2001; Nazzi and Ramus, 2003; Saffran et al., 2006) by providing infants with clues to predict where units begin and end (Trehub and Thorpe, 1989; Trainor and Adams, 2000). This hypothesis is corroborated by experimental evidence with adults: experiments in which simple artificial signals were taught to participants showed that when there was no temporal structure at all to the signals (i.e., signals just changed continuously over time), participants had a hard time learning to reproduce them (de Boer and Verhoef, 2012). This was true, even though the signals were based on vowels, and thus were recognizably speech-like. In an otherwise identical experiment, where signals did have clear temporal structure (i.e., there were regularly spaced building blocks separated by drops in volume), learning was much better even though the signals themselves were less speech-like (being produced with a slide whistle, Verhoef et al., 2014). Here we investigate the predictability of temporal structure of speech in a sample of 18 languages using three different statistical approaches. Specifically, we explore how well the occurrence of an upcoming syllable nucleus can be predicted on the basis of the times at which previous syllables occurred. In one of the three statistical models, we also test whether the previous syllable's intensity helps in predicting the time of occurrence of the next syllable.

We emphatically do not want to enter the debate about rhythmic classes of languages (stress-timed, syllable-timed, or mora-timed) and the ways to measure them. Much research has classified languages based on their temporal structure (Pike, 1945; Rubach and Booij, 1985; Port et al., 1987; Bertinetto, 1989; Fabb and Halle, 2012), reporting multiple acoustic correlates for language rhythmic class (Ramus et al., 1999; Patel and Daniele, 2003). Arvaniti (2012) has shown that many of the proposed measures are very sensitive to speaker, sentence type and elicitation method. In addition, she finds that groups of languages are classified differently by different measures, concluding that (p. 351) "any cross-linguistic differences captured by metrics are not robust [. . . ] making cross-linguistic comparisons and rhythmic classifications based on metrics unsafe at best." Here, we investigate how durations and intensities of preceding syllables can help to predict the position and duration of a subsequent syllable, and whether more complex patterns than a simple fixed average duration play a role. Though to our knowledge there has been little investigation of higher-order timing structures in speech, it is clear that structure in higher-order timing patterns (e.g., at the sentence level) can influence processing of smaller units (e.g., syllables) in speech: for example, Reinisch et al. (2011) show that the timing of a preceding sentence can influence how people interpret stress in a subsequent word. Results like this suggest that complex timing patterns at multiple levels in speech are salient to listeners and influence processing, motivating our analysis of these patterns.

Rhythm in language is obviously more complex than just temporal predictability of syllables (e.g., involving the way stressed and unstressed syllables are grouped into feet, Goedemans and Van der Hulst, 2005). However, most of the existing notions of rhythm in speech depend on already having some knowledge of the sound system of the language. Our notion of predictability is therefore somewhat more basic than most notions of rhythm in the phonological literature. Going back to the origins of rhythm research in psychology (Bolton, 1894), we call rhythmic the temporal regularities in sound sequences and rhythmical those patterns of temporal intervals also containing variation in loudness. Bolton's very influential work (Bolton, 1894) has, on the one hand triggered much developmental work (e.g., Thorpe and Trehub, 1989; Trehub and Thorpe, 1989; Trainor and Adams, 2000), while on the other promoted empirical research on the relative importance of duration and intensity in segmenting general auditory input (Povel, 1984; Trainor and Adams, 2000; de la Mora et al., 2013; Toro and Nespor, 2015). Here, we put the emphasis on speech rhythmicity, rather than rhythmicality, hence testing the importance of durational information (rather than finegrained spectral characteristics) in predicting future temporal regularities. In particular, we test whether the occurrence of syllable nuclei (characterized by peaks in intensity and maximum harmonics-to-noise ratio, i.e., voicedness) can be predicted from the (regularities in the) durations of the intervals between them. Therefore, we use only data about the syllable nuclei in our analysis.

In order to quantify the predictability of temporal structure in language, we investigated a small corpus of texts in 18 typologically and geographically diverse languages (listed in **Table 1**). We use a typologically and geographically diverse sample to exclude the possibility that temporal structure would somehow be an areal feature of Western European languages. As we are interested in the temporal structure of real speech, using word lists would not be useful, and therefore we use short stories. The example stories used in the illustrations of the IPA (International Phonetic Association, 1999) are ideal for this purpose. These are very short stories, either read from a text or spontaneously (but fluently) told. Although the stories are short this should not matter, because if rhythmic structure is to be of any use in acquisition, it should already be apparent from relatively short passages (Nazzi et al., 2000). Herein lies another difference with most existing literature on rhythmic measures: previous methods have been developed and used to quantify differences in rhythm between languages and hypothesized rhythmic classes (e.g., Arvaniti, 2012). Conversely, we are interested in the amount of temporal predictability that is present across languages, providing a set of clues to support the language learning process.

Story reading generally has a speaking style of its own. Analyses of Dutch (Theune et al., 2006), French (Doukhan et al., 2011), and Spanish (Montaño et al., 2013) show that compared to every-day speech, narrative speech tends to: (i) have more exaggerated pitch and intensity contours, (ii) be slower, (iii) have more pauses, (iv) include words with exaggerated pitch



*weights and sets. Results from higher-order*

 *models should be interpreted keeping in mind the low predictive power of ARMA models for small sample sizes.*

and intensity. Confusingly, in the literature this is often referred to as storytelling, but in fact most research is about stories that are read aloud from a prepared text. Spontaneously told stories have similar features, but more pauses and hesitations, and tend to have slower speaking rate (Levin et al., 1982). The features of story reading and storytelling are comparable to those of infant-directed speech (Fernald and Kuhl, 1987; Fernald et al., 1989), which facilitates word learning (Fernald and Mazzie, 1991; Filippi et al., 2014). Although story reading/telling style is therefore different from adult-adult dialog style, it may be more representative of the language intake (i.e., that part of the input that infants actually use in acquiring speech; Corder, 1967).

We use three increasingly sophisticated statistical techniques to quantify the predictability of syllable durations in our speech samples. The techniques make predictions based on increasingly long sequences, namely:


If temporal structure of speech is indeed predictable, this should be reflected in the outcome of our analyses. Our story-reading dataset might not be fully representative of ordinary adultdirected speech. However, the dataset is appropriate to look for temporal predictability, given the net-content of exaggerated features and comparability with infant-directed speech. If there is no structure at all to be found in this kind of speech, then there would be no reason to expect it in normal, less-controlled setting.

## MATERIALS AND METHODS

### Materials: Corpus

The audio files were recordings of the narrative texts used in various publications of the international phonetic association, used as illustrations of the sound systems of different languages. Most often Aesop's fable "The North Wind and the Sun" is used for this purpose, but sometimes other (native) stories are used. Crucially, all of these transcriptions and recordings have been published in the Journal of the International Phonetic Association as part of the series of "Illustrations of the IPA" and a number are also available in the IPA handbook (International Phonetic Association, 1999). Sources per language are indicated in **Table 1**. The story consists of 177<sup>190</sup> <sup>159</sup> (median, first and third quartile) syllables divided over 5–13 sentences.

### Methods: Annotations

The automatic methods for finding syllable centers we had available (Mermelstein, 1975; de Jong and Wempe, 2009) did not yield satisfactory results for the range of languages, speakers, speaking rates and speaking volumes that were in our sample, Hence we proceeded to annotate the sample manually. This has the advantage that our annotations represent the humanperceived syllable centers instead of computer-extracted data based on a predetermined set of features. Moreover, fine-tuning the parameters of automatic methods (Mermelstein, 1975; de Jong and Wempe, 2009) for each passage would introduce at least as much variability and subjectiveness as annotating the syllable centers manually. The centers of syllables were identified by ear, and their precise location was identified as the position where amplitude was highest and the harmonic structure was clearest (**Figure 1**). The transcriptions in the IPA articles were used to indicate phrase and sentence breaks, so that we could identify chunks of speech with uninterrupted rhythm. In addition, we indicated other points where the speaker paused and interrupted the rhythm. YJ re-checked all the cases where a break might have been forgotten, in order to have a more consistent dataset. Annotations were made in PRAAT versions 5.3.49–6.0.11 (Boersma and Weenink, 2013). Consistency between raters was ensured by having four of the considered languages annotated by two raters. The pairwise distance between all annotations was computed using Dynamic Time Warping (Sakoe and Chiba, 1978), a widely-used algorithm for aligning temporal sequences, where we use absolute time difference between two annotated nuclei as the distance metric. The sum of squared errors (i.e., sum of squared differences of matched annotated nuclei timings) between annotators for the same language was at least 10 times lower than the sum of squared errors between different languages or between real and randomly-generated annotations.

### Methods: Mapping Languages to Durations

Having this set of annotated points in time for all languages, we then calculated the time distance between nuclei of adjacent syllables, i.e., the inter-nucleus-interval durations (INI), and the difference in intensity between those nuclei. Hence each language

corresponds to two vectors D = (d1,d2,...,dn) and I = (i1,i2,...,in) where d<sup>s</sup> is the INI and i<sup>s</sup> is the difference in intensity between syllables s + 1 and s, for s < n (**Figure 1**). Moreover, the indicated phrase breaks and pauses are used to discard the associated INIs. Note that these intervals are not removed from the time series, but replaced by a missing value (NA) that can be handled properly by each analysis.

It is mathematically convenient and cognitively plausible (Grondin, 2010; McAuley, 2010) to work with the logarithm of duration. This is cognitively plausible given Weber's (1834) law that the perceived differences between signals depend on the magnitude of the signals. It is mathematically convenient, because the difference between the logarithms is proportional to the ratio of the original numbers. The logarithm of the INIs therefore abstracts over absolute duration, accounting for variability in speed between speakers and over time: for example, for both fast and slow speakers, adjacent syllables with equivalent durations would lead to a difference of zero for the logarithms.

### ANALYSIS AND RESULTS: SIMPLE DISTRIBUTIONAL MEASURES (ORDER 0)

We started by investigating distributional predictability in languages, namely whether information on presence and frequency of INIs provides information on the temporal organization of that language. We calculated the Kolmogorov-Smirnov D (Kolmogorov, 1933; Smirnov, 1948) statistic to quantify normality for each language. The D for each language is calculated as the difference between the empirical INI distribution for that language and a theoretical normal distribution with the same mean and standard deviation. We then tested how this measure relates to temporal variability by comparing it with a common measure of speech rhythm, the normalized pairwise variability index (nPVI, Grabe and Low, 2002). The nPVI is a measure of variability between adjacent durations, calculated as

$$nPVI = \frac{100}{n-1} \sum\_{t=1}^{n-1} |(d\_t - d\_{t+1})/0.5(d\_t + d\_{t+1})|,$$

where n is the number of syllables, and the factor 100 normalizes the number to be between 0 and 100 (Patel and Daniele, 2003). A "metronomic language" composed of a series of similar INI will have a low nPVI (tending to zero as the INIs become identical). A language with strong temporal variability in INI, composed for instance of alternating short-long INI, will have high nPVI. Note that the nPVI measure here is calculated in a slightly different way than usually, based on INI lengths instead of the lengths of syllables.

We found a significant correlation between Kolmogorov-Smirnov D and nPVI (Spearman rank correlation = 0.60, p < 0.01, **Figure 2**). This high and positive correlation between our simple measure of normality (of order 0) and the more complex nPVI (which takes into account order 1 difference between syllables) shows that they both capture

some common aspects of temporal structure of the signal. Our measure is possibly the simplest metric for temporal structure, suggesting that the complexity of nPVI adds little explanatory power to straightforward distributional measures. This analysis implies that most of temporal structure in a language as captured by a common measure of rhythmicity can be equally well judged by assessing whether syllable nuclei occur at normally distributed durations. Far from proposing one additional metric to quantify structural regularities in speech, we instead suggest that many existing metrics should be used carefully and critically, as they may embody very superficial features of speech (Loukina et al., 2011; Arvaniti, 2012).

For some languages, such as Thai, metrics are very different from those published in previous reports: the nPVI ranges between 55 and 60 in Romano et al. (2011) vs. our 41. In other languages, predictions are close: in Arrente one can compare our 35.4 with the range 39.6–51.2 found in Rickard (2006). Finally, for some languages we get almost identical numbers as in previous studies: for Italian, both our data and Romano et al.'s (2011) show nPVIs at 40 ± 1. Some issues about nPVI comparisons should be kept in mind. First, we purposely focussed on less-studied languages, and only some languages considered here had been analyzed at the level of rhythm and nPVI elsewhere. Moreover, for some languages several discordant measures of nPVI are available from different studies, making the selection of one previously-published nPVI per language quite arbitrary. In general, we do not find a strong association with previous studies probably because, as previously remarked (Arvaniti, 2012), values for the same rhythm metric applied to different corpora of the same language can vary a lot by study.

### ANALYSIS AND RESULTS: DISTRIBUTIONAL STATISTICS OF TEMPORAL STRUCTURE (ORDER 1)

### Why Use Distributional Methods?

We can make baseline inferences about temporal structure by quantifying the distribution of the logarithms of the ratio of adjacent INIs (i.e., the difference of the logarithm of adjacent INIs) observable among the languages in our sample. In the most temporally regular language, all adjacent syllables would have equal durations (i.e., equal INIs), and this distribution would be a point mass on 0 (the ratio between equal-length INIs is 1, whose logarithm is 0). In a language that has completely unpredictable temporal structure at this level, the duration of the preceding syllable provides no information about the duration of the following syllable, so this distribution would be uniform over a sensible range.

Standard tests for normality (D'Agostino and Pearson, 1973) suggest that we cannot reject the hypothesis that the data are drawn from an underlying Normal distribution for all 18 languages (at a = 0.05). As such it is reasonable to proceed under the assumption that the differences of the logarithm of the INIs are normally distributed. This assumption allows us to compute measures of predictability associated with normally distributed data. Many standard tools exist to estimate the shape of this distribution from a noisy sample, which is what our annotations represent. We calculated estimates for the mean µ and variance σ <sup>2</sup> of this distribution for each language. Maximum a-posteriori (MAP) point-estimates (under an uninformative prior—see below) for all languages are shown in **Table 1**. The mean always centers around 0, and the average variance is around ¼ (0.28), suggesting a moderate level of predictability across languages.

### Bayesian Inference for Distributions of Speech Timing Events: A Primer

A more satisfying approach—utilizing all the information in the distribution—is to compute the full posterior distribution P(µ, σ|R) via Bayesian inference. This approach is useful in three respects. First, it provides a more complete picture of the structure in our data at this level. Second, experiments of perception and estimation of time intervals suggest humans process temporal regularities in a Bayesian fashion, where expectations correspond to a-priori probability distributions affecting top-down perception of incoming stimuli (Rhodes and Di Luca, 2016). Third, it provides a way to model the judgements of an ideal learner who observes these data: what generalizations could an ideal learner infer from this evidence base? In intuitive terms, the posterior distribution represents an ideal observer's updated beliefs after observing evidence and combining this information with the beliefs it entertained before observing the evidence (prior beliefs). The updated posterior beliefs are said to be rational or ideal if the particular way in which the learner combines prior beliefs and observed evidence follows the principles of conditional probability captured in Bayes' theorem. This way of modeling inference aligns with human learning in many domains (Griffiths et al., 2010), and provides a normative standard that quantifies how an evidence-base could be exploited by an ideal observer—which is exactly what we wish to achieve here. Standard techniques from Bayesian statistics (e.g., Gelman et al., 2004, p. 78) allow us to formulate an unbiased prior P(µ, σ) for the inductive problem at hand. Specifically, the Normal-Inverse-Chi-Square conjugate model (e.g., Gelman et al., 2004), with k<sup>0</sup> = 0, a<sup>0</sup> = 0, v<sup>0</sup> = −1, for arbitrary µ0, ensures the prior is uninformative: in other words, the prior expresses uniform expectations about µ and σ 2 , so MAP estimates correspond to maximum likelihood estimates, and ideal learner predictions are unbiased.

The posterior P(µ, σ|R) can be derived analytically under this model. We interrogate this posterior for targeted measures of predictability. For example, we can quantify the degree of predictability available to an ideal learner who is exposed to a temporal sequence of syllables: we model a learner who encounters these data, induces estimates of µ and σ 2 via Bayesian inference, and goes on to use those estimates to make predictions about the time of occurrence of future syllables. In Bayesian statistics, the distribution describing these predictions is known as the posterior predictive distribution, and can be calculated exactly in this model. Our analysis pipeline assumes the learner induces estimates for µ and σ <sup>2</sup> by drawing a random sample from their posterior, and makes predictions by drawing random samples from the Normal distributions defined by those estimates. To account for the randomness which underpins the learner's sampled estimates of µ and σ 2 , the model integrates over the posterior for these parameters, computing predictions under each parameter setting, and weighting those predictions by the posterior probability of those parameters given the data (and the prior). Even under an unbiased prior, this is a meaningful operation since it takes into account inferential uncertainty about µ and σ 2 , and propagates that uncertainty through to the model's predictions. In this respect, the model's predictions are conservative by admitting variance in predictions (compared to, for example, predictions computed under maximum likelihood estimates of µ and σ 2 ). The specific form of the posterior predictive distribution in this model is Student's t. More formally, it can be shown that:

$$\begin{aligned} \rho\left(r^{\text{new}} \, \middle| \, \text{R}, \text{F}\right) &= \iint \rho\left(r^{\text{new}} \, \middle| \, \mu, \sigma^2\right) \rho\left(\mu, \sigma^2 \, \middle| \, \text{R}, \Phi\right) \, d\mu \, d\sigma \\ &= \, t\_{n-1}(\overline{r}, s), \end{aligned}$$

where r new is the new interval to be estimated, n is the number of data points observed, Φ are the parameters of the prior specified above, r is the mean of the observed data R, and s = (1 + n) P(r<sup>i</sup> − r) 2 /n(n − 1). The second line of this equation reflects a standard result in Bayesian statistics (see Gelman et al., 2004). We computed these distributions for each language: **Figure 3A** shows these predictions, superimposed on (normalized) histograms of the raw data R.

FIGURE 3 | Results of Bayesian and time series analyses. (A) Distributions of the log-ratio of adjacent INIs for all languages: most languages have a wider spread, indicating less predictability; a few languages show a narrower distribution (e.g., Cantonese), indicating higher predictability at this level. Normalized histograms show the raw empirical data; Solid lines show the ideal learner predictions; Dashed lines show 95% confidence intervals for the ideal learner predictions. (B) The proportion of Akaike weights taken up by models that use the ratio between subsequent INI lengths (differencing order *d* = 1; as opposed to the absolute lengths, *d* = 0) shows that, in the vast majority of language samples, the relative length data provide a better ARMA fit (cfr. last column in Table 1, % Akaike weight taken up by *d* = 1). (C) The accumulated Akaike weights of all fitted ARMA models for each AR-order p do not show a clear picture of a predominant order of the ARMA model providing the best fit.

### Bayesian Inference in Our Dataset: Results and Discussion

The similarities between these distributions across languages are intuitively clear from **Figure 3A**. We provide a quantitative measure of structure. Though various appropriate measures are available, we report the information-theoretic differential entropy of these predictive distributions, which is a logarithmic function of the variance. Differential entropy directly quantifies the information content of an unbiased, ideal learner's predictions in response to distributional information on first-order temporal regularity. Formally, differential entropy is defined for this problem as follows:

$$h\left(r^{\nu \circ \nu}\right) = \int \rho\left(r^{\nu \circ \nu}|\mathbb{R}, \Phi\right) \log \rho\left(r^{\nu \circ \nu}|\mathbb{R}, \Phi\right) dr^{\nu \circ \nu}$$

**Table 1** presents the differential entropy of the posterior predictive distribution, for each language. Lower values represent higher predictability: an ideal learner could make reliable predictions about the time of occurrence of an upcoming syllable in Cantonese, for example (entropy = 0.35), but would make less reliable predictions about Georgian (entropy = 0.98). In other words, in Cantonese more than Georgian, a few relative syllable durations provide information about the temporal structure of rest of the language.

We are hesitant to draw strong generalizations about predictability cross-linguistically from this small dataset. However, the distribution of predictability across the languages we have analyzed provides a window onto the variation in predictability we might expect. For example, the mean entropy across languages is 0.77; the lowest entropy is 0.3; and the highest entropy is 1.05. Of the 18 languages, 10 have entropy lower than this mean, and 14 have entropy lower than 0.9. Intuitively, this suggests most languages cluster around a moderate level of predictability at this level of analysis. Few languages are highly predictable (entropy→0) or effectively unpredictable (entropy→∞). An ideal learner who pays attention to these temporal regularities in speech will be better at predicting the location of the nuclei of an upcoming syllable than a learner who does not. Obviously, both hypothetical learners will still face uncertainty.

The ideal-learner analysis provides a range of tools for exploring learnability and predictability that could be generalized to more complex notions of temporal structure. The approach also offers potentially useful connections to language acquisition and inductive inference more generally. For example, in ideallearner models of language acquisition, the prior distribution is often understood to represent inductive biases. These inductive biases, either learned or inherent to cognition, are imposed by the learner on the inferential problem. This perspective provides a framework to ask and answer questions about perceptual biases for temporal regularity. For instance, how strong the prior bias of a learner must be for her to reliably perceive high temporal regularity – over and above what is actually present in the data (Thompson et al., 2016). We leave these extensions to future work, and turn instead to higher-order sequential dependencies.

### ANALYSIS AND RESULTS: TIME SERIES ANALYSIS FOR (HIGHER ORDER) SEQUENTIAL STRUCTURAL DEPENDENCE

### Structure beyond Metrics and Distributions

Our previous analyses, in line with existing research, quantified rhythmic structure using minimal temporal information: firstorder pairwise temporal regularities between adjacent syllables. Given the existing metrics and results for structure at this level (Arvaniti, 2012), and the inconsistency among associated findings, a natural alternative approach is to search for higherorder temporal structure, utilizing more features and a more complex statistical representation of the data. We address the question: does the preceding sequence of N syllables provide information about the timing of the upcoming syllable?

Structure at this level cannot be captured by typical firstorder measures (e.g., Chomsky, 1956) employed in the literature (Arvaniti, 2012). In light of the disparity between intuitive impressions of rhythm in speech and empirical studies that fail to recover these intuitions (e.g., Dauer, 1983), perhaps this gap is made up in part by higher-order structural regularity, not visible to first-order methods. Specifically, we test whether sequential information about duration and intensity affects the predictability of future durational information. In other words: can we predict when the next syllable nucleus will occur, knowing the intensity and time of occurrence of the previous nuclei?

### ARMA: Timing of Occurrence of Future Nuclei as Linear Combination of Past Nuclei Timing

Though there are many ways to model higher-order dependencies in sequences, a natural starting point is to approach the question using standard statistical tools from time series analysis. We model our data using a commonly used autoregressive moving-average (ARMA) process (Jones, 1980; Hamilton, 1994). In brief, an ARMA model tries to predict the next value in a time series from a linear combination of the previous values (see below for details). As explained in our introduction, the predictability of these timings may be beneficial during language acquisition: if the ARMA model is able to discover predictive regularities at this level, then in theory so could a language learner. In addition to the preceding INI lengths, we allow for an extra value (the difference in intensity of the previous syllable nucleus) to be taken into account in the prediction. Taking intensity into account in an ARMA model allows us to include a basic form of stress (in which intensity plays some role) in the predictions, which may be useful in languages where stressed and unstressed syllables alternate. Using this approach, we ask two questions: (i) is temporal predictability better captured by a linear relation between INIs or the same relationship between their ratios?; and (ii) is temporal predictability improved (with respect to zero- and first-order predictions) by basing predictions on more than just the single previous INI?

## ARMA for Speech Timing: A Short Introduction

In statistical terminology, the specific ARMA model we adopt is known as an ARMA(p, d, q) process, where p, d and q determine the window length of the time series used to make predictions. With respect to our purposes, the d parameter decides whether the ARMA models relations between absolute INI durations (d = 0) or instead between relative durations of adjacent INIs (d = 1). This is known as the degree of differencing in the series: to answer our question (i) above, we ask whether the model captures the series better with d = 0 or with d = 1. Models with d > 1 are possible, but the psychological interpretation of higher-order differencing is not straightforward, so we do not consider those models here. The parameters p and q determine how far back past the current to-be-predicted interval the model looks when calculating its prediction, which corresponds to the order in what we have been calling "higher-order" structure. The model computes predictions in two ways: by computing an "autoregressive" component and a "moving average" component. The standard technical details of this model are rigorously explained in the literature (Jones, 1980; Hamilton, 1994); it is sufficient to note that p and q determine how far back the model looks in these calculations respectively: the model performs autoregression on the p previous intervals, and calculates a moving average component for q previous intervals. Quantifying higher-order predictability corresponds to asking what combination of p and q (and d) lead to the most accurate model predictions. If the model makes better predictions by seeing more steps backward (controlling for increased model complexity, see below), this indicates the existence of predictability at higher-order. In principle p and q can grow unboundedly, but for reasons of practicality we impose a maximum depth on these parameters: specifically, the space of models we search is subject to the constraint p, q ≤ 5 (and d ≤ 1). In other words, we consider all ARMA models up to an order of five, where the order is the total number of previous durational observations taken into account.

### Akaike Weights: Ranking Models Based on Their Parsimony and Fit to the Data

We use the R library "forecast" (Hyndman and Khandakar, 2008; R Core Team, 2013) to fit the ARMA models to our data. This library can handle the missing INIs across phrase breaks, and does so by maximizing the likelihood of the model given all data that is present. Additionally, the preceding difference in intensity was fit by the ARMA model as an external regressor, adding this first-order intensity difference to the linear model. Then, for each language, we identified the model with the lowest AICc value (Akaike Information Criterion, Burnham and Anderson, 2002) as the one that fit our data the best. The AIC is the most common criterion to perform model selection in ARMA models (Brockwell and Davis, 1991): intuitively, AIC provides a score that reflects how well a model captures the data, whilst also penalizing model complexity. AICc corrects this measure for small sample sizes.

While AICc scores correct for model complexity, more complex operations such as addition, taking a mean or comparison of groups of models cannot be performed meaningfully using these values alone. Wagenmakers and Farrell (2004) describe how to calculate Akaike weights, which allow for a more advanced quantitative comparison between models. More specifically, the Akaike weights w<sup>i</sup> are a measure of a model's predictive power relative to the combined predictive power of all models considered, and can be calculated over a collection of AICc scores AICc<sup>j</sup> as follows:

$$\begin{aligned} \hat{w}\_i &= \exp(-\frac{1}{2}(AICc\_i - \min\_j(AICc\_j))),\\ w\_i &= \frac{\hat{w}\_i}{\sum\_j \hat{w}\_j} \end{aligned}$$

Using these weights (which sum up to a total of 1), we identified the Akaike set: the set of all highest-ranked models summing up to a cumulative Akaike weight of at least 0.95 (Johnson and Omland, 2004; Ravignani et al., 2015), in order to provide a view on the robustness of the best-fitting model. By aggregating the Akaike weights in this way, we (i) gain the combined explanatory power of multiple models instead of just the best one, and (ii) counteract the volatility of the analysis: i.e., if there are relatively few models with a high Akaike weight in this Akaike set, and most of them share a particular feature, we have more confidence in the importance of this feature than by just exploring the single best model.

In our particular analysis, we can test the hypotheses above by observing how Akaike weight is spread across the 72 different model variants: the larger the weight taken up by the relevant subset of models (i.e., with p above zero, or with d = 1) in the Akaike set, the stronger the support for the hypothesis. In sum, these techniques allow us to judge the features of models that explain the data well, while favoring simpler models, and without the need to choose a single best candidate.

Inference in time-series analysis is notoriously volatile, especially for small sample sizes and for series that include missing values. In our case, these missing values are derived from phrase and sentence breaks and other disruptions to the speech rhythm. This was clear in our results: although the range of ARMA-based analyses we pursued did consistently outperform baseline random-noise based alternatives, it did not lead to strong inferences: even the best-fitting ARMA models explained only a small portion of the data. **Figures 4A,B** respectively show examples of bad and good fit to our data. We therefore report results over a variety of possible models, an approach known as multi-model inference which smoothes over uncertainty in model selection.

### Results of the Time Series Analysis on Nuclei Timing

First, to address question (i) above, we compared the combined Akaike weight, for each language, of models which represented the data as relative durations vs. absolute durations. Relative durations models (i.e., d = 1) have a notably high sum of Akaike weights, for almost all languages (see **Table 1** and **Figure 3B**).

predictions of the fitted model on the y-axis; perfect predictions would correspond to a diagonal (45◦ ) line.

This suggests that the model is most powerful when looking at the data as (the logarithms of the) relative durations. This is intuitive from a psychological perspective, both in terms of the log-scaling, and in terms of the focus on relative durations rather than absolute temporal duration (Grondin, 2010; McAuley, 2010).

Second, we accumulated the Akaike weights of all models with the same value for p; we then compared these marginal Akaike sums over q and d between p, as a way of investigating the importance of the autoregressive component's order. The higher p is, the more time-steps backwards the AR component of the ARMA model can use in order to predict where the next syllable nucleus will occur. As such, the extent to which the combined Akaike weights for larger values of p exceed the equivalent weights for p = 0 or p = 1 provides a window onto higher-order structure: more specifically, an indication of how well higherorder regularities and patterns in our data are captured by the ARMA model. A higher order in the moving average portion of the model, determined by q, is less important because the MA process only captures temporal dependencies in the random error. That is, the MA can explain some variance attributable to e.g., drift in INI length (for instance, when speaking rate increases or decreases over time), but does not have a straightforward correlate in terms of predictability of syllable nuclei. As such we focus on p as an indicator of higher-order predictability, marginalizing over q.

**Figure 3C** depicts the marginalized Akaike weights (i.e., weights summed over possible values for d and q) for each p and language. As can be seen, this visualization reveals a less clear picture of the distribution of the Akaike weights. AICc quantifies the quality of a fit while taking in account a penalty for model complexity. Hence, a partition of identical weights for each p and language would be the least informative with respect to the best order of the model. Instead, if the higher-order dependencies were adding nothing at all to the model's predictive power, we would expect the Akaike weights to be concentrated strongly on just p = 0. Likewise, if higher-order dependencies made improvements to the model's predictions, we would expect one or some of the p > 0 models to reserve positive Akaike weight.

**Figure 3C** reveals a subtle pattern of results. On the one hand, we see that for most languages, Akaike weight is concentrated on lower-order models (p = 0, p = 1), arguing against the idea that higher-order dependencies make dramatic improvements to prediction (under the assumptions of the ARMA model). On the other hand, even among these cases, higher-order models often still reserve some Akaike weight, even after being penalized for increased complexity. This suggests that higher-order models may still be capturing meaningful structure, even where lowerorder dependencies are more powerful predictors. Moreover, there are some notable cases, such as Dutch, Mapudungun, and Turkish, in which higher-order models reserve extremely strong Akaike weight, at the expense of lower order models. This suggests that in these cases, models which are able to capture temporal dependencies at higher orders represent our best description of the data.

### Discussion: What Can Time Series Tell Us about Speech Timing?

Overall, the ARMA analysis hints at the possibility that temporal regularities exist at higher orders in at least some of our data. We take this as strong motivation to explore the possibility further in future work, but hesitate to draw strong conclusions given the limitations on the models' predictive power and the variability in results across languages. In this respect our findings mirror previous results on rhythmical structures in speech, which have also often not led to strong conclusions, and demonstrated sensitivity to idiosyncrasies of the data (Arvaniti, 2012). A conservative conclusion is that, even if there is predictability at higher orders, only some of this structure appears capturable by the ARMA analysis we undertook. This could have multiple reasons, ranging from idiosyncrasies of our data and our statistical approach, to more general questions about the presence and nexus of temporal structure in speech, as follows:


Deciding between these possibilities is a clear objective for future research. A natural starting point would be to work with more data (i) or different data (i.e., different features, iv): either more data per language, or more data from a subset of languages, or data from multiple speakers per language. Another approach would be to look in more detail at ARMA predictions, and perhaps consider generalizations or more complex time-series models that build on or relax some of the assumptions in the classic ARMA (e.g., the linearity assumption, ii, iii) Such models exist and could be explored in our data, or new data. The final possibility (v), that there are no structures to be found at this level, could only be upheld by ruling out possibilities i–iv, which our analyses cannot do.

Together, our analyses provide reasonable evidence for first or minimal order temporal structure (i.e., for the role of relative durations in the perception of rhythm in speech), and weaker evidence for principled higher-order structure that can be captured by linear regression models such as ARMA.

### GENERAL DISCUSSION AND CONCLUSIONS

Temporal structure is a central aspect of speech processing. Multiple studies have shown that infants rely on the rhythm type of their native language as a guide for speech segmentation (Nazzi and Ramus, 2003; Saffran et al., 2006). The extent to which higher-order sequences are used in predicting subsequent events or INIs is debated. Humans perform poorly at detecting temporal structure in mildly complex patterns (Cope et al., 2012). Finding regularity across a number of intervals correlates with reading ability, while detecting gradual speeding-up/slowing-down does not (Grube et al., 2014). However, to the best of our knowledge, no studies have ever provided a quantitative analysis of how the temporal properties of the speech signal determines predictability within the speech signal. Does the temporal structure of our data portray regularities that allow the duration and location of upcoming syllables to be predicted? Our approach to this question was 2-fold.

### Our Approach: Alternative Metrics for Low Order Temporal Regularities

First, in line with many other studies (Arvaniti, 2012), we focused on lower order temporal regularity. Existing metrics for speech rhythm at this level of analysis tend to be applied to research objectives that are slightly different to ours (e.g., classifying languages into rhythmic groups), and have been shown to be somewhat unreliable in the sense that they are often sensitive to idiosyncrasies of the data they model. In this light, our lower-order analyses focused first on maximal simplicity, then on quantifying predictability from the perspective of an ideal observer. These approaches proved useful for quantification of predictability at this level, showing broad support for constrained, but not complete regularity in INIs across the languages in our sample. These results are in keeping with the general and well-attested idea that there is temporal regularity in syllable timing, but that this regularity is not sufficient to account for the subjective experience of rhythm in speech (Lehiste, 1977). We add to this insight that a similar ceiling appears to also constrain how well these lower-order regularities can aid speech segmentation and acquisition in terms of predictability.

### Our Approach: Introducing Time Series Analysis to Speech Timing

Second, we tried to quantify predictability that might exist at higher-order temporal resolution in our dataset, a topic that, to the best of our knowledge, has received little attention in previous work<sup>1</sup> . We chose to model INI sequences as timeseries, and to make inferences about the order of dependencies in those series through model-fitting. This approach is a natural generalization of existing lower-order metrics: it allowed us to leverage a range of tried-and-tested methods of analysis in spite of the complexity inherent to higher-order forecasting. However, the results of our analyses provide only weak support for higherorder predictability. We highlighted a range of possible reasons for this above. Naturally, it is possible that our data are unsuited to the problem, or that our inferential methods were simply not powerful enough given the data. We disfavor this possibility for all the reasons discussed in the introduction and materials and methods. An alternative conclusion is that these regularities are not there to be found at higher orders. Again, we are hesitant of this conclusion, though acknowledge that it may chime with what others have claimed about speech rhythm in general (see Lehiste, 1977). The ARMA model, while widely used and a natural first contender, may be inherently unable to capture this important, though yet unknown, class of regularities: in particular, the ARMA model can only make predictions about the future on the basis of linear combinations of the past, which may be too restrictive.

<sup>1</sup>Though see Liss et al. (2010), who also examine higher-order dependencies; in particular, they used the spectrum of the intensity envelope to recognize dysarthrias in speech, a condition resulting in the perception of "disturbed speech rhythm": since peaks in the spectrum represent a linear relationship within the original time domain, ARMA could potentially capture the same kind of structure, though the ultimate goal of our article is different.

### Alternative Hypotheses: Is Predictability Contained in the Speech Signal, or Is Predictability a Top-Down Cognitive Trait?

An alternative explanation is that few regularities exist but humans hear rhythmic patterns in speech because they impose top-down expectations: for instance, humans perceive time intervals as more regular than they really are (Scott et al., 1985) and impose metric alterations to sequences which are physically identical (Brochard et al., 2003). However, exposure to strong temporal irregularities can make humans perceive regular events as irregular (Rhodes and Di Luca, 2016). Mildly regular predictable though non-isochronous—patterns are perceived quite well, possibly based on local properties of the pattern (Cope et al., 2012). In any case it seems that human perception of rhythm is not simply a matter of determining time intervals between acoustic intensity peaks, but that it involves a more complex process, potentially integrating multiple prosodic cues such as pitch, duration, INI or intensity values.

Top-down and global/local regularity perception relates to the question of whether the ability to perceive and entrain to temporal patterns in speech may benefit language processing at both a developmental and an evolutionary scale. From an evolutionary perspective, overregularization of perceived patterns combined with mild regularities in the speech signal might hint at culture-biology co-evolutionary processes. It would suggest that humans might have developed top-down mechanisms to regularize highly variable speech signals, which would have in turn acquired slightly more regularities (for biology-culture coevolution in language and speech, see: Perlman et al., 2014; de Boer, 2016; Thompson et al., 2016).

## Future Work

All the analyses above are based on only one speaker per language. Having multiple speakers for each language would have been preferable to account for speaker variability; Ideally, 18 speakers per language (as many as the languages encompassed in this study), would have allowed a meta-analysis via a 18 × 18 repeated measures ANOVA to test whether most variance could be explained by the language or rather the speaker/annotator factor. However, as we neither find, nor claim, existence of categorical differences between languages, we believe speaker variability is not an issue in the current analysis. Had we found strong differences between languages, we would not be able to know—with only one speaker per language—whether these were due to a particular language or, rather, to the particular speaker of that language. On the contrary, all our results are quite similar across languages and, importantly, annotators. The few outliers (Cantonese, Hungarian, and Turkish) should be investigated in future research by having many speakers and many annotators for each of them. Ours is in fact just a first attempt at introducing the Bayesian and time series approaches to the world of speech timing.

While annotating the language samples, we did not use preconceived notions about the building blocks of speech based on writing systems. Rather, we used clearly defined acoustic measures to define the events. Our approach is supported by evidence from analysis of phonological processes showing that syllables have cognitive reality even without writing. Moreover, although the sample size was small, our statistical methods were shown in the past powerful enough for comparable sample sizes, and for our sample could detect some regularities. Future studies with larger samples will test if analyzing more languages, or longer samples per language, leaves our controversial results unvaried. Should a replication confirm our negative result, this would suggest that the effect size of temporal predictability of speech is so small that it is unlikely to play an important role in the acquisition of speech.

We suggest that the ARMA model we use here to model syllable timing could be used to model another aspect of speech rhythm, namely amplitude modulation. It has been suggested that modulation in the envelope of the speech signal at different time scales might provide a useful physical correlate to rhythm perception (Goswami and Leong, 2013). In particular, the timing of signal amplitude decrease/increase and phase difference between modulation rates at different scales within the same speech signal might encode much rhythmic information (Goswami and Leong, 2013), which is not captured by our temporal prediction model above. However, hypotheses on predictability in amplitude modulation could be tested across languages using the same time series approach we use here. By swapping the roles of intensity and duration in the model above, one would allow a range of past intensity values to predict the timing and intensity of the upcoming syllable. High lag order of the resulting amplitude-modulation ARMA, possibly together with a lower Akaike than our time prediction model, would provide empirical support for the amplitude modulation hypothesis.

Further comparative research on temporal structure perception in speech with nonhuman animal species could better inform our understanding of the evolutionary path of such an ability, determining how much this ability depends on general pattern learning processes vs. speech-specific combination of cues (Ramus et al., 2000; Toro et al., 2003; Patel, 2006; Fitch, 2012; de la Mora et al., 2013; Ravignani et al., 2014; Spierings and ten Cate, 2014; Hoeschele and Fitch, 2016).

Finally, alternative algorithms and toolboxes could be tested and compared to our manual annotation results. Crucial desiderata for such algorithms are to: (1) yield more robust results than the unsatisfying automated approaches which spurred our manual annotation in the first place; (2) be at least as psychologically plausible as our manual annotation; (3) work properly across different language families and phonological patterns. These desiderata might be partially or fully satisfied by using and adapting algorithms originally developed for music analysis. In particular, interesting research directions at the boundary between experimental psychology and artificial intelligence could be: (i) performing automated annotations after adapting the "tempogram toolbox" (Grosche and Muller, 2011) to the speech signal, (ii) assessing the perceptual plausibility of the beat histogram (Lykartsis and Weinzierl, 2015) and the empirical mode decomposition of the speech amplitude envelope (Tilsen and Arvaniti, 2013), and (iii) further testing beat tracking algorithms already used in speech turn-taking (Schultz et al., 2016).

### Conclusions

Taken together, what do our analyses imply about the existence and locus of temporal predictability in speech? Others have argued that subjectively-perceived rhythm in speech may result from coupled or hierarchical series of events at multiple timescales across domains in speech (e.g., Cummins and Port, 1998; Tilsen, 2009). Our results speak only to predictability in the temporal relations between syllables. Nevertheless, these results hint at a broadly complementary perspective: within one domain, regularity in temporal structure is difficult (but not impossible) to capture with our methods, suggesting that the degree of predictability available to a learner is weak or unreliable at any individual level (e.g., first order, second order regularities). However, the following hypothesis strikes us as worthy of investigation in a statistical framework: the impression of regularity and predictability may result from the combination of cues at multiple levels, even though individually these cues may be weak.

Our results somewhat undermine a simplistic view of the usefulness of rhythm in language acquisition (Pompino-Marschall, 1988). Future research should further investigate the interaction of acoustic features underlying the perception of phonological patterns in natural languages. Research along these lines will improve our understanding of the interplay between predictability and learning, informing the debate on both language acquisition and language evolution.

### OVERVIEW OF THE DATA FILES AND THEIR FORMATS

### Raw Annotations

The data is available as Supplementary Material and at: https://10.6084/m9.figshare.3495710.v1.

The files with extension .zip, having the format Language\_iso\_annotator.zip contain the raw annotations in a saved Praat TextGrid. They annotate the narrative sound files of the Illustrations of the IPA, as provided by the Journal of the International Phonetics Association (https://www.internationalphoneticassociation.org/content/

journal-ipa). Whenever this audio data consisted of multiple files, multiple Praat files with annotation were created.

These annotations also contain the perceived phrase and sentence breaks (respectively by a / and // marker), that interrupted the sequences of contiguously uttered speech.

The individual TextGrid files should all be readable by Praat, version 6.

### Prepared Data

The previously mentioned TextGrid annotations were enhanced by adding the intensities and were then converted into a

### REFERENCES

Arvaniti, A. (2012). The usefulness of metrics in the quantification of speech rhythm. J. Phon. 40, 351–373. doi: 10.1016/j.wocn.2012.02.003

format that was easier to read by our analyses scripts. The Language\_iso\_annotator.out files are tab-separated text files that contains 4 columns, with each row corresponding to a single syllable nucleus annotation:


all.out assembles the previously described data from all different languages, while all\_unique.out contains the data of only one annotator for each language. To distinguish between the different concatenated datasets, these two tab-separated files contain 2 extra columns:


### Python Conversion Scripts

The Python script files (.py extension) are the ones that were used to convert the Praat .TextGrid format to the tabseparated .out files. They are included as a reference for the interested, but will not be executable as they depend on a selfcreated (and for now unfinished and unreleased) Python library to extract the intensities with Praat. Feel free to contact the authors for further explanation or access to the analysis scripts.

## AUTHOR CONTRIBUTIONS

BdB conceived the research, YJ, BT, PF, and BdB annotated the language recordings, YJ, AR, and BT analyzed the data. All authors wrote the manuscript.

### FUNDING

This research was supported by European Research Council grant 283435 ABACUS to BdB, and by a PhD Fellowship (Aspirant) of the Research Foundation Flanders - Vlaanderen (FWO) to YJ.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnhum. 2016.00586/full#supplementary-material


M. Rohrmeier, J. Hawkins, and I. Cross (Oxford: Oxford University Press), 73–95.


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Jadoul, Ravignani, Thompson, Filippi and de Boer. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Beat to Read: A Cross-Lingual Link between Rhythmic Regularity Perception and Reading Skill

Annike Bekius 1, 2 , Thomas E. Cope3, 4 and Manon Grube1, 3 \*

<sup>1</sup> Machine Learning Group, Technische Universität Berlin, Berlin, Germany, <sup>2</sup> Department of Human Movement Sciences, Institute Brain and Behaviour, Vrije Universiteit Amsterdam, Amsterdam, Netherlands, <sup>3</sup> Auditory Group, Newcastle University, Newcastle-upon-Tyne, UK, <sup>4</sup> Department of Clinical Neuroscience, University of Cambridge, Cambridge, UK

This work assesses one specific aspect of the relationship between auditory rhythm cognition and language skill: regularity perception. In a group of 26 adult participants, native speakers of 11 different native languages, we demonstrate a strong and significant correlation between the ability to detect a "roughly" regular beat and rapid automatized naming (RAN) as a measure of language skill (Spearman's rho, −0.47, p < 0.01). There was no such robust relationship for the "mirror image" task of irregularity detection, i.e., the ability to detect ongoing small deviations from a regular beat. The correlation between RAN and regularity detection remained significant after partialling out performance on the irregularity detection task (rho, −0.41, p, 0.022), non-verbal IQ (rho, −0.37, p < 0.05), or musical expertise (rho, −0.31, p < 0.05). Whilst being consistent with the "shared resources model" in terms of rhythm as a common basis of language and music, evolutionarily as well as in individual development, the results also document how two related rhythm processing abilities relate differently to language skill. Specifically, the results support a universal relationship between rhythmic regularity detection and reading skill that is robust to accounting for differences in fluid intelligence and musical expertise, and transcends language-specific differences in speech rhythm.

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

Katja Kornysheva, University College London, UK Mari Tervaniemi, University of Helsinki, Finland Lucía Vaquero, University of Barcelona, Spain

#### \*Correspondence:

Manon Grube manon.grube@tu-berlin.de

Received: 12 April 2016 Accepted: 09 August 2016 Published: 31 August 2016

#### Citation:

Bekius A, Cope TE and Grube M (2016) The Beat to Read: A Cross-Lingual Link between Rhythmic Regularity Perception and Reading Skill. Front. Hum. Neurosci. 10:425. doi: 10.3389/fnhum.2016.00425 Keywords: regularity, rhythm, auditory, timing, beat, language, cross-lingual, reading

### INTRODUCTION

The existence of a general relationship between language skill and auditory processing is widely accepted (Lukens, 1896). We typically acquire a language by first listening to it, and then by speaking it, before developing reading and writing skills (Harris, 1947). The perception of rhythm is emerging as being particularly relevant to both the normal acquisition of language skill and disorders of language (e.g., Huss et al., 2011; Grube et al., 2012, 2013, 2014; Przybylski et al., 2013; Gordon et al., 2015; Wieland et al., 2015).

The behavioral link between speech and language skills on the one hand, and aspects of temporal processing on the other, has been attracting research interest in recent decades. Early studies used single sounds or sound pairs, demonstrating links with reading skill, or language impairments for temporal processing ability at the segmental time scale of individual phonemes (Tallal, 1980; Wright et al., 1997; Goswami et al., 2002; Walker et al., 2006; Moore et al., 2010). However, language typically comes in sentences, i.e., streams of syllables consisting of several phonemes over a period of seconds, typically with a characteristic rhythm and stress pattern. The theory that children first process the whole sentence or phrase before breaking it down into single phonemes (Metsala and Walley, 1998) underpins the need to explore the role of sequence processing at supra-segmental time scales. Speech rhythm work in infants and adults has documented the relevance of rhythmic cues and durational patterns, in particular word and phrase boundaries (Smith et al., 1989; Jusczyk et al., 1992) that manifest in the "quasi-rhythmic" (Giraud and Poeppel, 2012) temporal structure of speech (Rosen, 1992).

Recent studies on auditory processing and language or literacy skills have provided evidence for a long neglected role for rhythm and timing per-se, rather than simply for the processing of acoustic features, such as pitch, over time. Huss et al. (2011) reported group-level deficits for the detection of changes in musical rhythms of varying meter in dyslexic compared to control children, and a significant correlation between rhythm and phonological and literacy measures across groups. Using a number of rhythm and timing (as well as pitch and timbre) measures, our work in a large cohort of 11-year olds (Grube et al., 2012) demonstrated correlations that were most robust and least affected by general intellectual skill between language and literacy skills and the rhythm domain. The strongest and most consistent correlation (with a Spearman's rho of about 0.3) was found for the processing of short, isochronous 5 tone sequences (1.5–2.5 s), corresponding to short sentence or phrase levels in speech. Less consistent correlations with language and literacy skill were found for the more musically oriented detection of perturbations in a longer, strongly metrical rhythmic sequence (composed of 7 tones, with an average duration of 3.2 s), created by equidistant spacing of accented tones in time (Grube and Griffiths, 2009). In another study we looked specifically at the processing of different types of longer rhythmic sequences and reading skill in young, English-native speaking adults (Grube et al., 2013). We found a strong and consistent correlation (with a Spearman's rho of about 0.5) for the detection of a "roughly regular" beat, created by adding a parametrically varied amount of jitter to 11-tone sequences with an average length of 4.4 s, corresponding to sentence levels in speech. Notably, this correlation was not seen in the 11-year olds, suggesting a relevance for this ability in later, higher-order language development. The second strongest correlations in both studies were found for the detection of perturbations in a strongly metrical rhythm. Whilst the metrical rhythms represented a highly simplified version of the hierarchical time structure of Western music (London, 2004), the roughly regular rhythms might similarly mimic the quasi-rhythmic structure of speech (Giraud and Poeppel, 2012). The regularity detection task might therefore capture an ability relevant to speech. This task measures the point at which the participant cannot reliably tell which of two simple tone sequences is closer to being regular, i.e., less random. In the beginning, one sequence is perfectly regular (isochronous), the other one highly irregular (by adding a jitter of 30%). Over the course of the task, the initially regular sequence becomes less and less regular, until both sound equally irregular to the listener. The previously demonstrated correlation with reading skill in English-speaking young adults (Grube et al., 2013) suggests a role specifically for this ability to "pull out" such a just noticeable regularity at the sequence (i.e., sentence) level, beyond the sound of single phonemes, and relevant to higher-order language skills. Why would this be? We argue that the subjectively perceptible, somewhat regular rhythm of speech is similar in temporal structure to the "rather irregular" rhythms with an intermediate to high jitter used here. Specifically, sequences with a jitter of up to 15% are closest to the variability of syllable duration and interstress intervals in speech, and the presence of regularities in this range aid speech perception (Tsyplikhin, 2007). We demonstrate sensitivity to regularities within an irregular sequence in the same range, and our inter-onset-intervals, with an average of 400 ms, would correspond to the inter-stress intervals, equivalent to the temporal separation of every second or third syllable (Scott, 1982; Rosen, 1992; Grabe and Low, 2002; Tilsen and Arvaniti, 2013).

The aim of this work is to test for a dissociated pattern in correlation with adult reading skill for this ability to extract such a roughly regular beat, compared to the related "mirror image" ability to detect small deviations from a perfectly regular beat.

Both tasks start out with one sequence being perfectly regular, and one being highly irregular. In both tasks, the difference between the two becomes progressively smaller, in an adaptive manner according to individual performance. A listener's perceptual threshold is defined here as the point at which they are able to correctly distinguish the sequences 70.9% of the time (Levitt, 1971). Nonetheless, the tasks fundamentally differ; the irregularity detection task tests for the smallest perceivable distributed deviation from perfect isochrony, while the regularity detection task tests for the largest degree of irregularity at which the listener is able to perceive any regularity at all. Phenomenologically, in the irregularity detection task the listener attempts to distinguish increasingly regular sequences, whilst in the regularity detection task the listener chooses between increasingly irregular sequences.

With respect to the underlying mechanisms of timing, we hypothesize that performance in the two tasks relies on differential contributions of two or more complementary mechanisms of "absolute," (i.e., duration-based) and "relative," (i.e., beat-based) timing (Grube et al., 2010a,b; Breska and Ivry, 2016: "discrete vs. continuous"; Teki et al., 2011). For the irregularity detection task, toward the end of which the listener is presented with two seemingly isochronous sequences, we expect performance to rely largely on beat-based timing mechanisms, supported by the striato-thalamo-cortical subsystem (Teki et al., 2011). For regularity detection in contrast, in which the listener will be facing two highly irregular sequences, we would expect performance to depend more on duration-based timing mechanisms, supported by the olivo-cerebellar sub-system (Teki et al., 2011). The two tasks also differ in whether a comparator beat is provided to the listener or must be generated internally, reinforcing the distinction between cerebellar and basal ganglia dependence (Grahn, 2009).

The available data therefore necessarily support a partial dissociation of the two subsystems responsible for the processing of regular and irregular sequences. Functionally, we argue that they contribute differentially to absolute vs. relative timing in the subsecond range (Chen et al., 2008; Grahn, 2009; Teki et al., 2011), relevant to language and music. In terms of underlying mechanisms, recent neurophysiological work has implicated neural oscillations in the theta, alpha, and beta ranges as playing a key role in entrainment with the temporal patterns of regular or metrical beats (Iversen et al., 2009; Fujioka et al., 2012), as well as those of pseudo-regular speech envelopes (Ghitza, 2011; Wöstmann et al., 2016). Consistent with domain-general timing functions, neuroimaging studies on shared brain bases for music and speech have demonstrated a common network involving middle and superior temporal gyri and inferior and middle frontal gyri (Schön and Tillmann, 2015). The similarities in brain bases between music and speech, especially in the temporal domain, further motivate our search for universal behavioral correlations.

In terms of our everyday lives, we hypothesize that the ability to pull out an ever-so-roughly regular beat from highly irregular sequences plays a critical role in speech perception and production. In contrast, we would expect sensitivity to small deviations from an isochronous beat (irregularity), to be less relevant to the successful processing of "quasi-rhythmic" speech.

Furthermore, we postulate that a behavioral relationship between regularity detection and reading skill would reflect a universal biological relationship. We argue that the ability to detect a roughly regular beat is a sensitive measure for a "temporal scaffolding mechanism" that supports the perception and production of any language, despite possible differences in speech rhythm. We therefore test here for a correlation between the two rhythm cognition measures of regularity and irregularity detection and rapid reading skill in a mixed cohort of different native language speakers. We assessed reading skill with the rapid automatized naming task (RAN) from the York Adult Assessment Battery of phonological and literacy skill. This is a standardized test that is a validated predictor of literacy skills (Warmington et al., 2013). We specifically chose RAN as a measure of fast reading that can be easily and comparably applied in different mother tongues (Georgiou et al., 2008).

In continuation of the findings leading up to this study (Grube et al., 2013), this work assesses the following novel aspects of the association between auditory rhythm perception and reading skill: (i) the constancy of the correlation with regularity detection across languages and a wider age range; (ii) the dissociation in correlation for an irregularity detection task using similar sequences.

### MATERIALS AND METHODS

The order of behavioral testing was the same as the order of tasks in this methods section. Total session duration was approximately 45 min. Participants were instructed in English, the one common language of proficiency between all participants and experimenters, but performed the RAN test in their mother tongue.

### Participants

The study was conducted in 26 adults (age range 20 to 40, mean age 28 ± 4.6 years; 12 male), who were native speakers of 11 different mother tongues (Danish, 1; Dutch, 3; German, 11; English, 2; French, 1; Greek, 1; Italian, 1; Romanian, 1; Slovenian, 2; Spanish, 1; Turkish, 2). They were in part professionals and in part students from different disciplines; duration of education ranged from 13 to 27 years (mean, 19.1 years ±3.7). Musical expertise ranged from none to (semi/ex) professional, summarized in a score on a scale 1–5, based on the amount of musical training: 1, no musical experience; 2, up to three years of practice; 3, more-than-three to eight years of practice; 4, more than eight years of practice; 5, professional musicianship. **Table 1** contains individual demographics and descriptive group statistics. The study was in accord with the guidelines of, and approved by, the Ethics Committee of the Department of Psychology at TU Berlin. All subjects gave written informed consent in accordance with the Declaration of Helsinki.

### Test of Fluid Intelligence

Fluid intelligence was measured using the progressive matrices from the Wechsler Abbreviated Scale of Intelligence (WASI), scored as the number of items correct. We used raw scores rather than standard scores, as all participants would be in the same age group. (Any non-linearity in the transformation into standard scores is thus not included but would in any event have no effect on the non-parametric correlation analyses employed here.)

### Reading Task

Reading skill was measured by the digit version of the standardized rapid automatized naming test (RAN) from the Revised York Adult Assessment Battery (Warmington et al., 2013). The task is to read a list of 50 digits as fast as possible, after a short practice of 7 items. The outcome measure is the time needed (in s) to read the full list of 50. Each participant performed the task in their mother tongue.

### Auditory Rhythmic Timing Tasks Setup

Testing was performed in a quiet room. Tasks were implemented in Matlab, 2012b. Stimuli were created at 44.1-kHz sampling rate (16-bit resolution), delivered via an external soundcard (Edirol UA-4FX) and closed headphones (Sennheiser HD 380 pro) at approximately 80 dB rms sound pressure level.

### Stimuli

Sequences were composed of nine to eleven 300-Hz pure tones (each of 100 ms duration including 20 ms raised cosine ramps), and using one out of three possible underlying tempi to avoid habituation and learning effects. The three possible tempi had mean inter-onset-intervals of 340, 400, or 460 ms; deviations from this pulse rate depended on the degree of irregularity applied to each sequence. The length and tempo of the sequences were carefully piloted and chosen to "sound right," and have been previously validated in developmental (Grube et al., 2012, 2014), neurodegenerative (Grube et al., 2010a; Cope et al., 2014a,b), and healthy adult (Grube et al., 2010b, 2013) cohorts. The sequences were sufficiently long to allow the perceptual judgment about the presence of a "roughly" regular beat, whilst being sufficiently short that listeners did not get bored. Tempi were chosen to be within the optimal range for the perception of a beat (Fraisse, 1984; Drake et al., 2000; Grondin, 2001; London, 2004).


**365**

English; Fre, French; Ger, German; Ita, Italian; Jap, Japanese; Rom, Romanian; Slo, Slovenian; Ser, Serbian; Spa, Spanish; Swe, Swedish; Tur, Turkish.

### Tasks

Auditory rhythmic timing was assessed by two tasks, one measuring the ability to detect small deviations from a perfectly regular sequence, and the other to detect a roughly regular beat within a highly irregular sequence. Both tasks were based on an adaptive two-alternative forced-choice paradigm. One trial consisted of two sequences (reference and target) presented in pseudo-randomized order. Subjects indicated the perceived target position by pressing 1 or 2 on a standard keyboard. Target-to-reference differences were supra-threshold initially, and were adaptively adjusted according to individual performance following a two-down-one-up algorithm, with a convergence level at the 70.9% correct point of the psychometric function (Levitt, 1971). The algorithm used a larger step size up to the fourth reversal and after that a smaller one.

The parametrically varied feature of interest is the degree of irregularity ("jitter"), introduced to the sequences by shifting each tone forward or backward. The jitter values used range from 0% (perfectly regular; isochronous) to 30% (highly irregular), and are realized by pseudo-randomly shortening and lengthening each individual inter-onset-interval by the desired jitter value ±50%. For a jitter value of 30% for instance, inter-onset-intervals were changed by 15–45%, in a way that the average change across the sequence was 30%. For details on additional constraints to avoid accidental interval repetitions etc. see Cope et al. (2014a); example sequences are available to listen to in Supplementary materials.

In the irregularity detection task (**Figure 1**, top; adapted from Cope et al., 2014a,b) subjects were required to indicate which of the two sequences presented per trial was more "irregular." The reference sequence is perfectly regular (0% jitter) throughout. The target has a clearly noticeable jitter of 20% initially, which is then adaptively adjusted in steps of 3 and 1% according to performance. Over the course of the task, the target approaches the reference jitter (0%) until the point at which the subject cannot detect which of the two sequences is not perfectly regular.

In the regularity detection task (**Figure 1**, bottom; adapted from Grube et al., 2010a,b, 2012, 2013, 2014; Cope et al., 2014a,b) subjects were required to indicate which of two sequences was more "regular." In this task, the reference is always highly irregular (30% jitter). The target is initially perfectly regular (0% jitter), and is adaptively adjusted in steps of 4 and 2.5%. Over the course of the task, the target approaches the reference jitter (30%) to the point at which the subject cannot detect which sequence contains more regularity.

Total number of trials per task was 48. Thresholds were calculated as the mean of the last 6 reversals (of the small step size). Inter-stimulus intervals (from the end of the first to start of the second sequence within each trial) and inter-trial intervals (from response to the start of the first sequence of the next trial) were 1500 ms each. The tasks took about 15 min each.

Task order was fixed in the way that made the most sense in terms of leading the subject through the session: starting with matrices (progressing from easy to harder ones, and also, although being the control measure, the longest task); followed by RAN (fast and fun, but best not performed "out of the cold"); and then the two timing tasks: firstly the irregularity detection (easily grasped) and secondly the regularity detection task (a little more unusual and best understood second; see online example stimuli).

### Statistical Data Analysis

Due to significant deviations from normal distribution (**Table 2**), revealed by the Lilliefors version of the Kolmogorov-Smirnoff Test, correlation analysis used Spearman's rho. The one-tailed version was used based on the a-priori hypothesis of a positive correlation between performance on reading and rhythm tasks. In a second step, in order to control for potentially confounding effects of musical expertise and non-verbal IQ, scores for musical expertise and WASI matrices were partialled out (of the correlations between RAN and rhythm measures). As the two rhythm tasks shared some underlying variance, the dissociation of relationship to language skill was finally confirmed by examining each task partialled out of its counterpart.

### RESULTS

### Correlations between Auditory Timing Tasks, Musical Expertise, and Fluid Intelligence

As expected, the two timing tasks of regularity and irregularity detection themselves were strongly and significantly correlated with each other (Spearman's rho, −0.5, p, 0.004). Irregularity detection thresholds ranged from 2 to 14% (median, 3.9%) jitter, regularity detection thresholds from 9.9 to 27.1% (median, 17.8%) jitter. Note that for irregularity detection, lower threshold values indicate better performance, whilst for regularity detection higher thresholds are better, hence the negative Spearman's rho correlation coefficient between the two tasks.

Performance on both timing tasks was also positively correlated with musical expertise, and this effect was somewhat stronger for regularity detection (rho, 0.49, p, 0.006) than irregularity detection (rho, −0.39, p, 0.026). After partialling out the effect of musical expertise, the correlation between the two timing tasks was somewhat reduced in strength but remained significant (rho, −0.39, p, 0.026).

Neither of the timing tasks was significantly correlated with fluid intelligence, although the relationship trended toward significance for regularity detection (rho, 0.32, p, 0.057) whilst being very weak for irregularity detection (rho, −0.19; p, 0.18). The correlation between the two timing tasks was virtually unaffected by partialling out the effect of fluid intelligence (rho, −0.48, p, 0.008).

### Correlation with Reading Skill

In support of the central hypothesis, there was a strong, statistically significant correlation between rapid automatized naming (RAN) scores and regularity detection thresholds (rho, −0.47, p, 0.008; **Table 2**, **Figure 2**). Note that the correlation coefficient is negative as better performance is indicated by larger regularity detection thresholds but lower RAN times. This correlation remains significant after partialling out the effect of musical expertise (rho, −0.38, p, 0.029) or fluid intelligence (rho, −0.37, p, 0.036), and borderline significant after partialling out both (rho, −0.31, p, 0.072). In other words, the correlation

TABLE 2 | Correlation strength and significance between auditory timing abilities and rapid automatized naming skill, and the effects of musical expertise and fluid intelligence.

two highly irregular sequences (regularity detection). Depicted are one exemplar reference and one target per task.


Listed are Spearman's rho correlation coefficients and p-values for correlations of the two timing tasks with: the rapid reading measure (RAN, time needed in s), followed by the same, but with accounting for shared variance between the two timing tasks; the covariate of musical expertise (score from 1 to 5), followed by the partial one with RAN, after controlling for musical expertise; the covariate of fluid intelligence (WASI matrices scores), followed by the partial correlation with RAN, after controlling for fluid intelligence; and RAN, after partialing out musical expertise and matrices. NB: Some correlations have a negative and some a positive sign, depending on the combination of measures. Irregularity and RAN measures are "the lower the better"; regularity, musical expertise, and matrices measures are "the higher the better." Whilst not all correlations are significant, all of them, even those that only show a trend, are consistent with the direction of performance being positively correlated. Note how strong and significant the correlation with RAN time needed is for regularity compared to irregularity detection thresholds. This difference becomes even clearer when partialing out (of the correlation with RAN) the shared variance for any of the other measures. Significance level was p < 0.05. In bold, significant. In brackets, Spearman's rho < 0.22 (explaining <5% variance).

between regularity detection and RAN explained 22% of the variance before, and 10% after, partialling out the effects of both musical expertise and fluid intelligence.

Strikingly, there was no such strong or significant correlation between irregularity detection thresholds and rapid automatized naming (RAN) scores. There was a weak trend for positive correlation of performance (rho, 0.26, p, 0.102), which became weaker still when partialling out musical expertise (rho, 0.16, p, 0.225), matrices scores (rho, 0.19, p, 0.187) and both together (rho, 0.12, p, 0.292).

Given the strong correlation between the two timing tasks, we also tested their partial correlations with RAN time needed, whilst accounting for their shared variance. This yielded a Spearman's rho correlation coefficient of −0.41 (p = 0.022) for regularity detection, compared to a rho of 0.03 (p = 0.447) for irregularity detection thresholds. This dissociation confirms that individual RAN speed is significantly more strongly related to perceptual threshold for regularity detection than irregularity detection.

Finally, we separately tested the correlations for RAN times against fluid intelligence and musical expertise. Consistent with the effects of partialling these out of the RAN correlations with regularity and irregularity detection, there was a strong correlation with RAN for the matrix reasoning scores (rho, −0.59, p, 0.0007) and a moderate correlation of borderline statistical significance for musical expertise (rho, −0.31, p, 0.06).

### DISCUSSION

The present work comprises a test of two, complementary, "mirror image" aspects of auditory rhythm cognition, and their relationship with reading skill. Both rhythm cognition measures are based on simple tone sequences of varying degrees of regularity. One measures the ability to detect small ongoing deviations from an isochronous beat in highly regular sequences (irregularity detection); the other measures the "opposite" ability in terms of the spectrum of regularity processing; that is, the ability to extract a just noticeable, "roughly" regular beat from highly irregular sequences (regularity detection).

### The "Roughly Regular" Beat to Read

The central hypothesis in this work is that the ability to track a "roughly" regular beat is a key ability for the development of speech and language skills, which is universal across different languages. Consistent with this hypothesis, the results demonstrate a significant, robust and specific correlation between the detection of a roughly regular beat and rapid number reading (RAN). They further demonstrate a lack of such a relationship for

detection thresholds (significant: rho, −0.47, p < 0.01).

the "mirror opposite" perceptual timing ability to detect ongoing deviations from a perfectly regular beat.

The regularity detection task objectively measures the listener's threshold in terms of the just noticeable degree of regularity, rather than the subjective judgment inherent in explicitly asking the listener whether or not they thought a sequence had an underlying regular beat. It uses an adaptive, criterion-free, 2-alternative, forced-choice paradigm in which the reference sequence always has a jitter of 30%, which renders the beat imperceptible (Madison and Merker, 2002). Toward the end of the task both sequences within a trial sound highly irregular to the listener and the objective is to decide which one is just that little bit closer to being regular.

The irregularity task measures the opposite ability; namely the ability to perceive small deviations from an isochronous beat, whereby one sequence is always perfectly regular. Toward the end of this task, both sequences sound very regular, and the subject is asked to decide which one contains ever-so-small deviations from perfect isochrony. Both abilities are related to the processing of regularity, and the two measures are correlated (rho, −0.5, p < 0.001), yet, importantly, they show a dissociated pattern of correlation with reading skill.

Our previous work has assessed the relationship between rhythm processing and sentence-level reading tasks (Grube et al., 2013). The present study extends this to RAN, a measure of reading in the wider sense (Di Filippo et al., 2005) and finds an even stronger correlation, specifically for the detection of a just noticeable degree of regularity. In other words, the ability to detect a roughly regular beat, similar to the quasirhythmic temporal structure of speech, correlates with the ability to rapidly read a page-long list of digits. When broken down into its cognitive constituents, RAN performance relies to a great extent on the strength of connection between orthographic and phonological representations, articulatory fluency, working memory, and the capacity to make rapid eye movements (saccades; Norton and Wolf, 2012). As our participants were all relatively young adults, and none had a disease of the nervous system, we would expect their saccadic latencies and velocities to be similar (Carpenter and Williams, 1995; Antoniades et al., 2007). We are interested in RAN as one measure of fast reading, and the correlation with rhythmic processing. How far this relationship can be broken down to the factors contributing to RAN speed cannot be known. We did not assess those factors, but would predict that further work might demonstrate a particular correlation between rhythm processing and articulatory agility, fluency, and working memory. Whether individuals who are better at detecting rhythmic regularities and able to read out loud faster also tend to read out aloud in a more rhythmically regular fashion will be subject of future work.

The data are consistent with a shared cognitive "sequencing" mechanism (Tillmann, 2012) for structuring events in time, both for rhythmic auditory input and motor speech output. Importantly, this relationship is present across languages with different rhythmic structures, supporting this being a universal mechanism for language acquisition.

Notably, and despite a significant correlation of fluid intelligence with reading skill and a marginally significant one with regularity detection, the relationship between reading skill and regularity detection could only be explained to a small degree by non-verbal intelligence as measured by progressive matrices. This is consistent with our previous findings, in which the correlations between rhythm processing and language skills were relatively independent of non-verbal intellectual skill in early adolescence and early adulthood (Grube et al., 2012, 2013). That is, these correlations were less affected by partialling out the effects on non-verbal IQ than those for pitch processing or processing speed (e.g., Deary, 1994; Stewart et al., 2015).

### Generic Regularity and Its Relationship to Musical and Speech Rhythm

The interval durations were chosen to be within the optimal range for the detection of a regular beat (Fraisse, 1984; Drake et al., 2000; Grondin, 2001; London, 2004). At the same time, these durations match the time scale between supra-segmental markers and stressed syllables or "beat intervals" in spoken speech (Scott, 1982; Rosen, 1992; Grabe and Low, 2002; Tilsen and Arvaniti, 2013). Each sequence had a unique jitter pattern, mimicking natural speech, in which no two sentences are identical. At the same time, the chosen tempi correspond well to those used in Western music. Notably, and in contrast to speech, musical rhythm (the succession of events in time) and meter (the underlying beat) are typically precisely defined and predictable, building on an isochronous beat and featuring an hierarchically organized metrical structure with nested levels of periodicity (London, 2004). We would argue that whilst the irregularity detection task, which is based on detecting small deviations from isochronous sequences, tested an aspect of perceptual timing more relevant to musical skill, our regularity detection task is geared toward our ability to perceptually track as well as produce a roughly regular rhythm, like that in speech. We therefore propose that the regularity detection threshold reflects the capacity of the brain to facilitate the structured intake and output of speech by providing a "temporal scaffolding" for both the perceptual and the motor domain (Ivry, 1996; Tierney and Kraus, 2016). One recent piece of evidence consistent with such a shared "temporal scaffolding" mechanism is the finding of rhythm perception deficits in children who stutter (Wieland et al., 2015). In previous work we have shown a much stronger correlation with reading for the regularity detection task than for the more musically relevant task of strongly metrical sequence processing accuracy (Grube et al., 2013).

### Relationship with Musical Expertise

The present main finding confirms the relevance of auditory regularity processing in reading skill. It is proposed that this occurs through a "temporal scaffolding" mechanism to structure input and output in time. With the observed correlation between rapid reading and regularity scores comes an effect of musical expertise that accounts for part of the correlation between reading and regularity detection. Rhythm cognition is expected to correlate with musical expertise, as musical training is thought to improve perception and production of rhythm [although recent behavioral work on the effect of formal training on different musical instruments showed no difference in superior rhythm perception and production performance between pianists, drummers, violinists and singers compared non-musicians (Matthews et al., 2016)]. The direct correlation between musical expertise and RAN times was moderate, and of borderline statistical significance. However, whether this effect is truly independent of the correlation between rhythm cognition and reading, or can be seen as an important contributing factor, remains open for further investigation. Furthermore, the causality of correlation cannot be gleaned from a cross-sectional study: the relationship with musical expertise could in part be due to a predisposition to take up music lessons if one has a good feeling for rhythm. Either way, the moderate effects found for our fairly crude measure of musical expertise in a relatively small sample, suggests that the true effect may well be stronger than observed.

The effect of musical expertise lends support to the shared resources model (e.g., Patel, 2003; Gordon et al., 2011) and is in line with a number of other groups' work on correlations between language and literacy skill, and more musically oriented rhythm tasks (e.g., Overy, 2003; Huss et al., 2011; Strait et al., 2011, 2013). This effect is somewhat consistent with the findings of our previous study (Grube et al., 2013) in which we used a number of word, non-word and poetry reading measures. We demonstrated there a strong effect for regularity detection, followed by a moderate one for strongly metrical rhythm processing, but none for that of gradual tempo contour of similar complexity to the regularity detection task. Taken together, the existing findings support the interpretation that the ability to analyze temporal structures with a quasi-regular beat is particularly relevant to speech and language skills. Musical training in turn may improve rhythm cognition abilities in a top-down fashion, and thereby strengthen the behavioral link between reading and regularity detection, as seen in the present work and supported by a recent report of neural correlates of enhanced speech rhythm sensitivity and musical aptitude (Magne et al., 2016). Overall, the evidence that music and speech, two sophisticated "high-end applications" of human auditory processing, have common underlying brain mechanisms seems strong, and is further corroborated by shared patterns of learning and brain plasticity in the two domains (see review by Zatorre, 2013). The extent to which these processes, and the mechanisms they employ to analyze rhythm, are shared or specialized will be the subject of further work (c.f. Strait et al., 2011, 2013).

### Cross-Lingual Universality

The observed correlation holds across different languages, as tested here in a mixed cohort of twenty-six adult native speakers of eleven languages. We find a strong and significant relationship between the participants' abilities to detect a roughly regular beat and to rapidly read out loud a list of 50 digits in their mother tongue. The participants were all fairly highly educated, and all rated themselves as advanced speakers of at least one additional language (**Table 1**). One might hypothesize an effect of linguistic background on RAN performance or regularity detection, or a correlation between the two. We therefore conducted additional analyses to test for such effects (using the number of second languages spoken proficiently) but did not find even a trend in the hypothesized direction. It cannot be known whether the absence of an effect might be due to the crudeness of the measure or the absence of an effect, i.e., supporting the universal nature of the link between RAN and regularity detection. This will be subject to larger, systematic studies looking at this correlation

Bekius et al. The Beat to Read

in a language-background specific way. Whilst the scope of this study did not allow for a comparative mother-tongue specific analysis, there were no clear deviations as a function of language. Consistent with this, a 10-month longitudinal study (from the start of formal literacy instruction) by Caravolas et al. (2012) supported the RAN measure, alongside phoneme awareness and letter-sound knowledge, to tap cognitive processes that are important for learning to read in languages of all alphabetic orthographies. The authors tested this in English, Spanish, Slovak, and Czech: four languages that vary in rhythmic properties to a comparable extent with the present range of Indo-European languages with phonetic (primarily Roman) scripts. Whilst there have been demonstrations of language-specific differences in speech rhythm (Dauer, 1983; Grabe and Low, 2002; Das et al., 2008) and there is an ongoing search for metrics to best capture them (Patel et al., 2006; Nolan and Asu, 2009; Turk and Shattuck-Hufnagel, 2013; Dellwo et al., 2015), we would argue that the presence of a roughly regular beat of some kind is inherent to them all. Consistent with our crosslingual finding for rhythmic regularity processing, Goswami and coworkers have demonstrated a universal role for aspects of sound rise-time as a fundamental, language-general feature. Specifically, Muneaux et al. (2004) reported deficits in cyclic amplitude modulation in French-speaking dyslexic compared to typically developing children, and Goswami et al. (2011) showed a corresponding consistent weakness in the sensitivity to the rate of onset of the amplitude envelope (rise time) in English, Spanish, and Chinese, three languages with distinct rhythmic properties. Taken together, cross-lingual work undertaken here and elsewhere will be important in informing the design of training strategies for language development, regardless of language-specific differences in phonemes (Näätänen et al., 1997), speech rhythm and melody.

### Language Impairment

The present data demonstrate a clear and strong cross-lingual correlation between the processing of rhythmic regularity of generic, pre-phonemic tone sequences, and normal adult reading skill. Whether this correlation will hold in individuals with language impairments remains to be explored. Based on other groups' reports on auditory and specifically rhythm deficits (discussed above), and our finding of the same correlation with shorter, simpler rhythm processing tasks in 11-year olds with typical development (Grube et al., 2012) and dyslexic traits (Grube et al., 2014), we would expect the present correlation to be found in listeners with language impairments. Given that correlations in our studies are stronger for more generic (i.e., less musically oriented) rhythms, we hypothesize that a rhythm cognition training programme based on such "roughly" regular rhythms could be at least as efficient as musical intervention (e.g.,

### REFERENCES

Antoniades, C. A., Altham, P. M., Mason, S. L., Barker, R. A., and Carpenter, R. (2007). Saccadometry: a new tool for evaluating presymptomatic Huntington patients. Neuroreport 18, 1133–1136. doi: 10.1097/WNR.0b013e32821c560d

Overy, 2003; Moreno et al., 2009; Schön and Tillmann, 2015; Habib et al., 2016), as it may tap more directly into the shared underlying mechanism.

### CONCLUSION

The present results support a universal, cross-lingual role for rhythmic regularity processing in adult language (specifically, rapid automated reading) skill. The strong and robust correlation with the ability to detect a "roughly" regular beat similar to the "pseudo-regular" rhythm of speech on the one hand, and the absence of such a correlation for the detection of small deviations from a perfectly regular beat on the other, suggest a differential relevance to higher order speech and language skills, reflecting an evolutionary effect manifest in individual development.

### AUTHOR CONTRIBUTIONS

AB carried out the piloting of tasks and the behavioral testing, and contributed to the writing of the manuscript. TC contributed to the design and programming of the tasks, and the writing of the manuscript. MG designed the work, programmed the tasks, supervised the data acquisition, analyzed and interpreted the data, and wrote the manuscript.

### FUNDING

The research leading to these results has received funding from the People Programme (Marie Curie Actions) of the European Union's Seventh Framework Programme (FP7/2007-2013) under REA grant agreement no. 600209 (IPODI fellowship awarded to MG). Author TC was supported by the UK National Institute for Health Research (NIHR), the Association of British Neurologists, and the Patrick Berthoud Charitable Trust.

### ACKNOWLEDGMENTS

The research leading to these results was conducted in the laboratory facilities of the Machine Learning Group at TU Berlin. The authors thank K.-R. Mueller, Machine Learning Group, B. Blankertz, BBCI Group and S. Weinzierl, Audiocommunication Group, TU Berlin, for providing the environment for this research to be carried out.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnhum. 2016.00425

Breska, A., and Ivry, R. B. (2016). Taxonomies of timing: where does the cerebellum fit in? Curr. Opin. Behav. Sci. 8, 282–288. doi: 10.1016/j.cobeha.2016.02.034

Caravolas, M., Lervåg, A., Mousikou, P., Efrim, C., Litavsky, M., Onochie-Quintanilla, E., et al. (2012). Common patterns of prediction of literacy development in different alphabetic orthographies. Psychol. Sci. 23, 678–686. doi: 10.1177/09567976114 34536


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Bekius, Cope and Grube. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# The Enhanced Musical Rhythmic Perception in Second Language Learners

#### M. Paula Roncaglia-Denissen1,2\*, Drikus A. Roor <sup>3</sup> , Ao Chen<sup>4</sup> and Makiko Sadakata1,3,5

1 Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, Netherlands, <sup>2</sup> Amsterdam Brain and Cognition (ABC), University of Amsterdam, Amsterdam, Netherlands, <sup>3</sup> Musicology Department, University of Amsterdam, Amsterdam, Netherlands, <sup>4</sup> Utrecht Institute of Linguistics (Uil OTS), Utrecht University, Utrecht, Netherlands, <sup>5</sup> Donders Institute for Brain, Cognition and Behavior, Radboud University Nijmegen, Nijmegen, Netherlands

Previous research suggests that mastering languages with distinct rather than similar rhythmic properties enhances musical rhythmic perception. This study investigates whether learning a second language (L2) contributes to enhanced musical rhythmic perception in general, regardless of first and second languages rhythmic properties. Additionally, we investigated whether this perceptual enhancement could be alternatively explained by exposure to musical rhythmic complexity, such as the use of compound meter in Turkish music. Finally, it investigates if an enhancement of musical rhythmic perception could be observed among L2 learners whose first language relies heavily on pitch information, as is the case with tonal languages. Therefore, we tested Turkish, Dutch and Mandarin L2 learners of English and Turkish monolinguals on their musical rhythmic perception. Participants' phonological and working memory capacities, melodic aptitude, years of formal musical training and daily exposure to music were assessed to account for cultural and individual differences which could impact their rhythmic ability. Our results suggest that mastering a L2 rather than exposure to musical rhythmic complexity could explain individuals' enhanced musical rhythmic perception. An even stronger enhancement of musical rhythmic perception was observed for L2 learners whose first and second languages differ regarding their rhythmic properties, as enhanced performance of Turkish in comparison with Dutch L2 learners of English seem to suggest. Such a stronger enhancement of rhythmic perception seems to be found even among L2 learners whose first language relies heavily on pitch information, as the performance of Mandarin L2 learners of English indicates. Our findings provide further support for a cognitive transfer between the language and music domain.

Keywords: music rhythm, speech rhythm, second language

### INTRODUCTION

Language and music have many features in common which could suggest a common origin (Wallin et al., 2001; Mithen, 2005). Different views on their origin have proposed that either music might be a byproduct of language (Pinker, 1997), language could have originated from music (e.g., Darwin, 1872; Falk, 2004; Fitch, 2010) or language and music could have originated from a common cognitive domain (Brown, 2000). Despite these different views, investigating possible shared features of these domains might shed more light

#### Edited by:

Andrea Ravignani, Vrije Universiteit Brussel, Belgium

#### Reviewed by:

L. Robert Slevc, University of Maryland, USA Anjali Bhatara, Université Paris Descartes, France Elena Pagliarini, Universitat Pompeu-Fabra, Spain

#### \*Correspondence:

M. Paula Roncaglia-Denissen m.p.roncaglia@uva.nl; mprdenissen@gmail.com

Received: 05 October 2015 Accepted: 27 May 2016 Published: 10 June 2016

#### Citation:

Roncaglia-Denissen MP, Roor DA, Chen A and Sadakata M (2016) The Enhanced Musical Rhythmic Perception in Second Language Learners. Front. Hum. Neurosci. 10:288. doi: 10.3389/fnhum.2016.00288 on what could be a human innate ability, indicate an evolutionary adaptation, or be a byproduct of the other domain. Thus, investigating a possible common feature in language and music might provide the necessary tools to better understand their origin and the evolution of these features in the cognitive landscape (Patel, 2006).

The use of a common mechanism in language and music has been suggested in syntactic processing (Patel, 1998, 2003a, 2008) as well as in melodic and rhythmic organization (Lerdahl and Jackendoff, 1983; Jackendoff, 1989). Evidence of such commonalities has been provided by studies reporting a transfer effect of expertise between these two domains. On the one hand, sensitivity to pitch processing in language appears to be transferred to the music domain (Deutsch et al., 2006, 2009; Elmer et al., 2011). On the other hand, melodic aptitude positively correlates with pronunciation skills in second language (L2; Milovanov et al., 2008) and phonological perception (Slevc and Miyake, 2006; Marques et al., 2007). Furthermore, it has been suggested that musical training improves first-language reading and syntactic skills (Jentschke and Koelsch, 2009; Moreno et al., 2009; Brod and Opitz, 2012; Tierney and Kraus, 2013), while musical rhythmic training, more specifically, improves reading impairments (Bhide et al., 2013).

In both language and music, rhythm is used to organize the sound stream, grouping acoustic events such as sounds and pauses into meaningful units, e.g., words and sentences in language, and phrase and motive in music. While linguistic rhythm helps speech comprehension (Roncaglia-Denissen et al., 2013b) and creates acoustically marked boundaries inside and between words (Patel, 2003b), musical rhythm generates temporal expectation at different hierarchical levels (Patel, 2006). Perhaps, in the same way that musical training may shape one's auditory perception (cf. Vuust et al., 2012), learning a second language (L2) could enhance the perception of rhythmic variation in language, such as sound duration and intensity, which are also present in music organization. Therefore, a rhythmic perceptual enhancement could be created in the music domain via language.

In previous research, Roncaglia-Denissen et al. (2013a) reported that mastering languages with different rhythmic properties, e.g., German and Turkish, helps to enhance individuals' musical rhythmic perception. The authors argued that the rhythmic differences between the two languages could account for this enhancement: German is a stress-timed language and uses the metric foot as its unit of speech organization, i.e., a combination of one stressed syllable with at least one unstressed syllable (Nespor and Vogel, 1986). Turkish, on the other hand, is considered a syllable-timed language and uses syllable, regardless of stress, as its speech organization unit. In terms of wordlevel metrical stress, Turkish is by default word final (Inkelas and Orgun, 2003), while German is trochaic or word initial (Eisenberg, 1991) 1 .

The current research aims to further investigate the impact of learning a L2 on the musical rhythmic perception. Therefore, we tested one group of Turkish monolinguals and three groups of L2 learners (Dutch, Mandarin and Turkish L2 learners of English) on their ability to discriminate rhythmic variation in music. If learning a L2 helps to enhance one's rhythmic perception, then all L2 learner groups should be better than the monolingual group at musical rhythmic perception.

Alternatively to the suggestion that L2 learning impacts musical rhythmic perception, it has been proposed that an enhanced musical rhythmic perception could result from the exposure to music complexity, such as the use of compound meter in Turkish music (Hannon et al., 2012). If this should be the case then all Turkish participants, both monolinguals and L2 learners of English, should be better at musical rhythmic perception than non-Turkish participants. No differences between Turkish monolinguals and Turkish L2 learners of English should be found.

By testing L2 learners of English from different native languages, the current research aims to investigate how rhythmic differences between first and second languages might affect one's rhythmic perception. Regarding their rhythmic properties, Dutch and English are considered both stress-timed languages with the metric preference for the trochee, i.e., a stressed syllable followed by an unstressed one (Pike, 1945; Jusczyk et al., 1993; Vroomen and de Gelder, 1995). Therefore, Dutch L2 learners of English should have one single set of rhythmic properties as a result of the full rhythmic overlap between these two languages.

Turkish, on the other hand, is a syllable-timed language (Inkelas and Orgun, 2003; Van Kampen et al., 2008). At the word level, Turkish has a preference for word-final stress, thus, Turkish L2 learners of English could show an enhanced musical rhythmic perception as a result of encoding distinct sets of rhythmic properties of their first and second languages. This enhanced musical rhythmic perception could be reflected by better performance in musical rhythmic discrimination than Dutch L2 learners of English.

In Mandarin, which is also considered a syllable-timed language (Goswami et al., 2010), the importance of tonal variation for its lexical system is well established (Leather, 1983; Moore, 1993; Shen, 1993; Lai and Sereno, 2007). Thus, we expect Mandarin L2 learners of English to perform better in detecting melodic variation than their L2 learner peers. However, at the word level, there is no consensus regarding the lexical stress preference in Mandarin (Shen, 1993; Duanmu, 1999; Zhang et al., 2008), and lexical stress is restricted to a small percentage of words. That is, while the initial syllable carries a canonical lexical tone, a non-initial syllable carries a neutral tone, perceptually weaker in comparison to the canonical tone. The remaining non-neutral tone words cannot be categorized as either trochaic or iambic (Chao, 1968; Moore, 1993; Zhang et al., 2008).

<sup>1</sup>The use of rhythmic categories, such as syllable-timing, stress-timing and mora-timing, has been challenged by some of the field literature (e.g., Grabe and Low, 2002; Nolan and Asu, 2009) suggesting that a rhythmic continuum, instead of categories, would be a more adequate characterization

of language rhythms. This would be the case because one language could present characteristics of both categories.

Regarding the Mandarin compared to Turkish L2 learners of English, two hypotheses can be made. First, it could be that Mandarin L2 learners of English perform worse than Turkish L2 learners of English. This could be the case because the lexical stress system of Mandarin relies more on pitch than on the rhythmic information of sound duration and intensity, while the lexical stress system of English is mainly based on rhythmic features. Learning a feature in L2 that is not present in the first language might result in a negative transfer with a more effortful and less native-like outcome (Ullman, 2001, 2004). Therefore, the benefits from first and L2 rhythmic differences possible for Turkish L2 learners of English could be hindered in Mandarin natives.

Second, Mandarin L2 learners of English may perform comparably to Turkish L2 learners of English. This could be the case because both the establishment of a broader lexical stress system by Mandarin natives and the reconfiguration of the default stress position by the Turkish natives might require adjusting their rhythmic perception to accommodate it to their L2 as well. In this case, both groups could be sensitive to different rhythmic properties, such as sound duration and intensity, as a result of mastering two rhythmically distinct languages. Hence, both groups could show an enhanced perception of rhythmic variation and comparable performances in rhythmic discrimination.

Regardless of whether Mandarin L2 learners of English perceive musical rhythm comparably to or worse than Turkish L2 learners of English, the perception of musical rhythmic variation of Mandarin L2 learners of English should still be higher than of Dutch L2 learners of English. This should be the case because, even though secondary, the rhythmic features of sound intensity and duration are still used to some extent for stress perception in tonal languages (Shen, 1993; Lai and Sereno, 2007).

### MATERIALS AND METHODS

### Participants

Sixty participants<sup>2</sup> , all non-musicians, were divided into four experimental groups, i.e., 15 Mandarin L2 learners of English (8 females, Mage = 25.06 years, SD = 1.98, mean age of L2 first exposure, AoL2FE = 9.93 years, SD = 2.31), 15 Turkish L2 learners of English (8 females, Mage = 26.33 years, SD = 3.08, MAoL2FE = 10.13 years, SD = 4.34), 15 Dutch L2 learners of English (8 females, Mage = 25.53 years, SD = 4.64, MAoL2FE = 8.80 years, SD = 3.27) and 15 Turkish monolinguals<sup>3</sup>

(8 females, Mage = 18.93 years, SD = 1.94). Participants reported having little formal musical training (M = 1.61 years, SD = 2.19) and were all university students or had recently graduated. None of the participants reported any neurological impairment or hearing deficit, and all had normal or corrected-to-normal vision. This study was approved by the ethics committees of the Faculty of Humanities of the University of Amsterdam, Utrecht University and the Middle East Technical University, in Ankara. All participants gave their written informed consent for data collection, use and publication.

### Materials

### Phonological and Working Memory Measures

To measure participants' phonological memory, i.e., the ability to store and recall novel sounds (cf., Baddeley et al., 1998), the Mottier test was used (Mottier, 1951). The Mottier test is a non-word repetition task composed of six sets of nonwords, ranging from two to six syllables each. All nonwords consisted of the constant syllabic structure of one consonant followed by one vowel, i.e., CV. For the Dutch participants, the stimulus material followed the Dutch phonetic rules and was spoken by a male native speaker. For the Turkish and Mandarin participants, the stimulus material was spoken by a female native speaker of each language according to the phonetic rules of Turkish and Mandarin, respectively.

Participants' working memory capacity was measured using the backward digit span, a cognitive task involving information storage and transformation (Oberauer et al., 2000; Süß et al., 2002). The backward digit span version here used was composed of 14 sets of two trials, ranging from two to eight numbers. In the Dutch version, numbers were spoken by a male native speaker, while in the Mandarin and Turkish versions, by a female native speaker.

### Melodic Aptitude Test

Melodic aptitude tests have been used by the field literature as an indicator of musical aptitude (Seashore et al., 1960; Gordon, 1965, 1969; Wallentin et al., 2010; Roncaglia-Denissen et al., 2013a). Participants' melodic aptitude was assessed using the melodic subset of the musical ear test (MET; Wallentin et al., 2010). The melodic aptitude test consisted of 52 pairs of melodic phrases, presenting 3–8 tones. The melodies had the duration of one measure and were played at 100 bpm. Different trials (26 pairs) contained pitch violation and in half of them (13 pairs) the pitch violation was also a violation in the pitch contour. Twenty-five trials were constituted by non-diatonic tones, 20 were in the major keys and seven in minor keys. The order in which these features occurred was randomized.

#### Rhythmic Aptitude Test

The rhythmic subset of the MET (Wallentin et al., 2010) was used as a measure of musical rhythmic aptitude. The rhythmic

<sup>2</sup>A priori power analysis was conducted (GPOWER; Erdfelder et al., 1996) based on the large effect size (ω <sup>2</sup> = 0.32) of the relationship between musical rhythmic ability and language group reported by Roncaglia-Denissen et al. (2013a). This power analysis indicated that a sample size of 42 L2 learners (13 participants per group) would be sufficient to detect group differences with a power of 0.95 and an alpha of 0.05. To keep groups' size comparable, the sample size of monolingual participants was also estimated in 13 participants.

<sup>3</sup>The younger age of Turkish monolinguals in comparison with the other participants is due to the fact that they were in their first semester as university students at time of data collection. The monolingual participants had been already accepted as university students, but were taking English basic course

to pass an English proficiency exam required to continue their education at the Middle East Technical University, in Ankara.

subset comprised 52 pairs of rhythmic phrases that were either identical or different from each other. Rhythmic phrases were recorded using wood blocks and were 4–11 beats long. All phrases had the duration of one measure and were played at 100 bpm. Trials consisting of two distinct rhythmic phrases differed only with regard to one beat. Rhythmic complexity was achieved by including even beat subdivisions in 31 trials and triplets in the remaining 21 trials. Thirty-seven trials began on the downbeat and the remaining 15 trials started after it. The order in which these features occurred was randomized.

### Self-Reported Language Skills and History Questionnaire

Participants were given a self-reported language skills and history questionnaire. Self-reported language skills have been shown to correlate highly with objective measures of language skills (Marian et al., 2007) and were successfully used in previous research to assess individuals' language skills (e.g., Garbin et al., 2011; Roncaglia-Denissen et al., 2015). The language skills and history questionnaire in the current study is the same one used and published by Roncaglia-Denissen et al. (2013a) in previous study. In this questionnaire, participants' first and second languages' listening, writing, reading and speaking skills are assessed, together with their age of first and L2 first exposure, situations of acquisition, and current use. Based on the results of the assessment and on participants' own perception of their language preference, English was regarded as the L2 in all L2 learners groups, and no group differences were found in terms of age of L2 first exposure, p = 0.45.

### Music Background Questionnaire

Participants were given a music background questionnaire to assess information about their formal musical training (number of years) and daily exposure to music (hours). Formal musical training was assessed for each participant in terms of number of years they attended to music lessons to learn an instrument or to learn how to sing. Whether they learned one or multiple instruments at this period or whether an instrument was learned simultaneously with singing lessons was disregarded. The music background questionnaire is provided in the supplementary material.

### Procedures

Participants were tested individually in a quiet room<sup>4</sup> . The tests were administered in a different pseudo-randomized order for each participant and each individual session lasted approximately 40 min. For the rhythmic and melodic aptitude tests, participants performed two practice trials prior to each test which could be repeated until the test at hand was fully understood. Practice trials were not presented again and were not part of the experimental items. At the end of the session, participants were given a self-reported language skills and history questionnaire and a music background questionnaire.

### Mottier Test, Backward Digit Span

In the Mottier test, participants heard non-words and were instructed to repeat each word as accurately as possible immediately after hearing it. Participants' responses were computed ad hoc by the experimenter. The test was terminated when participants failed to recall a minimum of four items correctly in the same set. Participants' scores were based on the total number of correctly recalled non-words, with a maximum score of 30 non-words.

In the backward digit span, participants listened to sequences of numbers while facing away from the computer. At the end of each trial, participants were instructed to repeat the numbers in the reversed order in which they were presented. The test was terminated when participants failed to correctly recall one trial of the same set. Participants' scores were given based on the total number of trials correctly recalled with a maximum of 14 trials.

### Melodic Aptitude Test

In the melodic aptitude test, participants were presented with the stimulus material via the computer. Mandarin L2 learners of English were given an answer sheet, marking down if the previously heard trial was composed by identical or non-identical melodic phrases. The remaining participants performed this test via the computer and their responses were collected by pressing the corresponding answer-key on a computer key-board. The position of the correct-response key was counter-balanced across participants.

### Rhythmic Aptitude Test

Participants were presented with rhythmic pairs containing either identical or different rhythmic phrases. At the end of each trial, participants had to decide if the rhythmic phrases of the same trial were identical or not. Mandarin L2 learners of English used an answer sheet indicating if each heard trial was composed of two identical or two different rhythmic phrases. The remaining participants performed the experiment via the computer and their responses were computed by pressing the corresponding ''yes-key'', in cases of identical phrases, or the ''no-key'' in cases of non-identical phrases. The position of the correct-response key (''yes-key'') was counter-balanced across participants.

### Statistical Analysis

For the purpose of the current work, only language skills involving the explicit (i.e., speaking and listening) or implicit (i.e., reading) use of rhythm (cf., Fodor, 2002; Kentner, 2012) were taken into account. In order to compare L2 learners' L2 listening, reading and speaking skills, three separate Kruskal-Wallis tests were computed, using each skill as dependent variable and

<sup>4</sup>Mandarin L2 learners of English were tested in Utrecht, Tilburg and in Amsterdam. Turkish L2 learners of English were tested in Amsterdam, Tilburg, Nijmegen and in Ankara. Dutch L2 learners of English were tested in Amsterdam, while Turkish monolinguals were tested in Ankara.



group (Mandarin, Turkish and Dutch) as a between-subjects factor.

Participants were also compared in terms of their daily exposure to music (number of hours) and years of formal musical training by means of two Kruskal-Wallis tests using group as a between-subjects factor. No statistically significant differences across groups were found regarding participants' daily exposure to music and formal musical training, ps > 0.1, hence, these two variables were no longer pursued.

Additionally, three analyses of variance (ANOVAs) were computed using participants' mean scores in the melodic aptitude test and in the two conducted cognitive tests as dependent variables and group as a between-subjects factor. Finally, participants' mean scores in the rhythmic aptitude test were entered in an analysis of covariance (ANCOVA) as a dependent variable with group (Mandarin, Turkish and Dutch late L2 learners of English and Turkish monolinguals) as a between-subjects factor. Participants' scores in each cognitive test, i.e., the Mottier test and the backward digit span, as well as their mean scores in the melodic aptitude test, were entered in the statistical model as covariates.

### RESULTS

### L2 Skills

Participants' self-reported L2 listening, speaking and reading skills are shown in **Table 1**.

Regarding participants' L2 (English) skills<sup>5</sup> , no statistical differences were found among groups for L2 reading and speaking skills, ps > 0.1, hence these two variables were no further pursued. A significant group difference was found for participants' L2 listening skills, X <sup>2</sup> = 7.23, p = 0.02, r = 0.93. Pairwise comparisons of the group means using Bonferroni correction revealed a significant difference between Mandarin (M = 80.66%, SD = 10.99) and Dutch L2 learners of English (M = 91.33%, SD = 10.60), p < 0.016. No statistically significant difference was

found between Mandarin (M = 80.66%, SD = 10.99) and Turkish (M = 88.00%, SD = 12.64) and between Turkish and Dutch L2 learners of English (M = 91.33%, SD = 10.60), p > 0.016. The mean comparisons of participants' self-reported L2 listening skill are illustrated in **Figure 1**.

### Mottier Test, Backward Digit Span, Melodic and Rhythmic Aptitude Tests

Participants' scores in the Mottier test, backward digit span, melodic and rhythmic aptitude tests, years of formal musical training participants received and daily exposure to music (hours) are depicted in **Table 2**.

#### Mottier Test and Backward Digit Span

Results revealed no group differences in participants' scores in the Mottier test, p = 0.13. Analysis of participants' backward digit span score showed a significant group difference, F(3,56) = 6.02, p = 0.001, η <sup>2</sup> = 0.24. Post hoc Boferroni comparison of groups' mean scores revealed that Mandarin L2 learners

<sup>5</sup> In addition to the self-reported L2 language (English), Turkish participants reported having very basic knowledge of Chinese (1 participant), German (5 participants), French (2 participants), Japanese (1 participant), Finish (1 participant), intermediate knowledge of French and Dutch (1 participant) and advanced knowledge of Dutch (1 participant). Chinese participants reported having very basic knowledge of Korean (2 participants), Japanese (4 participants), Russian (1 participant), French (1 participant) and Dutch (5 participant). Dutch participants reported having very basic knowledge of French (4 participants), Spanish (3 participants), Czech (1 participant) and an intermediate knowledge of German (10 participants).


TABLE 2 | Participants' scores in the Mottier test, backward digit span, melodic and rhythmic aptitude tests, formal musical training and daily exposure to music.

outperformed the other groups. No further group differences were found.

### Melodic Aptitude Test

The analysis of participants' mean scores in the melodic aptitude test revealed a significant effect of group, F(3,56) = 15.47, p < 0.001, η <sup>2</sup> = 0.12. Post hoc comparisons using Bonferroni test revealed a significant difference between monolinguals and L2 groups, with worse performance found for monolinguals (M = 53.97%, SD = 9.89) than L2 learners (M = 72.60%, SD = 10.19). A marginally significant group difference was found for L2 learners groups, p = 0.06, η <sup>2</sup> = 0.05. Planned comparisons of groups' mean scores using Bonferroni correction for multiple comparisons revealed higher melodic mean scores for Mandarin (M = 76.66%, SD = 6.24) than for Dutch L2 learners of English (M = 68.07%, SD = 12.54). No statistically significant difference was found between Mandarin and Turkish L2 learners (M = 73.07%, SD = 9.50) and between Turkish and Dutch L2 learners of English. Comparisons of participants' accuracy rates in the melodic aptitude test are depicted in **Figure 2**.

### Rhythmic Aptitude Test

For participants' rhythmic aptitude test, the conducted ANCOVA revealed a significant effect of group, F(5,54) = 16.31, p < 0.001, η <sup>2</sup> = 0.39. A post hoc pairwise comparison of participants' mean scores using a Bonferroni test revealed a significant difference between the Mandarin (M = 75.64%, SD = 6.15) and the Dutch L2 learners of English (M = 66.15%, SD = 8.77); and between the Mandarin L2 learners and the Turkish monolinguals (M = 54.35%, SD = 10.31). Similarly, a statistically significant difference in mean scores was found between the Turkish (M = 73.97%, SD = 7.11) and the Dutch L2 learners of English (M = 66.15%, SD = 8.77) and between Turkish L2 learners and monolinguals (M = 54.35%, SD = 10.31). A significant group difference in mean scores was also encountered when comparing Turkish monolinguals with Dutch L2 learners of English. Hence, Turkish monolinguals performed worse than all the other groups. No statistically significant difference was found between Mandarin and Turkish L2 learners groups. Comparisons of participants' accuracy rates in the rhythmic aptitude test are depicted in **Figure 3**.

To investigate if rhythmic performance in the three L2 learners group could be affected by the difference in their L2 listening skill, an additional ANCOVA was computed adding L2 listening skill to the other covariates, i.e., participants' scores in the Mottier test, backward digit span and melodic aptitude test. Results revealed that L2 listening skill does not contribute significantly to participants' rhythmic performance, F(8,36) = 0.69, p = 0.42. Therefore, this variable was not further pursued.

### DISCUSSION

The current research investigated whether and how the learning of a L2 could contribute to individuals' musical rhythmic perception. Turkish monolinguals and three groups of L2 learners, namely, Mandarin, Turkish and Dutch L2 learners of English were tested on their rhythmic perception in music. Additionally, Turkish monolinguals were tested to account for the possibility that the exposure to musical rhythmic complexity could explain a possible enhancement in individuals' musical rhythmic perception. To account for individual differences in cognitive ability and musical aptitude,

which could have influenced participants' rhythmic perception, participants' working memory, phonological memory and melodic aptitude were assessed, and used as covariates.

Our results showed L2 learning to be more salient to musical rhythmic perception than exposure to musical rhythm complexity, since Turkish monolinguals demonstrated worse performance than all the other groups, including Turkish L2 learner of English. Additionally, monolinguals performed worse than L2 learners in the melodic aptitude test, despite that no group differences were found between monolinguals and their L2 learner peers regarding their formal musical training, daily exposure to music and phonological memory. Interestingly enough, the only group difference found with respect to the cognitive measures here collected concerned the higher working memory scores of Mandarin L2 learners of English in comparison with the other groups. According to previous research, differences in verbal working memory could be due to cultural differences (cf., Hedden et al., 2002). Thus, the use of non-verbal working memory measures in future cross-cultural studies could be an option if one wishes to avoid such differences.

The worse performances of Turkish monolinguals in musical rhythm and melody perception in comparison with the three L2 learners groups could indicate that learning an L2 might enhance the overall perception of acoustic variation, such as the variation in sound duration, intensity and pitch. Similarly to how musical training may shape one's auditory skills (Vuust et al., 2012), learning a L2 could promote similar effect. Furthermore, the enhanced melodic perception of Mandarin in comparison with Dutch L2 learners of English corroborates previous findings in the literature (Wong et al., 2012; Bidelman et al., 2013) that report enhanced musical pitch perception in native speakers of tonal in comparison with those of non-tonal languages. The lack of group difference between the melodic performance of Mandarin and Turkish L2 learners could be due to the size of the effect which, despite a visible trend, failed to reach significance. Therefore, future research further contrasting the melodic aptitude of tonal and non-tonal L2 learners should be carried out.

Regarding L2 learners' rhythmic performance, our findings indicate that an enhanced rhythmic perception is found for L2 learners whose first and second languages diverge in their rhythmic characteristics, as is the case of Mandarin and Turkish L2 learners of English. Perhaps Dutch L2 learners of English could be worse at musical rhythmic perception than Turkish and Mandarin L2 learners of English due to the full overlap of rhythmic properties between these two languages.

The comparable performances of Mandarin and Turkish L2 learners of English could indicate that the processes of reconfiguring stress position (from word-final to word-initial position) and learning a new lexical stress system could enhance one's rhythmic perception. This could be the case because having to learn rhythmic features in L2 that are different from the native language, such as sound duration and intensity, could make one more aware of variations in these rhythmic features in language and musical perception.

The observed enhanced rhythmic perception could represent another cognitive advantage of bilinguals, similar to verbal and non-verbal intelligence (Peal and Lambert, 1962), problem solving skills (Bialystok, 1999; Bialystok and Shapero, 2005), and phonological memory (Service, 1992; Cheung, 1996). An enhanced rhythmic perception could be decisive to a more successful language encoding (Sundara and Scutellaro, 2011), language recognition and a more effective selection of the target-language (cf.,Roncaglia-Denissen et al., 2013a).

Rhythmic information is not only relevant for language, but also for music organization. Thus, a perceptual auditory enhancement in language could be transferred and used in the music domain, as our results seem to suggest. Evidence of cognitive transfer between the language and the music domains has been reported by quite a few studies. On the one hand, the use of linguistic pitch variation by tonal native speakers enhances their perception of musical pitch variation (Deutsch et al., 2006; Elmer et al., 2011), and on the other hand, musical training improves the perception of linguistic pitch variation (Slevc and Miyake, 2006; Marques et al., 2007; Milovanov et al., 2008). Regarding rhythmic skills, it has been shown that effects of rhythmic training can be transferred to the language domain (Bhide et al., 2013), and timing sensitivity in language may be predicted by musical aptitude (Milovanov et al., 2009; Marie et al., 2011; Sadakata and Sekiyama, 2011). The present study is in line with previous research, suggesting that learning languages with distinct rhythmic properties enhances individuals' perception of rhythmic variation in music (Roncaglia-Denissen et al., 2013a; Bhatara et al., 2015). The existence of a bi-directional transfer effect between the language and the music domain strongly suggests the existence of shared mechanisms and cognitive resources between them (Patel, 2008, 2014).

In face of the reported results one may argue that, together with learning an L2, other unmeasured cultural variables could be contributing to our findings. If this should be the case, future investigations should address this matter. Additionally, a few concrete questions remain, such as which rhythmic features of speech might contribute to the enhancement of individuals' musical rhythmic perception. The processing of timing features in music has been described as having different levels, from the encoding of short timing span (Repp, 2005) to an overall rhythmic pattern analysis of longer sound sequences (Zanto et al., 2006).

In speech, timing information provides important cues at different levels as well. At the phonological level, it helps to distinguish vowels, e.g., in Dutch, (Booij, 1999) and consonants, e.g., in Japanese (Han, 1992; Sadakata and McQueen, 2013; Kawahara, 2015). At the word level, timing information manifests itself as word metric preference, e.g., the trochee or the iamb (Hayes, 1985), while beyond the word level, it helps to organize the speech flow (Grabe and Low, 2002; Roncaglia-Denissen et al., 2013b). The sensitivity to such timing cues depend on one's mastered languages (Kingston et al., 2009; Sadakata and Sekiyama, 2011; Roncaglia-Denissen et al., 2015).

Perhaps mastering languages with distinct word metric preference, e.g., word initial vs. word-final stress, may be enough to enhance rhythmic sensitivity in music. Alternatively, perhaps being sensitive to broader features such as speech organization units, e.g., metric foot or the syllable, as for our Mandarin learners of English, may be enough to enhance rhythmic sensitivity. It could also be that the interplay between the word and speech levels, rather than their respective impact alone, account for such a rhythmic enhancement.

To disentangle which mechanisms are playing a central role in enhancing individuals' rhythmic perception, be it word metric preference, speech organization, or both, one could extend the current approach to other language pairs that diverge in their rhythmic features. For instance, a language pair consisting of a syllable-timed language and a stress-timed language that share the word metric preference for the trochee, as it is the

### REFERENCES


case of Spanish and English respectively (Pike, 1945; Jusczyk et al., 1993; Sebastian and Costa, 1997; Schmidt-Kassow et al., 2011), would be a good candidate for investigation. Additionally, two syllable-timed languages with different metric preference, such as Turkish (with the preference for word-final stress) and Spanish (a word-initial language) would also prove an interesting investigation. Future research addressing this matter will help us truly understand which relevant features for rhythmic perception in language could be also relevant and could be used in musical rhythmic perception. With this knowledge, one could gain a better understanding of what the music and language domains might share, and be one step closer to grasping what makes these two domains so unique and particular to humans.

### AUTHOR CONTRIBUTIONS

MPRD, DAR, AC and MS contributed to the design of the experiment, to the recruitment of participants and data collection. MPRD was responsible for data analysis. MPRD, DAR, AC and MS contributed to the writing of the manuscript. MPRD was the lead author. DAR, AC and MS contributed to different sections of the manuscript and reviewed drafts of it.

### ACKNOWLEDGMENTS

MPRD is supported by a Horizon grant of the Netherlands Organization for Scientific Research (NWO). We are grateful to Annet Hogenberger and Cognitive Science program at the Informatics Institute at Middle East Technical University (METU) for allowing us to carry out the current experiment in her lab, Gözde Nasuhbeyoglu for helping us with the translations, ˘ Kamuran Özlem Üzer from the Basic English Department at the METU for helping us to recruit Turkish monolingual students. We are also grateful to Özge Yüzgeç and Ezgi Kayhan for their support in translating our stimulus material and to Eleanor Harding for her thoughtful comments on this manuscript.

### SUPPLEMENTARY MATERIAL

The Supplementary Material for this article can be found online at: http://journal.frontiersin.org/article/10.3389/fnhum.2016.00 288/abstract


Chao, Y. R. (1968). A Grammar of Spoken Chinese. Berkeley: CA.


**Conflict of Interest Statement**: The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Roncaglia-Denissen, Roor, Chen and Sadakata. This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution and reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

# Preliminary Experiments on Human Sensitivity to Rhythmic Structure in a Grammar with Recursive Self-Similarity

#### Andreea Geamba ¸su1, 2 \* † , Andrea Ravignani 3, 4 \* † and Clara C. Levelt 1, 2

*<sup>1</sup> Leiden University Centre for Linguistics, Leiden University, Leiden, Netherlands, <sup>2</sup> Leiden Institute for Brain and Cognition, Leiden University, Leiden, Netherlands, <sup>3</sup> Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium, <sup>4</sup> Department of Cognitive Biology, Faculty of Life Sciences, University of Vienna, Vienna, Austria*

Keywords: statistical learning, rhythm, recursion, artificial grammar learning, rhythm perception, Lindenmayer system, L-system, Fibonacci grammar

### OVERVIEW

Edited by: *Huan Luo, Peking University, China*

#### Reviewed by:

*Nai Ding, Zhejiang University, China Liping Wang, Institute of Neuroscience, China*

#### \*Correspondence:

*Andreea Geamba ¸su a.geambasu@hum.leidenuniv.nl; Andrea Ravignani andrea.ravignani@gmail.com † Authors share first authorship.*

#### Specialty section:

*This article was submitted to Auditory Cognitive Neuroscience, a section of the journal Frontiers in Neuroscience*

> Received: *07 April 2016* Accepted: *07 June 2016* Published: *28 June 2016*

#### Citation:

*Geamba ¸su A, Ravignani A and Levelt CC (2016) Preliminary Experiments on Human Sensitivity to Rhythmic Structure in a Grammar with Recursive Self-Similarity. Front. Neurosci. 10:281. doi: 10.3389/fnins.2016.00281* We present the first rhythm detection experiment using a Lindenmayer grammar, a self-similar recursive grammar shown previously to be learnable by adults using speech stimuli. Results show that learners were unable to correctly accept or reject grammatical and ungrammatical strings at the group level, although five (of 40) participants were able to do so with detailed instructions before the exposure phase.

### INTRODUCTION

Processing of hierarchical structures has been proposed as a uniquely human ability, a hallmark of the linguistic system that distinguishes human language from animal communication systems (Hauser et al., 2002; Martins, 2012). Recursion is often considered the pinnacle of human-specific hierarchical structures (Hauser et al., 2002). Artificial Grammar Learning experiments have shown that adult participants are able to learn the context-free grammar AnB n , whose generation requires hierarchical rules, even without the need for semantic information (Lai and Poletiek, 2013). Parsing and generalizing grammars like AnB n requires detection that a structure, e.g., AB, is embedded between elements of another structure, e.g., A...B. Other species have not been shown unequivocally to be able to learn on the basis of the center-embedding principle required of AnB n (rather than using other strategies, Corballis, 2007; van Heijningen et al., 2009; Beckers et al., 2012; Poletiek et al., 2015; Ravignani et al., 2015), which is taken as evidence that processing of recursion is a human-specific capacity.

Yet to what extent learning of an AnB n grammar can be taken as evidence for processing recursive information at all is debated. Some researchers argue that human participants could in fact use simpler strategies, such as counting and matching the number of As and Bs in a test sequence (Hochmann et al., 2008; Zimmerer et al., 2011), while others argue that despite different strategies, the same core operations are nonetheless necessary (Fitch and Friederici, 2012; Fitch, 2014). Saddy (2009) proposed that a more suitable grammar for the investigation of recursive processing may be Lindenmayer grammars, or L-systems. Uriagereka et al. (2013) have proposed that these grammars are suitable for between-species comparative work because they generate utterances that can be infinitely long and produce a "rhythm" when recognized. L-systems were first proposed by Lindenmayer to describe algae cell growth (Lindenmayer, 1968; Lindenmayer and Rozenberg, 1972) and have since been used to describe and recognize different plant structures (Samal et al., 1994). L-systems have rewrite rules that occur in parallel and have no terminal symbol, indicating that they can produce infinite sequences (**Figure 1A**). Because of their hierarchical structure and recursive properties, they are an interesting grammar to use in testing recursive processing. In her dissertation, Shirley (2014) began to explore the learnability of Fibonacci grammars, a subgroup of L-systems, that at each iteration produce sequences with lengths corresponding to Fibonacci numbers. She found that after a 3-min training with a Fibonacci grammar composed of syllables bi and ba, participants were able to correctly accept grammatical 10-s-long structures, and correctly reject ungrammatical ones. However, how participants processed the stimuli in Shirley's task is not clear yet. A possible rhythm-based strategy may have been used by participants to recognize a pattern in sounds generated by recursive branching, using rhythmic structure, i.e., how durational events are grouped and perceived hierarchically based on their relative accentuation. When presented with sequences of acoustic events occurring at constant time intervals (i.e., isochronous, as in Shirley, 2014), humans tend to group these events. Grouping often occurs when events are differentially accented, that is, marked by differing pitch or intensity (e.g., strong-weak-weak, Hay and Diehl, 2007).

The detection of a specific rhythmic pattern might be the mechanism participants draw upon to detect recursive structures such as those tested here. Syllables in Shirley (2014) differed by their vowel quality, with possibly some non-systematic variation in fundamental frequency and intensity. If detection strategies based on rhythmic features were used to learn Shirley's grammars, participant tested with percussion sounds (enhancing the recursive rhythmical structure of the stimuli) instead of speech syllables should show similarly high or even better performance, as the non-temporal rhythmic cues (intensity or pitch accentuation) would be enhanced, while violations in interstimulus intervals would disrupt the rhythmic detection strategy and hence grammar recognition (Shirley, 2014).

Can a complex pattern, recursively and hierarchically organized according to an L-system, be learned on the basis of a rhythmical strategy? We tested this hypothesis by enhancing the rhythmic quality of the sequences by using drum sounds differing in pitch and intensity, instead of syllables. This work thus constitutes the first study on rhythm perception using Lsystems<sup>1</sup> . We conducted two experiments (**Figure 1D**), each with two conditions (two types of foil grammars) to evaluate the learnability of the L-system grammars. Between our two experiments, we also varied instructions, to further explore whether the method of presenting the exposure stimuli had an effect on learning ability. Based on previous work by Saddy (2009) and Shirley (2014) we expected that participants would pick up on the rhythmic nature of the structures, and be able to discriminate grammatical from ungrammatical strings. Our results indicate that for the majority of our participants, rhythm alone may not be enough to learn this type of grammar; musical background, age, instruction, and the specific types of foil grammars may all be contributing factors.

### METHODS AND MATERIALS

Two experiments were conducted, using Fibonacci grammars similar to those used in Saddy (2009) and Shirley (2014). The experiments consisted of an exposure phase and a test phase. During the exposure phase, participants passively listened to a sequence of kick and snare drum sounds following a Fibonacci grammar. During the subsequent test phase, participants were asked to indicate whether the test item (composed of the same kick and snare sounds) corresponded to the grammar from the listening phase, and to rate their certainty. The two experiments (Experiments 1 and 2) differed only in the detail of instruction given to participants. Instructions in Experiment 2 were more detailed than those in Experiment 1 (see Procedure). Each of the experiments consisted of two conditions (Mirror and Swap), in which each of the ungrammatical test items differed from the target Fibonacci grammar in different ways (see Stimuli).

### Participants

Forty students (nine males; age range 18–32, M = 22, SD = 3.05) from Leiden University participated, N = 20 in Experiment 1 and N = 20 in Experiment 2. Participants were recruited via the SONA participant recruitment website of Leiden University. None of the participants had hearing problems or were dyslexic. Participants had various linguistic backgrounds, with all participants speaking at least one foreign language. They also had varying degrees of musical experience. The study was approved by the Ethical Committee of the Faculty of Social Sciences at Leiden University. Participants signed an informed consent form before taking part and were fully debriefed on the intention of the study upon completion of the experiment. They received course credits or monetary compensation for participating.

### Stimuli

The Fibonacci sequences were made of simple drum sounds: a kick (average intensity 78 dB; sound X) and a snare (average intensity 66 dB, sound Y), each 200 ms in duration. See **Figure 1B** for the Fibonacci grammar's rewrite rules.

An exposure string was created using a series of customwritten Python scripts which created a large iteration of the Fibonacci grammar (n = 23, resulting in a 75025-elementlong string). From this initial sequence, a 900-element (3-minlong) sequence was extracted and used for the habituation phase. Grammatical test items (50 elements, 10 s long) were extracted from the remaining sequence such that each grammatical string was unique.

Two modifications of the Fibonacci sequences were used as foil grammars. The first will be referred to as a Swap sequence. A Swap sequence consisted of a sequence taken from the remainder of the initial 75025-long sequence, in which a randomly-selected X and an adjacent Y from the string were switched, subject to the

<sup>1</sup> See Martins et al. (2014) for a musical recursion experiment in the melodic domain.

FIGURE 1 | A derivation of the target Fibonacci grammar at the first four iterations and at the final 23rd iteration used to generate the exposure and test stimuli (A), the rewrite rules of the grammar (B), the makeup of the two foil grammars (C), and an overview of the two experiments reported with their two respective foil test conditions (D). We use upward and downward note stems to differentiate between the two drum sounds.

constraints that the swap would (i) produce a different string and (ii) not introduce an easily-detectable YY bigram (**Figure 1C**). For example, if the Fibonacci iteration n<sup>3</sup> is X**YX**XY (see **Figure 1A**), its corresponding Swap sequence may be X**XY**XY. The second foil sequence will be referred to as a Mirror sequence. A Mirror sequence consisted of the Fibonacci sequence that was cut in half; this first half of the sequence was mirrored and replaced the original second half (**Figure 1C**). For example, if the Fibonacci iteration n<sup>5</sup> is XYXXYX**Y**XXYXXY (**Figure 1A**), its corresponding Mirror sequence would be: XYXXYX**Y**XYXXYX, where the seventh element (Y, bold) is treated as the point of mirroring. In order to avoid introducing more than two repetitions of the X element, or more than one repetition of the Y element, the point of mirroring varied by sequence, and thus mirror sequences could be either 50 or 51 elements long.

The composition of the foil grammars ensured that they never occurred in the habituation sequence, nor could they have ever occurred in any shorter iterations of the Fibonacci grammar. They also ensured that the grammatical and ungrammatical items were as similar as possible with respect to their local (element adjacency) and global (distribution of Xs and Ys) properties, thus preventing participants from solving the task by using simpler methods such as counting.

### Materials

The experiments were conducted on a computer running Windows 7, with a 17-inch monitor (refresh rate: 60 Hz; resolution: 1280×1024 pixels). Participants sat ∼50 cm from the screen in a quiet room and listened to the stimuli via headphones (Sennheiser HD 201). The experiment was programmed and run in Praat (Boersma and Weenink, 2014) and participant responses were registered via mouse clicks.

### Procedure

In Experiment 1, participants were first presented with the following instruction: "You will now hear a 3-min-long rhythmic sequence. Listen carefully. When the sounds stop, press the spacebar to proceed to the test phase." Participants in Experiment 2 were presented with more specific instructions: "You will now hear a 3-min long rhythmic pattern. Listen carefully. You will have to distinguish between this pattern and another pattern in the test phase. When the sounds stop, press the spacebar to proceed to the test phase."

Within each experiment (**Figure 1D**), an equal number of participants was randomly assigned to the Mirror condition or the Swap condition (n = 10 per condition per experiment). In both conditions, the participants listened to the same Lsystem exposure sequence for 3 min. During the exposure phase the display was gray and showed a black fixation cross. After the exposure phase, the testing phase began. Participants were then presented with the following instructions: "The test phase will now begin. You will hear 36 test sounds. For every sound, listen carefully and indicate whether it follows the same rhythm as during the listening phase. Rate your certainty on a scale of 1 to 5. 1 = definitely no; 2 = probably no; 3 = not sure; 4 = probably yes; 5 = definitely yes. Only answer when the sound has finished playing." During the test phase, participants in both the Mirror and the Swap condition were tested on their ability to discriminate between 10-s-long grammatical L-system sequences and ungrammatical sequences (Mirror or Swap sequences, depending on condition). In both conditions, they were instructed to indicate whether the sequences they heard followed the same rhythm as the sequences they had heard during the listening phase. The instructions appearing on the screen during playback of each test item were as follows: "Does this sound follow the same rhythm as in the listening phase? How sure are you?." Participants could then answer by clicking on one of two boxes with the words YES or NO. For their sureness response, they clicked on one of five boxes with numerals 1 (definitely no) through 5 (definitely yes).

Upon completion of the experiment, participants filled in a questionnaire, which inquired about their sex, age, hearing, dyslexia, languages spoken, handedness, musical training, and education level and background. They were subsequently debriefed on the purpose of the study and any questions they had were answered.

### DESCRIPTIVE STATISTICS AND RESULTS

There are two types of correct answers, namely a correct acceptance of a grammatical L-system sequence, and a correct rejection of an ungrammatical foil sequence. Thus, we analyzed correct responses both overall and comparing acceptances and rejections.

At the group level, when pooling across participants, the number of correct responses was at chance for each of the four groups (1 sample t-test, all t < 1.8, all p > 0.12). For each experiment and in each condition, performance did not differ between correct acceptances of grammatical and correct rejections of ungrammatical stimuli (paired samples t-test, all |t|< 1.7, all p > 0.13); also reaction times did not differ (all t < 0.71, all p > 0.49).

For each of the four groups (see **Figure 2**), we did a Spearman correlation (uncorrected) between % correct responses and:


Analyses of individual performances showed that five participants correctly classified stimuli above chance. A Fisher exact test revealed that each of these five participants significantly more often than chance associated correct Fibonacci-grammatical stimuli as similar to the sequences heard in the exposure phase and foils as dissimilar to the sequences heard during the exposure phase (one-sided, all p < 0.05, all prior odds ratio using Maximum Likelihood Estimate > 4.0). These were participants numbers 30 and 31 (Experiment

2, Swap group) and numbers 22, 33, and 37 (Experiment 2, Mirror group). Interestingly, all these participants received detailed instructions. Moreover four out of five reported having musical training (12 out of our 40 participants reported musical training).

### DISCUSSION AND FUTURE EXPERIMENTS

Our experiments did not show that, at the group level, participants were able to learn the Fibonacci grammars and discriminate them from either the Mirror or Swap foil grammars. At the individual level however, there were five participants in Experiment 2 who correctly identified grammatical and ungrammatical strings above chance level, suggesting that with specific instructions participants may be able to discriminate the grammatical from ungrammatical strings. Of those who did perform above chance level, most had received musical training, adding weight to the argument that rhythm perception may be involved in learning this type of grammar. However, the question remains as to why most of our participants were not able to discriminate grammatical and ungrammatical strings, while the participants in Shirley (2014) were able to do this.

The very limited proficiency our participants achieved may be due to the fact that the foil grammars were too similar to the target grammar to be discriminated. While our exposure grammars were similar to those used in Saddy (2009) and Shirley (2014), our foil grammars differed in that ours did not include repetitions of both Xs and Ys, and thus could not be discriminated using repetition detection. By making the difference between target and foil grammar more subtle to avoid this method of discrimination, it might be that some of our foils were substrings of the Fibonacci-grammatical space, generated by one of the infinite iterations of the rewrite rules (Krivochen and Saddy, personal communication). This would have made discrimination between the target and foils more difficult in our experiment than in the experiments by Saddy (2009) and Shirley (2014), in which foils were part of the L-system space but not Fibonacci-grammatical. We can therefore not conclude whether or not participants are able to learn a Fibonacci grammar when presented with musical sounds. In future research, in order to be able to draw conclusions about whether musical rhythm differs from linguistic rhythm, and whether participants are able to use some sort of rhythmic structure to learn Fibonacci grammars (rather than surface properties of the stimuli) foil grammars should be calibrated to an optimal tradeoff between the structural properties of Shirley's foils and the surface properties of those used here. In addition, a different paradigm, such as Serial Reaction Time or EEG, may help illuminate what cues in the sequence participants attended to and at which point they detect an error.

In addition, several important points for consideration in future experiments are raised by our results. First, the individuals who performed above chance in correctly identifying grammatical and ungrammatical sequences, all took part in Experiment 2, where instructions were more specific than in Experiment 1. Instructions in Experiment 2 were also more in line with Shirley's instructions, letting participants know before training that they would later have to judge the correspondence between the test items and the exposure sounds. Our instructions did, however, differ from Shirley's in that Shirley used the word "language rule" whereas in our experiments, the term "rhythmic pattern" was used in order to potentially push participants even further in focusing on the rhythm of the sequences. The different terms may prime participants to listen to and learn about the same exposure grammars in different ways. Future experiments should thus take instruction into account as a factor. Furthermore, another factor that should be taken into account and balanced in the future is age of participants; although not significant in the statistical analysis, older participants may perform better on this type of rhythm detection task (**Figure 2B**).

Taking into account the important difference in foil grammars between our experiments and those reported in Shirley (2014), we hypothesize that when given a complex grammar as foil that is not part of the Fibonacci grammatical space, participants would be able to draw upon rhythmic detection abilities to accurately accept grammatical and reject ungrammatical sequences. Success of some individuals on our potentially more difficult task (as compared to Shirley's) already points in this direction. Success in learning Fibonacci grammars using percussion sounds would add support to the claim that rhythm detection is being used to solve this type of Artificial Grammar Learning task, as well as the type of task using speech sounds in Shirley (2014). Future work will address these outstanding issues.

### OVERVIEW OF THE DATA FILES AND THEIR FORMATS

The raw data files are available at the figshare repository: https://figshare.com/s/83987b4a52906c87e115. The raw data is contained in the file "alldata.csv," which can be read by any text editor or Microsoft Excel. This file was obtained by merging all output files from individual participants (collected between Dec 5th, 2014 and Feb 26th, 2016), and adding additional information from questionnaire (e.g., musical training). Python scripts used for the analyses are available from the authors on request.

Variable names and coding (values in brackets)


### AUTHOR CONTRIBUTIONS

All authors conceived the experiments, designed the stimuli, and edited the manuscript. AG performed the experiments. AR analyzed the data. AG and AR wrote the manuscript.

### REFERENCES


### ACKNOWLEDGMENTS

We dedicate this work to the memory of Remko Scha. We thank Johanne Rauwenhoff and Laura Toron for help collecting data. We also thank Doug Saddy, Liz Shirley, and Diego Gabriel Krivochen for valuable discussion on the experiments. AG and CL were supported by NWO Vrije Competitie grant 360.70.452 (to CL). AR was supported by ERC grants 283435 ABACUS (to Bart de Boer), 230604 SOMACCA (to W. Tecumseh Fitch) and ESF grant 5544 INFTY (to AR).

Comparative Perspective. (Vienna). Abstract retrieved from: https://evolangx. univie.ac.at/fileadmin/user\_upload/p\_evolangx/AbstractsCorrWeb.pdf


**Conflict of Interest Statement:** The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Copyright © 2016 Geamba¸su, Ravignani and Levelt. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

**389**

digital media

of impactful research

article's readership