Song Is More Memorable Than Speech Prosody: Discrete Pitches Aid Auditory Working Memory

Haiduk, Felix; Quigley, Cliodhna; Fitch, W. Tecumseh

doi:10.3389/fpsyg.2020.586723

ORIGINAL RESEARCH article

Front. Psychol., 10 December 2020

Sec. Auditory Cognitive Neuroscience

Volume 11 - 2020 | https://doi.org/10.3389/fpsyg.2020.586723

Song Is More Memorable Than Speech Prosody: Discrete Pitches Aid Auditory Working Memory

Felix Haiduk¹^*

Cliodhna Quigley^1,2,3

W. Tecumseh Fitch^1,2

¹Department of Behavioral and Cognitive Biology, University of Vienna, Vienna, Austria
²Vienna Cognitive Science Hub, University of Vienna, Vienna, Austria
³Konrad Lorenz Institute of Ethology, University of Veterinary Medicine Vienna, Vienna, Austria

Vocal music and spoken language both have important roles in human communication, but it is unclear why these two different modes of vocal communication exist. Although similar, speech and song differ in certain design features. One interesting difference is in the pitch intonation contour, which consists of discrete tones in song, vs. gliding intonation contours in speech. Here, we investigated whether vocal phrases consisting of discrete pitches (song-like) or gliding pitches (speech-like) are remembered better, conducting three studies implementing auditory same-different tasks at three levels of difficulty. We tested two hypotheses: that discrete pitch contours aid auditory memory, independent of musical experience (“song memory advantage hypothesis”), or that the higher everyday experience perceiving and producing speech make speech intonation easier to remember (“experience advantage hypothesis”). We used closely matched stimuli, controlling for rhythm and timbre, and we included a stimulus intermediate between song-like and speech-like pitch contours (with partially gliding and partially discrete pitches). We also assessed participants' musicality to evaluate experience-dependent effects. We found that song-like vocal phrases are remembered better than speech-like vocal phrases, and that intermediate vocal phrases evoked a similar advantage to song-like vocal phrases. Participants with more musical experience were better in remembering all three types of vocal phrases. The precise roles of absolute and relative pitch perception and the influence of top-down vs. bottom-up processing should be clarified in future studies. However, our results suggest that one potential reason for the emergence of discrete pitch–a feature that characterises music across cultures–might be that it enhances auditory memory.

Introduction

Human communication is a fundamental behaviour defining how we see ourselves, particularly regarding our apparent uniqueness compared to other species. Two communicative systems are found in all human cultures, namely language and music, and both have key components executed in the vocal domain. Vocal music (song) and spoken language (speech) constitute two distinct but related modes of learned volitional vocalisations. The question of why humans engage in these two modes of vocalisations goes back to Darwin who proposed that they have a common origin (Darwin, 1871). They share the same vocal output system, which might be fundamental to the origin of both systems (see Levinson, 2013).

To clarify why humans engage in two modes of vocalisations–speech and song–a useful starting point is to break down these complex systems into components accessible for empirical study, in order to investigate the corresponding perceptual differences and their cognitive and behavioural consequences. Song and speech have multiple components in common. Both make use of the vocal apparatus (Lindblom and Sundberg, 2014) and share widely overlapping vocal production networks (Zarate, 2013; Pisanski et al., 2016). Both are complex auditory signals that rely on learned rules and are volitionally produced [for which the capacity of vocal learning is necessary; see Fitch and Jarvis (2013)]. Both consist of elements (for example notes or phonemes) that are generatively combined into sequences and group at different levels hierarchically, which exerts high demands on auditory memory. Finally, both modes of vocalisation are culturally transmitted and culturally variable (see e.g., Trehub, 2013; Mehr et al., 2019). However, song and speech are also perceptually clearly distinguishable, and not simply discriminated in arbitrary ways by different cultures (see Fritz et al., 2013). Aspects of the structure of the vocal spectrotemporal signal presumably influence this distinction. Furthermore, since these structural aspects are probably indicative of the underlying brain systems that shape the production and perception of song and speech (see e.g., Poeppel, 2011), signal structure offers a useful starting point to investigate their underlying cognitive and behavioural mechanisms.

Several structural aspects distinguishing music from language, termed “design features of music,” have been proposed by Fitch (2006). One of these design features concerns pitch, the perceptual component mostly based on the fundamental frequency (f0) of a sound. Pitch is an acoustic feature fundamental to both music (in song melody) and language (in speech prosody). Crucially, the variation of pitch over time, known as pitch trajectory or pitch contour, differs between the two modes, offering a well-defined perceptual component that contrasts between song and speech. In song, pitch contours typically consist of tones that are discrete in both time and frequency. The frequencies of these tones map onto culturally transmitted scales that themselves add structure to a musical phrase by relating pitches in particular ratios (intervals, see e.g., Krohn et al., 2007). Pitches and intervals provide the building blocks of melodies that can convey both affective and/or symbolic meaning to a culturally informed listener (see Seifert et al., 2013). In contrast, in speech, pitch contours are smoother and follow gliding up and down patterns. Speech intonation does not intentionally match specific pitches of a musical scale. This remains true in tonal languages, where either the direction of pitch shift (i.e., the contour) or the general relative level [high, low, medium pitch etc., see Bradley (2013)] is relevant, but not the specific absolute pitch. Nonetheless, speech intonation conveys considerable information in all languages, expressing emotional states, pragmatic and communicative intention, lexical meanings, etc (see Paulmann, 2016). Humans perceive and produce speech intonation constantly, starting in early childhood, and even prenatal exposure shapes neonates' vocal intonation production (Mampe et al., 2009). The interval relations between single tones in song have been suggested to utilise a fine-grained intonation perception system exclusive to music, while both song and speech would share a more coarse-grained perceptive system for the overall directional course of the intonation contour (Zatorre and Baum, 2012; see also Merrill, 2013). In summary, pitch supports a rich and overlapping body of communicative purposes in both song and speech, making it puzzling why these two distinct modes of utilising pitch would emerge in human vocalisations in the first place. A general auditory mechanism might become domain-specific by being utilised differently based on the nature of the input signal (phenotypic plasticity; Fritz et al., 2013). In turn a perceptual system that evolved for some other purpose might be exploited by a given signal structure, yielding a certain behavioural outcome (exaptation, Lloyd and Gould, 2017). It is also possible that a signal structure like discreteness of pitch in song is simply an epiphenomenon, without any behavioural consequences, but emerges due to sensory biases (see e.g., ten Cate and Rowe, 2007). Relating signal structure to cognitive consequences can therefore be informative about the underlying cognitive system, especially if it concerns a fundamental design feature that separates one signal type from others. Thus, it is worthwhile to investigate whether use of discrete vs. gliding pitches has functional consequences in auditory cognition.

Music and language both require auditory memory since both types of signals have variable content, and extend over time. Learned vocal phrases are important for song and spoken conversation. However, since music is not linguistically propositional (Fritz et al., 2013) it seems likely that auditory memory for song centres on spectrotemporal features like pitch. As for many studies comparing music and language, results concerning auditory working memory are mixed (see Schulze et al., 2018, for an overview). It has been suggested that remembering song and speech utilises two distinct systems of subvocal rehearsal: a phonological loop for verbal stimuli, and a tonal loop for musical sounds. Evidence for such a distinction comes from patient studies: Tillmann et al. (2015) reviewed pitch memory-deficits in congenital amusia (CA), concluding that there is a distinction between verbal and pitch-related memory systems, with only the latter being impaired in CA patients. However, CA patients do have difficulties in prosodic pitch perception as well (Tillmann et al., 2015), and it is unclear whether pitch-related memory is clearly divided into music-specific and speech-specific networks. Moreover, studies using intervening distractor sounds in a memory task found that pitch similarity of target (tones or words) and distractor sounds (words or tones) influences performance, suggesting an overlap in pitch-related working memory for song and speech prosody (Semal et al., 1996; Ueda, 2004). There is also evidence from fMRI studies that neural populations active in perception are recruited for subvocal rehearsal for working memory (Hickok et al., 2003), for both speech and melodies (see Buchsbaum, 2016). Hickok et al. (2003) also compared working memory maintenance of non-sense speech and piano music in these regions and found nearly identical time courses of activation. This might suggest substantial mechanistic overlap, and thus no substantial difference between memorability of sung and spoken stimuli.

However, a study by Verhoef et al. (2011) suggested that under demands for memory in a transmission task, random intonation contours morph into distinct elements that are then utilised in a combinatorial way. Using a slide whistle in an iterated learning task (where stimuli played by one participant were passed to another to imitate, and so on in a chain for many participants), pitch contours gradually appeared that were distinct and easy to memorise. However, no distinct scale tones emerged. Reasons might lie in the use of a non-vocal instrument, the structure of the initial contours in the iterative chains, or simply because there is no memory advantage for pitch contours consisting of discrete pitches over gliding intonation. Thus, it is not clear whether the pitch trajectory–discrete or gliding–by itself has an effect on memory for vocal phrases. This is the issue we aimed to address in the current studies.

We conducted three studies to test whether identical vocal phrases can be remembered better when consisting of discrete (song-like) or gliding (speech-like) pitches. We presented these stimuli in a same-different paradigm using a two-alternative forced-choice task, with an auditory distractor stimulus interposed between the target and test phases. We took care to minimise cultural biases, and we used various types of distractor sounds for each study, to vary task difficulty and to interfere with subvocal rehearsal. We also quantified musical experience in order to investigate its effects on auditory memory, since studies comparing musicians and non-musicians in terms of auditory memory often reveal advantages for musicians (see Schulze et al., 2018).

The existing literature reviewed above suggests several hypotheses and predictions. We first hypothesised that discrete pitches are a spectrotemporal property that enhances auditory memory (“song memory advantage”). This hypothesis predicts that vocal phrases consisting of discrete pitches should be remembered better than vocal phrases consisting of gliding pitches. Such a result would indicate that a fundamental and widespread acoustic feature of music has clear cognitive effects beyond surface-level perception. An alternative hypothesis is that gliding speech pitch might be remembered better, because humans perceive and produce speech prosody with high abundance from early childhood (“experience advantage”). The high familiarity of speech prosody might therefore make the gliding pitches of speech the more salient, and thus better-remembered, type of signal. Finally, based on previous literature we hypothesised that musical training might enhance memory effects, but we had no strong prediction about whether this advantage would apply either across stimulus classes, or specifically only for songlike stimuli with discrete pitch. Either way, the results should provide insight into the memory systems underlying these two systems.

Study 1

Materials and Methods

Stimuli

Our stimuli were based on natural speech samples, from which the pitch trajectory was extracted (see Figure 1). In order to minimize cultural influences due to familiarity these were Mandarin speech samples, spoken by an adult male¹. For the same reason pitch tones were adjusted to lie on an unfamiliar Bohlen-Pierce-Scale. From each of the pitch trajectories new pitch contours were derived to keep the sentence-level prosodic movement. Note that Mandarin Chinese shows both sentence-level intonation and syllable-level lexical tone. Our manipulation removed syllable-level but maintained sentence-level intonation. Sentence-level intonation supports both semantic and phonological processing in native speakers of Mandarin Chinese, but not in non-speakers (Tong et al., 2005). However, as non-speakers of Mandarin Chinese still show behavioural and neuronal effects of prosodic processing (Tong et al., 2005), using Mandarin sentence-level prosody seems a reasonable choice to minimize effects of cultural familiarity. The pitch contours derived from the Mandarin speech samples were the basis for stimuli of two variants: one variant consisting of discrete tones (as in a song melodic intonation) and another variant consisting of gliding tones (as in speech prosodic intonation). Additionally, seven intermediate variants were created of which one was used as a stimulus halfway between discrete and gliding variants. The reason for this was to examine to what extent potential effects are based on bottom-up processing of pitch as opposed to top-down categorical processing (hearing the stimuli “as song” or “as speech”). We hypothesised that an intermediate stimulus would neither be perceived as clear song nor as clear speech prosody. If the predicted effect that discrete pitch aids memory was mediated by a categorical top-down perception of a stimulus “as song” we would expect the intermediate stimulus to show very minor effects only. If gliding pitch was more memorable, we also would expect only minor effects for intermediate stimuli since they would not represent clear speech prosody. If however any effects were based on bottom-up pitch perception, we would expect intermediate stimuli to show medium effects with a magnitude between our Song and our Speech stimuli, since the pitch is partly gliding and partly discrete. Finally, all pitch contours were re-synthesised using a one-syllable constant-pitch recording of [la:l] of a male singer. Final stimuli consisted of seven syllables for study 1 and study 3 and ten syllables for study 2.

FIGURE 1

Figure 1. Schema of stimulus creation. From recordings of Mandarin Chinese phrases spoken by a male (A) pitch trajectories were extracted (B), interpolated and smoothed (C). After finding the local frequency extrema (minima/maxima) of the trajectories (D), new pitch contours were derived by shifting the temporal distance between these extrema to create intervals of at least 0.25 s (E) and by subsequently shifting the extrema in frequency to the nearest Bohlen-Pierce or diatonic scale tones (F). The type of interpolation between extrema was determined by stimulus category. For speech stimuli, Praat interpolated linearly between extrema (G). For Song stimuli, pitch values at extrema were continued (H), including a smooth transition right before the next minimum/maximum to avoid unnaturally abrupt pitch changes (I). Pitch values for last syllables of each contour were adjusted to be 0.3s in duration, with falling pitch for Speech stimuli (J) and stable pitch for Song stimuli (K). Seven intermediate interpolation variants on a continuum between Song and Speech stimuli were created (L). Chunking pitch contours of all variants (Song, Speech, intermediates) using extrema as borders resulted in single pitch contour chunks (M). One-syllable template sounds, recorded by a male on the syllable [la:l] at all Bohlen-Pierce scale tones (N) were selected to match the initial pitch of a pitch contour chunk. The respective chunk's duration and pitch contour were then carried over to the respective template sound (O) to derive stimulus syllables. Finally, the first six (studies 1 and 3) or nine (study 2) stimulus syllables and the respective last syllable of each contour and variant were concatenated to generate the final stimuli (P,Q).

The software package Praat (Version 6.0.36, Boersma and Weenink, 2017) was used to create the stimuli (see Figure 1). In a first step 200 pitch trajectories were isolated from 200 mandarin phrases, spoken by a male [function “To Pitch (ac),” for details on the parameters: see Supplementary Material].

Pitch trajectories were then smoothed using the function “pitchsmoothing” from Praat Vocal Toolkit (80%; Corretge, 2012) and unvoiced gaps were linearly interpolated using the praat standard function “interpolate.” Frequency values at local extrema (minima/maxima) of the smoothed pitch trajectory, along with the onset and offset frequencies of the whole pitch trajectory, were used as markers for deriving discrete pitches. These extrema were shifted in time, if necessary, to be at least 0.25 s distant from each other to later avoid unnaturally short tones/syllables. The frequency values at these extrema were shifted to the nearest Bohlen-Pierce scale tone, again to avoid any culture-specific biases. For a second variant frequency values at extrema were shifted to the nearest tone of the familiar western diatonic scale including both major and minor thirds. This resulted in new pitch contours, each in both a Bohlen-Pierce and a diatonic variant, both variants being equally tempered. Both the Bohlen-Pierce scale and the diatonic scale were based on 80 Hz as lowest tonic frequency and comprised 14 possible tones (see Table 1).

TABLE 1

Table 1. Possible scale tones used in stimulus creation (both equal tempered, in Hz).

To derive the discrete variant of each contour (henceforth Song stimuli), pitch samples between two adjacent extrema were set to the respective Bohlen-Pierce/diatonic tone of the first of the two extrema. This way stable pitches emerged between every two extrema. To ensure naturalistic sounding stimuli with no abrupt changes of pitch (that would have led to a yodelling sound), transitions between these stable pitches were smoothed by transitioning to the next pitch from 1/8 of the duration between two extrema before the next extremum. To derive the gliding contours, pitch samples between extrema were linearly interpolated. We henceforth refer to these stimuli as Speech stimuli for simplicity, noting that we investigated speech intonation, not full-fledged speech. Extrema were used as cut-offs to chunk whole contours into syllable-length PitchTiers. These chunks were later combined with the one-syllable male voice recordings [la:l]. The last chunk of a Speech contour was altered such that it always had a falling pitch from the final minimum/maximum and a duration of 0.3 s, while the last chunk of Song contours was altered such that it always had a sustained pitch and a duration of 0.3 s. This mimicked phrase-final lengthening that occurs in both natural speech and song (Arnold and Jusczyk, 2002). If the last minimum/maximum was near 80 Hz this led to some pitch samples below 80 Hz for the last syllables of some Speech stimuli (see Supplementary Table 1 for details).

To create stimulus variants between Speech and Song, we synthesised a continuum of seven intermediate stimuli between Song and Speech variants of each contour. To this end, frequency values (in Hz) of a stimulus' Song and the Speech variant at each given timepoint were converted to semitone values (“semitones re 1 Hz” in Praat). This way the difference between both values was converted from logspace (Hz) to linear space (semitones). This difference was then logarithmically divided into seven steps, resulting in seven values for the stimulus variants between Song and Speech. The logspacing was such that steps closer to Song were smaller than steps closer to Speech. Stimuli located perceptually halfway between Speech and Song would be categorised by a listener as Song (and not as Speech) 50% of times (which is called the Point of Subjective Equality, see Kingdom and Prins, 2016). To obtain an empirical estimate of this category boundary, we presented stimuli of all variants and contours to one participant (author FH) in a two-alternative forced-choice Speech/Song categorisation task, using a staircase method. We chose the variant closest to the Point of Subjective Equality as intermediate stimulus for study 1 (henceforth Intermediate). This was step 6, with Speech being step 1 and Song being step 9.

To obtain a naturalistic timbre, a male singer recorded sustained tones on the syllables [la:l] at each scale tone of the Bohlen-Pierce scale as well as the diatonic scale in the octave from 80 to 160 Hz (henceforth template syllables). Recording was done using a Zoom H4n recording device (16 bit, 44.1 kHz). In order to avoid template syllables at higher frequencies sounding more distant than template syllables at lower frequencies in the final stimuli and therefore disrupting the perception of a single speaker/singer (Bregman, 1994; Zahorik et al., 2005; Bregman et al., 2016), template syllables were not equalised in intensity. For each discrete Bohlen-Pierce /diatonic PitchTier chunk of each contour, the template syllable nearest in frequency to the respective pitch was adjusted in duration to match the duration of the respective PitchTier chunk, whereby the first and last 0.02 s of each template syllable remained unstretched in time in order to keep the liquid consonant [l] unchanged. The pitch contour was then combined with the spectral information of the template syllable and resynthesised to a syllable [la:l] following the current pitch contour. This procedure resulted in the final syllable sounds (henceforth syllables) for each contour and variant (i.e., Song, Speech and the seven Intermediates). The first six and the last syllable of each contour and variant were then concatenated with an overlap of 0.03 s, and the resulting stimulus was ramped using a trapezoid with 0.01 s rise and fall. Thus, the same syllable sequence with the same durations for each syllable was used for all variants of each contour (i.e., Song, Speech, and the Intermediate stimuli). The mean duration of syllables across all stimuli used was 0.360 s (SD = 0.168). The mean pitch acceleration rate of syllables across the speech prosody stimuli was 0.106 Hz/ms (SD = 0.081), thus much lower than for lexical tones in Mandarin Chinese (see Krishnan et al., 2010). Stimuli were scaled in intensity to 73 dB SPL (Praat: “Scale intensity”; i.e., the root-mean-square amplitude of the stimuli was changed to 73 dB above 0.00002 Pa). All stimuli were monophonic, 16 bit and with a sampling rate of 44.1 kHz.

For study 1 and 3, only stimuli with a duration between 2 and 3 s and with 7 syllables were used. For study 2, only stimuli between 3 and 4 s of duration and with 10 syllables were used. Note that syllable durations were ultimately based on the natural speech signals we used (Mandarin Chinese) and that therefore our stimuli were not isochronous or metrical in any sense.

Stimuli that were too similar to each other were excluded (evaluated aurally by author FH), resulting in 94 different stimulus contours per scale (diatonic/Bohlen-Pierce), each as Song, Speech and Intermediate variants.

To create the deviant stimuli for the same/different memory task, the pitch of one randomly chosen syllable (target syllable) of each contour, excluding the first and the last syllable, was shifted to another of the scale tones of the Bohlen-Pierce/diatonic scales specified above. The direction of the shift reversed the direction of pitch change between the target syllable and the syllable preceding it, such that a rising pitch interval between target syllable and preceding syllable became falling after the shift and vice versa. That way the overall pitch trajectory, that is the global up and down of the pitch contour, always changed. Exceptions were cases when the shift would have resulted in pitches of target syllables outside the Bohlen-Pierce/diatonic scales specified above. In such cases the pitch interval between two consecutive syllables changed in magnitude but not in direction (falling or rising). This was the case for 280 out of the 3,720 trials that made up the sample for study 1. The amount of pitch shift was random within the constraints just mentioned (see below in the statistical methods for details on the pitch deviation). The mean absolute pitch deviation was 3.8 semitones (SD = 5.0).

Distractor sounds matching the temporal structure of the tonal stimuli were constructed from pink noise bursts of random duration between 0.2 and 0.3 s in Praat (decreasing by 6 dB SPL per octave, F0 = 100 Hz) and concatenated with 0.01 s overlap, such that the total duration of the distractor was 2 s. Two hundred and eighty two different distractors were created to allow unique distractors to be used for each variant of each contour.

Distractor sounds and stimuli where presented such that in the same-different task participants would hear an original stimulus contour first, then a distractor sound and then either the original stimulus contour again (same) or the respective deviant stimulus contour (different).

Procedure

Thirty three participants (21 females, 12 males, age range 18 to 56 years, mean = 24.3 years) took part in the study. Throughout all three studies, participants were recruited via posters at the university venue and via Facebook and most of them were university students from the University of Vienna. All participants in the three studies were rewarded with 5 € per half hour of participation (approx. study duration was 60 min for all three studies). All participants in the three studies gave written informed consent to participate. All three studies were approved by the ethics committee of the University of Vienna (# 00361).

Participants were welcomed by the experimenter (author FH) and led to one of two adjacent testing rooms for human participants where they were seated in front of a computer. To avoid expectancy effects participants were given minimal, written instructions in German or English². They were told simply that their task was to make decisions about sounds by clicking on circles on a screen. Words such as “song,” “speech,” “language,” “music” were not mentioned.

Python 2.7 was used to present stimuli (scripts based on the script-building program “Experimenter_GUI, version 0.1” by Pinker, 1997). Sounds were played via Sennheiser HD 201 headphones at about 73 dB SPL. The study was run on a Macbook Pro laptop (Retina display, 15-inch screen, Mid 2015). Except during the training phase, the experimenter left the testing room so that each participant executed the tasks alone.

The experiment consisted of four phases: a training phase for the same-different task (about 10 min duration), the same-different task itself (about 30 min), a test of spectral or holistic hearing (Schneider et al., 2005) and the Gold-MSI musicality questionnaire (Müllensiefen et al., 2014). We did not include the results from the test of spectral or holistic hearing as results were heavily skewed towards holistic hearing (holistic:mix:spectral hearing 16:15:2 in study 1) and therefore uninformative.

Instructions were provided in written form on the computer screen. Trials were self-paced so participants could take breaks between trials and after each part of the experiment.

In the initial training phase, participants were asked to judge whether a “sound example” (German: “Klangbeispiel”) changed or stayed the same and were instructed that their goal was to gain as many points as possible by answering correctly. They could start each trial by clicking on a small white filled circle in the middle of a grey screen and made decisions by clicking on one of two rectangular response boxes on the left and right side of the screen (side randomised between trials). One response box contained the word “anders” (or “different”), the other one “gleich” (or “same”). In order to minimise lapse errors due to random change of the sides where two boxes would appear, boxes were consistently outlined by rectangular lines of yellow (“same”) or blue (“different”). Participants were informed that their current points score would be shown on the screen after each trial. After reading these instructions, participants were allowed to ask clarification questions. No information concerning the task purpose, the nature of the stimuli or the type of change in difference stimuli was given at any point during the study.

In the training phase, both Song and Speech variants of ten different pitch contours randomly chosen from the pool of diatonic pitch contours (thus 20 stimuli) were presented in random order in the two-alternative forced choice design. Pitch contours used in the training phase were not used in the later test phase to avoid any memory transfer effects. Participants first heard one stimulus, followed by the distractor sound, and then either the stimulus again (same) or its deviant version (different). Thus, in cases when deviants occured they were always played after the original (unaltered) stimulus.

After clicking on one of the two response boxes, the participants received auditory and visual feedback. Correct choices were rewarded by displaying a gain of five points and a short bell-like sound. Incorrect choices were punished by the screen turning red for 500 ms along with playing an aversive sound and by displaying a loss of five points. Failed trials were repeated until successful. The current number of points (which could only be positive) was also displayed after each trial.

During this initial training phase, the experimenter stayed in the room to allow the participants to ask questions about the task.

The second phase of the same-different task was the test phase and involved 120 forced-choice trials. Participants first received instructions written on the computer screen. The instructions given were the same as in the training phase, except that participants were told that the following task would consist of three blocks with an approximate duration of 10 min each and the possibility to take a break of minimum 30 s between blocks. Additionally, participants were informed that points they gained would now only be displayed after each block (no feedback after trials). After the instruction, participants were allowed to ask clarification questions.

After the instructions the experimenter left the room until the participant was finished.

The same setup was used as in the training phase. This time, 40 different pitch contour pairs, randomly drawn from the pool of Bohlen-Pierce stimuli individually for each participant, were presented in each the Speech variant, the Song variant, and the Intermediate variant (in sum 120 trials). Half of the stimuli were difference stimuli (i.e., included a pitch change on one syllable during the second presentation of the pitch contour after the distractor noise) and half of the stimuli were without change.

Presentation order of the stimuli was random. In total 120 trials were presented. After each block of 40 trials, the run was interrupted for 30 s and written instructions appeared on the screen, asking the participant to take a short break and continue by clicking the white circle. To ensure a high level of motivation throughout the task an arbitrary current point score was displayed as well, showing 135 points after the first block and 260 points after the second block.

Participants were also instructed to self-assess their current concentration abilities after each block on a 7-point Likert scale. This was done to assess the reliability of the results for each block, whereby our a-priori criterion was to drop blocks for which concentration was below 3. During the test phase, no feedback was given after each trial and no trials were repeated. After the participants were finished, the experimenter re-entered the room.

Part three of the study consisted of the test on holistic vs. spectral hearing from Schneider et al. (2005), which we did not use in the analysis because of biased results.

Part four of the study consisted of the Gold-MSI test to assess musicality of the participants (Müllensiefen et al., 2014), and a post-experimental questionnaire asking about strategies the participants used, any speculations about the purpose of the study and the languages they spoke (including level of proficiency).

Two speakers of Mandarin Chinese were excluded to avoid any familiarity confounds due to the original contours being Chinese.

Afterwards, participants gave post-experimental consent for using their data, were thanked, paid, and accompanied to the exit. The total duration of part 1 of the experiment ranged from 50 to 90 min (two participants needed more than 60 min due to the self-paced procedure).

Statistical Methods

Our goal was to quantify whether vocal contours are remembered better when consisting of discrete rather than gliding intonation contour. Our memory task required a binary response (original and target contour judged to be different or same), given two possible states of the target signal (original and target contour deviate or do not deviate). Our task therefore can be analysed in the framework of signal detection theory (see Kingdom and Prins, 2016, for an overview).

To this end, we used a logistic GLMM with logit link function. The response variable was the response given by the participants (“Response Given”). Fixed effects entered were the response that would have been correct (“Stimulus State”) and the variant of the contour (“Variant” Song, Speech, Intermediate) along with their interaction. We z-transformed the musicality score of the Gold-MSI in order to obtain model estimates for participants of average musicality (“z.Musicality”) and included this as a fixed effect as well. As random slopes we entered “Stimulus State,” “Variant” and “z.Musicality” within “Contour” (the 94 different stimulus pitch trajectories created) and “Stimulus State” and “Variant” within “Participant.” Interactions of random intercepts and random slopes were not included. We also did not include semitone deviation as a predictor in the model since it did not differ significantly between Variants (Median_Speech: 6.67; Median_Intermediate: 8.6; Median_Song: 7.88; Kruskal Wallis Test: H = 0.51, df = 2, p = 0.78; note that the semitone differences were derived in Praat as real numbers, not integers). We did not include the results of the test of spectral vs. holistic hearing as its results were heavily skewed towards holistic hearing (only 1 out of 31 participants in the analysis sample scored as spectral hearer). Individual participants (“Participant”) and stimulus contours (“Contour”) were included as random intercepts. Dummy coding was done by a custom-built function. Optimizer “bobyqa” with maximum 100,000 function iterations was used to assess the maximum likelihood.

All statistical analyses were done in R (R Core Team, 2019). We fitted the model (and all following models) using the function glmer (R package “lme4”; version 1.1-21; Bates et al., 2015).

Collinearity was assessed by fitting a linear model without random effects and interactions between fixed effects and applying the function “vif” (R package “car”; version 3.0-2; Fox et al., 2011). The function revealed no collinearity issues.

Overdispersion was tested using a custom-built function. The model was not overdispersed, and in fact appeared to be slightly underdispersed (χ² = 2831.574, df = 3699, p = 1, dispersion parameter = 0.77), so no further action was required.

To assess how robust the model estimates would be given changes in the predictors we assessed model stability using a custom-built function. This function excludes participants one at a time from the data. Model estimates of the reduced and the full datasets are then compared (please see Nieuwenhuis et al., 2012 for a comparable approach). The model turned out to be stable (for ranges of parameter estimates of fixed effects please see Supplementary Table 2 and Supplementary Figure 1).

In order to test the effect of all predictors as a whole we compared the full model with a null model lacking the predictors of interest, i.e., including only the random factors. Models were compared using a likelihood ratio test, applying the function “anova.glm” (argument test set to “Chisq”; Dobson, 2002; Forstmeier and Schielzeth, 2011).

To test the effect of single predictors we used likelihood ratio tests, comparing the full model with a model reduced by one of the effects of interest (R function “drop1” with argument “test” set to “Chisq”). We first assessed the effect of the three-way interaction “Stimulus State:Variant:z.Musicality.” To test the effect of all two-way interactions, including the interaction of interest “Stimulus State:Variant,” we fitted the model without the three-way interactions and executed likelihood ratio tests as above.

The sample size for this model was 31 individuals with 120 observations each, thus a total of 3,720 observations.

We derived 0.95 confidence intervals using the function “bootMer” (package “lme4”) with 1,000 parametric bootstraps.

After applying the analyses mentioned above, we additionally did a repeated-measures Anova (R function “aov_ez,” from package “afex,” version 0.26-0; Singmann et al., 2020) on classically calculated d-primes for informative purposes (correction for hit rates and false alarm rates of 0 and 1: +− 1/(2N). A hit was defined as correctly detecting a difference (deviation), a false alarm accordingly was a response as “different” while there was no change in the stimulus. Corrections had to be done for 17 of 93 cells for this analysis). Note that this approach goes along with a loss of information in the data and is therefore not the ideal way of dealing with a data set with binary responses. We included “Variant” as repeated measures factor.

Results

Post-experiment Questionnaire

Twelve participants reported that they had attempted to repeat the stimuli silently or by humming to remember them, some of these participants additionally tapped their fingers. One participant reported to have visualised the contours, one used only finger tapping as strategy. Four participants guessed that the study was about memory. This suggested that the distractor sound we used was not distracting enough which was one reason to replicate the study under more difficult conditions.

Statistical Results

After fitting the full model (model coefficients see Table 2) we first assessed whether it differed significantly from the null model (comprising only the random factors). We found that all predictors of interest as a whole explained the data significantly better than the null model (χ² = 117.436, df = 11, p < 0.001). Figure 2 and Table 3 show the probability estimates of the full model.

TABLE 2

Table 2. Estimates of model coefficients (log odds), standard errors (SE), lower (CI_lower) and upper (CI_upper) 0.95 confidence intervals, z statistics (z-value) and associated p-values of the model fitted for study 1.

FIGURE 2

Figure 2. Model estimates for the probability of detecting a deviant in study 1 (see also Table 3). Empty circles: Hits, filled circles: False Alarms. Bars indicate 0.95 confidence intervals. N = 31.

TABLE 3

Table 3. Probability estimates and 0.95 confidence intervals for Hits and False Alarms as depicted in Figure 2.

We next tested the three-way interaction “Stimulus State:Variant:z.Musicality,” which was significant (LRT = 9.686, AIC = 3,556.861, df = 2, p = 0.008, Nagelkerke's R² = 0.046). Musicality enhanced memorising much more for Intermediate stimuli than for Speech or Song stimuli. Note that therefore the interpretation of the two-way interactions was not straightforward anymore (see Figure 3 and Table 4 for detailed results on the two-way interactions). We found that the interaction “Stimulus State:Variant” was significant, that is, participants could memorise both Song and Intermediate stimuli better than Speech stimuli (see also Figure 2), with the magnitude depending on their Musicality score.

FIGURE 3

Figure 3. Model estimates for Musicality on the probability of detecting a deviant in study 1. Musicality had been assessed with the Gold-MSI and z-transformed for analysis. Empty circles: Hits, filled circles: False Alarms. Dotted lines represent 0.95 confidence intervals. Musicality showed the strongest impact for Intermediate stimuli. N = 31.

TABLE 4

Table 4. Degrees of freedom (Df), Akaike Information Criteria (AIC), Likelihood Ratio test statistic (LRT), associated p-value [Pr(Chi)] and Nagelkerke's R², based on likelihood ratio tests for two-way interactions of the model fitted in study 1.

For the Anova, we assessed the assumption of normality of residuals using a qqplot and histogram of the residuals, which showed no obvious deviation from normality. The assumption of sphericity had not been violated (Mauchly's Test of Sphericity: W = 0.984, p = 0.793).

We obtained a trend for a difference between Variants [F_{(2, 60)} = 2.5, p = 0.09, generalised η² = 0.03]. For an overview of the distribution of d-primes as function of Stimulus Variant please see Supplementary Figure 2, for the raw hit and false alarm rates please see Supplementary Table 3.

Study 2

Study 1 showed the predicted effect of discrete pitch on memory. To test how robust and far-reaching the effect would be, and in particular whether it would be present in a much more difficult task, we replicated study 1 with more challenging settings. To increase difficulty, we decreased the magnitude of deviation in the target stimuli, increased the number of syllables in each stimulus and changed the distractor sound. Moreover, given that we found that Intermediate stimuli had similar effects to Song stimuli, we replicated study 1 with an Intermediate stimulus closer to Speech stimuli. We hypothesised that the effects of Intermediate stimuli would then be between those for Song and Speech stimuli.