Preliminary Evidence of Pre-Attentive Distinctions of Frequency-Modulated Tones that Convey Affect

Recognizing emotion is an evolutionary imperative. An early stage of auditory scene analysis involves the perceptual grouping of acoustic features, which can be based on both temporal coincidence and spectral features such as perceived pitch. Perceived pitch, or fundamental frequency (F0), is an especially salient cue for differentiating affective intent through speech intonation (prosody). We hypothesized that: (1) simple frequency-modulated tone abstractions, based on the parameters of actual prosodic stimuli, would be reliably classified as representing differing emotional categories; and (2) that such differences would yield significant mismatch negativities (MMNs) – an index of pre-attentive deviance detection within the auditory environment. We constructed a set of FM tones that approximated the F0 mean and variation of reliably recognized happy and neutral prosodic stimuli. These stimuli were presented to 13 subjects using a passive listening oddball paradigm. We additionally included stimuli with no frequency modulation (FM) and FM tones with identical carrier frequencies but differing modulation depths as control conditions. Following electrophysiological recording, subjects were asked to identify the sounds they heard as happy, sad, angry, or neutral. We observed that FM tones abstracted from happy and no-expression speech stimuli elicited MMNs. Post hoc behavioral testing revealed that subjects reliably identified the FM tones in a consistent manner. Finally, we also observed that FM tones and no-FM tones elicited equivalent MMNs. MMNs to FM tones that differentiate affect suggests that these abstractions may be sufficient to characterize prosodic distinctions, and that these distinctions can be represented in pre-attentive auditory sensory memory.


INTRODUCTION
Frequency modulation (FM) is characteristic of speech and its parameters may aid us in communicating social intent (Goydke et al., 2004). Vocally, while we express emotion through words (semantics) we also modulate our tone through pitch change (prosody). Numerous studies have indicated that the perception of pitch and pitch change as reflected in fundamental frequency (F 0 ) and F 0 variability (F 0SD ), respectively, are crucial in identifying the internal emotional states of one's interlocutor (Ladd et al., 1985). For example, we and others have found that high pitch mean and variability characterize excitement and happiness while low pitch mean and low pitch variability is often perceived as fear or sadness (Juslin and Scherer, 2005), or no emotion. The prominence of pitch as a cue for prosodic perception is highlighted by studies that have shown that emotion (Lakshminarayanan et al., 2003) as well as interrogative intent (Majewski and Blasdell, 1969) can be detected based on F 0 alone. Furthermore, in the domain of emotional prosody, the temporal structure of pitch change does not seem to be a prerequisite for affective decoding, a claim supported by Knowner (1941) who observed that individuals could reliably (nearly twice-chance) identify affective intent in prosodic sentences played backward. Together these findings suggest that overall pitch mean and variability may be sufficient to communicate emotional intent, regardless of temporal structure. We therefore hypothesized that simple FM tones, the carrier frequency and modulation depth of which approximates the F 0 mean and SD of well-identified prosodic tokens, might be sufficient to differentiate affective intent.
Rapid emotion identification is also an evolutionary imperative. Electrophysiological studies indicate responses to emotion occurring as early as 80-150 ms (Kawasaki et al., 2001;Sauter and Eimer, 2010). Similar to our classification of color perception, which enables us to tell food from poison and predator from prey, our identification of emotions conveyed facially or vocally through prosody is performed categorically (Etcoff and Magee, 1992;Beale and Keil, 1995;Young et al., 1997;Laukka, 2005). In this way, affective cues are much like phonemes whose distinctions do not vary along a sensory or acoustical continuum but instead are "Balkanized" -that is, perceived as having a common identity within a category and a sharp change in perception at the position Frontiers in Human Neuroscience www.frontiersin.org in which this category boundary ends. Such categorization may provide the building block for higher order cognition while minimizing processing demands that lead to increased perceptual speed (Harnad, 1987). Electrophysiologically, evoking mismatch negativities (MMNs) reveals that categorical distinctions of affective prosody signals and phonemes appear early in auditory processing (Näätänen, 2000). The MMN is an evoked response that is thought to index preattentive detection of deviance within the auditory environment [ (Näätänen, 2000) however see Näätänen (1991) and Woldorff et al. (1991) for a discussion of the potential effects of attention on MMN amplitude under special conditions]. Typically, MMNs are measured by comparing the responses to a deviant or "A" tone that has been presented within a stream of standard "B" tones. To date, few prosody studies have employed MMNs to examine emotional prosody using actual speech (Kujala et al., 2005;Schirmer et al., 2005) or single or multiple pseudo syllables (Korpilahti et al., 2007;Schirmer et al., 2008;Thonnessen et al., 2010). The drawback to using actual speech, is that the complexity of the acoustic signal can generate MMNs that are difficult to discern and interpret compared to those generated by more basic acoustic stimuli. With pitch intensity and spectral features changing across emotions in actual prosodic stimuli, MNN differences even when measured do not reveal which feature or combination of features is generating the MMN (Leitman et al., 2009). From this perspective, simple FM tones -should they reliably abstract affective prosodic distinctions -are clearly advantageous. More importantly, the successful identification of emotion within simple FM tones could indicate the minimal amount of information needed to convey emotion.
In our case, the base (carrier) frequency, modulation frequency, and modulation depth of our constructed FM tones approximate those present in highly recognizable affective prosodic stimuli from a standardized affective prosody battery. We hypothesized that these FM tone abstractions of vocal affect would generate MMN differences and reliable affective categorizations in a post-study identification task. Such a finding would suggest that automatic recognition of vocal emotional prosodic distinctions could result from processing of fundamental FM alone. While the importance of pitch as a prosodic has been demonstrated both behaviorally cue (Ladd et al., 1985;Goydke et al., 2004) as well as electrophysiologically and MMN studies (Leitman et al., 2009), this study using stationary FM tones would suggest that pitch cues alone devoid of their temporal progression are sufficient for automatic prosodic detection.
Our FM abstractions of "happy"and"neutral"prosodic stimuli, varied along two parameters: carrier frequency and modulation depth (Figure 1). In order characterize the specific impact of each of manipulated FM parameters on MMN generation in isolation, we created two control stimuli: one with no-FM, and a second stimulus in which the carrier frequency is held constant that of our "happy" FM abstraction but the modulation depth is varied.
In summary, our primary aim was to examine whether FM abstraction of prosodic stimuli pitch parameters reliably discriminate emotion, and whether such abstractions generate MMN's. An auxiliary aim was to examine the effect of various changes in the parameters of FM within the auditory scene have on MMN generation. One prior study (Bishop et al., 2005), had suggested that MMN's to FM tones among no-FM standard tones elicited a larger MMN's than the reverse condition of no-FM tones deviants within FM tone standards. We endeavored to replicate this potential asymmetry (FM vs. no-FM), and further examined whether variation within carrier frequency and modulation depths (CMD + FIGURE 1 | Frequency-modulated (FM) tone stimuli profile. Spectrograms of reliably recognized prosodic sentences ("is it eleven o'clock?") spoken with a happy (A) or no emotional (B) intonation. Pitch differences between stimuli are indicated by fundamental frequency (F 0 ) contour (blue trace) as calculated by TDPSOLA algorithm in PRAAT. Using the F 0 mean and SD of these stimuli we created FM analogs of these stimuli [(C,D) blue traces reflect the carrier frequencies (Hz) and modulation depths respectively] whose modulation frequency was held constant at 3 Hz. Control stimulus (E) illustrates the no-FM stimulus and (F) illustrates the second control stimulus: a hybrid of the FM tones (C,D).

SUBJECTS
Informed consent was obtained from 13 (6 females) healthy control subjects with a mean age of 26 ± 1, a mean education level of 15.9 years, and no reported history of psychopathology. All subjects reported that they were right handed, had normal hearing, and were medication free at the time of testing. Three subjects were excluded from analysis due to technical recording issues resulting in high levels of noise within their data. All procedures were conducted under the supervision of the local internal review board.

STIMULI AND PROCEDURE
The stimuli consisted of two types: FM tones and no-FM tones. We decided to use FM tones with carrier frequencies and modulation depths that approximate the F 0 mean and SD of well-identified prosodic tokens from the Juslin and Laukka (2001) Vocal affect battery. This translated into two standard FM tones. The first had a base frequency of 378 Hz and a modulation depth of 169 Hz (CMD+), approximating a happy stimulus whose identification rate in prior testing was greater than 80% (Leitman et al., 2010), and the second had a base frequency of 178 Hz and a modulation depth of 23 Hz (CMD−), approximating a neutral prosodic utterance. All stimuli were given a fixed length of 1000 ms that corresponded to the approximate length of the original speech stimuli, and a modulation rate of 3 Hz roughly corresponding to the average speech rate. In order to examine MMN differences to the presence or absence of FM as well as the effect of differing modulation depths on MMN amplitude and latency, additional control FM tone conditions were constructed. These conditions are outlined in Table 1. Subjects were presented with six conditions each of which consisted of two types of stimuli of 1000 ms tones, in which FM was either present or absent (see Table 1). Tones were presented using an inter-stimulus interval (ISI) of 700 ms, and the six conditions were presented in a randomized order. In each block, the ratio of standard to deviant tones was 4:1 such that each block contained 240 standard tones and 80 deviant tones. All tonal contours were presented binaurally at 75 db (SPL) through Sennheiser HD 600 headphones. Subjects were instructed that the experiment was designed to test their passive auditory responses to tonal sequences to which they need not attend. Subjects watched a silent movie during the course of stimulus presentation and were instructed not to pay attention to the aural stimuli. Importantly, subjects were never told that a focus of the experiment was testing affective perception. After electrophysiological testing, subjects were asked to affectively categorize all the tones presented during electrophysiological testing as happy, sad, angry, or no-expression.

DATA COLLECTION
High-density event-related potentials (ERP) were recorded continuously from 64 scalp electrodes (following the standard 10-20 placement) with a bandwidth of 0.5-100 Hz and digitized at a sampling rate of 512 Hz. Epochs (-200 to 800 ms relative to stimulus onset) were constructed off-line. Trials with blinks and large eye movements were rejected off-line on the basis of horizontal electro-oculogram (HEOG) and vertical electro-oculogram (VEOG). No systematic differences in HEOG or VEOG were seen across conditions (artifact rejection window of ±100 μV). An artifact criterion of ±100 μV was used at all other electrode sites to reject trials with excessive EMG or other noise transients from −100 ms pre-stimulus to 450 ms post-stimulus. For average files, baselines were corrected to zero over the −100 to 0 ms latency range. Average waveform files were filtered "off-line" using a 0.5to 45-Hz zero-phase-shift band-pass digital filter with roll-off of 96 dB/octave. All stimuli were collected using BIOSEMI (Amsterdam, Netherlands) 64-channel electrode array (band-pass filter setting 0.01-100 Hz) and presented using Presentation software (www.neurobs.com). The BIOSEMI system uses active electrodes an common mode sense (CMS) active and driven right leg (DRL) passive electrodes. Post-collection processing was performed using SCAN (neuroscan) and Besa software. Statistical analysis was conducted off-line using SPSS software.

STATISTICAL ANALYSIS
In order to maximize the net amplitude of all subtraction waveforms over fronto-central site (FZ) for statistical analysis, all 64channel data was re-referenced to an average of both left and right mastoids (Kujala et al., 2007). Given that the standard and deviant tones often differed in overall energy for many of the conditions, MMN subtraction waveforms were derived by comparing ERP responses to deviant stimuli in one run to ERP responses to the same stimulus type in an alternate run. As we illustrate in Table 2, these "like from like" subtractions were arraigned so as to address our hypothetical questions as to whether MMN differences would be observed when contrasting (1) FM tones and no-FM tones, (2) FM stimuli with similar carrier and modulation frequencies but differing modulation depths, and (3) tones with similar modulation frequencies but differing carrier frequencies and modulation depths.
For each subtraction waveform, the maximum negative peak within a latency range of 110-185 ms at electrode FZ was entered and tested "off-line" for significance via one way t -test separate pairwise t -tests for each of the aforementioned contrasts (see Table 2). This latency window for peak detection was employed for Frontiers in Human Neuroscience www.frontiersin.org all contrasts except for the contrast examining FM tone deviance within a no-FM context. In this condition, we hypothesized that the MMN might be significantly delayed given that deviance onset occurs somewhat later so that modulation can be detected. Therefore, we shifted the peak detection window by 40 to 150-225 ms post-stimulus onset. Statistical assessment of post-experiment affective classification of FM tones was conducted using a Chi square test in SPSS. All statistical testing used an alpha criterion of p < 0.05.

CONTROL CONDITION: FREQUENCY MODULATION (FM no -fm) VS. NO FREQUENCY MODULATION (no-FM FM )
We observed significant mismatches for both FM deviants within no-FM standards [FM no-FM (t 1,12 , 3.6, p = 0.003)] as well as the reverse condition [no-FM FM (t 1,12 , 6.0, p < 0.0001)], with the FM no-FM mismatch peak occurring significantly later than no-FM FM (t 1,12 , 5.6, p < 0.0001; Figure 3, Table 2). A comparison of subtraction waveforms for FM tones in the deviant and standard position within a no-FM context (FM no-FM ) and no-FM tones in deviant and standard positions within an FM context (no-FM FM ) revealed no significant amplitude difference (t 1,12 ,1.2, p = 0.26).

FIGURE 2 | Stimuli extracted from Happy (left) and No-expression (right) prosodic stimuli.
Top panel represents grand average waveforms at electrodes FZ, right (M1) and left (M2) mastoid for both standard (blue) and deviant (red) waveforms for FM tones Bottom row represents grand average waveforms that were re-referenced to average mastoids for statistical comparisons of net amplitude differences between conditions. Bottom panel represents voltage topographies of subtraction waveforms (Note: isotemporal lines extending beyond electrode placement areas is an artifact of topography generation platform).
A comparison between these mismatch subtraction waveforms revealed that subtraction waveforms generated by a modulation depth increase (MD+ MD− ) was larger than that observed by a modulation depth decrease (MD− MD+ ) This effect was statistically significant at trend levels (t 1,12 , −2.1, p = 0.057).

POST RECORDING AFFECTIVE JUDGMENTS
After completing the EEG recordings, subjects were presented the four tones they heard in a randomized order and asked to affectively label the sounds as angry, sad, happy, or no-expression.
A chi square test of the emotional attribution pattern of subject ratings for the four FM tones indicated a response pattern that deviated significantly from chance (χ 2 9 = 71.3, p < 0.0001). As Frontiers in Human Neuroscience www.frontiersin.org

Figure 5
indicates, subjects endorsed each tone differently. High carrier frequency and high modulation depth (378, 169 Hz) was identified as happy by 84.6% of subjects. The no-FM tone (378 no-FM) was identified as neutral by 92.3% of subjects. Finally, the majority of subjects recognized tones with high (378 Hz) and low (178 Hz) carrier frequencies but equivalent and low modulation depths (23 Hz) as sad and angry, respectively.

DISCUSSION
Our primary goal was to examine whether FM tones extracted from the fundamental frequency parameters can reliably reflect the emotional intent presented in the actual prosodic stimuli and whether these stimuli elicit significant MMNs that can be used to evaluate underlying processes. In order not to compromise MMN generation, subjects were only asked to characterize the emotion of the FM tones after EEG recording. Behaviorally, post hoc affective judgments revealed a significant overall pattern for affective judgments for our stimuli. Eighty-four percent of our subjects correctly identified the FM tone based on prosodic happiness. Contrary to our expectations, however, the FM tone based on neutral (no-expression) prosody was judged by the majority of subjects to sound angry. This may be due to the presence of our additional control stimuli -notably the no-FM tone (378/0/0) -in our post hoc affective judgment task. Nevertheless, retrospectively this is consistent with the concept of "cold" vs. "hot" anger, with cold anger being conveyed by stimuli with low mean pitch but moderate pitch variability (Leitman et al., 2010). Ninety-two percent of subjects rated the no-FM as sounding neutral. The second control stimulus (378/23/3) was a hybrid, comprised of the carrier frequency abstracted from the happy speech stimuli and the small modulation depth of the neutral/angry speech stimulus. The majority of subjects judged this stimulus as sounding sad. These results suggest that it may be possible to reliably abstract basic emotions using simple FM tones and to use such stimuli for research into basic brain mechanisms underlying prosodic evaluation. We are currently attempting to map the emotionality of FM space in a more systemic manner, to permit more systematic evaluation deficits in disorders associated with prosodic impairments (Kantrowitz et al., 2011;Leitman and Janata, unpublished data).
In ERP studies, MMN-like responses were observed even when "like from like" analyses were used to compare responses to the same stimulus elicited in different contexts. A comparison of the MMNs elicited by the FM tones based on prosodic happiness (378/169/3) with those elicited by no-emotion (178/23/3) stimuli indicated that these stimuli both elicited equivalent MMNs. By contrasting across stimuli blocks using "like from like" subtractions we were able to conclude that the auditory deviance detection indexed in the MMN is not attributable to acoustical differences in standard and deviant tones, but rather to the automatic comparison between standard and deviant tones in a sequence.
In contrasting our FM stimuli with our control no-FM stimulus, we found that both FM and no-FM deviants elicited MMNs. These MMNs did not differ in amplitude, but the FM deviant peak (FM no-FM ) occurred later. This latency shift likely reflects the fact that modulation of the FM tone develops progressively after stimulus onset, thereby delaying the point of deviance detection.
Our findings diverge from those of the one published paper to date on the topic by Bishop et al. (2005) (Rauschecker, 1998). Conceivably, then, a within-run comparison of no-FM deviants and FM standards reflects a deviant waveform representing neuronal populations that respond to no-FM tones and a standard waveform representing these neurons as well as additional neurons that are tuned to FM, thereby obscuring any deviance-related negativity. A final comparison of FM deviance in which carrier and modulation frequencies are held constant and only modulation depth changes indicated MMN asymmetry, with increasing modulation deviance (MD+) but not decreasing MD− eliciting a significant MMN. This tendency for larger MMN's to deviants that increase rather than decrease in modulation features was also observed in the CMD+ vs. CMD− contrast. Yet, there the difference was not significant. The reason for this suggested asymmetry is unclear but perhaps MMN deviance detection is more sensitive to relational changes to the auditory environment that increase or add rather than decrease.
This study was a preliminary attempt to examine whether emotional intent could be reflected in a simple, stationary signal that incorporates two pitch parameters. Significant MMNs indicate that these distinctions can be processed pre-attentively. This study had a number of limitations: examination of voltage topographies to the FM happy and the MD+ topography suggests stronger right hemisphere (RH) FM MMN generation. This would be consistent with literature suggesting a RH preference for slow modulation signals (Zatorre et al., 2002) and a RH dominance in emotional prosody (Ross, 1981;Schirmer and Kotz, 2006). Future investigations using source localization and comparative study with MEG or fMRI will be necessary to confirm such hemispheric asymmetry. Such a study is indeed under way. In terms of our goal in abstracting emotion using simple FM tones, our results, while encouraging, are still preliminary. More systematic multidimensional mapping of emotion FM space in terms of carrier frequency, modulation depth, and modulation frequency is clearly needed before any firm conclusions can be drawn. Further investigations should also determine the number of categorically perceived emotions that can be represented. Unpublished data (Kantrowitz et al., 2011;Leitman and Janata, unpublished data) in our lab suggests that emotions distinctly cluster along differing portions of FM multidimensional space.
The major utility of such systematic mapping of affective judgments would provide a flexible and sensitive tool to examine the relationship between auditory pitch perception and emotional judgment, potentially benefiting clinical investigations of dysprosodia in illnesses such as parkinsonism, schizophrenia, and autism.

Frontiers in Human Neuroscience
www.frontiersin.org An additional utility of such tasks would be as behavioral electrophysiological probes for developmental learning disabilities that involve abnormal phonological processing like some forms of dyslexia. It has recently been suggested that such disabilities arise from improper temporal sampling of speech signals (Goswami, 2011;Goswami et al., 2011), and the inability to perceive changes of amplitude (rise time) within the speech amplitude envelope. These deficits are particularly pronounced for slow modulations of <4 Hz− roughly the syllabic rate or period of conversational speech. Such slow temporal processing [delta (1.5 Hz+) and theta (3-10 Hz)] is preferentially processed by right superior temporal gyrus (rSTG) with lSTG favoring higher beta and gamma frequencies (15+ and 30+ Hz respectively; Zatorre et al., 2002;Poeppel et al., 2008;Goswami, 2011). Mapping FM tones three-dimensionally in terms of high and low modulation frequencies, modulation depths and carrier frequencies could thus provide a useful tool to characterize abnormal rise time perception and temporal sampling nature, providing a quantitative neuropsychological index for phonological processing abnormalities in developmental learning disabilities.

CONCLUSION
Previously we (Leitman et al., 2009) and others (Kujala et al., 2005;Schirmer et al., 2005) have demonstrated that prosodic perception begin quite early in auditory processing and can be indexed by the MMN. These findings however have used either real speech or tonal contours for real speech. A central question to understanding auditory processing of prosody is what the minimal amount of information to convey affective distinctions and pertinently whether the temporal sequence or progression of the prosodic signal is necessary for rapid classification of emotions. Here we observe that stationary FM tones discriminate emotions as well as elicit significant MMN's, demonstrating preliminary evidence that a simple representation of mean fundamental frequency and its variation over time are sufficient to characterize emotional distinctions and process them automatically.