Original Research ARTICLE
Front. Hum. Neurosci., 26 February 2010 | https://doi.org/10.3389/fnhum.2010.00019
“It’s not what you say, but how you say it”: a reciprocal temporo-frontal network for affective prosody
Department of Psychiatry-Neuropsychiatry Program, Brain Behavior Laboratory, University of Pennsylvania School of Medicine, Philadelphia, PA, USA
Department of psychiatry and Behavioral Sciences, UC Davis Imaging Research Center, University of California at Davis, Sacramento, CA, USA
Department of Psychology, Stockholm University, Stockholm, Sweden
Program in Cognitive Neuroscience and Schizophrenia, The Nathan S. Kline Institute for Psychiatric Research, Orangeburg, NY, USA
Department of Psychiatry, New York University School of Medicine, New York, NY, USA
Department of Radiology, University of Pennsylvania School of Medicine, Philadelphia, PA, USA
Humans communicate emotion vocally by modulating acoustic cues such as pitch, intensity and voice quality. Research has documented how the relative presence or absence of such cues alters the likelihood of perceiving an emotion, but the neural underpinnings of acoustic cue-dependent emotion perception remain obscure. Using functional magnetic resonance imaging in 20 subjects we examined a reciprocal circuit consisting of superior temporal cortex, amygdala and inferior frontal gyrus that may underlie affective prosodic comprehension. Results showed that increased saliency of emotion-specific acoustic cues was associated with increased activation in superior temporal cortex [planum temporale (PT), posterior superior temporal gyrus (pSTG), and posterior superior middle gyrus (pMTG)] and amygdala, whereas decreased saliency of acoustic cues was associated with increased inferior frontal activity and temporo-frontal connectivity. These results suggest that sensory-integrative processing is facilitated when the acoustic signal is rich in affective information, yielding increased activation in temporal cortex and amygdala. Conversely, when the acoustic signal is ambiguous, greater evaluative processes are recruited, increasing activation in inferior frontal gyrus (IFG) and IFG STG connectivity. Auditory regions may thus integrate acoustic information with amygdala input to form emotion-specific representations, which are evaluated within inferior frontal regions.
When we communicate vocally, it is often not just what we say – but how we say it – that matters. For example, in expressing joy our voices become increasingly melodic, while our voicing of sadness is more often flat and monotonic. Such prosodic aspects of speech precede formal language acquisition, reflecting the evolutionary importance of communicating emotion (Fernald, 1989 ).
Vocal communication of emotion results from gestural changes of the vocal apparatus that, in turn, cause collinear alterations in multiple features of the speech signal such as pitch, intensity, and voice quality. There are relatively distinct patterns of such acoustic cues that differentiate between specific emotions (Banse and Scherer, 1996 ; Cowie et al., 2001 ; Juslin and Laukka, 2003 ). For example, anger, happiness, and fear are typically characterized by high mean pitch and voice intensity, whereas sadness expressions are associated with low mean pitch and intensity. Also, anger and happiness expressions typically have large pitch variability, whereas fear and sadness expressions have small pitch variability. Regarding voice quality, anger expressions typically have a large proportion of high-frequency energy in the spectrum, whereas sadness has less high-frequency energy (as the proportion of high-frequency energy increases, the voice sounds sharper and less soft). We present the first study to experimentally examine neural correlates of these acoustic cue-dependent perceptual changes.
We employed a parametric design, using emotional vocal stimuli with varying degrees of acoustic cue saliency to create graded levels of stimulus-driven prosodic ambiguity. A vocal stimulus with high cue salience has high levels of acoustic cues that are typically associated with the vocal expression of a particular emotion and presents an acoustic signal rich in affective information, whereas a vocal stimulus with low cue salience has low levels of the relevant acoustic cues and is more ambiguous. We generated a four-choice vocal emotion identification task (anger, fear, happiness and no expression) to examine how acoustic-cue level impacts affective prosodic comprehension. As our independent variable, we used the acoustic cue which best correlated with performance on the emotion identification task – this cue served as a proxy for “cue saliency”. For happiness and fear, we utilized pitch variability – the standard deviation of the fundamental frequency (F0SD) as a cue salience proxy, and for anger we used proportion of high-frequency spectral energy [i.e. elevated ratios of energy above vs. below 500 Hz (HF500)]. These cues are important predictors of recognition of the respective emotions (Banse and Scherer, 1996 ; Juslin and Laukka, 2001 ; Leitman et al., 2008 ) and pitch variability and spectral energy ratios are important for emotion categorization (Ladd et al., 1985 ; Juslin and Laukka, 2001 ; Leitman et al., 2008 ).
For each emotion, our vocal stimuli set contained stimuli exhibiting a wide range of the emotion-relevant cue. We then examined behavioral performance and brain activation parametrically across each emotion as a function of this cue level change across items. We hypothesized that variation in cue salience level would be reflected in activation levels within a reciprocal temporo-frontal neural circuit as proposed by Schirmer and Kotz (2006) and others (Ethofer et al., 2006 ). F0SD as a proxy for cue salience in fear and happiness allowed further differentiation: Saliency-related performance increases are expected to positively correlate with pitch variability (F0SD) for happy stimuli, and negatively correlate with F0SD for fear stimuli. Therefore, a similar activation pattern for increasing cue saliency for both happiness and fear would suggest that the activation observed relates to emotional salience as predicted, rather than to pitch variation alone.
The proposed temporo-frontal network that we expect to be affected by changes in cue saliency is grounded in neuroscience research. Initial lesion studies (Ross et al., 1988 ; Van Lancker and Sidtis, 1993 ; Borod et al., 1998 ) linked affective prosodic processing broadly to right hemispheric function (Hornak et al., 1996 ; Ross and Monnot, 2008 ). More recent neuroimaging studies (Morris et al., 1999 ; Adolphs et al., 2001 ; Wildgruber et al., 2005 ; Ethofer et al., 2006 ; Wiethoff et al., 2008 , 2009 ) related prosodic processing to a distributed network including: posterior aspects of superior and middle temporal gyrus (pSTG, pMTG), inferior frontal (IFG) and orbitofrontal (OFC) gyri, and sub-cortical regions such as basal ganglia and amygdala. In current models (Ethofer et al., 2006 ; Schirmer and Kotz, 2006 ), affective prosodic comprehension has been parsed into multiple stages: (1) elementary sensory processing (2) temporo-spectral processing to extract salient acoustic features (3) integration of these features into the emotional acoustic object, and (4) evaluation of the object for meaning and goal relevance. Together these processing stages comprise a circuit with reciprocal connections between nodes.
Prior neuroimaging studies compared prosodic vs. nonprosodic tasks [i.e. (Mitchell et al., 2003 )], or prosodic identification of emotional vs. neutral stimuli [i.e. (Wiethoff et al., 2008 )], and thereby identified a set of brain regions likely involved in affective prosody. Based on knowledge of functional roles of temporal cortex and IFG (‘reverse inference’; Poldrack, 2006 ; Van Horn and Poldrack, 2009 ), it was assumed that temporal cortex mediates sensory-integrative functions while IFG plays an evaluative role (Ethofer et al., 2006 ; Schirmer and Kotz, 2006 ). However, these binary ‘cognitive subtraction’ designs did not permit a direct demonstration of the distinct roles of temporal cortex versus IFG.
Our parametric design, using stimuli varying in cue salience to create varying levels of stimulus-driven prosodic ambiguity, has two major advantages over prior study designs: First, analysis across varying levels of an experimental manipulation allow more robust and interpretable results linking activation to the manipulated variable than designs that utilize a binary comparison. Second, the parametric manipulation of cue saliency should produce a dissociation in the relationship of sensory vs. evaluative regions to the manipulated cue level. This allows direct evaluation of the hypothesis that IFG plays an evaluative role distinct from the sensory-integrative role of temporal cortex.
We hypothesized that during a simple emotion identification task, the presence of high levels of affectively salient cues within the acoustic signal should facilitate the extraction and integration of these cues into a percept that would be reflected in temporal cortex activation increases. We also hypothesized that increased cue saliency would correlate with amygdala activation. Amygdala activation is correlated with perceived intensity in non-verbal vocalizations (Fecteau et al., 2007 ; Bach et al., 2008b ). Such activity may reflect automatic affective tagging of the stimulus intensity level (Bach et al., 2008a ,b ). Conversely, we predicted that decreasing cue saliency would be associated with increasing IFG activation, reflecting increased evaluation of the stimuli for meaning (Adams and Janata, 2002 ) and difficulty in selecting the proper emotion (Thompson-Schill et al., 1997 ). We thus expected that increased activation in this evaluation and response selection region (IFG) would be directly associated with decreased activity in feature extraction and integration regions (pSTG and pMTG). Thus, our parametric design aimed to characterize a reciprocal temporo-frontal network underlying prosodic comprehension and examine how activity within this network changes as a function of cue salience.
Informed consent was obtained from 20 male right-handed subjects with a mean age of 28 ± 5, 14.9 ± 2 years of education, and no reported history of psychopathology or hearing loss. One subject did not complete the scanning session due to a strong sensitivity to scanner noise. All procedures were conducted under the supervision of the local internal review board.
Stimuli and Design
Recognition of emotional prosody was assessed using a subset of stimuli from Juslin and Laukka’s (2001) prosody task. The stimuli consisted of audio recordings of two male and two female actors portraying three emotions – anger, fear, happiness, as well as utterances with no emotional expression. The sentences spoken were semantically neutral and consisted of both statements and questions (e.g., “It is eleven o’clock”, “Is it eleven o’clock?”). All speakers were native British English; these stimuli have been used successfully with American subjects (Leitman et al., 2008 ). All stimuli were less than 2 s in length. Each emotion was represented by 8–10 exemplars that had unique acoustic properties that would reflect a particular level of cue salience for each emotion. These stimuli were repeated on average 5–7 times to yield 56 stimuli for each emotion. These stimuli were pseudo-randomly presented over fMRI time series acquisitions (runs a–d) of 56 stimuli each, in such a manner that all runs were balanced for the type of sentence (question or statement), emotion, and gender of speaker.
For this stimulus set, measurement of all acoustic cues was conducted in PRAAT (Boersma, 2001 ) speech analysis software as described previously (Juslin and Laukka, 2001 ). F0SD was transformed to a logarithmic scale for all analyses as done previously (Leitman et al., 2008 ). Our initial choice of these particular cues as our proxies for cue salience (F0SD for happiness and fear, HF500 for anger) was based on our prior findings with a full Juslin and Laukka stimuli set. There we found that the F0SD ranges of happy and fear and the HF500 range for anger were statistically distinct from the other emotions as a whole (see Leitman et al., 2008 – Table 2) and that they provided the single strongest correlate of subject performance. For this study, due to time constraints, we reduced the emotions presented from six to four: anger, fear, happiness, or neutral. As Table 1 illustrates, in the present study the ranges for F0SD and HF500 for happiness and anger respectively are no longer statistically different from the three remaining emotions; nevertheless, they did remain the strongest single predictor of performance of the acoustic features measured. Note that we had no a priori hypotheses regarding the neutral stimuli that were included in the experiment in order to give subjects the option not to endorse an emotion. Our prior study (Leitman et al., 2008 ) indicated that when the cue salience of an emotional stimulus was low, subjects often endorsed it as neutral. With the inclusion of neutral stimuli, we were additionally able to replicate more prior conventional binary contrasts of emotional prosody versus neutral.
Table 1. Selected acoustic features of prosodic stimuli.
The task consisted of a simple forced-choice identification task and was presented in a fast event-related design whose timing and features are described in Figure 1 . This design used compressed image acquisition to allow for a silent period in which audio stimuli could be presented.
Figure 1. fMRI Paradigm. Subjects were placed in a supine position into the scanner and instructed to focus on a central fixation crosshair displayed via a rear-mounted projector [PowerLite 7300 video projector (Epson America, Inc., Long Beach, CA, USA)] and viewed through a head coil-mounted mirror. After sound offset, this crosshair was replaced with a visual prompt containing emoticons representing the four emotion choices and the corresponding response button number. Auditory stimuli were presented through pneumatic headphones and sound presentation occurred between volume collections to minimize any potential impact of scanner noise on stimulus processing.
Images were acquired on a clinical 3T Siemens Trio Scanner (Iselin, NJ, USA). A 5 min magnetization-prepared, rapid acquisition gradient-echo image (MPRAGE) was acquired for anatomic overlays of functional data and spatial normalization Talairach and Tournoux (1988 ). Functional BOLD imaging (Bandettini et al., 1992 ) used a single-shot gradient-echo (GE) echo-planar (EPI) sequence (TR/TE=4000/27 ms, FOV=220 mm, matrix=64 × 64, slice thickness/gap=3.4/0 mm). This sequence delivered a nominal voxel resolution of 3.4 × 3.4 × 3.4 mm. Thirty four axial slices were acquired from the superior cerebellum up through the frontal lobe, aligning the slab orientation so that the middle slice was parallel to the lateral sulcus, in order to minimize signal drop-out in the temporal poles and ventral and orbitofrontal aspects of cortex. The extent of this scanning region is illustrated in Figure 2 along with a contrast of all stimuli > rest.
Figure 2. All stimuli > rest. Activation presented at an uncorrected p < 0.05 threshhold. Grey shadow represents scanned regions of the brain.
The fMRI data were preprocessed and analyzed using FEAT (FMRI Expert Analysis Tool) Version 5.1, part of FSL (FMRIB’s Software Library, www.fmrib.ox.ac.uk/fsl ). Images were slice time corrected, motion corrected to the median image using tri-linear interpolation with 6 degrees of freedom, high pass filtered (120 s), spatially smoothed (8-mm FWHM, isotropic) and scaled using mean-based intensity normalization. Resulting translational motion parameters were examined to ensure that there was not excessive motion (in our data, all subjects exhibited less than 1 mm displacement in any plane). BET was used to remove non-brain areas (Smith, 2002 ). The median functional image was coregistered to the T1-weighted structural volume and then normalized to the standard anatomical space (T1 MNI template) using tri-linear interpolation (Jenkinson and Smith, 2001 ) and transformation parameters were later applied to statistical images for group-level analysis.
Variations in subject performance were examined using a general linear mixed effects model conducted with Stata 9.0 (StataCorp; College Station, TX, USA). In this model, subjects’ prosodic identification served as the outcome variable, subjects (n=19) were treated as random effects, and fixed effects included fMRI runs (a–d) and cue saliency level (10 for happy and anger, 8 for fear, each level reflecting a unique stimulus). Adjustment for the clustering (repeated measures from within individual) was accomplished within the mixed model using the sandwich estimator approach, which is the default adjustment method for this program. The significance levels of individual model parameters were assessed using the F-test statistic, which were appropriately adjusted for the non-independence of the repeated measures within individual, with an alpha criterion of p < 0.05.
Subject-level time-series statistical analysis was carried out using FILM (FMRIB’s Improved Linear Model) with local autocorrelation correction (Woolrich et al., 2001 ). Event-related first stage analysis was conducted separately for the four timeseries, modeling each of the four conditions (angry, happy, fear, neutral) against a canonical hemodynamic response function (HRF) and its temporal derivative.
In order to compare our results to those of prior studies (Wiethoff et al., 2008 , 2009 ) we contrasted anger, fear and happiness with neutral stimuli. In order to quantify the relationship of activation to parametrically varied cue saliency levels, we also included a parametric regressor – ZCUE – consisting of z-normalized values of the relevant cue value for each emotion (F0SD for fear and happy, HF500 for anger) across all emotions. A separate analysis was conducted for each of the three emotion conditions in which the HRF was scaled as a function of the relevant cue level for each stimulus (F0SD for fear and happy, HF500 for anger). These parametric regressors were orthogonalized relative to the fixed amplitude HRF regressor for the corresponding emotion, yielding a contrast that reflected cue level related variations above or below the average stimulus response.
A second-level within-subject fixed effects analysis across all four runs was then conducted for each subject. The resulting single-subject contrast estimates were submitted to a third-level between-subjects (group) analysis employing FMRIB’s Local Analysis of Mixed Effects (FLAME) (Beckmann et al., 2003 ), which models inter-session or inter-subject random-effects components of the mixed-effects variance using Markov chain Monte Carlo sampling to estimate the true random-effects variance and degrees of freedom at each voxel (Woolrich et al., 2004 ).
As mentioned, saliency-related activation for happy stimuli was positively related to pitch variability (F0SD), while saliency-related activation for fear stimuli was negatively related to F0SD. In order to illustrate that activation changes correlating with cue level within our ROIs reflect emotion-specific changes and not directional changes in acoustic features, we conducted a conjunction analysis of happy and fear stimuli. This analysis examines correlated activation changes of increasing cue saliency (increasing F0SD for happiness, decreasing F0SD for fear) or decreasing cue saliency (decreasing F0SD for happiness, increasing F0SD for fear) within these emotions jointly.
Statistical significance was based on both voxel height and spatial extent in the whole brain, using AFNI AlphaSim to correct for multiple comparisons by Monte Carlo simulation (10,000 iterations, voxel height threshold p < 0.01 uncorrected, cluster probability p < 0.01). This whole-brain correction required a minimum cluster size of 284 2 × 2 × 2 voxels. Given the small size of the amygdala (319 voxels for both amygdalae combined) and our a priori prediction of amygdala involvement, this cluster threshold was deemed inappropriate for detecting amgydala activity. We therefore repeated the above AlphaSim correction using a mask restricted to the amygdala as defined anatomically by a standardized atlas (Maldjian et al., 2003 ), yielding a cutoff of >31 voxels.
Anatomical regions within significant clusters were identified by a Talairach atlas Talairach and Tournoux (1988 ) with supplemental divisions for regions like planum temporale (PT) and IFG-pars triangularis delineated using the Harvard-Oxford atlas created by the Harvard Center for Morphometric Analysis, and WFU Pick atlas (Maldjian et al., 2003 ), respectively. Using the cluster tool (FSL), we identified local maxima with connectivity of 26 voxels or more within these anatomical regions.
To assess the degree of lateralization within auditory regions for our cue × emotion interactions we adopted a method akin to one used previously by Obleser et al. (2008) . We contrasted activity within right and left structural ROIs containing PT, pSTG, and pMTG by calculating a lateralization quotient index (LQ). We used “Energy” as an activation measure, which takes into account both amplitude and spatial extent (Gur et al., 2007 ). Energy is calculated as: Energy=mean BOLD % signal change *number of voxels, where % signal change was calculated using FSL’s Featquery tool from voxels greater than our chosen voxel height threshold (overall whole brain p < 0.01). Thus,
where k = number of voxels.
As in Obleser et al. (2008) , we used a jackknife procedure (Efron and Tibshirani, 1993 ) to determine the reliability of our emotion × cue effects, rerunning the model n times (n = 19, the number of our participants) each time omitting a different participant. This procedure resulted in n models with n-1 subjects, which, unlike lateralization analysis based on single subjects, preserved the advantages of second level modeling such as greatly increased signal to noise ratio.
Psychophysiological interaction (PPI) analysis (Friston et al., 1997 ) was used to evaluate effects of cue salience on the functional connectivity of right IFG with other regions in our affective prosodic model. PPI examines changes in the covariation of BOLD signal between brain regions in relation to the experimental paradigm. IFG was chosen as a seed region because we wished to clarify its role in prosodic “evaluation” which should increase with decreasing cue saliency. The mean time series was extracted from an 8-mm- radius sphere within the right IFG seed region, centered on the coordinates (MNI = 50, 22, 20) where the peak effect was observed in our initial parametric analysis of cue salience within each emotion. Using FSL FEAT and following the method of Friston et al. (1997) , we created a regression model employing regressors reflecting the standardized estimate (Z score) of cue saliency for each cue by emotion (ZCUE), the mean timeseries of our rIFG sphere, and the ZCUE × timeseries interaction (the PPI regressor of interest). Additionally, we included mean global (whole brain) times series, slice time correction, and motion in our model to reduce non-specific sources of timeseries correlation.
Emotion identification accuracy was well above chance for all four emotional categories (Figure 3 A). Examination of identification rates within each emotion as a function of cue level revealed that the identification of anger stimuli significantly increased as a function of HF500 (F1, 1041 = 101.08 p < 0.0001) (Figure 3 B). An inverse correlation indicated that decreasing F0SD was associated with increased identification of fearful stimuli (F1, 1037 = 12.32 p < 0.0005) (Figure 3 C), while happy prosodic stimuli significantly increased as a function of F0SD (F1, 1056 = 28.45 p < 0.0001) (Figure 3 D). Although the experiment was divided into four runs (a–d), there was no effect of run number on performance for any of the emotions (all p’s > 0.19).
Figure 3. Identification performance as a function of acoustic cue saliency levels. (A) Mean performance across all emotion choices; error bars reflect standard error of the mean of the raw data. White dotted line indicates chance performance. (B) Anger: as HF500 increases accuracy increases. (C) Fear: as F0SD decreases accuracy increases. (D) Happiness: as F0SD increases accuracy increases.
All emotions > neutral
A contrast of emotional prosody versus neutral prosody revealed increasing activation to emotional prosody in a cluster spanning Heschl’s gyrus and posterior and middle portions of superior and middle temporal gyrus (pSTG, mSTG, pMTG) as well as clusters in inferior frontal (IFG) and orbitofrontal gyri (OFC) (Figure 4 and Table 2 ). Additional activation clusters were observed in anterior and middle portions of cingulate gyrus as well as sub-cortically within insula, caudate and thalamus. No activation within amygdala was observed even at reduced significance thresholds (uncorrected p < 0.05).
Figure 4. All emotions > neutral. A subtraction of neutral activation from all emotions (anger, fear and happiness) indicates activation clusters bilaterally in posterior superior/middle temporal gyrus (pSTG/ pMTG), inferior frontal gyrus (IFG) and orbitofrontal cortex (OFC). The markers in red illustrate differences between this contrast and the subsequent parametric analysis: Arrow = OFC activation; * = thalamic activation; circles = absence of amygdala activation bilaterally.
Table 2. Mean cluster location and local maxima of BOLD signal change for all emotions > neutral.
All emotions × cue saliency
A voxel-wise examination of ZCUE-correlated activation patterns for all emotions (anger, fear and happiness) revealed activation clusters spanning PT, pSTG, pMTG, and IFG that were modulated by cue saliency level (Figure 5 A). Increasing cue saliency (increasing ZCUE) correlated with activation in PT, pSTG and pMTG. Conversely, decreasing cue saliency (decreasing ZCUE) was associated with IFG activation. Further, in contrast to the all emotion>neutral contrast, small volume analysis of amygdala revealed bilateral activation clusters that correlated with increasing cue saliency.
Figure 5. Cue saliency-correlated activation patterns, by emotion. (A) Correlation with a standardized estimate (ZCUE) of cue saliency across all emotions revealed increased PT, pSTG and pMTG activation as cue saliency increased (red), and conversely, increased bilateral IFG activation as cue saliency decreased (blue). (B) A similar pattern was observed for anger as HF500 increased/decreased. (C) A conjunction analysis of increasing cue saliency (increasing F0SD) for happy and increasing cue saliency (decreasing F0SD) for fear yielded a similar pattern. (D) Uncorrected p < 0.05 maps of F0SD modulated activity for happy (left) and fear (right) indicate activation clusters spanning pSTG, amygdala and IFG. For happiness increasing F0SD (red) is associated with activation increases in pSTG and amygdala while decreasing F0SD (blue)is associated with increasing IFG activation. The reverse pattern is seen for fear, decreasing F0SD is associated with activation increases in pSTG and amygdala, while decreasing F0SD is associated with increasing IFG activation.
Beyond these a priori ROIs, increasing cue saliency positively correlated with activation in posterior cingulate gyrus (pCG) bilaterally, right precuneus, and anterior-medial portions of paracingulate gyrus (Brodmann’s areas 23, 7 and 32 respectively) (Table 3 ).
Table 3. Mean cluster location and local maxima of BOLD signal change for all emotions × cue saliency correlations.
Anger × HF500
Activation to anger stimuli was significantly modulated by HF500 level (Figure 5 B). Increasing cue saliency (greater HF500) was associated with bilateral clusters of activation spanning PT, STG, and MTG. In contrast, decreasing cue saliency (lower HF500) was associated with increased bilateral IFG activation. Within amygdala, small volume correction indicated activation clusters that were associated with increasing cue saliency.
Beyond these a priori ROI’s, increasing cue saliency (here HF500) in anger stimuli positively correlated with activation in pCG and precuneus (Table 4 ). Decreasing cue saliency correlated with activation in AC, left globus pallidus, and right caudate and insula.
Table 4. Mean cluster location and local maxima of BOLD signal change for Anger × cue saliency correlations.
Conjunction analysis of fear and happiness × F0SD
Similarly, for fear and happiness, F0SD-correlated activation patterns were observed in clusters spanning PT, pSTG, MTG, amygdala and IFG that were modulated by cue saliency level (Figure 5 C). Increasing cue saliency (increasing F0SD for happiness, decreasing F0SD for fear) correlated with activation in PT, pSTG, pMTG and amygdala. Conversely, decreasing cue saliency (decreasing F0SD for happiness, increasing F0SD for fear) was associated with right IFG activation.
Beyond these a priori regions of interest, increasing cue saliency for fear and happy stimuli positively correlated with activation in anterior and ventral aspects of left MTG (Brodmann’s areas 20, 34 and 24), bilateral pCG, and right supramarginal gyrus, right postcentral gyrus, right insula and right precuneus (Table 5 ).
Table 5. Mean cluster location and local maxima of BOLD signal change for Happy and Fear Conjunction × cue saliency correlations.
These overall activation patterns observed in the conjunction analysis of happiness and fear were also seen within each emotion individually, albeit at a reduced significance threshold (see Figure 5 D).
Analysis of hemispheric laterality for fear and happiness, incorporating both activation magnitude and spatial extent, indicated that PT, pSTG and MTG activation was robustly right-lateralized [LQ = −0.11 ± 0.02 (t1, 17 = −21.0, p < 0.0001)]. A similar assessment for anger × cue was slightly left lateralized [LQ = 0.02 ± 0.01 (t1, 17 = −12.3, p < 0.0001)].
An examination of the psychophysiological interaction between ZCUE and right IFG activity indicated robust negative interactions centered in bilateral pSTG (Figure 6 ). This interaction suggests that the functional coupling of rIFG and STG/MTG significantly increases as ZCUE decreases.
Figure 6. Psychophysiological (PPI). This functional connectivity analysis map illustrates the negative interaction between ZCUE and the mean timeseries of IFG seed region (red sphere). This map indicates that functional connectivity between IFG and auditory processing regions is significantly modulated by cue saliency: Decreasing cue saliency increases IFG-STG functional coupling, while increasing cue saliency decreases this coupling.
We approached affective prosodic comprehension from an object-based perspective, which characterizes affective prosodic processing as a reciprocal circuit comprising sensory, integrative, and cognitive stages (Schirmer and Kotz, 2006 ). Our model locates sensory-integrative aspects of prosodic processing in posterior STG and MTG, while higher-order evaluation occurs in IFG. Sensory-integrative processing should be robust when the prosodic signal is rich in the acoustic cues that typify the affective intent (high cue saliency), yielding increased PT, pSTG, and pMTG activation. Such integration may be facilitated by amygdala. Conversely, when the prosodic signal is ambiguous (low cue saliency), greater evaluative processes are recruited, increasing activation in IFG.
We tested this model by capitalizing on prior observations that acoustic cues, namely pitch variability (F0SD) and high-frequency spectral energy (HF500), correlate with the identification of specific emotions. We conducted a prosody identification task in which the stimuli varied parametrically in their cue salience. Our results were highly consistent with model predictions.
Activation Related to Saliency of Emotion-Specific Acoustic Cues
Consistent with our hypothesis, increased cue saliency was associated with right lateralized BOLD signal increases in PT, pSTG, pMTG and amygdala, as well as additional regions not included in our a priori model. Similarly, Wiethoff et al. (2008) reported pSTG activation to emotional prosody relative to neutral prosodyÃƒÂ¯Ã‚Â¿Ã‚Â½that was abolished after covarying for acoustic features such as F0SD and decibel level. This effect is consistent with our findings: A comparison between a contrast of all emotion >neutral and our maps of emotions × cue saliency revealed a high degree of overlap in pSTG, where increasing cue saliency produced correlated activation increases. We posit that these changes reflect increased facilitation in the extraction and integration of acoustic cues that characterize the emotion.
Again as predicted, decreased cue saliency was associated with increased activation in IFG (as well as anterior cingulate for anger, which was not part of our model). This activity, we propose, reflects increasing evaluation of the stimulus because ambiguity increases the difficulty of response selection.
These effects of salience were similar across the three emotions we examined, but depended on emotion-specific acoustic cues. Thus, saliency-related activation for happy stimuli was positively related to pitch variability (F0SD), negatively related to F0SD for fear stimuli, and positively associated with HF500 for anger stimuli. This emotion-specific effect is highlighted by the conjunction analyses combining fear and happy conditions, where the same acoustic cue (F0SD) produces opposite saliency effects. When the conjunction combined positive parametric effects of F0SD across happy stimuli and negative parametric effects of F0SD across fear stimuli, the predicted saliency patterns were robust. In contrast, in a control conjunction analysis (see Figure 7 ), examining effects of F0SD independent of emotion (positive parametric effect across both happy and fear conditions), an unrelated pattern emerged. This pattern suggests that effects within auditory sensory regions are not due to pitch variability change alone. Rather, these auditory regions code acoustic features in an emotion-specific manner when individuals are engaged in vocal affect perception.
Figure 7. Control conjunction analyses. Increasing or decreasing F0SD across fear and happiness does not reveal activation in STG, IFG or amygdala at uncorrected p < 0.05 threshold.
A comparison of our parametric model (Figure 5 ) with a standard binary contrast of all emotions > neutral (Figure 4 ) revealed a high degree of overlap in activation in temporal and inferior frontal regions. However, the all emotions>neutral contrast (Figure 4 red markers) also indicated activation clusters in ventral IFG/OFC and thalamus that were not present in our cue salience parametric model, even at reduced thresholds. This effect suggests that the modulation of evaluation resulting from stimulus-driven ambiguity may be restricted to portions of the frontal prosodic processing circuit. The absence of modulation of thalamic activity by cue salience suggests that such modulation may only begin at the corticolimbic level.
Notably, cue salience increases resulted in correlated activation increases in the amygdala that were not observed in a contrast of all emotion > neutral (Figure 4 ). An examination of all stimuli versus rest also failed to indicate significant amygdala activation (Figure 2 ) even at p < 0.05 uncorrected.
The literature regarding the role of amygdala in prosody and non-verbal vocalizations is mixed, with some studies (Phillips et al., 1998 ; Morris et al., 1999 ; Sander et al., 2005 ; Fecteau et al., 2007 ; Ethofer et al., 2009a ; Wiethoff et al., 2009 ) indicating a role for the amygdala and others not (Grandjean et al., 2005 ; Mitchell and Crow, 2005 ). A number of studies (Morris et al., 1999 ; Adolphs, 2002 ) suggest that the amygdala may preferentially activate during implicit tasks and become deactivated during explicit tasks, other studies have indicated the opposite (Gur et al., 2002 ; Habel et al., 2007 ) or that the amygdala activation may decrease over the duration of the experiment due to habituation (Wiethoff et al., 2009 ). Our results suggest that during explicit identification the amygdala may be sensitive to the degree of cue salience in the prosody. This sensitivity may relate to increasing arousal engendered by cue salience as well as the fact that identification accuracy for such stimuli was considerably higher for high cue than for low cue saliency stimuli. Indeed, a study of facial affect has shown that identification accuracy is associated with increased amygdala activation (Gur et al., 2007 ). Thus, amygdala activation may reflect some form of concurrent visceral or automatic recognition of emotion that may facilitate explicit evaluation.
Functional Integration within the Affective Prosody Circuit
To examine how cue salience modulates the functional coupling between IFG and other regions in the prosody network, we also conducted a psychophysiological interaction (PPI) functional connectivity analysis (Friston et al., 1997 ). The IFG timeseries was positively correlated with the regions in the model including auditory cortex and amygdala (not shown), demonstrating the expected functional connectivity within the network. Also consistent with our hypothesis, we found that IFG-STG connectivity was significantly modulated by cue saliency. As cue saliency decreased, IFG-STG coupling increased; in contrast, as cue saliency increased, IFG-STG coupling diminished.
A Dynamic Causal Modeling study (Ethofer et al., 2006 ) suggested that bilateral IFG regions receive parallel input from right temporal cortex during prosodic processing. Our results build on this finding, demonstrating that temporal auditory processing regions and inferior frontal evaluative regions exhibit a reciprocal interaction, whose balance is determined by the degree of cue presence that typified the emotion. When this cue saliency is low, evaluation of the stimulus and selection of the appropriate response become more difficult.
These observations demonstrate the integrated action of regions within a functional circuit. They support the view that in affective prosodic identification tasks IFG is involved in evaluation (Adams and Janata, 2002 ; Wildgruber et al., 2004 ) and response selection (Thompson-Schill et al., 1997 ), increasing top down modulation on auditory sensory-integrative regions in temporal cortex when stimuli are more ambiguous.
While the reciprocal effects of salience in our a priori regions were similar across all three emotions, parametric modulation of HF500 for anger yielded a bilateral response that was slightly left lateralized in contrast to the expected strongly right-predominant response seen for happy and fear. Prosodic identification is considered to be a predominantly right-hemisphere process (Ross, 1981 ; Heilman et al., 1984 ; Borod et al., 1998 ). Several considerations may explain the bilateral effects seen for anger. First, voice quality as indexed by HF500 is highly correlated with decibel level (here r = 0.83). While spectral changes appear to predominantly engage right auditory cortex, intensity or energy changes are likely reflected in auditory cortex bilaterally (Zatorre and Belin, 2001 ; Obleser et al., 2008 ). However, Grandjean et al. (2005) observed bilateral activation to anger seemingly independent of isolated acoustic cues such as intensity. This finding suggests that the lateralization of affective prosody is emotion specific.
Limitations and Future Directions
Our study had several limitations. First, we parametrically varied cue levels using non-manipulated speech stimuli. This enhances ecological validity, and we chose cues (F0SD, HF500) that are tightly linked to the relevant emotions and which maximally differentiated emotions portrayed in our stimulus set [see (Leitman et al., 2008 ) for details]. However, in natural speech stimuli these cues are also correlated with other acoustic features that result from the vocal gestural changes eliciting the particular cue change. These additional features could contribute to the observed relationship between our selected cues and variation in performance and neural activity. Future studies could employ synthetic stimuli that can permit precise and independent modulation of one cue at a time.
Second, while fMRI provides high spatial resolution, its relatively low temporal resolution cannot capture many details of temporally complex and dynamic processes contributing to prosody identification. Electrophysiological studies indicate prosodic distinctions occurring at multiple timepoints, ranging from ∼ 200 ms in mismatch studies to ∼400 ms in N400 studies, supporting multi-stage “objects” model of prosodic processing (Schirmer and Kotz, 2006 ). Combining EEG and fMRI may provide a more complete description of prosodic circuit function and allow us to discriminate processes we could not distinguish in the current study, such as feature extraction vs. feature integration.
Third, while our model incorporates the regions and processes most prominently implicated in prosodic identification, it is not comprehensive. The whole-brain analysis identified additional areas, such as posterior cingulate (pCG), whose activity varied with cue salience as well as reinforcement-sensitive regions, such as caudate and insula. Prior studies have suggested that pCG and insula activation increases during prosodic processing are linked to increased sensory integration of acoustic cues such as F0 modulation (Hesling et al., 2005 ). Our finding that cue saliency increases correlate with pCG and insula activation increases strongly support this assertion. The exact role of pCG in facilitating sensory integration is not known but perhaps this region serves to coordinate STG integration of acoustic features between hemispheres. Future models of prosodic processing should incorporate insula and pCG more thoroughly.
Fourth, our population sample was limited to right-handed males, in order to avoid variation in prosodic processing and general language processing known to result from differences in handedness or gender (Schirmer et al., 2002 , 2004 ). Future studies will need to examine factors such as handedness, gender and IQ directly and determine their impact on different processing stages within the model.
The purpose of our study was to explore the neural representation of acoustic-cue dependent perceptual change in affective prosody across all emotions. To accomplish this we formally tested a proposed multi-stage model of affective prosody that parses such perception into sensory-integrative and cognitive-evaluative stages. Consistent with our hypothesis, parametric manipulation of cue saliency revealed a reciprocal network underlying affective prosodic perception. Temporal auditory regions, which process acoustic features more generally, here in conjunction with amygdala, process acoustic features in an emotion-specific manner. This processing and its subsequent evaluation for meaning is modulated by inferior frontal regions, such that when the signal is ambiguous (as in the case of low cue-saliency), information processing in auditory regions is augmented by increased recruitment of top-down resources. While the current study identified responses to emotional salience common to multiple emotions, our results are not meant to suggest that there are no emotion-specific differences in neural activation between emotions. Indeed, recent work by Ethofer et al. (2009b) indicates that individual emotions may have spatially distinct representations in STG. Our findings converge with his in suggesting that temporal cortex is the locus for complex acoustic analysis that differentiates emotions. Such analysis in all likelihood involves the extraction and integration of acoustic cue patterns that typify differing affective intent. Based on our data we propose that when such cues are ambiguous, increasing frontal- evaluative resources are employed in making affective distinctions. Furthermore, we did find emotion-specific, correlated activation changes with cue salience in brain regions beyond our a priori ROIs. For happiness and fear, but not anger, increasing cue salience correlated with activation increases in insula, caudate, uncus, parahippocampal gyrus and putamen (Table 2 ). For anger, increasing cue salience correlated with increasing activation in differing portions of insula, caudate and globus pallidus (Table 3 ). These results suggest that specific emotional prosodic distinctions, like their facial affective counterparts, elicit distinct sub-cortical patterns of responses.
Finally, the model we describe may have clinical utility for psychiatric and neurological disorders associated with dysprosodia, such as Parkinsonism, autism, and schizophrenia. For example, prior studies have demonstrated strong links between dysprosodia in schizophrenia and pitch perception deficits (Leitman et al., 2005 , 2007 , 2008 ; Matsumoto et al., 2006 ). Our parametric fMRI approach should allow examination of the degree to which schizophrenia dysprosodia stems from failures in temporal-lobe mediated extraction and integration of prosodic cues versus prefrontal evaluative dysfunction.
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
This work was supported in part by NARSAD, NIH grants MH060722, Prodev-CAMRIS.The authors thank Dr. Mark Elliot for his assistance in implementing the MRI sequences, Dr. Warren Bilker and Ms. Colleen Brensinger for their help with the statistical modeling, and Ms. Kosha Ruparel for her suggestions regarding the fMRI analysis.