Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Introduction The effective fusion of text and audio information for categorical and dimensional speech emotion recognition (SER) remains an open issue, especially given the vast potential of deep neural networks (DNNs) to provide a tighter integration of the two. Methods In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional SER. We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a DNN, and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Results Experiments on the MSP-Podcast and IEMOCAP datasets demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behavior. Discussion Overall, our multistage fusion shows better quantitative performance, surpassing alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.


Introduction
Automatic emotion recognition (AER) is an important component of human-computer interfaces, with applications in health and wellbeing, multimedia information retrieval, and dialogue systems.Human emotions are expressed in, and can accordingly be identified from, different modalities, such as speech, gestures, and facial expressions [1,2].Over the years, different modalities have proven more conducive to the recognition of different emotional states.For example, video has shown better performance at valence recognition, whereas acoustic cues are better indicators of arousal [2].A plethora of previous works have thus investigated different ways to improve AER by combining several information streams: e. g. audio, video, text, gestures, and physiological signals.Underlying these computational approaches are distinct emotion theories, with the two most commonly-used ones being discrete (or basic) emotion theories [3] and dimensional ones [4].
In the present contribution, we are primarily interested in speech emotion recognition (SER).Speech, as the primary mode of human communication, is also a major conduit of emotional expression [2].Accordingly, SER has been a prominent research area in affective computing for several years [5].Although it constitutes a single modality, it is often analysed as two separate information streams, a linguistic (what has been said) and a paralinguistic one (how it has been said) [2,5,6].
However, the two streams are not independent.Previous studies have established that the interaction between acoustic descriptors and emotional states depends on the linguistic content of an utterance [7].Moreover, text information is better suited for valence and audio for arousal recognition [2].As a result, several works have attempted to more tightly integrate the two streams in an attempt to model their complex interrelationship and obtain more reliable recognition performance [1,2,6].
(LSTMs)) operating on expert-based acoustic descriptors [9,10].Whereas such methods have a long history in the field of SER, in recent years convolutional neural networks (CNNs) operating on raw audio or low-level features have been shown capable of learning good representations that lead to better performance [12,13,14].Thus, the combination of multistage fusion with the representation power of CNNs for auditory processing is a natural next step in the attempt to closely integrate the acoustic and linguistic information streams.
Inspired by such works, we propose a novel, CNN-based multistage fusion architecture where summary linguistic embeddings extracted from a pre-trained language model are used to condition multiple intermediate layers of a CNN operating on log-Mel spectrograms.We contrast it with a single-stage deep neural network (DNN) architecture, where the two information streams are fused at a single point, and a late fusion method, where unimodal models are trained independently and their decisions are aggregated.
The rest of this contribution is organised as follows.We begin by presenting an overview of related works in Section 2, with an emphasis on single-stage and multistage fusion approaches.We then describe our architecture in Section 3 and experimental settings in Section 4. Results are reported in Section 5 and Section 6 for categories and dimensions, respectively.We end with a conclusion in Section 7.

Related work
In this section, we present an overview of related works.We begin by a brief overview of the state-of-the-art in unimodal, either text-or acoustic-based, AER methods.We then present an overview of fusion methods.For a more in-depth, recent survey on bimodal SER, we refer the reader to Atmaja et al. [6].

Unimodal Speech Emotion Recognition
The goal of audio-based SER systems is to estimate the target speaker's emotion by analysing their voice [5].For paralinguistics, this is traditionally handled by extracting a set of low-level descriptors (LLDs), such as Mel frequency ceptral coefficients (MFCCs) or pitch, which capture relevant cues [15].In recent years, there has been a rise in deep learning architectures based on CNNs [12] or transformers [16] which outperform the more traditional approaches.
Early works on text-based emotion recognition utilised affective lexica, such as the WordNet-Affect dictionary [17], to generate word-level scores, which could then be combined using expert rules to derive a sentence-level prediction [18].In recent years, the use of learnt textual representations, like word2vec [19], has substituted these methods.Here too, attention-based models (transformers) [20] have shown exceptional performance on several natural language processing (NLP) tasks.These models are usually pre-trained on large, unlabelled corpora using some proxy task, e. g. masked language prediction [21], which enables them to learn generic text representations.

Bimodal Speech Emotion Recognition
Early work in multimodal fusion has primarily followed the shallow fusion paradigm [22,23].Several early systems depended on hand-crafted features, usually higher-level descriptors (HLDs), extracted independently for each modality, which were subsequently processed by a fusion architecture adhering to the early or late fusion paradigm.Early fusion corresponds to feeding the HLDs from both information streams as input to a single classifier.Late fusion, on the other hand, is achieved by training unimodal classifiers independently, and subsequently aggregating their predictions.The aggregation can consist of simple rules (e. g. averaging the predictions) or be delegated to a cascade classifier [24].
With the advent of deep learning (DL), traditional, shallow fusion methods have been substituted by end-to-end multimodal systems [25] where the different modalities are processed by jointly-trained modules.We differentiate between single-stage and multistage fusion.Single-stage fusion is the natural extension of shallow fusion methods, where the different modalities are first processed separately by independent differentiable modules [10,26].Although these methods have consistently outperformed both the unimodal baselines and shallow fusion alternatives, they nevertheless build on independently learnt unimodal representations constructed by modules agnostic to the presence of other modalities.In an attempt to utilise the power of DL to learn useful representations after several layers of processing, the community has also pursued multistage fusion paradigms, where the processing of different modalities is intertwined in multiple layers of a DNN [11,27,28].
Our proposed multistage fusion mechanism draws inspiration from recent approaches in denoising [29,30] and speaker adaptation [31].In general, all these approaches utilise two DNNs: one devoted to the primary task, and a second providing additional information through some fusion mechanism.The two networks can be trained independently or jointly.The fusion mechanisms utilised in these works all operate on the same principle: the embeddings produced by  the secondary network modulate the output of several (usually all) convolution layers of the primary network either by shifting or a combination of scaling and shifting.

Proposed Method
The focus of the current contribution is on tightly integrating acoustic and linguistic information.This is achieved by proposing a multistage fusion approach where linguistic embeddings condition several intermediate layers of a CNN processing audio information.The overall architecture comprises of two constituent networks: a (pre-trained) text-based model that provides linguistic embeddings and an auditory CNN whose intermediate representations are conditioned on those embeddings.
The two unimodal architectures can obviously be trained independently for emotion recognition and their predictions can be aggregated in a straightforward way (e.g. by averaging) to produce a combined output.This simple setup is used in some of our experiments as a shallow (late) fusion baseline.Additionally, the two information streams can be combined in single-stage fashion, with the linguistic embeddings fused at a single point with the embeddings generated by the CNN; a setup which forms a deep fusion baseline with which to compare our method.
As our unimodal text model, we use BERT [21], a pre-trained model with a strong track-record on several NLP tasks.As our baseline acoustic model, we use the CNN14 architecture introduced by Kong et al. [32], which was found to give good performance for emotional dimensions [33].CNN14 consists of 12 convolution layers and 2 linear ones, with mean and max pooling following the last convolution layer to aggregate its information over the time and frequency axes.These unimodal networks form the building blocks of our fusion methods, which are illustrated in Figure 1.Both architectures first pass the text input through a pre-trained BERT model, and then use it to condition one or more layers of the CNN14 base architecture.
Our proposed multistage fusion method, shown in Figure 1a, relies on fusing linguistic representations in the form of embeddings extracted from a pre-trained BERT model with the intermediate layer outputs of an acoustics-based CNN model.Linguistic embeddings are computed by averaging the token-embeddings returned by BERT for each utterance.Similar to prior works [29,31], we use the averaged embeddings to shift the intermediate representations of each block.Given an input X X X ∈ R Tin×Fin to each convolution block, with T and F being the number of time windows and frequency bins, respectively, and the average BERT embeddings where ReLU stands for the rectified linear unit activation function [34], BN for batch normalisation [35], and CONV for 2D convolutions.The projection (PROJ) is implemented as a trainable linear layer which projects the input embeddings E E E L to a vector, E E E P ∈ R Fout , with the same dimensionality as the output feature maps: where W W W and b b b are the trainable weight and bias terms, respectively.Thus, this conditioning mechanism is tantamount to adding a unique bias term to each output feature map of each convolution block.
The single-stage fusion architecture integrates acoustics and linguistics at a single point: immediately after the output of the last CNN14 convolution layer.The linguistic embeddings are first projected to the appropriate dimension, and then added to the acoustic representations produced by the convolution network.The shallow fusion architecture is shown in Figure 1b.

IEMOCAP
We use the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset [36], a multimodal emotion recognition dataset collected from 5 pairs of actors, each acting a set of scripted and improvised conversations, resulting in a total of 10, 039 utterances.It has been annotated for the emotional dimensions of arousal, valence, and dominance on a 5-point Likert scale, with individual ratings averaged over all annotators to produce the gold standard.It has also been annotated for the emotion categories of neutral, excited, surprised, happy, frustrated, sad, angry, afraid, and disgusted.
The dataset additionally contains gold standard transcriptions, which we use in our experiments.As IEMOCAP does not contain official train/dev/test splits, we follow the established convention of evaluating using leave-one-speaker-out (LOSO) cross-validation (CV) [26,28,37,38], where we use all utterances of each speaker once as the test set, each time using the utterances of their pair as the validation set, resulting in a total of 10 folds.As we have no ground truth transcriptions for MSP-Podcast, we generated them automatically using an open-source implementation of DeepSpeech2 [40] 1 .Whereas other works have used more advanced, proprietary models [41], we opted for a widely-used open-source alternative for reproducibility.However, this model achieves a worse automatic speech recognition (ASR) performance than that of proprietary models.As it has been shown by several previous works that the performance of text-based and fusion approaches improves with better ASR models [42,43], we also expect our method to yield correspondingly better results, and do not consider this a critical limitation of our work.

Experimental procedure
As discussed in Section 1, we consider both a categorical and a dimensional model of emotion.For discrete emotion recognition, most works on the two datasets considered here pursue a 4-class classification problem, utilising the emotion classes of {angry, happy, neutral, sad}, while further fusing the emotion class of excited with that of happy for IEMOCAP [6].To make our results comparable, we follow this formulation as well.For these experiments, we report unweighted average recall (UAR)(%), the standard evaluation metric for this task which also accounts for class imbalance, and additionally show confusion matrices.To mitigate the effect of class imbalance, which is particularly pronounced for MSP-Podcast, we use a weighted variant of cross-entropy, where the loss for each sample is weighted by the inverse frequency of the class it belongs to.For dimensional SER, where we have continuous values for the dimensions of arousal, valence and dominance, we formulate our problem as a standard regression task and evaluate based on concordance correlation coefficient (CCC) -the standard evaluation metric for dimensional emotion [44,45]and also train with the CCC loss [44,45].
However, multi-tasking potentially entangles the three dimensions and therefore complicates our analysis.Moreover, whereas the CCC loss is widely used for emotional dimension modelling, it is not the standard loss for regression tasks.Thus, we begin by considering single-task models trained with the standard mean squared error (MSE) loss.
We perform our experiments by separately training on IEMOCAP and MSP-Podcast.As mentioned, we perform 10-fold LOSO CV for the first and use the official train/dev/test partitions for the latter.We also report cross-domain results.
Cross-domain results are obtained by evaluating models trained with one dataset on the other.For MSP-Podcast, where a single model is trained, we evaluate it on the entire IEMOCAP dataset.For IEMOCAP, where 10 models are trained for each experiment, we evaluate all 10 of them on the test set of MSP-Podcast, and compute the average performance metric for each task.As the emotional dimensions of the two datasets are annotated with different scales (5-point scale for IEMOCAP and 7-point scale for MSP-Podcast), we evaluate cross-corpus performance using Pearson correlation coefficient (PCC) instead of CCC, as the former is unaffected by differences in the scale.
For each dataset and task, we thus always perform the following experiments: • CNN14: the unimodal, acoustics-only baseline, • SFCNN14: our single-stage fusion architecture, • MFCNN14: our multistage fusion architecture.
All models are trained for 60 epochs with a learning rate of 0.01 and a batch size of 64 using stochastic gradient descent (SGD) with a Nesterov momentum of 0.9.We select the model that performed best on each respective validation set.
In order to avoid statistical fluctuations due to random seeds, we run each experiment 5 times and report mean and standard deviation.
Additionally, for some configurations, we fine-tune the pre-trained BERT model, a practice which has recently emerged as the standard linguistic baseline showing strong performance on several NLP tasks.This particular configuration is not considered in all our experiments, as the number of different formulations examined made the computational cost of fine-tuning several BERT models prohibitive.Specifically, we train 5 models with different random seeds for categorical emotion recognition and multi-tasking on the emotional dimensions of MSP-Podcast.We omit single-task experiments.
As pre-trained model, we selected bert-base-uncased distributed by Huggingface2 with a final linear layer.As in the other experiments, we use the weighted cross-entropy loss for classification, the MSE loss for single task regression, and the mean CCC loss averaged over all targets for multitask regression.For all conditions, the maximum token length was set to 128, and the batch size to 32.For fine-tuning, we chose the Adam optimiser with fixed weight decay (learning rate: 2e − 5, betas: 0.9 and 0.999, epsilon: 1e − 06, weight decay: 0.0, no bias correction), and a linear schedule with 1000 total and 100 warmup steps.We trained for 4 epochs with early stopping based on a UAR or CCC decrease on the development set for classification and regression, respectively.

Emotional categories
We begin by considering the 4-class emotion classification problem discussed in Section 4. Table 1 presents inand cross-domain results on MSP-Podcast and IEMOCAP for both unimodal baselines and both fusion methods.Interestingly, CNN14 performs worse than BERT on both MSP-Podcast, with a UAR of 48.3 % vs 48.9 %, and on IEMOCAP (55.8 % vs 71.1 %), while also achieving the best cross-corpus performance when training on IEMOCAP and evaluating on MSP-Podcast.This shows that linguistics carry more emotional information on both datasets, but more so for IEMOCAP.On the one hand, this could be due to the noisy transcriptions used for MSP-Podcast.On the other hand, this dataset is more naturalistic than IEMOCAP, where actors could have relied more on text for conveying their emotions, especially in the case of scripted conversations.Bimodal fusion leads to consistently higher performance compared to CNN14.Both architectures perform significantly better than the unimodal audio baseline for both datasets, with SFCNN14 performing slightly better on MSP-Podcast, and MFCNN14 considerably outperforming it on IEMOCAP.Moreover, in the case of IEMOCAP, only MFCNN14 is better than BERT, whereas SFCNN14 is significantly worse than it.In terms of cross-corpus results, the two fusion models yield the same performance when trained on MSP-Podcast and tested on IEMOCAP, with MFCNN14 being better on the reverse setup.In both cases, performance is severely degraded, which illustrates once more the challenges associated with cross-corpus AER.Interestingly, the degradation is lower for models trained on MSP-Podcast, indicating that larger, naturalistic corpora can lead to overall better generalisation.Thus, while SFCNN14 outperforms MFCNN14 on MSP-Podcast, the latter is showing better generalisation capabilities and better performance on IEMOCAP, indicating that this form of fusion has great potential.
Figure 2 additionally shows the confusion matrices for the best performing models (based on the validation set) on the MSP-Podcast test set.For both CNN14 and BERT, we observe poor performance, with large off-diagonal entries.Notably, BERT is better at recognising happy and neutral while CNN14 is better for angry and sad.For CNN14, the most frequent misclassifications occur when happy is misclassified into angry and neutral into sad.The latter is particularly problematic as more neutral samples are classified as sad (1596) than neutral (1216).
These problems are largely mitigated through the use of multimodal architectures.The improvements introduced by both SFCNN14 and MFCNN14 are mostly concentrated on the happy and neutral classes, where the true positive rate (TPR) improves by +33/39 % and +47/27 %, respectively.However, the two architectures exhibit a different behaviour on their off-diagonal entries.MFCNN14 substantially worsens the false positives on the happy class, with a large increase on the amount of angry (+35 %), neutral (+84 %), and sad (+122 %) samples misclassified as happy.In contrast, SFCNN14 exhibits only a minor deterioration (+13 %) on neutral to happy misclassifations.This illustrates that, although the UAR of both models, as shown in Table 1, is comparable for MSP-Podcast, the more granular view provided by the confusion matrices clearly positions SFCNN14 as the winning architecture for this experiment.
Overall, our results clearly demonstrate that the proposed deep fusion methods can lead to substantial gains compared to their baseline, unimodal counterparts.MFCNN14 shows a more robust behaviour with respect to the different datasets, whereas SFCNN14 shows more desirable properties in terms of confusion matrices.

Emotional dimensions
After evaluating our methods on categorical emotion recognition, we proceed with modelling emotional dimensions.As discussed in Section 4, we begin with single-task models trained with an MSE loss in Section 6.1, which allows us to study the effects of fusion independently for each dimension.Then, in Section 6.For each approach, we show results for the best performing seed.For the two fusion models, we additionally show % change with respect to CNN14 for easier comparison.

Single-task models
Our first experiments are performed on the emotional dimensions of MSP-Podcast and IEMOCAP.In Table 2, we report CCC and PCC results for in-domain and cross-domain performance, respectively.The performance of both fusion models is compared to that of the baseline CNN14 using two-sided independent sample t-tests.
The best performance on valence, both in-and cross-domain is achieved by MFCNN14, which reaches a mean CCC of .407 on MSP-Podcast and .664 on IEMOCAP.This is significantly higher than CNN14 and considerably outperforms SFCNN14, showing that multistage fusion can better utilise the textual information in this case.Cross-domain performance is severely degraded when training on IEMOCAP and testing on MSP-Podcast, while not so much when doing the opposite.This illustrates how training on large, naturalistic corpora leads to better generalisation for SER systems, both unimodal and bimodal ones.
In the case of arousal, CNN14 performs better on MSP-Podcast with an average CCC of .664(vs .620and .627for SFCNN14 and MFCNN14), but this difference is not statistically significant.On IEMOCAP, MFCNN14 shows a marginally better performance than CNN14, while SFCNN14 is significantly worse than its unimodal baseline.This curious case shows how additional information can also hamper the training process.We hypothesise that this is because textual information is not conducive to arousal modelling, and leads it to perform worse on this task.This is corroborated by BERT models trained to jointly predict arousal/valence/dominance presented in Section 6.2.As mentioned in Section 4, we did not train BERT models for each dimension in isolation to reduce the computational load of our experiments; thus, we return to this point in Section 6.2.
Finally, we note that cross-domain performance for arousal, while also lower than in-domain performance, is not as low as valence, especially for CNN14.Interestingly, PCC on IEMOCAP for CNN14 models trained on MSP-Podcast is now significantly higher than the PCC obtained by either SFCNN14 or MFCNN14 (.593 vs .543and .558,respectively).On the contrary, MFCNN14 shows significantly better performance on the opposite setup (.432 vs .418PCC) than CNN14, while SFCNN14 remains significantly worse.This illustrates once more that the paralinguistic information stream carries more information on arousal than the linguistic one, and models trained on that can generalise better across different datasets.
Results on dominance follow the trends exhibited by arousal.CNN14 is significantly better than SFCNN14 and MFCNN14 on in-domain MSP-Podcast results (.583 vs .523and .539),but, in the case of MFCNN14, this large in-domain difference does not translate to better cross-domain generalisation, as both models are nearly equivalent on IEMOCAP PCC performance (.471 vs .470).This tendency is reversed on IEMOCAP; there MFCNN14 achieves significantly better in-domain results (.503 vs .424),but shows evidence of overfitting by performing significantly worse cross-domain (.365 vs .438).
Overall, our results show that bimodal fusion significantly improves performance on the valence dimension both in-and cross-domain for both datasets, with MFCNN14 achieving consistently superior performance to SFCNN14.For the case of MSP-Podcast, the other two dimensions fail to improve, while for IEMOCAP they improve only in-domain, and only for MFCNN14, while SFCNN14 performs consistently worse than CNN14.As we discuss in Section 6.2, this is because BERT is not good at modelling arousal and dominance, and this propagates to the fusion models.It thus appears that linguistic information, which by itself is not adequate to learn the tasks, hampers the training process and results in worse fusion models as well.This undesirable property seems to affect SFCNN14 more strongly than MFCNN14, which is able to circumvent it, and, in some cases, benefit from linguistic information.Thus, in the case of dimensional emotion recognition, MFCNN14 so far shows a better behaviour than SFCNN14.

Multi-task models
We end our section on emotional dimensions by considering a multi-task problem with a CCC loss.This is motivated by several recent works who have gotten better performance by switching to this formulation [44,45].To reduce the footprint of our experiments, we only evaluate this approach on MSP-Podcast.CCC results for 5 runs are shown in Table 3.As previously discussed, Table 3 additionally includes results with a fine-tuned BERT model.As expected, we observe that BERT performs much better than CNN14 on valence prediction (.503 vs .291),but lacks far behind LF BERT CNN14 SFCNN14 MFCNN14 Figure 3: Error residual plots for arousal (left), valence (middle), and dominance (right).Linear curves fitted on error residuals (e n = y n t − y n p ) for each architecture and dimension and plotted against the gold standard (y n t ).With the exception of SFCNN14 on arousal, all models show higher errors towards the edges of their scale.Plots best viewed in colour.on arousal (.232 vs .660)and dominance (.214 vs .578).This clearly illustrates how the two streams, acoustics and linguistics, carry complementary information for emotion recognition.
Both fusion methods improve on all dimensions compared to CNN14.In particular, MFCNN14 is significantly better on all three dimensions (.678 vs .660,.521vs .291,and .604vs .578),whereas SFCNN14 is significantly better only for valence (.497 vs .291)but not for arousal (.665 vs .660)and dominance (.598 vs .578).Moreover, of the two fusion methods, only MFCNN14 significantly improves on valence performance compared to BERT (.521 vs .503),while SFCNN14 performs marginally (but significantly) worse (.497).This demonstrates once more that multistage fusion can better utilise the information coming from the two streams.
Finally, we are interested in whether the models show a heteroscedastic behaviour by examining their error residuals.To this end, we pick the best models on the validation set and examine their residuals (model results for the selected seeds included in A). Figure 3 shows fitted linear curves on the error residuals for each model and task.We use fitted curves for illustration purposes as superimposing the scatterplots for each model would make our plots uninterpretable.The curves are least-squares estimates over all error residuals.Our analysis reveals that most models show non-uniform errors, with their deviation from the ground truth increasing as we move away from the middle of the scale.This 'regression towards the mean' phenomenon is highly undesirable, especially for real-world applications where users would observe a higher deviation from their own perception of a target emotion for more intense manifestations of it.SFCNN14 is the only model which escapes this undesirable fate, primarily for arousal and dominance.BERT, in contrast, shows the worst behaviour for those two dimensions, but is comparable to SFCNN14 for valence.CNN14 and MFCNN14 show improved a much behaviour compared to BERT, but some bias still appears, with the late fusion baseline naturally falling in the middle between BERT and CNN14.For valence, SFCNN14 and MFCNN14 both closely follow BERT in showing a low, but nevertheless existent bias, while CNN14 performs worse.
Interestingly, the residuals are also showing an asymmetric behaviour.For valence, in particular, CNN14 is showing higher errors than the other models for the upper end of the scales, but is comparable to them for the lower end.Conversely, BERT shows higher errors for the lower end of arousal and dominance.This indicates that models struggle more with the scales.Naturally, this is partly explained by the sparseness of data for the more extreme values, as naturalistic datasets tend to be highly imbalanced towards neutral.Nevertheless, this continues to pose a serious operationalisation problem for AER systems.

Conclusion
In the present contribution, we investigated the performance of deep fusion methods for emotion recognition.We introduced a novel method for combining linguistic and acoustic information for AER, relying on deep, multistage fusion of summary linguistic features with the intermediate layers of a CNN operating on log-Mel spectrograms, and contrasted it with a simpler, single-stage fusion one where information is only combined at a single point.We demonstrated both methods' superiority over unimodal and shallow, decision-level fusion baselines.In terms of quantitative evaluations, multistage fusion fares better than the single-stage baseline, thus illustrating how a tighter coupling of acoustics and linguistics inside CNNs can lead to a better integration of the two streams.

A Error residuals
Results for these the best-performing multitask emotional dimension prediction models are shown in Table 4.As we are using error residuals for our evaluation, we additionally show MSE results.

Figure 1 :
Figure 1: Diagrams of the architectures used in this work illustrating the processing of a single utterance.(a) MFCNN14: Multistage fusion architecture.(b) SFCNN14: Single-stage fusion architecture.The same architecture was used for the baseline CNN14, but without the fusion.

Figure 2 :
Figure 2: Confusion matrices for the 4-class emotion classification task on MSP-Podcast.For each approach, we show results for the best performing seed.For the two fusion models, we additionally show % change with respect to CNN14 for easier comparison.

Table 1 :
UAR(%) in-and cross-domain results for 4-class emotion recognition (chance level: 25 %).MSP-Podcast in-domain results computed on the official test set, whereas IEMOCAP in-domain results correspond to leave-onespeaker-out cross-validation, where data from each speaker is used once as the test set and the other speaker in their session is used as the development set.Cross-domain results are reported on the official test set for MSP-Podcast and the entire dataset for IEMOCAP.Average (and standard deviation) results computed over 5 runs.Fusion results (SFCNN14 and MFCNN14) that are significantly different than the unimodal baselines (CNN14 and BERT) as determined by two-sided independent t-tests (p < 0.05) are marked by * and † , respectively.

Table 2 :
CCC/PCC in-/cross-domain results for emotional dimension prediction using single-task models trained with an MSE loss.MSP-Podcast in-and cross-domain results computed on the test set, whereas IEMOCAP in-domain results correspond to leave-one-speaker-out cross-validation and cross-domain results reported on the entire dataset.Average (and standard deviation) results computed over 5 runs.Fusion results (SFCNN14 and MFCNN14) that are significantly different than the unimodal baseline (CNN14) as determined by two-sided independent sample t-tests (p < 0.05) are marked by * .

Table 3 :
CCC results for emotional dimension prediction using multi-task models on MSP-Podcast trained with a CCC loss.Models trained to jointly optimise the CCC for all dimensions.Average (and standard deviation) results reported over 5 runs.Fusion results (SFCNN14 and MFCNN14) that are significantly different than the unimodal baselines for each dimension as determined by two-sided independent sample t-tests (p < 0.05) are marked by * (for CNN14) and † (for BERT).

Table 4 :
CCC and MSE results for emotional dimension prediction using multi-task models on MSP-Podcast.Models trained to jointly optimise the CCC for all dimensions.Results reported for the model showing the best validation set performance for each architecture.LF corresponds to a late, decision-level fusion (averaging) of unimodal predictions.