<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="research-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fcomp.2021.767767</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Multimodal Affect Models: An Investigation of Relative Salience of Audio and Visual Cues for Emotion Prediction</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Wu</surname> <given-names>Jingyao</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1440942/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Dang</surname> <given-names>Ting</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1540948/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Sethu</surname> <given-names>Vidhyasaharan</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1249304/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Ambikairajah</surname> <given-names>Eliathamby</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/877851/overview"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>School of Electrical Engineering and Telecommunications, University of New South Wales</institution>, <addr-line>Sydney, NSW</addr-line>, <country>Australia</country></aff>
<aff id="aff2"><sup>2</sup><institution>Department of Computer Science and Technology, University of Cambridge</institution>, <addr-line>Cambridge</addr-line>, <country>United Kingdom</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Oya Celiktutan, King&#x00027;s College London, United Kingdom</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Heysem Kaya, Utrecht University, Netherlands; Alexey Karpov, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), Russia</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Jingyao Wu <email>jingyao.wu&#x00040;unsw.edu.au</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Human-Media Interaction, a section of the journal Frontiers in Computer Science</p></fn></author-notes>
<pub-date pub-type="epub">
<day>23</day>
<month>12</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>3</volume>
<elocation-id>767767</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>08</month>
<year>2021</year>
</date>
<date date-type="accepted">
<day>02</day>
<month>12</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2021 Wu, Dang, Sethu and Ambikairajah.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Wu, Dang, Sethu and Ambikairajah</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>People perceive emotions via multiple cues, predominantly speech and visual cues, and a number of emotion recognition systems utilize both audio and visual cues. Moreover, the perception of static aspects of emotion (speaker&#x00027;s arousal level is high/low) and the dynamic aspects of emotion (speaker is becoming more aroused) might be perceived via different expressive cues and these two aspects are integrated to provide a unified sense of emotion state. However, existing multimodal systems only focus on single aspect of emotion perception and the contributions of different modalities toward modeling static and dynamic emotion aspects are not well explored. In this paper, we investigate the relative salience of audio and video modalities to emotion state prediction and emotion change prediction using a Multimodal Markovian affect model. Experiments conducted in the RECOLA database showed that audio modality is better at modeling the emotion state of arousal and video for emotion state of valence, whereas audio shows superior advantages over video in modeling emotion changes for both arousal and valence.</p></abstract>
<kwd-group>
<kwd>emotion recognition</kwd>
<kwd>multimodal</kwd>
<kwd>emotion dynamics</kwd>
<kwd>ordinal data</kwd>
<kwd>machine learning</kwd>
</kwd-group>
<counts>
<fig-count count="5"/>
<table-count count="7"/>
<equation-count count="9"/>
<ref-count count="53"/>
<page-count count="12"/>
<word-count count="8447"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Emotion plays an important role in daily life communications and social interactions (Picard, <xref ref-type="bibr" rid="B34">2000</xref>), and the ability to recognize a person&#x00027;s emotional state is a critical requirement for achieving a more natural human-computer interaction (Cowie et al., <xref ref-type="bibr" rid="B8">2001</xref>). When interacting with each other, people use a range of cues, such as speech patterns, facial expressions, etc. to communicate and recognize emotions. Analogously, Automatic Emotion Recognition (AER) systems based on a myriad of modalities such as speech, text, facial expression, body languages, etc. have been developed (Wu et al., <xref ref-type="bibr" rid="B48">2014</xref>; Avots et al., <xref ref-type="bibr" rid="B2">2019</xref>; Yalamanchili et al., <xref ref-type="bibr" rid="B50">2021</xref>). Among these modalities, audio and visual cues have been most widely studied, which is unsurprising considering that facial and vocal expressions are the most direct and natural modalities by which people communicate emotions (Wu et al., <xref ref-type="bibr" rid="B48">2014</xref>).</p>
<p>The Brunswik&#x00027;s functional lens model may be used to explicitly depict the various elements involved in the communication of emotional states (Zhang and Provost, <xref ref-type="bibr" rid="B52">2019</xref>). As shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, when someone expresses an emotion, this information is carried via multiple modalities (speech and visual expression most commonly). An observer receives all modalities and integrates the information carried in them. While different modalities may contribute differently to the perception of different aspects of emotional state (such as perceiving that the speaker is becoming more aroused vs. perceiving arousal level is high), they are all ultimately combined to provide a unified sense of the &#x02018;recognised&#x00027; emotional state (Brunswik, <xref ref-type="bibr" rid="B5">1955</xref>; Banse and Scherer, <xref ref-type="bibr" rid="B4">1996</xref>; Zhang and Provost, <xref ref-type="bibr" rid="B52">2019</xref>). The assumption inherent in the previous statement that different aspects of emotions are perceived via differing mechanisms is supported by the observation that when a group of people are all asked to label the evolving emotional state of the same speaker engaged in a conversation, the level of disagreement amongst the raters about change in the emotional state is significantly lower than the level of disagreement about the actual emotion label (Yannakakis et al., <xref ref-type="bibr" rid="B51">2018</xref>).</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>The Brunswik&#x00027;s functional lens model of the emotion expression and perception. The perception of two different aspects of emotion is depicted with the blue dots representing the absolute state at a point in time and the red arrow indicating the change in emotional state from one instance to the next (red equal sign indicates no change in emotion).</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-767767-g0001.tif"/>
</fig>
<p>Additionally, there is growing interest in the representation and prediction of only the dynamical aspects of emotional state, such as emotion change (Huang and Epps, <xref ref-type="bibr" rid="B20">2016</xref>) and the purely relative labels of emotional state on an ordinal scale (Martinez et al., <xref ref-type="bibr" rid="B28">2014</xref>; Parthasarathy et al., <xref ref-type="bibr" rid="B33">2017</xref>). Despite the recent successes in predicting the emotion states (Kim and Clements, <xref ref-type="bibr" rid="B23">2015</xref>; Han et al., <xref ref-type="bibr" rid="B18">2020</xref>) and emotion changes (Liang et al., <xref ref-type="bibr" rid="B25">2018</xref>) from multiple modalities, these systems either focus on one or another but not both. Subsequently, to the best of the authors&#x00027; knowledge, till date there has been no analyses of any potential differences in the contributions of different modalities (audio, video, etc.) to the perception of the different aspects of emotional state (state vs. change).</p>
<p>Generally, such emotional states can be described with categorical labels (e.g., happy, sad, angry, etc.), or using a dimensional representations (e.g., arousal, valence, etc.) (Russell, <xref ref-type="bibr" rid="B38">1980</xref>; Grimm et al., <xref ref-type="bibr" rid="B16">2007</xref>). A large body of recent literature has focused on using dimensional representations (Ak&#x000E7;ay and O&#x0011F;uz, <xref ref-type="bibr" rid="B1">2020</xref>), since they are able to better describe the complexity of emotions such as blended emotions, emotion transitions, etc. (Gunes and Schuller, <xref ref-type="bibr" rid="B17">2013</xref>; Ak&#x000E7;ay and O&#x0011F;uz, <xref ref-type="bibr" rid="B1">2020</xref>). Dimensional representations in turn can employ either numerical or ordinal labels along each dimension and there has been growing interest in the use of ordinal labels in recent years (Yannakakis et al., <xref ref-type="bibr" rid="B51">2018</xref>). Research in psychology suggests that ordinal labels are better aligned with human perceptions (Stewart et al., <xref ref-type="bibr" rid="B45">2005</xref>). Subsequent studies in affective computing have also demonstrated that the use of ordinal labels leads to greater agreement among a group of raters when they are asked to label their perception of the emotional states of a speaker (Martinez et al., <xref ref-type="bibr" rid="B28">2014</xref>; Makantasis, <xref ref-type="bibr" rid="B27">2021</xref>). In the work reported in this manuscript, we focus on ordinal labels within a dimensional emotion representation framework; whereby labels are given as points on ordinal scales corresponding to affect dimensions such as arousal (activate vs. deactivated) and valence (positive vs. negative) (Russell, <xref ref-type="bibr" rid="B38">1980</xref>; Grimm et al., <xref ref-type="bibr" rid="B16">2007</xref>).</p>
<p>Both the static and dynamic aspects of emotion can be associated with the ordinal emotion labels on affective dimensions (Yannakakis et al., <xref ref-type="bibr" rid="B51">2018</xref>). In a recent study we introduced the terminology absolute ordinal label (AOL) and relative ordinal label (ROL), to distinguish between these two aspects of emotion in ordinal scale (Wu et al., <xref ref-type="bibr" rid="B49">2021</xref>). Wherein, AOLs are assigned from a given set of points on the ordinal scale (e.g., low arousal, medium arousal, high arousal); and ROLs represent relative comparisons between pairs of labels on the ordinal scale (e.g., the ranking from the lowest arousal to the highest arousal). Each type of ordinal label represents different and complementary aspects of emotions as depicted in <xref ref-type="fig" rid="F2">Figure 2</xref>. AOLs denoted by the positions of the blue circles represent the absolute arousal (or valence) level, the static emotion state at a given point of time, whereas ROLs denoted by the numbers insides the circles provide a relative level of the emotion state at each time with respect to other times, thus the change of ROLs (i.e., &#x00394; in <xref ref-type="fig" rid="F2">Figure 2</xref>) indicates emotion changes (e.g., arousal increases) with time (orange arrows).</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>A graphical representation of the complementary characteristics of AOLs and ROLs over 6 time steps within a 100-frame utterances. The position of the blue circles represents AOLs at different time steps with ROLs indicated by the number insides the circles, and the arrows depicting ROL changes between consecutive time steps.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-767767-g0002.tif"/>
</fig>
<p>There have been a number of recent advances in both AOL (Metallinou et al., <xref ref-type="bibr" rid="B30">2012</xref>; Kim and Clements, <xref ref-type="bibr" rid="B23">2015</xref>; Han et al., <xref ref-type="bibr" rid="B18">2020</xref>), and ROL (Parthasarathy et al., <xref ref-type="bibr" rid="B33">2017</xref>; Melhart et al., <xref ref-type="bibr" rid="B29">2020</xref>; Makantasis, <xref ref-type="bibr" rid="B27">2021</xref>). However, these have focused only on either absolute or relative ordinal labels and not the joint modeling of both the static and dynamic aspects. The work reported in this paper explores if different modalities contribute to varying degrees to the recognition of static and dynamic aspects of emotions. Specifically, we extend the Dynamic Ordinal Markov Model (DOMM), previously introduced to integrate absolute and relative ordinal information (Wu et al., <xref ref-type="bibr" rid="B49">2021</xref>), to incorporate multimodal inputs and exploit the distinct AOL and ROL prediction subsystems to investigate differences in their contributions. All evaluations are carried out using the widely used RECOLA database (Ringeval et al., <xref ref-type="bibr" rid="B37">2013</xref>).</p>
</sec>
<sec id="s2">
<title>2. Related Work</title>
<p>Multimodal emotion recognition systems have been benefited from a number of advances in different techniques to fuse the multiple expressive cues (Yalamanchili et al., <xref ref-type="bibr" rid="B50">2021</xref>). These methods can be broadly categorized as either feature level fusion or decision level fusion (Wu et al., <xref ref-type="bibr" rid="B48">2014</xref>). The former is generally carried out by concatenating feature vectors from different modalities (Metallinou et al., <xref ref-type="bibr" rid="B30">2012</xref>; Kim and Clements, <xref ref-type="bibr" rid="B23">2015</xref>), while the latter involves developing indepedent unimodal predictive models and then aggregating the predictions from each modality (Ringeval et al., <xref ref-type="bibr" rid="B36">2014</xref>; Sahoo and Routray, <xref ref-type="bibr" rid="B40">2016</xref>). A combination of both is also possible, for instance Metallinou et al. (<xref ref-type="bibr" rid="B30">2012</xref>) first adopts feature level fusion with different weights assigned to audio and video modalities followed by model level fusion to learn a joint representation from multiple modalities. Similarly, Schoneveld et al. (<xref ref-type="bibr" rid="B43">2021</xref>) implements LSTM based fusion networks that is trained together with the pre-trained audio and video features. However, these multimodal fusion techniques do not allow for differences in the relative salience of the different modalities toward the static and dynamic aspects of emotion.</p>
<p>Additionally, literature in psychology has also reported on the role of audio and visual cues toward inferring different emotions (Banse and Scherer, <xref ref-type="bibr" rid="B4">1996</xref>). These studies have primarily focused on the relationship between modalities and specific emotions. For instance, facial expressions provide information about the occurrence of pleasant emotional states (Ekman and Oster, <xref ref-type="bibr" rid="B10">1979</xref>), and acoustic features of the speech signals are strongly associated with the speaker&#x00027;s arousal state (Bachorowski, <xref ref-type="bibr" rid="B3">1999</xref>; Russell et al., <xref ref-type="bibr" rid="B39">2003</xref>). Some of these observations of human emotion perception have also been found to have analogues in the automatic emotion recognition system developed by the affective computing community (Tzirakis et al., <xref ref-type="bibr" rid="B46">2019</xref>; Schoneveld et al., <xref ref-type="bibr" rid="B43">2021</xref>).</p>
<p>While there are no studies that directly investigates the relationship between expressive cues and the static and dynamic aspects of emotions, evidence from psychology suggests that people may seek to control their facial expressions when experiencing certain emotions (Crivelli and Fridlund, <xref ref-type="bibr" rid="B9">2018</xref>). Similarly, fine nuances in emotions, that might otherwise be missed, may be perceptible from vocal expression (Simon-Thomas et al., <xref ref-type="bibr" rid="B44">2009</xref>). These observations motivate the work reported in this paper, on exploring the varying contributions of speech and video modalities toward the prediction of emotion state and emotion change.</p>
</sec>
<sec id="s3">
<title>3. Dynamic Ordinal Markov Model</title>
<p>The Dynamic Ordinal Markov Model (DOMM) is a Markovian framework originally proposed for speech based emotion prediction system that integrates both static and dynamic aspects of ordinal emotion labels (Wu et al., <xref ref-type="bibr" rid="B49">2021</xref>). A high level representation of the DOMM framework is depicted in <xref ref-type="fig" rid="F3">Figure 3</xref>. The emotion state (static aspect) represented by AOLs, and emotion change (dynamic aspect) represented by ROLs, are separately modeled by the AOL and ROL prediction systems. These two subsystems are implemented using an Ordinal Multi-class Support Vector Machine (OMSVM) (Kim and Ahn, <xref ref-type="bibr" rid="B24">2012</xref>) and a RankSVM (Joachims, <xref ref-type="bibr" rid="B21">2002</xref>), respectively. The predictions from both subsystems dynamically update the parameters of a Markov model which is used to make the final predictions.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Overview of the DOMM system architecture. In the depicted state diagrams, the blue circles indicate the state (i.e., AOL) with low (L), medium (M) and high (H) and red arrows denote the state transitions. The state posterior probability corresponding to the blue dots are calculated from AOL Prediction subsystem for each state L,M,H at each time step. Similarly, the state transition probability is obtained from the ROL prediction subsystem for all possible state transitions. The predicted emotion state sequence is then obtained using Viterbi decoding based on the time-varying state and transition probabilities. The x axis represents the time step from <italic>t</italic><sub>1</sub> to <italic>t</italic><sub><italic>T</italic></sub>. The y axis represents the AOLs with low (L), medium (M) and high (H) states. Red line indicates the &#x0201C;best path&#x0201D; after Viterbi tracking which gives the final predicted AOL sequence.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-767767-g0003.tif"/>
</fig>
<p>Within the DOMM framework, the AOLs are represented as the states of a Markov model with ROLs reflecting state transitions. The predicted emotion labels are then given as:</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mo>:</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo class="qopname">arg</mml:mo><mml:mo class="qopname">max</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mo>:</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mo>:</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>T</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mo>:</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <bold>x</bold><sub>1 : <italic>T</italic></sub> &#x0003D; {<bold>x</bold><sub><bold>1</bold></sub>, <bold>x</bold><sub><bold>2</bold></sub>, ..., <bold>x</bold><sub><bold>T</bold></sub>} denotes the sequence of input features, with <bold>x</bold><sub><bold>t</bold></sub> denoting the feature at time <italic>t</italic>; and <italic><bold>&#x003B2;</bold></italic><sub>1 : <italic>T</italic></sub> &#x0003D; {&#x003B2;<sub>1</sub>, &#x003B2;<sub>2</sub>, ..., &#x003B2;<sub><italic>T</italic></sub>} denotes the sequence of DOMM states, with &#x003B2;<sub><italic>t</italic></sub> denoting the state at time <italic>t</italic>.</p>
<p>Finally, we note that <inline-formula><mml:math id="M2"><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi><mml:mtext>&#x000A0;</mml:mtext></mml:mrow><mml:mo>&#x02200;</mml:mo><mml:mi>t</mml:mi></mml:math></inline-formula> where <inline-formula><mml:math id="M3"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:math></inline-formula> represents the set of possible AOLs, for e.g., when the possible AOLs are low (<italic>L</italic>), medium (<italic>M</italic>), and high (<italic>H</italic>), <inline-formula><mml:math id="M4"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>L</mml:mi><mml:mo>,</mml:mo><mml:mi>M</mml:mi><mml:mo>,</mml:mo><mml:mi>H</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. This framework was developed on the assumption that AOLs are more readily interpretable and consequently predictions should be AOLs, while also recognizing that ROLs are better aligned with the types of judgements humans are better at making and should inform the predictions.</p>
<p>To determine <inline-formula><mml:math id="M5"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mtext>&#x000A0;</mml:mtext><mml:mo>:</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mi>T</mml:mi></mml:mrow></mml:msub></mml:math></inline-formula> we employ Viterbi decoding, making use of the Markovian property of the DOMM framework by tracking the most probable state sequence (Forney, <xref ref-type="bibr" rid="B15">1973</xref>). This in turn requires an estimate of initial state probability at time <italic>t</italic><sub>0</sub>, and state probabilities, <italic>P</italic>(&#x003B2;<sub><italic>t</italic></sub>), and the state transition probabilities, <italic>P</italic>(&#x003B2;<sub><italic>t</italic></sub> | &#x003B2;<sub><italic>t</italic>&#x02212;1</sub>), at each time frame <italic>t</italic>. The term &#x00027;dynamic&#x00027; in DOMM refers to the fact that both state and state transition probabilities are time-varying quantities and inferred from the input signal. Within the DOMM framework, both these quantities are estimated by the AOL and ROL prediction subsystems. Given a set of AOLs, <inline-formula><mml:math id="M6"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:math></inline-formula>, the AOL prediction subsystem is implemented as a machine learning model that maps input features to state posteriors, <inline-formula><mml:math id="M7"><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:mo>:</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mtext>&#x000A0;</mml:mtext><mml:mo>|</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x02200;</mml:mo><mml:mo>&#x003BB;</mml:mo><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, where <inline-formula><mml:math id="M8"><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow></mml:math></inline-formula> denotes the input feature space. Similarly, the ROL prediction subsystem is implemented as a machine learning model that maps input features to the state transition probabilities, <inline-formula><mml:math id="M9"><mml:msub><mml:mrow><mml:mi>f</mml:mi></mml:mrow><mml:mrow><mml:mi>R</mml:mi></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:mo>:</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mrow><mml:mi mathvariant="-tex-caligraphic">X</mml:mi></mml:mrow><mml:mo>&#x02192;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mtext>&#x000A0;</mml:mtext><mml:mo>|</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mtext>&#x000A0;</mml:mtext><mml:mo>|</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mo>&#x02200;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:math></inline-formula>. In the realization of the DOMM employed in the experiments reported in this paper, <italic>f</italic><sub><italic>A</italic></sub> is implemented as an Ordinal Multiclass SVM (OMSVM) model (Kim and Ahn, <xref ref-type="bibr" rid="B24">2012</xref>) and <italic>f</italic><sub><italic>R</italic></sub> is implemented using a RankSVM model (Joachims, <xref ref-type="bibr" rid="B21">2002</xref>).</p>
<p>The OMSVM is a variation of the conventional Multi-class SVM that utilizes the ordinal pairwise partition algorithm to group the AOLs and models each group of AOLs with a series of SVMs which enables it to capture the ordinal nature of the labels (Kim and Ahn, <xref ref-type="bibr" rid="B24">2012</xref>). The state posterior probabilities, <italic>P</italic>(&#x003B2;<sub><italic>t</italic></sub>), is computed by applying a sigmoid function to the OMSVM outputs as suggested in Platt (<xref ref-type="bibr" rid="B35">1999</xref>):</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M10"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0225C;</mml:mo><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>&#x0002B;</mml:mo><mml:mi>e</mml:mi><mml:mi>x</mml:mi><mml:mi>p</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>a</mml:mi><mml:msub><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>b</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mtext>&#x02003;&#x000A0;</mml:mtext><mml:mo>&#x02200;</mml:mo><mml:mo>&#x003BB;</mml:mo><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>H</italic><sub>&#x003BB;</sub>(<bold>x</bold><sub><italic>t</italic></sub>) denotes the OMSVM output corresponding to the AOL &#x003BB;, given an input feature vector <bold>x</bold><sub><italic>t</italic></sub>; <italic>a</italic> and <italic>b</italic> denote constant sigmoid function parameters which are determined during model training.</p>
<p>The ROL prediction subsystem employed in this work employs a RankSVM model trained to predict the relative rank, &#x003B1;<sub><italic>t</italic></sub> of the arousal/valence labels within an utterance. The state transition probabilities, <italic>P</italic>(&#x003B2;<sub><italic>t</italic>&#x02212;1</sub> &#x02192; &#x003B2;<sub><italic>t</italic></sub>), are then estimated from the change in relative rank between consecutive frames, &#x00394;&#x003B1;<sub><italic>t</italic></sub> &#x0003D; &#x003B1;<sub><italic>t</italic></sub> &#x02212; &#x003B1;<sub><italic>t</italic>&#x02212;1</sub>, as follows:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M11"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02192;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0225C;</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x00394;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00394;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00394;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where &#x003B1;<sub><italic>t</italic></sub> &#x0003D; <italic>G</italic>(<bold>x</bold><sub><italic>t</italic></sub>), with <italic>G</italic>(<bold>x</bold><sub><italic>t</italic></sub>) denoting the RankSVM output given the input feature vector <bold>x</bold><sub><italic>t</italic></sub>. The conditional probabilities on the right-hand side of Equation (3) are all estimated from models obtained from the labeled training data.</p>
<p>Specifically, a model of <italic>P</italic>(&#x00394;&#x003B1;<sub><italic>t</italic></sub> | &#x003B2;<sub><italic>t</italic>&#x02212;1</sub> &#x0003D; &#x003BB;) is inferred using the Kernel Density Estimation (KDE) (Platt, <xref ref-type="bibr" rid="B35">1999</xref>) based on the set of relative rank differences obtained from all training data points where the preceding point corresponded to the AOL, &#x003BB;. The set of these models obtained for possible AOLs, <inline-formula><mml:math id="M12"><mml:mo>&#x003BB;</mml:mo><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:math></inline-formula>, can then be used to determine any desired <italic>P</italic>(&#x00394;&#x003B1;<sub><italic>t</italic></sub> | &#x003B2;<sub><italic>t</italic>&#x02212;1</sub>). Similarly, <italic>P</italic>(&#x00394;&#x003B1;<sub><italic>t</italic></sub> | &#x003B2;<sub><italic>t</italic>&#x02212;1</sub> &#x0003D; &#x003BB;<sub>1</sub>, &#x003B2;<sub><italic>t</italic></sub> &#x0003D; &#x003BB;<sub>2</sub>) is estimated from the set of training points labeled as &#x003BB;<sub>2</sub>, where the previous point was labeled as &#x003BB;<sub>1</sub>. Again, from the set of models corresponding to all pairwise combinations of &#x003BB;<sub>1</sub> and &#x003BB;<sub>2</sub>, any desired <italic>P</italic>(&#x00394;&#x003B1;<sub><italic>t</italic></sub> | &#x003B2;<sub><italic>t</italic>&#x02212;1</sub>, &#x003B2;<sub><italic>t</italic></sub>) can be determined. Finally, the set of prior probabilities, <italic>P</italic>(&#x003B2;<sub><italic>t</italic></sub> | &#x003B2;<sub><italic>t</italic>&#x02212;1</sub>), can be estimated as:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M13"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02192;</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M14"><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow></mml:math></inline-formula>, <italic>N</italic><sub>&#x003BB;<sub>1</sub></sub> denotes the number of training points labeled as &#x003BB;<sub>1</sub>, and <italic>N</italic><sub>&#x003BB;<sub>1</sub> &#x02192; &#x003BB;<sub>2</sub></sub> denotes the number of occurrences of pairs of training data points with a point labeled &#x003BB;<sub>1</sub> followed by a point labeled &#x003BB;<sub>2</sub> in the training sets.</p>
<p>The initial probability <italic>P</italic>(&#x003B2;<sub>0</sub>) is directly estimated from training data as:</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M15"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x003BB;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>N</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mi>N</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>N</italic><sub>&#x003BB;</sub> denotes the number of training points labeled as &#x003BB; and <italic>N</italic> denotes the total number of data points in the training sets.</p>
</sec>
<sec id="s4">
<title>4. Proposed Methodology</title>
<p>The structure of the DOMM framework makes it an appropriate choice to investigate the varying degrees of salience of different modalities toward inferring the static and dynamic aspects of emotion. Specifically, since the AOL prediction and ROL prediction subsystems are independently trained and explicitly cater to the static and dynamic aspects of emotion labels, respectively, we can study if cues from different modalities are particularly well suited for one or another. In the investigations reported in this paper, we train and compare a range of DOMM systems where the AOL and ROL prediction subsystems are trained to use either audio (A), or video (V), or a combination of audio and video (AV) features as inputs. This allows us to compare every possible combination of audio and video modalities to model and predict the static and dynamic aspects of emotion as outlined in <xref ref-type="fig" rid="F4">Figure 4</xref>.</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>An overview of the proposed analyses to determine relative salience of audio and video modalities toward the perception of emotion state and emotion change. The 9 possible DOMM systems corresponding to different combinations modalities used for estimating state and transition probabilities are all evaluated on the same test data. A table of these measures of prediction accuracy, as illustrated, allows for the identification of broad trends around the relative salience of each modality toward the prediction of emotion state and emotion change.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-767767-g0004.tif"/>
</fig>
<p>Specifically, we train three versions of the AOL prediction system:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M16"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mo>&#x003BB;</mml:mo><mml:mo>|</mml:mo><mml:mo>&#x003A6;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0225C;</mml:mo><mml:mi>&#x003C3;</mml:mi><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>H</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x003BB;</mml:mo></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x003A6;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mtext>&#x02003;&#x000A0;</mml:mtext><mml:mo>&#x02200;</mml:mo><mml:mo>&#x003BB;</mml:mo><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mi mathvariant="-tex-caligraphic">A</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mo>&#x003A6;</mml:mo><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where, &#x003C3;(&#x000B7;) denotes the sigmoid function, &#x003A6; denotes the modality, and the feature vector <inline-formula><mml:math id="M17"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> is obtained by concatenating <inline-formula><mml:math id="M18"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> and <inline-formula><mml:math id="M19"><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:msubsup></mml:math></inline-formula> and three versions of the ROL prediction system given as:</p>
<disp-formula id="E8"><label>(7)</label><mml:math id="M21"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>&#x02192;</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:mo>&#x003A6;</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0225C;</mml:mo><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>&#x00394;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x003A6;</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00394;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x003A6;</mml:mo></mml:mrow></mml:msubsup><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow></mml:msub><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mi>P</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mo>&#x00394;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x003A6;</mml:mo></mml:mrow></mml:msubsup><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>&#x003B2;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac><mml:mo>,</mml:mo><mml:mtext>&#x02003;&#x000A0;</mml:mtext><mml:mo>&#x003A6;</mml:mo><mml:mo>&#x02208;</mml:mo><mml:mrow><mml:mo>{</mml:mo><mml:mrow><mml:mi>A</mml:mi><mml:mo>,</mml:mo><mml:mi>V</mml:mi><mml:mo>,</mml:mo><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mo>}</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where, similar to Equation (3), <inline-formula><mml:math id="M22"><mml:msubsup><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x003A6;</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:mi>G</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mstyle mathvariant="bold"><mml:mtext>x</mml:mtext></mml:mstyle></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x003A6;</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula>, with <italic>G</italic>(&#x000B7;) representing a RankSVM, and <inline-formula><mml:math id="M23"><mml:mo>&#x00394;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x003A6;</mml:mo></mml:mrow></mml:msubsup><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x003A6;</mml:mo></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003B1;</mml:mi></mml:mrow><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mo>&#x003A6;</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula>.</p>
<p>These sets of AOL and ROL prediction subsystems lead to 9 possible combinations as depicted in <xref ref-type="fig" rid="F4">Figure 4</xref>. The emotion state prediction accuracies of all combinations are estimated and compared to ascertain the relative salience of the speech and video modality toward the modeling of emotion state and emotion change. For instance, looking down the first column of the table prediction accuracies in <xref ref-type="fig" rid="F4">Figure 4</xref>, the entries <inline-formula><mml:math id="M24"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:mrow></mml:math></inline-formula> represent the scenario where the RankSVM models emotion change based on audio cues, denoted by the superscript <italic>A</italic>, while the OMSVM predicts emotion state from one of the three possible input feature vectors, as denoted by the subscript.</p>
<p>Similarly, the second and third columns represent the configurations where the RankSVM takes either video (V) or audiovisual (AV) features as input. If audio was significantly more salient for the prediction of emotion state compared to video features then we should expect the entries in the first row to be consistently higher than those in the second row. Additionally, if video features carried little useful information about emotion state then the entries in the third row would be similar to those in the first row. Likewise, comparing columns allows us to make inferences about the relative salience of the audio and video feature to emotion change modeling. Finally, we note that even though the predictions made by DOMM systems are AOLs and the performance measures are accuracies of predictions of AOLs, the predictions are obtained via Viterbi decoding of a Markov model and incorrect estimates of the transition probabilities will lead to incorrect predictions of the state sequence.</p>
</sec>
<sec id="s5">
<title>5. Experiment Settings</title>
<sec>
<title>5.1. Database Description</title>
<p>No publicly corpora that can be used to train emotion prediction systems come with absolute and relative ordinal labels. Consequently, in this work we use the well-established Remote Collaborative and Affective Interactions (RECOLA) database (Ringeval et al., <xref ref-type="bibr" rid="B37">2013</xref>) and convert the interval labels to AOLs and ROLs. The RECOLA dataset (Ringeval et al., <xref ref-type="bibr" rid="B37">2013</xref>) is a widely used multimodal corpus containing both audio and video modalities. It consists of 9.5 h of audiovisual recordings spanning 23 dyadic interactions among 46 participants. The experiments reported in this paper were conducted with the data provided for the AVEC 2016 challenge which included 9 utterances of 5 min duration each in both the training and development sets (Valstar et al., <xref ref-type="bibr" rid="B47">2016</xref>). The challenge development set is employed as the test set in this experiment since the labels of the test set are not public. Each utterance is annotated by 6 raters with continuous arousal and valence ratings between &#x02013;1 and 1, sampled at intervals of 40 ms.</p>
<p>Delay compensation is applied to compensate for human perception delays in the labels as suggested by Huang et al. (<xref ref-type="bibr" rid="B19">2015</xref>) with a delay of 4 s for arousal and 2 s for valence. Finally, the ratings are aggregated over a 2 s window, as per Parthasarathy and Busso (<xref ref-type="bibr" rid="B31">2016</xref>) who suggest that a window size between 1 and 3 s is appropriate for retaining salient trends in the ratings while reducing noise.</p>
<p>The AOLs are converted from interval labels individually for each annotator and the final consensus AOLs are inferred via majority vote among the 6 individual AOLs (one per annotator). The conversion scheme is carried out by setting thresholds to divide the interval labels into three levels of arousal (valence) state: [low, medium, high]. Specifically, if <italic>y</italic><sub><italic>t</italic></sub> denotes the average arousal/valence intensity at window <italic>t</italic> and &#x003B8;<sub>1</sub> and &#x003B8;<sub>2</sub> denotes two thresholds. The AOLs are obtained as: &#x003B2;<sub><italic>t</italic></sub> &#x0003D; <italic>Low</italic> for <italic>y</italic><sub><italic>t</italic></sub> &#x02264; &#x003B8;<sub>1</sub>; &#x003B2;<sub><italic>t</italic></sub> &#x0003D; <italic>Medium</italic> for &#x003B8;<sub>1</sub> &#x0003C; <italic>y</italic><sub><italic>t</italic></sub> &#x02264; &#x003B8;<sub>2</sub>; &#x003B2;<sub><italic>t</italic></sub> &#x0003D; <italic>High</italic> for <italic>y</italic><sub><italic>t</italic></sub> &#x0003E; &#x003B8;<sub>2</sub>. For arousal labels, the thresholds were chosen as &#x003B8;<sub>1</sub> &#x0003D; &#x02212;0.14 and &#x003B8;<sub>2</sub> &#x0003D; 0.14 and for valence these were set as &#x003B8;<sub>1</sub> &#x0003D; 0 and &#x003B8;<sub>2</sub> &#x0003D; 0.17. In both cases the thresholds were chosen to provide an even distribution across the low, medium and high states as outlined in Wu et al. (<xref ref-type="bibr" rid="B49">2021</xref>). The resultant distribution of absolute states are given in <xref ref-type="table" rid="T1">Table 1</xref>. Additionally, we repeated out analyses on two other sets of thresholds for both arousal and valence. The results can be found in the <xref ref-type="supplementary-material" rid="SM1">Supplementary Tables S7</xref>&#x02013;<xref ref-type="supplementary-material" rid="SM1">S10</xref>.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Absolute ordinal labels distribution on RECOLA dataset with thresholds: &#x003B8;<sub><italic>a</italic>1</sub> &#x0003D; &#x02212;&#x003B8;<sub><italic>a</italic>2</sub> &#x0003D; 0.14 for arousal and &#x003B8;<sub><italic>v</italic>1</sub> &#x0003D; 0 and &#x003B8;<sub><italic>v</italic>2</sub> &#x0003D; 0.17 for valence.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th/>
<th valign="top" align="center"><bold>Low</bold></th>
<th valign="top" align="center"><bold>Medium</bold></th>
<th valign="top" align="center"><bold>High</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Training set</td>
<td valign="top" align="left">Arousal</td>
<td valign="top" align="center">363</td>
<td valign="top" align="center">443</td>
<td valign="top" align="center">526</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Valence</td>
<td valign="top" align="center">462</td>
<td valign="top" align="center">463</td>
<td valign="top" align="center">416</td>
</tr>
<tr>
<td valign="top" align="left">Test set</td>
<td valign="top" align="left">Arousal</td>
<td valign="top" align="center">578</td>
<td valign="top" align="center">348</td>
<td valign="top" align="center">406</td>
</tr>
<tr>
<td/>
<td valign="top" align="left">Valence</td>
<td valign="top" align="center">545</td>
<td valign="top" align="center">432</td>
<td valign="top" align="center">364</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Likewise, the ROLs for each annotator are first converted by performing pairwise comparisons across each 2 s window based on the mean arousal (valence) intensity. The global ROLs are computed by adopting the Qualitative Agreement (QA) method (Parthasarathy and Busso, <xref ref-type="bibr" rid="B32">2018</xref>). Within each utterance, a matrix of pairwise comparisons amongst all windows for each individual annotator is first collected as shown in <xref ref-type="fig" rid="F5">Figure 5</xref>. For instance, the valence rating within the third window is less than that within the second window, leading to a down-arrow in the cell located at the second row and third column. A consensus matrix is then obtained via majority vote among matrices from all annotators and the final rank sequence of ROLs is obtained from this consensus matrix.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Illustration of QA method (Parthasarathy and Busso, <xref ref-type="bibr" rid="B32">2018</xref>). <bold>(A)</bold> Individual comparison matrix obtained from interval labels for one rater. Up-arrow indicates an increase; down-arrow indicates a decrease and equal denotes tie. <bold>(B)</bold> Consensus matrix obtained by aggregating individual matrix collected from multiple raters using majority votes.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fcomp-03-767767-g0005.tif"/>
</fig>
</sec>
<sec>
<title>5.2. Features</title>
<sec>
<title>5.2.1. Audio Features</title>
<p>Two sets of widely used audio features, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) (Eyben et al., <xref ref-type="bibr" rid="B12">2015</xref>) and the Bag-of-Audio-Words (BoAW) (Schmitt et al., <xref ref-type="bibr" rid="B41">2016</xref>) were employed in the experiments reported in this paper. The 88-dimension extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) was chosen since it is a standard feature set used in affective computing to simplify benchmarking and provided by the AVEC 2016 challenge (Valstar et al., <xref ref-type="bibr" rid="B47">2016</xref>). It comprises of arithmetic mean and coefficient of variation functionals applied to 18 low-level descriptors (LLDs) extracted from the minimalistic acoustic parameter set along with another 8 functionals applied to pitch and loudness. Additional 7 LLDs are extracted from the extension parameter set with 4 statistics over the unvoiced segments, 6 temporal features, and 26 additional cepstral parameters and dynamic parameter (Eyben et al., <xref ref-type="bibr" rid="B12">2015</xref>). The features were extracted using the OpenSMILE toolkit (Eyben et al., <xref ref-type="bibr" rid="B13">2010</xref>) and for additional details about eGeMAPS, readers are referred to Eyben et al. (<xref ref-type="bibr" rid="B12">2015</xref>).</p>
<p>The Bag-of-Audio-Words (BoAW) features were extracted by first computing 20 dimensional MFCCs and their deltas. The &#x00027;audio words&#x00027; were determined as clusters in this space (Schmitt et al., <xref ref-type="bibr" rid="B41">2016</xref>). The BoAW features employed in our experiments were generated using 100 clusters, leading to a 100-dimensional BoAW representations. The extraction was implemented using OpenXbow (Schmitt and Schuller, <xref ref-type="bibr" rid="B42">2017</xref>). Principal Component Analysis (PCA) was then employed for dimensional reduction resulting in 40 dimensional features. Their first-order derivatives were then computed and concatenated with the 40 principal components, leading to an 80-dimensional feature representation.</p>
</sec>
<sec>
<title>5.2.2. Video Features</title>
<p>The video features utilized in the experiments reported in this paper comprise of two standard features sets provided in the AVEC 2016 challenge, the appearance features and geometry features (Valstar et al., <xref ref-type="bibr" rid="B47">2016</xref>). The video appearance feature is computed using Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP) by first convolving the input video frames with a bank of Gabor Filters to obtain Gabor magnitude response images. The LBP operator is applied to the resulted image slices along 3 orthogonal plans (<italic>x-y, x-t, y-t</italic>), resulting in three LBP histograms per Gabor response. Finally, all the histograms are then concatenated into a single LGBP-TOP histogram across all video frames. PCA is then applied for feature reduction and an 84-dimensional feature set was obtained (Valstar et al., <xref ref-type="bibr" rid="B47">2016</xref>).</p>
<p>The geometry features are extracted by first tracking the 49 facial landmarks and aligning them with a mean shape from stable points located around nose and eyes. Then the coordinate differences between them are computed, together with their deltas leading to 196 features. The landmarks are split into groups with (i) left eye and eyebrow; (ii) right eye and eyebrow; and (iii) the mouth. The Euclidean distances and angles between the points are computed within each group. A final Euclidean distance is computed between the mean of stable landmarks and aligned landmarks in the video frame. The extraction process resulting in 316 dimensional features. Details refer to Valstar et al. (<xref ref-type="bibr" rid="B47">2016</xref>). Finally, for the audio-visual front-end we employed feature level fusion by directly concatenating the individual audio and video features extracted using the above methods.</p>
</sec>
</sec>
<sec>
<title>5.3. Backend Implementation</title>
<p>The OMSVM subsystem for AOL prediction is implemented using ClassificationECOC MATLAB toolbox (an error correction output code multi-class classifier) (Escalera et al., <xref ref-type="bibr" rid="B11">2009</xref>). The state posterior probabilities were then computed using the FitPosterior function (Platt, <xref ref-type="bibr" rid="B35">1999</xref>). The RankSVM model used in the ROL prediction subsystem was implemented using the toolkit referred to in Chapelle and Keerthi (<xref ref-type="bibr" rid="B6">2010</xref>). Both the OMSVM and RankSVM models employs linear kernels and both used <italic>c</italic> &#x0003D; 1 &#x000D7; 10<sup>&#x02212;4</sup> as suggested in Fan et al. (<xref ref-type="bibr" rid="B14">2008</xref>) and Kim and Clements (<xref ref-type="bibr" rid="B23">2015</xref>).</p>
</sec>
<sec>
<title>5.4. Evaluation Metrics</title>
<p>Two measures of AOL prediction accuracy are adopted in the experiments reported in this paper. Firstly, Unweighted Average Recall (UAR), which is a commonly employed evaluation metric in nominal classification tasks in AER literature (Metallinou et al., <xref ref-type="bibr" rid="B30">2012</xref>; Zhang et al., <xref ref-type="bibr" rid="B53">2016</xref>), and has also been utilized for evaluation in AOL prediction systems (Kim and Ahn, <xref ref-type="bibr" rid="B24">2012</xref>). UAR (%) ranges between 0 and 100% with 33.3% indicating a chance level prediction in this three class prediction task (balanced classes). However, UAR does not taken into account the ordinality in the labels. For instance, incorrectly predicting &#x0201C;Low arousal&#x0201D; as &#x0201C;Medium arousal&#x0201D; or &#x0201C;High arousal&#x0201D; carries the same penalty in UAR, although the latter is obviously a more significant error. Consequently, the weighted Cohen,s Kappa coefficient (Cohen, <xref ref-type="bibr" rid="B7">1968</xref>), <italic>k</italic><sub><italic>w</italic></sub> which is used to measure the consistency between two AOL sequences is also reported since it takes the ordinal nature of AOLs into consideration. The coefficient <italic>k</italic><sub><italic>w</italic></sub> indicates the level of agreement between two different AOL sequences (predictions vs. ground truth) as given by Equation 8, with <italic>k</italic><sub><italic>w</italic></sub> &#x0003D; 1 indicating perfect agreement and <italic>k</italic><sub><italic>w</italic></sub> &#x0003D; 0 indicating only chance agreement (Cohen, <xref ref-type="bibr" rid="B7">1968</xref>).</p>
<disp-formula id="E9"><label>(8)</label><mml:math id="M25"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mi>w</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x000B7;</mml:mo></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:mstyle displaystyle="true"><mml:munderover accentunder="false" accent="false"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:mi>j</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mtext>&#x000A0;</mml:mtext><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:munderover></mml:mstyle><mml:msub><mml:mrow><mml:mi>w</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>&#x000B7;</mml:mo></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>q</mml:mi></mml:mrow><mml:mrow><mml:mo>&#x000B7;</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where, <italic>i, j</italic> &#x02208; [1, 2, 3] denote possible AOLs (1-<italic>low</italic>, 2-<italic>medium</italic>, 3-<italic>high</italic>); <italic>p</italic><sub><italic>ij</italic></sub> is the entry located at the <italic>i</italic><sup><italic>th</italic></sup> row and <italic>j</italic><sup><italic>th</italic></sup> column of the confusion matrix denoting the proportion of test instances corresponding the AOL <italic>i</italic> being predicted as AOL <italic>j</italic>; <italic>p</italic><sub><italic>i</italic>&#x000B7;</sub> denotes the fraction of all groundtruth AOLs that correspond to the label <italic>i</italic>; <italic>q</italic><sub>&#x000B7;<italic>j</italic></sub> denotes the fraction of instances where the AOLs are correctly predicted as <italic>j</italic>; and <italic>w</italic><sub><italic>ij</italic></sub> is the element at <italic>i</italic><sup><italic>th</italic></sup> row and <italic>j</italic><sup><italic>th</italic></sup> column of matrix <inline-formula><mml:math id="M26"><mml:mi>W</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="true">[</mml:mo><mml:mrow><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none none none none none none none none none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mn>0</mml:mn></mml:mtd><mml:mtd><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn></mml:mtd><mml:mtd><mml:mn>1</mml:mn></mml:mtd></mml:mtr></mml:mtable></mml:mrow><mml:mo stretchy="true">]</mml:mo></mml:mrow></mml:math></inline-formula>.</p>
<p>Finally, Kendall&#x00027;s Tau (&#x003C4;) (Kendall, <xref ref-type="bibr" rid="B22">1938</xref>) is used to evaluate the performance of ROL predictions, which is a typical evaluation metric that measures the consistency between two ranks and has been used in several AER works (Lotfian and Busso, <xref ref-type="bibr" rid="B26">2016</xref>; Parthasarathy et al., <xref ref-type="bibr" rid="B33">2017</xref>). It can vary between &#x02013;1 and 1 indicating the range from complete antithesis to perfect match (Kendall, <xref ref-type="bibr" rid="B22">1938</xref>), as shown in Equation (9):</p>
<disp-formula id="E10"><label>(9)</label><mml:math id="M27"><mml:mtable columnalign="left"><mml:mtr><mml:mtd><mml:mi>&#x003C4;</mml:mi><mml:mtext>&#x000A0;</mml:mtext><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>C</mml:mi><mml:mo>-</mml:mo><mml:mi>D</mml:mi></mml:mrow><mml:mrow><mml:mi>T</mml:mi></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>T</italic> refers to total number of comparisons given by <inline-formula><mml:math id="M28"><mml:mrow><mml:mi>T</mml:mi><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>n</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:mfrac></mml:mrow></mml:math></inline-formula>, with <italic>n</italic> referring to the highest rank index. <italic>C</italic> denotes the number of concordant pairs and <italic>D</italic> denotes the number of discordant pairs.</p>
</sec>
</sec>
<sec id="s6">
<title>6. Results and Discussion</title>
<p>The two subsystems that model emotion state (AOL prediction subsystem) and emotion change (ROL prediction subsystem) are first evaluated with different modalities. Following this, we use the DOMM framework to analyse the relative contributions of speech and video modalities toward inferring static and dynamic aspects of emotion. The code used to implement these experiments and demo scripts can be accessed at: <ext-link ext-link-type="uri" xlink:href="https://github.com/JingyaoWU66/Multimodal_DOMM.git">https://github.com/JingyaoWU66/Multimodal_DOMM.git</ext-link>.</p>
<sec>
<title>6.1. Validating Subsystems</title>
<p>The performance achieved with each single modality of OMSVM in terms of UAR (%) and <italic>k</italic><sub><italic>w</italic></sub>, and RankSVM in terms of &#x003C4; are reported in <xref ref-type="table" rid="T2">Tables 2</xref>, <xref ref-type="table" rid="T3">3</xref>. The best results obtained with audio-visual feature across all combinations of different feature sets of each modality are also reported. For arousal, this was achieved by the combination of eGemaps (audio) and appearance (video); and for valence, the combination of BoAW (audio) and geometric (video) features. When it comes to AOL prediction with the OMSVM, it can be seen that the audio modality outperforms video modality for arousal prediction for all feature sets in terms of both UAR and <italic>k</italic><sub><italic>w</italic></sub>. The converse is true when it comes to valence predictions with video outperforming audio. This is in line with previously reported observations that audio features are more salient for arousal recognition, whereas valence recognition is more accurate with visual features (Schoneveld et al., <xref ref-type="bibr" rid="B43">2021</xref>). The feature level fusion of audio and video modality outperforms video only for valence prediction, suggesting that audio-visual features contains more useful information; whereas this is not observed for arousal predictions. This suggests that the more salient audio modality carries the necessary information for arousal prediction. More importantly, it also indicates that simple feature level fusion may not be the optimal approach for leveraging multimodal inputs.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>OMSVM evaluation with different modalities in terms of unweighted average recall UAR (%) and weighted kappa <italic>k</italic><sub><italic>w</italic></sub> (reported inside parenthesis).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center"><bold>Arousal</bold></th>
<th valign="top" align="center"><bold>Valence</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Audio (eGemaps)</td>
<td valign="top" align="center"><bold>58.8 (0.476)</bold></td>
<td valign="top" align="center">34.8 (0.033)</td>
</tr>
<tr>
<td valign="top" align="left">Audio (BoAW)</td>
<td valign="top" align="center">51.6(0.374)</td>
<td valign="top" align="center">38.3 (0.106)</td>
</tr>
<tr>
<td valign="top" align="left">Video (Appearance)</td>
<td valign="top" align="center">35.4 (0.105)</td>
<td valign="top" align="center">40.0 (0.127)</td>
</tr>
<tr>
<td valign="top" align="left">Video (Geometric)</td>
<td valign="top" align="center">35.6 (0.056)</td>
<td valign="top" align="center">45.4 (0.223)</td>
</tr>
<tr>
<td valign="top" align="left">Audio-Visual (Best)</td>
<td valign="top" align="center">48.8 (0.308)</td>
<td valign="top" align="center"><bold>49.7 (0.288)</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The best performance among different modalities is indicated in bold</italic>.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>RankSVM evaluation with different modalities in terms of Kendall&#x00027;s tau (&#x003C4;).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th valign="top" align="center"><bold>Arousal</bold></th>
<th valign="top" align="center"><bold>Valence</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Audio (eGemaps)</td>
<td valign="top" align="center"><bold>0.554</bold></td>
<td valign="top" align="center">0.136</td>
</tr>
<tr>
<td valign="top" align="left">Audio (BoAW)</td>
<td valign="top" align="center">0.482</td>
<td valign="top" align="center">0.181</td>
</tr>
<tr>
<td valign="top" align="left">Video (Appearance)</td>
<td valign="top" align="center">0.243</td>
<td valign="top" align="center">0.233</td>
</tr>
<tr>
<td valign="top" align="left">Video (Geometric)</td>
<td valign="top" align="center">0.072</td>
<td valign="top" align="center">0.193</td>
</tr>
<tr>
<td valign="top" align="left">Audio-Visual (Best)</td>
<td valign="top" align="center">0.524</td>
<td valign="top" align="center"><bold>0.238</bold></td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>The best performance is indicated in bold</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>With respect to ROL prediction with the RankSVM, audio modality achieves the highest &#x003C4; among all three feature sets for arousal, suggesting that as in the case with emotion state prediction, audio features are also well suited for predicting change in arousal state, especially when compared to video features. For valence prediction, video features outperforms audio, but the best &#x003C4; is achieved when both audio and video features are fused, suggesting both modalities contain complementary information about change in valence.</p>
</sec>
<sec>
<title>6.2. Salience of Audio and Video Modalities for Modeling Emotion State</title>
<p>As outlined in section 4, there are nine possible combinations of emotion state and emotion change prediction based on audio and/or video features (refer to table depicted in <xref ref-type="fig" rid="F4">Figure 4</xref>) within the DOMM framework. To ascertain the relative salience of the modalities toward emotion state prediction, we would compare the performance metrics within each column (with each row denoting a different modality for emotion state modeling). In all the results tables reported in this section, in addition to the nine combinations, we also report the means across each row which would give an indication of the &#x0201C;average&#x0201D; salience of each modality for emotion state modeling. For instance, to determine the salience of audio modality for emotion state prediction, we first compute <inline-formula><mml:math id="M29"><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, where the performances due to the three different RankSVM subsystems are averaged. The higher the value of <inline-formula><mml:math id="M30"><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is, the more salient audio modality is, for emotion state prediction. Four different combinations of audio and video feature sets are evaluated, with <xref ref-type="table" rid="T4">Tables 4</xref>, <xref ref-type="table" rid="T5">5</xref> showing the results obtained using eGemaps (audio) and appearance (video), and <xref ref-type="table" rid="T6">Tables 6</xref>, <xref ref-type="table" rid="T7">7</xref> showing the results obtained using BoAW (audio) and geometric (video). These two combinations led to more accurate predictions when compared to the other two possible combinations of audio and video features. These are reported in the <xref ref-type="supplementary-material" rid="SM1">Supplementary Tables S1</xref>&#x02013;<xref ref-type="supplementary-material" rid="SM1">S4</xref>.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Performance of Arousal prediction in terms of Unweighted Average Recall (UAR %) and weighted kappa <italic>k</italic><sub><italic>w</italic></sub> (reported inside parenthesis).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th/>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>RankSVM</bold></th>
<th valign="top" align="center"><bold>Mean</bold></th>
</tr>
<tr>
<th/>
<th/>
<th valign="top" align="center"><bold>Audio</bold></th>
<th valign="top" align="center"><bold>Video</bold></th>
<th valign="top" align="center"><bold>Audio-Visual</bold></th>
<th/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>OMSVM</bold></td>
<td valign="top" align="left"><bold>Audio</bold></td>
<td valign="top" align="center">61.1(0.532)</td>
<td valign="top" align="center">55.7(0.441)</td>
<td valign="top" align="center">59.0(0.501)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M40"><mml:mstyle mathvariant="bold"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>58</mml:mn><mml:mo>.</mml:mo><mml:mn>6</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>491</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Video</bold></td>
<td valign="top" align="center">44.3(0.215)</td>
<td valign="top" align="center">40.8(0.169)</td>
<td valign="top" align="center">42.4(0.192)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M41"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>42</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>192</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Audio-Visual</bold></td>
<td valign="top" align="center">51.6(0.383)</td>
<td valign="top" align="center">47.7(0.319)</td>
<td valign="top" align="center">52.7(0.405)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M42"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>50</mml:mn><mml:mo>.</mml:mo><mml:mn>7</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>369</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Mean</bold></td>
<td valign="top" align="center"><inline-formula><mml:math id="M43"><mml:mstyle mathvariant="bold"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>52</mml:mn><mml:mo>.</mml:mo><mml:mn>3</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>377</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M44"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>48</mml:mn><mml:mo>.</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>309</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M45"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>51</mml:mn><mml:mo>.</mml:mo><mml:mn>4</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>366</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center">-</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Audio feature: eGemaps; Video feature: Appearance. The best performance across the mean values is indicated in bold</italic>.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T5">
<label>Table 5</label>
<caption><p>Performance of Valence prediction in terms of Unweighted Average Recall (UAR %) and weighted kappa <italic>k</italic><sub><italic>w</italic></sub> (reported inside parenthesis).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th/>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>RankSVM</bold></th>
<th valign="top" align="center"><bold>Mean</bold></th>
</tr>
<tr>
<th/>
<th/>
<th valign="top" align="center"><bold>Audio</bold></th>
<th valign="top" align="center"><bold>Video</bold></th>
<th valign="top" align="center"><bold>Audio-Visual</bold></th>
<th/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>OMSVM</bold></td>
<td valign="top" align="left"><bold>Audio</bold></td>
<td valign="top" align="center">41.8(0.164)</td>
<td valign="top" align="center">38.8(0.113)</td>
<td valign="top" align="center">40.2(0.160)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M46"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>40</mml:mn><mml:mo>.</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>146</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Video</bold></td>
<td valign="top" align="center">41.2(0.146)</td>
<td valign="top" align="center">40.8(0.153)</td>
<td valign="top" align="center">43.7(0.196)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M47"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>41</mml:mn><mml:mo>.</mml:mo><mml:mn>9</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>165</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Audio-Visual</bold></td>
<td valign="top" align="center">40.3(0.140)</td>
<td valign="top" align="center">43.1(0.186)</td>
<td valign="top" align="center">44.7(0.227)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M48"><mml:mstyle mathvariant="bold"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>42</mml:mn><mml:mo>.</mml:mo><mml:mn>7</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>184</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Mean</bold></td>
<td valign="top" align="center"><inline-formula><mml:math id="M49"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>41</mml:mn><mml:mo>.</mml:mo><mml:mn>1</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>150</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M50"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>40</mml:mn><mml:mo>.</mml:mo><mml:mn>7</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>150</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M51"><mml:mstyle mathvariant="bold"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>42</mml:mn><mml:mo>.</mml:mo><mml:mn>9</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>191</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula></td>
<td valign="top" align="center">-</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Audio feature: eGemaps; Video feature: Appearance. The best performance across the mean values is indicated in bold</italic>.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T6">
<label>Table 6</label>
<caption><p>Performance of Arousal prediction in terms of Unweighted Average Recall (UAR %) and weighted kappa <italic>k</italic><sub><italic>w</italic></sub> (reported inside parenthesis).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th/>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>RankSVM</bold></th>
<th valign="top" align="center"><bold>Mean</bold></th>
</tr>
<tr>
<th/>
<th/>
<th valign="top" align="center"><bold>Audio</bold></th>
<th valign="top" align="center"><bold>Video</bold></th>
<th valign="top" align="center"><bold>Audio-Visual</bold></th>
<th/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>OMSVM</bold></td>
<td valign="top" align="left"><bold>Audio</bold></td>
<td valign="top" align="center">55.9(0.462)</td>
<td valign="top" align="center">52.8(0.381)</td>
<td valign="top" align="center">55.7(0.457)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M52"><mml:mstyle mathvariant="bold"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>54</mml:mn><mml:mo>.</mml:mo><mml:mn>8</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>433</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Video</bold></td>
<td valign="top" align="center">37.8(0.106)</td>
<td valign="top" align="center">36.9(0.071)</td>
<td valign="top" align="center">38.6(0.106)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M53"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>37</mml:mn><mml:mo>.</mml:mo><mml:mn>8</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>094</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Audio-Visual</bold></td>
<td valign="top" align="center">49.1(0.329)</td>
<td valign="top" align="center">46.6(0.256)</td>
<td valign="top" align="center">49.3(0.325)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M54"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>48</mml:mn><mml:mo>.</mml:mo><mml:mn>3</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>303</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Mean</bold></td>
<td valign="top" align="center"><inline-formula><mml:math id="M55"><mml:mstyle mathvariant="bold"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>47</mml:mn><mml:mo>.</mml:mo><mml:mn>6</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>299</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M56"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>45</mml:mn><mml:mo>.</mml:mo><mml:mn>4</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>236</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M57"><mml:mstyle mathvariant="bold"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>47</mml:mn><mml:mo>.</mml:mo><mml:mn>9</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>296</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mstyle></mml:math></inline-formula></td>
<td valign="top" align="center">-</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Audio feature: BoAW; Video feature: Geometric. The best performance across the mean values is indicated in bold</italic>.</p>
</table-wrap-foot>
</table-wrap>
<table-wrap position="float" id="T7">
<label>Table 7</label>
<caption><p>Performance of Valence prediction in terms of Unweighted Average Recall (UAR %) and weighted kappa <italic>k</italic><sub><italic>w</italic></sub> (reported inside parenthesis).</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th/>
<th/>
<th valign="top" align="center" colspan="3" style="border-bottom: thin solid #000000;"><bold>RankSVM</bold></th>
<th valign="top" align="center"><bold>Mean</bold></th>
</tr>
<tr>
<th/>
<th/>
<th valign="top" align="center"><bold>Audio</bold></th>
<th valign="top" align="center"><bold>Video</bold></th>
<th valign="top" align="center"><bold>Audio-Visual</bold></th>
<th/>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left"><bold>OMSVM</bold></td>
<td valign="top" align="left"><bold>Audio</bold></td>
<td valign="top" align="center">42.6(0.193)</td>
<td valign="top" align="center">43.2(0.193)</td>
<td valign="top" align="center">45.0(0.215)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M58"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>43</mml:mn><mml:mo>.</mml:mo><mml:mn>6</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>200</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Video</bold></td>
<td valign="top" align="center">47.0(0.253)</td>
<td valign="top" align="center">47.3(0.246)</td>
<td valign="top" align="center">46.7(0.243)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M59"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>47</mml:mn><mml:mo>.</mml:mo><mml:mn>0</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>247</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Audio-Visual</bold></td>
<td valign="top" align="center">51.8(0.328)</td>
<td valign="top" align="center">49.1(0.288)</td>
<td valign="top" align="center">49.7(0.295)</td>
<td valign="top" align="center"><inline-formula><mml:math id="M60"><mml:mstyle mathvariant="bold"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mstyle mathvariant="bold"><mml:mn>50.2</mml:mn><mml:mo>(</mml:mo><mml:mn>0.304</mml:mn><mml:mo>)</mml:mo></mml:mstyle></mml:mstyle></mml:math></inline-formula></td>
</tr>
<tr>
<td/>
<td valign="top" align="left"><bold>Mean</bold></td>
<td valign="top" align="center"><inline-formula><mml:math id="M61"><mml:mstyle mathvariant="bold"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mstyle mathvariant="bold"><mml:mn>47.1</mml:mn><mml:mo>(</mml:mo><mml:mn>0.258</mml:mn><mml:mo>)</mml:mo></mml:mstyle></mml:mstyle></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M62"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mn>46</mml:mn><mml:mo>.</mml:mo><mml:mn>5</mml:mn><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>.</mml:mo><mml:mn>242</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula></td>
<td valign="top" align="center"><inline-formula><mml:math id="M63"><mml:mstyle mathvariant="bold"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mstyle mathvariant="bold"><mml:mn>47.1</mml:mn><mml:mo>(</mml:mo><mml:mn>0.251</mml:mn><mml:mo>)</mml:mo></mml:mstyle></mml:mstyle></mml:math></inline-formula></td>
<td valign="top" align="center">-</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<p><italic>Audio feature: BoAW; Video feature: Geometric. The best performance across the mean values is indicated in bold</italic>.</p>
</table-wrap-foot>
</table-wrap>
<p>As can be seen from <xref ref-type="table" rid="T4">Table 4</xref>, the best average UAR for arousal state prediction is 58.6% and best <italic>k</italic><sub><italic>w</italic></sub> is achieved as 0.491, achieved by audio input <inline-formula><mml:math id="M31"><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> (which also outperforms <inline-formula><mml:math id="M32"><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>), suggesting that audio modality contributes most to the static arousal prediction. Furthermore, within each column, the audio modality is consistently superior to video modality and audio-visual modality, indicating that OMSVM consistently predicts arousal state most accurately from audio regardless of the input modality to the RankSVM subsystem. Similar trends are also observed in <xref ref-type="table" rid="T6">Table 6</xref> and the tables included in the <xref ref-type="supplementary-material" rid="SM1">Supplementary Material</xref> where different feature sets are utilized.</p>
<p>From the valence prediction accuracies reported in <xref ref-type="table" rid="T5">Tables 5</xref>, <xref ref-type="table" rid="T7">7</xref>, it can be seen that <inline-formula><mml:math id="M33"><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is higher than both <inline-formula><mml:math id="M34"><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> and <inline-formula><mml:math id="M35"><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>, both in terms of UAR and <italic>k</italic><sub><italic>w</italic></sub>. Additionally, <inline-formula><mml:math id="M36"><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula> is better performed than <inline-formula><mml:math id="M37"><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:mrow></mml:math></inline-formula>, both results indicating that video is more salient than audio when it comes to predicting valence state. Similar trends are also observed within each column, with <inline-formula><mml:math id="M38"><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> performing the best amongst the three different systems, when the RankSVM input is either video or audio-visual. For the outliers when the RankSVM input is audio (first column of <xref ref-type="table" rid="T5">Table 5</xref>), <inline-formula><mml:math id="M39"><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup></mml:math></inline-formula> does not correspond to the best performance, but this configuration does not lead to the best overall performance and it may be that the gains of audio-visual input to the OMSVM is offset by the audio based RankSVM resulting in less accurate ROL prediction (refer <xref ref-type="table" rid="T3">Table 3</xref>).</p>
</sec>
<sec>
<title>6.3. Salience of Audio and Video Modalities for Modeling Emotion Change</title>
<p>To investigate the salience of different modalities for predicting change in emotions, the impact of varying the input modality to RankSVM based ROL prediction subsystem can be studied. Specifically, this can be done by comparing the prediction accuracy across rows in <xref ref-type="table" rid="T4">Tables 4</xref>&#x02013;<xref ref-type="table" rid="T7">7</xref>. Additionally, the average performance for each modality obtained by computing the mean over all input modalities for the OMSVM (i.e., means of the columns) are also reported, such as <inline-formula><mml:math id="M64"><mml:mrow><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover><mml:mo>=</mml:mo><mml:mi>m</mml:mi><mml:mi>e</mml:mi><mml:mi>a</mml:mi><mml:mi>n</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup><mml:mo>,</mml:mo><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math></inline-formula>, representing the average accuracy when using audio for predicting emotion change.</p>
<p>From <xref ref-type="table" rid="T4">Tables 4</xref>, <xref ref-type="table" rid="T6">6</xref> it can be seen that <inline-formula><mml:math id="M65"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula> corresponds to the highest prediction accuracy, followed by <inline-formula><mml:math id="M66"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula> and then <inline-formula><mml:math id="M67"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula> in terms of both UAR and <italic>k</italic><sub><italic>w</italic></sub>. This suggests that audio is, on average, the most salient modality for the purposes of modeling change in arousal. Furthermore, it can also be seen that within each row, the performance when using audio for RankSVM always outperforms video modality, further validating the observation that audio is more salient for modeling emotion change.</p>
<p>Looking across the rows of the valence prediction accuracies reported in <xref ref-type="table" rid="T5">Tables 5</xref>, <xref ref-type="table" rid="T7">7</xref>, it can be observed that <inline-formula><mml:math id="M68"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula> achieves the highest prediction accuracy (in terms of both UAR and <italic>k</italic><sub><italic>w</italic></sub>), suggesting that both audio and video modalities are salient when it comes to predicting changes in valence. Interestingly, <inline-formula><mml:math id="M69"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula> is higher than <inline-formula><mml:math id="M70"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula>. Particularly, in <xref ref-type="table" rid="T7">Table 7</xref>, <inline-formula><mml:math id="M71"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula> achieves the similar performance with <inline-formula><mml:math id="M72"><mml:mover accent="false" class="mml-overline"><mml:mrow><mml:msubsup><mml:mrow><mml:mi>S</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow><mml:mrow><mml:mi>A</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo accent="true">&#x000AF;</mml:mo></mml:mover></mml:math></inline-formula> in terms of UAR, but even performs better when incorporates ordinality in <italic>k</italic><sub><italic>w</italic></sub> evaluation. This appears to run counter to the conventional wisdom that video is more salient than audio for valence prediction (Metallinou et al., <xref ref-type="bibr" rid="B30">2012</xref>; Schoneveld et al., <xref ref-type="bibr" rid="B43">2021</xref>). However, it is worth noting that valence state prediction results do conform to those expectations and the valence change prediction results might be suggesting that fine nuances related to valence changes in audio modality can be better perceived when compared to video.</p>
</sec>
</sec>
<sec sec-type="conclusions" id="s7">
<title>7. Conclusion</title>
<p>There is a large body of literature devoted to recognizing speakers, static emotion state (e.g., arousal level at a point in time), and a growing interest in the prediction of dynamic changes in emotion (e.g., change of arousal level between consecutive time steps). In this manuscript, we consider a unified model that integrates both static and dynamic aspects of emotion perception. In particular, the differences in relative salience of audio and video modalities toward modeling the static and dynamic aspects of emotions are investigated.</p>
<p>Using the Dynamic Ordinal Markov Model (DOMM) framework, extensive analyses were carried out by varying the input modalities to the OMSVM (modeling static aspects) and the RankSVM (modeling dynamic aspects) subsystems, covering all possible combinations of different feature sets of audio and video inputs. The DOMM framework is particularly well suited for this analyses because it is able to separately model the static and dynamic aspects of emotion with different input modalities, prior to integrating them for ordinal emotion prediction. The experimental comparisons were carried out on the widely used RECOLA dataset, and prediction accuracy was quantified in terms of both UAR and weighted Kappa. Results obtained from a range of different system configurations consistently show that audio modality achieves superior advantages in modeling emotion state on arousal and video modality is more salient for modeling emotion state on valence.</p>
<p>Additionally, our results also show that emotion changes for both arousal and valence are better captured by audio modality, either by itself or when fused with video input. This is consistently observed across the rows in <xref ref-type="table" rid="T4">Tables 4</xref>&#x02013;<xref ref-type="table" rid="T7">7</xref>, where the highest prediction accuracy is achieved with audio input to the RankSVM regardless of the input modalities for the OMSVM. This also aligns with the findings in psychology that people might convey their intention rather than the true emotions via facial expressions, while their vocal expressions allow for better discrimination between emotional state even if the differences are only fine nuances.</p>
</sec>
<sec sec-type="data-availability" id="s8">
<title>Data Availability Statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: RECOLA database: <ext-link ext-link-type="uri" xlink:href="https://diuf.unifr.ch/main/diva/recola/download.html">https://diuf.unifr.ch/main/diva/recola/download.html</ext-link>.</p>
</sec>
<sec id="s9">
<title>Author Contributions</title>
<p>JW, TD, and VS made the contributions to the conception, design and analysis to the study. JW conducted the experiments and drafted the article. VS and EA contributed to framing and contextualizing the research problem. All authors contributed to drafting the manuscript.</p>
</sec>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x00027;s Note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
</body>
<back>
<sec sec-type="supplementary-material" id="s11">
<title>Supplementary Material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/fcomp.2021.767767/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/fcomp.2021.767767/full#supplementary-material</ext-link></p>
<supplementary-material xlink:href="Data_Sheet_1.pdf" id="SM1" mimetype="application/pdf" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ak&#x000E7;ay</surname> <given-names>M. B.</given-names></name> <name><surname>O&#x0011F;uz</surname> <given-names>K.</given-names></name></person-group> (<year>2020</year>). <article-title>Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers</article-title>. <source>Speech Commun.</source> <volume>116</volume>, <fpage>56</fpage>&#x02013;<lpage>76</lpage>. <pub-id pub-id-type="doi">10.1016/j.specom.2019.12.001</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Avots</surname> <given-names>E.</given-names></name> <name><surname>Sapi&#x00144;ski</surname> <given-names>T.</given-names></name> <name><surname>Bachmann</surname> <given-names>M.</given-names></name> <name><surname>Kami&#x00144;ska</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). <article-title>Audiovisual emotion recognition in wild</article-title>. <source>Mach. Vis. Appl.</source> <volume>30</volume>, <fpage>975</fpage>&#x02013;<lpage>985</lpage>. <pub-id pub-id-type="doi">10.1007/s00138-018-0960-9</pub-id></citation>
</ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bachorowski</surname> <given-names>J.-A.</given-names></name></person-group> (<year>1999</year>). <article-title>Vocal expression and perception of emotion</article-title>. <source>Curr. Directions Psychol. Sci.</source> <volume>8</volume>, <fpage>53</fpage>&#x02013;<lpage>57</lpage>. <pub-id pub-id-type="doi">10.1111/1467-8721.00013</pub-id></citation>
</ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Banse</surname> <given-names>R.</given-names></name> <name><surname>Scherer</surname> <given-names>K. R.</given-names></name></person-group> (<year>1996</year>). <article-title>Acoustic profiles in vocal emotion expression</article-title>. <source>J. Personal. Soc. Psychol.</source> <volume>70</volume>, <fpage>614</fpage>. <pub-id pub-id-type="doi">10.1037//0022-3514.70.3.614</pub-id><pub-id pub-id-type="pmid">8851745</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brunswik</surname> <given-names>E.</given-names></name></person-group> (<year>1955</year>). <article-title>Representative design and probabilistic theory in a functional psychology</article-title>. <source>Psychol. Rev.</source> <volume>62</volume>, <fpage>193</fpage>. <pub-id pub-id-type="doi">10.1037/h0047470</pub-id><pub-id pub-id-type="pmid">14371898</pub-id></citation></ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chapelle</surname> <given-names>O.</given-names></name> <name><surname>Keerthi</surname> <given-names>S. S.</given-names></name></person-group> (<year>2010</year>). <article-title>Efficient algorithms for ranking with svms</article-title>. <source>Inf. Retrieval</source> <volume>13</volume>, <fpage>201</fpage>&#x02013;<lpage>215</lpage>. <pub-id pub-id-type="doi">10.1007/s10791-009-9109-9</pub-id><pub-id pub-id-type="pmid">26656580</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cohen</surname> <given-names>J.</given-names></name></person-group> (<year>1968</year>). <article-title>Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit</article-title>. <source>Psychol. Bull.</source> <volume>70</volume>, <fpage>213</fpage>. <pub-id pub-id-type="doi">10.1037/h0026256</pub-id><pub-id pub-id-type="pmid">19673146</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cowie</surname> <given-names>R.</given-names></name> <name><surname>Douglas-Cowie</surname> <given-names>E.</given-names></name> <name><surname>Tsapatsoulis</surname> <given-names>N.</given-names></name> <name><surname>Votsis</surname> <given-names>G.</given-names></name> <name><surname>Kollias</surname> <given-names>S.</given-names></name> <name><surname>Fellenz</surname> <given-names>W.</given-names></name> <etal/></person-group>. (<year>2001</year>). <article-title>Emotion recognition in human-computer interaction</article-title>. <source>IEEE Signal Process. Mag.</source> <volume>18</volume>, <fpage>32</fpage>&#x02013;<lpage>80</lpage>. <pub-id pub-id-type="doi">10.1109/79.911197</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Crivelli</surname> <given-names>C.</given-names></name> <name><surname>Fridlund</surname> <given-names>A. J.</given-names></name></person-group> (<year>2018</year>). <article-title>Facial displays are tools for social influence</article-title>. <source>Trends Cogn. Sci.</source> <volume>22</volume>, <fpage>388</fpage>&#x02013;<lpage>399</lpage>. <pub-id pub-id-type="doi">10.1016/j.tics.2018.02.006</pub-id><pub-id pub-id-type="pmid">29544997</pub-id></citation></ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ekman</surname> <given-names>P.</given-names></name> <name><surname>Oster</surname> <given-names>H.</given-names></name></person-group> (<year>1979</year>). <article-title>Facial expressions of emotion</article-title>. <source>Ann. Rev. Psychol.</source> <volume>30</volume>, <fpage>527</fpage>&#x02013;<lpage>554</lpage>. <pub-id pub-id-type="doi">10.1146/annurev.ps.30.020179.002523</pub-id></citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Escalera</surname> <given-names>S.</given-names></name> <name><surname>Pujol</surname> <given-names>O.</given-names></name> <name><surname>Radeva</surname> <given-names>P.</given-names></name></person-group> (<year>2009</year>). <article-title>Separability of ternary codes for sparse designs of error-correcting output codes</article-title>. <source>Pattern Recognit. Lett.</source> <volume>30</volume>, <fpage>285</fpage>&#x02013;<lpage>297</lpage>. <pub-id pub-id-type="doi">10.1016/j.patrec.2008.10.002</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Eyben</surname> <given-names>F.</given-names></name> <name><surname>Scherer</surname> <given-names>K. R.</given-names></name> <name><surname>Schuller</surname> <given-names>B. W.</given-names></name> <name><surname>Sundberg</surname> <given-names>J.</given-names></name> <name><surname>Andr&#x000E9;</surname> <given-names>E.</given-names></name> <name><surname>Busso</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing</article-title>. <source>IEEE Trans. Affect. Comput.</source> <volume>7</volume>, <fpage>190</fpage>&#x02013;<lpage>202</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2015.2457417</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Eyben</surname> <given-names>F.</given-names></name> <name><surname>W&#x000F6;llmer</surname> <given-names>M.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2010</year>). <article-title>Opensmile: the munich versatile and fast open-source audio feature extractor</article-title>, in <source>Proceedings of the 18th ACM International Conference on Multimedia</source> (<publisher-loc>Firenze</publisher-loc>), <fpage>1459</fpage>&#x02013;<lpage>1462</lpage>.</citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Fan</surname> <given-names>R.-E.</given-names></name> <name><surname>Chang</surname> <given-names>K.-W.</given-names></name> <name><surname>Hsieh</surname> <given-names>C.-J.</given-names></name> <name><surname>Wang</surname> <given-names>X.-R.</given-names></name> <name><surname>Lin</surname> <given-names>C.-J.</given-names></name></person-group> (<year>2008</year>). <article-title>Liblinear: a library for large linear classification</article-title>. <source>J. Mach. Learn. Res.</source> <volume>9</volume>, <fpage>1871</fpage>&#x02013;<lpage>1874</lpage>. <pub-id pub-id-type="doi">10.5555/1390681.1442794</pub-id><pub-id pub-id-type="pmid">21207929</pub-id></citation></ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Forney</surname> <given-names>G. D.</given-names></name></person-group> (<year>1973</year>). <article-title>The viterbi algorithm</article-title>. <source>Proc. IEEE</source> <volume>61</volume>, <fpage>268</fpage>&#x02013;<lpage>278</lpage>. <pub-id pub-id-type="doi">10.1109/PROC.1973.9030</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B16">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Grimm</surname> <given-names>M.</given-names></name> <name><surname>Kroschel</surname> <given-names>K.</given-names></name> <name><surname>Mower</surname> <given-names>E.</given-names></name> <name><surname>Narayanan</surname> <given-names>S.</given-names></name></person-group> (<year>2007</year>). <article-title>Primitives-based evaluation and estimation of emotions in speech</article-title>. <source>Speech Commun.</source> <volume>49</volume>, <fpage>787</fpage>&#x02013;<lpage>800</lpage>. <pub-id pub-id-type="doi">10.1016/j.specom.2007.01.010</pub-id></citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gunes</surname> <given-names>H.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2013</year>). <article-title>Categorical and dimensional affect analysis in continuous input: current trends and future directions</article-title>. <source>Image Vis. Comput.</source> <volume>31</volume>, <fpage>120</fpage>&#x02013;<lpage>136</lpage>. <pub-id pub-id-type="doi">10.1016/j.imavis.2012.06.016</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Han</surname> <given-names>W.</given-names></name> <name><surname>Jiang</surname> <given-names>T.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name> <name><surname>Ruan</surname> <given-names>H.</given-names></name></person-group> (<year>2020</year>). <article-title>Ordinal learning for emotion recognition in customer service calls</article-title>, in <source>ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Barcelona</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>6494</fpage>&#x02013;<lpage>6498</lpage>.</citation>
</ref>
<ref id="B19">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>Z.</given-names></name> <name><surname>Dang</surname> <given-names>T.</given-names></name> <name><surname>Cummins</surname> <given-names>N.</given-names></name> <name><surname>Stasak</surname> <given-names>B.</given-names></name> <name><surname>Le</surname> <given-names>P.</given-names></name> <name><surname>Sethu</surname> <given-names>V.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction</article-title>, in <source>Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge</source> (<publisher-loc>Brisbane</publisher-loc>), <fpage>41</fpage>&#x02013;<lpage>48</lpage>.</citation>
</ref>
<ref id="B20">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>Z.</given-names></name> <name><surname>Epps</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). <article-title>Detecting the instant of emotion change from speech using a martingale framework</article-title>, in <source>2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>IEEE</publisher-loc>), <fpage>5195</fpage>&#x02013;<lpage>5199</lpage>.</citation>
</ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Joachims</surname> <given-names>T.</given-names></name></person-group> (<year>2002</year>). <article-title>Optimizing search engines using clickthrough data</article-title>, in <source>Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source> (<publisher-loc>Edmonton</publisher-loc>), <fpage>133</fpage>&#x02013;<lpage>142</lpage>.</citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kendall</surname> <given-names>M. G.</given-names></name></person-group> (<year>1938</year>). <article-title>A new measure of rank correlation</article-title>. <source>Biometrika</source> <volume>30</volume>, <fpage>81</fpage>&#x02013;<lpage>93</lpage>. <pub-id pub-id-type="doi">10.2307/2332226</pub-id></citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>J. C.</given-names></name> <name><surname>Clements</surname> <given-names>M. A.</given-names></name></person-group> (<year>2015</year>). <article-title>Multimodal affect classification at various temporal lengths</article-title>. <source>IEEE Trans. Affect. Comput.</source> <volume>6</volume>, <fpage>371</fpage>&#x02013;<lpage>384</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2015.2411273</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kim</surname> <given-names>K.-J.</given-names></name> <name><surname>Ahn</surname> <given-names>H.</given-names></name></person-group> (<year>2012</year>). <article-title>A corporate credit rating model using multi-class support vector machines with an ordinal pairwise partitioning approach</article-title>. <source>Comput. Oper. Res.</source> <volume>39</volume>, <fpage>1800</fpage>&#x02013;<lpage>1811</lpage>. <pub-id pub-id-type="doi">10.1016/j.cor.2011.06.023</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liang</surname> <given-names>P. P.</given-names></name> <name><surname>Zadeh</surname> <given-names>A.</given-names></name> <name><surname>Morency</surname> <given-names>L.-P.</given-names></name></person-group> (<year>2018</year>). <article-title>Multimodal local-global ranking fusion for emotion recognition</article-title>, in <source>Proceedings of the 20th ACM International Conference on Multimodal Interaction</source> (<publisher-loc>Boulder</publisher-loc>), <fpage>472</fpage>&#x02013;<lpage>476</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lotfian</surname> <given-names>R.</given-names></name> <name><surname>Busso</surname> <given-names>C.</given-names></name></person-group> (<year>2016</year>). <article-title>Practical considerations on the use of preference learning for ranking emotional speech</article-title>, in <source>2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Shanghai</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5205</fpage>&#x02013;<lpage>5209</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Makantasis</surname> <given-names>K.</given-names></name></person-group> (<year>2021</year>). <article-title>Affranknet&#x0002B;: ranking affect using privileged information</article-title>. <source>arXiv preprint</source> arXiv:2108.05598.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Martinez</surname> <given-names>H. P.</given-names></name> <name><surname>Yannakakis</surname> <given-names>G. N.</given-names></name> <name><surname>Hallam</surname> <given-names>J.</given-names></name></person-group> (<year>2014</year>). <article-title>Don&#x00027;t classify ratings of affect; rank them!</article-title> <source>IEEE Trans. Affect. Comput.</source> <volume>5</volume>, <fpage>314</fpage>&#x02013;<lpage>326</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2014.2352268</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Melhart</surname> <given-names>D.</given-names></name> <name><surname>Sfikas</surname> <given-names>K.</given-names></name> <name><surname>Giannakakis</surname> <given-names>G.</given-names></name> <name><surname>Liapis</surname> <given-names>G. Y. A.</given-names></name></person-group> (<year>2020</year>). <article-title>A study on affect model validity: nominal vs ordinal labels</article-title>, in <source>Workshop on Artificial Intelligence in Affective Computing</source> (<publisher-loc>Stockholm</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>27</fpage>&#x02013;<lpage>34</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Metallinou</surname> <given-names>A.</given-names></name> <name><surname>Wollmer</surname> <given-names>M.</given-names></name> <name><surname>Katsamanis</surname> <given-names>A.</given-names></name> <name><surname>Eyben</surname> <given-names>F.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name> <name><surname>Narayanan</surname> <given-names>S.</given-names></name></person-group> (<year>2012</year>). <article-title>Context-sensitive learning for enhanced audiovisual emotion classification</article-title>. <source>IEEE Trans. Affect. Comput.</source> <volume>3</volume>, <fpage>184</fpage>&#x02013;<lpage>198</lpage>. <pub-id pub-id-type="doi">10.1109/T-AFFC.2011.40</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Parthasarathy</surname> <given-names>S.</given-names></name> <name><surname>Busso</surname> <given-names>C.</given-names></name></person-group> (<year>2016</year>). <article-title>Defining emotionally salient regions using qualitative agreement method</article-title>, in <source>Interspeech</source> (<publisher-loc>San Francisco</publisher-loc>), <fpage>3598</fpage>&#x02013;<lpage>3602</lpage>.</citation>
</ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Parthasarathy</surname> <given-names>S.</given-names></name> <name><surname>Busso</surname> <given-names>C.</given-names></name></person-group> (<year>2018</year>). <article-title>Preference-learning with qualitative agreement for sentence level emotional annotations</article-title>, in <source>Interspeech 2018</source>, <publisher-loc>Hyderabad</publisher-loc>.</citation>
</ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Parthasarathy</surname> <given-names>S.</given-names></name> <name><surname>Lotfian</surname> <given-names>R.</given-names></name> <name><surname>Busso</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>Ranking emotional attributes with deep neural networks</article-title>, in <source>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4995</fpage>&#x02013;<lpage>4999</lpage>.</citation>
</ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Picard</surname> <given-names>R. W.</given-names></name></person-group> (<year>2000</year>). <source>Affective Computing</source>. <publisher-name>MIT Press</publisher-name>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Platt</surname> <given-names>J.</given-names></name></person-group> (<year>1999</year>). <article-title>Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods</article-title>. <source>Adv. Large Margin Classifiers</source> <volume>10</volume>, <fpage>61</fpage>&#x02013;<lpage>74</lpage>.</citation>
</ref>
<ref id="B36">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ringeval</surname> <given-names>F.</given-names></name> <name><surname>Amiriparian</surname> <given-names>S.</given-names></name> <name><surname>Eyben</surname> <given-names>F.</given-names></name> <name><surname>Scherer</surname> <given-names>K.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2014</year>). <article-title>Emotion recognition in the wild: incorporating voice and lip activity in multimodal decision-level fusion</article-title>, in <source>Proceedings of the 16th International Conference on Multimodal Interaction</source> (<publisher-loc>Istanbul</publisher-loc>), <fpage>473</fpage>&#x02013;<lpage>480</lpage>.</citation>
</ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ringeval</surname> <given-names>F.</given-names></name> <name><surname>Sonderegger</surname> <given-names>A.</given-names></name> <name><surname>Sauer</surname> <given-names>J.</given-names></name> <name><surname>Lalanne</surname> <given-names>D.</given-names></name></person-group> (<year>2013</year>). <article-title>Introducing the recola multimodal corpus of remote collaborative and affective interactions</article-title>, in <source>2013 10th IEEE international Conference and Workshops on Automatic Face and Gesture Recognition (FG)</source> (<publisher-loc> Shanghai</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russell</surname> <given-names>J. A.</given-names></name></person-group> (<year>1980</year>). <article-title>A circumplex model of affect</article-title>. <source>J. Pers. Soc. Psychol.</source> <volume>39</volume>, <fpage>1161</fpage>. <pub-id pub-id-type="doi">10.1037/h0077714</pub-id></citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russell</surname> <given-names>J. A.</given-names></name> <name><surname>Bachorowski</surname> <given-names>J.-A.</given-names></name> <name><surname>Fern&#x000E1;ndez-Dols</surname> <given-names>J.-M.</given-names></name></person-group> (<year>2003</year>). <article-title>Facial and vocal expressions of emotion</article-title>. <source>Ann. Rev. Psychol.</source> <volume>54</volume>, <fpage>329</fpage>&#x02013;<lpage>349</lpage>. <pub-id pub-id-type="doi">10.1146/annurev.psych.54.101601.145102</pub-id><pub-id pub-id-type="pmid">12415074</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sahoo</surname> <given-names>S.</given-names></name> <name><surname>Routray</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>Emotion recognition from audio-visual data using rule based decision level fusion</article-title>, in <source>2016 IEEE Students Technology Symposium (TechSym)</source> (<publisher-loc>Kharagpur</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>7</fpage>&#x02013;<lpage>12</lpage>.</citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schmitt</surname> <given-names>M.</given-names></name> <name><surname>Ringeval</surname> <given-names>F.</given-names></name> <name><surname>Schuller</surname> <given-names>B. W.</given-names></name></person-group> (<year>2016</year>). <article-title>At the border of acoustics and linguistics: bag-of-audio-words for the recognition of emotions in speech</article-title>, in <source>Interspeech</source> (<publisher-loc>San Francisco</publisher-loc>), <fpage>495</fpage>&#x02013;<lpage>499</lpage>.</citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schmitt</surname> <given-names>M.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2017</year>). <article-title>Openxbow&#x02014;introducing the passau open-source crossmodal bag-of-words toolkit</article-title>. <source>J. Mach. Learn. Res.</source> <volume>18</volume>, <fpage>1</fpage>&#x02013;<lpage>5</lpage>.</citation>
</ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schoneveld</surname> <given-names>L.</given-names></name> <name><surname>Othmani</surname> <given-names>A.</given-names></name> <name><surname>Abdelkawy</surname> <given-names>H.</given-names></name></person-group> (<year>2021</year>). <article-title>Leveraging recent advances in deep learning for audio-visual emotion recognition</article-title>. <source>Pattern Recognit. Lett.</source> <volume>146</volume>, <fpage>1</fpage>&#x02013;<lpage>7</lpage>. <pub-id pub-id-type="doi">10.1016/j.patrec.2021.03.007</pub-id></citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simon-Thomas</surname> <given-names>E. R.</given-names></name> <name><surname>Keltner</surname> <given-names>D. J.</given-names></name> <name><surname>Sauter</surname> <given-names>D.</given-names></name> <name><surname>Sinicropi-Yao</surname> <given-names>L.</given-names></name> <name><surname>Abramson</surname> <given-names>A.</given-names></name></person-group> (<year>2009</year>). <article-title>The voice conveys specific emotions: evidence from vocal burst displays</article-title>. <source>Emotion</source> <volume>9</volume>, <fpage>838</fpage>. <pub-id pub-id-type="doi">10.1037/a0017810</pub-id><pub-id pub-id-type="pmid">20001126</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Stewart</surname> <given-names>N.</given-names></name> <name><surname>Brown</surname> <given-names>G. D.</given-names></name> <name><surname>Chater</surname> <given-names>N.</given-names></name></person-group> (<year>2005</year>). <article-title>Absolute identification by relative judgment</article-title>. <source>Psychol. Rev.</source> <volume>112</volume>, <fpage>881</fpage>. <pub-id pub-id-type="doi">10.1037/0033-295X.112.4.881</pub-id><pub-id pub-id-type="pmid">16262472</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tzirakis</surname> <given-names>P.</given-names></name> <name><surname>Zafeiriou</surname> <given-names>S.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2019</year>). <article-title>Real-world automatic continuous affect recognition from audiovisual signals</article-title>, in <source>Multimodal Behavior Analysis in the Wild</source> (<publisher-loc>Elsevier</publisher-loc>), <fpage>387</fpage>&#x02013;<lpage>406</lpage>.</citation>
</ref>
<ref id="B47">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Valstar</surname> <given-names>M.</given-names></name> <name><surname>Gratch</surname> <given-names>J.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name> <name><surname>Ringeval</surname> <given-names>F.</given-names></name> <name><surname>Lalanne</surname> <given-names>D.</given-names></name> <name><surname>Torres Torres</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Avec 2016: depression, mood, and emotion recognition workshop and challenge</article-title>, in <source>Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge</source> (<publisher-loc>Amsterdam</publisher-loc>), <fpage>3</fpage>&#x02013;<lpage>10</lpage>.</citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>C.-H.</given-names></name> <name><surname>Lin</surname> <given-names>J.-C.</given-names></name> <name><surname>Wei</surname> <given-names>W.-L.</given-names></name></person-group> (<year>2014</year>). <article-title>Survey on audiovisual emotion recognition: databases, features, and data fusion strategies</article-title>. <source>APSIPA Trans. Signal Inf. Process.</source> <volume>3</volume>, <fpage>e12</fpage>. <pub-id pub-id-type="doi">10.1017/ATSIP.2014.11</pub-id><pub-id pub-id-type="pmid">30886898</pub-id></citation></ref>
<ref id="B49">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Dang</surname> <given-names>T.</given-names></name> <name><surname>Sethu</surname> <given-names>V.</given-names></name> <name><surname>Ambikairajah</surname> <given-names>E.</given-names></name></person-group> (<year>2021</year>). <article-title>A novel markovian framework for integrating absolute and relative ordinal emotion information</article-title>. <source>arXiv preprint</source> arXiv:2108.04605.</citation>
</ref>
<ref id="B50">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yalamanchili</surname> <given-names>B.</given-names></name> <name><surname>Dungala</surname> <given-names>K.</given-names></name> <name><surname>Mandapati</surname> <given-names>K.</given-names></name> <name><surname>Pillodi</surname> <given-names>M.</given-names></name> <name><surname>Vanga</surname> <given-names>S. R.</given-names></name></person-group> (<year>2021</year>). <article-title>Survey on multimodal emotion recognition (mer) systems</article-title>, in <source>Machine Learning Technologies and Applications: Proceedings of ICACECS 2020</source> (<publisher-loc>Singapore</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>319</fpage>&#x02013;<lpage>326</lpage>.</citation>
</ref>
<ref id="B51">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yannakakis</surname> <given-names>G. N.</given-names></name> <name><surname>Cowie</surname> <given-names>R.</given-names></name> <name><surname>Busso</surname> <given-names>C.</given-names></name></person-group> (<year>2018</year>). <article-title>The ordinal nature of emotions: an emerging approach</article-title>. <source>IEEE Trans. Affect. Comput.</source> <volume>12</volume>, <fpage>16</fpage>&#x02013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2018.2879512</pub-id><pub-id pub-id-type="pmid">27295638</pub-id></citation></ref>
<ref id="B52">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>B.</given-names></name> <name><surname>Provost</surname> <given-names>E. M.</given-names></name></person-group> (<year>2019</year>). <article-title>Automatic recognition of self-reported and perceived emotions</article-title>, in <source>Multimodal Behavior Analysis in the Wild</source> (<publisher-loc>Elsevier</publisher-loc>), <fpage>443</fpage>&#x02013;<lpage>470</lpage>.</citation>
</ref>
<ref id="B53">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Ringeval</surname> <given-names>F.</given-names></name> <name><surname>Dong</surname> <given-names>B.</given-names></name> <name><surname>Coutinho</surname> <given-names>E.</given-names></name> <name><surname>Marchi</surname> <given-names>E.</given-names></name> <name><surname>Sch&#x000FC;ller</surname> <given-names>B.</given-names></name></person-group> (<year>2016</year>). <article-title>Enhanced semi-supervised learning for multimodal emotion recognition</article-title>, in <source>2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Shanghai</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5185</fpage>&#x02013;<lpage>5189</lpage>. <pub-id pub-id-type="pmid">33823091</pub-id></citation></ref>
</ref-list> 
</back>
</article>
