<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="review-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Comput. Sci.</journal-id>
<journal-title>Frontiers in Computer Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Comput. Sci.</abbrev-journal-title>
<issn pub-type="epub">2624-9898</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fcomp.2020.00009</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Computer Science</subject>
<subj-group>
<subject>Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>A Review of Generalizable Transfer Learning in Automatic Emotion Recognition</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name><surname>Feng</surname> <given-names>Kexin</given-names></name>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/748194/overview"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Chaspari</surname> <given-names>Theodora</given-names></name>
<uri xlink:href="http://loop.frontiersin.org/people/874521/overview"/>
</contrib>
</contrib-group>
<aff><institution>HUman Bio-Behavioral Signals (HUBBS) Lab, Texas A&#x00026;M University</institution>, <addr-line>College Station, TX</addr-line>, <country>United States</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Nicholas Cummins, University of Augsburg, Germany</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Ronald B&#x000F6;ck, Otto von Guericke University Magdeburg, Germany; Ziping Zhao, Tianjin Normal University, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Kexin Feng <email>kexin0814&#x00040;tamu.edu</email></corresp>
<fn fn-type="other" id="fn001"><p>This article was submitted to Human-Media Interaction, a section of the journal Frontiers in Computer Science</p></fn></author-notes>
<pub-date pub-type="epub">
<day>28</day>
<month>02</month>
<year>2020</year>
</pub-date>
<pub-date pub-type="collection">
<year>2020</year>
</pub-date>
<volume>2</volume>
<elocation-id>9</elocation-id>
<history>
<date date-type="received">
<day>31</day>
<month>07</month>
<year>2019</year>
</date>
<date date-type="accepted">
<day>14</day>
<month>02</month>
<year>2020</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2020 Feng and Chaspari.</copyright-statement>
<copyright-year>2020</copyright-year>
<copyright-holder>Feng and Chaspari</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license>
</permissions>
<abstract><p>Automatic emotion recognition is the process of identifying human emotion from signals such as facial expression, speech, and text. Collecting and labeling such signals is often tedious and many times requires expert knowledge. An effective way to address challenges related to the scarcity of data and lack of human labels, is transfer learning. In this manuscript, we will describe fundamental concepts in the field of transfer learning and review work which has successfully applied transfer learning for automatic emotion recognition. We will finally discuss promising future research directions of transfer learning for improving the generalizability of automatic emotion recognition systems.</p></abstract>
<kwd-group>
<kwd>transfer learning</kwd>
<kwd>generalizability</kwd>
<kwd>automatic emotion recognition</kwd>
<kwd>speech</kwd>
<kwd>image</kwd>
<kwd>physiology</kwd>
</kwd-group>
<contract-sponsor id="cn001">Engineering Information Foundation<named-content content-type="fundref-id">10.13039/100002844</named-content></contract-sponsor>
<counts>
<fig-count count="0"/>
<table-count count="4"/>
<equation-count count="0"/>
<ref-count count="147"/>
<page-count count="14"/>
<word-count count="11338"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Emotion plays an important role in human-human or human-computer interaction. Emotionally-aware systems enable the better understanding of human behavior and facilitate uninterrupted and long-term interaction between humans and computers (Beale and Peter, <xref ref-type="bibr" rid="B10">2008</xref>). The recent development of laboratory and real-world sensing systems allows us to fully capture multimodal signal information related to human emotion. This has resulted in a large amount of publicly available datasets with high variability in terms of elicitation methods, speaker demographics, spoken language, and recording conditions. Despite the high availability of such datasets, the amount of data included in each dataset is limited and the emotion-related labels are scarce, therefore prohibiting the reliable training and generalizability of emotion recognition systems. In order to address this challenge, recent studies have proposed transfer learning methods to provide reliable emotion recognition performance, even in unseen contexts, individuals, and conditions (Abdelwahab and Busso, <xref ref-type="bibr" rid="B2">2018</xref>; Lan et al., <xref ref-type="bibr" rid="B67">2018</xref>; Latif et al., <xref ref-type="bibr" rid="B69">2018</xref>; Gideon et al., <xref ref-type="bibr" rid="B45">2019</xref>).</p>
<p>Emerging transfer learning methods can leverage the knowledge from one emotion-related domain to another. The main premise behind such techniques is that people may share similar characteristics when expressing a given emotion. For example, anger may result in increased speech loudness and more intense facial expressions (Siegman and Boyle, <xref ref-type="bibr" rid="B115">1993</xref>). Fear is usually expressed with reduced speech loudness and may produce increased heart rate (Hodges and Spielberger, <xref ref-type="bibr" rid="B57">1966</xref>). These emotion-specific characteristics might be commonly met among people, contributing to the similarity among the various emotional datasets. Therefore, transfer learning approaches can learn common emotion-specific patterns and can be applied across domains for recognizing emotions in datasets with scarce or non-labeled samples. Such techniques can further result in generalizable systems, which can detect emotion for unseen data.</p>
<p>The current manuscript discusses ways in which transfer learning techniques can overcome challenges related to limited amount of data samples, scarce labels, and condition mismatch, and result in robust and generalizable automated systems for emotion recognition. We first introduce basic concepts in transfer learning (section 2) and discuss the development of major transfer learning methods and their applications in conventional machine learning fields, such as computer vision (section 3). We then review state-of-art work in automatic emotion recognition using transfer learning for speech, image, and physiological modalities (section 4). Finally, we discuss potential promising research directions of transfer learning for improving the generalizability of automatic emotion recognition systems (section 5).</p>
</sec>
<sec id="s2">
<title>2. Basic Concepts in Transfer Learning</title>
<p>In this section, we will provide the basic definition of transfer learning and discuss several ways to categorize transfer learning methods.</p>
<sec>
<title>2.1. Definitions</title>
<p>Domain in transfer learning generally refers to a feature space and its marginal probability distribution (Pan et al., <xref ref-type="bibr" rid="B98">2010</xref>). Given a specific domain, a task includes a label space and an objective function that needs to be optimized. Source domain usually refers to a set of data with sufficient data samples, large amount of labels, and potentially high quality (e.g., lab environment). In contrast, the data from a target domain may include limited number of samples and small amount or non-existent labels, and potentially be noisy. Given a source and a target, transfer learning approaches attempt to improve the learning of the target task using knowledge from the source domain.</p>
</sec>
<sec>
<title>2.2. Association Metrics Between Source and Target Domains</title>
<p>The selection of the source domain plays an important role in the transfer learning process. A source domain sharing a lot of similarities with the target, is more likely to yield efficient transfer (Pan and Yang, <xref ref-type="bibr" rid="B99">2009</xref>). Similarity can be quantified through the distance between source and target with respect to their data structure (e.g., feature or label distribution), recording conditions (e.g., recording equipment, elicitation methods), and data sample characteristics (e.g., participants with similar demographics, speech samples of same language).</p>
<p>Proposed transfer learning methods typically use a distance metric to maximize the similarity between the source and target domain. Commonly used distance metrics include the: (a) Kullback-Leibler divergence (KL divergence), employing a cross-entropy measure to calculate similarity in the probability distribution between the source and the target domain; (b) Jensen-Shannon divergence (JS divergence), a symmetric version of the KL divergence; (c) Maximum Mean Discrepancy (MMD) and multi-kernel MMD, creating an embedding of the source and target domains on the Reproducing Kernel Hilbert Space (RKHS) and comparing their mean difference; (d) Wasserstein Distance, also known as Earth-Mover (EM) Distance, quantifying the domain difference when there is very little or no overlap between two domain distributions by computing the transport map for every probability density between two domains.</p>
</sec>
<sec>
<title>2.3. Categorization of Transfer Learning Techniques</title>
<p>The state-of-art application of transfer learning on automatic emotion tasks can be categorized in two main approaches. The first refers to the availability of labels in the target domain. Supervised transfer learning includes information from labeled data from both the source and target domain during the learning task, while unsupervised learning includes information only from the labels of the source domain (Pan and Yang, <xref ref-type="bibr" rid="B99">2009</xref>). Unsupervised transfer learning enables the design of reliable machine learning systems, even for domains for which labeled data are not available. The second categorization refers to the availability of one or multiple datasets in the source. Single-source transfer learning contains only one dataset, while multi-source transfer learning leverages multiple sets of data in the source domain (Ding et al., <xref ref-type="bibr" rid="B38">2019</xref>).</p>
</sec>
</sec>
<sec id="s3">
<title>3. Emerging Work on Supervised and Unsupervised Transfer Learning</title>
<p>This section provides an overview of previously proposed methods in the field of transfer learning, summarized into three main categories: (a) statistical-based transfer learning; (b) region selection through domain relevance; and (c) deep transfer learning approaches. We will further discuss these three categories in the following subsections.</p>
<sec>
<title>3.1. Statistical-Based Transfer Learning</title>
<p>Three types of statistical approaches have been proposed for transfer learning: (a) distribution alignment aims to minimize the shift between the source and target domain to reduce the domain difference; (b) latent space extraction recovers common components between two domains; (c) classifier regularization will increase the ability of regularization for the classifier to predict labels in the target domain.</p>
<sec>
<title>3.1.1. Alignment Between Source and Target Distributions</title>
<p>The marginal alignment methods aim to find a mapping between source and target distributions. This can be done by setting pairwise constraints between the two domains (Saenko et al., <xref ref-type="bibr" rid="B107">2010</xref>; Kulis et al., <xref ref-type="bibr" rid="B66">2011</xref>). Gopalan et al. (<xref ref-type="bibr" rid="B51">2011</xref>) utilized the Grassmann manifold and the incremental learning to search a transformation path between the domains. This method was further improved by Gong et al. (<xref ref-type="bibr" rid="B48">2012</xref>), who proposed the Geodesic Flow Kernel (GFK). The Grassmann manifold and Maximum Mean Discrepancy (MMD) are also used in other approaches such as the Domain Invariant Projection (DIP) proposed by Baktashmotlagh et al. (<xref ref-type="bibr" rid="B8">2013</xref>). Marginal alignment is one of the first methods appearing in the field of transfer learning, which attempts to find a common distribution between the source and target domains. Despite its promising results, marginal alignment might not always fully align two completely distinct domains to an entirely same distribution, especially when these include a high mismatch (e.g., different emotional expressions or recording settings).</p>
</sec>
<sec>
<title>3.1.2. Latent Space Extraction</title>
<p>The shared subspace extraction methods assume that the feature space of each domain consists of the domain-specific and the domain-invariant components, and comprise one of the most commonly used approaches in transfer learning. These methods attempt to map both the source and target data to a subspace which only keeps the common information between domains to minimize the difference between them. In order to find such a subspace, Pan et al. (<xref ref-type="bibr" rid="B98">2010</xref>) proposed the Transfer Component Analysis (TCA) using the Maximum Mean Discrepancy (MMD) and Reproducing Kernel Hilbert Space (RHKS), and significantly reduced the distribution mismatch using a low-dimensional subspace. Additional methods for achieving this goal include boosting [Becker et al., <xref ref-type="bibr" rid="B11">2013</xref> and Domain-Invariant Component Analysis (DICA); Muandet et al., <xref ref-type="bibr" rid="B92">2013</xref>]. Due to inherent similarities across emotions, the idea of extracting a common latent space has been successfully applied to automatic emotion recognition tasks (Deng et al., <xref ref-type="bibr" rid="B33">2013</xref>; Zheng et al., <xref ref-type="bibr" rid="B145">2015</xref>).</p>
</sec>
<sec>
<title>3.1.3. Classifier Regularization</title>
<p>Other approaches have attempted to utilize a regularized version of the classifier trained on the source domain in order to predict the labels in the target domain. Support Vector Machines (SVM) have been widely explored in this process. Yang et al. (<xref ref-type="bibr" rid="B134">2007</xref>) proposed the Adapting SVM (A-SVM) which learns a difference function (also referred to as &#x0201C;delta function&#x0201D;) between an original and adapted classifier using an objective function similar to the one in the SVM. Other methods include the Projective Model Transfer SVM (PMT-SVM), Deformable Adaptive SVM (DA-SVM), Domain Weighting SVM (DWSVM) (Aytar and Zisserman, <xref ref-type="bibr" rid="B7">2011</xref>), which adapt the weights of a source model to the target domain using various pre-defined constraints (e.g., assign different weights for source and target domain in DWSVM). Bergamo and Torresani (<xref ref-type="bibr" rid="B12">2010</xref>) also explored the efficacy of transferring knowledge between different SVM structures using feature Augmentation SVM (AUGSVM), Transductive learning SVM (TSVM) and Domain Transfer SVM (DT-SVM) (Duan et al., <xref ref-type="bibr" rid="B39">2009</xref>). Other types of statistical models, such as maximum entropy classifiers, have been also explored in this process (Daume and Marcu, <xref ref-type="bibr" rid="B28">2006</xref>). Due to its simplicity and efficacy in small data samples, classifier regularization has been applied on various emotion recognition tasks based on physiological signals (Zheng and Lu, <xref ref-type="bibr" rid="B144">2016</xref>).</p>
</sec>
</sec>
<sec>
<title>3.2. Region Selection Through Domain Relevance</title>
<p>The region selection approaches have been mostly introduced in computer vision and rely on the concept of how humans understand a given image. For example, instead of giving equal attention to every part of an image, humans would concentrate more on specific salient objects. Therefore, these approaches aim to identify salient regions of an image by generating a domainness map, and separate the image into a different level of domainness (Tommasi et al., <xref ref-type="bibr" rid="B123">2016</xref>). This domainness feature is further utilized to promote knowledge transfer. Other studies have proposed methods using sparse coding (Long et al., <xref ref-type="bibr" rid="B78">2013</xref>) or abstract auxiliary information (e.g., skeleton or color of an image) (Motiian et al., <xref ref-type="bibr" rid="B90">2016</xref>), which also simulate the way humans comprehend an image as a whole. Hjelm et al. (<xref ref-type="bibr" rid="B56">2018</xref>) also utilized the different domain relevance of each part of the image to extract and maximize the mutual information. The region selection methods are very close to the way humans perceive information, and the domainness map makes the methods straightforward and explainable. While similar ideas can also be applied to non-image-related tasks, the determination of domainness levels and validation of the extracted domainness map can be less straightforward.</p>
</sec>
<sec>
<title>3.3. Deep Transfer Learning Methods</title>
<p>Deep learning methods are widely explored and applied on transfer learning. Two main types of deep learning approaches have demonstrated promising performance for knowledge transfer: (a) domain adaptation using deep learning, which aims to transfer the knowledge or mitigate the domain difference between source and target with respect to the neural network embedding; and (b) the adversarial and generative learning, which aims to generate data embeddings that are least separable between the source and the target.</p>
<sec>
<title>3.3.1. Domain Adaptation Using Deep Learning</title>
<p>The large amount of publicly available datasets has yielded several pre-trained deep learning models [e.g., VGG (Simonyan and Zisserman, <xref ref-type="bibr" rid="B116">2014</xref>), VGG-Face (Parkhi et al., <xref ref-type="bibr" rid="B100">2015</xref>), VGG-M-2048 (Chatfield et al., <xref ref-type="bibr" rid="B26">2014</xref>) and AlexNet (Krizhevsky et al., <xref ref-type="bibr" rid="B65">2012</xref>)] which have achieved good performance in image and speech recognition tasks. To address the mismatch between different domains, it is possible to utilize the parameter/structure of pre-trained models to achieve knowledge transfer (e.g., using a model with the same number of hidden layers and same weights learned from the source data). A promising method for achieving this is fine-tuning, which replaces and learns the last layers of the model, while re-adjusting the parameters of the previous ones. A challenge with fine-tuning lies in the fact that the parameters learned on the source task are not preserved after learning the target task. In order to address this &#x0201C;forgetting problem,&#x0201D; Rusu et al. (<xref ref-type="bibr" rid="B106">2016</xref>) proposed the progressive neural network, which keeps the network trained on the source data, based on which it builds an additional network for the target. Jung et al. (<xref ref-type="bibr" rid="B61">2018</xref>) also addressed this problem by keeping the decision boundary unchanged, while also making the feature embeddings extracted for the target close to the ones of the source domain. Utilizing a pre-trained model on the source data includes the following advantages: (a) it speeds up the training process; (b) it potentially increases the ability of generalization, as well as the robustness of the final model; (c) it automatically extracts high-level features between domains.</p>
<p>Although neural network fine-tuning and progressive neural networks can yield benefits to the training process in terms of computational time and ability to generalize (Yosinski et al., <xref ref-type="bibr" rid="B135">2014</xref>), these methods sometimes fail to address the domain difference and may have a poor performance when the source and target have small overlap. An alternative approach to this has been proposed by Ghifary et al. (<xref ref-type="bibr" rid="B43">2014</xref>), who added adaptation layers to the conventional deep learning models and used the Maximum Mean Discrepancy (MMD) for minimizing the domain distribution mismatch between the source and target domains. Instead of making use of a well-trained model, the source data are used in conjunction with the target data in the training process to determine the domain distance. Further research includes determining which layer to be used as the adaptation layer, applying multiple adaptation layers in a model, etc (Tzeng et al., <xref ref-type="bibr" rid="B125">2014</xref>; Long et al., <xref ref-type="bibr" rid="B77">2015</xref>). Moreover, the Joint convolutional neural net (CNN) architecture or Joint MMD (JMMD) (Tzeng et al., <xref ref-type="bibr" rid="B124">2015</xref>; Long et al., <xref ref-type="bibr" rid="B79">2017</xref>) aims to align similar classes between different domains by taking into account the structural information between them.</p>
<p>Various studies on deep transfer learning for automatic emotion recognition have yielded promising results on speech and image datasets (Gideon et al., <xref ref-type="bibr" rid="B44">2017</xref>; Kaya et al., <xref ref-type="bibr" rid="B62">2017</xref>; Abdelwahab and Busso, <xref ref-type="bibr" rid="B2">2018</xref>; Li and Chaspari, <xref ref-type="bibr" rid="B71">2019</xref>. These methods have been less explored for physiological signals, potentially due to the small amount of available data for this modality.</p>
</sec>
<sec>
<title>3.3.2. Adversarial and Generative Methods</title>
<p>The idea of adversarial learning for knowledge transfer was proposed by Ganin and Lempitsky (<xref ref-type="bibr" rid="B41">2014</xref>) in the domain adversarial neural network (DANN). DANN contains three parts: a feature extractor, a domain classifier, and a task classifier. The feature extractor attempts to learn feature representations which minimize the loss of the task classifier and maximize the loss of the domain classifier. Instead of modifying the loss function based on the distance between two domains, the DANN is able to automatically extract the feature which is common for both domains while maintaining the characteristics of each class (Ganin et al., <xref ref-type="bibr" rid="B42">2016</xref>). Variants of DANNs have further been widely explored. The Domain Separation Network (DSN) proposed by Bousmalis et al. (<xref ref-type="bibr" rid="B14">2016</xref>), modified the feature extractor into three encoders (i.e., one for the source, one for the target, and one for both) in order to separate the domain-specific from domain-invariant embeddings. DSN also replaced the domain classifier with a shared decoder, to further ensure that the domain-invariant embedding is useful and can promote the generalizability of the model. According to the multi-adversarial domain adaptation network proposed by Pei et al. (<xref ref-type="bibr" rid="B101">2018</xref>), a separate task classifier is trained for every class, which makes different classes less likely to have overlapping distributions. As the number of available datasets increases, networks which handle multiple sources of data are also explored (Xu et al., <xref ref-type="bibr" rid="B132">2018</xref>; Zhao et al., <xref ref-type="bibr" rid="B142">2018</xref>). In order to further avoid the negative transfer, partial transfer learning, which can be done via Bayesian optimization (Ruder and Plank, <xref ref-type="bibr" rid="B105">2017</xref>), is applied in the adversarial neural networks in order to transfer knowledge from large domains to more specific, smaller domains by selecting only part of the source data in the training process (Cao et al., <xref ref-type="bibr" rid="B21">2018</xref>).</p>
<p>Inspired by the two player game, the generative adversarial nets (GAN) were further proposed by Goodfellow et al. (<xref ref-type="bibr" rid="B49">2014</xref>) containing a generator and a discriminator. The generator generates fake data from a random distribution and aims to confuse the discriminator, while the discriminator focuses on distinguishing between the real and generated data. In this process, both models can learn from each other and fully explore the patterns of data, since the informed generation of synthetic samples can potentially overcome the mismatch between the source and the target task. Modifications of GAN-based networks are also proposed. For example, Radford et al. (<xref ref-type="bibr" rid="B103">2015</xref>) introduced the Deep Convolutional Generative Adversarial Networks (DCGAN) to combine the CNN with GAN. The Wasserstein GAN (WGAN) integrated the Wasserstein distance in the loss function and further improved the training stability (Arjovsky et al., <xref ref-type="bibr" rid="B5">2017</xref>).</p>
<p>The adversarial and generative adversarial neural networks have been successfully applied to speech- and image-based emotion recognition tasks (Wang and Zheng, <xref ref-type="bibr" rid="B128">2015</xref>; Motiian et al., <xref ref-type="bibr" rid="B91">2017</xref>; Sun et al., <xref ref-type="bibr" rid="B122">2018</xref>) with promising results.</p>
</sec>
</sec>
</sec>
<sec id="s4">
<title>4. Transfer Learning for Automatic Emotion Recognition</title>
<p>In this section, we discuss the application of transfer learning on three modalities commonly used in the automatic emotion recognition task: (a) speech; (b) video (or image); and (c) physiology. Sentiment analysis is not included in this manuscript, since it is related to crowd-sourced data, which are beyond the scope of the review.</p>
<sec>
<title>4.1. Transfer Learning for Speech-Based Emotion Recognition</title>
<p>Because of the multi-faceted information included in the speech signal, transfer learning has been widely applied in speech-based emotion recognition (<xref ref-type="table" rid="T1">Table 1</xref>). Previously proposed approaches attempt to transfer the knowledge between datasets collected under similar conditions (e.g., audio signals collected by actors in the lab) (Abdelwahab and Busso, <xref ref-type="bibr" rid="B1">2015</xref>, <xref ref-type="bibr" rid="B2">2018</xref>; Sagha et al., <xref ref-type="bibr" rid="B108">2016</xref>; Zhang et al., <xref ref-type="bibr" rid="B136">2016</xref>; Deng et al., <xref ref-type="bibr" rid="B30">2017</xref>; Gideon et al., <xref ref-type="bibr" rid="B44">2017</xref>; Neumann and Vu, <xref ref-type="bibr" rid="B93">2019</xref>) or using the knowledge from acted in-lab audio signals to spontaneous speech collected in-the-wild (Deng et al., <xref ref-type="bibr" rid="B32">2014b</xref>; Mao et al., <xref ref-type="bibr" rid="B85">2016</xref>; Zong et al., <xref ref-type="bibr" rid="B147">2016</xref>; Song, <xref ref-type="bibr" rid="B117">2017</xref>; Gideon et al., <xref ref-type="bibr" rid="B45">2019</xref>; Li and Chaspari, <xref ref-type="bibr" rid="B71">2019</xref>).</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>Overview of previously proposed transfer learning methods for speech-based emotion recognition.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Dataset</bold></th>
<th valign="top" align="left"><bold>In-lab/real-world</bold><break/><bold>transfer learning</bold></th>
<th valign="top" align="left"><bold>Acted/spontaneous</bold><break/><bold>transfer learning</bold></th>
<th valign="top" align="left"><bold>Emotional labels</bold></th>
<th valign="top" align="left"><bold>Cross-linguistic</bold><break/><bold>transfer learning</bold></th>
<th valign="top" align="left"><bold>Type of transfer learning</bold></th>
<th valign="top" align="left"><bold>Input features</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Deng et al., <xref ref-type="bibr" rid="B33">2013</xref></td>
<td valign="top" align="left"><bold>source:</bold><break/> TUM AVIC (Schuller et al., <xref ref-type="bibr" rid="B111">2009a</xref>), EMO-DB (Burkhardt et al., <xref ref-type="bibr" rid="B16">2005</xref>),<break/> eNTERFACE (Martin et al., <xref ref-type="bibr" rid="B86">2006</xref>),<break/> SUSAS (Hansen and Bou-Ghazale, <xref ref-type="bibr" rid="B53">1997</xref>),<break/> VAM (Grimm et al., <xref ref-type="bibr" rid="B52">2008</xref>)<break/><bold>target:</bold> FAU AEC (Steidl, <xref ref-type="bibr" rid="B120">2009</xref>)</td>
<td valign="top" align="left">In-lab; real-world</td>
<td valign="top" align="left">Acted &#x00026;<break/> spontaneous</td>
<td valign="top" align="left">Valence</td>
<td valign="top" align="left">English<break/> German</td>
<td valign="top" align="left">Autoencoder for aligning source to target<break/> SVM for classification</td>
<td valign="top" align="left">INTERSPEECH<break/> 2009 emotion challenge</td>
</tr>
<tr>
<td valign="top" align="left">Deng et al., <xref ref-type="bibr" rid="B32">2014b</xref></td>
<td valign="top" align="left"><bold>source:</bold><break/> SUSAS (Hansen and Bou-Ghazale, <xref ref-type="bibr" rid="B53">1997</xref>), ABC (Schuller et al., <xref ref-type="bibr" rid="B110">2007</xref>)<break/><bold>target:</bold> FAU AEC (Steidl, <xref ref-type="bibr" rid="B120">2009</xref>)</td>
<td valign="top" align="left"><bold>source:</bold><break/> In-lab; real-world<break/><bold>target:</bold> Real-world</td>
<td valign="top" align="left"><bold>source:</bold><break/> Acted; spontaneous<break/><bold>target:</bold> Spontaneous</td>
<td valign="top" align="left">Valence</td>
<td valign="top" align="left">English<break/> German</td>
<td valign="top" align="left">Adaptive denoising<break/> autoencoder (DAE)</td>
<td valign="top" align="left">INTERSPEECH<break/> 2009 emotion challenge</td>
</tr>
<tr>
<td valign="top" align="left">Deng et al., <xref ref-type="bibr" rid="B34">2014c</xref></td>
<td valign="top" align="left"><bold>source:</bold><break/> SUSAS (Hansen and Bou-Ghazale, <xref ref-type="bibr" rid="B53">1997</xref>), ABC (Schuller et al., <xref ref-type="bibr" rid="B110">2007</xref>)<break/><bold>target:</bold> FAU AEC (Steidl, <xref ref-type="bibr" rid="B120">2009</xref>)</td>
<td valign="top" align="left"><bold>source:</bold><break/> In-lab; real-world<break/><bold>target:</bold> Real-world</td>
<td valign="top" align="left"><bold>source:</bold><break/> Acted; spontaneous<break/><bold>target:</bold> Spontaneous</td>
<td valign="top" align="left">Valence</td>
<td valign="top" align="left">English<break/> German</td>
<td valign="top" align="left">Encoders trained separately for domains<break/> one layer nn maps subspace to target</td>
<td valign="top" align="left">INTERSPEECH<break/> 2009 emotion challenge</td>
</tr>
<tr>
<td valign="top" align="left">Deng et al., <xref ref-type="bibr" rid="B31">2014a</xref></td>
<td valign="top" align="left"><bold>source:</bold><break/> SUSAS (Hansen and Bou-Ghazale, <xref ref-type="bibr" rid="B53">1997</xref>), ABC (Schuller et al., <xref ref-type="bibr" rid="B110">2007</xref>)<break/><bold>target:</bold> FAU AEC (Steidl, <xref ref-type="bibr" rid="B120">2009</xref>)</td>
<td valign="top" align="left"><bold>source:</bold><break/> In-lab; real-world<break/><bold>target:</bold> Real-world</td>
<td valign="top" align="left"><bold>source:</bold><break/> Acted; spontaneous<break/><bold>target:</bold> Spontaneous</td>
<td valign="top" align="left">Valence</td>
<td valign="top" align="left">English<break/> German</td>
<td valign="top" align="left">Shared-hidden-layer autoencoder.<break/> A common encoder which also aims to<break/> minimize reconstruction error</td>
<td valign="top" align="left">INTERSPEECH<break/> 2009 emotion challenge</td>
</tr>
<tr>
<td valign="top" align="left">Song et al., <xref ref-type="bibr" rid="B118">2015</xref></td>
<td valign="top" align="left"><bold>source/target:</bold><break/> EMO-DB (Burkhardt et al., <xref ref-type="bibr" rid="B16">2005</xref>), eNTERFACE (Martin et al., <xref ref-type="bibr" rid="B86">2006</xref>)</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Acted</td>
<td valign="top" align="left">Angry, disgusted,<break/> fear, happy, sad</td>
<td valign="top" align="left">English<break/> German</td>
<td valign="top" align="left">Transfer principal component analysis<break/> and sparse coding based method</td>
<td valign="top" align="left">INTERSPEECH<break/> 2010 paralinguistic challenge</td>
</tr>
<tr>
<td valign="top" align="left">Abdelwahab and Busso, <xref ref-type="bibr" rid="B1">2015</xref></td>
<td valign="top" align="left"><bold>source:</bold><break/> IEMOCAP (Busso et al., <xref ref-type="bibr" rid="B17">2008</xref>),<break/> SEMAINE (McKeown et al., <xref ref-type="bibr" rid="B88">2011</xref>)<break/><bold>target:</bold> RECOLA (Ringeval et al., <xref ref-type="bibr" rid="B104">2013</xref>)</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Acted &#x00026;<break/> spontaneous</td>
<td valign="top" align="left">Arousal and valence</td>
<td valign="top" align="left">English<break/> French</td>
<td valign="top" align="left">Domain adaptation for SVM<break/> incremental adaptation for SVM</td>
<td valign="top" align="left">INTERSPEECH<break/> 2011 speaker state feature</td>
</tr>
<tr>
<td valign="top" align="left">Mao et al., <xref ref-type="bibr" rid="B85">2016</xref></td>
<td valign="top" align="left"><bold>source:</bold><break/> EMO-DB (Burkhardt et al., <xref ref-type="bibr" rid="B16">2005</xref>),<break/> ABC (Schuller et al., <xref ref-type="bibr" rid="B110">2007</xref>)<break/><bold>target:</bold> FAU AEC (Steidl, <xref ref-type="bibr" rid="B120">2009</xref>)</td>
<td valign="top" align="left"><bold>source:</bold> In-lab<break/><bold>target:</bold> Real-world</td>
<td valign="top" align="left"><bold>source:</bold> Acted<break/><bold>target:</bold> Spontaneous</td>
<td valign="top" align="left">Valence</td>
<td valign="top" align="left">English<break/> German</td>
<td valign="top" align="left">Sharing priors between related<break/> source and target classes</td>
<td valign="top" align="left">INTERSPEECH<break/> 2009 emotion challenge</td>
</tr>
<tr>
<td valign="top" align="left">Sagha et al., <xref ref-type="bibr" rid="B108">2016</xref></td>
<td valign="top" align="left"><bold>source/target:</bold><break/> EMO-DB (Burkhardt et al., <xref ref-type="bibr" rid="B16">2005</xref>),<break/> SAVEE (Haq et al., <xref ref-type="bibr" rid="B54">2008</xref>),<break/> EMOVO (Costantini et al., <xref ref-type="bibr" rid="B27">2014</xref>),<break/> Polish (Staroniewicz and Majewski, <xref ref-type="bibr" rid="B119">2009</xref>)</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Acted</td>
<td valign="top" align="left">Valence</td>
<td valign="top" align="left">English, German,<break/> Italian, Polish</td>
<td valign="top" align="left">Kernel canonical correlation<break/> analysis (KCCA)</td>
<td valign="top" align="left">INTERSPEECH<break/> 2009 emotion challenge</td>
</tr>
<tr>
<td valign="top" align="left">Zhang et al., <xref ref-type="bibr" rid="B136">2016</xref></td>
<td valign="top" align="left"><bold>source:</bold> RAVDESS (Livingstone and Russo, <xref ref-type="bibr" rid="B76">2018</xref>)<break/><bold>target:</bold> UMSSED (Zhang et al., <xref ref-type="bibr" rid="B137">2015</xref>)</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Acted</td>
<td valign="top" align="left">Angry, happy<break/> neutral, sad</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Multi-task learning</td>
<td valign="top" align="left">INTERSPEECH<break/> computational paralinguistics<break/> challenge 2013</td>
</tr>
<tr>
<td valign="top" align="left">Zong et al., <xref ref-type="bibr" rid="B147">2016</xref></td>
<td valign="top" align="left"><bold>source/target:</bold><break/> EMO-DB (Burkhardt et al., <xref ref-type="bibr" rid="B16">2005</xref>), eNTERFACE (Martin et al., <xref ref-type="bibr" rid="B86">2006</xref>),<break/> AFEW (Dhall et al., <xref ref-type="bibr" rid="B36">2012</xref>)</td>
<td valign="top" align="left">In-lab &#x00026;<break/> real-world</td>
<td valign="top" align="left">Acted &#x00026;<break/> spontaneous</td>
<td valign="top" align="left">Angry, disgusted, afraid<break/> happy, neutral, sad</td>
<td valign="top" align="left">English<break/> German</td>
<td valign="top" align="left">Domain-adaptive least-<break/> squares regression (DaLSR)</td>
<td valign="top" align="left">INTERSPEECH<break/> 2009 emotion challenge</td>
</tr>
<tr>
<td valign="top" align="left">Song, <xref ref-type="bibr" rid="B117">2017</xref></td>
<td valign="top" align="left"><bold>source/target:</bold><break/> EMO-DB (Burkhardt et al., <xref ref-type="bibr" rid="B16">2005</xref>), eNTERFACE (Martin et al., <xref ref-type="bibr" rid="B86">2006</xref>),<break/> FAU AEC (Steidl, <xref ref-type="bibr" rid="B120">2009</xref>)</td>
<td valign="top" align="left">In-lab &#x00026;<break/> real-world</td>
<td valign="top" align="left">Acted &#x00026;<break/> spontaneous</td>
<td valign="top" align="left">Angry, disgusted,<break/> afraid, happy, sad</td>
<td valign="top" align="left">English<break/> German</td>
<td valign="top" align="left">Linear subspace learning</td>
<td valign="top" align="left">INTERSPEECH<break/> 2010 paralinguistic challenge</td>
</tr>
<tr>
<td valign="top" align="left">Deng et al., <xref ref-type="bibr" rid="B30">2017</xref></td>
<td valign="top" align="left"><bold>source/target:</bold><break/> EMO-DB (Burkhardt et al., <xref ref-type="bibr" rid="B16">2005</xref>),<break/> GeWEC (B&#x000E4;nziger and Scherer, <xref ref-type="bibr" rid="B9">2010</xref>)</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Acted</td>
<td valign="top" align="left">Categorical emotions</td>
<td valign="top" align="left">German<break/> French</td>
<td valign="top" align="left">Denoising autoencoder</td>
<td valign="top" align="left">INTERSPEECH<break/> 2009 emotion challenge</td>
</tr>
<tr>
<td/>
<td/>
<td/>
<td/>
<td valign="top" align="left">Valence and arousal</td>
<td valign="top" align="left">Shared-hidden-layer autoencoder<break/> extreme learning machine autoencoder</td>
<td/>
<td/>
</tr>
<tr>
<td valign="top" align="left">Gideon et al., <xref ref-type="bibr" rid="B44">2017</xref></td>
<td valign="top" align="left"><bold>source/target:</bold><break/> IEMOCAP (Busso et al., <xref ref-type="bibr" rid="B17">2008</xref>), MSP-IMPROV (Busso et al., <xref ref-type="bibr" rid="B19">2016</xref>)</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Acted</td>
<td valign="top" align="left">Angry, neutral, sad, happy</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Progressive neural network (PNN)</td>
<td valign="top" align="left">Geneva minimalistic acoustic<break/> parameter set (GeMAPS)</td>
</tr>
<tr>
<td valign="top" align="left">Chang and Scherer, <xref ref-type="bibr" rid="B25">2017</xref></td>
<td valign="top" align="left"><bold>source:</bold> AMI (Carletta et al., <xref ref-type="bibr" rid="B22">2005</xref>)<break/><bold>target:</bold> IEMOCAP (Busso et al., <xref ref-type="bibr" rid="B17">2008</xref>)</td>
<td valign="top" align="left"><bold>source:</bold> Real-world<break/><bold>target:</bold> In-lab</td>
<td valign="top" align="left">Acted &#x00026;<break/> spontaneous</td>
<td valign="top" align="left">Valence and activation</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Deep convolutional generative<break/> adversarial networks (DCGAN)</td>
<td valign="top" align="left">Speech spectrogram</td>
</tr>
<tr>
<td valign="top" align="left">Abdelwahab and Busso, <xref ref-type="bibr" rid="B2">2018</xref></td>
<td valign="top" align="left"><bold>source:</bold> IEMOCAP (Busso et al., <xref ref-type="bibr" rid="B17">2008</xref>), MSP-IMPROV (Busso et al., <xref ref-type="bibr" rid="B19">2016</xref>)<break/><bold>target:</bold> MSP-Podcast (Lotfian and Busso, <xref ref-type="bibr" rid="B80">2017</xref>)</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Acted &#x00026;<break/> spontaneous</td>
<td valign="top" align="left">Arousal, valence, dominance</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Domain adversarial<break/> neural network (DANN)</td>
<td valign="top" align="left">INTERSPEECH<break/> computational paralinguistics<break/> challenge 2013</td>
</tr>
<tr>
<td valign="top" align="left">Gideon et al., <xref ref-type="bibr" rid="B45">2019</xref></td>
<td valign="top" align="left"><bold>source/target:</bold><break/> IEMOCAP (Busso et al., <xref ref-type="bibr" rid="B17">2008</xref>), MSP-IMPROV (Busso et al., <xref ref-type="bibr" rid="B19">2016</xref>)<break/> PRIORI Emotion (Khorram et al., <xref ref-type="bibr" rid="B63">2018</xref>)</td>
<td valign="top" align="left">In-lab &#x00026; <break/> real-world</td>
<td valign="top" align="left">Acted &#x00026;<break/> spontaneous</td>
<td valign="top" align="left">Valence</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Adversarial discriminative domain<break/> generalization (ADDoG)</td>
<td valign="top" align="left">Mel Filter Bank<break/> (MFB)</td>
</tr>
<tr>
<td valign="top" align="left">Li and Chaspari, <xref ref-type="bibr" rid="B71">2019</xref></td>
<td valign="top" align="left"><bold>source:</bold><break/> IEMOCAP (Busso et al., <xref ref-type="bibr" rid="B17">2008</xref>), CREMA-D (Cao et al., <xref ref-type="bibr" rid="B20">2014</xref>),<break/> RAVDESS (Livingstone and Russo, <xref ref-type="bibr" rid="B76">2018</xref>), eNTERFACE (Martin et al., <xref ref-type="bibr" rid="B86">2006</xref>) <break/><bold>target:</bold> IEMOCAP (Busso et al., <xref ref-type="bibr" rid="B17">2008</xref>)</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left"><bold>source:</bold> Acted <break/><bold>target:</bold> Spontaneous</td>
<td valign="top" align="left">Angry, happy, sad, afraid</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Feedforward neural <break/> network fine-tuning <break/> progressive neural network (PNN)</td>
<td valign="top" align="left">INTERSPEECH <break/> 2009 emotion challenge</td>
</tr>
<tr>
<td valign="top" align="left">Neumann and Vu, <xref ref-type="bibr" rid="B93">2019</xref></td>
<td valign="top" align="left"><bold>source/target:</bold><break/> IEMOCAP (Busso et al., <xref ref-type="bibr" rid="B17">2008</xref>), MSP-IMPROV (Busso et al., <xref ref-type="bibr" rid="B19">2016</xref>)</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Acted</td>
<td valign="top" align="left">Angry, happy, sad, neutral</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">A latent feature space was learned <break/> on source domain using an encoder-decoder. <break/> Such space was added as feature vector in <break/> attentive convolutional neural network</td>
<td valign="top" align="left">MFCC feature</td>
</tr>
<tr>
<td valign="top" align="left">Latif et al., <xref ref-type="bibr" rid="B68">2019</xref></td>
<td valign="top" align="left"><bold>source/target:</bold> <break/> EMO-DB (Burkhardt et al., <xref ref-type="bibr" rid="B16">2005</xref>), SAVEE (Jackson and Haq, <xref ref-type="bibr" rid="B59">2014</xref>), <break/> EMOVO (Costantini et al., <xref ref-type="bibr" rid="B27">2014</xref>), URDC (Latif et al., <xref ref-type="bibr" rid="B69">2018</xref>)</td>
<td valign="top" align="left">In-lab &#x00026; <break/> real-world</td>
<td valign="top" align="left">Acted &#x00026; <break/> spontaneous</td>
<td valign="top" align="left">Positive/negative valence</td>
<td valign="top" align="left">German, Urdu <break/> Italian, English</td>
<td valign="top" align="left">Similar to GAN structure <break/> but source data was used <break/> instead of generated fake data.</td>
<td valign="top" align="left">Geneva minimalistic acoustic <break/> parameter set (GeMAPS)</td>
</tr>
<tr>
<td valign="top" align="left">Zhao et al., <xref ref-type="bibr" rid="B141">2019</xref></td>
<td valign="top" align="left"><bold>source:</bold> eGender (Burkhardt et al., <xref ref-type="bibr" rid="B15">2010</xref>) <break/><bold>target:</bold> EMO-DB (Burkhardt et al., <xref ref-type="bibr" rid="B16">2005</xref>), IEMOCAP (Busso et al., <xref ref-type="bibr" rid="B17">2008</xref>)</td>
<td valign="top" align="left"><bold>source:</bold> Real-world<break/><bold>target:</bold> In-lab</td>
<td valign="top" align="left"><bold>source:</bold> spontaneous <break/><bold>target:</bold> acted</td>
<td valign="top" align="left">Continuous prediction <break/> or classification: neutral, <break/> happiness, sadness, anger</td>
<td valign="top" align="left">English <break/> German</td>
<td valign="top" align="left">Learn age and gender attributes separately <break/> then transfer these knowledge by feeding <break/> such information to emotion model.</td>
<td valign="top" align="left">INTERSPEECH <break/> 2010 configuration</td>
</tr>
<tr>
<td valign="top" align="left">Zhou and Chen, <xref ref-type="bibr" rid="B146">2019</xref></td>
<td valign="top" align="left"><bold>source:</bold> Aibo-Ohm and Aibo-Mont (Steidl, <xref ref-type="bibr" rid="B120">2009</xref>) <break/><bold>target:</bold> EMO-DB (Burkhardt et al., <xref ref-type="bibr" rid="B16">2005</xref>)</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left"><bold>source:</bold> Spontaneous <break/><bold>target:</bold> acted</td>
<td valign="top" align="left">Binary negative / positive</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Data was relabeled to reveal domain info class-wise adversarial domain adaptation <break/> two stages training: train encoder, predictor <break/> fix predictor, train encoder only</td>
<td valign="top" align="left">Geneva minimalistic acoustic <break/> parameter set (GeMAPS)</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Different types of transfer learning architectures have been explored in speech-based emotion recognition, including the statistical methods (Deng et al., <xref ref-type="bibr" rid="B33">2013</xref>, <xref ref-type="bibr" rid="B31">2014a</xref>,<xref ref-type="bibr" rid="B34">c</xref>; Abdelwahab and Busso, <xref ref-type="bibr" rid="B1">2015</xref>; Song et al., <xref ref-type="bibr" rid="B118">2015</xref>; Sagha et al., <xref ref-type="bibr" rid="B108">2016</xref>; Zong et al., <xref ref-type="bibr" rid="B147">2016</xref>; Song, <xref ref-type="bibr" rid="B117">2017</xref>), the adversarial or generative networks (Chang and Scherer, <xref ref-type="bibr" rid="B25">2017</xref>; Abdelwahab and Busso, <xref ref-type="bibr" rid="B2">2018</xref>; Gideon et al., <xref ref-type="bibr" rid="B45">2019</xref>; Latif et al., <xref ref-type="bibr" rid="B68">2019</xref>), and other neural network structures (Mao et al., <xref ref-type="bibr" rid="B85">2016</xref>; Deng et al., <xref ref-type="bibr" rid="B30">2017</xref>; Gideon et al., <xref ref-type="bibr" rid="B44">2017</xref>; Li and Chaspari, <xref ref-type="bibr" rid="B71">2019</xref>; Neumann and Vu, <xref ref-type="bibr" rid="B93">2019</xref>; Zhou and Chen, <xref ref-type="bibr" rid="B146">2019</xref>). A commonly used input of the aforementioned approaches includes the feature set proposed by the INTERSPEECH emotion challenge and INTERSPEECH paralinguistic challenges (Schuller et al., <xref ref-type="bibr" rid="B112">2009b</xref>, <xref ref-type="bibr" rid="B113">2010</xref>, <xref ref-type="bibr" rid="B114">2013</xref>), which typically contains the first 12 Mel Frequency Cepstral Coefficients, root-mean-square energy, zero-crossing rate, voice probability, and fundamental frequency (Deng et al., <xref ref-type="bibr" rid="B33">2013</xref>, <xref ref-type="bibr" rid="B32">2014b</xref>, <xref ref-type="bibr" rid="B30">2017</xref>; Mao et al., <xref ref-type="bibr" rid="B85">2016</xref>; Sagha et al., <xref ref-type="bibr" rid="B108">2016</xref>; Zhang et al., <xref ref-type="bibr" rid="B136">2016</xref>; Zong et al., <xref ref-type="bibr" rid="B147">2016</xref>; Song, <xref ref-type="bibr" rid="B117">2017</xref>; Abdelwahab and Busso, <xref ref-type="bibr" rid="B2">2018</xref>; Li and Chaspari, <xref ref-type="bibr" rid="B71">2019</xref>; Zhao et al., <xref ref-type="bibr" rid="B141">2019</xref>).</p>
<p>Statistical values of these descriptors, including maximum, minimum, range, time position of the maximum and minimum, average, standard deviation, skewness, kurtosis, as well as the first- and second-order coefficient of a linear regression model are extracted from the frame-based measures. Other approaches also include the speech spectrogram as an input to convolutional-based neural networks (Gideon et al., <xref ref-type="bibr" rid="B45">2019</xref>). Previously proposed transfer learning methods for speech emotion recognition employ same classes for the source and target data. Two commonly used baseline methods against which the proposed transfer learning approaches are compared include in-domain training and out-of-domain training (only used data from source domain). The first performs training and testing by solely using labeled data from target domain, while the second trains the model on the source data and tests on the target. Results indicate that the proposed transfer learning methods outperform the out-of-domain methods, and are equivalent to or sometimes surpass in-domain training, indicating the potential of leveraging multiple sources of emotion-specific speech data to improve emotion recognition performance.</p>
<p>Besides speech data, audio signals from music clips have been also applied for emotion recognition (Zhang et al., <xref ref-type="bibr" rid="B136">2016</xref>). However, because of the limited number of emotion-based datasets with music signals, as well as the significant domain mismatch between music and speech, this application is relatively less explored.</p>
</sec>
<sec>
<title>4.2. Transfer Learning for Video/Image-Based Emotion Recognition</title>
<p>Facial expressions convey a rich amount of information related to human emotion. A variety of transfer learning techniques have been explored for video/image-based automatic emotion recognition (<xref ref-type="table" rid="T2">Table 2</xref>). The state-of-art transfer learning approach to the video-based emotion recognition includes obtaining high-level features using mainly a convolutional neural network (CNN) trained on large sources of data (e.g., VGG; Simonyan and Zisserman, <xref ref-type="bibr" rid="B116">2014</xref>) (Kaya et al., <xref ref-type="bibr" rid="B62">2017</xref>; Aly and Abbott, <xref ref-type="bibr" rid="B4">2019</xref>; Ngo and Yoon, <xref ref-type="bibr" rid="B95">2019</xref>, or transfering the knowledge from higher-quality auxiliary image datasets (e.g., skeleton or color of an image, image with description text) (Xu et al., <xref ref-type="bibr" rid="B130">2016</xref>). Source datasets in this case might not necessarily contain the same labeled classes as the target dataset. Occluded facial images, which are common in daily life, are also utilized to help with the generalization and robustness of the overall system (Xu et al., <xref ref-type="bibr" rid="B131">2015</xref>). More advanced transfer learning approaches, such as adversarial methods, are less explored in this process. A possible reason is that the high-level image features are relatively easier to obtain and knowledge from other domains might not be able to significant help. Another reason could be the selection of source domain is more important for facial emotion recognition (Sugianto and Tjondronegoro, <xref ref-type="bibr" rid="B121">2019</xref>). In order to recognize emotions from video clips, every frame of the clip is analyzed and the final decision is made based on voting methods, such as major voting on the separate frames (Zhang et al., <xref ref-type="bibr" rid="B136">2016</xref>). Face detection methods, such as the deformable parts model (DPM) (Mathias et al., <xref ref-type="bibr" rid="B87">2014</xref>), may also be used to avoid the influence of irrelevant regions of the video frame (Kaya et al., <xref ref-type="bibr" rid="B62">2017</xref>).</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Overview of the previous work on transfer learning for video/image-based emotion recognition.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Dataset</bold></th>
<th valign="top" align="left"><bold>In-lab/real-world</bold><break/><bold>transfer learning</bold></th>
<th valign="top" align="left"><bold>Acted/ spontaneous</bold><break/><bold>transfer learning</bold></th>
<th valign="top" align="left"><bold>Same labels between</bold><break/><bold>source &#x00026; target</bold></th>
<th valign="top" align="left"><bold>Emotional labels</bold></th>
<th valign="top" align="left"><bold>Type of transfer learning</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Ng et al., <xref ref-type="bibr" rid="B94">2015</xref></td>
<td valign="top" align="left"><bold>source:</bold> <break/> VGG (Simonyan and Zisserman, <xref ref-type="bibr" rid="B116">2014</xref>) <break/> AlexNet (Krizhevsky et al., <xref ref-type="bibr" rid="B65">2012</xref>), <break/> FER-2013 (Goodfellow et al., <xref ref-type="bibr" rid="B50">2013</xref>) <break/><bold>target:</bold> <break/> EmotiW 2015 (Dhall et al., <xref ref-type="bibr" rid="B37">2015</xref>)</td>
<td valign="top" align="left">Real-world</td>
<td valign="top" align="left">Spontaneous</td>
<td valign="top" align="left">No for VGG/AlexNet <break/> yes for FER-2013</td>
<td valign="top" align="left">Neutral, angry, disgusted, sad, <break/> fear, happy, surprised</td>
<td valign="top" align="left">Two-stage fine-tuning<break/> based on VGG/AlexNet <break/> and the target data</td>
</tr>
<tr>
<td valign="top" align="left">Xu et al., <xref ref-type="bibr" rid="B131">2015</xref></td>
<td valign="top" align="left"><bold>source:</bold><break/> MSRA-CFW (Zhang et al., <xref ref-type="bibr" rid="B140">2012</xref>) <break/><bold>target:</bold> <break/> self-built database contains <break/> CK&#x0002B; (Lucey et al., <xref ref-type="bibr" rid="B81">2010</xref>), <break/> JAFFE (Lyons et al., <xref ref-type="bibr" rid="B83">1999</xref>), <break/> KDEF (Goeleven et al., <xref ref-type="bibr" rid="B46">2008</xref>), <break/> PICS (PIC, <xref ref-type="bibr" rid="B102">2013</xref>)</td>
<td valign="top" align="left">Real-world</td>
<td valign="top" align="left">Spontaneous</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Neutral, angry, disgusted, sad, <break/> fear, happy, surprised</td>
<td valign="top" align="left">Feature transfer by training two facial <break/> Identification convolutional networks</td>
</tr>
<tr>
<td valign="top" align="left">Xu et al., <xref ref-type="bibr" rid="B130">2016</xref></td>
<td valign="top" align="left"><bold>source:</bold> <break/> Flickr (Borth et al., <xref ref-type="bibr" rid="B13">2013</xref>) <break/><bold>target:</bold> <break/> YouTube (Jiang et al., <xref ref-type="bibr" rid="B60">2014</xref>) <break/> Ekman-6 emotion dataset</td>
<td valign="top" align="left">Real-world</td>
<td valign="top" align="left">Spontaneous</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">8 Primary emotions <break/> 24 primary &#x00026; <break/> Secondary emotions</td>
<td valign="top" align="left">Auxiliary image transfer encoding <break/> Using auxiliary data (e.g., image with <break/> description text)</td>
</tr>
<tr>
<td valign="top" align="left">Kaya et al., <xref ref-type="bibr" rid="B62">2017</xref></td>
<td valign="top" align="left"><bold>source:</bold> <break/> VGG-Face (Parkhi et al., <xref ref-type="bibr" rid="B100">2015</xref>),<break/> VGG-M-2048 (Chatfield et al., <xref ref-type="bibr" rid="B26">2014</xref>) <break/> FER-2013 (Goodfellow et al., <xref ref-type="bibr" rid="B50">2013</xref>) <break/><bold>target:</bold> <break/> EmotiW 2015 (Dhall et al., <xref ref-type="bibr" rid="B37">2015</xref>),<break/> EmotiW 2016 (Dhall et al., <xref ref-type="bibr" rid="B35">2016</xref>),<break/> CK&#x0002B; (Lucey et al., <xref ref-type="bibr" rid="B81">2010</xref>),<break/> MMI (Valstar and Pantic, <xref ref-type="bibr" rid="B126">2010</xref>),<break/> RECOLA (Ringeval et al., <xref ref-type="bibr" rid="B104">2013</xref>),<break/> First impressions challenge <break/> (Escalante et al., <xref ref-type="bibr" rid="B40">2016</xref>)</td>
<td valign="top" align="left">Real-world</td>
<td valign="top" align="left">Spontaneous</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Neutral, angry, disgusted, sad, <break/> fear, happy, surprised</td>
<td valign="top" align="left">Fine-tuning during convolutional <break/> Neural network (CNN) training</td>
</tr>
<tr>
<td valign="top" align="left">Ngo and Yoon, <xref ref-type="bibr" rid="B95">2019</xref></td>
<td valign="top" align="left"><bold>source:</bold> <break/> ResNet-50(He et al., <xref ref-type="bibr" rid="B55">2016</xref>) <break/><bold>target:</bold> <break/> AffectNet (Mollahosseini et al., <xref ref-type="bibr" rid="B89">2017</xref>)</td>
<td valign="top" align="left">Real-world</td>
<td valign="top" align="left">Spontaneous</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Neutral, happiness, sadness, surprise, <break/> fear, disgust, anger, contempt</td>
<td valign="top" align="left">Fine-tuning on the well-trained <break/> ResNet-50 net</td>
</tr>
<tr>
<td valign="top" align="left">Aly and Abbott, <xref ref-type="bibr" rid="B4">2019</xref></td>
<td valign="top" align="left"><bold>source:</bold> <break/> AlexNet (Krizhevsky et al., <xref ref-type="bibr" rid="B65">2012</xref>), <break/> JAFEE (Lyons et al., <xref ref-type="bibr" rid="B82">1998</xref>) <break/> CK&#x0002B; (Lucey et al., <xref ref-type="bibr" rid="B81">2010</xref>) <break/><bold>target:</bold> <break/> VT-KFER (Aly et al., <xref ref-type="bibr" rid="B3">2015</xref>),<break/> 300W (Sagonas et al., <xref ref-type="bibr" rid="B109">2016</xref>)</td>
<td valign="top" align="left">VTKFER: in-lab <break/> 300W: real-world</td>
<td valign="top" align="left">VTKFER: acted <break/> 300W: spontaneous</td>
<td valign="top" align="left">Yes</td>
<td valign="top" align="left">Happiness, sadness, surprise, <break/> disgust, fear, anger</td>
<td valign="top" align="left">Multi-stage Progressive <break/> Transfer Learning (MSPTL) <break/> fine tune the AlexNet in multiple <break/> stages using different data (simple to <break/>more challenging or non-frontal)</td>
</tr>
<tr>
<td valign="top" align="left">Sugianto and Tjondronegoro, <xref ref-type="bibr" rid="B121">2019</xref></td>
<td valign="top" align="left"><bold>source:</bold> <break/> ResNet-50(He et al., <xref ref-type="bibr" rid="B55">2016</xref>) <break/> MS-CELEB-1M(He et al., <xref ref-type="bibr" rid="B55">2016</xref>) <break/> VGGFace2(He et al., <xref ref-type="bibr" rid="B55">2016</xref>) <break/> CK&#x0002B; (Lucey et al., <xref ref-type="bibr" rid="B81">2010</xref>) <break/><bold>target:</bold> <break/> AffectNet (Mollahosseini et al., <xref ref-type="bibr" rid="B89">2017</xref>)</td>
<td valign="top" align="left">Real-world</td>
<td valign="top" align="left">Spontaneous</td>
<td valign="top" align="left">No</td>
<td valign="top" align="left">Neutral, happiness, sadness, surprise, <break/> fear, disgust, anger, contempt</td>
<td valign="top" align="left">Fine-tuning on CK&#x0002B; (relevant domain) <break/> lowers the performance due to the <break/>large knowledge gap. <break/>General to specific knowledge <break/>transfer performs best.</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>4.3. Transfer Learning for Emotion Recognition Based on Physiological Signals</title>
<p>A small amount of previous work has attempted to perform transfer learning on physiological signals for emotion recognition (<xref ref-type="table" rid="T3">Table 3</xref>). Among the various physiological signals, the electroencephalogram (EEG) is the most commonly used in transfer learning, probably due to the rich amount of information included in this signal. Because of the limited number of datasets including physiological signals, as well as the high variability across people, the knowledge transfer between different datasets is less efficient and relatively less explored. Commonly used transfer learning applications attempt to train subject-specific (personalized) models by providing knowledge learned from subjects similar to the one in the test (Lin and Jung, <xref ref-type="bibr" rid="B74">2017</xref>; Lin, <xref ref-type="bibr" rid="B72">2019</xref>), or simply consider all the members in the group by assigning different weights (Li et al., <xref ref-type="bibr" rid="B70">2019</xref>). Other methods include statistical approaches, such as Principal Component Analysis (PCA) and adaptive subspace feature matching (ASFM) (Chai et al., <xref ref-type="bibr" rid="B23">2017</xref>). As the development of wearable devices progresses and more physiological data related to emotional experiences become available, transfer learning methods appear to depict a great potential in this domain.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Overview of transfer learning methods for emotion recognition based on physiological signals.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>References</bold></th>
<th valign="top" align="left"><bold>Dataset</bold></th>
<th valign="top" align="left"><bold>Elicitation</bold><break/><bold>method</bold></th>
<th valign="top" align="left"><bold>In-lab</bold><break/><bold>Real-world</bold></th>
<th valign="top" align="left"><bold>Emotional labels</bold></th>
<th valign="top" align="left"><bold>Type of transfer learning</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Zheng et al., <xref ref-type="bibr" rid="B145">2015</xref></td>
<td valign="top" align="left">SEED (Zheng and Lu, <xref ref-type="bibr" rid="B143">2015</xref>)</td>
<td valign="top" align="left">Video</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Three emotions <break/> (positive, neutral, and negative)</td>
<td valign="top" align="left">Personalized transfer learning <break/> Transfer component analysis (TCA) <break/> kernel principal component analysis (KPCA)</td>
</tr>
<tr>
<td valign="top" align="left">Chai et al., <xref ref-type="bibr" rid="B24">2016</xref></td>
<td valign="top" align="left">SEED (Zheng and Lu, <xref ref-type="bibr" rid="B143">2015</xref>)</td>
<td valign="top" align="left">Video</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Three emotions <break/> (positive, neutral, and negative)</td>
<td valign="top" align="left">Subspace alignment auto-encoder</td>
</tr>
<tr>
<td valign="top" align="left">Zheng and Lu, <xref ref-type="bibr" rid="B144">2016</xref></td>
<td valign="top" align="left">SEED (Zheng and Lu, <xref ref-type="bibr" rid="B143">2015</xref>)</td>
<td valign="top" align="left">Video</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Three emotions <break/> (positive, neutral, and negative)</td>
<td valign="top" align="left">Transductive parameter transfer (TPT)<break/> Transductive SVM (T-SVM) <break/> Transfer component analysis (TCA)<break/> Kernel PCA (KPCA)</td>
</tr>
<tr>
<td valign="top" align="left">Lin and Jung, <xref ref-type="bibr" rid="B74">2017</xref></td>
<td valign="top" align="left">Oscar soundtrack <break/> EEG dataset <break/> (Lin et al., <xref ref-type="bibr" rid="B75">2010</xref>)</td>
<td valign="top" align="left">Music</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Valence and arousal</td>
<td valign="top" align="left">Conditional transfer learning framework to determine <break/> How transferable is a model to a given individual</td>
</tr>
<tr>
<td valign="top" align="left">Chai et al., <xref ref-type="bibr" rid="B23">2017</xref></td>
<td valign="top" align="left">SEED (Zheng and Lu, <xref ref-type="bibr" rid="B143">2015</xref>)</td>
<td valign="top" align="left">Video</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Three emotions <break/> (positive, neutral, and negative)</td>
<td valign="top" align="left">Adaptive subspace feature matching</td>
</tr>
<tr>
<td valign="top" align="left">Lan et al., <xref ref-type="bibr" rid="B67">2018</xref></td>
<td valign="top" align="left">SEED (Zheng and Lu, <xref ref-type="bibr" rid="B143">2015</xref>) <break/>DEAP (Koelstra et al., <xref ref-type="bibr" rid="B64">2011</xref>)</td>
<td valign="top" align="left">Video</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Three emotions <break/> (positive, neutral, and negative)</td>
<td valign="top" align="left">Transfer component analysis (TCA) <break/> Geodesic flow kernel (GFK) Domain adaptation<break/> Kernel principal component Analysis (KPCA)</td>
</tr>
<tr>
<td valign="top" align="left">Lin, <xref ref-type="bibr" rid="B72">2019</xref></td>
<td valign="top" align="left">MDME (Lin et al., <xref ref-type="bibr" rid="B73">2015</xref>) <break/>SDMN (Lin et al., <xref ref-type="bibr" rid="B75">2010</xref>)</td>
<td valign="top" align="left">Music</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Binary valence and arousal</td>
<td valign="top" align="left">Principal component analysis (RPCA)-embedded <break/>transfer learning personalized cross-day model. <break/> Use Riemannian distance and RPCA to <break/> select similar samples within dataset</td>
</tr>
<tr>
<td valign="top" align="left">Li et al., <xref ref-type="bibr" rid="B70">2019</xref></td>
<td valign="top" align="left">SEED (Zheng and Lu, <xref ref-type="bibr" rid="B143">2015</xref>)</td>
<td valign="top" align="left">Video</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Three emotions <break/> (positive, neutral, and negative)</td>
<td valign="top" align="left">Known objects have seperate classifiers <break/> such classifiers were ensembled using <break/>style transfer mapping (STM) method for a new object.</td>
</tr>
<tr>
<td valign="top" align="left">Zhang et al., <xref ref-type="bibr" rid="B139">2019</xref></td>
<td valign="top" align="left">SEED (Zheng and Lu, <xref ref-type="bibr" rid="B143">2015</xref>)</td>
<td valign="top" align="left">Video</td>
<td valign="top" align="left">In-lab</td>
<td valign="top" align="left">Three emotions <break/> (positive, neutral, and negative)</td>
<td valign="top" align="left">A CNN is used as feature <break/> extractor from Electrodes-frequency Distribution Maps. <break/> Deep domain confusion (DDC) narrowed feature difference <break/>between domains. EFDMs and CNN are used for classification.</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
</sec>
<sec sec-type="discussion" id="s5">
<title>5. Discussion</title>
<p>In this section, we will provide a brief summary of the current application of transfer learning on automatic emotion recognition and outline potential aspects for future research.</p>
<sec>
<title>5.1. Summary of Current Research</title>
<p>Previous work has explored transfer learning for automatic emotion recognition in the three commonly used signals (i.e., speech, video, and physiology) (<xref ref-type="table" rid="T4">Table 4</xref>). Transfer learning for speech aims to transfer the knowledge between the different datasets using the state-of-art transfer learning methods, such as adversarial or generative networks. Image-based transfer learning is mainly utilized to extract high-level features from images or their auxiliary data (e.g., image description text) using convolutional neural networks (CNN). For physiological signals, transfer learning has promoted the design of personalized models through statistical methods, though the data scarcity and high inter-individual variability yield highly variable results across subjects. Speech and video signals allow for the design of sophisticated systems that can detect multiple emotions (i.e., up to seven emotions), while physiological data usually yield more coarse-grained emotion recognition.</p>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Overview of current research.</p></caption>
<table frame="hsides" rules="groups">
<thead><tr>
<th valign="top" align="left"><bold>Signal</bold></th>
<th valign="top" align="left"><bold>Common transfer learning method</bold></th>
<th valign="top" align="left"><bold>Overview</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Speech</td>
<td valign="top" align="left">Statistical-based transfer learning<break/> Deep transfer learning methods</td>
<td valign="top" align="left">Speech signal has been widely used for emotion recognition,<break/> both statistical and deep learning methods are widely explored.</td>
</tr>
<tr>
<td valign="top" align="left">Video/image</td>
<td valign="top" align="left">Deep transfer learning methods</td>
<td valign="top" align="left">Motivated by the wide adoption of deep learning in image processing,<break/> such methods are widely used for emotion recognition in the last years.</td>
</tr>
<tr>
<td valign="top" align="left">Physiological</td>
<td valign="top" align="left">Statistical-based transfer learning</td>
<td valign="top" align="left">Neural networks were not widely used for physiological signals in recent years,<break/> but researchers have started to apply deep transfer learning methods.</td>
</tr>
</tbody>
</table>
</table-wrap>
</sec>
<sec>
<title>5.2. Potential Future Directions</title>
<sec>
<title>5.2.1. Multi-Modal Transfer Learning for Emotion Recognition</title>
<p>Multi-modal sources of data have been widely used in automatic emotion recognition tasks in order to supplement information from multiple modalities and reduce potential bias from one single signal (Busso et al., <xref ref-type="bibr" rid="B18">2004</xref>; W&#x000F6;llmer et al., <xref ref-type="bibr" rid="B129">2010</xref>). However, transfer learning methods play a very limited role in this process. Common knowledge transfer in multi-modal methods include fine-tune well trained models to a specific type of signal (Vielzeuf et al., <xref ref-type="bibr" rid="B127">2017</xref>; Yan et al., <xref ref-type="bibr" rid="B133">2018</xref>; Huang et al., <xref ref-type="bibr" rid="B58">2019</xref>; Ortega et al., <xref ref-type="bibr" rid="B96">2019</xref>), or fine-tune different well-trained models to both speech and video signals (Ouyang et al., <xref ref-type="bibr" rid="B97">2017</xref>; Zhang et al., <xref ref-type="bibr" rid="B138">2017</xref>; Ma et al., <xref ref-type="bibr" rid="B84">2019</xref>). Other usage of transfer learning for multi-modal methods includes leveraging the knowledge from one signal to another (e.g., video to speech) to reduce the potential bias (Athanasiadis et al., <xref ref-type="bibr" rid="B6">2019</xref>). While these methods provide a promising performance, applying the more recent transfer learning methods, or utilizing more types of signals and leveraging knowledge between multiple signals, is likely to improve transfer learning and boost emotion recognition accuracy.</p>
</sec>
<sec>
<title>5.2.2. Transferring Emotionally-Relevant Knowledge Between In-lab and Real-Life Conditions</title>
<p>Most of the current studies focus on transferring knowledge between datasets collected in the lab, due to the relative less availability of real-world dataset, especially for the speech or physiological signals. (Abdelwahab and Busso, <xref ref-type="bibr" rid="B1">2015</xref>, <xref ref-type="bibr" rid="B2">2018</xref>; Zheng et al., <xref ref-type="bibr" rid="B145">2015</xref>; Chai et al., <xref ref-type="bibr" rid="B24">2016</xref>, <xref ref-type="bibr" rid="B23">2017</xref>; Sagha et al., <xref ref-type="bibr" rid="B108">2016</xref>; Zhang et al., <xref ref-type="bibr" rid="B136">2016</xref>; Zheng and Lu, <xref ref-type="bibr" rid="B144">2016</xref>; Gideon et al., <xref ref-type="bibr" rid="B44">2017</xref>; Lin and Jung, <xref ref-type="bibr" rid="B74">2017</xref>; Lan et al., <xref ref-type="bibr" rid="B67">2018</xref>; Li and Chaspari, <xref ref-type="bibr" rid="B71">2019</xref>). While some of these datasets try to simulate a naturalistic scenarios, they are still very different from actual real-world conditions, since they contain less noisy data, collected in high quality conditions (e.g., no occluded video or far-field speech), and might not be able to successfully elicit all possible emotions met in real-life conditions (e.g., grief). Recently, there have been multiple efforts to collect emotion datasets in the wild, such as the PRIORI (Khorram et al., <xref ref-type="bibr" rid="B63">2018</xref>) and the AVEC In-The-Wild Emotion Recognition Challenge (Dhall et al., <xref ref-type="bibr" rid="B37">2015</xref>). Exploring the ability of transferring knowledge from data collected in the lab to data obtained in real-life conditions can significantly extend the applicability of emotion recognition applications in real-life (e.g., quantifying well-being, tracking mental health indices; Khorram et al., <xref ref-type="bibr" rid="B63">2018</xref>).</p>
</sec>
<sec>
<title>5.2.3. Multi-Source Transfer Learning</title>
<p>With the advent of a large number of emotion recognition corpora, multi-source transfer learning methods provide a promising research direction. By leveraging the variability from multiple data sources collected under different contextual and recording conditions, multi-source transfer learning might be able to provide highly robust and generalizable systems. Multi-source transfer learning can also lay a foundation for modeling various aspects of human behavior different than emotion (e.g., mood, anxiety), where only a limited number of datasets with a small number of data samples are available. Multi-source transfer learning has not been explored for automatic emotion recognition task, which also makes it a great future research direction.</p>
</sec>
<sec>
<title>5.2.4. Distinct Labels Between the Source and Target Domains</title>
<p>A potential challenge in automatic emotion recognition lies in the fact that the various datasets might include different types of emotional labels, which can introduce high mismatch between the source and target domains. Especially in the case where the source domain includes data collected in the lab, the corresponding labels mostly only include basic emotions (e.g., happy, anger, fear, surprise, and neutral). However, the emotional classes in the target domain might be slightly different, since they might include subtle real-world emotions, such as frustration, disapproval, etc. Understanding and modeling associations between primary and secondary emotions can potentially contribute toward more accurate emotion inferences in real-life.</p>
</sec>
<sec>
<title>5.2.5. Transfer Learning for Cross-Cultural and Cross-Linguistic Emotion Recognition</title>
<p>Emotions can be expressed in different ways across different cultures and languages. For example, emotions may be expressed in a direct and noticeable way in Western countries, while emotional expression tends to be more subtle in parts of Asia (Davis et al., <xref ref-type="bibr" rid="B29">2012</xref>; G&#x000F6;k&#x000E7;en et al., <xref ref-type="bibr" rid="B47">2014</xref>). Even though previous studies have explored the knowledge transfer across European languages (e.g., German, French, Italian, and Polish) (Sagha et al., <xref ref-type="bibr" rid="B108">2016</xref>), indicating that languages are not a key factor for automatic emotion recognition, extensive experiments with non-Western languages/cultures might be able to provide additional insights for advancing the field of transfer learning in emotion recognition. At the same time, most emotional datasets include Caucasian subjects, while a few samples of collected data contain participants from different ethnicities and races (Schuller et al., <xref ref-type="bibr" rid="B111">2009a</xref>). It would be beneficial to examine potential discrepancies related to the linguistic speaking style and facial expressions for building generalizable emotion recognition systems across cultures.</p>
</sec>
</sec>
</sec>
<sec sec-type="conclusions" id="s6">
<title>6. Conclusion</title>
<p>In this manuscript, we reviewed emerging methods in supervised and unsupervised transfer learning, as well as successful applications and promising future research directions of transfer learning for automatic emotion recognition. We first provided an overview of basic transfer learning methods mostly used in image and speech processing, including statistical approaches, deep transfer learning, and region selection through domain relevance. We then expanded upon transfer learning applications for emotion recognition studying three main modalities of speech, image, and physiological signals. Findings from previous work suggest the feasibility of transfer learning approaches in building reliable emotion recognition systems, yielding improved performance compared to in-domain learning (i.e., training and testing models on samples from the same dataset). Despite the encouraging findings, various implications for future work exist by leveraging multiple sources and modalities of emotional data, which have the potential to yield transferrable emotional embeddings toward novel computational models of human emotion, and human behavior in general.</p>
</sec>
<sec id="s7">
<title>Author Contributions</title>
<p>KF made contributions to the conception, design, and analysis of existing work, and drafted the research article. TC contributed to the conception and design of the work, and revised the article.</p>
<sec>
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
</sec>
</body>
<back>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Abdelwahab</surname> <given-names>M.</given-names></name> <name><surname>Busso</surname> <given-names>C.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>Supervised domain adaptation for emotion recognition from speech</article-title>,&#x0201D; in <source>2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Brisbane, QLD</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5058</fpage>&#x02013;<lpage>5062</lpage>.</citation></ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Abdelwahab</surname> <given-names>M.</given-names></name> <name><surname>Busso</surname> <given-names>C.</given-names></name></person-group> (<year>2018</year>). <article-title>Domain adversarial for acoustic emotion recognition</article-title>. <source>IEEE/ACM Trans. Audio Speech Lang. Process.</source> <volume>26</volume>, <fpage>2423</fpage>&#x02013;<lpage>2435</lpage>. <pub-id pub-id-type="doi">10.1109/TASLP.2018.2867099</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Aly</surname> <given-names>S.</given-names></name> <name><surname>Trubanova</surname> <given-names>A.</given-names></name> <name><surname>Abbott</surname> <given-names>L.</given-names></name> <name><surname>White</surname> <given-names>S.</given-names></name> <name><surname>Youssef</surname> <given-names>A.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>Vt-kfer: a kinect-based rgbd&#x0002B; time dataset for spontaneous and non-spontaneous facial expression recognition</article-title>,&#x0201D; in <source>2015 International Conference on Biometrics (ICB)</source> (<publisher-loc>Phuket</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>90</fpage>&#x02013;<lpage>97</lpage>.</citation></ref>
<ref id="B4">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Aly</surname> <given-names>S. F.</given-names></name> <name><surname>Abbott</surname> <given-names>A. L.</given-names></name></person-group> (<year>2019</year>). &#x0201C;<article-title>Facial emotion recognition with varying poses and/or partial occlusion using multi-stage progressive transfer learning</article-title>,&#x0201D; in <source>Scandinavian Conference on Image Analysis</source> (<publisher-loc>Norrk&#x000F6;ping</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>101</fpage>&#x02013;<lpage>112</lpage>.</citation></ref>
<ref id="B5">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Arjovsky</surname> <given-names>M.</given-names></name> <name><surname>Chintala</surname> <given-names>S.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name></person-group> (<year>2017</year>). <article-title>Wasserstein GAN</article-title>. <source>arXiv: 1701.07875</source>.</citation></ref>
<ref id="B6">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Athanasiadis</surname> <given-names>C.</given-names></name> <name><surname>Hortal</surname> <given-names>E.</given-names></name> <name><surname>Asteriadis</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). <article-title>Audio-visual domain adaptation using conditional semi-supervised generative adversarial networks</article-title>. <source>Neurocomputing</source>. <pub-id pub-id-type="doi">10.1016/j.neucom.2019.09.106</pub-id></citation></ref>
<ref id="B7">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Aytar</surname> <given-names>Y.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2011</year>). &#x0201C;<article-title>Tabula rasa: model transfer for object category detection</article-title>,&#x0201D; in <source>2011 International Conference on Computer Vision</source> (<publisher-loc>Barcelona</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2252</fpage>&#x02013;<lpage>2259</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2011.6126504</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Baktashmotlagh</surname> <given-names>M.</given-names></name> <name><surname>Harandi</surname> <given-names>M. T.</given-names></name> <name><surname>Lovell</surname> <given-names>B. C.</given-names></name> <name><surname>Salzmann</surname> <given-names>M.</given-names></name></person-group> (<year>2013</year>). &#x0201C;<article-title>Unsupervised domain adaptation by domain invariant projection</article-title>,&#x0201D; in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Sydney, NSW</publisher-loc>), <fpage>769</fpage>&#x02013;<lpage>776</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2013.100</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>B&#x000E4;nziger</surname> <given-names>T.</given-names></name> <name><surname>Scherer</surname> <given-names>K. R.</given-names></name></person-group> (<year>2010</year>). &#x0201C;<article-title>Introducing the geneva multimodal emotion portrayal (GEMEP) corpus</article-title>,&#x0201D; in <source>Blueprint for Affective Computing: A Sourcebook</source>, eds <person-group person-group-type="editor"><name><surname>Scherer</surname> <given-names>K. R.</given-names></name> <name><surname>B&#x000E4;nziger</surname> <given-names>T.</given-names></name> <name><surname>Roesch</surname> <given-names>E. B.</given-names></name></person-group> (<publisher-loc>Oxford</publisher-loc>: <publisher-name>Oxford University Press</publisher-name>), <fpage>271</fpage>&#x02013;<lpage>294</lpage>.</citation></ref>
<ref id="B10">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Beale</surname> <given-names>R.</given-names></name> <name><surname>Peter</surname> <given-names>C.</given-names></name></person-group> (Eds.). (<year>2008</year>). &#x0201C;<article-title>The role of affect and emotion in HCI</article-title>,&#x0201D; in <source>Affect and Emotion in Human-Computer Interaction</source> (<publisher-loc>Berlin; Heidelberg</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>11</lpage>.</citation></ref>
<ref id="B11">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Becker</surname> <given-names>C. J.</given-names></name> <name><surname>Christoudias</surname> <given-names>C. M.</given-names></name> <name><surname>Fua</surname> <given-names>P.</given-names></name></person-group> (<year>2013</year>). &#x0201C;<article-title>Non-linear domain adaptation with boosting</article-title>,&#x0201D; in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Lake Tahoe, BC</publisher-loc>), <fpage>485</fpage>&#x02013;<lpage>493</lpage>.</citation></ref>
<ref id="B12">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bergamo</surname> <given-names>A.</given-names></name> <name><surname>Torresani</surname> <given-names>L.</given-names></name></person-group> (<year>2010</year>). &#x0201C;<article-title>Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach</article-title>,&#x0201D; in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Vancouver, CA</publisher-loc>), <fpage>181</fpage>&#x02013;<lpage>189</lpage>.</citation></ref>
<ref id="B13">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Borth</surname> <given-names>D.</given-names></name> <name><surname>Ji</surname> <given-names>R.</given-names></name> <name><surname>Chen</surname> <given-names>T.</given-names></name> <name><surname>Breuel</surname> <given-names>T.</given-names></name> <name><surname>Chang</surname> <given-names>S.-F.</given-names></name></person-group> (<year>2013</year>). &#x0201C;<article-title>Large-scale visual sentiment ontology and detectors using adjective noun pairs</article-title>,&#x0201D; in <source>Proceedings of the 21st ACM International Conference on Multimedia</source> (<publisher-loc>Barcelona</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>223</fpage>&#x02013;<lpage>232</lpage>. <pub-id pub-id-type="doi">10.1145/2502081.2502282</pub-id></citation></ref>
<ref id="B14">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Bousmalis</surname> <given-names>K.</given-names></name> <name><surname>Trigeorgis</surname> <given-names>G.</given-names></name> <name><surname>Silberman</surname> <given-names>N.</given-names></name> <name><surname>Krishnan</surname> <given-names>D.</given-names></name> <name><surname>Erhan</surname> <given-names>D.</given-names></name></person-group> (<year>2016</year>). &#x0201C;<article-title>Domain separation networks</article-title>,&#x0201D; in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Barcelona</publisher-loc>), <fpage>343</fpage>&#x02013;<lpage>351</lpage>.</citation></ref>
<ref id="B15">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Burkhardt</surname> <given-names>F.</given-names></name> <name><surname>Eckert</surname> <given-names>M.</given-names></name> <name><surname>Johannsen</surname> <given-names>W.</given-names></name> <name><surname>Stegmann</surname> <given-names>J.</given-names></name></person-group> (<year>2010</year>). &#x0201C;<article-title>A database of age and gender annotated telephone speech</article-title>,&#x0201D; in <source>LREC</source> (<publisher-loc>Malta</publisher-loc>).</citation></ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Burkhardt</surname> <given-names>F.</given-names></name> <name><surname>Paeschke</surname> <given-names>A.</given-names></name> <name><surname>Rolfes</surname> <given-names>M.</given-names></name> <name><surname>Sendlmeier</surname> <given-names>W. F.</given-names></name> <name><surname>Weiss</surname> <given-names>B.</given-names></name></person-group> (<year>2005</year>). &#x0201C;<article-title>A database of german emotional speech</article-title>,&#x0201D; in <source>Ninth European Conference on Speech Communication and Technology</source> (<publisher-loc>Lisbon</publisher-loc>).</citation></ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Busso</surname> <given-names>C.</given-names></name> <name><surname>Bulut</surname> <given-names>M.</given-names></name> <name><surname>Lee</surname> <given-names>C.-C.</given-names></name> <name><surname>Kazemzadeh</surname> <given-names>A.</given-names></name> <name><surname>Mower</surname> <given-names>E.</given-names></name> <name><surname>Kim</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2008</year>). <article-title>IEMOCAP: interactive emotional dyadic motion capture database</article-title>. <source>Lang. Resour. Evaluat.</source> <volume>42</volume>:<fpage>335</fpage>. <pub-id pub-id-type="doi">10.1007/s10579-008-9076-6</pub-id></citation></ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Busso</surname> <given-names>C.</given-names></name> <name><surname>Deng</surname> <given-names>Z.</given-names></name> <name><surname>Yildirim</surname> <given-names>S.</given-names></name> <name><surname>Bulut</surname> <given-names>M.</given-names></name> <name><surname>Lee</surname> <given-names>C. M.</given-names></name> <name><surname>Kazemzadeh</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2004</year>). &#x0201C;<article-title>Analysis of emotion recognition using facial expressions, speech and multimodal information</article-title>,&#x0201D; in <source>Proceedings of the 6th International Conference on Multimodal Interfaces</source> (<publisher-loc>State College, PA</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>205</fpage>&#x02013;<lpage>211</lpage>. <pub-id pub-id-type="doi">10.1145/1027933.1027968</pub-id></citation></ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Busso</surname> <given-names>C.</given-names></name> <name><surname>Parthasarathy</surname> <given-names>S.</given-names></name> <name><surname>Burmania</surname> <given-names>A.</given-names></name> <name><surname>AbdelWahab</surname> <given-names>M.</given-names></name> <name><surname>Sadoughi</surname> <given-names>N.</given-names></name> <name><surname>Provost</surname> <given-names>E. M.</given-names></name></person-group> (<year>2016</year>). <article-title>MSP-IMPROV: an acted corpus of dyadic interactions to study emotion perception</article-title>. <source>IEEE Trans. Affect. Comput.</source> <volume>8</volume>, <fpage>67</fpage>&#x02013;<lpage>80</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2016.2515617</pub-id></citation></ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>H.</given-names></name> <name><surname>Cooper</surname> <given-names>D. G.</given-names></name> <name><surname>Keutmann</surname> <given-names>M. K.</given-names></name> <name><surname>Gur</surname> <given-names>R. C.</given-names></name> <name><surname>Nenkova</surname> <given-names>A.</given-names></name> <name><surname>Verma</surname> <given-names>R.</given-names></name></person-group> (<year>2014</year>). <article-title>CREMA-D: crowd-sourced emotional multimodal actors dataset</article-title>. <source>IEEE Trans. Affect. Comput.</source> <volume>5</volume>, <fpage>377</fpage>&#x02013;<lpage>390</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2014.2336244</pub-id><pub-id pub-id-type="pmid">25653738</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Cao</surname> <given-names>Z.</given-names></name> <name><surname>Long</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Jordan</surname> <given-names>M. I.</given-names></name></person-group> (<year>2018</year>). &#x0201C;<article-title>Partial transfer learning with selective adversarial networks</article-title>,&#x0201D; in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>), <fpage>2724</fpage>&#x02013;<lpage>2732</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2018.00288</pub-id></citation></ref>
<ref id="B22">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Carletta</surname> <given-names>J.</given-names></name> <name><surname>Ashby</surname> <given-names>S.</given-names></name> <name><surname>Bourban</surname> <given-names>S.</given-names></name> <name><surname>Flynn</surname> <given-names>M.</given-names></name> <name><surname>Guillemot</surname> <given-names>M.</given-names></name> <name><surname>Hain</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2005</year>). &#x0201C;<article-title>The AMI meeting corpus: a pre-announcement</article-title>,&#x0201D; in <source>International Workshop on Machine Learning for Multimodal Interaction</source> (<publisher-loc>Edinburgh</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>28</fpage>&#x02013;<lpage>39</lpage>.</citation></ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chai</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>Q.</given-names></name> <name><surname>Zhao</surname> <given-names>Y.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>D.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <etal/></person-group>. (<year>2017</year>). <article-title>A fast, efficient domain adaptation technique for cross-domain electroencephalography (EEG)-based emotion recognition</article-title>. <source>Sensors</source> <volume>17</volume>:<fpage>1014</fpage>. <pub-id pub-id-type="doi">10.3390/s17051014</pub-id><pub-id pub-id-type="pmid">28467371</pub-id></citation></ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Chai</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>Q.</given-names></name> <name><surname>Zhao</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>X.</given-names></name> <name><surname>Bai</surname> <given-names>O.</given-names></name> <name><surname>Li</surname> <given-names>Y.</given-names></name></person-group> (<year>2016</year>). <article-title>Unsupervised domain adaptation techniques based on auto-encoder for non-stationary eeg-based emotion recognition</article-title>. <source>Comput. Biol. Med.</source> <volume>79</volume>, <fpage>205</fpage>&#x02013;<lpage>214</lpage>. <pub-id pub-id-type="doi">10.1016/j.compbiomed.2016.10.019</pub-id><pub-id pub-id-type="pmid">27810626</pub-id></citation></ref>
<ref id="B25">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chang</surname> <given-names>J.</given-names></name> <name><surname>Scherer</surname> <given-names>S.</given-names></name></person-group> (<year>2017</year>). &#x0201C;<article-title>Learning representations of emotional speech with deep convolutional generative adversarial networks</article-title>,&#x0201D; in <source>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>New Orleans, LA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2746</fpage>&#x02013;<lpage>2750</lpage>. <pub-id pub-id-type="doi">10.1109/ICASSP.2017.7952656</pub-id></citation></ref>
<ref id="B26">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Chatfield</surname> <given-names>K.</given-names></name> <name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Vedaldi</surname> <given-names>A.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Return of the devil in the details: delving deep into convolutional nets</article-title>. <source>arXiv: 1405.3531</source>. <pub-id pub-id-type="doi">10.5244/C.28.6</pub-id></citation></ref>
<ref id="B27">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Costantini</surname> <given-names>G.</given-names></name> <name><surname>Iaderola</surname> <given-names>I.</given-names></name> <name><surname>Paoloni</surname> <given-names>A.</given-names></name> <name><surname>Todisco</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). &#x0201C;<article-title>Emovo corpus: an Italian emotional speech database</article-title>,&#x0201D; in <source>International Conference on Language Resources and Evaluation (LREC 2014)</source> (<publisher-loc>Reykjavik</publisher-loc>: <publisher-name>European Language Resources Association</publisher-name>), <fpage>3501</fpage>&#x02013;<lpage>3504</lpage>.</citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Daume</surname> <given-names>H.</given-names> <suffix>III.</suffix></name> <name><surname>Marcu</surname> <given-names>D.</given-names></name></person-group> (<year>2006</year>). <article-title>Domain adaptation for statistical classifiers</article-title>. <source>J. Artif. Intell. Res.</source> <volume>26</volume>, <fpage>101</fpage>&#x02013;<lpage>126</lpage>. <pub-id pub-id-type="doi">10.1613/jair.1872</pub-id></citation></ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Davis</surname> <given-names>E.</given-names></name> <name><surname>Greenberger</surname> <given-names>E.</given-names></name> <name><surname>Charles</surname> <given-names>S.</given-names></name> <name><surname>Chen</surname> <given-names>C.</given-names></name> <name><surname>Zhao</surname> <given-names>L.</given-names></name> <name><surname>Dong</surname> <given-names>Q.</given-names></name></person-group> (<year>2012</year>). <article-title>Emotion experience and regulation in china and the united states: how do culture and gender shape emotion responding?</article-title> <source>Int. J. Psychol.</source> <volume>47</volume>, <fpage>230</fpage>&#x02013;<lpage>239</lpage>. <pub-id pub-id-type="doi">10.1080/00207594.2011.626043</pub-id><pub-id pub-id-type="pmid">22250807</pub-id></citation></ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Fr&#x000FC;hholz</surname> <given-names>S.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2017</year>). <article-title>Recognizing emotions from whispered speech based on acoustic feature transfer learning</article-title>. <source>IEEE Access</source> <volume>5</volume>, <fpage>5235</fpage>&#x02013;<lpage>5246</lpage>. <pub-id pub-id-type="doi">10.1109/ACCESS.2017.2672722</pub-id></citation></ref>
<ref id="B31">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Xia</surname> <given-names>R.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Liu</surname> <given-names>Y.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2014a</year>). &#x0201C;<article-title>Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition</article-title>,&#x0201D; in <source>2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Florence</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4818</fpage>&#x02013;<lpage>4822</lpage>.</citation></ref>
<ref id="B32">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Eyben</surname> <given-names>F.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2014b</year>). <article-title>Autoencoder-based unsupervised domain adaptation for speech emotion recognition</article-title>. <source>IEEE Signal Process. Lett.</source> <volume>21</volume>, <fpage>1068</fpage>&#x02013;<lpage>1072</lpage>. <pub-id pub-id-type="doi">10.1109/LSP.2014.2324759</pub-id></citation></ref>
<ref id="B33">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Marchi</surname> <given-names>E.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2013</year>). &#x0201C;<article-title>Sparse autoencoder-based feature transfer learning for speech emotion recognition</article-title>,&#x0201D; in <source>2013 Humaine Association Conference on Affective Computing and Intelligent Interaction</source> (<publisher-loc>Geneva</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>511</fpage>&#x02013;<lpage>516</lpage>.</citation></ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>Z.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2014c</year>). &#x0201C;<article-title>Linked source and target domain subspace feature transfer learning&#x02013;exemplified by speech emotion recognition</article-title>,&#x0201D; in <source>2014 22nd International Conference on Pattern Recognition</source> (<publisher-loc>Stockholm</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>761</fpage>&#x02013;<lpage>766</lpage>.</citation></ref>
<ref id="B35">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dhall</surname> <given-names>A.</given-names></name> <name><surname>Goecke</surname> <given-names>R.</given-names></name> <name><surname>Joshi</surname> <given-names>J.</given-names></name> <name><surname>Hoey</surname> <given-names>J.</given-names></name> <name><surname>Gedeon</surname> <given-names>T.</given-names></name></person-group> (<year>2016</year>). &#x0201C;<article-title>Emotiw 2016: video and group-level emotion recognition challenges</article-title>,&#x0201D; in <source>Proceedings of the 18th ACM International Conference on Multimodal Interaction</source> (<publisher-loc>Tokyo</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>427</fpage>&#x02013;<lpage>432</lpage>.</citation></ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Dhall</surname> <given-names>A.</given-names></name> <name><surname>Goecke</surname> <given-names>R.</given-names></name> <name><surname>Lucey</surname> <given-names>S.</given-names></name> <name><surname>Gedeon</surname> <given-names>T.</given-names></name></person-group> (<year>2012</year>). <article-title>Collecting large, richly annotated facial-expression databases from movies</article-title>. <source>IEEE Multimedia</source> <volume>19</volume>, <fpage>34</fpage>&#x02013;<lpage>41</lpage>. <pub-id pub-id-type="doi">10.1109/MMUL.2012.26</pub-id></citation></ref>
<ref id="B37">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Dhall</surname> <given-names>A.</given-names></name> <name><surname>Ramana Murthy</surname> <given-names>O.</given-names></name> <name><surname>Goecke</surname> <given-names>R.</given-names></name> <name><surname>Joshi</surname> <given-names>J.</given-names></name> <name><surname>Gedeon</surname> <given-names>T.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>Video and image based emotion recognition challenges in the wild: emotiw 2015</article-title>,&#x0201D; in <source>Proceedings of the 2015 ACM on International Conference on Multimodal Interaction</source> (<publisher-loc>Seattle, WA</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>423</fpage>&#x02013;<lpage>426</lpage>.</citation></ref>
<ref id="B38">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ding</surname> <given-names>Z.</given-names></name> <name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Fu</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). <source>Multi-source Transfer Learning</source>. <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>.</citation></ref>
<ref id="B39">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Duan</surname> <given-names>L.</given-names></name> <name><surname>Tsang</surname> <given-names>I. W.</given-names></name> <name><surname>Xu</surname> <given-names>D.</given-names></name> <name><surname>Maybank</surname> <given-names>S. J.</given-names></name></person-group> (<year>2009</year>). &#x0201C;<article-title>Domain transfer SVM for video concept detection</article-title>,&#x0201D; in <source>2009 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Miami, FL</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1375</fpage>&#x02013;<lpage>1381</lpage>.</citation></ref>
<ref id="B40">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Escalante</surname> <given-names>H. J.</given-names></name> <name><surname>Ponce-L&#x000F3;pez</surname> <given-names>V.</given-names></name> <name><surname>Wan</surname> <given-names>J.</given-names></name> <name><surname>Riegler</surname> <given-names>M. A.</given-names></name> <name><surname>Chen</surname> <given-names>B.</given-names></name> <name><surname>Clap&#x000E9;s</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2016</year>). &#x0201C;<article-title>Chalearn joint contest on multimedia challenges beyond visual analysis: an overview</article-title>,&#x0201D; in <source>2016 23rd International Conference on Pattern Recognition (ICPR)</source> (<publisher-loc>Canc&#x000FA;n</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>67</fpage>&#x02013;<lpage>73</lpage>.</citation></ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ganin</surname> <given-names>Y.</given-names></name> <name><surname>Lempitsky</surname> <given-names>V.</given-names></name></person-group> (<year>2014</year>). <article-title>Unsupervised domain adaptation by backpropagation</article-title>. <source>arXiv: 1409.7495</source>.</citation></ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ganin</surname> <given-names>Y.</given-names></name> <name><surname>Ustinova</surname> <given-names>E.</given-names></name> <name><surname>Ajakan</surname> <given-names>H.</given-names></name> <name><surname>Germain</surname> <given-names>P.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Laviolette</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Domain-adversarial training of neural networks</article-title>. <source>J. Mach. Learn. Res.</source> <volume>17</volume>, <fpage>2096</fpage>&#x02013;<lpage>2030</lpage>.</citation></ref>
<ref id="B43">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ghifary</surname> <given-names>M.</given-names></name> <name><surname>Kleijn</surname> <given-names>W. B.</given-names></name> <name><surname>Zhang</surname> <given-names>M.</given-names></name></person-group> (<year>2014</year>). &#x0201C;<article-title>Domain adaptive neural networks for object recognition</article-title>,&#x0201D; in <source>Pacific Rim International Conference on Artificial Intelligence</source> (<publisher-loc>Queensland</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>898</fpage>&#x02013;<lpage>904</lpage>.</citation></ref>
<ref id="B44">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gideon</surname> <given-names>J.</given-names></name> <name><surname>Khorram</surname> <given-names>S.</given-names></name> <name><surname>Aldeneh</surname> <given-names>Z.</given-names></name> <name><surname>Dimitriadis</surname> <given-names>D.</given-names></name> <name><surname>Provost</surname> <given-names>E. M.</given-names></name></person-group> (<year>2017</year>). <article-title>Progressive neural networks for transfer learning in emotion recognition</article-title>. <source>arXiv: 1706.03256</source>. <pub-id pub-id-type="doi">10.21437/Interspeech.2017-1637</pub-id></citation></ref>
<ref id="B45">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gideon</surname> <given-names>J.</given-names></name> <name><surname>McInnis</surname> <given-names>M. G.</given-names></name> <name><surname>Provost</surname> <given-names>E. M.</given-names></name></person-group> (<year>2019</year>). <article-title>Barking up the right tree: improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG)</article-title>. <source>arXiv: 1903.12094</source>. <pub-id pub-id-type="doi">10.1109/TAFFC.2019.2916092</pub-id></citation></ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Goeleven</surname> <given-names>E.</given-names></name> <name><surname>De Raedt</surname> <given-names>R.</given-names></name> <name><surname>Leyman</surname> <given-names>L.</given-names></name> <name><surname>Verschuere</surname> <given-names>B.</given-names></name></person-group> (<year>2008</year>). <article-title>The Karolinska directed emotional faces: a validation study</article-title>. <source>Cogn. Emot.</source> <volume>22</volume>, <fpage>1094</fpage>&#x02013;<lpage>1118</lpage>. <pub-id pub-id-type="doi">10.1080/02699930701626582</pub-id></citation></ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>G&#x000F6;k&#x000E7;en</surname> <given-names>E.</given-names></name> <name><surname>Furnham</surname> <given-names>A.</given-names></name> <name><surname>Mavroveli</surname> <given-names>S.</given-names></name> <name><surname>Petrides</surname> <given-names>K.</given-names></name></person-group> (<year>2014</year>). <article-title>A cross-cultural investigation of trait emotional intelligence in Hong Kong and the UK</article-title>. <source>Pers. Individ. Diff.</source> <volume>65</volume>, <fpage>30</fpage>&#x02013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1016/j.paid.2014.01.053</pub-id></citation></ref>
<ref id="B48">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gong</surname> <given-names>B.</given-names></name> <name><surname>Shi</surname> <given-names>Y.</given-names></name> <name><surname>Sha</surname> <given-names>F.</given-names></name> <name><surname>Grauman</surname> <given-names>K.</given-names></name></person-group> (<year>2012</year>). &#x0201C;<article-title>Geodesic flow kernel for unsupervised domain adaptation</article-title>,&#x0201D; in <source>2012 IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Providence, RI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2066</fpage>&#x02013;<lpage>2073</lpage>.</citation></ref>
<ref id="B49">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I.</given-names></name> <name><surname>Pouget-Abadie</surname> <given-names>J.</given-names></name> <name><surname>Mirza</surname> <given-names>M.</given-names></name> <name><surname>Xu</surname> <given-names>B.</given-names></name> <name><surname>Warde-Farley</surname> <given-names>D.</given-names></name> <name><surname>Ozair</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2014</year>). &#x0201C;<article-title>Generative adversarial nets</article-title>,&#x0201D; in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>2672</fpage>&#x02013;<lpage>2680</lpage>.</citation></ref>
<ref id="B50">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Goodfellow</surname> <given-names>I. J.</given-names></name> <name><surname>Erhan</surname> <given-names>D.</given-names></name> <name><surname>Carrier</surname> <given-names>P.</given-names></name> <name><surname>Courville</surname> <given-names>A.</given-names></name> <name><surname>Mirza</surname> <given-names>M.</given-names></name> <name><surname>Hamner</surname> <given-names>B.</given-names></name> <etal/></person-group>. (<year>2013</year>). &#x0201C;<article-title>Challenges in representation learning: a report on three machine learning contests</article-title>,&#x0201D; in <source>International Conference on Neural Information Processing</source> (<publisher-loc>Daegu</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>117</fpage>&#x02013;<lpage>124</lpage>.</citation></ref>
<ref id="B51">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Gopalan</surname> <given-names>R.</given-names></name> <name><surname>Li</surname> <given-names>R.</given-names></name> <name><surname>Chellappa</surname> <given-names>R.</given-names></name></person-group> (<year>2011</year>). &#x0201C;<article-title>Domain adaptation for object recognition: an unsupervised approach</article-title>,&#x0201D; in <source>2011 International Conference on Computer Vision</source> (<publisher-loc>Barcelona</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>999</fpage>&#x02013;<lpage>1006</lpage>.</citation></ref>
<ref id="B52">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Grimm</surname> <given-names>M.</given-names></name> <name><surname>Kroschel</surname> <given-names>K.</given-names></name> <name><surname>Narayanan</surname> <given-names>S.</given-names></name></person-group> (<year>2008</year>). &#x0201C;<article-title>The Vera am Mittag German audio-visual emotional speech database</article-title>,&#x0201D; in <source>2008 IEEE International Conference on Multimedia and Expo</source> (<publisher-loc>Hanover</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>865</fpage>&#x02013;<lpage>868</lpage>.</citation></ref>
<ref id="B53">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hansen</surname> <given-names>J. H.</given-names></name> <name><surname>Bou-Ghazale</surname> <given-names>S. E.</given-names></name></person-group> (<year>1997</year>). &#x0201C;<article-title>Getting started with SUSAS: a speech under simulated and actual stress database</article-title>,&#x0201D; in <source>Fifth European Conference on Speech Communication and Technology</source> (<publisher-loc>Rhodes</publisher-loc>).</citation></ref>
<ref id="B54">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Haq</surname> <given-names>S.</given-names></name> <name><surname>Jackson</surname> <given-names>P. J.</given-names></name> <name><surname>Edge</surname> <given-names>J.</given-names></name></person-group> (<year>2008</year>). &#x0201C;<article-title>Audio-visual feature selection and reduction for emotion classification</article-title>,&#x0201D; in <source>Proceedings of the International Conference on Auditory-Visual Speech Processing (AVSP&#x00027;08)</source> (<publisher-loc>Tangalooma, QLD</publisher-loc>).</citation></ref>
<ref id="B55">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2016</year>). &#x0201C;<article-title>Deep residual learning for image recognition</article-title>,&#x0201D; in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>770</fpage>&#x02013;<lpage>778</lpage>.</citation></ref>
<ref id="B56">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Hjelm</surname> <given-names>R. D.</given-names></name> <name><surname>Fedorov</surname> <given-names>A.</given-names></name> <name><surname>Lavoie-Marchildon</surname> <given-names>S.</given-names></name> <name><surname>Grewal</surname> <given-names>K.</given-names></name> <name><surname>Trischler</surname> <given-names>A.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Learning deep representations by mutual information estimation and maximization</article-title>. <source>arXiv: 1808.06670</source>.</citation></ref>
<ref id="B57">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hodges</surname> <given-names>W.</given-names></name> <name><surname>Spielberger</surname> <given-names>C.</given-names></name></person-group> (<year>1966</year>). <article-title>The effects of threat of shock on heart rate for subjects who differ in manifest anxiety and fear of shock</article-title>. <source>Psychophysiology</source> <volume>2</volume>, <fpage>287</fpage>&#x02013;<lpage>294</lpage>. <pub-id pub-id-type="doi">10.1111/j.1469-8986.1966.tb02656.x</pub-id></citation></ref>
<ref id="B58">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Huang</surname> <given-names>Y.</given-names></name> <name><surname>Yang</surname> <given-names>J.</given-names></name> <name><surname>Liu</surname> <given-names>S.</given-names></name> <name><surname>Pan</surname> <given-names>J.</given-names></name></person-group> (<year>2019</year>). <article-title>Combining facial expressions and electroencephalography to enhance emotion recognition</article-title>. <source>Fut. Int.</source> <volume>11</volume>:<fpage>105</fpage>. <pub-id pub-id-type="doi">10.3390/fi11050105</pub-id></citation></ref>
<ref id="B59">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jackson</surname> <given-names>P.</given-names></name> <name><surname>Haq</surname> <given-names>S.</given-names></name></person-group> (<year>2014</year>). <source>Surrey Audio-Visual Expressed Emotion (SAVEE) Database</source>. <publisher-loc>Guildford</publisher-loc>: <publisher-name>University of Surrey</publisher-name>.</citation></ref>
<ref id="B60">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jiang</surname> <given-names>Y.-G.</given-names></name> <name><surname>Xu</surname> <given-names>B.</given-names></name> <name><surname>Xue</surname> <given-names>X.</given-names></name></person-group> (<year>2014</year>). &#x0201C;<article-title>Predicting emotions in user-generated videos</article-title>,&#x0201D; in <source>Twenty-Eighth AAAI Conference on Artificial Intelligence</source> (<publisher-loc>Qu&#x000E9;bec, QC</publisher-loc>).</citation></ref>
<ref id="B61">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Jung</surname> <given-names>H.</given-names></name> <name><surname>Ju</surname> <given-names>J.</given-names></name> <name><surname>Jung</surname> <given-names>M.</given-names></name> <name><surname>Kim</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). &#x0201C;<article-title>Less-forgetful learning for domain expansion in deep neural networks</article-title>,&#x0201D; in <source>Thirty-Second AAAI Conference on Artificial Intelligence</source> (<publisher-loc>New Orleans, LA</publisher-loc>).</citation></ref>
<ref id="B62">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kaya</surname> <given-names>H.</given-names></name> <name><surname>G&#x000FC;rp&#x00131;nar</surname> <given-names>F.</given-names></name> <name><surname>Salah</surname> <given-names>A. A.</given-names></name></person-group> (<year>2017</year>). <article-title>Video-based emotion recognition in the wild using deep transfer learning and score fusion</article-title>. <source>Image Vis. Comput.</source> <volume>65</volume>, <fpage>66</fpage>&#x02013;<lpage>75</lpage>. <pub-id pub-id-type="doi">10.1016/j.imavis.2017.01.012</pub-id></citation></ref>
<ref id="B63">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Khorram</surname> <given-names>S.</given-names></name> <name><surname>Jaiswal</surname> <given-names>M.</given-names></name> <name><surname>Gideon</surname> <given-names>J.</given-names></name> <name><surname>McInnis</surname> <given-names>M. G.</given-names></name> <name><surname>Provost</surname> <given-names>E. M.</given-names></name></person-group> (<year>2018</year>). <article-title>The PRIORI emotion dataset: linking mood to emotion detected in-the-wild</article-title>. <source>CoRR</source> abs/1806.10658. <pub-id pub-id-type="doi">10.21437/Interspeech.2018-2355</pub-id></citation></ref>
<ref id="B64">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Koelstra</surname> <given-names>S.</given-names></name> <name><surname>Muhl</surname> <given-names>C.</given-names></name> <name><surname>Soleymani</surname> <given-names>M.</given-names></name> <name><surname>Lee</surname> <given-names>J.-S.</given-names></name> <name><surname>Yazdani</surname> <given-names>A.</given-names></name> <name><surname>Ebrahimi</surname> <given-names>T.</given-names></name> <etal/></person-group>. (<year>2011</year>). <article-title>DEAP: a database for emotion analysis; using physiological signals</article-title>. <source>IEEE Trans. Affect. Comput.</source> <volume>3</volume>, <fpage>18</fpage>&#x02013;<lpage>31</lpage>. <pub-id pub-id-type="doi">10.1109/T-AFFC.2011.15</pub-id></citation></ref>
<ref id="B65">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Sutskever</surname> <given-names>I.</given-names></name> <name><surname>Hinton</surname> <given-names>G. E.</given-names></name></person-group> (<year>2012</year>). &#x0201C;<article-title>Imagenet classification with deep convolutional neural networks</article-title>,&#x0201D; in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Lake Tahoe, CA</publisher-loc>), <fpage>1097</fpage>&#x02013;<lpage>1105</lpage>.</citation></ref>
<ref id="B66">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Kulis</surname> <given-names>B.</given-names></name> <name><surname>Saenko</surname> <given-names>K.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name></person-group> (<year>2011</year>). &#x0201C;<article-title>What you saw is not what you get: domain adaptation using asymmetric kernel transforms</article-title>,&#x0201D; in <source>CVPR 2011</source> (<publisher-loc>Colorado Springs, CO</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1785</fpage>&#x02013;<lpage>1792</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2011.5995702</pub-id></citation></ref>
<ref id="B67">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lan</surname> <given-names>Z.</given-names></name> <name><surname>Sourina</surname> <given-names>O.</given-names></name> <name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Scherer</surname> <given-names>R.</given-names></name> <name><surname>M&#x000FC;ller-Putz</surname> <given-names>G. R.</given-names></name></person-group> (<year>2018</year>). <article-title>Domain adaptation techniques for EEG-based emotion recognition: a comparative study on two public datasets</article-title>. <source>IEEE Trans. Cogn. Dev. Syst.</source> <volume>11</volume>, <fpage>85</fpage>&#x02013;<lpage>94</lpage>. <pub-id pub-id-type="doi">10.1109/TCDS.2018.2826840</pub-id></citation></ref>
<ref id="B68">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Latif</surname> <given-names>S.</given-names></name> <name><surname>Qadir</surname> <given-names>J.</given-names></name> <name><surname>Bilal</surname> <given-names>M.</given-names></name></person-group> (<year>2019</year>). &#x0201C;<article-title>Unsupervised adversarial domain adaptation for cross-lingual speech emotion recognition</article-title>,&#x0201D; in <source>2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII)</source> (<publisher-loc>Cambridge</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>732</fpage>&#x02013;<lpage>737</lpage>. <pub-id pub-id-type="doi">10.1109/ACII.2019.8925513</pub-id></citation></ref>
<ref id="B69">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Latif</surname> <given-names>S.</given-names></name> <name><surname>Rana</surname> <given-names>R.</given-names></name> <name><surname>Younis</surname> <given-names>S.</given-names></name> <name><surname>Qadir</surname> <given-names>J.</given-names></name> <name><surname>Epps</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). <article-title>Cross corpus speech emotion classification-an effective transfer learning technique</article-title>. <source>arXiv: 1801.06353</source>.</citation></ref>
<ref id="B70">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Qiu</surname> <given-names>S.</given-names></name> <name><surname>Shen</surname> <given-names>Y.-Y.</given-names></name> <name><surname>Liu</surname> <given-names>C.-L.</given-names></name> <name><surname>He</surname> <given-names>H.</given-names></name></person-group> (<year>2019</year>). <article-title>Multisource transfer learning for cross-subject EEG emotion recognition</article-title>. <source>IEEE Trans. Cybernet</source>. <pub-id pub-id-type="doi">10.1109/TCYB.2019.2904052</pub-id><pub-id pub-id-type="pmid">30932860</pub-id></citation></ref>
<ref id="B71">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Q.</given-names></name> <name><surname>Chaspari</surname> <given-names>T.</given-names></name></person-group> (<year>2019</year>). &#x0201C;<article-title>Exploring transfer learning between scripted and spontaneous speech for emotion recognition</article-title>,&#x0201D; in <source>Proceedings of the ACM International Conference on Multimodal Interaction (ICMI)</source> (<publisher-loc>Suzhou</publisher-loc>: <publisher-name>ACM</publisher-name>).</citation></ref>
<ref id="B72">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>Y.-P.</given-names></name></person-group> (<year>2019</year>). <article-title>Constructing a personalized cross-day EEG-based emotion-classification model using transfer learning</article-title>. <source>IEEE J. Biomed. Health Informat</source>. <pub-id pub-id-type="doi">10.1109/JBHI.2019.2934172</pub-id><pub-id pub-id-type="pmid">31403448</pub-id></citation></ref>
<ref id="B73">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>Y.-P.</given-names></name> <name><surname>Hsu</surname> <given-names>S.-H.</given-names></name> <name><surname>Jung</surname> <given-names>T.-P.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>Exploring day-to-day variability in the relations between emotion and eeg signals</article-title>,&#x0201D; in <source>International Conference on Augmented Cognition</source> (<publisher-loc>Los Angeles, CA</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>461</fpage>&#x02013;<lpage>469</lpage>.</citation></ref>
<ref id="B74">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>Y.-P.</given-names></name> <name><surname>Jung</surname> <given-names>T.-P.</given-names></name></person-group> (<year>2017</year>). <article-title>Improving EEG-based emotion classification using conditional transfer learning</article-title>. <source>Front. Hum. Neurosci.</source> <volume>11</volume>:<fpage>334</fpage>. <pub-id pub-id-type="doi">10.3389/fnhum.2017.00334</pub-id><pub-id pub-id-type="pmid">28701938</pub-id></citation></ref>
<ref id="B75">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>Y.-P.</given-names></name> <name><surname>Wang</surname> <given-names>C.-H.</given-names></name> <name><surname>Jung</surname> <given-names>T.-P.</given-names></name> <name><surname>Wu</surname> <given-names>T.-L.</given-names></name> <name><surname>Jeng</surname> <given-names>S.-K.</given-names></name> <name><surname>Duann</surname> <given-names>J.-R.</given-names></name> <etal/></person-group>. (<year>2010</year>). <article-title>EEG-based emotion recognition in music listening</article-title>. <source>IEEE Trans. Biomed. Eng.</source> <volume>57</volume>, <fpage>1798</fpage>&#x02013;<lpage>1806</lpage>. <pub-id pub-id-type="doi">10.1109/TBME.2010.2048568</pub-id><pub-id pub-id-type="pmid">20442037</pub-id></citation></ref>
<ref id="B76">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Livingstone</surname> <given-names>S. R.</given-names></name> <name><surname>Russo</surname> <given-names>F. A.</given-names></name></person-group> (<year>2018</year>). <article-title>The ryerson audio-visual database of emotional speech and song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in north american english</article-title>. <source>PLoS ONE</source> <volume>13</volume>:<fpage>e0196391</fpage>. <pub-id pub-id-type="doi">10.1371/journal.pone.0196391</pub-id><pub-id pub-id-type="pmid">29768426</pub-id></citation></ref>
<ref id="B77">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Long</surname> <given-names>M.</given-names></name> <name><surname>Cao</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Jordan</surname> <given-names>M. I.</given-names></name></person-group> (<year>2015</year>). <article-title>Learning transferable features with deep adaptation networks</article-title>. <source>arXiv: 1502.02791</source>.</citation></ref>
<ref id="B78">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Long</surname> <given-names>M.</given-names></name> <name><surname>Ding</surname> <given-names>G.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name> <name><surname>Guo</surname> <given-names>Y.</given-names></name> <name><surname>Yu</surname> <given-names>P. S.</given-names></name></person-group> (<year>2013</year>). &#x0201C;<article-title>Transfer sparse coding for robust image representation</article-title>,&#x0201D; in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Portland, OR</publisher-loc>), <fpage>407</fpage>&#x02013;<lpage>414</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2013.59</pub-id></citation></ref>
<ref id="B79">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Long</surname> <given-names>M.</given-names></name> <name><surname>Zhu</surname> <given-names>H.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name> <name><surname>Jordan</surname> <given-names>M. I.</given-names></name></person-group> (<year>2017</year>). &#x0201C;<article-title>Deep transfer learning with joint adaptation networks</article-title>,&#x0201D; in <source>Proceedings of the 34th International Conference on Machine Learning-Volume 70</source> (<publisher-loc>Sydney, NSW</publisher-loc>), <fpage>2208</fpage>&#x02013;<lpage>2217</lpage>.</citation></ref>
<ref id="B80">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lotfian</surname> <given-names>R.</given-names></name> <name><surname>Busso</surname> <given-names>C.</given-names></name></person-group> (<year>2017</year>). <article-title>Building naturalistic emotionally balanced speech corpus by retrieving emotional speech from existing podcast recordings</article-title>. <source>IEEE Trans. Affect. Comput</source>.</citation></ref>
<ref id="B81">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lucey</surname> <given-names>P.</given-names></name> <name><surname>Cohn</surname> <given-names>J. F.</given-names></name> <name><surname>Kanade</surname> <given-names>T.</given-names></name> <name><surname>Saragih</surname> <given-names>J.</given-names></name> <name><surname>Ambadar</surname> <given-names>Z.</given-names></name> <name><surname>Matthews</surname> <given-names>I.</given-names></name></person-group> (<year>2010</year>). &#x0201C;<article-title>The extended cohn-kanade dataset (CK&#x0002B;): a complete dataset for action unit and emotion-specified expression</article-title>,&#x0201D; in <source>2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops</source> (<publisher-loc>San Francisco, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>94</fpage>&#x02013;<lpage>101</lpage>. <pub-id pub-id-type="doi">10.1109/CVPRW.2010.5543262</pub-id></citation></ref>
<ref id="B82">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lyons</surname> <given-names>M.</given-names></name> <name><surname>Akamatsu</surname> <given-names>S.</given-names></name> <name><surname>Kamachi</surname> <given-names>M.</given-names></name> <name><surname>Gyoba</surname> <given-names>J.</given-names></name></person-group> (<year>1998</year>). &#x0201C;<article-title>Coding facial expressions with gabor wavelets</article-title>,&#x0201D; in <source>Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition</source> (<publisher-loc>Nara</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>200</fpage>&#x02013;<lpage>205</lpage>.</citation></ref>
<ref id="B83">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Lyons</surname> <given-names>M. J.</given-names></name> <name><surname>Budynek</surname> <given-names>J.</given-names></name> <name><surname>Akamatsu</surname> <given-names>S.</given-names></name></person-group> (<year>1999</year>). <article-title>Automatic classification of single facial images</article-title>. <source>IEEE Trans. Patt. Anal. Mach. Intell.</source> <volume>21</volume>, <fpage>1357</fpage>&#x02013;<lpage>1362</lpage>. <pub-id pub-id-type="doi">10.1109/34.817413</pub-id></citation></ref>
<ref id="B84">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ma</surname> <given-names>Y.</given-names></name> <name><surname>Hao</surname> <given-names>Y.</given-names></name> <name><surname>Chen</surname> <given-names>M.</given-names></name> <name><surname>Chen</surname> <given-names>J.</given-names></name> <name><surname>Lu</surname> <given-names>P.</given-names></name> <name><surname>Ko&#x00161;ir</surname> <given-names>A.</given-names></name></person-group> (<year>2019</year>). <article-title>Audio-visual emotion fusion (avef): a deep efficient weighted approach</article-title>. <source>Informat. Fusion</source> <volume>46</volume>, <fpage>184</fpage>&#x02013;<lpage>192</lpage>. <pub-id pub-id-type="doi">10.1016/j.inffus.2018.06.003</pub-id></citation></ref>
<ref id="B85">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mao</surname> <given-names>Q.</given-names></name> <name><surname>Xue</surname> <given-names>W.</given-names></name> <name><surname>Rao</surname> <given-names>Q.</given-names></name> <name><surname>Zhang</surname> <given-names>F.</given-names></name> <name><surname>Zhan</surname> <given-names>Y.</given-names></name></person-group> (<year>2016</year>). &#x0201C;<article-title>Domain adaptation for speech emotion recognition by sharing priors between related source and target classes</article-title>,&#x0201D; in <source>2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Shanghai</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>2608</fpage>&#x02013;<lpage>2612</lpage>.</citation></ref>
<ref id="B86">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Martin</surname> <given-names>O.</given-names></name> <name><surname>Kotsia</surname> <given-names>I.</given-names></name> <name><surname>Macq</surname> <given-names>B.</given-names></name> <name><surname>Pitas</surname> <given-names>I.</given-names></name></person-group> (<year>2006</year>). &#x0201C;<article-title>The eNTERFACE&#x00027;05 audio-visual emotion database</article-title>,&#x0201D; in <source>22nd International Conference on Data Engineering Workshops (ICDEW&#x00027;06)</source> (<publisher-loc>Atlanta, GA</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>8</fpage>.</citation></ref>
<ref id="B87">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Mathias</surname> <given-names>M.</given-names></name> <name><surname>Benenson</surname> <given-names>R.</given-names></name> <name><surname>Pedersoli</surname> <given-names>M.</given-names></name> <name><surname>Van Gool</surname> <given-names>L.</given-names></name></person-group> (<year>2014</year>). &#x0201C;<article-title>Face detection without bells and whistles</article-title>,&#x0201D; in <source>European Conference on Computer Vision</source> (<publisher-loc>Zurich</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>720</fpage>&#x02013;<lpage>735</lpage>.</citation></ref>
<ref id="B88">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>McKeown</surname> <given-names>G.</given-names></name> <name><surname>Valstar</surname> <given-names>M.</given-names></name> <name><surname>Cowie</surname> <given-names>R.</given-names></name> <name><surname>Pantic</surname> <given-names>M.</given-names></name> <name><surname>Schroder</surname> <given-names>M.</given-names></name></person-group> (<year>2011</year>). <article-title>The semaine database: annotated multimodal records of emotionally colored conversations between a person and a limited agent</article-title>. <source>IEEE Trans. Affect. Comput.</source> <volume>3</volume>, <fpage>5</fpage>&#x02013;<lpage>17</lpage>. <pub-id pub-id-type="doi">10.1109/T-AFFC.2011.20</pub-id></citation></ref>
<ref id="B89">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Mollahosseini</surname> <given-names>A.</given-names></name> <name><surname>Hasani</surname> <given-names>B.</given-names></name> <name><surname>Mahoor</surname> <given-names>M. H.</given-names></name></person-group> (<year>2017</year>). <article-title>Affectnet: a database for facial expression, valence, and arousal computing in the wild</article-title>. <source>IEEE Trans. Affect. Comput.</source> <volume>10</volume>, <fpage>18</fpage>&#x02013;<lpage>31</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2017.2740923</pub-id></citation></ref>
<ref id="B90">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Motiian</surname> <given-names>S.</given-names></name> <name><surname>Piccirilli</surname> <given-names>M.</given-names></name> <name><surname>Adjeroh</surname> <given-names>D. A.</given-names></name> <name><surname>Doretto</surname> <given-names>G.</given-names></name></person-group> (<year>2016</year>). &#x0201C;<article-title>Information bottleneck learning using privileged information for visual recognition</article-title>,&#x0201D; in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Las Vegas, NV</publisher-loc>), <fpage>1496</fpage>&#x02013;<lpage>1505</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2016.166</pub-id></citation></ref>
<ref id="B91">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Motiian</surname> <given-names>S.</given-names></name> <name><surname>Piccirilli</surname> <given-names>M.</given-names></name> <name><surname>Adjeroh</surname> <given-names>D. A.</given-names></name> <name><surname>Doretto</surname> <given-names>G.</given-names></name></person-group> (<year>2017</year>). &#x0201C;<article-title>Unified deep supervised domain adaptation and generalization</article-title>,&#x0201D; in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Venice</publisher-loc>), <fpage>5715</fpage>&#x02013;<lpage>5725</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.609</pub-id></citation></ref>
<ref id="B92">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Muandet</surname> <given-names>K.</given-names></name> <name><surname>Balduzzi</surname> <given-names>D.</given-names></name> <name><surname>Sch&#x000F6;lkopf</surname> <given-names>B.</given-names></name></person-group> (<year>2013</year>). &#x0201C;<article-title>Domain generalization via invariant feature representation</article-title>,&#x0201D; in <source>International Conference on Machine Learning</source> (<publisher-loc>Atlanta, GA</publisher-loc>), <fpage>10</fpage>&#x02013;<lpage>18</lpage>.</citation></ref>
<ref id="B93">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Neumann</surname> <given-names>M.</given-names></name> <name><surname>Vu</surname> <given-names>N. T.</given-names></name></person-group> (<year>2019</year>). &#x0201C;<article-title>Improving speech emotion recognition with unsupervised representation learning on unlabeled speech</article-title>,&#x0201D; in <source>ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Brighton</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>7390</fpage>&#x02013;<lpage>7394</lpage>.</citation></ref>
<ref id="B94">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ng</surname> <given-names>H.-W.</given-names></name> <name><surname>Nguyen</surname> <given-names>V. D.</given-names></name> <name><surname>Vonikakis</surname> <given-names>V.</given-names></name> <name><surname>Winkler</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>Deep learning for emotion recognition on small datasets using transfer learning</article-title>,&#x0201D; in <source>Proceedings of the 2015 ACM on International Conference on Multimodal Interaction</source> (<publisher-loc>Seattle, WA</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>443</fpage>&#x02013;<lpage>449</lpage>.</citation></ref>
<ref id="B95">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ngo</surname> <given-names>T. Q.</given-names></name> <name><surname>Yoon</surname> <given-names>S.</given-names></name></person-group> (<year>2019</year>). &#x0201C;<article-title>Facial expression recognition on static images</article-title>,&#x0201D; in <source>International Conference on Future Data and Security Engineering</source> (<publisher-loc>Nha Trang</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>640</fpage>&#x02013;<lpage>647</lpage>.</citation></ref>
<ref id="B96">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ortega</surname> <given-names>J. D.</given-names></name> <name><surname>Cardinal</surname> <given-names>P.</given-names></name> <name><surname>Koerich</surname> <given-names>A. L.</given-names></name></person-group> (<year>2019</year>). <article-title>Emotion recognition using fusion of audio and video features</article-title>. <source>arXiv: 1906.10623</source>. <pub-id pub-id-type="doi">10.1109/SMC.2019.8914655</pub-id></citation></ref>
<ref id="B97">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ouyang</surname> <given-names>X.</given-names></name> <name><surname>Kawaai</surname> <given-names>S.</given-names></name> <name><surname>Goh</surname> <given-names>E. G. H.</given-names></name> <name><surname>Shen</surname> <given-names>S.</given-names></name> <name><surname>Ding</surname> <given-names>W.</given-names></name> <name><surname>Ming</surname> <given-names>H.</given-names></name> <etal/></person-group>. (<year>2017</year>). &#x0201C;<article-title>Audio-visual emotion recognition using deep transfer learning and multiple temporal models</article-title>,&#x0201D; in <source>Proceedings of the 19th ACM International Conference on Multimodal Interaction</source> (<publisher-loc>Glasgow</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>577</fpage>&#x02013;<lpage>582</lpage>.</citation></ref>
<ref id="B98">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>S. J.</given-names></name> <name><surname>Tsang</surname> <given-names>I. W.</given-names></name> <name><surname>Kwok</surname> <given-names>J. T.</given-names></name> <name><surname>Yang</surname> <given-names>Q.</given-names></name></person-group> (<year>2010</year>). <article-title>Domain adaptation via transfer component analysis</article-title>. <source>IEEE Trans. Neural Netw.</source> <volume>22</volume>, <fpage>199</fpage>&#x02013;<lpage>210</lpage>. <pub-id pub-id-type="doi">10.1109/TNN.2010.2091281</pub-id><pub-id pub-id-type="pmid">21095864</pub-id></citation></ref>
<ref id="B99">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Pan</surname> <given-names>S. J.</given-names></name> <name><surname>Yang</surname> <given-names>Q.</given-names></name></person-group> (<year>2009</year>). <article-title>A survey on transfer learning</article-title>. <source>IEEE Trans. Knowl. Data Eng.</source> <volume>22</volume>, <fpage>1345</fpage>&#x02013;<lpage>1359</lpage>. <pub-id pub-id-type="doi">10.1109/TKDE.2009.191</pub-id></citation></ref>
<ref id="B100">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Parkhi</surname> <given-names>O. M.</given-names></name> <name><surname>Vedaldi</surname> <given-names>A.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2015</year>). &#x0201C;<article-title>Deep face recognition</article-title>,&#x0201D; in <source>BMVC</source>, <volume>Vol. 1</volume> (<publisher-loc>Swansea</publisher-loc>), <fpage>1</fpage>&#x02013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.5244/C.29.41</pub-id></citation></ref>
<ref id="B101">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Pei</surname> <given-names>Z.</given-names></name> <name><surname>Cao</surname> <given-names>Z.</given-names></name> <name><surname>Long</surname> <given-names>M.</given-names></name> <name><surname>Wang</surname> <given-names>J.</given-names></name></person-group> (<year>2018</year>). &#x0201C;<article-title>Multi-adversarial domain adaptation</article-title>,&#x0201D; in <source>Thirty-Second AAAI Conference on Artificial Intelligence</source> (<publisher-loc>New Orleans, LA</publisher-loc>).</citation></ref>
<ref id="B102">
<citation citation-type="web"><person-group person-group-type="author"><collab>PIC</collab></person-group> (<year>2013</year>). <source>Psychological image collection at stirling (PICS)</source>. Available online at: <ext-link ext-link-type="uri" xlink:href="http://pics.stir.ac.uk/">http://pics.stir.ac.uk/</ext-link></citation></ref>
<ref id="B103">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Radford</surname> <given-names>A.</given-names></name> <name><surname>Metz</surname> <given-names>L.</given-names></name> <name><surname>Chintala</surname> <given-names>S.</given-names></name></person-group> (<year>2015</year>). <article-title>Unsupervised representation learning with deep convolutional generative adversarial networks</article-title>. <source>arXiv: 1511.06434</source>.</citation></ref>
<ref id="B104">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ringeval</surname> <given-names>F.</given-names></name> <name><surname>Sonderegger</surname> <given-names>A.</given-names></name> <name><surname>Sauer</surname> <given-names>J.</given-names></name> <name><surname>Lalanne</surname> <given-names>D.</given-names></name></person-group> (<year>2013</year>). &#x0201C;<article-title>Introducing the recola multimodal corpus of remote collaborative and affective interactions</article-title>,&#x0201D; in <source>2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG)</source> (<publisher-loc>Shanghai</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x02013;<lpage>8</lpage>. <pub-id pub-id-type="doi">10.1109/FG.2013.6553805</pub-id></citation></ref>
<ref id="B105">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ruder</surname> <given-names>S.</given-names></name> <name><surname>Plank</surname> <given-names>B.</given-names></name></person-group> (<year>2017</year>). <article-title>Learning to select data for transfer learning with bayesian optimization</article-title>. <source>arXiv: 1707.05246</source>. <pub-id pub-id-type="doi">10.18653/v1/D17-1038</pub-id></citation></ref>
<ref id="B106">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Rusu</surname> <given-names>A. A.</given-names></name> <name><surname>Rabinowitz</surname> <given-names>N. C.</given-names></name> <name><surname>Desjardins</surname> <given-names>G.</given-names></name> <name><surname>Soyer</surname> <given-names>H.</given-names></name> <name><surname>Kirkpatrick</surname> <given-names>J.</given-names></name> <name><surname>Kavukcuoglu</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>Progressive neural networks</article-title>. <source>arXiv: 1606.04671</source>.</citation></ref>
<ref id="B107">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Saenko</surname> <given-names>K.</given-names></name> <name><surname>Kulis</surname> <given-names>B.</given-names></name> <name><surname>Fritz</surname> <given-names>M.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name></person-group> (<year>2010</year>). &#x0201C;<article-title>Adapting visual category models to new domains</article-title>,&#x0201D; in <source>European Conference on Computer Vision</source> (<publisher-loc>Heraklion</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>213</fpage>&#x02013;<lpage>226</lpage>.</citation></ref>
<ref id="B108">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sagha</surname> <given-names>H.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Gavryukova</surname> <given-names>M.</given-names></name> <name><surname>Han</surname> <given-names>J.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name></person-group> (<year>2016</year>). &#x0201C;<article-title>Cross lingual speech emotion recognition using canonical correlation analysis on principal component subspace</article-title>,&#x0201D; in <source>2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Shanghai</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>5800</fpage>&#x02013;<lpage>5804</lpage>.</citation></ref>
<ref id="B109">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sagonas</surname> <given-names>C.</given-names></name> <name><surname>Antonakos</surname> <given-names>E.</given-names></name> <name><surname>Tzimiropoulos</surname> <given-names>G.</given-names></name> <name><surname>Zafeiriou</surname> <given-names>S.</given-names></name> <name><surname>Pantic</surname> <given-names>M.</given-names></name></person-group> (<year>2016</year>). <article-title>300 faces in-the-wild challenge: database and results</article-title>. <source>Image Vis. Comput.</source> <volume>47</volume>, <fpage>3</fpage>&#x02013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.1016/j.imavis.2016.01.002</pub-id></citation></ref>
<ref id="B110">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schuller</surname> <given-names>B.</given-names></name> <name><surname>Arsic</surname> <given-names>D.</given-names></name> <name><surname>Rigoll</surname> <given-names>G.</given-names></name> <name><surname>Wimmer</surname> <given-names>M.</given-names></name> <name><surname>Radig</surname> <given-names>B.</given-names></name></person-group> (<year>2007</year>). &#x0201C;<article-title>Audiovisual behavior modeling by combined feature spaces</article-title>,&#x0201D; in <source>2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP&#x00027;07</source>, <volume>Vol. 2</volume> (<publisher-loc>Honolulu, HI</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>II</fpage>&#x02013;<lpage>733</lpage>.</citation></ref>
<ref id="B111">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Schuller</surname> <given-names>B.</given-names></name> <name><surname>M&#x000FC;ller</surname> <given-names>R.</given-names></name> <name><surname>Eyben</surname> <given-names>F.</given-names></name> <name><surname>Gast</surname> <given-names>J.</given-names></name> <name><surname>H&#x000F6;rnler</surname> <given-names>B.</given-names></name> <name><surname>W&#x000F6;llmer</surname> <given-names>M.</given-names></name> <etal/></person-group>. (<year>2009a</year>). <article-title>Being bored? Recognising natural interest by extensive audiovisual integration for real-life application</article-title>. <source>Image Vis. Comput.</source> <volume>27</volume>, <fpage>1760</fpage>&#x02013;<lpage>1774</lpage>. <pub-id pub-id-type="doi">10.1016/j.imavis.2009.02.013</pub-id></citation></ref>
<ref id="B112">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schuller</surname> <given-names>B.</given-names></name> <name><surname>Steidl</surname> <given-names>S.</given-names></name> <name><surname>Batliner</surname> <given-names>A.</given-names></name></person-group> (<year>2009b</year>). &#x0201C;<article-title>The INTERSPEECH 2009 emotion challenge</article-title>,&#x0201D; in <source>Tenth Annual Conference of the International Speech Communication Association</source> (<publisher-loc>Brighton</publisher-loc>).</citation></ref>
<ref id="B113">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schuller</surname> <given-names>B.</given-names></name> <name><surname>Steidl</surname> <given-names>S.</given-names></name> <name><surname>Batliner</surname> <given-names>A.</given-names></name> <name><surname>Burkhardt</surname> <given-names>F.</given-names></name> <name><surname>Devillers</surname> <given-names>L.</given-names></name> <name><surname>M&#x000FC;ller</surname> <given-names>C.</given-names></name> <etal/></person-group>. (<year>2010</year>). &#x0201C;<article-title>The interspeech 2010 paralinguistic challenge</article-title>,&#x0201D; in <source>Eleventh Annual Conference of the International Speech Communication Association</source> (<publisher-loc>Makuhari</publisher-loc>).</citation></ref>
<ref id="B114">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Schuller</surname> <given-names>B.</given-names></name> <name><surname>Steidl</surname> <given-names>S.</given-names></name> <name><surname>Batliner</surname> <given-names>A.</given-names></name> <name><surname>Vinciarelli</surname> <given-names>A.</given-names></name> <name><surname>Scherer</surname> <given-names>K.</given-names></name> <name><surname>Ringeval</surname> <given-names>F.</given-names></name> <etal/></person-group>. (<year>2013</year>). &#x0201C;<article-title>The INTERSPEECH 2013 computational paralinguistics challenge: social signals, conflict, emotion, autism</article-title>,&#x0201D; in <source>Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association</source> (<publisher-loc>Lyon</publisher-loc>).</citation></ref>
<ref id="B115">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Siegman</surname> <given-names>A. W.</given-names></name> <name><surname>Boyle</surname> <given-names>S.</given-names></name></person-group> (<year>1993</year>). <article-title>Voices of fear and anxiety and sadness and depression: the effects of speech rate and loudness on fear and anxiety and sadness and depression</article-title>. <source>J. Abnorm. Psychol.</source> <volume>102</volume>:<fpage>430</fpage>. <pub-id pub-id-type="doi">10.1037/0021-843X.102.3.430</pub-id><pub-id pub-id-type="pmid">8408955</pub-id></citation></ref>
<ref id="B116">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Very deep convolutional networks for large-scale image recognition</article-title>. <source>arXiv: 1409.1556</source>.</citation></ref>
<ref id="B117">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>P.</given-names></name></person-group> (<year>2017</year>). <article-title>Transfer linear subspace learning for cross-corpus speech emotion recognition</article-title>. <source>IEEE Trans. Affect. Comput.</source> <pub-id pub-id-type="doi">10.1109/TAFFC.2018.2800046</pub-id></citation></ref>
<ref id="B118">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Song</surname> <given-names>P.</given-names></name> <name><surname>Zheng</surname> <given-names>W.</given-names></name> <name><surname>Liu</surname> <given-names>J.</given-names></name> <name><surname>Li</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>X.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>A novel speech emotion recognition method via transfer pca and sparse coding</article-title>,&#x0201D; in <source>Chinese Conference on Biometric Recognition</source> (<publisher-loc>Urumchi</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>393</fpage>&#x02013;<lpage>400</lpage>.</citation></ref>
<ref id="B119">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Staroniewicz</surname> <given-names>P.</given-names></name> <name><surname>Majewski</surname> <given-names>W.</given-names></name></person-group> (<year>2009</year>). &#x0201C;<article-title>Polish emotional speech database&#x02013;recording and preliminary validation</article-title>,&#x0201D; in <source>Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions</source> (<publisher-loc>Prague</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>42</fpage>&#x02013;<lpage>49</lpage>.</citation></ref>
<ref id="B120">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Steidl</surname> <given-names>S.</given-names></name></person-group> (<year>2009</year>). <source>Automatic Classification of Emotion Related User States in Spontaneous Children&#x00027;s Speech</source>. <publisher-name>University of Erlangen-Nuremberg Erlangen</publisher-name>.</citation></ref>
<ref id="B121">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sugianto</surname> <given-names>N.</given-names></name> <name><surname>Tjondronegoro</surname> <given-names>D.</given-names></name></person-group> (<year>2019</year>). &#x0201C;<article-title>Cross-domain knowledge transfer for incremental deep learning in facial expression recognition</article-title>,&#x0201D; in <source>2019 7th International Conference on Robot Intelligence Technology and Applications (RiTA)</source> (<publisher-loc>Daejeon</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>205</fpage>&#x02013;<lpage>209</lpage>.</citation></ref>
<ref id="B122">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Sun</surname> <given-names>S.</given-names></name> <name><surname>Yeh</surname> <given-names>C.-F.</given-names></name> <name><surname>Hwang</surname> <given-names>M.-Y.</given-names></name> <name><surname>Ostendorf</surname> <given-names>M.</given-names></name> <name><surname>Xie</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). &#x0201C;<article-title>Domain adversarial training for accented speech recognition</article-title>,&#x0201D; in <source>2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Calgary, AB</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>4854</fpage>&#x02013;<lpage>4858</lpage>.</citation></ref>
<ref id="B123">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tommasi</surname> <given-names>T.</given-names></name> <name><surname>Lanzi</surname> <given-names>M.</given-names></name> <name><surname>Russo</surname> <given-names>P.</given-names></name> <name><surname>Caputo</surname> <given-names>B.</given-names></name></person-group> (<year>2016</year>). &#x0201C;<article-title>Learning the roots of visual domain shift</article-title>,&#x0201D; in <source>European Conference on Computer Vision</source> (<publisher-loc>Amsterdam</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>475</fpage>&#x02013;<lpage>482</lpage>.</citation></ref>
<ref id="B124">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tzeng</surname> <given-names>E.</given-names></name> <name><surname>Hoffman</surname> <given-names>J.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name> <name><surname>Saenko</surname> <given-names>K.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>Simultaneous deep transfer across domains and tasks</article-title>,&#x0201D; in <source>Proceedings of the IEEE International Conference on Computer Vision</source> (<publisher-loc>Santiago</publisher-loc>), <fpage>4068</fpage>&#x02013;<lpage>4076</lpage>.</citation></ref>
<ref id="B125">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Tzeng</surname> <given-names>E.</given-names></name> <name><surname>Hoffman</surname> <given-names>J.</given-names></name> <name><surname>Zhang</surname> <given-names>N.</given-names></name> <name><surname>Saenko</surname> <given-names>K.</given-names></name> <name><surname>Darrell</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>Deep domain confusion: maximizing for domain invariance</article-title>. <source>arXiv: 1412.3474</source>.</citation></ref>
<ref id="B126">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Valstar</surname> <given-names>M.</given-names></name> <name><surname>Pantic</surname> <given-names>M.</given-names></name></person-group> (<year>2010</year>). &#x0201C;<article-title>Induced disgust, happiness and surprise: an addition to the mmi facial expression database</article-title>,&#x0201D; in <source>Proceedings of the 3rd International Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect</source> (<publisher-loc>Paris</publisher-loc>), <fpage>65</fpage>.</citation></ref>
<ref id="B127">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Vielzeuf</surname> <given-names>V.</given-names></name> <name><surname>Pateux</surname> <given-names>S.</given-names></name> <name><surname>Jurie</surname> <given-names>F.</given-names></name></person-group> (<year>2017</year>). &#x0201C;<article-title>Temporal multimodal fusion for video emotion classification in the wild</article-title>,&#x0201D; in <source>Proceedings of the 19th ACM International Conference on Multimodal Interaction</source> (<publisher-loc>Glasgow</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>569</fpage>&#x02013;<lpage>576</lpage>.</citation></ref>
<ref id="B128">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>D.</given-names></name> <name><surname>Zheng</surname> <given-names>T. F.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>Transfer learning for speech and language processing</article-title>,&#x0201D; in <source>2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)</source> (<publisher-loc>Hong Kong</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>1225</fpage>&#x02013;<lpage>1237</lpage>.</citation></ref>
<ref id="B129">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>W&#x000F6;llmer</surname> <given-names>M.</given-names></name> <name><surname>Metallinou</surname> <given-names>A.</given-names></name> <name><surname>Eyben</surname> <given-names>F.</given-names></name> <name><surname>Schuller</surname> <given-names>B.</given-names></name> <name><surname>Narayanan</surname> <given-names>S.</given-names></name></person-group> (<year>2010</year>). &#x0201C;<article-title>Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional lstm modeling</article-title>,&#x0201D; in <source>Proceedings of the INTERSPEECH 2010</source> (<publisher-loc>Makuhari</publisher-loc>), <fpage>2362</fpage>&#x02013;<lpage>2365</lpage>.</citation></ref>
<ref id="B130">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>B.</given-names></name> <name><surname>Fu</surname> <given-names>Y.</given-names></name> <name><surname>Jiang</surname> <given-names>Y.-G.</given-names></name> <name><surname>Li</surname> <given-names>B.</given-names></name> <name><surname>Sigal</surname> <given-names>L.</given-names></name></person-group> (<year>2016</year>). &#x0201C;<article-title>Video emotion recognition with transferred deep feature encodings</article-title>,&#x0201D; in <source>Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>ACM</publisher-name>), <fpage>15</fpage>&#x02013;<lpage>22</lpage>.</citation></ref>
<ref id="B131">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>M.</given-names></name> <name><surname>Cheng</surname> <given-names>W.</given-names></name> <name><surname>Zhao</surname> <given-names>Q.</given-names></name> <name><surname>Ma</surname> <given-names>L.</given-names></name> <name><surname>Xu</surname> <given-names>F.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>Facial expression recognition based on transfer learning from deep convolutional networks</article-title>,&#x0201D; in <source>2015 11th International Conference on Natural Computation (ICNC)</source> (<publisher-loc>Zhangjiajie</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>702</fpage>&#x02013;<lpage>708</lpage>.</citation></ref>
<ref id="B132">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>R.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Zuo</surname> <given-names>W.</given-names></name> <name><surname>Yan</surname> <given-names>J.</given-names></name> <name><surname>Lin</surname> <given-names>L.</given-names></name></person-group> (<year>2018</year>). &#x0201C;<article-title>Deep cocktail network: multi-source unsupervised domain adaptation with category shift</article-title>,&#x0201D; in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source> (<publisher-loc>Salt Lake City, UT</publisher-loc>), <fpage>3964</fpage>&#x02013;<lpage>3973</lpage>.</citation></ref>
<ref id="B133">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yan</surname> <given-names>J.</given-names></name> <name><surname>Zheng</surname> <given-names>W.</given-names></name> <name><surname>Cui</surname> <given-names>Z.</given-names></name> <name><surname>Tang</surname> <given-names>C.</given-names></name> <name><surname>Zhang</surname> <given-names>T.</given-names></name> <name><surname>Zong</surname> <given-names>Y.</given-names></name></person-group> (<year>2018</year>). <article-title>Multi-cue fusion for emotion recognition in the wild</article-title>. <source>Neurocomputing</source> <volume>309</volume>, <fpage>27</fpage>&#x02013;<lpage>35</lpage>. <pub-id pub-id-type="doi">10.1016/j.neucom.2018.03.068</pub-id></citation></ref>
<ref id="B134">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yang</surname> <given-names>J.</given-names></name> <name><surname>Yan</surname> <given-names>R.</given-names></name> <name><surname>Hauptmann</surname> <given-names>A. G.</given-names></name></person-group> (<year>2007</year>). &#x0201C;<article-title>Adapting SVM classifiers to data with shifted distributions</article-title>,&#x0201D; in <source>Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007)</source> (<publisher-loc>Omaha, NE</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>69</fpage>&#x02013;<lpage>76</lpage>.</citation></ref>
<ref id="B135">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Yosinski</surname> <given-names>J.</given-names></name> <name><surname>Clune</surname> <given-names>J.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Lipson</surname> <given-names>H.</given-names></name></person-group> (<year>2014</year>). &#x0201C;<article-title>How transferable are features in deep neural networks?</article-title>,&#x0201D; in <source>Advances in Neural Information Processing Systems</source> (<publisher-loc>Montreal, QC</publisher-loc>), <fpage>3320</fpage>&#x02013;<lpage>3328</lpage>.</citation></ref>
<ref id="B136">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>B.</given-names></name> <name><surname>Provost</surname> <given-names>E. M.</given-names></name> <name><surname>Essl</surname> <given-names>G.</given-names></name></person-group> (<year>2016</year>). &#x0201C;<article-title>Cross-corpus acoustic emotion recognition from singing and speaking: a multi-task learning approach</article-title>,&#x0201D; in <source>2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Shanghai, IEEE</publisher-loc>), <fpage>5805</fpage>&#x02013;<lpage>5809</lpage>.</citation></ref>
<ref id="B137">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>B.</given-names></name> <name><surname>Provost</surname> <given-names>E. M.</given-names></name> <name><surname>Swedberg</surname> <given-names>R.</given-names></name> <name><surname>Essl</surname> <given-names>G.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>Predicting emotion perception across domains: a study of singing and speaking</article-title>,&#x0201D; in <source>Twenty-Ninth AAAI Conference on Artificial Intelligence</source> (<publisher-loc>Austin, TX</publisher-loc>).</citation></ref>
<ref id="B138">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Huang</surname> <given-names>T.</given-names></name> <name><surname>Gao</surname> <given-names>W.</given-names></name> <name><surname>Tian</surname> <given-names>Q.</given-names></name></person-group> (<year>2017</year>). <article-title>Learning affective features with a hybrid deep model for audio&#x02013;visual emotion recognition</article-title>. <source>IEEE Trans. Circ. Syst. Video Technol.</source> <volume>28</volume>, <fpage>3030</fpage>&#x02013;<lpage>3043</lpage>. <pub-id pub-id-type="doi">10.1109/TCSVT.2017.2719043</pub-id></citation></ref>
<ref id="B139">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>W.</given-names></name> <name><surname>Wang</surname> <given-names>F.</given-names></name> <name><surname>Jiang</surname> <given-names>Y.</given-names></name> <name><surname>Xu</surname> <given-names>Z.</given-names></name> <name><surname>Wu</surname> <given-names>S.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.</given-names></name></person-group> (<year>2019</year>). &#x0201C;<article-title>Cross-subject EEG-based emotion recognition with deep domain confusion</article-title>,&#x0201D; in <source>International Conference on Intelligent Robotics and Applications</source> (<publisher-loc>Shenyang</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>558</fpage>&#x02013;<lpage>570</lpage>.</citation></ref>
<ref id="B140">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>X.</given-names></name> <name><surname>Zhang</surname> <given-names>L.</given-names></name> <name><surname>Wang</surname> <given-names>X.-J.</given-names></name> <name><surname>Shum</surname> <given-names>H.-Y.</given-names></name></person-group> (<year>2012</year>). <article-title>Finding celebrities in billions of web images</article-title>. <source>IEEE Trans. Multimedia</source> <volume>14</volume>, <fpage>995</fpage>&#x02013;<lpage>1007</lpage>. <pub-id pub-id-type="doi">10.1109/TMM.2012.2186121</pub-id></citation></ref>
<ref id="B141">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Ye</surname> <given-names>N.</given-names></name> <name><surname>Wang</surname> <given-names>R.</given-names></name></person-group> (<year>2019</year>). <article-title>Speech emotion recognition based on hierarchical attributes using feature nets</article-title>. <source>Int. J. Parallel Emergent Distrib. Syst</source>. <fpage>1</fpage>&#x02013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1080/17445760.2019.1626854</pub-id></citation></ref>
<ref id="B142">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhao</surname> <given-names>H.</given-names></name> <name><surname>Zhang</surname> <given-names>S.</given-names></name> <name><surname>Wu</surname> <given-names>G.</given-names></name> <name><surname>Moura</surname> <given-names>J. M. F.</given-names></name> <name><surname>Costeira</surname> <given-names>J. P.</given-names></name> <name><surname>Gordon</surname> <given-names>G. J.</given-names></name></person-group> (<year>2018</year>). &#x0201C;<article-title>Adversarial multiple source domain adaptation</article-title>,&#x0201D; in <source>Advances in Neural Information Processing Systems 31</source>, eds <person-group person-group-type="editor"><name><surname>Bengio</surname> <given-names>S.</given-names></name> <name><surname>Wallach</surname> <given-names>H.</given-names></name> <name><surname>Larochelle</surname> <given-names>H.</given-names></name> <name><surname>Grauman</surname> <given-names>K.</given-names></name> <name><surname>Cesa-Bianchi</surname> <given-names>N.</given-names></name> <name><surname>Garnett</surname> <given-names>R.</given-names></name></person-group> (<publisher-loc>Montreal, QC</publisher-loc>: <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>8559</fpage>&#x02013;<lpage>8570</lpage>.</citation></ref>
<ref id="B143">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zheng</surname> <given-names>W.-L.</given-names></name> <name><surname>Lu</surname> <given-names>B.-L.</given-names></name></person-group> (<year>2015</year>). <article-title>Investigating critical frequency bands and channels for EEG-based emotion recognition with deep neural networks</article-title>. <source>IEEE Trans. Auton. Ment. Dev.</source> <volume>7</volume>, <fpage>162</fpage>&#x02013;<lpage>175</lpage>.</citation></ref>
<ref id="B144">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zheng</surname> <given-names>W.-L.</given-names></name> <name><surname>Lu</surname> <given-names>B.-L.</given-names></name></person-group> (<year>2016</year>). &#x0201C;<article-title>Personalizing eeg-based affective models with transfer learning</article-title>,&#x0201D; in <source>Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence</source> (<publisher-loc>New York, NY</publisher-loc>: <publisher-name>AAAI Press</publisher-name>), <fpage>2732</fpage>&#x02013;<lpage>2738</lpage>.</citation></ref>
<ref id="B145">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zheng</surname> <given-names>W.-L.</given-names></name> <name><surname>Zhang</surname> <given-names>Y.-Q.</given-names></name> <name><surname>Zhu</surname> <given-names>J.-Y.</given-names></name> <name><surname>Lu</surname> <given-names>B.-L.</given-names></name></person-group> (<year>2015</year>). &#x0201C;<article-title>Transfer components between subjects for EEG-based emotion recognition</article-title>,&#x0201D; in <source>2015 International Conference on Affective Computing and Intelligent Interaction (ACII)</source> (<publisher-loc>Xi&#x00027;an</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>917</fpage>&#x02013;<lpage>922</lpage>.</citation></ref>
<ref id="B146">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>H.</given-names></name> <name><surname>Chen</surname> <given-names>K.</given-names></name></person-group> (<year>2019</year>). &#x0201C;<article-title>Transferable positive/negative speech emotion recognition via class-wise adversarial domain adaptation</article-title>,&#x0201D; in <source>ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source> (<publisher-loc>Brighton, VIC</publisher-loc>: <publisher-name>IEEE</publisher-name>), <fpage>3732</fpage>&#x02013;<lpage>3736</lpage>.</citation></ref>
<ref id="B147">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zong</surname> <given-names>Y.</given-names></name> <name><surname>Zheng</surname> <given-names>W.</given-names></name> <name><surname>Zhang</surname> <given-names>T.</given-names></name> <name><surname>Huang</surname> <given-names>X.</given-names></name></person-group> (<year>2016</year>). <article-title>Cross-corpus speech emotion recognition based on domain-adaptive least-squares regression</article-title>. <source>IEEE Signal Process. Lett.</source> <volume>23</volume>, <fpage>585</fpage>&#x02013;<lpage>589</lpage>. <pub-id pub-id-type="doi">10.1109/LSP.2016.2537926</pub-id></citation></ref>
</ref-list>
<fn-group>
<fn fn-type="financial-disclosure"><p><bold>Funding.</bold> This research was funded by the Engineering Information Foundation (EiF18.02) and the Texas A&#x00026;M Program to Enhance Scholarly and Creative Activities (PESCA).</p>
</fn>
</fn-group>
</back>
</article>