Zero-shot style transfer for gesture animation driven by text and speech using adversarial disentanglement of multimodal style encoding

Modeling virtual agents with behavior style is one factor for personalizing human-agent interaction. We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers including those unseen during training. Our model performs zero-shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers. We view style as being pervasive; while speaking, it colors the communicative behaviors expressivity while speech content is carried by multimodal signals and text. This disentanglement scheme of content and style allows us to directly infer the style embedding even of a speaker whose data are not part of the training phase, without requiring any further training or fine-tuning. The first goal of our model is to generate the gestures of a source speaker based on the content of two input modalities–Mel spectrogram and text semantics. The second goal is to condition the source speaker's predicted gestures on the multimodal behavior style embedding of a target speaker. The third goal is to allow zero-shot style transfer of speakers unseen during training without re-training the model. Our system consists of two main components: (1) a speaker style encoder network that learns to generate a fixed-dimensional speaker embedding style from a target speaker multimodal data (mel-spectrogram, pose, and text) and (2) a sequence-to-sequence synthesis network that synthesizes gestures based on the content of the input modalities—text and mel-spectrogram—of a source speaker and conditioned on the speaker style embedding. We evaluate that our model is able to synthesize gestures of a source speaker given the two input modalities and transfer the knowledge of target speaker style variability learned by the speaker style encoder to the gesture generation task in a zero-shot setup, indicating that the model has learned a high-quality speaker representation. We conduct objective and subjective evaluations to validate our approach and compare it with baselines.


INTRODUCTION
Human behavior style is a socially meaningful clustering of features found within and across multiple modalities, specifically in linguistic [7], spoken behavior such as the speaking style conveyed by speech prosody [29,33], and nonverbal behavior such as hand gestures and body posture [32,42].Style involves the ways in which people talk differently in different situations.A same person may have different speaking styles depending on the situation (e.g. at home, at the office or with friends).These situations can carry different social meanings [5].Different persons may also have different behavior styles while communicating in similar contexts.Style is syntagmatic.It unfolds over time in the course of an interaction and during one's life course [7].It does not emerge unaltered from the speaker.It is continuously attuned as it is accomplished and co-produced with the audience [28].It can be very self-conscious and at the same time can be extremely routinized to the extent that it resists attempts of being altered [28].For instance, style-shifting has been observed in the speech of Oprah Winfrey [19], a popular host of a U.S. talk show.Internal linguistic factors such Authors' addresses: Mireille Fares, ISIR, STMS, Sorbonne University, Paris, France, fares@isir.upmc.fr;Michele Grimaldi, ISIR, Sorbonne University, Paris, France, grimaldi@isir.upmc.fr;Catherine Pelachaud, CNRS, ISIR, Sorbonne University, Paris, France, catherine.pelachaud@sorbonne-universite.fr;Nicolas Obin, STMS, Sorbonne University, Paris, France, nobin@ircam.fr.arXiv:2208.01917v1[cs.SD] 3 Aug 2022 as lexical frequency, and external sociolinguistic factors influence the phonetic of various variables in her speech [19].
Another study [35] shows that Ellen Degeneres, another popular host of a US talk show, employs different speech styles in her TV show such as formal, consultative, casual and intimate styles.Style is specifically related to the diversity of gestures and expressivity of each specific speaker [6,34].All of the aforementioned points constitute a technical challenge when trying to model behavior style in virtual agents.The behavior generation model should not simply learn an overall style from multiple speakers, but should remember each speaker's specific style -idiosyncrasy -generated in a specific lexical content context and behavior expressivity.The model should be able to capture the style that are common throughout speakers, the ones that are unique to a speaker's prototypical gestures produced consciously and unconsciously, as well as the different style-shifting that may occur during speech.
Verbal and non-verbal behavior play a crucial role in sending and perceiving new information [31] in human-human interaction.Generative models that aim to predict Embodied Conversational Agents (ECA) gestures must consider the importance of producing meaningful and naturalistic gestures that are aligned with speech [9].Non-verbal behavior must be generated and synchronized in conjunction with verbal and prosodic behavior to define their shape and time of occurrence [38].This constitutes another technical challenge, to enable a smooth and engaging interaction between humans and ECAs by making sure that ECAs produce semantically-aware, natural, expressive and coherent gestures aligned with speech and its content.
In the present paper, we propose a novel approach to model behavior style in virtual agents and to tackle the different style modeling challenges.Our approach aims at (1) synthesizing natural and expressive upper body gestures of a source speaker, by encoding the content of two input modalities -text semantics and Mel spectrogram, (2) conditioning the source speaker's predicted gesture on the multimodal style representation of a target speaker, and therefore rendering the model able to perform style transfer across speakers, and finally (3) allowing zero-shot style transfer of newly coming speakers that were not seen by the model during training.Our model consists of two main components: first (1) a speaker style encoder network which goal is to model a specific target speaker style extracted from three input modalities -Mel spectrogram, upper-body gestures, and text semantics; and second (2) a sequence-to-sequence synthesis network that generates a sequence of upper-body gestures based on the content of two input modalities -Mel spectrogram and text semantics -of a source speaker, and conditioned on the target speaker style embedding.We trained our model on the database PATS, which was proposed in [2] and designed to study gesture generation and style transfer.It includes 3 main features that we are considering in our approach: text semantics represented by BERT embeddings, Mel spectrogram and 2D upper body poses.

Contributions
Our contributions can be listed as follows: (1) We propose the first approach for zero-shot multimodal style transfer approach for 2D pose synthesis.At inference, an embedding style vector can be directly inferred from multimodal data (text, speech and and pose) of any speaker, by simple projection into the embedding style space (similar to the one used in [20]).The style transfer performed by our model allows the transfer of style from any unseen speakers, without further training or fine-tuning of our trained model.Thus it is not limited to the styles of the speakers of a given database.It also allows "style preservation" by generating gestures for multiple speakers while remembering what is unique for each speaker.
(2) To design our approach, we make the following assumptions for the separation of style and content information: style is possibly encoded across all modalities (text, speech, pose) and varies little or not over time; content is encoded only by text and speech modalities and varies over time.
(3) To implement theses assumptions, we propose an architecture for encoding and disentangling content and style information from multiple modalities.On one side, a content encoder is used to encode a content matrix from text and speech signal; on the other hand, a style encoder is used to encode a style vector from all text, speech, and signal modalities.A fader loss is introduced to effectively disentangle content and style encodings [25].The encoding of the style takes into account 3 modalities: body poses, text semantics, and speech -Mel spectrograms.
These modalities are important for gesture generation [16,23] and are linked to style.
(4) Finally, we evaluate the 2D generated gestures by converting them to 3D poses, and simulating 3D animations of the generated gestures.The 3D poses generation is done from incomplete upper body 2D pose joints, using MocapNET, and are simulated on a 3D virtual agent.3D poses estimation has never been done using 2D poses with such a large number of missing joints in the context of virtual agents animation.
The paper is organized as follows.The next section discusses the related work.We then describe the proposed architecture.Afterwards we describe the training regime.Then we present the objective and subjective evaluations.We finally discuss our results.

RELATED WORK
Since few years, a large number of gesture generative models have been proposed, principally based on sequential generative parametric models such as Hidden Markov Models HMM and gradually moving towards deep neural networks enabling spectacular advances over the last few years.Hidden Markov Models were previously used to predict head motion driven by prosody [39], and body motion [26,27].Chiu & Marsella [10] proposed an approach for predicting gesture labels from speech using conditional random fields (CRFs) and generating gesture motion based on these labels, using Gaussian process latent variable models (GPLVMs).These works focus on the gesture generation task driven by either one modality namely speech, or by the two modalities -speech and text.Their work focuses on producing naturalistic and coherent gestures that are aligned with speech and text, enabling a smoother interaction with ECAs, and leveraging the vocal and visual prosody.The non-verbal behavior is therefore generated in conjunction with the verbal behavior.Most of these works use a TTS for producing the voice, which, then, serves as input for computing the animation of the virtual agent.LSTM networks driven by speech were recently used to predict sequences of gestures [18] and body motions [3,40].LSTMs were additionally employed for synthesizing sequences of facial gestures driven by text and speech, namely the fundamental frequency (F0) [12,13].Generative adversarial networks (GANs) were proposed to generate realistic head motion [37] and body motions [15].Furthermore, transformer networks and attention mechanisms were recently used for upper-facial gesture synthesis based on multimodal data -text and speech [14].Jonell et al. [21] propose a probabilistic approach based on normalizing flows for synthesizing facial gestures in dyadic settings.Gestures driven by both acoustic and semantic information [12,14,24] are the closest approaches to our gesture generation task, however they cannot be used for the style transfer task.
Beyond realistic generation of human non-verbal behavior, style modelling and control in gesture is receiving more attention in order to propose more expressive behaviors that could possibly adapted to a specific audience [1,2,4,11,16,22,30].Michael Neff et al. [30] propose a system that produces full body gesture animation driven by text, in the style of a specific performer.Alexanderson et al. [4] propose a generative model for synthesizing speech-driven gesticulation, they exert directorial control over the output style such as gesture level and speed.Tero Karras et al. [22] propose a model for driving 3D facial animation from audio.Their main objective is to model the style of a single actor by using a deep neural network that outputs 3D vertex positions of meshes that correspond to a specific audio.Daniel Cudeiro et al. [11] also propose a model that synthesizes 3D facial animation driven by speech signal.Ginosar et al. [16] propose an approach for generating gestures given audio speech, however their approach uses models trained on single speakers.The aforementioned works have focused on generating nonverbal behaviors (facial expression, head movement, gestures in particular) aligned with speech [2,11,22,30].They have not consider multimodal data when modeling style, as well as when synthesizing gestures.
To our knowledge, the only attempts to model and transfer the style from multi-speakers database have been proposed by [2] and [1].[2] presented Mix-StAGE, a speech driven approach that trains a model from multiple speakers while learning a unique style embedding for each speaker.They created PATS, a dataset designed to study various styles of gestures for a large number of speakers in diverse settings.In their proposed neural architecture, a content and a style encoder are used to extract content and style information from speech and pose.To disentangle style from content information, they assume that style is only encoded through the pose modality, and the content is shared across speech and pose modalities.A style embedding matrix whose each vector represents the style associated to a specific speaker from the training set.During training, they further propose a multimodal GAN strategy to generate poses either from the speech or pose modality.During inference, the pose is inferred by only using the speech modality and the desired style token.However, their generative model is conditioned on gesture style and driven by audio.It does not include verbal information.It cannot perform zero-shot style transfer on speakers that were not seen by their model during training.In addition, the style is associated with each unique speaker, which makes the distinction unclear between each speaker's specific style -idiosyncrasy -, the style that is shared among a set of speakers of similar settings (i.e.TV show hosts, journalists, etc...), and the style that is unique to each speaker's prototype gestures that are produced consciously and unconsciously, in addition to the different style-shifting that may occur.Moreover, the style transfer is limited to the styles of the speakers of , which prevents the transfer of style from an unseen speaker.Furthermore, the proposed architecture is based on the disentangling of content and PATS style information, which is based on the assumption that style is only encoded by gestures.However, both text and speech also convey style information, and the encoding of style must take into account all the modalities of human behavior.To tackle those issues, [1] presented a few-shot style transfer strategy based on neural domain adaptation accounting for cross-modal grounding shift between source speaker and target style.This adaptation still requires 2 minutes of the style to be transferred.
To the best of our knowledge, our approach is the first to synthesize gestures from a source speaker, which are semantically-aware, speech driven and conditioned on a multimodal representation of the style of target speakers, in a zero-shot configuration i.e., without requiring any further training or fine-tuning.(3) An adversarial component in the form of a fader network [25] is used for disentangling style and content from the multimodal data.

ZERO-SHOT MULTIMODAL STYLE TRANSFER MODEL (ZS-MSTM) FOR GESTURE ANIMATION DRIVEN BY TEXT AND SPEECH
At inference time, the adversarial component is discarded, and the model can generate different versions of poses when fed with different style embeddings.Gesturing styles for the same input speech can be altered by simply switching the style embeddings, or switching the multimodal input data fed as input to the Style Encoder.ZS-MSTM illustrated in Fig. 1 aims at mapping multimodal speech and text feature sequences into continuous upper-body gestures, conditioned on a speaker style embedding.The network operates on the word-level: the inputs and output of the network consist of one feature vector for each word W of the input text sequence.The length of the word-level input features (text and audio) corresponds to 64 timesteps (as provided by PATS).The model generates a sequence of gestures corresponding to the same word-level features given as inputs.Gestures are sequences of 2D poses represented by X and Y positions of the joints of the skeleton.The network has an embedding dimension d_model equal to 768.

Content Encoder
The content encoder   illustrated in Fig. 1 takes as inputs BERT embedding   and audio Mel spectrograms  ℎ corresponding to each W.   is represented by a vector of length 768 -BERT embedding size used in PATS.
ℎ is encoded using Mel spectrogram Transformer (AST) pre-trained base384 model [17].AST operates as follows: the input Mel spectrogram which has 128 frequency bins, is split into a sequence of 16x16 patches with overlap, and then is linearly projected into a sequence of 1D patch vectors, which is added with a positional embedding.We append a [CLS] token to the resulting sequence, which is then input to a Transformer Encoder.AST was originally proposed for audio classification.Since we do not intend to use it for a classification task, we remove the linear layer with sigmoid activation function at the output of the Transformer Encoder.We use the Transformer Encoder's output of the [CLS] token as the Mel spectrogram representation S. The Transformer Encoder has an embedding dimension equals to   , 12 encoding layers, and 12 attention heads.The word-level encoded Mel spectrogram is then concatenated with the word-level BERT embedding.A self-attention mechanism is then applied on the resulting vector.The multi-head attention layer has 4 attention heads, and an embedding size   equals to   =   + 768.The output of the attention layer is the vector ℎ  , a content representation of the source speaker's word-level Mel spectrogram and text embedding, and it can be written as follows: where: sa(.) denotes self-attention.

Style Encoder
As discussed previously, style is a clustering of features found within and across modalities, encompassing verbal and non-verbal behavior.It is not limited to gestural information.We consider that style is encoded in a speaker's multimodal -text, speech and pose -behavior.As illustrated in Fig. 1, the style encoder    takes as input, at the word-level, Mel spectrogram  ℎ , BERT embedding   , and a sequence of (X, Y) joints positions that correspond to a target speaker's 2D poses   .AST is used to encode the audio input spectrogram.3 layers of LSTMs with a hidden-size equal to   are used to encode the vector representing the 2D poses.The last hidden layer is then concatenated with the audio representation.Next, a multi-head attention mechanism is applied on the resulting vector.This attention layer has 4 attention heads and an embedding size equals to   .Finally, the output vector is concatenated with the 2D poses vector representation.The resulting vector ℎ   is the output speaker style embedding that serves to condition the network with the speaker style.The final style embedding ℎ   can therefore be written as follows : where: sa(.) denotes self-attention.

Sequence to sequence gesture synthesis
The stylized 2D poses are generated given the sequence of content representation ℎ  of the source speaker's Mel spectrogram and text embeddings obtained at word-level, and conditioned by the style vector embedding ℎ   generated from a target speaker's multimodal data.For decoding the stylized 2D-poses, the sequence of ℎ  and the vector ℎ   are concatenated (by repeating the ℎ   vector for each word of the sequence), and passed through a dense layer of size   .We then give the resulting vector as input to a transformer decoder.The transformer decoder is composed of N = 1 decoding layer, with 2 attention heads, and an embedding size equal to   .Similar to the one proposed in [41], it is composed of residual connections applied around each of the sub-layers, followed by layer normalization.Moreover, the self-attention sub-layer in the decoder stack is altered to prevent positions from attending to subsequent positions.The output predictions are offset by one position.This masking makes sure that the predictions for position index j depends only on the known outputs at positions that are less than j.For the last step, we perform a permutation of the first and the second dimensions of the vector generated by the transformer decoder.
The resulting vector is a sequence of 2D-poses which corresponds to: where: G is the transformer generator conditioned on latent content embedding ℎ  and style embedding ℎ   .
The generator loss of the transformer gesture synthesis can be written as,

Adversarial Component
Our approach of disentangling style from content relies on the fader network disentangling approach [25], where a fader loss is introduced to effectively separate content and style encodings.The fundamental feature of our disentangling scheme is to constrain the latent space of ℎ  to be independent of the style embeddings ℎ   .Concretely, it means that the distribution over ℎ  of the latent representations should not contain the style information.A fader network is composed of: an encoder which encodes the input information X into the latent code ℎ  , a decoder which decodes the original data from the latent, and an additional variable ℎ   used to condition the decoder with the desired information (a face attribute in the original paper).The objective of the fader network is to learn a latent encoding ℎ  of the input data that is independent on the conditioning variable ℎ   while both variables are complementary to reconstruct the original input data from the latent variable ℎ  and the conditioning variable ℎ   .To do so, a discriminator D is optimized to predict the variable ℎ   from the latent code ℎ  ; on the other side the auto-encoder is optimized using an additional adversarial loss so that the classifier D is unable to predict the variable ℎ   .Contrary to the original fader network in which the conditional variable is discrete within a finite binary set (0 or 1 for the presence or absence attribute), in this paper the conditional variable ℎ   is continuous.We then formulate this discriminator as a regression on the conditional variable ℎ   : the discriminator learns to predict the style embedding ℎ   from the content embedding ℎ  , as: While optimizing the discriminator, the discriminator loss L  must be as low as possible, such as: In turn, optimizing the generator loss including the fader loss L   , the discriminator must not be able to predict correctly the style embedding ℎ   from the content embedding ℎ  conducting to a high discriminator error and thus a low fader loss.The adversarial loss can be written as, To be consistent, the style prediction error is preliminary normalized within 0 and 1 range.
Finally, the total generator loss can therefore be written as follows: where  is the adversarial weight that starts off at 0 and is linearly incremented by 0.01 after each training step.The discriminator  and the generator  are then optimized alternatively as described in [25].

TRAINING
This section describes the training regime we follow for our model.We trained our network using the PATS dataset [2].
PATS was created to study various styles of gestures.The dataset contains upper-body 2D pose sequences aligned with corresponding Mel spectrogram, and BERT embeddings.It offers 251 hours of data, with a mean of 10.7 seconds and a standard deviation of 13.5 seconds per interval.PATS gathers data from 25 speakers with different behavior styles from various settings (e.g., lecturers, TV shows hosts).It contains also several annotations.The spoken text has been transcribed in PATS and aligned with the speech.The 2D body poses have been extracted with OpenPose.Each speaker is represented by their lexical diversity and the spatial extend of their arms.While in PATS arms and fingers have been extracted, we do not consider finger data in our work.That is we do not model and predict 2D finger joints.This choice arises as the analysis of finger data is very noisy and not very accurate.
We consider two test conditions: Seen Speaker and Unseen Speaker.The Seen Speaker condition aims to assess the style transfer correctness that our model can achieve when presented with speakers that were seen during training as target style.On the other hand, the Unseen Speaker condition aims to assess the performance of our model when presented with unseen target speakers, to perform zero-shot style transfer.Seen and unseen speakers are specifically selected from PATS to cover a diversity of stylistic behavior with respect to lexical diversity and spatial extent as reported by [2] 1 .For each PATS speaker, there is a train, validation and test set already defined in the database.For testing the Seen Speaker condition, our training set includes the train sets of 16 PATS speakers: "Shelly", "Jon", "Fallon", "Bee", "Ellen", "Oliver", "Lec_cosmic", "Lec_hist", "Seth", "Conan", "Angelica", "Rock", "Noah", "Ytch_prof", "Lec_law", and "Ytch_dating".Six other speakers are selected for the Unseen Speaker condition, and their test sets are also used for our experiments.These six speakers "Lec_evol", "Almaram", "Huckabee", "Ytch_charisma", "Minhaj", and "Chemistry" differ in their behavior style and lexical diversity.Each training batch contains 24 pairs of word embeddings, Mel spectrogram, and their corresponding sequence of (X, Y) joints of the skeleton (of the upper-body pose).We use Adam optimizer with  1 = 0.95,  2 = 0.999.For balanced learning, we use a scheduler with an initial learning rate of 0.00001, with warmup steps = 20000.We train the network for 200 epochs.All features values are normalized so that the dataset mean and standard deviation are 0 and 0.5, respectively.

3D POSE GENERATION AND SIMULATION
Previous evaluation studies of models learned from video data have used 2D stick figures for their subjective evaluation [2].Even when the 2D stick figure is projected on the video of a human speaker, the animation is not always readable as, in particular, it is missing information on the body pose in the Z direction (the depth axis).So we choose to convert the 2D poses into 3D poses.We visualize the behavior animation resulting from our model on a 3D virtual agent.As in [2], we train our model on the database PATS, and therefore the generated 2D body poses correspond to incomplete skeleton joints; missing joints include lower body joints, as well as torso joints.To visualize the resulting animations of our model, we convert the 2D poses into 3D poses and use 3D human mesh.
We develop an approach that generates 3D poses from incomplete upper body 2D pose joints using MocapNET [36], an ensemble of SNN encoders that estimates the 3D human body pose based on 2D joint estimations extracted from monocular RGB images.It outputs skeletal information directly into the BVH format which can be rendered in real-time or imported without any additional processing in most popular 3D animation software.MocapNET operates on 2D joint input, received in the popular COCO [8] or BODY25 [8] format.In order to be used, the file containing the predictions are formatted following the BODY25 format and the 2D joints are mapped to respect the BODY25 joints.The JSON files with 2D detections are subsequently converted to CSV files and then to 3D BVH files using the MocapNET.Finally we add zeros for the missing joints.MocapNET is trained using a 1920x1080 "virtual camera" to emulate a GoPRO Hero 4 running at the Full-HD mode.We adapted the output of our gesture generation model to such a configuration.We also set up the frames resolution to correspond to the original video stream size.Once the BVH file is created we use the 3D animation software Blender to simulate the animation.Finally, we apply a 3D human mesh to the skeleton to simulate a 3D human animation.The mesh is taken from Mixamo2 , an online database of characters and mocap animations used in art projects, movies and games.In order to fuse the mesh with the skeleton, we scale the mesh to fit the skeleton and we parent the skeleton and the mesh with automatic weights.

EVALUATION METRICS AND STUDIES
To measure the performance of our work, we conducted several objective and subjective evaluation studies we present in this section.We start by introducing the metrics we use for the objective studies; we follow by explaining the protocol we follow for the perceptive studies as well as the creation of stimuli we use.

Fig. 1 .
Fig. 1.ZS-MSTM (Zero-Shot Multimodal Style Transfer Model) architecture.The content encoder (further referred to as   ) is used to encode content embedding ℎ  from BERT text embeddings   and speech Mel-spectrograms  ℎ using a speech encoder  ℎ  .The style encoder (further referred to as    ) is used to encode style embedding ℎ   from multimodal text   , speech  ℎ , and pose   using speech encoder  ℎ   and pose encoder     .The generator  is a transformer network that generates the sequence of poses   from the sequence of content embedding ℎ  and the style embedding vector ℎ   .The adversarial module relying on the discriminator  is used to disentangle content and style embeddings ℎ  and ℎ   .

( 2 )
A sequence to sequence gesture synthesis network that synthesizes gestures based on the content of two input modalities -text embeddings and Mel spectrogram -of a source speaker, and conditioned on the target speaker style embedding.A content encoder is presented to encode the content of the Mel spectrogram along with BERT embeddings.

Fig. 3 .
Fig. 3.A sequence of gestures corresponding to a sequence of 2D poses.(a) 2D poses.(b) The corresponding sequence of 3D poses computed by MocapNet and simulated with Blender.(c) Resulting animation with a 3D human mesh.