<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="review-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Sig. Proc.</journal-id>
<journal-title>Frontiers in Signal Processing</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Sig. Proc.</abbrev-journal-title>
<issn pub-type="epub">2673-8198</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1230755</article-id>
<article-id pub-id-type="doi">10.3389/frsip.2023.1230755</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Signal Processing</subject>
<subj-group>
<subject>Mini Review</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Multilingual video dubbing&#x2014;a technology review and current challenges</article-title>
<alt-title alt-title-type="left-running-head">Bigioi and Corcoran</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frsip.2023.1230755">10.3389/frsip.2023.1230755</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Bigioi</surname>
<given-names>Dan</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2385737/overview"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Corcoran</surname>
<given-names>Peter</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2246620/overview"/>
</contrib>
</contrib-group>
<aff>
<institution>School of Engineering</institution>, <institution>University of Galway</institution>, <addr-line>Galway</addr-line>, <country>Ireland</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1932062/overview">Feng Yang</ext-link>, Google, United States</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2297372/overview">Xinwei Yao</ext-link>, Google, United States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2339293/overview">Keren Ye</ext-link>, Google, United States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2405950/overview">Seung Hyun Lee</ext-link>, Korea University, Republic of Korea</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/2409949/overview">Xi Chen</ext-link>, Rutgers, The State University of New Jersey, United States</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Dan Bigioi, <email>d.bigioi1@universityofgalway.ie</email>; Peter Corcoran, <email>peter.corcoran@universityofgalway.ie</email>
</corresp>
</author-notes>
<pub-date pub-type="epub">
<day>25</day>
<month>09</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>3</volume>
<elocation-id>1230755</elocation-id>
<history>
<date date-type="received">
<day>29</day>
<month>05</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>11</day>
<month>09</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Bigioi and Corcoran.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Bigioi and Corcoran</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>The proliferation of multi-lingual content on today&#x2019;s streaming services has created a need for automated multi-lingual dubbing tools. In this article, current state-of-the-art approaches are discussed with reference to recent works in automatic dubbing and the closely related field of talking head generation. A taxonomy of papers within both fields is presented, and the main challenges of both speech-driven automatic dubbing, and talking head generation are discussed and outlined, together with proposals for future research to tackle these issues.</p>
</abstract>
<kwd-group>
<kwd>talking head generation</kwd>
<kwd>dubbing</kwd>
<kwd>deep fakes</kwd>
<kwd>deep learning</kwd>
<kwd>artificial intelligence</kwd>
<kwd>video synthesis</kwd>
<kwd>audio video synchronisation</kwd>
</kwd-group>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-at-acceptance</meta-name>
<meta-value>Image Processing</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction and background</title>
<p>The problem of Video dubbing is not a recent challenge. Looking back through the literature in <xref ref-type="bibr" rid="B9">Cao et al. (2005)</xref> the authors discuss the complexity of mimicking facial muscle movements and note that data-driven methods had yielded some of the most promising results at that time, almost two decades ago. More recently <xref ref-type="bibr" rid="B48">Mariooryad and Busso (2012)</xref> synthesized facial animations based on the MPEG-4 facial animation standard, using the audiovisual IEMOCAP database (<xref ref-type="bibr" rid="B7">Busso et al., 2008</xref>). While the Face Animation Parameters (FAP) defined in MPEG are useful, such model based approaches are no longer considered as state of the art (SotA) for photo-realistic speech dubbing or facial animation. Nevertheless, these earlier works attest to long-standing research on speech-driven facial re-enactment in the literature.</p>
<p>Today, there have been many new advances in facial rendering and acoustic and speech models. The requirements of video dubbing are mainly driven by the evolution of the video streaming industry (<xref ref-type="bibr" rid="B33">Hayes and Bolanos-Garcia-Escribano, 2022</xref>) and will be the focus of this review. The rapid growth of streaming services and the resulting competition has led to a proliferation of new content, with a significant growth in non-English language content and a global expansion of audiences to existing and new non-English speaking audiences and markets. Much of the success of the leading content streaming services lies in delivering improved quality of content to these new markets with a need for more sophisticated and semi-automated subtitle and dubbing services.</p>
<p>Subtitle services are well-developed and provide a useful bridge to the growing libraries of video content for non-English audiences. The leading services have also begun to release new content with dubbing in multiple languages and to annotate and dub legacy content as well (<xref ref-type="bibr" rid="B60">Roxborough, 2019</xref>; <xref ref-type="bibr" rid="B53">NILESH and DECK, 2023</xref>). Auto-translation algorithms can help here, but typically human input is also needed to refine the quality of the resulting translations.</p>
<p>When content is professionally dubbed a voice actor will carefully work to align the translated text with the original actors facial movements and expressions. This is a challenging and skilled task and it is difficult to find multi-lingual voice actors, so often only the lead actors in a movie will be professionally overdubbed. This creates an &#x201c;uncanny valley&#x201d; effect for most overdubbed content which detracts from the viewing experience and it is often preferable to view content in the original language with subtitles. Thus the overdubbing of digital content remains a significant challenge for the video streaming industry (<xref ref-type="bibr" rid="B66">Spiteri Miggiani, 2021</xref>).</p>
<p>For the best quality of experience in viewing multi-lingual content it is desirable not only to overdub the speech track for a character, but also to adjust their facial expressions, particularly the lip and jaw movements to match the speech dubbing. This requires a subtle adjustment of the original video content for each available language track, ensuring that while the lip and jaw movements change in response to the new language track, the overall performance of the original language actor is not diminished in any way. But achieving this seamless audio driven automatic dubbing is a non-trivial task, with many approaches proposed over the last half-decade tackling this problem. Deep learning techniques especially have proven popular in this domain (<xref ref-type="bibr" rid="B81">Yang et al., 2020</xref>; <xref ref-type="bibr" rid="B72">Vougioukas et al., 2020</xref>; <xref ref-type="bibr" rid="B70">Thies et al., 2020</xref>; <xref ref-type="bibr" rid="B65">Song et al., 2018</xref>; <xref ref-type="bibr" rid="B78">Wen et al., 2020</xref>), demonstrating compelling results on the tasks of automatic dubbing, and the lesser constrained, more well-known task of &#x201c;talking head generation.&#x201d;</p>
<p>In this article, current state-of-the-art approaches are discussed with reference to the most recent and relevant works in automatic dubbing and the closely related field of talking head generation. A taxonomy of papers within both fields is presented, and current SotA for both audio-driven automatic dubbing, and talking head generation are discussed and outlined. Recent approaches can be broadly classified as falling within two main schools of thought: end-to-end, or structural-based generation (<xref ref-type="bibr" rid="B44">Liang et al., 2022</xref>). It is clear from this review that much of the foundation technology is now available to tackle photo-realistic multilingual dubbing, but there are still remaining challenges which we seek to define and clarify in our concluding discussion.</p>
</sec>
<sec id="s2">
<title>2 The high-level dubbing pipeline</title>
<p>Traditionally, dubbing is a costly post-production affair that consists of three primary steps:<list list-type="simple">
<list-item>
<p>&#x2022; Translation: This is the process of taking the script of the original video, and translating it to the desired language(s). Traditionally, this is done by hiring multiple language experts, fluent in both the original, and target languages. With the emergence of large language models in recent years however, accurate automatic language to language translation is becoming a reality (<xref ref-type="bibr" rid="B23">Duquenne et al., 2023</xref>), and has been adopted into industry use as early as 2020 by the likes of Netflix (<xref ref-type="bibr" rid="B2">Alarcon, 2023</xref>). That being said, the models are not perfect and are susceptible to mistranslations, therefore to ensure quality an expert is still required to look over the translated script.</p>
</list-item>
<list-item>
<p>&#x2022; Voice Acting: Once the scripts have been translated, the next step is to identify and hire suitable voice actors for each of the desired languages. For a high quality dub, care must be taken to ensure that the voice actors can accurately portray the range of emotions of the original recording, and that their voices suitably match the on-screen character. This is a costly, and time-consuming endeavour, and would benefit immensely from automation. Despite incredible advances in text-to-speech, and voice-cloning technologies in recent years, a lot of work still remains to be able to truly replicate the skill of a professional voice actor (<xref ref-type="bibr" rid="B77">Weitzman, 2023</xref>), however for projects where quality is not as important, text to speech is an attractive option due to its reduced cost.</p>
</list-item>
<list-item>
<p>&#x2022; Audio Visual Mixing: As soon as the new language voice recordings are obtained, the final step is to combine them with the original video recording in as seamless a manner as possible. Traditionally this involves extensive manual editing work in order to properly align and synchronise the new audio to the original video performance. Even the most skilled of editors however cannot truly synchronise these two streams. High quality dubbing work is enjoyable to watch yet oftentimes it is still noticeable that the content is dubbed. Poor quality dubbing work detracts from the user experience, oftentimes inducing the &#x201c;uncanny-valley&#x201d; effect in viewers.</p>
</list-item>
</list>
</p>
<p>Due to the recent advancements in deep learning, there is scope for automation in each of the traditional dubbing steps. Manual language translation can be carried out automatically by large language models such as <xref ref-type="bibr" rid="B23">Duquenne et al. (2023)</xref>. Traditional voice acting can be replaced by powerful text to speech models such as <xref ref-type="bibr" rid="B42">&#x141;a&#x144;cucki (2021)</xref>; <xref ref-type="bibr" rid="B45">Liu et al. (2023)</xref>; <xref ref-type="bibr" rid="B75">Wang et al. (2017)</xref>. Audio-visual mixing, can then be carried out by talking head generation/video editing models such as <xref ref-type="bibr" rid="B90">Zhou et al. (2020)</xref>. Given the original video and language streams, the following is an example of what such an automatic dubbing pipeline might look like for dubbing an English language video into German:<list list-type="simple">
<list-item>
<p>&#x2022; Transcribing and Translating Source Audio: Using an off-the-shelf automatic speech recognition model, an accurate transcript can be produced from the speech audio. The English transcript can then be translated into German using a large language model such as BERT or GPT3 finetuned on the language to language translation task.</p>
</list-item>
<list-item>
<p>&#x2022; Synthesizing Audio: Synthetic speech can be produced by leveraging a text to speech model, taking the translated transcript as input, and outputting realistic speech. Ideally the model would be finetuned on the original actors voice, and produce high quality speech that sounds just like the original actor but in a different language.</p>
</list-item>
<list-item>
<p>&#x2022; 3D Character Face Extaction: From the video stream, detect and isolate the target character. Map the target characters face onto a 3D morphable model using monocular 3D reconstruction, and isolate the headpose/global head movement, obtaining a static 3D face. Remove the original lip/jaw movements, but retain the overall facial expressions and eye blinks on the character model.</p>
</list-item>
<list-item>
<p>&#x2022; Facial Animation Generation: Generate the expression parameters corresponding to the lip and jaw movements on the 3D face model in response to the driving synthetic German audio speech signal via a recurrent neural network. Introduce the global head movement information back to the 3D model to obtain a 3D head whose facial expressions and head pose correspond to the original performance, but with the lips and jaws modified in response to the new audio.</p>
</list-item>
<list-item>
<p>&#x2022; Rendering: Mask out the facial region of the character in the original video, insert the newly generated 3D face model on top, and utilise an image-to-image translation network to generate the final photorealistic output frames.</p>
</list-item>
</list>
</p>
<p>The hypothetical pipeline described above is known as a structural-based approach, and is <xref ref-type="fig" rid="F1">Figure 1</xref>. The next section shall go into more detail on popular structural-based approaches, as well as end-to-end methods for talking head generation, audio driven automatic dubbing/audio driven video editing.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>A high-level diagram depicting the automatic dubbing process described in this section. 3D model image is taken from the work of <xref ref-type="bibr" rid="B19">Cudeiro et al. (2019)</xref>, while the subject displayed is part of the <xref ref-type="bibr" rid="B18">Cooke et al. (2006)</xref> dataset.</p>
</caption>
<graphic xlink:href="frsip-03-1230755-g001.tif"/>
</fig>
<p>The scope of this article is limited to discussions surrounding state of the art works tackling facial animation generation, namely, we explore the recent trends in talking head generation, and audio driven automatic dubbing/video editing. The rest of the papers is organised as follows: <xref ref-type="sec" rid="s3">Section 3</xref> provides a detailed discussion on methods that seek to tackle the talking head generation, and automatic dubbing, classifying them as either end-to-end or structural-based methods, and discussing their merits and pitfalls. <xref ref-type="sec" rid="s4">Section 4</xref> provides details on popular datasets used to train models for these tasks, as well as a list of common evaluation metrics used to quantify the performance of such models. <xref ref-type="sec" rid="s5">Section 5</xref> provides discussion on open challenges within the field, and how researchers have been tackling them, before concluding the paper in <xref ref-type="sec" rid="s6">Section 6</xref>.</p>
</sec>
<sec id="s3">
<title>3 Taxonomy of talking head generation and automatic dubbing</title>
<p>Talking head generation can be defined as the creation of a new video from a single source image or handful of frames, and a driving speech audio input. There are many challenges associated with this (<xref ref-type="bibr" rid="B10">Chen et al., 2020a</xref>). Not only must the generated lip and jaw movements be correctly synchronised to the speech input, but the overall head movement must also be realistic, eye blinking consistent to the speaker should be present, and the expressions on the face should match the tone and content of the speech. While many talking head approaches have been proposed in recent years, each addressing some or all of the aforementioned issues to various degrees, there is plenty of scope for researchers to further the field, as this article will demonstrate.</p>
<p>As touched upon earlier, the task of audio driven automatic dubbing is a constrained version of the talking head generation problem. Instead of creating an entire video from scratch, the goal is to alter an existing video, resynchronizing the lip and jaw movements of the target actor in response to a new input audio signal. Unlike talking head generation, factors such as head motion, eye blinks, and facial expressions are already present in the original video. The challenge lies in seamlessly altering the lip and jaw content of the video, while keeping the performance of the actor as close to the original as possible, so as to not detract from it.</p>
<sec id="s3-1">
<title>3.1 End-to-end vs. structural-based generation</title>
<p>At a high level, existing deep learning approaches to both tasks can be broken down into two main methods: end-to-end or structural-based generation. Each method has its own set of advantages and disadvantages, which we will now go over.</p>
<sec id="s3-1-1">
<title>3.1.1 Pipeline complexity and model latency</title>
<p>End-to-end approaches offer the advantage of a simpler pipeline, enabling faster processing and reduced latency in generating the final output. With fewer components and streamlined computations, real-time synthesis becomes achievable. However, the actual performance relies on crucial factors like the chosen architecture, model size, and output frame size. For example, GAN-based end-to-end methods can achieve real-time results, but they are often limited to lower output resolutions, such as 128 &#xd7; 128 or 256 &#xd7; 256. Diffusion-based approaches are even slower, often taking seconds or even minutes per frame, even with more efficient sampling methods, albeit at the cost of image quality. Striking the right balance between speed and output resolution is essential in optimizing end-to-end talking head synthesis. It is important to highlight that these same limitations are also present for structural-based methods, particularly within their rendering process. However, structural-based methods tend to be even slower than end-to-end approaches due to the additional computational steps involved in their pipeline. Structural-based methods often require multiple stages, such as face detection, facial landmark/3D model extraction, expression synthesis, photorealistic rendering and so on. Each of these stages introduces computational overhead, making the overall process more time-consuming.</p>
</sec>
<sec id="s3-1-2">
<title>3.1.2 Cascading errors</title>
<p>In structural-based methods, errors made in earlier stages of the pipeline can propagate and amplify throughout the process. For example, inaccuracies in face or landmark detection can significantly impact the quality of the final generated video. End-to-end approaches, on the other hand, bypass the need for such intermediate representations, reducing the risk of cascading errors. At the same time, however, when errors do occur in end-to-end approaches, it can be harder to identify the source of the error, as such methods do not explicitly produce intermediate facial representations. This lack of transparency in the generation process can make it challenging for researchers to diagnose and troubleshoot issues when the output is not as expected. It becomes essential to develop techniques for error analysis and debugging to improve the reliability and robustness of end-to-end systems.</p>
</sec>
<sec id="s3-1-3">
<title>3.1.3 Robustness to different data</title>
<p>Structural-based methods rely on carefully curated and annotated datasets for each stage of the pipeline, which can be time-consuming and labor-intensive to create. End-to-end approaches are often more adaptable and generalize better to various speaking styles, accents, and emotional expressions, as they can leverage large and diverse datasets for training. This flexibility is crucial in capturing the nuances and variations present in natural human speech and facial expressions.</p>
</sec>
<sec id="s3-1-4">
<title>3.1.4 Output quality</title>
<p>The quality of output is a critical aspect in talking head synthesis, as it directly impacts the realism and plausibility of the generated videos. Structural-based methods excel in this regard due to their ability to exert more fine-grained control over the intermediate representations of the face during the synthesis process. With such methods, the face is typically represented using a set of keypoints (or 3D model parameters), capturing essential facial features and expressions. These landmarks serve as a structured guide for the generation of facial movements, ensuring that the resulting video adheres to the anatomical constraints of a human face. By explicitly controlling these keypoints, the model can produce more accurate and realistic facial expressions that are consistent with human facial anatomy.</p>
<p>End-to-end approaches sacrifice some level of fine-grained control in favor of simplicity and direct audio-to-video mapping. While they offer the advantage of faster processing and reduced latency, they may struggle to capture the intricate details and nuances present in facial expressions, especially in more challenging or uncommon scenarios.</p>
</sec>
<sec id="s3-1-5">
<title>3.1.5 Training data requirements</title>
<p>End-to-end approaches typically require a large amount of training data to generalize well across various situations. While structural-based methods can benefit from targeted, carefully annotated datasets for specific tasks, end-to-end methods may need a more diverse and extensive dataset to achieve comparable performance. This, in turn, means longer training times as the model needs to process and learn from a vast amount of data, which can be computationally intensive and time-consuming. This can be a significant drawback for researchers and practitioners, as it hinders the rapid experimentation and development of new models. It may also require access to powerful hardware, such as high-performance GPUs or TPUs, to accelerate the training process.</p>
</sec>
<sec id="s3-1-6">
<title>3.1.6 Explicit output guidance</title>
<p>Structural-based methods allow researchers to incorporate explicit rules and constraints into different stages of the pipeline. This explicit guidance can lead to more accurate and controllable results, which can be lacking in end-to-end approaches where such guidane is more difficult to implement.</p>
</sec>
</sec>
<sec id="s3-2">
<title>3.2 Structural based generation</title>
<p>Structural based deep learning approaches have been immensely popular in recent years, and are considered the dominant approach when it comes to both talking head generation and audio driven automatic dubbing. As mentioned above, this is due to the relative ease with which one can exert control over the final output video, high quality image frame fidelity, and relative speed with which animations can be driven for 3D character models.</p>
<p>Instead of training a single neural network to generate the desired video given an audio signal, the problem is typically broken up into two main steps: 1) Training a neural network to drive the facial motion from audio of an underlying structural representation of the face. The structural representation is typically either a 3D morphable model or 2D/3D keypoint representation of the face. 2) Rendering photorealistic video frames from the structural model of the face using a second neural rendering model. Please see <xref ref-type="table" rid="T1">Table 1</xref> for a summary of relevant structural-based approaches in the literature.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Table summarising some of the most relevant structural-based approaches in the literature.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Method</th>
<th align="center">Animation network architecture</th>
<th align="center">Audio input</th>
<th align="center">Intermediate representation</th>
<th align="center">Additional inputs</th>
<th align="center">Head motion</th>
<th align="center">Rendering network architecture</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">
<xref ref-type="bibr" rid="B68">Suwajanakorn et al. (2017)</xref>
</td>
<td align="center">LSTM</td>
<td align="center">MFCC</td>
<td align="center">PCA mouth coefficients</td>
<td align="center">None</td>
<td align="center">No</td>
<td align="center">AAM-based rendering</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B69">Taylor et al. (2017)</xref>
</td>
<td align="center">Feed forward</td>
<td align="center">Phoneme transcript</td>
<td align="center">Face model animation parameters</td>
<td align="center">None</td>
<td align="center">No</td>
<td align="center">Video compositing approach</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B27">Eskimez et al. (2018)</xref>
</td>
<td align="center">LSTM</td>
<td align="center">Mel spectrograms</td>
<td align="center">2D landmarks</td>
<td align="center">None</td>
<td align="center">No</td>
<td align="center">Not applicable</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B13">Chen et al. (2019)</xref>
</td>
<td align="center">LSTM</td>
<td align="center">MFCC</td>
<td align="center">2D landmarks</td>
<td align="center">None</td>
<td align="center">No</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B20">Das et al. (2020)</xref>
</td>
<td align="center">GAN</td>
<td align="center">Deep speech features</td>
<td align="center">2D landmarks</td>
<td align="center">None</td>
<td align="center">No</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B90">Zhou et al. (2020)</xref>
</td>
<td align="center">LSTM</td>
<td align="center">Learned speech embeddings</td>
<td align="center">2D landmarks</td>
<td align="center">None</td>
<td align="center">Yes</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B46">Lu et al. (2021)</xref>
</td>
<td align="center">LSTM</td>
<td align="center">Learned speech embeddings</td>
<td align="center">2D Landmarks</td>
<td align="center">None</td>
<td align="center">Yes</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B74">Wang et al. (2021)</xref>
</td>
<td align="center">LSTM</td>
<td align="center">MFCC &#x2b; FBANK features</td>
<td align="center">Keypoints&#x2014;dense motion field</td>
<td align="center">None</td>
<td align="center">Yes</td>
<td align="center">CNN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B37">Ji et al. (2021)</xref>
</td>
<td align="center">LSTM</td>
<td align="center">Learned speech embeddings</td>
<td align="center">2D landmarks &#x2b; 3D face model</td>
<td align="center">Driving video</td>
<td align="center">From video</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B6">Bigioi et al. (2022)</xref>
</td>
<td align="center">Recurrent LSTM</td>
<td align="center">Mel spectrogram</td>
<td align="center">2D landmarks</td>
<td align="center">None</td>
<td align="center">Yes</td>
<td align="center">Not applicable</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B38">Karras et al. (2017)</xref>
</td>
<td align="center">CNN</td>
<td align="center">Autocorrelation features</td>
<td align="center">3D vertex positions of face mesh</td>
<td align="center">Emotional State</td>
<td align="center">No</td>
<td align="center">Not applicable</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B19">Cudeiro et al. (2019)</xref>
</td>
<td align="center">CNN Encoder-Decoder</td>
<td align="center">DeepSpeech features</td>
<td align="center">Flame face model</td>
<td align="center">None</td>
<td align="center">No</td>
<td align="center">Not applicable</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B70">Thies et al. (2020)</xref>
</td>
<td align="center">CNN</td>
<td align="center">DeepSpeech features</td>
<td align="center">3D expression parameters</td>
<td align="center">None</td>
<td align="center">No</td>
<td align="center">CNN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B11">Chen et al. (2020b)</xref>
</td>
<td align="center">CNN</td>
<td align="center">Raw Audio</td>
<td align="center">3D keypoints</td>
<td align="center">Reference frames</td>
<td align="center">Yes</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B83">Yi et al. (2020)</xref>
</td>
<td align="center">LSTM</td>
<td align="center">MFCC</td>
<td align="center">3D expression parameters</td>
<td align="center">Driving video</td>
<td align="center">Yes</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B79">Wu et al. (2021)</xref>
</td>
<td align="center">Encoder-Decoder &#x2b; Unet</td>
<td align="center">DeepSpeech features</td>
<td align="center">3D expression parameters</td>
<td align="center">Driving video</td>
<td align="center">Yes</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B87">Zhang et al. (2021b)</xref>
</td>
<td align="center">GAN</td>
<td align="center">Learned speech embeddings</td>
<td align="center">3D expression parameters</td>
<td align="center">Reference image</td>
<td align="center">Yes</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B84">Zhang et al. (2021a)</xref>
</td>
<td align="center">GAN</td>
<td align="center">DeepSpeech features</td>
<td align="center">3D expression parameters</td>
<td align="center">Driving video</td>
<td align="center">Yes</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B64">Song et al. (2022)</xref>
</td>
<td align="center">LSTM &#x2b; Unet</td>
<td align="center">MFCC</td>
<td align="center">3D expression parameters</td>
<td align="center">Driving video</td>
<td align="center">No</td>
<td align="center">UNet</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B78">Wen et al. (2020)</xref>
</td>
<td align="center">GAN</td>
<td align="center">MFCC</td>
<td align="center">3D expression parameters</td>
<td align="center">Driving video</td>
<td align="center">No</td>
<td align="center">GAN</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B41">Lahiri et al. (2021)</xref>
</td>
<td align="center">CNN</td>
<td align="center">Spectrograms</td>
<td align="center">3D vertex positions</td>
<td align="center">Driving video</td>
<td align="center">No</td>
<td align="center">CNN</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s3-2-1">
<title>3.2.1 2D/3D landmark based methods</title>
<p>In this section we discuss methods that rely on either 2D or 3D face landmarks as an intermediate structural representation for producing facial animations from audio. Some of the discussed methods use the generated landmarks to animate a 3D face model, these methods shall also be considered &#x201c;landmark-based.&#x201d; <xref ref-type="fig" rid="F2">Figure 2</xref> depicts a high level overview of what a typical landmark-based approach could look like.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>A high-level landmark based pipeline, where headpose and face structure from an existing video is combined with predicted lip/jaw displacements from a target audio clip to generate a modified video.</p>
</caption>
<graphic xlink:href="frsip-03-1230755-g002.tif"/>
</fig>
<p>
<xref ref-type="bibr" rid="B68">Suwajanakorn et al. (2017)</xref>, <xref ref-type="bibr" rid="B69">Taylor et al. (2017)</xref> were among the first works to explore using deep learning techniques to generate speech animation. The former trained a recurrent network to generate sparse mouth key points from audio before compositing them onto an existing video, and the latter presenting an approach for generalised speech animation by training a neural network model to predict animation parameters of a reference face model given phoneme labels as input. The field has come a long way since then, with <xref ref-type="bibr" rid="B27">Eskimez et al. (2018)</xref> presenting a method for generating static (no headpose) talking face landmarks from audio via a LSTM based model, and <xref ref-type="bibr" rid="B13">Chen et al. (2019)</xref> expanding the work by conditioning a GAN network on the landmarks to generate photorealistic frames. Similarly, <xref ref-type="bibr" rid="B20">Das et al. (2020)</xref> also employed a GAN based architecture to generate facial landmarks from deepspeech features extracted from audio, before using a second GAN conditioned on the landmarks to generate the photorealistic frames.</p>
<p>
<xref ref-type="bibr" rid="B90">Zhou et al. (2020)</xref>&#x2019;s approach was among the first to generate talking face landmarks with realistic head pose movement from audio. They did this by training two LSTM networks, one to handle the lip/jaw movements, and a second to generate the headpose, before combining the two outputs and passing them through an off-the-shelf image-to-image translation network for generating photorealistic frames.</p>
<p>
<xref ref-type="bibr" rid="B46">Lu et al. (2021)</xref>&#x2019;s approach also simulated headpose and upper body motion using a separate auto regressive model trained on deepspeech audio features before generating realistic frames using an image-to-image translation model conditioned on feature maps based on the generated landmarks. While also proposing an approach for the head pose problem, <xref ref-type="bibr" rid="B74">Wang et al. (2021)</xref> tackled the challenge of stabilising non-face (background) regions when generating talking head videos from a single image.</p>
<p>Unlike the previous methods which were all approaches at solving the talking head generation task, the following papers fall into the audio-driven automatic dubbing category that seek to modify existing videos. <xref ref-type="bibr" rid="B37">Ji et al. (2021)</xref> were among the first to tackle the problem of generating emotionally aware video portraits by disentangling speech into two representations, a content-aware time dependent stream, and an emotion-aware time independent stream, and training a model to generate 2D facial landmarks. It may be considered a &#x201c;hybrid&#x201d; structural approach, as from both the predicted and ground truth landmarks they perform monocular 3D reconstruction to obtain two 3D face models. They then combine the pose parameters from the ground truth with the expression and geometry parameters of the predicted to create the final 3D face model before extracting edge maps and generating the output frames via image-to-image translation. <xref ref-type="bibr" rid="B6">Bigioi et al. (2022)</xref> extracted ground truth 3D landmarks from video, and trained a network to alter them directly given an input audio sequence without the need to first retarget them to a static fixed face model before animating it and then returning the original headpose.</p>
</sec>
<sec id="s3-2-2">
<title>3.2.2 3D model based methods</title>
<p>In this section we discuss methods that use 3D face models as intermediate representations when generating facial animations. In other words, we talk about methods that train models to produce blendshape face parameters from audio signals as input. <xref ref-type="fig" rid="F3">Figure 3</xref> above depicts a high-level overview of one such model.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>A high-level 3d model based pipeline, where monocular facial reconstruction is performed on a source video to extract expression, pose, and geometry parameters. A separate audio to expression parameter prediction network is then trained. The predicted expression parameters are then used to replace the original ones, to generate a new 3D facial mesh, which is then rendered into a photorealistic video via a neural rendering model.</p>
</caption>
<graphic xlink:href="frsip-03-1230755-g003.tif"/>
</fig>
<p>
<xref ref-type="bibr" rid="B38">Karras et al. (2017)</xref> were among the first to use deep learning to learn facial animation for a 3D face model from limited audio data. <xref ref-type="bibr" rid="B19">Cudeiro et al. (2019)</xref> introduced a 4D audiovisual face dataset (talking 3D models), as well as a network trained to generate 3D facial animations from deepspeech audio features. <xref ref-type="bibr" rid="B70">Thies et al. (2020)</xref> also utilised deepspeech audio features to train a network to output speaker independent facial expression parameters that drive an intermediate 3D face model before generating the photorealistic frames using a neural rendering model. <xref ref-type="bibr" rid="B11">Chen et al. (2020b)</xref>&#x2019;s approach involved learning head motion from a collection of reference frames, and then combining that information with learned PCA components denoting facial expression in a 3D aware frame generation network. Their approach is interesting because their pipeline addresses various known problems within talking head generation such as maintaining the identity/appearance of the head consistent, maintaining a consistent background, and generating realistic speaker aware head motion. <xref ref-type="bibr" rid="B83">Yi et al. (2020)</xref> presented an approach to generate talking head videos using a driving audio signal by training a neural network to predict pose and expression parameters for a 3D face model from audio, and combining them with shape, texture, and lighting parameters extracted from a set of reference frames. They then render the 3D face model to photo realism via a neural renderer, before fine tuning the rendered frames with a memory augmented GAN. <xref ref-type="bibr" rid="B79">Wu et al. (2021)</xref> presented an approach to generate talking head faces of a target portrait given a driving speech signal, and &#x201c;Style Reference Video.&#x201d; They train their model such that the output video mimics the speaking style of the reference video but whose identity corresponds to the target portrait. <xref ref-type="bibr" rid="B87">Zhang et al. (2021b)</xref> presented a method for one shot talking head animation. Given a reference frame and driving audio source they generate eyebrow, head pose, and mouth motion parameters of a 3D morphable model using an encoder-decoder architecture. A flow-guided video generator is then used to create the final output frames. <xref ref-type="bibr" rid="B84">Zhang et al. (2021a)</xref> synthesize talking head videos given a driving speech input and reference video clip. They design a GAN based module that can output expression, eyeblink, and headpose parameters of a 3D MM given deepspeech audio features.</p>
<p>While the previously referenced methods are all examples of pure talking head generation approaches, the following are in the automatic dubbing category. Both <xref ref-type="bibr" rid="B64">Song et al. (2022)</xref> and <xref ref-type="bibr" rid="B78">Wen et al. (2020)</xref> presented approaches to modify an existing video using a driving audio signal by training a neural network to extract 3D face model expression parameters from audio, and combining them with pose and geometry parameters extracted from the original video before applying neural rendering to generate the modified photorealistic video. To generate the facial animations, <xref ref-type="bibr" rid="B63">Song et al. (2021)</xref> employ a similar pipeline to the methods referenced above, however they go one step further, and transfer the acoustic properties of the original video&#x2019;s speaker onto the driving speech via an encoder-decoder mechanism, essentially dubbing the video. <xref ref-type="bibr" rid="B59">Richard et al. (2021)</xref> provided a generalised framework for generating accurate 3D facial animations given speech, by learning a categorical latent space that disentangles audio-correlated (lips/jaw motion), and audio un-correlated (eyeblinks, upper facial expression) information at inference time. Doing so, they built a framework that can be applied to both automatic dubbing, and talking head generation tasks. <xref ref-type="bibr" rid="B41">Lahiri et al. (2021)</xref> introduced an encoder-decoder architecture trained to decode 3D vertex positions [similar to <xref ref-type="bibr" rid="B38">Karras et al. (2017)</xref>], and 2D texture maps of the lip region from audio and the previously generated frame. They combine these to form a textured 3D face mesh which they then render and blend with the original video to generate the dubbed video clip.</p>
<p>We would also like to draw attention to the works of <xref ref-type="bibr" rid="B28">Fried et al. (2019)</xref> and <xref ref-type="bibr" rid="B82">Yao et al. (2021)</xref>. These are video editing approaches which utilise text, in addition to audio, to modify existing talking head videos. The former approach works by aligning phoneme labels to the input audio, and constructing a 3D face model for each input frame. Then, when modifying the text transcript (e.g., dog to god), they search for segments of the input video where the visemes are similar, blending the 3D model parameters from the corresponding video frames to generate a new frame which is then rendered via their neural renderer. The latter approach builds off this work, by improving the efficiency of the phoneme matching algorithm, and developing a self-supervised neural retargeting technique for transferring the mouth motions of the source actor to the target actor.</p>
</sec>
</sec>
<sec id="s3-3">
<title>3.3 End-to-end generation</title>
<p>Though less popular in recent times than their structural based counterparts, the potential to generate or modify a video directly given an input audio signal is one of the key factors that make end-to-end approaches an attractive proposition to talking head researchers. These methods aim to learn the complex mapping between audio, facial expressions and lip movements using a single unified model that combines the traditional stages of talking head generation into a single step. By doing so, they eliminate the need for explicit intermediate representations, such as facial landmarks, or 3D models, which can be computationally expensive and prone to error. This ability to directly connect the audio input to the video output streamlines the synthesis process and can enable real-time or near-real-time generation. Please see <xref ref-type="table" rid="T2">Table 2</xref> for a summary of relevant end-to-end based approaches in the literature.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>Table summarising some of the most relevant end-to-end approaches in the literature.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Method</th>
<th align="center">Architecture</th>
<th align="center">Audio input</th>
<th align="center">Additional inputs</th>
<th align="center">Head motion</th>
<th align="center">Photorealistic frame rendering</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="center">
<xref ref-type="bibr" rid="B14">Chung et al. (2017)</xref>
</td>
<td align="center">Encoder-Decoder</td>
<td align="center">MFCC</td>
<td align="center">Reference identity</td>
<td align="center">No</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B12">Chen et al. (2018)</xref>
</td>
<td align="center">GAN</td>
<td align="center">Mel spectrogram</td>
<td align="center">Reference lip image</td>
<td align="center">No</td>
<td align="center">Limited to lip region only</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B65">Song et al. (2018)</xref>
</td>
<td align="center">GAN</td>
<td align="center">MFCC</td>
<td align="center">Reference image</td>
<td align="center">No</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B49">Mittal and Wang (2020)</xref>
</td>
<td align="center">LSTM &#x2b; GAN</td>
<td align="center">Learned speech embeddings</td>
<td align="center">Reference image</td>
<td align="center">No</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B88">Zhou et al. (2019)</xref>
</td>
<td align="center">GAN</td>
<td align="center">MFCC</td>
<td align="center">Reference frames</td>
<td align="center">No</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B72">Vougioukas et al. (2020)</xref>
</td>
<td align="center">GAN</td>
<td align="center">Raw audio</td>
<td align="center">Reference image</td>
<td align="center">No</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B57">Prajwal et al. (2020)</xref>
</td>
<td align="center">GAN</td>
<td align="center">Mel spectrogram</td>
<td align="center">Driving video</td>
<td align="center">Yes</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B39">Kumar et al. (2020)</xref>
</td>
<td align="center">GAN</td>
<td align="center">DeepSpeech features</td>
<td align="center">None</td>
<td align="center">Yes</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B26">Eskimez et al. (2020)</xref>
</td>
<td align="center">LSTM &#x2b; GAN</td>
<td align="center">Raw audio</td>
<td align="center">Reference image</td>
<td align="center">No</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B89">Zhou et al. (2021)</xref>
</td>
<td align="center">GAN</td>
<td align="center">Spectrograms</td>
<td align="center">Driving video &#x2b; reference frame</td>
<td align="center">Yes</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B67">Stypu&#x142;kowski et al. (2023)</xref>
</td>
<td align="center">Diffusion Unet</td>
<td align="center">Learned speech embeddings</td>
<td align="center">Reference image</td>
<td align="center">Yes</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B61">Shen et al. (2023)</xref>
</td>
<td align="center">Diffusion Unet</td>
<td align="center">Learned speech embeddings</td>
<td align="center">Reference image &#x2b; face landmarks</td>
<td align="center">Yes</td>
<td align="center">Yes</td>
</tr>
<tr>
<td align="center">
<xref ref-type="bibr" rid="B5">Bigioi et al. (2023)</xref>
</td>
<td align="center">Diffusion Unet</td>
<td align="center">Mel spectrograms</td>
<td align="center">Reference image</td>
<td align="center">Yes</td>
<td align="center">Yes</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>
<xref ref-type="bibr" rid="B14">Chung et al. (2017)</xref> proposed one of the first end-to-end talking head generation techniques. Given a reference identity frame and driving speech audio signal, they succeeded in training an encoder-decoder based architecture to generate talking head videos, additionally demonstrating how their approach could be applied to the dubbing problem. Their approach was limited however as it only generated the cropped region around the face, discarding any background.</p>
<p>
<xref ref-type="bibr" rid="B12">Chen et al. (2018)</xref> presented a GAN based method of generating lip movement from a driving speech source and reference lip frame. Similar to the above method, theirs was limited to generating just the cropped region of the face surrounding the lips. <xref ref-type="bibr" rid="B65">Song et al. (2018)</xref> presented a more generalised GAN-based approach for talking head generation that also took the temporal consistency between frames into account by introducing a recurrent unit in their pipeline, generating smoother videos. <xref ref-type="bibr" rid="B88">Zhou et al. (2019)</xref> proposed a model that could generate videos based on learned disentangled representations of speech and video. The approach is interesting because it allowed authors to generate a talking head video from a reference identity frame, and driving speech signal or video. <xref ref-type="bibr" rid="B49">Mittal and Wang (2020)</xref> disentangled the audio signal into various factors such as phonetic content, and emotional tone, and conditioned a talking head generative model on these representations instead of the raw audio, demonstrating compelling results. <xref ref-type="bibr" rid="B72">Vougioukas et al. (2020)</xref> proposed an approach to generate temporally consistent talking head videos from a reference frame and audio using a GAN-based approach. Their method generated realistic eyeblinks in addition to synchronised lip movements in an end-to-end manner. <xref ref-type="bibr" rid="B57">Prajwal et al. (2020)</xref> introduced a &#x201c;lip-sync discriminator&#x201d; for generating more accurate lip movements on talking head videos, as well as proposing new metrics to evaluate lip synchronization on generated videos. <xref ref-type="bibr" rid="B26">Eskimez et al. (2020)</xref> proposed a robust GAN based model that could generate talking head videos from noisy speech. <xref ref-type="bibr" rid="B39">Kumar et al. (2020)</xref> proposed a GAN-based approach for one shot talking head generation. <xref ref-type="bibr" rid="B89">Zhou et al. (2021)</xref> proposed an interesting approach to exert control over the pose of an audio-driven talking head. Using a target &#x201c;pose&#x201d; video, and speech signal, they condition a model to generate talking head videos from a single reference identity image whose pose is dictated by the target video.</p>
<p>While GAN-based <xref ref-type="bibr" rid="B30">Goodfellow et al. (2014)</xref> methods such as the approaches referenced above have been immensely popular in recent years, they have been shown to have a number of limitations by practitioners in the field. Due to the presence of multiple losses and discriminators their optimization process is complex and quite unstable. This can lead to difficulties in finding a balance between the generator and discriminator, resulting in issues like mode collapse, where the generator fails to capture the full diversity of the target distribution. Vanishing gradients is another issue, which occurs when gradients become too small during back propagation, preventing the model from learning effectively, especially in deeper layers. This can significantly slow down the training process and limit the overall performance of the model. With that in mind, we would like to draw special attention to diffusion models (<xref ref-type="bibr" rid="B62">Sohl-Dickstein et al., 2015</xref>, <xref ref-type="bibr" rid="B34">Ho et al., 2020</xref>, <xref ref-type="bibr" rid="B21">Dhariwal and Nichol, 2021</xref>, <xref ref-type="bibr" rid="B52">Nichol and Dhariwal, 2021</xref>), a new class of generative model that has gained prominence in the last couple of years due to strong performance on a myriad of tasks such as text based image generation, speech synthesis, colourisation, body animation prediction, and more.</p>
</sec>
<sec id="s3-4">
<title>3.4 Diffusion-based generation</title>
<p>We dedicate a short section of this paper towards diffusion based approaches, due to their recent rise in use and popularity. Note that within this section, we describe methods found from both the end-to-end, and structural-based schools of thought as at this time, there are only a handful of diffusion-based talking head works.</p>
<p>For a deeper understanding of the diffusion architecture, we direct readers to works of <xref ref-type="bibr" rid="B62">Sohl-Dickstein et al. (2015)</xref>; <xref ref-type="bibr" rid="B34">Ho et al. (2020)</xref>; <xref ref-type="bibr" rid="B21">Dhariwal and Nichol (2021)</xref>; <xref ref-type="bibr" rid="B52">Nichol and Dhariwal (2021)</xref>, as these are the pioneering works that contributed to their recent popularity and wide-spread adoption. In short however the diffusion process can be summarised as consisting of two stages 1) the forward diffusion process, and 2) the reverse diffusion process.</p>
<p>In the forward diffusion process, the desired output data is gradually &#x201c;destroyed&#x201d; over a series of time steps by adding Gaussian noise at each step until the data becomes just another sample from a standard Gaussian distribution. Conversely, in the reverse diffusion process, a model is trained gradually denoise the data by removing the noise at each time step, with the loss typically being computed as a distance function between the predicted noise vs. the actual noise that was added at that particular time step. The combination of these two stages enables diffusion models to model complex data distributions without suffering from mode collapse unlike GANs, and to generate high-quality samples without the need for adversarial training or complex loss functions.</p>
<p>Within the context of talking head generation, and video editing there are a number of recent works that have explored using diffusion models. Specifically, <xref ref-type="bibr" rid="B67">Stypu&#x142;kowski et al. (2023)</xref>, <xref ref-type="bibr" rid="B61">Shen et al. (2023)</xref>, and <xref ref-type="bibr" rid="B5">Bigioi et al. (2023)</xref> being among the first to explore their use for end-to-end talking head generation and audio driven video editing. All three methods follow a similar auto-regressive frame-based approach where the previously generated frame is fed back into the model along with the audio signal and a reference identity frame to generate the next frame in the sequence. Notably, <xref ref-type="bibr" rid="B61">Shen et al. (2023)</xref> condition their model with landmarks, and perform their training within the latent space to save on computational resources, unlike that of <xref ref-type="bibr" rid="B67">Stypu&#x142;kowski et al. (2023)</xref> and <xref ref-type="bibr" rid="B5">Bigioi et al. (2023)</xref>. <xref ref-type="bibr" rid="B67">Stypu&#x142;kowski et al. (2023)</xref> approach can be considered a true talking head generation method, as their method does not rely on any frames from the original video to guide their model (except for the initial seed/identity frame), and their resultant video is completely synthetic. <xref ref-type="bibr" rid="B5">Bigioi et al. (2023)</xref> perform video editing by modifying an existing video sequence by teaching their model to inpaint on a masked out facial region of the video in response to an input speech signal. <xref ref-type="bibr" rid="B61">Shen et al. (2023)</xref>&#x2019;s approach is similar, where they perform video editing rather than talking head generation by modifying an existing video with the use of a face mask designed to cover the facial region of the source video.</p>
<p>While the above approaches are currently the only end-to-end diffusion based methods, a number of structural based approaches, that leverage diffusion models have also been proposed in recent months. <xref ref-type="bibr" rid="B86">Zhang et al. (2022)</xref> proposed an approach that used audio to predict landmarks, before using a diffusion based renderer to output the final frame. <xref ref-type="bibr" rid="B93">Zhua et al. (2023)</xref> also utilised a diffusion model similarly, using it to take the source image and the predicted motion features as input to generate the high-resolution frames. <xref ref-type="bibr" rid="B22">Du et al. (2023)</xref> introduced an interesting two stage approach for talking head generation. The first stage consisted of training a diffusion autoencoder on video frames, to extract latent representations of the frames. The second stage involved training a speech to latent representation model, with the idea being that the latents predicted by the speech, could be decoded by the pretrained diffusion autoencoder to image frames. The method achieves impressive results, outperforming other relevant structural-based methods in the field. <xref ref-type="bibr" rid="B80">Xu et al. (2023)</xref> use a diffusion-based renderer conditioned on multi-model inputs to drive the emotion, and pose of the generated talking head videos. Notably their approach is also applicable to the face swapping problem.</p>
<p>Within the realm of talking heads, diffusion models have shown incredibly promising results, often producing videos with demonstratively higher visual quality, and similar lip sync performance compared to more traditional GAN-based methods. One major limitation, however, lies in their inability to model long sequences of frames without the output degrading in quality over time due to their autoregressive nature. It will be exciting to see what the future holds for further research in this area.</p>
</sec>
<sec id="s3-5">
<title>3.5 Other approaches</title>
<p>There are certain approaches that do not necessarily fit into the aforementioned subcategories, that are still relevant and worth discussing.</p>
<p>Viseme based methods such as <xref ref-type="bibr" rid="B91">Zhou et al. (2018)</xref> are early approaches at driving 3D character models. The authors presented an LSTM based network capable of producing viseme curves that could drive JALI based character models as described by <xref ref-type="bibr" rid="B24">Edwards et al. (2016)</xref>.</p>
<p>
<xref ref-type="bibr" rid="B31">Guo et al. (2021)</xref> is a unique method for talking head generation that instead of relying on traditional intermediate structural representations such as landmarks or 3DMMs, instead generates a neural radiance field from audio from which a realistic video is synthesised using volume rendering.</p>
</sec>
</sec>
<sec id="s4">
<title>4 Popular datasets and evaluation metrics</title>
<p>In this section we describe the most popular metrics for measuring the quality of videos generated by audio-driven talking head, and automatic dubbing models.</p>
<sec id="s4-1">
<title>4.1 Evaluation metrics</title>
<p>Quantitatively evaluating both talking head, and dubbed videos is a non-straight forward task. Traditional perceptual metrics such as SSIM, or distance-based metrics such as the L2 Norm, or PSNR, which seek to quantify the similarity between two images, are inadequate. Such metrics do not take into account the temporal nature of video, with the quality of a video being affected not only by the individual quality of frames, but also by the smoothness and synchronisation of the frames as they are played back in the video.</p>
<p>Although these metrics may not provide a perfect evaluation of video quality, they are still important for bench marking purposes as they provide a good indication of what to expect from the model. As such, when there is access to ground truth samples to compare a model&#x2019;s output with, the following metrics are commonly used:</p>
<p>PSNR (Peak Signal to Noise Ratio): The peak signal to noise ratio between the ground truth and the generated image is computed. The higher the PSNR value, the better the quality of the reconstructed image.</p>
<p>Facial Action Units (AU) <xref ref-type="bibr" rid="B25">Ekman and Friesen (1978)</xref> Recognition: <xref ref-type="bibr" rid="B65">Song et al. (2018)</xref> and <xref ref-type="bibr" rid="B11">Chen et al. (2020b)</xref> popularised a method for evaluating reconstructed images with respect to ground truth samples using five facial action units.</p>
<p>ACD (Average Content Distance) (<xref ref-type="bibr" rid="B71">Tulyakov et al., 2018</xref>): As used by <xref ref-type="bibr" rid="B72">Vougioukas et al. (2020)</xref>, the Cosine (ACD-C) and Euclidean (ACD-E) distance between the generated frame and ground truth image can be calculated. The smaller the distance between two images the more similar the images.</p>
<p>SSIM (Structural Similarity Index) (<xref ref-type="bibr" rid="B76">Wang et al., 2004</xref>): This is a metric designed to measure the similarity between two images by looking at the luminance, contrast, and structure of the pixels in the images.</p>
<p>Landmark Distance Metric (LMD): Proposed by <xref ref-type="bibr" rid="B12">Chen et al. (2018)</xref>, Landmark Distance (LMD) is a popular metric used to evaluate the lip synchronisation of a synthetic video. It works by extracting facial landmark lip coordinates for each frame of both the generated, and ground truth videos using an off-the-shelf facial landmark extractor, calculating the euclidean distance between them, and normalising based on the length of video and number of frames.</p>
<p>Unfortunately, when generating talking head or dubbed videos, oftentimes it is impossible to use the metrics discussed above as there is no corresponding ground truth data with which to compare the generated samples. Therefore, a number of perceptual metrics (metrics which seek to emulate how humans perceive things) have been proposed to address this problem. These include:</p>
<p>CPBD (Cumulative Probability Blur Detection) (<xref ref-type="bibr" rid="B51">Narvekar and Karam, 2011</xref>): This is a perceptual based metric used to detect blur in imageness and measure image sharpness. Used by <xref ref-type="bibr" rid="B39">Kumar et al. (2020)</xref>; <xref ref-type="bibr" rid="B72">Vougioukas et al. (2020)</xref>; <xref ref-type="bibr" rid="B14">Chung et al. (2017)</xref> to evaluate their talking head videos.</p>
<p>WER (Word Error Rate): A pretrained lip reading model is used to predict the words spoken by the generated face. Works such as <xref ref-type="bibr" rid="B39">Kumar et al. (2020)</xref> and <xref ref-type="bibr" rid="B72">Vougioukas et al. (2020)</xref> use the LipNet <xref ref-type="bibr" rid="B3">Assael et al. (2016)</xref> model which is pre-trained on the GRID data set and achieves 95.2 percent lip reading accuracy.</p>
<p>SyncNet Based Metrics: These are perceptual metrics based on the SyncNet model introduced by <xref ref-type="bibr" rid="B17">Chung and Zisserman (2017b)</xref> that evaluate lip synchronisation in unconstrained videos. <xref ref-type="bibr" rid="B57">Prajwal et al. (2020)</xref> introduced two such metrics: 1) LSE-D which is the average error measure calculated in terms of the distance between the lip and audio representations, and 2) LSE-C which is the average confidence score. These metrics have proven popular since their introduction, with a vast majority of recent papers in the field using them for evaluating their videos.</p>
</sec>
<sec id="s4-2">
<title>4.2 Benchmark Datasets</title>
<p>There are a number of benchmark datasets used to evaluate talking head and video dubbing models. They can be broadly categorised as being either &#x201c;in-the-wild,&#x201d; or &#x201c;lab conditions&#x201d; style datasets. In this section we list some of the most popular ones, and briefly describe them.<list list-type="simple">
<list-item>
<p>&#x2022; VoxCeleb 1 and 2 (<xref ref-type="bibr" rid="B50">Nagrani et al., 2017</xref>; <xref ref-type="bibr" rid="B15">Chung et al., 2018</xref>): This dataset contains audio and video recordings of celebrities speaking in the wild. It is often used for training and evaluating talking head generation, lip reading, and dubbing models. The former contains over 150,000 utterances from 1,251 celebrities, and the latter over 1,000,000 utterances from 6,112 celebrities.</p>
</list-item>
<list-item>
<p>&#x2022; GRID (<xref ref-type="bibr" rid="B18">Cooke et al., 2006</xref>): The GRID dataset consists of audio and video recordings of 34 speakers reading 1,000 sentences in lab conditions. It is commonly used for evaluating lip-reading algorithms but has also been used for talking head generation and video dubbing models.</p>
</list-item>
<list-item>
<p>&#x2022; LRS3-TED (<xref ref-type="bibr" rid="B1">Afouras et al., 2018</xref>): This dataset contains audio and video recordings of over 400&#xa0;h of TED talks, which are speeches given by experts in various fields.</p>
</list-item>
<list-item>
<p>&#x2022; LRW (<xref ref-type="bibr" rid="B16">Chung and Zisserman, 2017a</xref>): The LRW (Lip Reading in the Wild) dataset consists of up to 1,000 utterances of 500 different words, spoken by hundreds of different speakers in the wild.</p>
</list-item>
<list-item>
<p>&#x2022; CREMA-D (<xref ref-type="bibr" rid="B8">Cao et al., 2014</xref>): This dataset contains audio and video recordings of people speaking in various emotional states (happy, sad, anger, fear, disgust, and neutral). In total it contains 7,442 clips of 91 different actors recorded in lab conditions.</p>
</list-item>
<list-item>
<p>&#x2022; TCD-TIMIT (<xref ref-type="bibr" rid="B32">Harte and Gillen, 2015</xref>): The Trinity College Dublin Talking Heads dataset (TCD-TIMIT) contains video recordings of 62 actors speaking in a controlled environment.</p>
</list-item>
<list-item>
<p>&#x2022; MEAD Dataset (<xref ref-type="bibr" rid="B73">Wang et al., 2020</xref>): This dataset contains videos featuring 60 actors talking with eight different emotions at three different intensity levels (except for neutral). The videos are simultaneously recorded at seven different perspectives with roughly 40&#xa0;h of speech recorded for each person.</p>
</list-item>
<list-item>
<p>&#x2022; RAVDESS Dataset (<xref ref-type="bibr" rid="B73">Wang et al., 2020</xref>): The Ryerson Audio-Visual Database of Emotional Speech and Song is a corpus consisting of 24 actors speaking with calm, happy, sad, angry, fearful, surprise, and disgust expressions, and singing with calm, happy, sad, angry, and fearful emotions. Each expression is produced at two levels of emotional intensity, with an additional neutral expression. It contains 7,356 recordings in total.</p>
</list-item>
<list-item>
<p>&#x2022; CelebV-HQ (<xref ref-type="bibr" rid="B92">Zhu et al., 2022</xref>): CelebV-HQ is a dataset containing 35,666 video clips involving 15,653 identities and 83 manually labeled facial attributes covering aspects such as appearance, action, and emotion</p>
</list-item>
</list>
</p>
</sec>
</sec>
<sec id="s5">
<title>5 Open challenges</title>
<p>Although significant progress has been made in the fields of talking head generation and automatic dubbing, these areas of research are constantly evolving, and several open challenges still need to be addressed, offering plenty of opportunities for future work.</p>
<sec id="s5-1">
<title>5.1 Bridging the uncanny valley</title>
<p>Despite existing research, generating truly realistic talking heads remains an unsolved problem. There are various factors that come into play when discussing the topic of realism and how we can bridge the &#x201c;Uncanny Valley&#x201d; effect in video dubbing. These include:<list list-type="simple">
<list-item>
<p>&#x2022; Visual quality: Realistic talking head videos should have high-quality visuals that accurately capture the colors, lighting, and textures of the scene. This requires attention to detail in the rendering process. Currently, most talking head and visual dubbing approaches are limited to generating videos at low output resolutions, and those that do work on higher resolutions are quite limited both in terms of model robustness, and generalisation (more on that later). This is due to several reasons: 1) the computational complexity of deep learning models rises significantly when generating high-resolution videos, both in terms of training time, and inference speed; this, in turn, has an adverse effect on real-time performance; 2) generating realistic talking head videos requires the model to capture intricate details of facial expressions, lip movements, and speech patterns; as the output resolution of the video increases, so too does the demand for more fine-grained details, making it more difficult for models to achieve high degrees of realism; 3) Storage and bandwidth limitations; high-resolution videos require both of these in abundance, limiting high resolution generation to researchers who have access to state of the art in hardware systems. Some approaches that have sought to tackle this issue are the works of <xref ref-type="bibr" rid="B29">Gao et al. (2023)</xref>, <xref ref-type="bibr" rid="B31">Guo et al. (2021)</xref>, and <xref ref-type="bibr" rid="B61">Shen et al. (2023)</xref>, who&#x2019;s approaches are capable of outputting high resolution frames.</p>
</list-item>
<list-item>
<p>&#x2022; Motion: Realistic talking head/dubbed videos should have realistic motion, including smooth and natural movements of the face in response to speech, and realistic head motion when generating videos from scratch. This is a continuous topic of interest, with many works exploring it such as <xref ref-type="bibr" rid="B11">Chen et al. (2020b)</xref>, <xref ref-type="bibr" rid="B74">Wang et al. (2021)</xref>, and more recently <xref ref-type="bibr" rid="B85">Zhang et al. (2023)</xref>.</p>
</list-item>
<list-item>
<p>&#x2022; Disembodied Voice: The phenomenon of a Disembodied Voice is characterized by a jarring mismatch between a speaker&#x2019;s voice and their physical appearance, which is a commonly encountered issue in movie dubbing. Despite its significance, this issue remains relatively unexplored within the realm of talking head literature, thereby presenting a promising avenue for researchers to investigate further. The work conducted by <xref ref-type="bibr" rid="B54">Oh et al. (2019)</xref> demonstrated that there is an inherent link between a speaker&#x2019;s voice and their appearance that can be learned, thus lending credence to the idea that dubbing efforts should prioritize the synchronization of voice and appearance.</p>
</list-item>
<list-item>
<p>&#x2022; Emotion: Realistic videos should evoke realistic emotions, including facial expressions, body language, and dialogue. Achieving realistic emotions requires careful attention to acting and performance, as well as attention to detail in the animation and sound design. Recent works seeking to incorporate emotion into their generated talking heads include <xref ref-type="bibr" rid="B47">Ma et al. (2023)</xref>, <xref ref-type="bibr" rid="B44">Liang et al. (2022)</xref>, <xref ref-type="bibr" rid="B43">Li et al. (2021)</xref>.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s5-2">
<title>5.2 The data problem: single vs. multispeaker approaches</title>
<p>As mentioned previously there are two primary approaches to video dubbing&#x2014;structural and end-to-end. In order to train a model to generate highly photorealistic talking head videos with current end-to-end methods, many dozens of hours of single-speaker audiovisual content are required. The content should be of a high quality with factors such as good lighting, consistent framing of the face, and clear audio data. The quantity of data on an individual speaker may be reduced when methods are trained on a multi-speaker dataset, but sufficiently large datasets are only starting to become available. At this point in time it is not possible to estimate how well end-to-end methods might generalize to multiple speakers, or how much data may eventually be required to fine-tune a dubbing model for an individual actor in a movie to achieve a realistic mimicry of their facial actions. The goal should be of the order of tens of minutes of data, or less, to allow for the dubbing of the majority of characters with speaking roles.</p>
</sec>
<sec id="s5-3">
<title>5.3 Generalisation and robustness</title>
<p>Developing a model that can generalize across all faces, and audios, under any conditions such as poor lighting, partial occlusion, or incorrect framing, remains a challenging task yet to be fully resolved.</p>
<p>While supervised learning has proven to be a powerful approach for training models, it typically requires large amounts of labeled data that are representative of the target distribution. However, collecting diverse and balanced datasets that cover all possible scenarios and variations in facial appearance and conditions is a challenging and time-consuming task. Furthermore, it is difficult to anticipate all possible variations that the model may encounter during inference, such as changes in lighting conditions or facial expressions.</p>
<p>To address these challenges, researchers have explored alternative approaches such as self-supervised learning, which aims to learn from unlabelled data by creating supervisory signals from the data itself. In other words, self-labelling the data. Methods such as <xref ref-type="bibr" rid="B4">Baevski et al. (2020)</xref>; <xref ref-type="bibr" rid="B35">Hsu et al. (2021)</xref>, which fall under the self-supervised learning paradigm, have gained popularity in speech-related fields due to their promising results in improving the robustness and generalization of models. These methods may help overcome the limitations of traditional supervised learning methods that rely solely on labeled data for training. That being said, <xref ref-type="bibr" rid="B58">Radford et al. (2022)</xref> showed that while such methods can learn high-quality representations of the input they are being trained on,&#x201c;they lack an equivalently performant decoder mapping those representations to useable outputs, necessitating a finetuning stage in order to actually perform a task such as speech recognition&#x201d;. The authors demonstrate that by training their model on a &#x201c;weakly-supervised&#x201d; dataset of 680,000&#xa0;h of speech, their model performs well on unseen datasets without the need to finetune. What this means for talking head generation/dubbing is that a model trained on large amounts of &#x201c;weakly-supervised,&#x201d; or in other words, imperfect data, may potentially acquire a higher level of generalization. This can be particularly valuable for tasks like talking head generation or dubbing, where a system needs to understand and replicate various speech patterns, accents, and linguistic nuances that might not be explicitly present in labeled data.</p>
</sec>
<sec id="s5-4">
<title>5.4 The multilingual aspect</title>
<p>In the realm of talking head generation, it is fascinating to observe the adaptability of models trained exclusively on English-language datasets when faced with speech from languages they have not encountered during training. This phenomenon can be attributed to the models&#x2019; proficiency in learning universal acoustic and linguistic features. While language diversity entails a wide array of phonetic, prosodic, and syntactic intricacies, there exists an underpinning foundation of shared characteristics that traverse linguistic boundaries. These foundational aspects, intrinsic to human speech, include elements like phonetic structure and prosodic patterns, which exhibit commonalities across languages. Talking head generation models that excel in capturing these universal attributes inherently possess the ability to generate lip motions that align with a range of linguistic expressions, irrespective of language.</p>
<p>While the lip movements generated by models trained on English-language datasets may exhibit a remarkable degree of fidelity when applied to unseen languages, capturing cultural behaviors associated with those languages is a more intricate endeavor. Cultural gestures, expressions, and head movements often bear an intimate connection with language and its subtle intricacies. Unfortunately, these models, despite their linguistic adaptability, may lack the exposure needed to capture these culturally specific behaviors accurately. For instance, behaviors like the distinctive head movements indicative of agreement in certain cultures remain a challenge for these models. This underscores the connection between language and culture, highlighting the need for models to not only decipher linguistic components but also to appreciate and simulate the cultural nuances that accompany them. As such, we believe that further research is necessitated to ensure a unified representation of both linguistic and cultural dimensions in the realm of talking head generation and automatic dubbing, leaving this an open challenge to the field.</p>
</sec>
<sec id="s5-5">
<title>5.5 Ethical and legal challenges</title>
<p>Lastly we mention that the modification of original digital media content is subject to a wide range of ethical and data-protection considerations. While it is expected for most digital content that the work of paid actors is considered as &#x201c;work for hire,&#x201d; there are broader considerations if auto-dubbing technology becomes broadly adopted. Even as we write there is a large-scale strike of actors in Hollywood, fighting for rights with respect to the use of AI generated acting sequences. A full discussion of the broad ethical and intellectual property implications arising as today&#x2019;s AI technologies mature into sophisticated end-products for digital content creation would require a separate article.</p>
<p>Ultimately there is a clear need for advanced IP rights management within the digital media creating industry. Past efforts have focused on media manipulation, such as fingerprinting or encryption (<xref ref-type="bibr" rid="B40">Kundur and Karthik, 2004</xref>) but were ultimately unsuccessful. More recently researchers have proposed techniques such as blockchain might be used in the context of subtitles (<xref ref-type="bibr" rid="B55">Orero and Torner, 2023</xref>), while legal researchers have provided a broader context for the challenge of digital copyright in the context of the evolution of the Metaverse (<xref ref-type="bibr" rid="B36">Jain and Srivastava, 2022</xref>). Clearly, multi-lingual video dubbing represents just one specific sub-context of this broader ethical and regulatory challenge.</p>
<p>Looking at ethical considerations for the focused topic of multi-lingual video-dubbing one practical approach is to adopt a methodology that can track pipeline usage. One technique adopted in the literature is to build traceability into the pipeline itself, as discussed by <xref ref-type="bibr" rid="B56">Pataranutaporn et al. (2021)</xref>. These authors have included both human and machine traceability methods into their pipeline to ensure safe and ethical use thereof. Their human traceability technique was inspired by fabrication detection techniques drawn from other media paradigms (e.g., text, video) and incorporates perceivable traces like signatures of authorship, distinguishable appearance or small editing artefacts into the generated media. Machine traceability, on the other hand, involves incorporating traces imperceptible to humans, such as non-visible noise signals.</p>
</sec>
</sec>
<sec id="s6">
<title>6 Concluding thoughts</title>
<p>In this paper we have attempted to capture the current state-of-art for automated, multi-lingual video dubbing. This is an emerging field of research, driven by the needs of the video streaming industry and there are may interesting synergies with a range of neural technologies, including auto-translation services, text-to-speech synthesis, and talking-head generators. In addition to a review and discussion of the recent literature we have also outlined some of the key challenges that remain to blend today&#x2019;s neural technologies into practical implementations of tomorrow&#x2019;s digital media services.</p>
<p>This work may serve both as an introduction and reference guide for researchers new to the fields of automatic dubbing, and talking head generation, but also seeks to draw attention to the latest techniques and new approaches and methodologies for those who already have some familiarity with the field. We hope it will encourage and inspire new research and innovation on this emerging research topic.</p>
</sec>
</body>
<back>
<sec id="s7">
<title>Author contributions</title>
<p>All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.</p>
</sec>
<sec id="s8">
<title>Funding</title>
<p>This work has the financial support of the Science Foundation Ireland Centre for Research Training in Digitally-Enhanced Reality (d-real) under Grant No. 18/CRT/6224, and the ADAPT Centre (Grant 13/RC/2106).</p>
</sec>
<sec sec-type="COI-statement" id="s9">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s10">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Afouras</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Chung</surname>
<given-names>J. S.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Lrs3-ted: A large-scale dataset for visual speech recognition</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1809.00496">https://arxiv.org/abs/1809.00496</ext-link>.</citation>
</ref>
<ref id="B2">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Alarcon</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2023</year>). <source>Netflix builds proof-of-concept AI model to simplify subtitles for translation</source>.</citation>
</ref>
<ref id="B3">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Assael</surname>
<given-names>Y. M.</given-names>
</name>
<name>
<surname>Shillingford</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Whiteson</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>De Freitas</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Lipnet: end-to-end sentence-level lipreading</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1611.01599">https://arxiv.org/abs/1611.01599</ext-link>.</citation>
</ref>
<ref id="B4">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Baevski</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Mohamed</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Auli</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>wav2vec 2.0: A framework for self-supervised learning of speech representations</article-title>. <source>Adv. neural Inf. Process. Syst.</source> <volume>33</volume>, <fpage>12449</fpage>&#x2013;<lpage>12460</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.2006.11477</pub-id>
</citation>
</ref>
<ref id="B5">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Bigioi</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Basak</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>McDonnell</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Corcoran</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Speech driven video editing via an audio-conditioned diffusion model</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2301.04474">https://arxiv.org/abs/2301.04474</ext-link>. <pub-id pub-id-type="doi">10.1109/ACCESS.2022.3231137</pub-id>
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Bigioi</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Jordan</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Jain</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>McDonnell</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Corcoran</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Pose-aware speech driven facial landmark animation pipeline for automated dubbing</article-title>. <source>IEEE Access</source> <volume>10</volume>, <fpage>133357</fpage>&#x2013;<lpage>133369</lpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Busso</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Bulut</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Lee</surname>
<given-names>C.-C.</given-names>
</name>
<name>
<surname>Kazemzadeh</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Mower</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2008</year>). <article-title>Iemocap: interactive emotional dyadic motion capture database</article-title>. <source>Lang. Resour. Eval.</source> <volume>42</volume>, <fpage>335</fpage>&#x2013;<lpage>359</lpage>. <pub-id pub-id-type="doi">10.1007/s10579-008-9076-6</pub-id>
</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Cooper</surname>
<given-names>D. G.</given-names>
</name>
<name>
<surname>Keutmann</surname>
<given-names>M. K.</given-names>
</name>
<name>
<surname>Gur</surname>
<given-names>R. C.</given-names>
</name>
<name>
<surname>Nenkova</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Verma</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Crema-d: crowd-sourced emotional multimodal actors dataset</article-title>. <source>IEEE Trans. Affect. Comput.</source> <volume>5</volume>, <fpage>377</fpage>&#x2013;<lpage>390</lpage>. <pub-id pub-id-type="doi">10.1109/TAFFC.2014.2336244</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Tien</surname>
<given-names>W. C.</given-names>
</name>
<name>
<surname>Faloutsos</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Pighin</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2005</year>). <article-title>Expressive speech-driven facial animation</article-title>. <source>ACM Trans. Graph. (TOG)</source> <volume>24</volume>, <fpage>1283</fpage>&#x2013;<lpage>1302</lpage>. <pub-id pub-id-type="doi">10.1145/1095878.1095881</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Cui</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Kou</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>C.</given-names>
</name>
</person-group> &#x201c;<article-title>What comprises a good talking-head video generation?</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops</conf-name>, <conf-loc>Glasgow, UK</conf-loc> <conf-date>2020a</conf-date>.</citation>
</ref>
<ref id="B11">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Cui</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Kou</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> &#x201c;<article-title>Talking-head generation with rhythmic head motion</article-title>,&#x201d; in <conf-name>Proceedings of the Computer Vision&#x2013;ECCV 2020: 16th European Conference</conf-name>, <conf-loc>Glasgow, UK</conf-loc>, <conf-date>August 2020b</conf-date>.</citation>
</ref>
<ref id="B12">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Maddox</surname>
<given-names>R. K.</given-names>
</name>
<name>
<surname>Duan</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Lip movements generation at a glance</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1803.10404">https://arxiv.org/abs/1803.10404</ext-link>.</citation>
</ref>
<ref id="B13">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Maddox</surname>
<given-names>R. K.</given-names>
</name>
<name>
<surname>Duan</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Hierarchical cross-modal talking face generation with dynamic pixel-wise loss</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1905.03820">https://arxiv.org/abs/1905.03820</ext-link>.</citation>
</ref>
<ref id="B14">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Chung</surname>
<given-names>J. S.</given-names>
</name>
<name>
<surname>Jamaludin</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>You said that?</article-title> <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1705.02966">https://arxiv.org/abs/1705.02966</ext-link>.</citation>
</ref>
<ref id="B15">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Chung</surname>
<given-names>J. S.</given-names>
</name>
<name>
<surname>Nagrani</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Voxceleb2: deep speaker recognition</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1806.05622">https://arxiv.org/abs/1806.05622</ext-link>.</citation>
</ref>
<ref id="B16">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chung</surname>
<given-names>J. S.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> &#x201c;<article-title>Lip reading in the wild</article-title>,&#x201d; in <conf-name>Proceedings of the Computer Vision&#x2013;ACCV 2016: 13th Asian Conference on Computer Vision</conf-name>, <conf-loc>Taipei, Taiwan</conf-loc>, <conf-date>November 2017a</conf-date>, <fpage>87</fpage>&#x2013;<lpage>103</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chung</surname>
<given-names>J. S.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> &#x201c;<article-title>Out of time: automated lip sync in the wild</article-title>,&#x201d; in <conf-name>Proceedings of the Computer Vision&#x2013;ACCV 2016 Workshops: ACCV 2016 International Workshops</conf-name>, <conf-loc>Taipei, Taiwan</conf-loc>, <conf-date>November 2017b</conf-date>, <fpage>251</fpage>&#x2013;<lpage>263</lpage>.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Cooke</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Barker</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Cunningham</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Shao</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2006</year>). <article-title>An audio-visual corpus for speech perception and automatic speech recognition</article-title>. <source>J. Acoust. Soc. Am.</source> <volume>120</volume>, <fpage>2421</fpage>&#x2013;<lpage>2424</lpage>. <pub-id pub-id-type="doi">10.1121/1.2229005</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Cudeiro</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Bolkart</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Laidlaw</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Ranjan</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Black</surname>
<given-names>M. J.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Capture, learning, and synthesis of 3d speaking styles</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1905.03079">https://arxiv.org/abs/1905.03079</ext-link>.</citation>
</ref>
<ref id="B20">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Das</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Biswas</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sinha</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Bhowmick</surname>
<given-names>B.</given-names>
</name>
</person-group> &#x201c;<article-title>Speech-driven facial animation using cascaded gans for learning of motion and texture</article-title>,&#x201d; in <conf-name>Proceedings of the Computer Vision&#x2013;ECCV 2020: 16th European Conference</conf-name>, <conf-loc>Glasgow, UK</conf-loc>, <conf-date>August 2020</conf-date>, <fpage>408</fpage>&#x2013;<lpage>424</lpage>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dhariwal</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Nichol</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Diffusion models beat gans on image synthesis</article-title>. <source>Adv. neural Inf. Process. Syst.</source> <volume>34</volume>, <fpage>8780</fpage>&#x2013;<lpage>8794</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.2105.05233</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Du</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>K.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Dae-talker: high fidelity speech-driven talking face generation with diffusion autoencoder</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2303.17550">https://arxiv.org/abs/2303.17550</ext-link>.</citation>
</ref>
<ref id="B23">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Duquenne</surname>
<given-names>P.-A.</given-names>
</name>
<name>
<surname>Elsahar</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Gong</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Heffernan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Hoffman</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Klaiber</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <source>SeamlessM4t&#x2014;massively multilingual and multimodal machine translation</source>. <publisher-loc>Menlo Park, California, United States</publisher-loc>: <publisher-name>Meta</publisher-name>.</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Edwards</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Landreth</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Fiume</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Jali: an animator-centric viseme model for expressive lip synchronization</article-title>. <source>ACM Trans. Graph. (TOG)</source> <volume>35</volume>, <fpage>1</fpage>&#x2013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1145/2897824.2925984</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ekman</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Friesen</surname>
<given-names>W. V.</given-names>
</name>
</person-group> (<year>1978</year>). <article-title>Facial action coding system</article-title>. <source>Environ. Psychol. Nonverbal Behav</source>. <pub-id pub-id-type="doi">10.1037/t27734-000</pub-id>
</citation>
</ref>
<ref id="B26">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Eskimez</surname>
<given-names>S. E.</given-names>
</name>
<name>
<surname>Maddox</surname>
<given-names>R. K.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Duan</surname>
<given-names>Z.</given-names>
</name>
</person-group> &#x201c;<article-title>End-to-end generation of talking faces from noisy speech</article-title>,&#x201d; in <conf-name>Proceedings of the ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP)</conf-name>, <conf-loc>Barcelona, Spain</conf-loc>, <conf-date>May 2020</conf-date>, <fpage>1948</fpage>&#x2013;<lpage>1952</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Eskimez</surname>
<given-names>S. E.</given-names>
</name>
<name>
<surname>Maddox</surname>
<given-names>R. K.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Duan</surname>
<given-names>Z.</given-names>
</name>
</person-group> &#x201c;<article-title>Generating talking face landmarks from speech</article-title>,&#x201d; in <conf-name>Proceedings of the Latent Variable Analysis and Signal Separation: 14th International Conference, LVA/ICA 2018</conf-name>, <conf-loc>Guildford, UK</conf-loc>, <conf-date>July 2018</conf-date>, <fpage>372</fpage>&#x2013;<lpage>381</lpage>.</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fried</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Tewari</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zollh&#xf6;fer</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Finkelstein</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Shechtman</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Goldman</surname>
<given-names>D. B.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Text-based editing of talking-head video</article-title>. <source>ACM Trans. Graph. (TOG)</source> <volume>38</volume>, <fpage>1</fpage>&#x2013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1145/3306346.3323028</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Gao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ming</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Lu</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>High-fidelity and freely controllable talking head video generation</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2304.10168">https://arxiv.org/abs/2304.10168</ext-link>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goodfellow</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Pouget-Abadie</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Mirza</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Warde-Farley</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ozair</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2014</year>). <article-title>Generative adversarial nets</article-title>. <source>Adv. neural Inf. Process. Syst.</source> <volume>27</volume>.</citation>
</ref>
<ref id="B31">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Guo</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Liang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.-J.</given-names>
</name>
<name>
<surname>Bao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Ad-nerf: audio driven neural radiance fields for talking head synthesis</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2103.11078">https://arxiv.org/abs/2103.11078</ext-link>.</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Harte</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Gillen</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>Tcd-timit: an audio-visual corpus of continuous speech</article-title>. <source>IEEE Trans. Multimedia</source> <volume>17</volume>, <fpage>603</fpage>&#x2013;<lpage>615</lpage>. <pub-id pub-id-type="doi">10.1109/TMM.2015.2407694</pub-id>
</citation>
</ref>
<ref id="B33">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Hayes</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Bolanos-Garcia-Escribano</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2022</year>). <source>Streaming English dubs: a snapshot of netflix&#x2019;s playbook: Conference: Transtextual and transcultural circumnavigations. 10th international conference of aieti (iberian association for translation and interpreting studies)</source>. <publisher-loc>Braga, portugal</publisher-loc>: <publisher-name>universidade do minho</publisher-name>.</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Ho</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Jain</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Abbeel</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Denoising diffusion probabilistic models</article-title>. <source>Adv. neural Inf. Process. Syst.</source> <volume>33</volume>, <fpage>6840</fpage>&#x2013;<lpage>6851</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hsu</surname>
<given-names>W.-N.</given-names>
</name>
<name>
<surname>Bolte</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Tsai</surname>
<given-names>Y.-H. H.</given-names>
</name>
<name>
<surname>Lakhotia</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Salakhutdinov</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Mohamed</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Hubert: self-supervised speech representation learning by masked prediction of hidden units</article-title>. <source>IEEE/ACM Trans. Audio, Speech, Lang. Process.</source> <volume>29</volume>, <fpage>3451</fpage>&#x2013;<lpage>3460</lpage>. <pub-id pub-id-type="doi">10.1109/TASLP.2021.3122291</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jain</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Srivastava</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Copyright infringement in the era of digital world</article-title>. <source>Int&#x2019;l JL Mgmt. Hum.</source> <volume>5</volume>, <fpage>1333</fpage>.</citation>
</ref>
<ref id="B37">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Ji</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Loy</surname>
<given-names>C. C.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>X.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Audio-driven emotional video portraits</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2104.07452">https://arxiv.org/abs/2104.07452</ext-link>.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Karras</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Aila</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Laine</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Herva</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lehtinen</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Audio-driven facial animation by joint end-to-end learning of pose and emotion</article-title>. <source>ACM Trans. Graph. (TOG)</source> <volume>36</volume>, <fpage>1</fpage>&#x2013;<lpage>12</lpage>. <pub-id pub-id-type="doi">10.1145/3072959.3073658</pub-id>
</citation>
</ref>
<ref id="B39">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Kumar</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Goel</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Narang</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Hasan</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Robust one shot audio to video generation</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2012.07842">https://arxiv.org/abs/2012.07842</ext-link>.</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Kundur</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Karthik</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>Video fingerprinting and encryption principles for digital rights management</article-title>. <source>Proc. IEEE</source> <volume>92</volume>, <fpage>918</fpage>&#x2013;<lpage>932</lpage>. <pub-id pub-id-type="doi">10.1109/JPROC.2004.827356</pub-id>
</citation>
</ref>
<ref id="B41">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Lahiri</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kwatra</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Frueh</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Lewis</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bregler</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Lipsync3d: data-efficient learning of personalized 3d talking faces from video using pose and lighting normalization</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2106.04185">https://arxiv.org/abs/2106.04185</ext-link>.</citation>
</ref>
<ref id="B42">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>&#x141;a&#x144;cucki</surname>
<given-names>A.</given-names>
</name>
</person-group> &#x201c;<article-title>Fastpitch: parallel text-to-speech with pitch prediction</article-title>,&#x201d; in <conf-name>Proceedings of the ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</conf-name>, <conf-loc>Toronto, ON, Canada</conf-loc>, <conf-date>June 2021</conf-date>.</citation>
</ref>
<ref id="B43">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>X.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Write-a-speaker: text-based emotional and rhythmic talking-head generation</article-title>. <source>Proc. AAAI Conf. Artif. Intell.</source> <volume>35</volume>, <fpage>1911</fpage>&#x2013;<lpage>1920</lpage>. <pub-id pub-id-type="doi">10.1609/aaai.v35i3.16286</pub-id>
</citation>
</ref>
<ref id="B44">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Liang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Pan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Hong</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>X.</given-names>
</name>
<etal/>
</person-group> &#x201c;<article-title>Expressive talking head generation with granular audio-visual control</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <conf-loc>New Orleans, LA, USA</conf-loc>, <conf-date>June 2022</conf-date>, <fpage>3387</fpage>&#x2013;<lpage>3396</lpage>.</citation>
</ref>
<ref id="B45">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Mei</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Mandic</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Audioldm: text-to-audio generation with latent diffusion models</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2301.12503">https://arxiv.org/abs/2301.12503</ext-link>.</citation>
</ref>
<ref id="B46">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Chai</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Cao</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Live speech portraits: real-time photorealistic talking-head animation</article-title>. <source>ACM Trans. Graph. (TOG)</source> <volume>40</volume>, <fpage>1</fpage>&#x2013;<lpage>17</lpage>. <pub-id pub-id-type="doi">10.1145/3478513.3480484</pub-id>
</citation>
</ref>
<ref id="B47">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Ma</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Lv</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Styletalk: one-shot talking head generation with controllable speaking styles</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2301.01081">https://arxiv.org/abs/2301.01081</ext-link>.</citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mariooryad</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Busso</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2012</year>). <article-title>Generating human-like behaviors using joint, speech-driven models for conversational agents</article-title>. <source>IEEE Trans. Audio, Speech, Lang. Process.</source> <volume>20</volume>, <fpage>2329</fpage>&#x2013;<lpage>2340</lpage>. <pub-id pub-id-type="doi">10.1109/TASL.2012.2201476</pub-id>
</citation>
</ref>
<ref id="B49">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Mittal</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>B.</given-names>
</name>
</person-group> &#x201c;<article-title>Animating face using disentangled audio representations</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</conf-name>, <conf-loc>Snowmass, CO, USA</conf-loc>, <conf-date>March 2020</conf-date>, <fpage>3290</fpage>&#x2013;<lpage>3298</lpage>.</citation>
</ref>
<ref id="B50">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Nagrani</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Chung</surname>
<given-names>J. S.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Voxceleb: A large-scale speaker identification dataset</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1706.08612">https://arxiv.org/abs/1706.08612</ext-link>.</citation>
</ref>
<ref id="B51">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Narvekar</surname>
<given-names>N. D.</given-names>
</name>
<name>
<surname>Karam</surname>
<given-names>L. J.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>A no-reference image blur metric based on the cumulative probability of blur detection (cpbd)</article-title>. <source>IEEE Trans. Image Process.</source> <volume>20</volume>, <fpage>2678</fpage>&#x2013;<lpage>2683</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2011.2131660</pub-id>
</citation>
</ref>
<ref id="B52">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Nichol</surname>
<given-names>A. Q.</given-names>
</name>
<name>
<surname>Dhariwal</surname>
<given-names>P.</given-names>
</name>
</person-group> &#x201c;<article-title>Improved denoising diffusion probabilistic models</article-title>,&#x201d; in <conf-name>Proceedings of the International Conference on Machine Learning (PMLR)</conf-name>, <conf-date>2021</conf-date>, <conf-loc>Glasgow, UK</conf-loc> <fpage>8162</fpage>&#x2013;<lpage>8171</lpage>.</citation>
</ref>
<ref id="B53">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Nilesh</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Deck</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Forget subtitles: youtube now dubs videos with AI-generated voices</article-title>. <ext-link ext-link-type="uri" xlink:href="https://restofworld.org/2023/youtube-ai-dubbing-automated-translation/">https://restofworld.org/2023/youtube-ai-dubbing-automated-translation/</ext-link>.</citation>
</ref>
<ref id="B54">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Oh</surname>
<given-names>T.-H.</given-names>
</name>
<name>
<surname>Dekel</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Mosseri</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Freeman</surname>
<given-names>W. T.</given-names>
</name>
<name>
<surname>Rubinstein</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Speech2face: learning the face behind a voice</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1905.09773">https://arxiv.org/abs/1905.09773</ext-link>.</citation>
</ref>
<ref id="B55">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Orero</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Torner</surname>
<given-names>A. F.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>The visible subtitler: blockchain technology towards right management and minting</article-title>. <source>Open Res. Eur.</source> <volume>3</volume>, <fpage>26</fpage>. <pub-id pub-id-type="doi">10.12688/openreseurope.15166.1</pub-id>
</citation>
</ref>
<ref id="B56">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pataranutaporn</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Danry</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Leong</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Punpongsanon</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Novy</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Maes</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Ai-generated characters for supporting personalized learning and well-being</article-title>. <source>Nat. Mach. Intell.</source> <volume>3</volume>, <fpage>1013</fpage>&#x2013;<lpage>1022</lpage>. <pub-id pub-id-type="doi">10.1038/s42256-021-00417-9</pub-id>
</citation>
</ref>
<ref id="B57">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Prajwal</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Mukhopadhyay</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Namboodiri</surname>
<given-names>V. P.</given-names>
</name>
<name>
<surname>Jawahar</surname>
<given-names>C.</given-names>
</name>
</person-group> &#x201c;<article-title>A lip sync expert is all you need for speech to lip generation in the wild</article-title>,&#x201d; in <conf-name>Proceedings of the 28th ACM International Conference on Multimedia</conf-name>, <conf-loc>Seattle, WA, USA</conf-loc>, <conf-date>October, 2020</conf-date>, <fpage>484</fpage>&#x2013;<lpage>492</lpage>.</citation>
</ref>
<ref id="B58">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Radford</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>J. W.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Brockman</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>McLeavey</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Sutskever</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Robust speech recognition via large-scale weak supervision</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2212.04356">https://arxiv.org/abs/2212.04356</ext-link>.</citation>
</ref>
<ref id="B59">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Richard</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Zollh&#xf6;fer</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>De la Torre</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Sheikh</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Meshtalk: 3d face animation from speech using cross-modality disentanglement</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2104.08223">https://arxiv.org/abs/2104.08223</ext-link>.</citation>
</ref>
<ref id="B60">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Roxborough</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019</year>). <source>Netflix&#x2019;s global reach sparks dubbing revolution: &#x201c;the public demands it&#x201d;</source>. <publisher-loc>Los Angeles, California, United States</publisher-loc>: <publisher-name>The Hollywood Reporter</publisher-name>.</citation>
</ref>
<ref id="B61">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Shen</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Meng</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Difftalk: crafting diffusion models for generalized audio-driven portraits animation</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2301.03786">https://arxiv.org/abs/2301.03786</ext-link>.</citation>
</ref>
<ref id="B62">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Sohl-Dickstein</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Weiss</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Maheswaranathan</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Ganguli</surname>
<given-names>S.</given-names>
</name>
</person-group> &#x201c;<article-title>Deep unsupervised learning using nonequilibrium thermodynamics</article-title>,&#x201d; in <conf-name>Proceedings of the International conference on machine learning (PMLR)</conf-name>, <conf-loc>Lille, France</conf-loc>, <conf-date>July 2015</conf-date>, <fpage>2256</fpage>&#x2013;<lpage>2265</lpage>.</citation>
</ref>
<ref id="B63">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Bai</surname>
<given-names>J.-X.</given-names>
</name>
</person-group> &#x201c;<article-title>Tacr-net: editing on deep video and voice portraits</article-title>,&#x201d; in <conf-name>Proceedings of the 29th ACM International Conference on Multimedia</conf-name>, <conf-loc>Virtual Event China</conf-loc>, <conf-date>October 2021</conf-date>, <fpage>478</fpage>&#x2013;<lpage>486</lpage>. <pub-id pub-id-type="doi">10.1109/TIFS.2022.3146783</pub-id>
</citation>
</ref>
<ref id="B64">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Loy</surname>
<given-names>C. C.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Everybody&#x2019;s talkin&#x2019;: let me talk as you want</article-title>. <source>IEEE Trans. Inf. Forensics Secur.</source> <volume>17</volume>, <fpage>585</fpage>&#x2013;<lpage>598</lpage>.</citation>
</ref>
<ref id="B65">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Song</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Qi</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Talking face generation by conditional recurrent adversarial network</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1804.04786">https://arxiv.org/abs/1804.04786</ext-link>.</citation>
</ref>
<ref id="B66">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Spiteri Miggiani</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>English-Language dubbing: challenges and quality standards of an emerging localisation trend</article-title>. <source>J. Specialised Transl</source>.</citation>
</ref>
<ref id="B67">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Stypu&#x142;kowski</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Vougioukas</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zieba</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Petridis</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Pantic</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2023</year>). <article-title>Diffused heads: diffusion models beat gans on talking-face generation</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2301.03396">https://arxiv.org/abs/2301.03396</ext-link>.</citation>
</ref>
<ref id="B68">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Suwajanakorn</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Seitz</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Kemelmacher-Shlizerman</surname>
<given-names>I.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Synthesizing obama: learning lip sync from audio</article-title>. <source>ACM Trans. Graph. (ToG)</source> <volume>36</volume>, <fpage>1</fpage>&#x2013;<lpage>13</lpage>. <pub-id pub-id-type="doi">10.1145/3072959.3073640</pub-id>
</citation>
</ref>
<ref id="B69">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Taylor</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Yue</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Mahler</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Krahe</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Rodriguez</surname>
<given-names>A. G.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>A deep learning approach for generalized speech animation</article-title>. <source>ACM Trans. Graph. (TOG)</source> <volume>36</volume>, <fpage>1</fpage>&#x2013;<lpage>11</lpage>. <pub-id pub-id-type="doi">10.1145/3072959.3073699</pub-id>
</citation>
</ref>
<ref id="B70">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Thies</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Elgharib</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Tewari</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Theobalt</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Nie&#xdf;ner</surname>
<given-names>M.</given-names>
</name>
</person-group> &#x201c;<article-title>Neural voice puppetry: audio-driven facial reenactment</article-title>,&#x201d; in <conf-name>Proceedings of the Computer Vision&#x2013;ECCV 2020: 16th European Conference</conf-name>, <conf-loc>Glasgow, UK</conf-loc>, <conf-date>August 2020</conf-date>, <fpage>716</fpage>&#x2013;<lpage>731</lpage>.</citation>
</ref>
<ref id="B71">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Tulyakov</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>M.-Y.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Kautz</surname>
<given-names>J.</given-names>
</name>
</person-group> &#x201c;<article-title>Mocogan: decomposing motion and content for video generation</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, <conf-loc>Salt Lake City, UT, USA</conf-loc>, <conf-date>June 2018</conf-date>, <fpage>1526</fpage>&#x2013;<lpage>1535</lpage>.</citation>
</ref>
<ref id="B72">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vougioukas</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Petridis</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Pantic</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Realistic speech-driven facial animation with gans</article-title>. <source>Int. J. Comput. Vis.</source> <volume>128</volume>, <fpage>1398</fpage>&#x2013;<lpage>1413</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-019-01251-8</pub-id>
</citation>
</ref>
<ref id="B73">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Song</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Qian</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> &#x201c;<article-title>Mead: A large-scale audio-visual dataset for emotional talking-face generation</article-title>,&#x201d; in <conf-name>Proceedings of the ECCV</conf-name>, <conf-loc>Glasgow, UK</conf-loc>, <conf-date>(August 2020</conf-date>.</citation>
</ref>
<ref id="B74">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Audio2head: audio-driven one-shot talking-head generation with natural head motion</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2107.09293">https://arxiv.org/abs/2107.09293</ext-link>.</citation>
</ref>
<ref id="B75">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Skerry-Ryan</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Stanton</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Weiss</surname>
<given-names>R. J.</given-names>
</name>
<name>
<surname>Jaitly</surname>
<given-names>N.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). <article-title>Tacotron: towards end-to-end speech synthesis</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/1703.10135">https://arxiv.org/abs/1703.10135</ext-link>.</citation>
</ref>
<ref id="B76">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Bovik</surname>
<given-names>A. C.</given-names>
</name>
<name>
<surname>Sheikh</surname>
<given-names>H. R.</given-names>
</name>
<name>
<surname>Simoncelli</surname>
<given-names>E. P.</given-names>
</name>
</person-group> (<year>2004</year>). <article-title>Image quality assessment: from error visibility to structural similarity</article-title>. <source>IEEE Trans. image Process.</source> <volume>13</volume>, <fpage>600</fpage>&#x2013;<lpage>612</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2003.819861</pub-id>
</citation>
</ref>
<ref id="B77">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Weitzman</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2023</year>). <source>Voice actor vs. AI voice: Pros and cons. Speechify. Section: VoiceOver</source>. <publisher-loc>St Petersburg, Florida, USA</publisher-loc>: <publisher-name>Speechify</publisher-name>.</citation>
</ref>
<ref id="B78">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Richardt</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Z.-Y.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>S.-M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Photorealistic audio-driven video portraits</article-title>. <source>IEEE Trans. Vis. Comput. Graph.</source> <volume>26</volume>, <fpage>3457</fpage>&#x2013;<lpage>3466</lpage>. <pub-id pub-id-type="doi">10.1109/TVCG.2020.3023573</pub-id>
</citation>
</ref>
<ref id="B79">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Jia</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Dou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Duan</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>Q.</given-names>
</name>
</person-group> &#x201c;<article-title>Imitating arbitrary talking style for realistic audio-driven talking face synthesis</article-title>,&#x201d; in <conf-name>Proceedings of the 29th ACM International Conference on Multimedia</conf-name>, <conf-loc>Virtual Event China</conf-loc>, <conf-date>October 2021</conf-date>, <fpage>1478</fpage>&#x2013;<lpage>1486</lpage>.</citation>
</ref>
<ref id="B80">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Tai</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Multimodal-driven talking face generation, face swapping, diffusion model</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2305.02594">https://arxiv.org/abs/2305.02594</ext-link>.</citation>
</ref>
<ref id="B81">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Yang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Shillingford</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Assael</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Large-scale multilingual audio visual dubbing</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2011.03530">https://arxiv.org/abs/2011.03530</ext-link>.</citation>
</ref>
<ref id="B82">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yao</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Fried</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Fatahalian</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Agrawala</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Iterative text-based editing of talking-heads using neural retargeting</article-title>. <source>ACM Trans. Graph. (TOG)</source> <volume>40</volume>, <fpage>1</fpage>&#x2013;<lpage>14</lpage>. <pub-id pub-id-type="doi">10.1145/3449063</pub-id>
</citation>
</ref>
<ref id="B83">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Yi</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Ye</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Bao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.-J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Audio-driven talking face video generation with learning-based personalized head pose</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2002.10137">https://arxiv.org/abs/2002.10137</ext-link>.</citation>
</ref>
<ref id="B84">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zhao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zeng</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Ni</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Budagavi</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2021a</year>). <article-title>Facial: synthesizing dynamic talking face with implicit attribute learning</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2108.07938">https://arxiv.org/abs/2108.07938</ext-link>.</citation>
</ref>
<ref id="B85">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Cun</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Shen</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>Y.</given-names>
</name>
<etal/>
</person-group> (<year>2023</year>). <article-title>Sadtalker: learning realistic 3d motion coefficients for stylized audio-driven single image talking face animation</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2211.12194">https://arxiv.org/abs/2211.12194</ext-link>.</citation>
</ref>
<ref id="B86">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Cheng</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Xiao</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Xiao</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Shallow diffusion motion model for talking face generation from speech</article-title>,&#x201d; in <source>Asia-pacific web (APWeb) and web-age information management (WAIM) joint international conference on web and big data</source> (<publisher-loc>Berlin, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>), <fpage>144</fpage>&#x2013;<lpage>157</lpage>.</citation>
</ref>
<ref id="B87">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Ding</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Fan</surname>
<given-names>C.</given-names>
</name>
</person-group> (<conf-date>2021b</conf-date>). <article-title>Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset</article-title>. In <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>. <conf-loc>Glasgow, UK</conf-loc> <fpage>3661</fpage>&#x2013;<lpage>3670</lpage>.</citation>
</ref>
<ref id="B88">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Luo</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Talking face generation by adversarially disentangled audio-visual representation</article-title>. <source>Proc. AAAI Conf. Artif. Intell.</source> <volume>33</volume>, <fpage>9299</fpage>&#x2013;<lpage>9306</lpage>. <pub-id pub-id-type="doi">10.48550/arXiv.1807.07860</pub-id>
</citation>
</ref>
<ref id="B89">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Loy</surname>
<given-names>C. C.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Pose-controllable talking face generation by implicitly modularized audio-visual representation</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2104.11116">https://arxiv.org/abs/2104.11116</ext-link>.</citation>
</ref>
<ref id="B90">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Shechtman</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Echevarria</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kalogerakis</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Makelttalk: speaker-aware talking-head animation</article-title>. <source>ACM Trans. Graph. (TOG)</source> <volume>39</volume>, <fpage>1</fpage>&#x2013;<lpage>15</lpage>. <pub-id pub-id-type="doi">10.1145/3414685.3417774</pub-id>
</citation>
</ref>
<ref id="B91">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhou</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Landreth</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Kalogerakis</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Maji</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Visemenet: audio-driven animator-centric speech animation</article-title>. <source>ACM Trans. Graph. (TOG)</source> <volume>37</volume>, <fpage>1</fpage>&#x2013;<lpage>10</lpage>. <pub-id pub-id-type="doi">10.1145/3197517.3201292</pub-id>
</citation>
</ref>
<ref id="B92">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Zhu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Jiang</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>L.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <article-title>CelebV-HQ: A large-scale video facial attributes dataset</article-title>. <ext-link ext-link-type="uri" xlink:href="https://arxiv.org/abs/2207.12393">https://arxiv.org/abs/2207.12393</ext-link>.</citation>
</ref>
<ref id="B93">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhua</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhanga</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Liub</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Zhoub</surname>
<given-names>X.</given-names>
</name>
</person-group> &#x201c;<article-title>Audio-driven talking head video generation with diffusion model</article-title>,&#x201d; in <conf-name>Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</conf-name>, <conf-loc>Rhodes Island, Greece</conf-loc>, <conf-date>June 2023</conf-date>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>.</citation>
</ref>
</ref-list>
</back>
</article>