Study on Automatic 3D Facial Caricaturization: From Rules to Deep Learning

Facial caricature is the art of drawing faces in an exaggerated way to convey emotions such as humor or sarcasm. Automatic caricaturization has been explored both in the 2D and 3D domain. In this paper, we propose two novel approaches to automatically caricaturize input facial scans, filling gaps in the literature in terms of user-control, caricature style transfer, and exploring the use of deep learning for 3D mesh caricaturization. The first approach is a gradient-based differential deformation approach with data driven stylization. It is a combination of two deformation processes: facial curvature and proportions exaggeration. The second approach is a GAN for unpaired face-scan-to-3D-caricature translation. We leverage existing facial and caricature datasets, along with recent domain-to-domain translation methods and 3D convolutional operators, to learn to caricaturize 3D facial scans in an unsupervised way. To evaluate and compare these two novel approaches with the state of the art, we conducted the first user study of facial mesh caricaturization techniques, with 49 participants. It highlights the subjectivity of the caricature perception and the complementarity of the methods. Finally, we provide insights for automatically generating caricaturized 3D facial mesh.


INTRODUCTION
Caricatures have been used for centuries to convey humor or sarcasm.References can be found during the Antiquity with Aristotle referring to these artists as "grotesque," or in the works of Leonardo Da Vinci who was eagerly looking for people with deformities to use as models.Caricature can be defined as the art of drawing persons (usually faces) in a simplified or exaggerated way through sketching, pencil strokes, or other artistic drawings.Caricatures have been commonly used to entertain people, to laugh at politics or as a gift or souvenir sketched by street artists.These artists have the ability to capture distinct facial features, and then exaggerate those features (Redman, 1984).With the development of social VR networks or games, users may wish to use stylized avatars, including avatars preserving their identity (Olivier et al., 2020) but with such exaggerated features.Hence, automatically generating such caricatured avatars becomes a key issue, as having artists manually creating caricatured avatars would not be feasible for such applications involving large numbers of users.Let us consider a 3D mesh representing the user's face (either using 3D scanning or computer vision methods to build 3D shape from a minimum set of images).An automatic caricature system should maintain the relative geometric location of facial components, while emphasizing the subject's facial features distinct from others.While different caricature experts would generate different styles of faces (more or less cartoonish style for example), they would all be exaggerating facial traits of the individual (Brennan, 1985;Liang et al., 2002;Mo et al., 2004).The ability of creating a variety of plausible caricatures for each single face is therefore a key challenge when automatically generating caricatures, as different artists would create visually different caricatures, which should also be taken into account when evaluating the subjective quality of the results.
Previous works for the generation of 3D caricatures can be separated into two main families: interactive and automatic methods.Interactive methods offer tools to caricature experts to design the resulting caricature (Akleman, 1997;Akleman et al., 2000;Chen et al., 2002;Gooch et al., 2004), while fully automatic methods use hand-crafted rules (Brennan, 1985;Liang et al., 2002;Mo et al., 2004), often derived from the drawing procedure of artists.However, these approaches are typically restricted to a particular artistic style, e.g., sketch or a certain cartoon, and predefined templates of exaggeration.From the works in the literature in other domains, two different solutions could be envisioned to automatically generate caricatures.First, in the context of exaggerating distinct features, Sela et al. (2015) proposed a generic method to exaggerate the differences between the 3D scan of an object and an average template model of such type of object.However, this method has never been formally evaluated for human faces.Second, deep learning methods could be considered.As mentioned above, automatic methods mainly use hand-crafted rules that may fail to capture some complex choices made by caricature experts.In contrast, generative adversarial networks (GANs) are a promising mean to attempt to learn these choices based on a set of examples made by experts, without being limited to hand-crafted rules, but it has been never applied for the generation of 3D caricatures.The main goal of this paper is to propose and evaluate novel methods for the automatic generation of 3D caricatures from real 3D facial scans, first with a rule-based method, in order to keep tunable and interpretable parameters, and a deep learning method, to leverage real caricature data and hence generate caricatures closer to real ones.The main hypotheses we wish to address in this paper are: H1: the specialization of generic exaggeration methods for human faces should allow to produce convincing caricatures.To this end, we adapted the generic method proposed by Sela et al. (2015) in order to generate caricatures by exaggerating facial features from a 3D face scan (see Figure 1).This method has two main stages, one based on a curvature EDFM (Exaggerating the Difference From the Mean), and another based on a nearestneighbors search in a 3D caricature dataset, to apply the proportion exaggeration.
H2: deep learning should allow to overcome some of the limitations of rule-based methods by their ability to generalize based on a set of examples.Thus, we designed a method leveraging advances in the field of GAN-based style transfer, which has shown great success in the 2D domain, for instance on drawn caricatures (Cao et al., 2019).
H3: both methods should reach and overcome the state-ofthe-art results when trying to automatically generate caricatures from a human face 3D scan.To assess the advantages and disadvantages of the proposed methods, we conducted a perceptual study considering the base method proposed by Sela et al. (2015) and an additional EDFM method (Akleman and Reisch, 2004).
The results of the study support hypotheses H1 and H2, as the perceptual study demonstrated no significant preference of the subjects for any of the tested methods, for the proposed human faces.Although this result shows that the two proposed methods reached state of the art performance (H3), the perceptual study did not show a clear winner, highlighting the difficulty to simulate and evaluate such artistic caricatures for which a large variety of styles and solutions exists.The remainder of the paper is structured as follows.First, Section 2 reviews the state of the art, and identifies the gaps between existing techniques.Section 3 and Section 4 present the proposed rule-based and deep learningbased caricaturization methods respectively.Then, Section 5 presents the perceptual evaluation of the proposed methods with state-of-the-art methods.Finally, we discuss the results and provide insights on the automatic caricature generation in Section 6.

RELATED WORK
Computer assisted caricature generation has been a topic of interest for researchers since the beginning of Computer Graphics (Brennan, 1985).Typically, techniques from drawing guides, such as Redman's practical guide 1984) on how to draw caricatures, are exploited.This guide sets the fundamental rules of caricatures and proposes some concepts that are massively used.Among them, the "mean face assumption" implies the existence of an average face, and the process of "Exaggerating the Difference From the Mean" (EDFM) consists in emphasizing the features that make a person unique, i.e., different from the average face.Existing methods for automatic caricature generation split into two main categories: rule-based and learning-based methods.

Rule-Based Methods
Rule-based methods use a priori known procedures to caricaturize a shape.They can be further divided into two branches depending if their domain of application is on human faces or other shapes.
Face rule-based methods follow caricature drawing guidelines (e.g., EDFM) to generate deformed faces with emphasized features.Brennan (1985) first proposed an implementation of EDFM in two dimensions.They built an interactive system where a user can select facial feature points which are matched against the average feature points, then the distance between them is exaggerated.This algorithm was later extended by Akleman et al. in 2D and 3D domains (Akleman, 1997;Akleman and Reisch, 2004).Their software relies on a low-level procedure which requires the user to decide whether the exaggeration of a feature increases likeness or not.In the same spirit, Fujiwara et al. (2002) developed a piece of software named PICASSO for automatic 3D caricature generation.They used a set of feature points to generate simplified 3D faces before performing EDFM.EDFM was also used by Blanz and Vetter (1999) in an application example of their morphable model.They learn a principal component analysis (PCA) space from 200 3D textured faces.Their system allows caricature generation by increasing the distance to the statistical mean in terms of geometry and texture.Statistical dispersion has been taken into account by Mo et al. (2004) who showed that features should be emphasized proportionally to their standard deviation to preserve likeness.Chen et al. (2006) created 3D caricatures by fusing 2D caricatures generated using EDFM from different views.Redman's guide (Redman, 1984) not only introduces EDFM but also high levels concepts such as the five head types (oval, triangular, squared, round and long) and the dissociation between local and global exaggeration.These concepts have been exploited by Liu et al. (2012) to perform photo to 3D caricature translation.They applied EDFM with respect to the shape of the head (global scale) and to the distance ratios of a set of feature points (local scale).Face rule-based methods can generate a caricature from an input photograph or a 3D model but fail at reproducing artistic styles.Different caricaturists would make different caricatures from the same person.To avoid this issue, they usually provide user control at a relatively low-level of comprehension, which often requires artistic knowledge.
Non face specific rule-based methods rely on intrinsic or extracted features of geometrical shapes.They generalize the concept of caricature beyond the domain of human faces.Eigensatz et al. (2008) developed a 3D shape editing technique based on principal curvatures manipulation.With no reference model, their method can enhance or reduce the sharpness of a 3D shape.The link between saliency and caricature has been explored by Cimen et al. (2012).They introduced a perceptual method for caricaturing 3D shapes based on their saliency using free form deformation technique.A computational approach for surface caricaturization has been presented by Sela et al. (2015).They locally scale the gradient field of a mesh by its absolute Gaussian curvature.A reference mesh can be provided to follow the EDFM rule, and the authors show that their method is invariant to isometries, i.e., invariant to poses.General shape rule-based methods can also caricature a 2D or 3D shape without any reference model.As they do not take into account any statistical information nor the concept of artistic style, they try to link low-level geometry information to high-level caricature concepts, e.g., the fact that the most salient area should be more exaggerated (Cimen et al., 2012).As a result, they do not take into account the semantic of faces nor the art of human face caricature.
Since this work only tackles human face caricaturization, we refer to "face rule-based methods" as simply "rule-based methods".

Learning Based Methods
Existing learning-based methods for caricature generation can use both paired and unpaired data as training material.
Supervised data-driven methods would automatically find rules by relying on pairs of exemplars to learn a mapping between the domain of normal faces and the domain of caricatures.Xie et al. (2009) proposed a framework that learns a PCA model over 3D caricatures and a Locally Linear Embedding (LLE) model over 2D caricatures, both made by artists.The user can manually create a deformation that is projected into the PCA subspace and refined using the LLE model.Li et al. (2008) and Liu et al. (2009) both focused on learning a mapping between the LLE representation of photographs and their corresponding LLE representation of 3D caricatures modeled by artists.In the same vein, but only in the 3D domain, Zhou et al. (2016) regressed a set of locally linear mappings from sparse exemplars of 3D faces and their corresponding 3D caricature.As far as we know, Clarke et al. (2011) are the only authors that proposed a physics-oriented caricature method.They capture the artistic style of 2D caricatures by learning a pseudo stress-strain model which describes physical properties of virtual materials.All these data-driven approaches are based on paired datasets which require the work of 2D or 3D artists.Such datasets are costly to produce, therefore techniques of this kind are hardly applicable.
Unsupervised learning based methods learn how to caricature from unpaired face and caricature exemplars.Chen et al. (2001) and Liang et al. (2002) generated 2D caricatures by learning a nonlinear mapping between photos and corresponding caricatures made by artists.Derived from the image synthesis literature, where they have been used for unpaired one-to-one translation (Liu et al., 2017;Taigman et al., 2017;Yi et al., 2017;Zhu et al., 2017), or unpaired many-to-many translation (Huang et al., 2018b;Liu et al., 2019;Choi et al., 2020), Generative Adversarial Networks (GANs) have also shown impressive results on mesh synthesis and mesh-to-mesh translation (Goodfellow et al., 2014).Other approaches achieve 2D stylization using 3D priors and a differentiable renderer (Wang et al. (2021).)Cao et al. (2019) proposed a photo to 2D caricature translation framework CariGANs based on a large dataset of over 6,000 labeled 2D caricatures (Huo et al., 2018), and two GANs, namely CariGeoGAN for geometry exaggeration using landmark warping, and CariStyGAN for stylization.CariStyGAN allows to use a reference graphic style, or else, it will generate a random style.This framework was first extended by Shi et al. (2019) with a feature point-based warping for geometric exaggeration, then by Gu et al. ( 2021) which provides a random set of deformation styles in addition to the random set of graphics styles, offering consequent user control.In the 3D domain, Wu et al. (2018) then Cai et al. (2021) proposed robust methods for 3D caricature reconstruction from meshes, enlarging the set of available in-the-wild 3D caricatures, when used in combination with WebCaricature (Huo et al., 2018).Guo et al. (2019) showed an approach for producing expressive 3D caricatures from photos using a VAE-CycleGAN.Ye et al. (2020) proposed an end-to-end 3D caricature generation from photos method, using a GAN-based architecture with two symmetrical generators and discriminators.A step of texture stylization is performed with CariStyGAN.The recent works for caricature generation in 3D domain allow to reproduce the style of artists but they do not feature much user control.Ye et al. (2020) introduced Facial Shape Vectors so the user can choose the facial proportions on the caricature, but this is a quite low-level interaction and thus should be done by an artist.These works also show a weakness from the use of CariStyGAN for texture stylization.CariStyGAN tends to emphasize the shadows and light spots of the photos in order to make the reliefs sharper.In the case of textured 3D models, the shadows and light spots should be induced by the geometry and the lighting conditions, not by the texture albedo.If lighting information is entangled within texture information, changing the lighting condition can make the 3D model appear to be enlightened by non-existent lights.
Adopting a 3D mesh representation requires application of mesh convolutions defined on non-Euclidean domains (i.e., geometric deep learning methodologies).Over the past few years, the field of geometric deep learning has received significant attention (Litany et al., 2017;Maron et al., 2017).Methods relevant to this paper are auto-encoder structures such as used by Ranjan et al. (2017) and Gong et al. (2019), that showcase the efficiency of recent 3D convolutional operators at capturing the distribution of 3D facial meshes.Several approaches resort to mapping 3D faces to a 2D domain, and using 2D convolution operators (Moschoglou et al., 2020).Projecting a 3D surface to a 2D plane for 2D convolutions requires locally deforming distances, which translates to higher computing and memory costs compared to recent 3D convolution approaches, and some high-frequency information loss (Gong et al., 2019).
Deep learning based approaches, leveraging recent advancements in the field, could produce caricatures more similar to the kind produced by professionals, and allow global style control using handmade caricatures as style examples.On the opposite side, a user-controlled rule-based approach enabling a local control of the facial mesh deformation would allow for fine-tuned local control.We develop both approaches in Section 3 and Section 4. Finally, there is no overall perception user study of this specific field, limiting any qualitative comparison between approaches.We present the first study of this kind in Section 5, in order to evaluate the strengths and drawbacks of these two novel methods in comparison to two state-of-the-art approaches.

RULE-BASED USER-CONTROLLED CARICATURIZATION
We present a novel method featuring short computation time and providing meaningful user control over the generated caricatures.It is based on two main modules depicted in Figure 2 (in green and in yellow).First, a curvature exaggeration module (in green) enhances the facial lines by applying EDFM technique to the main PCA scores of the mesh gradients of the input face.This emphasizes only the 3D surface details such as ridges, peaks, and folds, and does not affect the global shape of the face (such as eyes, nose, and mouth relative positions).Second, a proportion exaggeration module (in yellow) leverages compositions of real artists (see Section 3.1) to caricature the general shape of the face.It projects the input face into a 3D caricature shape space thanks to a kNN regressor.This process applies a smooth and large scale deformation to the input face while preserving its local features.The curvature exaggeration and proportion exaggeration modules are thus complementary.They are combined to provide the user with a bilateral control (small scale versus large scale) over the resulting caricature.Lastly, an optional texture blurring and contrast enhancement module (in pink) makes the resulting caricature less realistic and more graphic.The reason behind this step is to make the result more acceptable for human observers.As shown by Zell et al. (2015), we use texture blurring because it increases the appeal and lowers the eeriness of a virtual character.The increase in contrast is meant to make the caricatures less realistic, but one could have used another technique to this end.In addition to these modules, our usercontrolled method features semantic mesh segmentation in four regions (see Section 3.2).In total, the method exposes ten knobs to the user.

Datasets
Realistic 3D faces were sampled from the LSFM dataset (Booth et al., 2016) which contains nearly 10k distinct 3D faces.In order to have textured meshes, we completed this set with 300 in-house 3D face scans.Their topologies are unified through automatic facial landmarking and geometry fitting (Danieau et al., 2019).To build our 3D caricatured mesh dataset, we run the 2D to 3D caricature inference method of Cai et al. (2021) on the WebCaricature dataset (Huo et al., 2018), which enables to extract the 3D caricatured face mesh from each 2D image.The WebCaricature dataset contains over 6k 2D caricatures.When Cai's algorithm did not successfully estimate the faces, due to extreme drawing composition (quick sketch, incomplete drawings, drafts, cubism etc.) the generated output remains the same default caricature mesh.All faces were then registered, in order to have a fixed topology (Sumner and Popović, 2004).

Facial Segmentation
In face modeling, cartoonization and caricaturing, semantic segmentation is a popular technique for increasing expressivity and user interaction (Blanz and Vetter, 1999;Liu et al., 2009;Zhou et al., 2016).In the proposed system, the 3D faces are segmented using the scheme proposed by Blanz and Vetter (1999) i.e. in four regions: the eyes, the nose, the mouth, and the rest of the face.This semantic segmentation allows the user to choose whether to emphasize or not a facial part.In total, the method exposes ten knobs to the user: one scalar is used for the strength of the gradient EDFM and another one for the amount of deformation from the kNN regressor to be added.Those two weights are tunable for each of the five regions (four masks and full face).Segmenting the domain also allows to break the inherent linearity of PCA by learning different subspaces.

Curvature Exaggeration
To emphasize the small scale features of the input 3D face, the curvature exaggeration module performs EDFM on the mesh gradient.In the process, we use PCA as a mean to reduce high frequencies (Figure 3).
• Offline preprocessing.The edge-based gradient operator E (see Supplementary Material) is used to compute the gradients g of each face mesh s of our custom 3D face dataset (Section 3.1).Following the results of Mo et al. (2004) showing that lowvariance features should be more taken into account, the gradients G are standardized: G std G− g σG .Then, a PCA is performed on the standardized gradients leading to the principal components W and each PCA scores t such that t g std •W.
• Runtime curvature exaggeration.The input face mesh s is standardized then projected into the PCA space learnt offline.EDFM technique is applied with a factor f grad given by the user.To prevent noise, we weight the result by the normalized standard  The exaggerated PCA scores are obtained as t t • max(f grad • σ, 1).The exaggerated gradient is then recovered as ĝ g + σ G • ( t • W T ).The gradients' exaggerated mesh ŝ is eventually reconstructed at the least squares sense by setting the border vertices fixed (the border of the eyes, the nostrils, the inner lips, and the contour of the head), as described in Supplementary Material.

Proportion Exaggeration
The proportion exaggeration module leverages the 3D caricatures (see Section 3.1) to sample a deformation that matches the input face difference from the mean using a kNN.Thus, it can be seen as an example-based version of EDFM.We argue that the sampled deformation contains mainly low frequencies and adding it to an input face will modify very little its surface curvatures.We observed that the 3D caricatures have more diverse global shapes than our 3D faces while being much smoother.In addition, the kNN regression also contributes to smooth out the deformation by averaging the k nearest neighbors.The process works as follows: • Offline preprocessing.The 3D caricatures are first standardized using the standard deviation of our 3D faces to make the low-variance areas more important (Mo et al., 2004).Then, we fit a kNN regressor using a cosine distance metric, as we mainly seek to find directions of deformation rather than amplitudes of deformation.The amplitude tuning is reserved for the user.
• Runtime proportion exaggeration.The input face is standardized then projected into the 3D caricature space with the kNN regressor using barycentric weights.The obtained deformation δ std is weighted by the 3D face standard deviation σ S and by a user-defined scalar f prop for amplitude tuning.Eventually, we add this deformation to the curvature exaggerated face to get the vertex positions of the resulting caricature c:

Results
In this section, the results of both the curvature exaggeration module and the proportion exaggeration module are presented and compared to those of their most similar existing approaches.We compare the curvature exaggeration module to Sela et al. (2015) because they fix the positions of border vertices and therefore tend to preserve the proportions of the caricatured faces.Our proportion exaggeration module is compared to the baseline 3D position EDFM introduced in the seminal work of Blanz and Vetter (1999).
• Curvature exaggeration module.The benefit of the PCAbased denoising mechanism is visible in Figure 3 between column b), and column c) and d).Without PCA, the EDFM technique magnifies the existing high frequencies of the face's difference from the mean.With PCA, the noise is removed but the exaggeration of facial lines remains.The use of a segmented model not only enables to provide more user-control, but also to emphasize the curvatures more locally.This effect can be noticed when comparing the results c) and d) in Figure 3. Sela et al. (2015)'s method successfully preserves the position of the eyes, the nostrils, the inner lips and the contour of the face.However other parts such as the nose, the lips and the chin seem greatly inflated and displaced which should not belong to facial lines enhancement.Conversely, our curvature exaggeration module modifies the vertex positions such that it only enhances the fine curvature details.
• Proportion exaggeration module.Figure 4 shows the effect of modifying k on the results of our proportion exaggeration module.Visually, the parameter k of the kNN regressor has less impact than we expected.However, it appears that a small value of k ( ≤ 5) tends to introduce high-frequencies and vertex entanglement while larges values of k ( ≥ 1000) seem to produce less vivid results.We fixed k 40 in our experiments.
The semantic segmentation has also an impact on our proportion exaggeration module.In Figure 5, the results with segmentation (column c) seem more caricatural but also more expressive than without segmentation (column b).Expressiveness is not intended by the proposed method since the focus is on neutral expression caricature generation.Nevertheless, we decided to conserve the segmentation scheme for the proportion exaggeration module.We also compare the proportion exaggeration algorithm to the baseline PCA-based EDFM on 3D coordinates proposed by Blanz and Vetter (1999) (column d).Our method clearly generates more diverse and inhomogeneous shapes than Blanz and Vetter (1999)'s approach.It is also noticeable that less high-frequency details are added than with the baseline method, which is what we aim at.

DEEP LEARNING BASED AUTOMATIC CARICATURIZATION
Rule-based methods allow the use of controllable and interpretable parameters, but are limited to capture information about caricature styles.Supervised learning based methods require a large paired mesh-to-caricature dataset, that are highly consuming in terms of both time and means to build.Instead, we consider the case of an unpaired learning-based approach, taking advantage of our 3D datasets of both neutral and caricatured faces (Cai et al., 2021) (cf.Section 3.1).Our network architecture is based on the shared content space assumption of Liu et al. (2019), that we adapt to the context of 3D data through the use of 3D convolutions of Gong et al. (2019), which define 3D convolution neighborhoods.

Framework Overview
Let us consider meshes of different styles (e.g.scans and caricatures), all sharing the same mesh topology.We represents our faces with raw 3D coordinates, and encode them using a recent 3D convolutional operator (Gong et al., 2019).Given a mesh x ∈ X and an arbitrary style y ∈ Y, our goal is to train a single generator G that can generate diverse meshes of each style y that corresponds to the mesh x.We generate stylespecific vectors in the learned space of each style and train G to reflect these vectors.Figure 6 illustrates an overview of our framework, which consists of three modules described below.
Generator.Our generator G translates an input mesh x into an output mesh G (x, s) reflecting a style-specific style code s, which is provided by the style encoder E. We use adaptive instance normalization (AdaIN) (Huang and Belongie, 2018a) to inject s into G.We observe that s can represent any style, which removes the necessity of providing y to G and allows G to synthesize meshes of all domains.
Style encoder.Given a mesh x, our encoder E extracts the style codes s E(x).Similar to Liu et al. (2019), our style encoder benefits from the multi-task learning setup.E can produce diverse style codes using different reference meshes.This allows G to synthesize an output mesh reflecting the style code s of a reference mesh x.Discriminator.Our discriminator D is a multitask discriminator (Mescheder et al., 2018;Liu et al., 2019;Choi et al., 2020), which consists of multiple output branches.Each branch D y learns a binary classification determining whether a mesh x is a mesh from the dataset of style y or a fake mesh G (x, s) produced G.

Training Objectives
Given a mesh x ∈ X and its original style y ∈ Y, we train our framework using the following objectives: • Adversarial objective.During training, we sample a mesh a and generate its style code s E(a).The generator G takes a mesh x and s as inputs and learns to generate an output mesh G (x, s) that is indistinguishable from real meshes of the style y, via a classical adversarial loss (Arjovsky et al., 2017): where D y (•) denotes the output of D corresponding to the style y.
• Reconstruction and cycle losses.To guarantee that the generated mesh G (x, s) properly preserves the style-invariant characteristics (e.g.identity) of its input mesh x, we employ the cycle consistency loss (Kim et al., 2017;Zhu et al., 2017;Choi et al., 2018) ) where ŝ E y (x) is the estimated style code of the input mesh x, ỹ and s are the style and estimated style codes of another mesh than x.By encouraging the generator G to reconstruct the input mesh x with its estimated style code ŝ, G learns to preserve the original characteristics of x while changing its style faithfully.In a similar goal of preserving style invariant characteristics, we use a reconstruction loss where ŝ E y (x) is the estimated style code of the input mesh x.
• Full objective.Our objective function can be summarized as follows: where λ r and λ cyc are hyper parameters for each term.We use the Adam Optimizer (Kingma and Ba, 2015).

Results
We trained the network for 50k iterations on a Titan X Pascal (4h, 8Go).Results of the approach are visible in Figure 7.The original faces (top row) are encoded using the network illustrated in Figure 6 along with a random caricature of the dataset, producing the caricatured face (bottom row).Facial proportions are hence exaggerated according to the distribution of the neutral and caricatured faces learned during the training stage.

USER STUDY
In order to assess the subjective quality of the caricatures generated by the previously described methods, we have conducted a perceptual study.The goal of the perceptual study was to subjectively rank the generated caricatures based on the perceived quality of the caricatures.In addition to the two methods described in Section 3 and Section 4, we also considered two baseline methods, the method from Sela et al. (2015) and a EDFM method (Blanz and Vetter (1999)).

Participants
Forty-nine participants took part in the experiment (9 females).They were between 18 and 63 years old (mean and STD age: 31.0 ± 11.3), and were recruited from our laboratory among students and staff.They were all naive to the purpose of the experiment, had normal or correct-to-normal vision, and gave written and informed consent.The study conformed to the declaration of Helsinki.Participants were not compensated for their participation and none of the participants knew the human faces used in study.

Stimuli
The top part of Figure 7 presents the 12 human face scans (Identity factor) used in the study (4 females, eight males).They were caricatured using five different approaches (Method factor): the learning-based approach (Deep) presented in Section 4, two variations of the rule-based approach presented in Section 3 (see Table 1), and two state-of-the-art caricaturization methods-EDFM (Blanz and Vetter, 1999) and Sela (Sela et al., 2015).For each face (original and caricatured), we used the cartoonization module presented in Section 3. The texture blurring is expected to reduce the mismatch of realism between the shape and the texture and therefore make the caricature more acceptable to human observers (Zell et al., 2015).The stimuli were rendered with a rotation of 30 °around the vertical axis, with a fixed view.The angle was chosen as a common viewpoint between a frontal and profile view.We considered only the facial mask, hence other facial attributes such as eyes and hair were not displayed.

Protocol
The perceptual study consisted of two parts.The first part of the study assessed the results produced by each method for each face, according to participant's preferences.For each human facial scan, participants were presented with the original face and the caricatures generated with the five methods.They were asked to rank all five caricatures from the best to the worst caricature.The order of the scans and the presentation of the caricatures was randomized independently for each participant and each facial scan was only presented once, for a total of 12 trials.The second part of the study aimed at evaluating globally each of the five methods.For each method (in a random order), the caricaturization results (12 facial scans) were displayed at once.Participants were asked to indicate how much they agreed to three statements using 5-point Likert scales.The statements were "They preserve the identity of the person," "They correspond to what would be expected of a caricature," "I like the results".There was no time limit for any of the two parts, and the evaluation was conducted online using the PsyToolkit software (Stoet, 2010(Stoet, , 2017)).We include a sample view of the ranking task in Supplementary Figure 22.A render of all 12 caricatures for each method can be seen on Supplementary Figures 17-19, 21.

Results
We present in this section the statistical results of the user study.

Average Rankings
To analyze ranking distributions (Figures 9-Figure 10), we first performed a Friedman test with the within-subject factor Method (using the average rank between all 12 scans).We found an effect of the Method on average ranking (χ 2 12.21; p < 0.05).The effect is then explored further using a Wilcoxon post-hoc test for pair-wise comparisons.We found significant differences only between EDFM and Deep, Geo.1, Sela (all p < 0.05).We found that per method, average rankings vary between 2.81 (EDFM) and 3.12 (Deep) 10.In order to determine whether ranking distributions per method differed with identities, we used a Friedman test with within-subject factors Method and Identity.Out of 12 distinct identities, 6 (identities 2, 5, 6, 7, 11, 12) showed significantly different rankings between methods.This is in most cases (5 out of 6) due to worse than average performance from a set of methods, usually Deep or Sela.

Top Rankings
We measured Top-1, Top-2, and Top-3 rank differences per method, using Friedman tests, Top-X rankings being the number of times the techniques were ranked X or lower (lower is better, Figure 11).We found no significant differences for Top-1 (χ 2 4.14; p 0.38) rankings, but an effect was found for both Top-2 (χ 2 9.74; p < 0.05) and Top-3 rankings (χ 2 34.60; p < 0.001).The effect for Top-2 and Top-3 rankings is then explored using a Wilcoxon post-hoc test.For Top-2 rankings, we found that EDFM was chosen significantly more often as first or second choice than Deep (p < 0.05) and Sela (p < 0.01).For Top-3 rankings, we found a similar preference for EDFM over Deep, Geo.1, and Sela (p < 0.05), as well as a significant lower preference for Sela over all others (p < 0.05).

Variations Between Participants
We looked into participant-wise preferences for caricature methods using a Friedman test on ranking choices of each participant, individually.Out of 49 participants, separate Friedman tests on their Top-1 rankings showed that only 12 had a significant preference towards a set of methods, and out of these only 4 towards a specific one.These numbers are too low to show anything conclusive in that regard.

Subjective Scores
Subjective ratings results were analyzed separately using a oneway ANOVA with within-subject factor Method on the data of each question.All subjective results differences between methods were found to be significant (p values of 5.7e − 6, 7.35e − 6, and 2.28e − 5).We conducted separate post-hoc analyses using Wilcoxon.For the statement "They preserve the identity of the person" (Figure 12), significantly different groups of method were Deep, Sela (mean 3), and Geo.1, EDFM (mean 2.3).The method Geo.2 (mean 2.6) was not significantly different from others.For the statements "They correspond to what would be expected of a caricature" (Figure 13) and "I like the results" (Figure 14), the only significant differences were between the group of Geo.1, Geo.2, EDFM, and Sela, Deep being in between.

DISCUSSION
In this paper, we have proposed two novel caricaturization methods.One leveraging the capabilities of deep style transfer The proposed methods, and two additional methods from the literature were evaluated through a user study considering 12 different facial scans and the corresponding caricature generated from these different methods.Overall, the results FIGURE 10 | Boxplot of the average rankings over participants, per method.Rankings range from 1 to 5. Overall, all methods achieve similar performances, averages being between 2.81 and 3.12 (lower is better).showed that all methods achieved similar performances, average ratings going from 2.82 to 3.12 (lower is better).An observation from the results is that in general, there was not a method which was significantly superior to the others.The results considering only the method (see Figure 12) show a fairly distributed results, although Deep and Sela approaches seem to generate a higher number of "badly ranked caricatures" (fourth and fifth ranks).This observation matches with the global appreciation from participants, as EDFM, Geo.1 and Geo.2 got slightly higher scores.While this result could suggest that some of the methods worked better from some facial scans than others, the results split by Identity do not totally support this hypothesis (see Figure 11).Looking at the top five worst ranked caricatures (Figure 15), we can identify several cases in which the method considered could have generated undesired results.The facial features of face six interpenetrate each other when using Sela, and the borders of face seven are spread too widely using the same method.On face 5, eye size difference is too greatly exaggerated with the method Deep.These generated faces rated significantly worse than others on average can be easily identified, opening possibilities of a manual or automatic filtering protocol.Nevertheless, these results seem to evidence that some methods had a particularly bad performance on some of the facial scans.Yet, this did not happen consistently.Each caricaturization method had a pre-defined set of meta-parameters.The chosen configuration could have suited better some faces than others, generating caricatures of different qualities.The top five best ranked caricatures can be seen on Figure 8.
Another potential explanation for the results is that the task was too hard and subjective, choices ending up being random.Using faces with no hair or eyes might have even increased the complexity of the task.Indeed, some participants explicitly stated that the task was difficult, especially as they were judging textured facial masks instead of full faces.Nevertheless, this potential user preference does not seem to be linked with any particular caricaturization method.Looking at participant preferences, only 12 participants out of 49 showed a significant rating variation between methods ranked first.Looking at results on subjective questions, the two worse rated (Deep and Sela) methods rank-wise (being also those with the worst rated specific caricatures) were rated higher both at "They correspond to what would be expected of a caricature" and "I like the result," where caricatures of each method were presented globally, suggesting that without their bad results on specific faces-which might be less visible when presented amongst all the others-they could actually have ranked higher than other methods.The conception of a perceptual metric reliably judging the quality of a caricature could help guide its creation, but the high variation of participant preferences in our study suggest that it would require a considerably larger study to be defined.
Considering these findings, we issue the following guidelines for choosing a method to generate caricatures automatically.
• If the main goal is to generate caricatures with a given set of parameters, no specific style, and as little variance as possible in quality, an EDFM-based method is the most suitable.• If there is still no specific style required, but more tolerance to variance in quality (for instance if it is possible to tune the generated faces when they are unsatisfying), we recommend the approach of Sela, rated very similarly to EDFM on average in the rankings task, and significantly more on the subjective questionnaire.
• If a specific caricature style is required, the Deep approach will offer results comparable with Sela both in the ranking task and the questionnaire.• Finally, if there is a need to target a specific user, the best solution is to use the panel of available methods, and leave the choice to them.Caricatures provide a style whose notion can be understood as an "accentuation of facial features," allowing manually defined rules to achieve comparable performance to learning-based approaches.Other stylistic facial domains, such as aliens or anthropomorphic animals could have more to gain from learning.Such non-realistic 3D facial data is although currently very scarce.

CONCLUSION
In this paper we have introduced two novel approaches to automatically generate caricatures from 3D facial scans.T he first method mixes EDFM-based curvature deformation and data driven proportion deformation, while the second method is based on a domain-to-domain translation deep neural network.Then, we present and discuss a perceptual study aiming to assess the quality of the generated caricatures.Overall, the results showed that the different evaluated methods performed in a similar way, although their performance could vary with respect to the facial scan used.This result illustrates both the subjectivity of evaluating caricaturization performance, along with the complementarity of using different approaches, producing different styles of caricatures.Future work could involve looking into automatic detection of the worse cases of automatic caricaturization, to apply a correction or a filter, or exploring learned-based automatic caricaturization by learning on different caricature styles, and setting up a network able to generate faces of a given style.We believe this study of the extended state of the art have helped grow and precise the landscape of automatic caricaturization approaches, and 3D facial stylization in general, and that our work provides interesting insights and guidelines for the automatic generation of caricatures that will help practitioners and inspire future research.

FIGURE 1 |
FIGURE 1 | Results of our novel user-controlled rule-based approach.Each pair (A, B, C, and D) presents the input facial scan (wired on the left) and its automatically generated caricature on the right.

FIGURE 2 |
FIGURE 2 | Overview of our user-controlled method presented in Section 3. Arrows and diamond shapes represent algorithms while boxes represent data.Offline and online processing are represented by the blue and orange colors, respectively.Green, yellow, and pink highlights show the different modules which compose the core of the user-controlled caricature system.For simplification purposes, the face segmentation is not shown.

FIGURE 4 |
FIGURE 4 | A comparison of results with different values of k for the kNN algorithm of the proportion exaggeration module.The first column shows the original facial mesh.Here, the caricatures are with f proportions 2.

FIGURE 6 |
FIGURE 6 | Overview of UNGT.A facial scan's identity is encoded along with the style of a caricature mesh, in order to produce the caricatured face.Textures are not processed, and presented for illustration purpose only.E represent the Style Encoder, G the Generator, and D the Discriminator.

FIGURE 7 |FIGURE 8 |
FIGURE 7 | Deep learning based caricatures for a number of facial scan examples.

FIGURE 9 |
FIGURE 9 | Average rankings, per Method and Identity.R1 to R5 are the ranks 1 to rank 5. Note the high variance per face and method.

FIGURE 11 |
FIGURE 11 | Caricature ranking distribution across all participants, per method.Top-1 to Top-5 rankings respectively shown in light blue, green, yellow, orange, and blue.

FIGURE 12 |
FIGURE 12 | Average Likert ratings for the statement "They preserve the identity of the person".Deep and Sela are significantly different to Geo.1 and EDFM.

FIGURE 13 |
FIGURE 13 | Average Likert ratings for the statement "I like the results".Geo.1, Geo.2, and EDFM are significantly different to Sela.

FIGURE 14 |
FIGURE 14 | Average Likert ratings for the statement "They correspond to what would be expected of a caricature".Geo.1, Geo.2, and EDFM are significantly different to Sela.