Systematic Evaluation of Design Choices for Deep Facial Action Coding Across Pose

The performance of automated facial expression coding is improving steadily. Advances in deep learning techniques have been key to this success. While the advantage of modern deep learning techniques is clear, the contribution of critical design choices remains largely unknown, especially for facial action unit occurrence and intensity across pose. Using the The Facial Expression Recognition and Analysis 2017 (FERA 2017) database, which provides a common protocol to evaluate robustness to pose variation, we systematically evaluated design choices in pre-training, feature alignment, model size selection, and optimizer details. Informed by the findings, we developed an architecture that exceeds state-of-the-art on FERA 2017. The architecture achieved a 3.5% increase in F1 score for occurrence detection and a 5.8% increase in Intraclass Correlation (ICC) for intensity estimation. To evaluate the generalizability of the architecture to unseen poses and new dataset domains, we performed experiments across pose in FERA 2017 and across domains in Denver Intensity of Spontaneous Facial Action (DISFA) and the UNBC Pain Archive.


INTRODUCTION
Emotion recognition technologies play an important role in human computer interaction systems. Face-to-face interactions between social robots and people are but one example (McColl et al., 2016;Cavallo et al., 2018). To recognize human emotion, facial action units (AUs) (Ekman et al., 2002) have been widely used, which correspond to discrete muscle contractions. Individually, or in combination, they can represent nearly all possible facial expressions.
In the last-half decade, automated facial affect recognition (AFAR) systems have made major advances in detection of the occurrence and intensity of facial actions. Previous studies focused on relatively controlled laboratory settings. More recent studies emphasize on less-constrained and in-the-wild scenarios (Cohn and De la Torre, 2015;Li and Deng, 2018;Zhi et al., 2019). Because frontal face views occur commonly in less constrained settings, robustness to pose variation is essential. The Facial Expression Recognition and Analysis 2017 (FERA 2017) challenge provided the first common protocol to evaluate robustness to pose variation (Valstar et al., 2017). In FERA 2017, deep learning (DL)-based approaches achieved the best performance in sub-challenges (Tang et al., 2017) for occurrence detection (Zhou et al., 2017) and intensity estimation.
While the advantages of DL approaches are clear, little is known about critical design choices in crafting them. Most studies used ad-hoc or default parameters provided by the DL frameworks; however, they neglected to investigate the effect of different parameter settings on facial AU detection. Also, little is known about the relative contribution of different design choices in pre-training, feature alignment, model size, and optimizer details.
We are especially interested in design choices based on two scenarios. One is robustness to pose variation. Until recently, most systems were concerned with relatively frontal face views. With increased attention to less-constrained and in-the-wild contexts, it is critical for systems to be robust to pose variation in real-world settings where it is common. The other scenario is transfer to new dataset domains other than those in which they have been trained and tested. To meet the need for systems that are robust to new contexts, systems must perform well both in the domains from which they come and in the domains to which they may be applied. The evaluation of domain transfer in AU systems is relatively new Ertugrul et al., 2020).
To address questions in design choices, we systematically explored combinations of different components and their parameters in a DL pipeline. We investigated pre-training practices, image alignment for pre-processing, training set size, optimizer, and learning rate (LR). By utilizing the insights, we achieved state-of-the-art performance in both the occurrence detection and the intensity estimation sub-challenges of FERA 2017 (Valstar et al., 2017) and state-of-the art in cross-domain generalizability to the Denver Intensity of Spontaneous Facial Action (DISFA) dataset (Mavadati et al., 2013). We also are the first to report cross-domain generalizability to UNBC Pain Archive (Lucey et al., 2011). To reveal which facial regions our architecture responds to in detecting specific AUs at specific poses, we visualized occlusion sensitivity maps.
The study of Niinuma et al. (2019) was an earlier version of the current study. In the present study, we evaluated an additional DL architecture (ResNet50), performed cross-domain evaluation with an additional dataset (UNBC Pain), evaluated cross-pose generalizability, and visualized occlusion sensitivity maps.
The FERA 2017 challenge (Valstar et al., 2017) was the first to provide a common protocol to compare approaches for detection of AU occurrence and AU intensity robust to pose variation. FERA 2017 provided synthesized face images from BP4D (Zhang et al., 2014) with nine head poses, as shown in Figure 1. To generate the synthesized images, 3D models were rotated by −40, −20, and 0 • pitch and −40, 0, and 40 • yaw from frontal pose. The training set was based on the BP4D database (Zhang et al., 2014), which included digital videos of 41 participants. The development and test sets were derived from BP4D+ (Zhang et al., 2016) and included digital videos of 20 and 30 participants, respectively. FERA 2017 presented two subchallenges: occurrence detection and intensity estimation, with 10 AUs labeled for the former and 7 AUs labeled for the latter.
For FERA 2017, the participants proposed a wide range of methods (Amirian et al., 2017;Batista et al., 2017;He et al., 2017;Li et al., 2017;Tang et al., 2017;Valstar et al., 2017;Zhou et al., 2017). Table 1 compares them with each other and with two more recent studies from Ertugrul et al. (2018) and . F 1 score and Intraclass Correlation (ICC) were used to evaluate, performance for occurrence detection and intensity estimation, respectively.
Several comparisons are noteworthy. While detailed face alignment using facial landmarks was used for shallow approaches, simple face alignment using face position or resized images more often sufficed for DL approaches. As for architecture, DL performed better than shallow approaches, and DL approaches with pre-trained models performed better than ones without pre-trained models. For both sub-challenges, the methods showing the best performance (Tang et al., 2017) for occurrence detection and for intensity estimation (Zhou et al., 2017) used DL with a pre-trained model. As for training set size, each method used different numbers of training images. Adaptive Moment Estimation (Adam) and Stochastic Gradient Descent (SGD) were popular choices for optimizer, and their LR varied between 10 −3 and 10 −4 .
According to the comparison of the existing methods, the effectiveness of DL approaches, especially the ones using pretrained models, is indicated for this task, but every approach used a different fixed configuration, and the key parameters are unknown. The aim of this study is to investigate the key parameters for both AU occurrence detection and intensity estimation for this task and discover the optimal configuration.

METHODS
The main goal of this study is to investigate the effect of the different components and parameters and to provide best practices that researchers can use for training DL methods for automatic facial expression analysis. Figure 1 shows an outline of our experimental design. We systematically varied parameters and design choices in this pipeline (key elements are denoted in blue color in Figure 1).

Architecture
Since the objective of this study is to investigate components that were commonly used by existing methods, we examined Visual Geometry Group (VGG) architectures. Table 1 shows VGG pre-trained models that were widely used as architectures. To examine the impact of alternative DL architectures, we also conducted the experiments using the ResNet50 pre-trained model in section 4.11. For VGG architectures, we selected two pre-trained models: VGG-ImageNet and VGG-Face. While VGG-ImageNet is a model that was trained on ImageNet for image classification (Simonyan and Zisserman, 2015), VGG-Face is a model that was trained on the face dataset for face recognition (Parkhi et al., 2015).

Baseline Configuration
In each experiment, we explored the effect of optimizer choice and parametric variation of key parameters. The experimental setup has five parameters (normalization, architecture, train set size, optimizer, and LR) and two tasks (occurrence detection and intensity estimation) in total. To vary all parameters would have resulted in 320 possible permutations. In consideration of computational cost and limits on how much could be visualized, we varied two parameters at a time and chose the top 50 permutations that we believed would be of most interest to developers of AFAR systems.
The baseline configuration used Procrustes analysis for face alignment and the VGG16 network trained on ImageNet. For optimizers, we compared Adam and SGD, with default learning rates of 5 × 10 −5 and 5 × 10 −3 , respectively. We fine-tuned the network from the third convolutional layer using 5,000 images for each pose and AU. The dropout rate was 0.5 throughout the experiments.

Normalization
We evaluated two methods for image normalization. In the first method, we applied Procrustes analysis (Gower, 1975) to the face shapes defined by the landmarks to estimate similarity normalized shapes. In the second method, we resized the images to the receptive field of the deep network.
Similarity normalization between source and template shapes using eye locations is a popular choice in the literature. One shortcoming of this approach is that the alignment error increases for landmarks farther away from the eye region. This artifact is more prominent under moderate-to-large head pose variations. To alleviate this problem, we used all 68 landmarks provided by the dlib face tracker (King, 2009) to calculate a Procrustes transformation between the predicted shape and a frontal looking template. We chose the size of the template to cover a bounding box of 224 × 224 pixels, which corresponds to the receptive field of the VGG network.
As for the second option, we resized each input image from the dataset to 224 × 224 pixel size to match the receptive field of the VGG network. Figure 2A shows the F 1 scores and ICC averages for all nine poses for each AU. The left figures show results for Adam optimizer, and the right figures show results for SGD optimizer.
The results indicate that the performance with Procrustes analysis is slightly better than the one with resizing, but the difference is small, only 1%. One possible explanation for this is that the network has enough capacity to learn all the nine different poses present in the training set. Another study indicates that a form of normalization is often helpful when classifiers are evaluated on poses different from the ones it was trained on (Ertugrul et al., 2018).

Pre-trained Architecture
Training deep models from scratch is time-consuming, and the amount of training data at hand may impede good performance. One popular solution is to select a model that was trained on large scale benchmark datasets (source domain) and fine-tune it on the data of our interest (target domain). Although this practice is effective, it is relatively neglected how the type of data in the source domain influence the performance of fine-tuning in the target domain.
To explore this question, we selected two models that were trained on very different domains: VGG-16 trained on ImageNet (Simonyan and Zisserman, 2015) and VGG-Face (Parkhi et al., 2015). We replaced the final layers of each networks with a 2-length one-hot representation for AU occurrence detection and with a 6-length one-hot representation for the intensity estimation task. In both cases, we trained separate models for each AU, resulting in 10 and 7 models for AU occurrence detection and AU intensity estimation, respectively. We finetuned the models for 10 epochs, validated their performance on the validation partition, and then reported their results on the subject-independent test partition. We used a PyTorch implementation for all of the models. Figure 2B shows that models pre-trained on ImageNet show better performance than the VGG-Face ones. VGG-Face was trained on face images for identification, while ImageNet includes many non-face images for image classification. One possible explanation is that VGG-Face learned to actively ignore facial expression in order to recognize the face. In this case, a generic image representation is more suitable for the task.

Training Set Size
Recently, multi-label stratified sampling was found advantageous over naive sampling strategies for AU detection . In this experiment, we employed this strategy and investigated the effect of different training set sizes on the performance. We down-sampled the majority class and up-sampled the minority class to build a stratified training set. We used this procedure for each pose and each AU. For example, in the case of AU occurrence detection, a 5, 000 training set size indicate that 5, 000 frames where the AU is present and 5, 000 frames where the AU is not present were randomly selected for each pose and for each AU, resulting in 90, 000 images in total (=5, 000 images × 2 classes × 9 poses).
We repeated the same stratifying procedure with the six ordinal classes of the intensity sub-challenge. In this case, a 5, 000 training set size means that 5, 000 images were randomly selected from the six classes (not present, and A to E levels) for each pose and for each AU, resulting in 270, 000 images in total (=5, 000 images × 6 classes × 9 poses). Figure 2C shows results as the function of different training set size. The training set size have minor influence on the performance: scores peaked at 5, 000 images after that performance plateaued.

Optimizer and LR
In this experiment, we investigated the impact of different optimizers and LRs on the performance. We varied the LRs, but other optimizer parameters were set to the default values used in PyTorch: betas = (0.9, 0.999) without weight decay for Adam and no momentum, no dampening, no weight decay, and no Nesterov acceleration for SGD.   The best results are shown in bold. Figure 2D shows that the optimal LR depends on the choice of optimizer. For Adam, LR = 5 × 10 −5 gave the best results, and for SGD, LR = 0.01 reached the best performance for both occurrence detection and intensity estimation. In addition, we can see that the performance differences between Adam and SGD are negligible if one uses the optimal learning rates for each optimizer.
It is worth noting that Zhou et al. (2017) used SGD with LR=10 −4 for the AU intensity estimation task. The results indicate that using Adam optimizer or SGD optimizer with larger LR could have improved their performance. Tang et al. (2017) used SGD with LR = 10 −3 , but they also applied momentum. Additional experiments revealed that, when momentum is used for SGD, smaller learning rate is preferable for optimal performance. More specifically, when we used the same parameters as Tang et al. (2017) reported for SGD (momentum = 0.9, weight decay = 0.02), F 1 score peaked at 0.596 using LR = 10 −4 . Their LR was close to optimal, though SGD without momentum further improves F 1 score to 0.609 with LR = 0.01.
We note that, when the LR was set to a large value, some models did not converge and predicted the majority class for all samples. Under this rare condition, ICC converges to zeros, but this should not be interpreted as chance performance. As variation in predicted intensity values reduces, the ICC metric loses predictive power.

Comparison With Existing Methods
We compare our method with the state-of-the-art on both the AU occurrence detection (Table 2) and the AU intensity estimation ( Table 3) sub-challenges from FERA 2017. The final parameters of the models are nearly identical for the two tasks: we used face alignment with Procrustes analysis as a pre-processing step, and we fine-tuned ImageNet pre-trained VGG16 model on stratified sets consisting of 5,000 samples per each class, pose, and AU. For AU occurrence detection, SGD with LR = 0.01 gave the best result (F 1 = 0.609), while for AU intensity estimation, Adam with LR = 5 × 10 −5 reached the best performance (ICC = 0.504). These scores outperform other state-of-the-art methods.
We noted a few key differences that contributed to this achievement. The main difference with Tang et al. (2017) is that they used VGG-Face pre-trained model while we used ImageNet pre-trained model. Zhou et al. (2017) used SGD with small LR while the combination of our optimizer and learning rate is optimal. While Li et al. (2018) evaluated their method for AU occurrence detection using the FERA 2017 dataset, they reported performance only on the validation partition. Their best F 1 score (0.522) is 9% lower than ours (0.611) on the validation partition.

Effect of Head Pose on Performance
To understand the effect of head pose on classifier performance, we complied the performance scores into a tabular form, as shown in Tables 4, 5. For each pose and AU, the tables show F 1 score and Accuracy for occurrence detection and ICC for intensity estimation. In the experiments, we used the same CNN models reported in section 4.5. We can see the effect of rotations in Tables 4, 5. As for the pitch rotations, the performance with 0 • pitch poses (Pose 4, 5, and 6) show better results than the others. As for yaw rotations, the performance scores are comparable for all poses.

Cross-Domain Evaluation
Differences in illumination, cameras, orientation of the face, quality, and diversity of the training data influence predictive performance between domains. To evaluate the generalizability of the method to unseen conditions, we reported performance on the DISFA (Mavadati et al., 2013) and UNBC McMaster Pain (Lucey et al., 2011) datasets.
These datasets were annotated with AU intensity labels. To create binary AU occurrences, we thresholded the 6-points intensity values at A-level (A-level or higher means the AU is present). We evaluated both occurrence detection and intensity estimation performance of our system. In these experiments, no fine tuning was performed on the target domain. Figure 3A shows the F 1 scores with two normalization methods, Procrustes analysis and resizing. Figure 3B shows the F 1 scores with two pre-trained architectures, VGG-ImageNet and VGG-Face. In these experiments, we used the same configuration with Adam optimizer in sections 4.1 and 4.2, respectively. We used the built-in face detector in dlib (King, 2009) to detect the face before applying Procrustes analysis. As for resizing, we extend the boxes of detected face positions by 30% to include whole faces and then crop and resize the boxes to 224 × 224 size images. For DISFA, we found that Procrustes analysis with VGG-ImageNet have better performance. For UNBC Pain Archive, the findings are in same direction but small.
Tables 6, 7 show the results from our model on both tasks. In these experiments, we used two types of models: (1) All poses: the previously trained CNN models reported in section 4.5, and (2) Pose 6 only: models trained on images only with Pose 6, which is equivalent to BP4D. Table 6 includes the comparison with cross-domain methods for occurrence detection on DISFA. Both Baltrušaitis et al. (2015) and Ghosh et al. (2015) used BP4D to train their model and thresholded AU intensity values at A-level to create binary events. Our models were also trained using BP4D because the train set for FERA 2017 is synthesized from BP4D. Pose 6 in FERA 2017 is the same as the pose shown in BP4D. To train the models for Pose 6 only, the same number of images as All poses are used. More specifically, 45,000 frames were randomly selected per class per AU, resulting in 90,000 images in total for each AU. As we discussed in section 4.3, we downsampled the majority class and up-sampled the minority class. We also report Accuracy and 2AFC scores that Ghosh et al. (2015) used. When All poses model and Pose 6 only model were compared, both models showed that almost the same performance for Accuracy, F1 score, and AUC though All poses shows slightly better results than Pose 6 only for 2AFC. The results look reasonable because most images in DISFA is frontal or near frontal. In comparison with Ghosh et al. (2015), our models outperform their method in both metrics. Baltrušaitis et al. (2015) report cross-domain scores only for two AUs (AU12 and AU17). Our models show better performance except for AU12 on Pose 6 only. These results show the robustness of our model for cross-domain situation. To the best our knowledge, there are no other methods that perform crossdomain evaluation on these datasets. Table 7 depicts the results of our methods.
It is worth mentioning that some differences on UNBC Pain may cause the low F 1 scores. The base rates on UNBC Pain is small (DISFA: 13.3%, and UNBC Pain: 7.2%) and the image size of UNBC Pain (320 × 240 or 352 × 240) is also smaller than the other two datasets (FERA2017: 1,024 × 1,024, and DISFA: 1,024 × 768). In addition, in UNBC Pain, facial expressions are mostly associated with pain, and the correlation among AUs differs from that of FERA2017 and DISFA. Tables 6, 7 also show AUC.

Cross-Pose Evaluation
We also performed cross-pose experiments to evaluate the generalization of our method to unseen poses. We report the results of two types of experiments: (1) We trained the architecture using eight of the nine poses of training set and tested it with the remaining pose of test set (Figure 4), (2) We trained the architecture using one pose of training set and tested it with nine poses of test set (Figure 5). The baseline configuration with Adam optimizer is used for cross-pose experiments. Figure 4 shows that the differences between the models trained with eight poses and those trained with nine poses. The horizontal axis represents the pose that was excluded from train set and used as test set. The value is zero if the performance between two models are the same, and the value is >0 if the performance with eight poses is better than the one obtained with nine poses. By training the models with all of the nine poses, the best performance since the model learns information about all poses is expected. With eight-pose experiments, we can see that, even if the test pose is excluded from the training set, our model performs similarly to the one in which the test pose is included in the training set. The results indicate that our model performs reasonably well on the unseen poses.
As for Figure 4, we provide a more detailed analysis. Accuracy for AU4 is higher for poses 1 and 2. No difference, however, is found for AU4 intensity. Given that the occlusion sensitivity maps for AU4 appear similar across poses, the difference for poses 1 and 2 in occurrence may be due to noise. AU15, on the other hand, showed the decreased accuracy for poses 7, 8, and 9. This effect would be expected. AU 15 results in localized, small movement, and appearance change below the lip corners. The best results are shown in bold. When the face is viewed from above (poses 7, 8, and 9), the target region is occluded. As for AU23, there was decreased accuracy for pose 9. Lip-corner tightening may be more difficult to perceive when viewed from above, but that was not found for two of the three extreme poses. Thus, variation for this AU occurrence is difficult to interpret. Unfortunately, AU intensity is not available for comparison. Figure 5 shows the results of the second experiment. Each cell of a 3 × 3 matrix shows the performance of each pose. Performance at a cell of the grid corresponds to the pose at the same cell given in Figure 1 The blue rectangle is a pose that was used to train a model. For example, for a model trained with Pose 1, F 1 score is 0.604 when we test it with Pose 1 of the test set and 0.446 when we test it with Pose 9 of the test set. The figure shows that maximum results are obtained with withinpose. Smaller decreases are observed in the performance when the models are tested with the poses in the neighboring cells. The performance is largely decreased when we test a model with largely different poses.

Occlusion Sensitivity Maps
To discover key features for the classifier, we generated occlusion sensitivity maps (Zeiler and Fergus, 2014) for each pose and each AU. We used an occlusion patch of a 45 × 45 size with Gaussian random noise. We slid the patch over the original image of 224 × 224 size with a stride 15. For each AU each pose, we selected 100 images that contained the specific AU and 100 images that did not contain it. We tested the 200 images for each AU and each pose and obtained accuracy values. Figure 6 shows the maps, where the darker red colors represent lower accuracy values. Significant regions are the ones colored with red because their occlusion causes the largest decrease in the accuracy.  As can be seen in Figure 6, for most of the AUs, the significant regions are localized at the regions where each AU is observed (e.g., around eyes, eyebrows, and forehead for AU1 and AU4, and around the mouth and chin for AU15 and AU17). The results indicate that the models learn of where to look at on the input to detect the specific AU correctly. Note that significant regions in Figure 6 are off to the left side even for frontal faces. This seems to be reasonable because the pitch and yaw rotations of images in FERA2017 datasets is in one direction, as shown in Figure 1. If any occlusion does not cause large decrease in accuracy, the map does not include dark red colors. For example, we see weak activation on the heatmap for the AU12 frontal face. The map indicates that, even if a large part of mouth is occluded, our model can detect AU12 by using the other part of the face.

Saliency Maps
We generated saliency maps using basic backpropagation (Simonyan et al., 2014) to compare the learned features. For each AU each pose, we selected 100 images that contained the specific AU and 100 images that did not contain it. We then obtained a mean image of saliency maps from the 200 images. Figure 7 shows the results of this experiment. Brighter areas are more important for the classifier to detect the related AUs.
The important regions are expected to be localized at the regions where each action unit is observed. Like the occlusion sensitivity map, the saliency map aims to find important regions to detect, but there are differences in their methodology and in the way they define what is important. The occlusion sensitivity map follows a perturbation-based (forward propagation) approach. Perturbed (occluded) inputs are forwarded through the network, and its effect on the output prediction is investigated. Contrary to the occlusion sensitivity map, saliency map is a gradient-based (back propagation) approach. The idea behind saliency map is to compute the gradient of the output category with respect to the input image pixels. This shows the amount of change in the output when a pixel value is slightly changed. Figure 7 shows that the important regions are well-localized for both VGG-ImageNet and VGG-Face. However, in comparison with VGG-ImageNet, the regions  (0.609) shows better performance than ResNet50 (0.591) for occurrence detection.

CONCLUSIONS
By evaluating combinations of different components and their parameters, we addressed how design choices in DL systems influence performance in facial AU coding and several findings standout. The source domain in which pre-training was performed influenced the performance of fine-tuning in the target domain. Generic pre-training proved better than a facespecific one. Face-specific pre-training indicates the training to learn identity but ignore the facial expression. Another important factor contributing to performance is the choice of different learning rates for different optimizers. For Adam optimizer, small LR was optimal. For SGD optimizer, large LR was optimal for expression coding. Best parameters of the optimizers were similar for both AU occurrence detection and AU intensity estimation, while varying the training set size and the type of image normalization had little effect on performance.
We also evaluated cross-pose and cross-domain generalizability of the proposed method and presented occlusion sensitivity maps and saliency maps to reveal key features for each facial AU. Our models outperformed other state-of-the-art approaches in the cross-domain experiments. Cross-pose evaluation showed that our models performed well for unseen poses.

ETHICS STATEMENT
The studies involving human participants were reviewed and approved by the IRB committee of Carnegie Mellon University. The patients/participants provided their written informed consent to participate in this study. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

AUTHOR CONTRIBUTIONS
KN implemented the architecture, ran the experiments, and wrote the manuscript with support from the other authors. IO implemented the visualization modules. JC contributed to the design and writing. LJ contributed to the conceptualization, design, and writing and supervised the project. All authors discussed the results and contributed to the final manuscript.