Unsupervised Facial Action Representation Learning by Temporal Prediction

Due to the cumbersome and expensive data collection process, facial action unit (AU) datasets are generally much smaller in scale than those in other computer vision fields, resulting in overfitting AU detection models trained on insufficient AU images. Despite the recent progress in AU detection, deployment of these models has been impeded due to their limited generalization to unseen subjects and facial poses. In this paper, we propose to learn the discriminative facial AU representation in a self-supervised manner. Considering that facial AUs show temporal consistency and evolution in consecutive facial frames, we develop a self-supervised pseudo signal based on temporally predictive coding (TPC) to capture the temporal characteristics. To further learn the per-frame discriminativeness between the sibling facial frames, we incorporate the frame-wisely temporal contrastive learning into the self-supervised paradigm naturally. The proposed TPC can be trained without AU annotations, which facilitates us using a large number of unlabeled facial videos to learn the AU representations that are robust to undesired nuisances such as facial identities, poses. Contrary to previous AU detection works, our method does not require manually selecting key facial regions or explicitly modeling the AU relations manually. Experimental results show that TPC improves the AU detection precision on several popular AU benchmark datasets compared with other self-supervised AU detection methods.


INTRODUCTION
Facial expression recognition technology offers the opportunity to seamlessly capture the expressed emotional experience of humans and facilitates unique human-computer interaction experiences. Over the past decades, facial expression recognition and analysis have been a hot research topic in the field of computer vision and human-computer interaction. To precisely characterize facial expressions, Ekman et al. developed the facial action coding system (FACS) (Ekman and Friesen, 1978). FACS has been widely used for describing and measuring facial behavior and has been the most comprehensive, anatomical system for describing facial expressions. FACS defines a detailed set of about 30 atomic non-overlapping facial muscle actions, i.e., action units (AUs). Almost any anatomical facial muscle activity can be characterized via a combination of facial AUs. Automatic AU detection has been a vital task for facial expression analysis, with a variety of applications in psychological and behavioral research, mental health assessment, and human-computer interaction (Bartlett et al., 2003;Zafar and Khan, 2014). Therefore, a reliable AU detection system is of vital importance for precise human emotion analysis.
Benefiting from the promising advancement in deep learning research, the performance and accuracy of AU detection has been improved by virtue of the convolutional neural network (CNN) based approaches in recent years (Li et al., 2017a(Li et al., ,b, 2018a(Li et al., ,b, 2020aCorneanu et al., 2018;Jacob and Stenger, 2021). However, the CNN-model-based AU detection approaches are quite data starved. What is worse is that AU annotation is time-consuming, labor-intensive, cumbersome, and error-prone. Thus, many existing works propose to exploit the auxiliary information for precise AU detection, e.g., Yang et al. (2021) proposed to use the semantic embedding and visual feature (SEV-Net) for AU detection. SEV-Net obtains AU semantic embeddings through both intra-AU and inter-AU attention components to capture the relationships among words within each sentence that describes individual AU. Li and Shan (2021) use the categorical facial expression images as auxiliary training data to boost the AU detection performance in a meta-learning manner. These pioneering works have inspired us to use a large amount of unlabeled facial videos to learn the AU representation unsupervised, as the unlabeled facial videos are easy to obtain and they consist of a large amount of subjects with diverse facial expressions.
Recently, self-supervised learning (SSL) has shown promising potential in learning discriminative features from the unlabeled data via various different manually defined pretext tasks (Wang et al., 2020;Cai et al., 2021;Hu et al., 2021;Kotar et al., 2021;Luo et al., 2021;Sun et al., 2021). For the task of AU detection, Li et al. (2019b) proposed to predict the optical flow caused by AUs and poses between two randomly sampled facial frames in a video sequence. The optical flow of the AUs and poses are then linearly combined to obtain the overall displacements between the two sampled faces. Lu et al. (2020) leveraged the temporal consistency to learn the AU feature via a self-supervised temporal ranking constraint. To capture the AU correlations in an input facial image, Yan et al. (2021) disentangled the global feature into multiple AU-specific features via a contrastive loss and then compute the feature for each AU by aggregating the features from the other AU-specific features with a transformer component. To bridge the performance gap between the fully supervised and selfsupervised AU detection methods, we propose a self-supervised pseudo signal based on the temporally predictive coding (TPC) to capture the temporal characteristics of the AUs. Specially, we construct a model that combines an AU feature extraction network with a convolutional gated recurrent unit (GRU) unit (Zonoozi et al., 2018), and a prediction head on top of the GRU that can make temporal predictions. We train the constructed model via TPC loss, which will be detailed in Section 3.1.
To further learn the per-frame discriminativeness between the sibling facial frames within a video clip, we propose a frame-wisely temporal contrastive learning mechanism. The AU detection model is tasked to perceive the temporal consistency and frame-wisely discriminativeness self-supervised. The AU detection backbone is trained end-to-end with the linear combination of the two contrastive losses on the unlabeled facial videos. Afterward, we additionally train a linear classifier with the pre-trained AU detection backbone with the scarce AU annotations.
In summary, the core contributions of this work can be summarized as follows: 1. We introduce self-supervised TPC for facial AU representation learning. TPC does not rely on AU annotations to learn the discriminative AU representations. 2. To further enhance the discriminability of the AU representation, TPC consists of a frame-wisely temporal contrastive learning constraint. TPC is capable of perceiving the temporal consistency and frame-wisely discriminativeness self-supervised. 3. Experimental results demonstrate the advantages of the proposed TPC over other state-of-the-art self-supervised AU detection methods on two popular AU datasets. Image retrieval results show that the learned AU representation in TPC is superior in spotting and capturing the AU similarities between different faces.

RELATED WORK
A number of AU detection approaches have been proposed recently (Zhao et al., 2016;Li et al., 2017a,b;Li and Shan, 2021). AU detection approaches are deep learning-based mostly.
Since AU actually means the movement of the facial muscles, many approaches detect the active/inactive states of AUs locally (Zhao et al., 2016;Li et al., 2017a,b). Among them, Zhao et al. (2016) used a locally connected convolutional layer to learn the AU-specific convolutional filters. SEV-Net  exploited the AU semantic word embedding as the auxiliary labels. FAUT was (Jacob and Stenger, 2021) proposed to capture the relationships between AUs via a transformer. These supervised AU detection methods need manually labeled training facial data. As training images are scarce, these methods often overfit on a specific dataset and cannot generalize well. Recently, self-supervised (Wiles et al., 2018;Li et al., 2019bLi et al., , 2020bLu et al., 2020) and weakly-supervised (Peng and Wang, 2018;Zhao et al., 2018) methods have been proposed to learn the deep learning-based models from unlabeled or partially labeled images. The former usually adopts the manually defined pseudo supervisory signals to learn the facial AU representation (Li et al., 2019b(Li et al., , 2020bLu et al., 2020). Among them, Fab-Net (Wiles et al., 2018) was trained to map a source facial frame to a target facial frame via estimating an optical flow field between the source and the target faces. Twin-cycle autoencoder (TCAE and TAE) (Li et al., 2019b(Li et al., , 2020b were proposed to learn the pose-invariant facial action features by estimating the respective optical flows for the poses and AUs via the cycleconsistency in the image and representations. Lu et al. (2020) proposed a temporally sensitive triplet-based metric learning to learn the facial AU representations via capturing the temporal AU consistency. It actually learns to rank the neighboring faces from the sequential frames in the correct order. Our proposed TPC differs from previous methods in three aspects. First, TPC is self-supervised in the pre-training stage. Second, TPC does not crop the regional AU features to learn the region-specific AU feature. Instead, it uses an abundant number of unlabeled videos to enhance the AU detection performance. Finally, TPC FIGURE 1 | Main idea of the proposed self-supervised temporally predictive coding (TPC) for facial AU representation learning. Given a facial sequence with T faces, we use the preceding T 1 faces as input and exploit the left faces for temporal prediction. Besides, we randomly sampled some triplets in each facial sequence to perceive the temporal consistency and frame-wisely discriminativeness self-supervised. ψ takes the context representation c t as input and estimates the features for the future frame recursively. Better viewed in color and zoom in.
is proposed to encode the temporal dynamics and consistencies to encode the characteristics of the facial AUs. Figure 1 illustrates the main framework of the proposed TPC for AU representation learning. Given an input facial sequence sampled from an unlabeled facial video, TPC first extracts the convolutional feature maps of each face via a commonly-used backbone network such as ResNet-50. Second, TPC learns the discriminativeness between different facial frames via temporal contrastive learning. We will introduce the proposed TPC and present the temporal contrastive learning paradigm in our proposed TPC as below.

Temporal Predictive Coding
Videos are very appealing as a data source for self-supervision as there are many forms of pseudo signal. In detail, the selfsupervision in the video sequence generally originates from three types: spatial, spatio-temporal, and sequential. Among the three kinds of self-supervised signal, spatial supervision can be derived from the structures in the static frame, spatiotemporal supervision naturally reflects the correlation across the different frames, and sequential supervision signifies the temporal coherence. Therefore, we exploit the sequential selfsupervision to learn a robust model for facial AU detection that is capable of capturing the temporal dynamics as well as temporal consistency of the facial AUs.
Let X = {x t } T t=1 denotes a consecutive sequence of T facial frames within an unlabeled video, where x t ∈ R H×W×C means the input t-th facial image of size H × W × C. Our goal here is to learn a model that predicts a slowly varying semantic representation based on the recent past. As illustrated in Figure 1, we partition a facial video clip into two parts: input part I and output part O: where T 1 is the length of the input facial sequence. First, a backbone network f (.) maps each facial frame x t to its latent convolutional map representation e t ∈ R H ′ ×W ′ ×C ′ , organized as height × width × channels. Then, we use a convolutional GRU to aggregate the sequential latent representations into a context representation c t . Mathematically, GRU uses the same gated principal of LSTM but with a simpler architecture. The below equations describe the mathematical model for the GRU: where h t is the hidden state, r t and z t are the reset gate value and update gate value at frame t. The functions σ (.) and (.) denote the sigmoid and tangent activation functions, respectively. The reset gate r t can decide whether or not to forget the previous activation. ⊙ means the element-wise multiplication. Figure 2 shows the main idea of the convolutional GRU.
With the encoded context representation c t , we exploit a prediction head ψ to predict the convolutional latent representation of the feature. In detail, ψ takes the context representation c t as input and estimates the features for the future frame recursively: where c t means the context feature from time step 1 to t, and e t+1 means the estimated latent convolutional feature of the time step t + 1. Similarly, we can predict the latent convolutional feature maps for the t + 2 facial frame, in a recursive manner. Such a recursive TPC manner enforces the prediction to be conditioned on all previous predictions and observations. The intuition behind the TPC is that the model is tasked to infer future AU semantics from the context representations c t and thus c t has to encode temporal consistency and dynamics of the facial AUs. The learning of the TPC is accomplished via a noise contrastive estimation, where our goal is the classify the real from the noisy samples. We denote the feature vector in each spatial location of the encoded and the predicted convolutional feature maps as e i,k andê i,k , where i denotes the temporal index and k means the spatial index in the convolutional features, k ∈ {(1, 1), (1, 2), · · · , (H ′ , W ′ )}. Finally, we can formulate the learning objective as follows: The goal of L pred is to classify the positive pair (ê i,k , e i,k ) among a set of constructed pairs. A positive pair consists of two elements that are located in the same spatial location and at the same time step. All the other pairs (ê i,k , e j,m ) that satisfy (i, k) = (j, m) are negative pairs. L pred is optimized such that the similarities of the positive pairs are higher than the similarities of the negative pairs. While the proposed TPC can spot the temporal consistency and dynamics of the input facial sequences, the discriminativeness of the nearby facial frames can be further enhanced so that the encoded AU representation can be more discriminative. We will explain how we use the temporal contrastive learning paradigm to achieve this goal in the next section.

Temporal Contrastive Learning
To learn the frame wisely discriminativeness of the input facial images, we introduce a temporal contrastive learning goal by adding multiple triplet losses (Schroff et al., 2015), each measuring the pairwise distance between the adjacent frames to the anchor frame. Learning to rank through triplet loss actually trains an AU detection backbone that learns to make the distance between the anchor and the positive face smaller than the distance between the anchor and the negative face. Let us denote a triplet that consists of three facial frames as (x a , x p , x n ), where x a , x p , and x n mean the anchor face, positive sample, negative sample, respectively. Note that x a , x p , and x n are consecutive facial frames randomly sampled from the input facial sequence X = {x t } T t=1 . Intuitively, (x a , x p ) should have more similar facial expressions than (x a , x n ) because the time interval is smaller between x a and x p . Inspired by intuition, we randomly sampled M triplets from the input facial sequence X and expect that the sum of M triplet losses would enable the AU detection backbone to learn to perceive the facial expression difference in the nearby facial frames. The learning target of the proposed temporal contrastive learning paradigm can be formulated as: where D is the cosine similarity of the input frame pairs. i is the sequence index, j is the frame index within the i-th input facial sequence. m is the margin that ensures L tcl will not be zero until the difference between the distances of the negative and positive frame from the anchor is greater than m. For each training facial sequence with T faces, we randomly sampled P triplets.

Overall Training Objective of TPC
For pre-train, we use the linear combination of L pred and L tcl as below: where λ means the importance of the temporal triplet loss, which will be discussed in the experimental section. For AU detection, we finetune the pre-trained model with the annotated AU labels. Mathematically, we exploit the multi-label sigmoid cross-entropy loss for optimizing the AU classification head and the pre-trained backbone model, which can be formulated as: where M denotes the number of facial AUs. z m denotes the m-th ground truth AU annotation of the input AU sample.ẑ m means the predicted AU score. z i ∈ {0, 1} means the labels w.r.t the ith AU. 0 means the AU is inactive, and 1 means the AU is active.

Implementation Details
We adopted ResNet-18 (He et al., 2016) as the backbone network for pretrain. We optimized the proposed backbone model via a batch-based stochastic gradient descent method. During training, we set the batch size as 64 on 4 GPU units and the initial learning rate as 0.001. For each video, we randomly sampled T = 10 consecutive faces for training, we used the first 8 eight faces as the input and the left 2 faces for prediction. Additionally, we randomly sampled P = 4 triplets from each facial sequence for temporal contrastive learning. During finetuning, we dropped the convolutional GRU and added a linear classifier layer for AU prediction. We set the momentum as 0.9 and the weight decay as 0.0005. We use the popular Voxceleb dataset (Nagrani et al., 2020) for pre-training. The dataset consists of about 6,000 subjects and hundreds of thousands of videos. All the videos only contain a subject with varying expressions and no AU or facial expression annotations.

Datasets and Evaluation Metric
For AU detection, we adopted the denver intensity of spontaneous facial action (DISFA) (Mavadati et al., 2013) and binghamton-pittsburgh 3D dynamic spontaneous facial expression database (BP4D) (Zhang et al., 2013) datasets. BP4D consists of a total of 328 videos recorded for 41 subjects (18 men and 23 women). A total of 8 different experimental tasks are evaluated on the 41 subjects, and their spontaneous facial AUs variations were recorded in the videos. There are nearly 14,0000 frames with 12 facial AUs labeled. DISFA contains 27 participants. Each participant is asked to watch a video to elicit his/her facial expressions. The facial AUs are annotated with intensities ranging from 0 to 5. There are about 130,000 AUannotated images in the DISFA dataset by setting the images with intensities greater than 1 as active. For the two datasets, the facial images are split into 3-fold in a subject-independent manner. We used the 3-fold cross-validation and adopted 12 AUs in BP4D and 8 AUs in DISFA dataset for evaluation. We adopted F1-score to evaluate the performance of the proposed AU detection method. The F1-score can be calculated as F1 = 2RP R+P , where R and P, respectively, denote the recall and precision. We also use the average F1-score over all the evolved AUs (Ave) to evaluate the overall facial AU detection precision.

Experimental Results
For the supervised methods, we compare the proposed TPC with deep region and multi-label (DRML) (Zhao et al., 2016), enhancing and cropping net (EAC-Net) (Li et al., 2017b), deep structure inference network (DSIN) (Corneanu et al., 2018), local relationship learning with person-specific shape regularization (LP-Net) (Niu et al., 2019), semantic relationship embedded representation learning (SRERL) (Li et al., 2019a), uncertain graph neural networks (UGN) (Song et al., 2021), semantic embedding and visual feature net (SEV-Net)  and facial action unit detection with transformers (FAUT) (Jacob and Stenger, 2021), meta auxiliary learning (MAL) (Li and Shan, 2021). It is worth noting that some of the AU detection approaches (Li et al., 2017b(Li et al., , 2019aCorneanu et al., 2018;Jacob and Stenger, 2021) learn the AU-specific representations with exclusive CNN branches via cropping the local facial regions. SEV-Net  proposes to learn robust visual features for AU detection via introducing the auxiliary AU descriptions. UGN (Song et al., 2021) learn to model the uncertainty of the AU annotations.
For the self-supervised methods, we compare the proposed TPC with TCAE (Li et al., 2019b), TAE (Li et al., 2020b), triplet ranking loss (TRL) (Lu et al., 2020). Among the compared methods, in TRL (Lu et al., 2020) proposed an aggregate ranking loss by taking the sum of multiple triplet losses to allow pairwise comparisons between the adjacent facial frames. In TRL, they learn to rank the faces through triplet loss involves training an encoder that learns to force the distance between the anchor face and the positive face smaller than the distance between the anchor face and the negative face. Table 1 shows the AU detection accuracy comparison of our TPC and previous methods on BP4D dataset. TPC obtains comparable AU detection accuracy in the average accuracy. In detail, TPC shows its superiority over DRML, EAC-Net, DSIN, LP-Net, with +12.8%, +5.2%, +2.2%, +0.1% improvements, respectively. Notably, TPC does not rely on facial landmarks to extract specified local facial regions, which will bring out a heavy computation burden in the training and inference phase. Besides, TPC does not need to use auxiliary AU description word embeddings or a large amount of annotated facial expression data for auxiliary learning. As different AUs are associated with specific facial muscles and corresponds to fine-grained local facial regions, learning region-specific AU representations is beneficial. The success of the region-based AU detection approaches (Li et al., 2017b(Li et al., , 2019a(Li et al., , 2020bCorneanu et al., 2018;Jacob and Stenger, 2021) have verified the benefits of the region-based AU detection approaches. We will explore this in future work. Table 2 shows the AU detection accuracy comparison of our TPC and previous methods on the DISFA dataset. TPC achieves slightly superior AU detection accuracy with the best state-ofthe-art self-supervised AU detection methods in the average F1 score, with 0.8% improvements over TAE, 7.3% improvements over TCAE, and 12.9% improvements over TRL. Notably, TPC shows its superiority in AU1 (Inner Brow Raiser), AU2 (Outer Brow Raiser), AU6 (Cheek Raiser), AU12 (Lip Corner Puller), and obtains comparable AU detection performance in AU9 (Nose Wrinkler) and AU25 (Lips part). In summary, the benefits of the proposed TPC over other self-supervised AU detection methods can be summarized in 2-fold. First, TPC explicitly learns to encode the temporal evolution and consistency of the facial Aus in the temporal sequences. The self-attention mechanism in the transformer modules is capable of perceiving the local to global interactions between different facial AUs. Second, TPC incorporates the frame-wisely temporal contrastive learning into the self-supervised paradigm to further learn the per-frame discriminative-ness between the nearby facial frames. Thus, TPC is capable of perceiving the temporal consistency and the frame-wisely discriminativeness of the facial AUs self-supervised. The consistent improvements over other self-supervised AU detection methods have verified the feasibility of TPC. We will TABLE 1 | Action unit (AU) detection accuracy of the proposed temporally predictive coding (TPC) and state-of-the-art approaches on BP4D dataset. The best results in the supervised and self-supervised methods are illustrated in Bold. The best results in the supervised and self-supervised methods are illustrated in Bold. carry out an ablation study to investigate the contribution of the two components in TPC in the next section. Table 3 shows the ablation experimental results. In Table 3, we show the accuracy variations with a different self-supervised components, and show the influence with different λ. As shown in Table 3, TPC shows the best AU detection performance with the linear combination of L pred and L tcl with λ = 0.1. It means both components in TPC contribute to its success in learing discriminative AU representations. Without either of the two self-supervised targets, TPC will show degraded AU detection accuracies. Besides, TPC also suffers from low accuracy with λ = 1.0 and λ = 10.0, which suggests the two self-supervised learning targets should be appropriately balanced to achieve the discriminative AU representations.

CONCLUSION
Within this paper, we aim to propose a self-supervised pseudo signal based on TPC to capture the temporal characteristics of the facial AUs in the sequential facial frames. To further learn the per-frame discriminativeness between the nearby faces, TPC incorporates the frame-wisely temporal contrastive learning into the self-supervised paradigm. The proposed TPC can be pre-trained without AU annotations, which facilitates making use of a large amount of unlabeled facial videos to learn the AU features that are robust to other undesired nuisances. Compared with supervised facial AU detection methods, TPC obtains comparable AU detection performance. Besides, TPC is superior to other self-supervised AU detection approaches. For future work, we will explore learning to perceive the regional and structural AU features in the temporal contrastive learning paradigm.

AUTHOR CONTRIBUTIONS
CW completed the algorithm design and wrote all parts of the manuscript. CW and ZW cooperatively conducted the experimental evaluation and cooperatively gave a detailed experimental analysis. ZW carefully checked the manuscript and polished the paper. Both authors have carefully read, polished, and approved the final manuscript.