- Hogeschool van Arnhem en Nijmegen, Automotive and Engineering Academy, Arnhem, Netherlands
Emotion estimation is a field that has been studied for a long time, and several approaches using machine learning models exist. This article presents BlendFER-Lite, an LSTM model that uses Blendshapes from the MediaPipe library to analyze facial expressions detected from a live-streamed camera feed. This model is trained on the FER2013 dataset and achieves 71% accuracy and an F1-score of 62%, meeting the accuracy benchmark for the FER2013 dataset while significantly reducing computational costs compared to current methods. For the sake of reproducibility, the code repository, datasets, and models proposed in this paper, in addition to the preprint, can be found on Hugging Face at: https://huggingface.co/papers/2501.13432.
JEL Classification: D8, H51
MSC Classification: 35A01, 65L10, 65L12, 65L20, 65L70
1 Introduction
In social robotics, many parts of a robot system are integrated into a single system. Typically, the more complex the system is, the better it performs tasks, which are more commonly focused on elderly care or patient care. In addition to skeletal and motion structures, a robot's subsystems include interaction modules and sensory functions, such as speech, vision, and hearing. Since human emotion is a large part of any interaction between two or more humans, it is important for robots to interpret human emotion using sensory inputs such as speech, tone of voice, facial expressions, body poses, hand gestures, gaze direction, and head orientation. In this work, we focus solely on emotion estimation from facial expressions, given their importance (Marechal et al., 2019).
Emotion estimation from facial expressions, more commonly referred to as facial emotion recognition (FER), generally involves three steps: (1) face detection: identify a face in a camera stream or image and localize its boundaries; (2) feature extraction: identify and extract the most relevant information from the previously detected face. Many approaches exist; more popular is by setting key points as a vector for the most relevant parts of the face, such as the tip of the nose and the ends of the mouth in two-dimensional or three-dimensional coordinates, and (3) expression classification: using extracted features as inputs for a classification model that predicts the emotion conveyed by the facial expression.
The proposed emotion estimation system functions as a feedback mechanism that helps drive a conversation or interaction between a human and a robot. The system enables a robot to detect real-time human reactions through facial expressions and potentially change the topic, approach, or action of the conversation.
The scope of this research was limited by the availability of data and computational resources. Consequently, emotion classification was restricted to three classes instead of the original seven proposed previously (Happy, Sad, Angry, Afraid, Surprise, Disgust, and Neutral) (Ekman and Friesen, 2003). These three classes had labels (happy, sad), while all other emotions were grouped into a general category under an undefined label (Unknown).
The system development process begins with a full data processing pipeline, which includes loading, cleaning, augmenting, extracting features, and visualizing data. The data are then input into the BlendFER-Lite model for training. During inference, the trained model is integrated into the Gaze Project1, where it facilitates facial detection/localization and feature extraction. Then, the trained model classifies facial expressions based on the features.
Some considerations should be noted. First, the FER2013 (Goodfellow et al., 2013) dataset is considered a challenge dataset, i.e., it was not built or optimized for research purposes. It contains a group of non-relevant images, such as animations and images with covered faces, that do not provide useful information to the model. Additionally, model training and testing are performed on a laptop equipped with an NVIDIA GeForce RTX 3050 Laptop GPU, running on the Linux Ubuntu operating system, and the programming language used is Python, under the Apache-2.0 license.
This study serves as a proof of concept (POC) for a research paper that will be discussed in the future work Section 8. In the context of emotion estimation for embedded systems in social robotics, the main contributions of this study are as follows:
1. Building BlendFER-Lite into a compact and cost-effective model suitable for inference on a microcomputer, while considering the spatial and temporal aspects of facial expression.
2. Using the MediaPipe library Face Landmarker task, which is for face localization and feature extraction, and incorporating Blendshapes (Lugaresi et al., 2019a) as features, is a technique rarely employed in FER applications.
The remainder of this article is organized as follows: Section 2 covers related work, Section 3 details the methods, Section 4 discusses the ablation studies, Section 5 presents the results, Section 6 discusses the results and findings of the study, Section 7 concludes the study, and Section 8 presents the future work to be based on this study.
2 Related work
Facial emotion recognition has been an active research area for a long time (Ekman and Friesen, 2003). The main challenges for building an emotion estimation system are Zafeiriou et al. (2017).
• The complexity of the emotion patterns.
• The emotions are time-varying.
• The process is user- and context-dependent.
Many methods exist for facial detection and tracking, such as the Haar cascade algorithm, and other methods for feature extraction, such as the classical algorithms Histogram Oriented Gradient (HOG) and Scale Invariant Feature Transform (SIFT) (Gautam and Seeja, 2023; Kumar et al., 2016), or, more recently, the Convolutional Neural Network (CNN) (Li and Deng, 2020) and transformer encoders (Min et al., 2024) have been employed to detect faces and extract features. Classification models also employ CNNs, Recurrent Neural Networks (RNNs), or hybrid architectures (Li and Deng, 2020; Leong et al., 2023).
2.1 Feature extraction
Feature extraction can be considered a special type of data dimensionality reduction aimed at identifying a subset of informative variables from image data (Egmont-Petersen et al., 2002). The most common approach for extracting high-quality features from facial images involves estimating facial landmarks (Kumar et al., 2025; Belmonte et al., 2021; Shaila et al., 2023). One approach involves identifying the nose tip and segmenting it by cropping a sphere centered on it. Then, the eyebrows, mouth corners, eye corners and possibly many other facial landmarks are identified on the segmented sphere. Finally, by measuring the distances between landmarks (An et al., 2023) and changes in their positions, facial expressions can be interpreted (Farkhod et al., 2022).
2.1.1 History
Looking back on the evolution of FER, one of the methods used to extract features is the HOG (Gautam and Seeja, 2023), which determines edge presence, direction, and magnitude. Another method is SIFT (Kumar et al., 2016), which follows a four-step process: (1) constructing the scale space, (2) localizing crucial points, (3) assigning orientation, and (4) assigning unique fingerprints. A feature extraction method still widely used today is the Support Vector Machine (SVM) (Cortes and Vapnik, 1995; Leong et al., 2023).
More recently, feature extraction has been performed using deep neural networks (Li and Deng, 2020; Leong et al., 2023), such as fully connected networks (FCNs) and CNNs, which construct models consisting of multiple layers and hyperparameters. The complexity of the data, the required performance, and the available computational resources influence which model is implemented.
2.1.2 MediaPipe
This study used the MediaPipe (Lugaresi et al., 2019a,b) library for facial recognition, tracking, feature extraction, and landmark detection. Using the Face Landmarker task could yield three outputs per image or frame: 468 3D landmark vectors, 52 Blendshapes, and facial transformation matrices. In previous works (Bisogni et al., 2023), two approaches were proposed for feature extraction, one of which involved using MediaPipe to extract features from images. The results indicated that using this library can be more effective than deep learning networks, such as CNNs, for subtle facial expressions. The more intense the facial expressions are, the smaller the performance gap between the two approaches. In the most extreme cases, CNNs perform better across most emotions.
In another study (Savin et al., 2021), MediaPipe was compared to OpenFace for landmark detection. OpenFace (Baltrušaitis et al., 2016) uses OpenCV to detect 68 facial landmarks, whereas MediaPipe uses TensorFlow (Abadi et al., 2016) to identify 468 landmarks that are arranged in fixed quads and represented by their coordinates (x, y, z). Figure 1 shows an image annotated with the landmarks used for facial expression recognition.
Another study is Kleisner et al. (2024), which proposes adding a CNN to process MediaPipe 3D landmark vectors and improve landmark placement precision, working on 2D facial photographs.
2.1.3 Blendshapes
The second output of the MediaPipe face landmark detection task model consists of Blendshapes, which provide an approximate semantic parameterization and a simple linear model for facial expressions (Lewis et al., 2014). This technique originated in the industry before gaining traction in academia and has been widely used in computer graphics. Although the Blendshapes technique is conceptually simple, developing a full Blendshapes face model is computationally intensive. To express a complete range of realistic expressions, one face might require more than 600 Blendshapes.
The construction of a single Blendshapes model was originally guided by the Facial Action Coding System (FACS) (Ekman and Rosenberg, 1997). This system enables manual coding of all facial displays, known as action units, and more than 7,000 combinations have been observed. The FACS action units are the smallest visibly discriminable changes in a facial display, and combinations of FACS action units can be used to describe emotional expressions (Ekman, 1993) and global distinctions between positive and negative expressions.2
MediaPipe incorporates a set similar to the ARKit face Blendshapes,3 which consists of 52 Blendshapes describing facial parts and expressions. These are quantified with probability scores from 0 to 1, indicating the presence of specific Blendshapes, as shown in Figure 1, while Figure 2 presents a Blendshapes histogram.
Figure 2. Histogram for the Blendshapes of the image in Figure 1 via MediaPipe.
2.2 Emotion classification
For emotion classification, many approaches have been used in research, such as support vector machines (SVM) (Kumar et al., 2016), which have been implemented with linear as well as radial basis functions, and stochastic gradient descent (SGD) classifiers, which achieved approximately 95% accuracy on the Radboud Faces Database (RaFD) (Langner et al., 2010). Bisogni et al. (2023) presented a comparison between two approaches, namely, a MediaPipe-SVM and a CNN-LSTM. MediaPipe and the CNN were employed for feature extraction, and the SVM and LSTM (Computation, 2016) were used for classification. Experiments have shown that features extracted using MediaPipe are superior to those extracted by the CNN. Consequently, the classification performance of the first combination is better than that of the second approach.
One approach in Huang et al. (2023) used ResNet (He et al., 2015) blocks in addition to the Squeeze-and-excitation network (Hu et al., 2018) for expression classification and demonstrated the improvements that can be achieved using the correct transfer learning approach. It also used the suggested model to examine the important features and the location of the major facial information using feature maps, which will be further discussed in Section 6.
Another type of image classification method adopts CNNs. Gautam and Seeja (2023) used a sequential model of three convolutional layers and a dense layer to classify input features, achieving high accuracy on the dataset used.
Since the model being developed is designed for video processing, where data are sequential, as stated in Goodfellow (2016), choosing an appropriate RNN is crucial. Just as a CNN is a neural network specialized for processing a grid of values, such as an image, an RNN is a neural network specialized for processing a sequence of values or sequential data.
When working with sequential data, such as videos, and considering temporal dependencies between images, an RNN is the best choice (Li and Deng, 2020). The use of an LSTM, which is a special type of RNN, aims to address the vanishing gradient and exploding gradient problems that are common in RNN training; Jain et al. (2018) compared the results, and the network including a CNN part and an RNN part delivers better accuracy than a network with only CNN layers, which is approximately 20% higher in accuracy.
Another study discussing RNNs is Leong et al. (2023), where a CNN-LSTM architecture was proposed as a facial recognition system that can understand spatiotemporal properties in video.
3 Methods
This section discusses the approach to building the emotion estimation system, starting with the dataset used and the data processing techniques, then building and optimizing the model, and finally evaluating and testing the model's performance.
3.1 Dataset
Since the model being developed is deliberate for video-based inference, selecting a video dataset with different facial expressions was the first priority. Li and Deng (2020) found a group of datasets, some of which include videos, such as MMI (Pantic et al., 2005; Valstar et al., 2010) and AFEW7.0 (Dhall et al., 2017). However, due to computational constraints and the intensive data preprocessing required for video-based datasets, we opted to use an image-based dataset instead.
Among the datasets considered, MultiPIE (Gross et al., 2010) was included due to its image quality and size. During the search, we also considered the Radboud Faces Database RaFD (Langner et al., 2010) for the structure of the database and the quality of the distribution. Another choice was AffectNet (Mollahosseini et al., 2017), owing to its suitable size and popularity as a benchmark to evaluate the performance of models in many studies and experiments.
In this study, the FER2013 dataset (Goodfellow et al., 2013) was used, which was chosen for its simplicity. This dataset contains around 35K 48x48 grayscale images, a sufficient number that are easy to obtain from this open-access dataset.
The dataset was downloaded from Kaggle4 in .csv format. The data include the class, image pixel array grayscale values, and the data split as columns and training examples as rows.
The FER2013 dataset (Goodfellow et al., 2013) is categorized into seven classes: happy, sad, angry, afraid, surprise, disgust, and neutral. The full dataset is split into three groups: training, public testing, and private testing, which serve as the training, validation, and test sets, respectively. The distribution of images across these subsets is shown in Table 1.
The table shows that (1) the happy, sad, and neutral classes have the highest number of samples, whereas the disgust class has the lowest representation, and (2) the public test and private test splits are well-suited for use as the cross-validation set and test set, respectively. In the next subsections, the data processing techniques applied to restructure the database in a form that best serves the requirements of the model being built are discussed.
3.2 Data processing and cleaning
Since the FER2013 dataset (Goodfellow et al., 2013) was originally designed as a challenge dataset, additional data processing steps are required to adapt it for research and real-world robotic applications. To ensure its suitability for emotion estimation in this context, a series of data cleaning and preprocessing steps must be applied.
3.2.1 Creating training data classes
The first step in data processing involves splitting the dataset into three subsets: training, validation, and test. Owing to class imbalance, we decided not to use the full set of available training class images; instead, we included a portion of the available samples, as shown in Table 2.
These values were chosen because the model focuses only on classifying happy, sad, and unknown classes, rather than the full seven emotions in the dataset. Assigning 4,000 samples to the main emotion classes ensures a more balanced dataset. The majority of other classes are allocated 1,500 samples each, keeping their combined total at 6,000, in addition to approximately 400 images for the disgust class. This distribution provides a sufficient number of training examples while ensuring that the unknown class remains aligned with happy and sad emotions.
Table 1 shows that including more training examples from each class is possible for some classes, but including all examples could lead to imbalanced data representation and a lower-quality distribution.
3.2.2 Readability by MediaPipe
To clean the dataset and prevent runtime errors caused by images that MediaPipe cannot detect, a detection program was developed. All images in the training set were processed with MediaPipe, and undetectable images were excluded from the dataset.
This process resulted in the following counts for each class, as shown in Table 3.
As described earlier, MediaPipe detects the main parts of the face and estimates landmark vectors and blendshapes for each image. If a part of the face is obstructed, this process becomes infeasible. Images with occlusions, such as those in Figure 3, are considered undetectable.
Figure 3. Example of unidentifiable samples from the FER2013 dataset. This figure displays representative 48 × 48 pixel images from the FER2013 dataset where faces were not successfully detected by the MediaPipe framework, primarily due to insufficient facial exposure or occlusion.
3.2.3 Indexing the test set
This step followed a few experiments that revealed an unexplainably high error rate in model performance. As a tracking method, a column was added to the test set in the CSV file, which assigns a number to each image in the FER2013 dataset. This number remains associated with the image throughout data processing and is transferred to the image's Blendshapes when the Blendshapes dataset (Attrah, 2025b) is created.
This step helps track images with high error rates when the model is evaluated on the test set, facilitating visualization of the images or the creation of different types of plots to identify patterns in the images causing the errors.
3.2.4 Augmenting the training set
To increase the number of training images and the variety of poses, and thereby improve the model's generalizability, data augmentation was applied. By generating new examples (More, 2016) with random transformations, the model's generalizability was enhanced. The augmentation techniques that were used are as follows:
1. Random horizontal slip.
2. Random rotation of 0.2 × 180 degrees in the clockwise direction and counterclockwise direction.
These techniques are based on the model's use case. Since it is designed for a social robot, horizontal flipping will enable the robot to interpret reflections and provide a valid facial image from a different camera angle. Additionally, random rotations improve model performance by accounting for natural head tilts and rotations, which are commonly observed in human interactions. For a robot, the ability to detect and classify facial expressions from various angles is crucial for real-world applications.
Processing the image dataset through augmentation yields a new, extended dataset with more than 20,000 images across the three classes described earlier.
3.2.5 Blendshapes dataset
Using MediaPipe to detect faces in a video stream requires extensive data processing before the data are fed into the classification model. To integrate the model with a face detection program, the model needs to be trained to interpret the program's output, i.e., the Gaze project's Blendshapes. Consequently, the full dataset is converted into a Blendshapes dataset (Attrah, 2025b), which is used to train the model.
Using Blendshapes was preferred over the 3D landmark vectors and the base 48 × 48 images to save a significant amount of computation during inference. This is because using the 3D landmark vectors is equivalent to using 1,404 features to represent the image, and the features of the 48x48 images sum up to 2,304 while using the Blendshapes equals having only 52 features. Further possible processing steps that were experimented with are in the Supplementary material.
3.3 BlendFER-Lite architecture and training
The classification model BlendFER-Lite is primarily built using long short-term memory (LSTM) layers (Computation, 2016), with the exception of the last dense layer, which employs softmax activation to produce classification scores from the output of the final layer.
The decision to incorporate LSTM layers in building the BlendFER-Lite model was made for the following reasons:
1. The BlendFER-Lite model is designed to be integrated into a video stream and to classify faces, meaning it operates on a time-series basis, where each input depends on the sequence of preceding inputs. Given this dependency, using one type of (RNN) (Goodfellow, 2016) is intuitive, as it retains information from past states in addition to the current input.
2. Another reason for this choice is that LSTM is a type of gated recurrent unit (GRU) (Chung et al., 2014). One of its advantages is an internal gating mechanism that helps filter out irrelevant features, allowing the model to focus on the most critical ones.
3. LSTM stands for long short-term memory. A key advantage of this network over other types of recurrent neural networks is its ability to maintain long-term dependencies effectively, not just short-term ones, as is the case with RNN (Medsker et al., 2001) and GRU networks.
Once the LSTM layer was selected, the next step was designing the model architecture. Several key factors were considered, such as accuracy, recall, precision, latency, and model size. After a few experiments to determine the best parameters, the Keras Tuner (O'Malley et al., 2019) was employed as an architecture search framework to identify the best values for the hyperparameters. The search space was defined based on prior experiments and empirical intuition5.
The final model was trained for 5,000 epochs, which took approximately 2 days. The batch size was 128, and the architecture was plotted using a design similar to the Keras utils model plotting function, as shown in Figure 4. The optimizer used for training was AdamW (Kingma, 2014; Loshchilov, 2017), with a learning rate of 1.09e-06, a global clipping norm of 1, and AMSGrad (Reddi et al., 2019). The adopted callbacks were the checkpoint callback and early stopping callback.
For the loss function, categorical cross-entropy was used to encode the labels of the images as one-hot vectors, and the metrics for evaluation were loss, categorical cross-entropy, categorical accuracy, and the F1-score. Other details and experiments are available in the Supplementary material.
3.4 Model evaluation
The model evaluation process is conducted for every set of weights produced by the checkpoint callback and shows improved training and validation performance metrics. Approximately every 100 epochs, a model's weights are saved.
Each model is evaluated by predicting the Blendshape labels for test dataset images and comparing them to the ground truth labels. A loss, categorical cross-entropy, categorical accuracy, and F1-score are calculated for the model, and the F1-score is used as the main metric for optimization.
4 Ablation studies
To better understand the BlendFER-Lite model and its potential applications, several ablation studies were conducted as part of the experiments. These studies aimed to determine the optimal number of Blendshapes, the most effective model type, and the best loss function.
4.1 Selection of relevant Blendshapes
To improve the efficiency of the model, reduce computational requirements, and output quality classification results with less processing time and memory usage, a set of Blendshapes returned from MediaPipe is disregarded. Since the model processes detected faces in each video frame, a total of 52 Blendshapes, similar to the ARKit face Blendshapes, are processed.
This approach was developed as a result of observing, as might be obvious in Figure 2, that some Blendshapes do not have a score. Testing the whole dataset (Attrah, 2025b) revealed that some of the Blendshapes do not have scores across all images, as can be found in Figure 5, where each subplot rectangle corresponds to a Blendshape and the blue dots are for the images. It is clear that in some subplots, the dots are spread across the entire rectangle almost uniformly, which indicates that it has coverage of the range 0–1, which are the values of the Blendshapes while in other subplots, the blue dots are concentrated in the lower half of the rectangle.
The research considered two strategies: (1) count the number of times a certain Blendshape has a zero score and discard it if it has a high count and (2) set a threshold value and count how many times each of the Blendshapes has a score that exceeds this threshold.
After testing various values for the high-count threshold in the first approach and different threshold values in the second approach, 0.4 was selected as the cutoff threshold, and 100 was set as a high-count number. Thus, for any Blendshape, if its score does not exceed 0.4 in more than 100 images in the dataset (Attrah, 2025b), it is discarded.
This choice was made for the following reasons:
• Counting zero values and making decisions on the basis of these criteria would have resulted in a small and non-relevant number of Blendshapes being discarded, which would not have improved the efficiency of the model.
• Using the second approach and setting a threshold higher than 0.4 would have resulted in discarding Blendshapes with a wide range of probabilities. These Blendshapes contribute valuable information to the model during training and removing them would result in a more compact model but reduced accuracy and performance.
• Setting a count higher than 100 would have resulted in a significant decrease in the number of Blendshapes being used, which would make it more difficult for the model to learn.
These reasons were validated through experimentation: training models on different data collections, testing their performance and size, and then making decisions for each number mentioned.
This process used 27 of the 52 Blendshapes listed as important in the Supplementary material, reducing the model size while keeping performance metrics, such as accuracy and F1-score, unchanged.
For further clarification, Figure 5 shows the scatter plot for the 52 Blendshapes. When discussing the importance of the Blendshapes table in the Supplementary material, highlight how many of its entries exhibit low values. This finding suggests that certain Blendshapes contribute minimal information and can be omitted.
4.2 Dense neural network model
To construct a faster model with shorter prediction time (latency) than the LSTM network, another, simpler network was built, with a fully connected (dense) layer as the only layer in the model. After a few optimization experiments, the model achieved high performance. However, when integrated into a real-time system using a camera stream, it lacked acceptable stability and oscillated between the classes without any change in the camera view, resulting in predictions with insufficient accuracy. This is because the layer does not consider the time dimension in the input data, which is well handled by the LSTM layer, as discussed further in the Supplementary material.
4.3 CCE and MSE loss functions
In earlier stages of the research, the mean squared error (Error, 2008) loss function was used to train the LSTM model. However, because the MSE function measures the distance between the prediction and the ground truth rather than the probability difference, it is better suited for regression applications. Therefore, the loss function had to be changed after a certain point.
Unlike the cross-entropy, categorical cross-entropy measures the difference between the predicted probability of the correct category and the ground-truth label, making it more suitable for classification tasks. Given that the application involves three classes, categorical cross-entropy was chosen.
To address class imbalance in the dataset, a potential improvement was using categorical focal cross-entropy (Lin, 2017). Although it was originally designed for object detection, previous research has shown that categorical focal cross-entropy improves classification model performance.
5 Results
The results obtained from the BlendFER-Lite model were in the top five in the classification benchmark of the dataset of the Papers With Code website6 taking into consideration the requirement of the model to have small size, and to have a more accurate comparison, the BlendFER-Lite model is compared to models with no extra training data and a condition of being a non-transformer neural network only. Such as the models mentioned in Table 4.
Where the Table 4 shows that the BlendFER-Lite model is showing enhanced performance by around 2% compared to the LocalLearning BOW and the DeepEmotion models, and given the minimal architecture and model size, the enhancement is greatly considerable. While the Ad-Corre, VGGNet, and ResNet18 with tricks deliver higher accuracy, they do not surpass a 2% cap in improvement, in addition to the depth of the model making them resource-consuming. The best result metrics, aside from accuracy, obtained from the model on the test set which consists of Happy: 291, Sad: 245, and Unknown: 1110 images' blendshapes are as follows:
• Loss = 0.6238.
• Categorical cross-entropy = 0.6235.
• Categorical accuracy = 0.7199.
• F1-score = 0.6298.
For a more detailed analysis, the model's confusion matrix is shown in Table 5.
The confusion matrix indicated that the happiness emotion achieved the highest classification accuracy, with 251 correctly classified instances out of 291 (a miss rate of 40). In contrast, the unknown class exhibited a correct classification rate of 850 out of 1,110 instances. Although this represents the largest absolute number of misclassifications (260) across the three classes, it is likely attributable to the inherent feature diversity and variability encompassed within this category. The sad emotion class demonstrated the lowest performance, with only 84 out of 246 instances correctly classified, resulting in 162 misclassified samples. While the unknown class showed the highest absolute number of misses, the sad class registered the highest proportional misclassification rate. It is also important to note that the test set for the sad class was the smallest of the three. Diving deep into the confusion matrix (Crall, 2023) and breaking down each part of it shows the happy, unknown, and sad classes as in Table 6.
From these numbers, we can see that the unknown class has the highest counts in all parts, indicating consistent performance across all classes and inaccuracy corresponding to the total count of the class sample. Additionally, finding the precision and recall can be easily done using the Equations 4, 5
and by calculating these metrics for each class, can yield the per-class precision and recall, as in Table 7.
The findings in Table 7 show that the unknown class is the easiest to detect, followed by the happy class, with the sad emotion last and the least precision, recall, and F1-score, making it difficult to recognize. which is an expected proportion resulting from the large imbalance in the dataset.
Experimenting with different architectures, hyperparameters, and longer training times did not yield improved results. However, it affected the classification accuracy metric once and the time-related stability another time. Nevertheless, the confusion matrix continued to fluctuate, showing inconsistent improvements in classification for the happy and sad classes.
At one point, the classification of the sad emotion improved, resulting in higher precision and recall, while at another, the happy class was favored, and the experiment reported in Table 5 revealed a bias toward the happy class.
To assign weights to each class, the model focused on the happy and sad classes, assigning them greater importance while allocating less weight to the unknown class, which is easily classified in all the experiments. However, some limitations prevent the use of class weights in experiments.
When the model was integrated into the test camera stream, as demonstrated in the demo video,7 the model outputs classifications for happy and sad. However, when facial features change, the unknown class is frequently assigned. For all other facial expressions, the unknown class remains the predicted output.
Regarding the latency of the model, it was measured separately from the Blendshapes extraction pipeline of mediapipe, and it shows a mean latency of 33.47 ms when calculated across the full test set of the FER2013 blendshapes dataset.
While the model size as a zipped .keras file is 775.385 KB (Attrah, 2025a), which is significantly smaller than the other models compared on the benchmark, and the count of parameters of the model is 44,631 parameters, it makes it the fastest running with the smallest memory usage compared to the best in Table 4, which is a model based on the ResNet-18 network that has more than 11M parameters.
6 Discussion
Considering the use case for which the model is developed, it is necessary to optimize for latency and model size, in addition to compiler metrics such as loss, cross-entropy, accuracy, and F1 score. These metrics are critical during the implementation phase when deploying the model on an edge device for real-time inference. One method experimented with involved quantizing the Blendshapes and using only 4 floating-point numbers out of 16. This significantly increased the model's speed but negatively impacted performance metrics and reduced accuracy.
Regarding the model size, models in Goodfellow et al. (2013), Min et al. (2024), An et al. (2023), Huang et al. (2023), Zhou et al. (2023), Fard and Mahoor (2022), Vulpe-Grigorasi and Grigore (2021), Yuan (2021), Khaireddin and Chen (2021), and Minaee et al. (2021) are of large size, which makes them unfit for implementation on an edge device, yet some of them might show better performance metrics. While in Febrian et al. (2023), a model of CNN-BiLSTM is presented, which makes the LSTM part of it similar to ours in processing the temporal dimension of the data but without using bidirectional layers, and that is because part of the experiments in Section 3.3 showed that it does not improve the results, only increases the time required for training. While the first part, which is a CNN, is much different from MediaPipe, although both of them are concerned with the spatial dimension of the data, and from Section 2.1.2 in Bisogni et al. (2023), MediaPipe shows superior results to a CNN model for feature extraction.
Section 4.1 suggested and validated an approach to decrease the number of Blendshapes included in representing the face and used a statistical approach to choose the best Blendshapes to use, and it showed the Blendshapes to be included are a majority from the brows, eyes, and mouth, and as shown in the Supplementary material, while in Huang et al. (2023) generates feature maps that show that the mouth and the nose are the parts with the most information and importance for classification, and this difference might be resulting from the difference in the datasets and the approach used, since Blendshapes are a different concept from the feature maps suggested. Also, notice that only two Blendshapes of the 52 in the MediaPipe blendshapes set represent the nose, which might be hiding some information.
7 Conclusion
This article presents the building of BlendFER-Lite, a model to be implemented on a microcomputer in a social robot to detect faces in a camera video stream, extract Blendshapes as features from the face using MediaPipe, and classify facial expressions based on the emotions they represent, whether they are happiness, sadness, or unknown.
Demonstrated that a 4-layer LSTM model is a suitable architecture to classify the Blendshapes of faces in camera stream frames. The model showed good temporal stability. Note that good results were obtained from the video stream even after the model was trained on an image dataset. In addition, the model achieves results competitive with the dataset's benchmark, with no loss in accuracy throughout the feature extraction and classification pipeline. This approach saves considerable memory and computation in comparison to other available methods.
Besides that, conducted ablation studies and suggested methods to further improve the model's resource efficiency and performance.
8 Future research
Based on this research, a new model will be built and trained mainly on the Aff-Wild2 Dataset (Kollias et al., 2019b; Kollias and Zafeiriou, 2019; Kollias et al., 2019a, 2020; Kollias and Zafeiriou, 2021a,b; Kollias et al., 2021; Kollias, 2022, 2023; Kollias et al., 2023; Zafeiriou et al., 2017) and tested on other datasets in addition to Aff-Wild2, to deliver results on facial expression classification, action units, and valence/arousal. and will use techniques discussed and implemented in this research paper, as well as more advanced architectures and models, such as transformer-based models (LLMs, VLMs, and MLLMs). Afterward, an expansion to work on another research project that includes other modalities, such as voice tone, body pose, and natural language, to estimate emotions. and using compound facial expressions (Du et al., 2014) to have a more accurate and detailed estimation for human emotion.
Further studies about reducing the feature count will be conducted on other types of data and features.
Data availability statement
The datasets used for this study can be found at: Kaggle/FER2013 blendshapes dataset example (Partial) [link] and Kaggle/Challenges in Representation Learning: Facial Expression Recognition Challenge [link].
Ethics statement
Written informed consent was not obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article because besides the reason above, the dataset is licensed and verified by many institutions, so it does have consent but we do not have a version of it, and the manuscript submitted presents a dataset only created from processing the earlier mentioned dataset with out adding any original images.
Author contributions
SA: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Writing – original draft, Writing – review & editing.
Funding
The author(s) declare that no financial support was received for the research and/or publication of this article.
Acknowledgments
We thank Professor Marijn Jongerden, Jeroen Veen, and Dixon Devasia for their help and guidance throughout the process, and Victor Hogeweij for his contribution to the Gaze project. Additionally, we thank Bhupinder Kaur, An Le, and Muhammad Reza for their insights and support during the early stages of the study.
Conflict of interest
The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Generative AI statement
The author(s) declare that no Gen AI was used in the creation of this manuscript.
Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.
Publisher's note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Supplementary material
The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnbot.2025.1678984/full#supplementary-material
Footnotes
1. ^https://gitlab.com/Hoog-V/gaze/-/tree/main/gaze_estimator_python_rpi?ref_type=heads
2. ^Blendshapes a facial expressions representation blog.
3. ^https://arkit-face-blendshapes.com/
4. ^Link to FER2013 Kaggle challenge data.
5. ^https://medium.com/p/1833f774051f
6. ^https://paperswithcode.com/sota/facial-expression-recognition-on-fer2013
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). “Tensorflow: a system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 265–283.
An, Y., Lee, J., Bak, E., and Pan, S. (2023). Deep facial emotion recognition using local features based on facial landmarks for security system. Comput. Mater. Continua 76:39460. doi: 10.32604/cmc.2023.039460
Attrah, S. (2025a). BlendFER-Lite. Kaggle. Available online at: https://www.kaggle.com/m/484311 (Accessed December 29, 2025).
Attrah, S. (2025b). FER2013 Blendshapes Dataset Example (Partial). Kaggle. Available online at: https://www.kaggle.com/dsv/10716347 (Accessed December 29, 2025).
Baltrušaitis, T., Robinson, P., and Morency, L.-P. (2016). “Openface: an open source facial behavior analysis toolkit,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), 1–10. doi: 10.1109/WACV.2016.7477553
Belmonte, R., Allaert, B., Tirilly, P., Bilasco, I. M., Djeraba, C., and Sebe, N. (2021). Impact of facial landmark localization on facial expression recognition. IEEE Trans. Affect. Comput. 14, 1267–1279. doi: 10.1109/TAFFC.2021.3124142
Bisogni, C., Cimmino, L., De Marsico, M., Hao, F., and Narducci, F. (2023). Emotion recognition at a distance: the robustness of machine learning based on hand-crafted facial features vs. deep learning models. Image Vis. Comput. 136:104724. doi: 10.1016/j.imavis.2023.104724
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Computation, N. (2016). Long short-term memory. Neural Comput. 9, 1735–1780. doi: 10.1162/neco.1997.9.8.1735
Cortes, C., and Vapnik, V. (1995). Support-vector networks. Mach. Learn. 20, 273–297. doi: 10.1023/A:1022627411411
Crall, J. (2023). The mcc approaches the geometric mean of precision and recall as true negatives approach infinity. arXiv preprint arXiv:2305.00594.
Dhall, A., Goecke, R., Ghosh, S., Joshi, J., Hoey, J., and Gedeon, T. (2017). “From individual to group-level emotion recognition: Emotiw 5.0,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction, 524–528. doi: 10.1145/3136755.3143004
Du, S., Tao, Y., and Martinez, A. M. (2014). Compound facial expressions of emotion. Proc. Nat. Acad. Sci. 111, E1454–E1462. doi: 10.1073/pnas.1322355111
Egmont-Petersen, M., de Ridder, D., and Handels, H. (2002). Image processing with neural networks—a review. Pattern Recognit. 35, 2279–2301. doi: 10.1016/S0031-3203(01)00178-9
Ekman, P. (1993). Facial expression and emotion. Am. Psychol. 48:384. doi: 10.1037//0003-066X.48.4.384
Ekman, P., and Friesen, W. V. (2003). Unmasking the Face: A Guide to Recognizing Emotions from Facial Clues, volume 10. Ishk.
Ekman, P., and Rosenberg, E. L. (1997). What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford: Oxford University Press, USA. doi: 10.1093/oso/9780195104462.001.0001
Fard, A. P., and Mahoor, M. H. (2022). AD-corre: adaptive correlation-based loss for facial expression recognition in the wild. IEEE Access 10, 26756–26768. doi: 10.1109/ACCESS.2022.3156598
Farkhod, A., Abdusalomov, A. B., Mukhiddinov, M., and Cho, Y.-I. (2022). Development of real-time landmark-based emotion recognition cnn for masked faces. Sensors 22:8704. doi: 10.3390/s22228704
Febrian, R., Halim, B. M., Christina, M., Ramdhan, D., and Chowanda, A. (2023). Facial expression recognition using bidirectional lstm-cnn. Procedia Comput. Sci. 216, 39–47. doi: 10.1016/j.procs.2022.12.109
Gautam, C., and Seeja, K. (2023). Facial emotion recognition using handcrafted features and cnn. Procedia Comput. Sci. 218, 1295–1303. doi: 10.1016/j.procs.2023.01.108
Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A., Mirza, M., Hamner, B., et al. (2013). “Challenges in representation learning: a report on three machine learning contests,” in Neural information processing: 20th international conference, ICONIP 2013, Daegu, Korea, november 3–7, 2013. Proceedings, Part III 20 (Springer), 117–124. doi: 10.1007/978-3-642-42051-1_16
Gross, R., Matthews, I., Cohn, J., Kanade, T., and Baker, S. (2010). Multi-pie. Image Vis. Comput. 28, 807–813. doi: 10.1016/j.imavis.2009.08.002
He, K., Zhang, X., Ren, S., and Sun, J. (2015). “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778. doi: 10.1109/CVPR.2016.90
Hu, J., Shen, L., and Sun, G. (2018). “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141. doi: 10.1109/CVPR.2018.00745
Huang, Z.-Y., Chiang, C.-C., Chen, J.-H., Chen, Y.-C., Chung, H.-L., Cai, Y.-P., et al. (2023). A study on computer vision for facial emotion recognition. Sci. Rep. 13:8425. doi: 10.1038/s41598-023-35446-4
Jain, N., Kumar, S., Kumar, A., Shamsolmoali, P., and Zareapoor, M. (2018). Hybrid deep neural networks for face emotion recognition. Pattern Recognit. Lett. 115, 101–106. doi: 10.1016/j.patrec.2018.04.010
Khaireddin, Y., and Chen, Z. (2021). Facial emotion recognition: state of the art performance on fer2013. arXiv preprint arXiv:2105.03588.
Kleisner, K., Trnka, J., and Turecek, P. (2024). Facedig: automated tool for placing landmarks on facial portraits for geometric morphometrics users. arXiv preprint arXiv:2411.01508.
Kollias, D. (2022). “Abaw: valence-arousal estimation, expression recognition, action unit detection &multi-task learning challenges,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2328–2336. doi: 10.1109/CVPRW56347.2022.00259
Kollias, D. (2023). “Abaw: learning from synthetic data &multi-task learning challenges,” in European Conference on Computer Vision (Springer), 157–172. doi: 10.1007/978-3-031-25075-0_12
Kollias, D., Schulc, A., Hajiyev, E., and Zafeiriou, S. (2020). “Analysing affective behavior in the first abaw 2020 competition,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG), 794–800. doi: 10.1109/FG47880.2020.00126
Kollias, D., Sharmanska, V., and Zafeiriou, S. (2019a). Face behavior a la carte: Expressions, affect and action units in a single network. arXiv preprint arXiv:1910.11111.
Kollias, D., Sharmanska, V., and Zafeiriou, S. (2021). Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv preprint arXiv:2105.03790.
Kollias, D., Tzirakis, P., Baird, A., Cowen, A., and Zafeiriou, S. (2023). “Abaw: valence-arousal estimation, expression recognition, action unit detection &emotional reaction intensity estimation challenges,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5888–5897. doi: 10.1109/CVPRW59228.2023.00626
Kollias, D., Tzirakis, P., Nicolaou, M. A., Papaioannou, A., Zhao, G., Schuller, B., et al. (2019b). Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. Int. J. Comput. Vision 127, 907–929. doi: 10.1007/s11263-019-01158-4
Kollias, D., and Zafeiriou, S. (2019). Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855.
Kollias, D., and Zafeiriou, S. (2021a). Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. arXiv preprint arXiv:2103.15792.
Kollias, D., and Zafeiriou, S. (2021b). “Analysing affective behavior in the second abaw2 competition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 3652–3660. doi: 10.1109/ICCVW54120.2021.00408
Kumar, A., Kumar, A., and Gupta, S. (2025). Machine learning-driven emotion recognition through facial landmark analysis. SN Comput. Sci. 6, 1–10. doi: 10.1007/s42979-025-03688-w
Kumar, P., Happy, S., and Routray, A. (2016). “A real-time robust facial expression recognition system using hog features,” in 2016 International Conference on Computing, Analytics and Security Trends (CAST) (IEEE), 289–293. doi: 10.1109/CAST.2016.7914982
Langner, O., Dotsch, R., Bijlstra, G., Wigboldus, D. H., Hawk, S. T., and Van Knippenberg, A. (2010). Presentation and validation of the radboud faces database. Cogn. Emot. 24, 1377–1388. doi: 10.1080/02699930903485076
Leong, S. C., Tang, Y. M., Lai, C. H., and Lee, C. (2023). Facial expression and body gesture emotion recognition: a systematic review on the use of visual data in affective computing. Comput. Sci. Rev. 48:100545. doi: 10.1016/j.cosrev.2023.100545
Lewis, J. P., Anjyo, K., Rhee, T., Zhang, M., Pighin, F. H., and Deng, Z. (2014). “Practice and theory of blendshape facial models,” in Eurographics 2014 - State of the Art Reports, eds. S. Lefebvre and M. Spagnuolo (The Eurographics Association). doi: 10.2312/egst.20141042
Li, S., and Deng, W. (2020). Deep facial expression recognition: a survey. IEEE Trans. Affect. Comput. 13, 1195–1215. doi: 10.1109/TAFFC.2020.2981446
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., et al. (2019a). “Mediapipe: a framework for perceiving and processing reality,” in Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR).
Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., et al. (2019b). Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.
Marechal, C., Mikolajewski, D., Tyburek, K., Prokopowicz, P., Bougueroua, L., Ancourt, C., et al. (2019). Survey on ai-based multimodal methods for emotion detection. High-Perform. Model. Simul. Big Data Applic. 11400, 307–324. doi: 10.1007/978-3-030-16272-6_11
Medsker, L. R., and Jain, L., . (2001). Recurrent Neural Networks: Design and Applications, 1st Edn. CRC Press. doi: 10.1201/9781003040620
Min, S., Yang, J., Lim, S., Lee, J., Lee, S., and Lim, S. (2024). Emotion recognition using transformers with masked learning. arXiv preprint arXiv:2403.13731.
Minaee, S., Minaei, M., and Abdolrashidi, A. (2021). Deep-emotion: facial expression recognition using attentional convolutional network. Sensors 21:3046. doi: 10.3390/s21093046
Mollahosseini, A., Hasani, B., and Mahoor, M. H. (2017). Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 10, 18–31. doi: 10.1109/TAFFC.2017.2740923
More, A. (2016). Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048.
O'Malley, T., Bursztein, E., Long, J., Chollet, F., Jin, H., Invernizzi, L., et al. (2019). Kerastuner. Available online at: https://github.com/keras-team/keras-tuner (Accessed December,13, 2025).
Pantic, M., Valstar, M., Rademaker, R., and Maat, L. (2005). “Web-based database for facial expression analysis,” in 2005 IEEE International Conference on Multimedia and Expo (IEEE), 5.
Reddi, S. J., Kale, S., and Kumar, S. (2019). On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.
Savin, A. V., Sablina, V. A., and Nikiforov, M. B. (2021). “Comparison of facial landmark detection methods for micro-expressions analysis,” in 2021 10th Mediterranean Conference on Embedded Computing (MECO) (IEEE), 1–4. doi: 10.1109/MECO52532.2021.9460191
Shaila, S., Vadivel, A., and Avani, S. (2023). Emotion estimation from nose feature using pyramid structure. Multimed. Tools Appl. 82, 42569–42591. doi: 10.1007/s11042-023-14682-w
Valstar, M., and Pantic, M. (2010). “Induced disgust, happiness and surprise: an addition to the mmi facial expression database,” in Proceedings of the 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect (Paris, France), 65.
Vulpe-Grigorasi, A., and Grigore, O. (2021). “Convolutional neural network hyperparameters optimization for facial emotion recognition,” in 2021 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE), 1–5. doi: 10.1109/ATEE52255.2021.9425073
Yuan, X. (2021). Fer2013-facial-emotion-recognition-pytorch. Available online at: https://github.com/LetheSec/Fer2013-Facial-Emotion-Recognition-PytorchGitHubRepository (Accessed December,13, 2025).
Zafeiriou, S., Kollias, D., Nicolaou, M. A., Papaioannou, A., Zhao, G., and Kotsia, I. (2017). “Aff-wild: valence and arousal ‘in-the-wild' challenge,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on (IEEE), 1980–1987. doi: 10.1109/CVPRW.2017.248
Keywords: emotion estimation, computer vision, social robotics, BlendFER-Lite, neural networks, Blendshapes
Citation: Attrah S (2026) Emotion estimation from video footage with LSTM. Front. Neurorobot. 19:1678984. doi: 10.3389/fnbot.2025.1678984
Received: 03 August 2025; Revised: 01 November 2025;
Accepted: 14 November 2025; Published: 06 February 2026.
Edited by:
Michalis Vrigkas, University of Western Macedonia, GreeceReviewed by:
Imran Ashraf, University of Lahore, PakistanM. Sreekrishna, SRM Institute of Science and Technology, India
Copyright © 2026 Attrah. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: Samer Attrah, c2FtaXJhdHJhOTVAZ21haWwuY29t
†ORCID: Samer Attrah orcid.org/0009-0006-8090-1256
Samer Attrah*†