Your new experience awaits. Try the new design now and help us make it even better

TECHNOLOGY AND CODE article

Front. Neurorobot., 06 February 2026

Volume 19 - 2025 | https://doi.org/10.3389/fnbot.2025.1678984

This article is part of the Research TopicMultimodal human action recognition in real or virtual environmentsView all 5 articles

Emotion estimation from video footage with LSTM


Samer Attrah
&#x;Samer Attrah*
  • Hogeschool van Arnhem en Nijmegen, Automotive and Engineering Academy, Arnhem, Netherlands

Emotion estimation is a field that has been studied for a long time, and several approaches using machine learning models exist. This article presents BlendFER-Lite, an LSTM model that uses Blendshapes from the MediaPipe library to analyze facial expressions detected from a live-streamed camera feed. This model is trained on the FER2013 dataset and achieves 71% accuracy and an F1-score of 62%, meeting the accuracy benchmark for the FER2013 dataset while significantly reducing computational costs compared to current methods. For the sake of reproducibility, the code repository, datasets, and models proposed in this paper, in addition to the preprint, can be found on Hugging Face at: https://huggingface.co/papers/2501.13432.

JEL Classification: D8, H51

MSC Classification: 35A01, 65L10, 65L12, 65L20, 65L70

1 Introduction

In social robotics, many parts of a robot system are integrated into a single system. Typically, the more complex the system is, the better it performs tasks, which are more commonly focused on elderly care or patient care. In addition to skeletal and motion structures, a robot's subsystems include interaction modules and sensory functions, such as speech, vision, and hearing. Since human emotion is a large part of any interaction between two or more humans, it is important for robots to interpret human emotion using sensory inputs such as speech, tone of voice, facial expressions, body poses, hand gestures, gaze direction, and head orientation. In this work, we focus solely on emotion estimation from facial expressions, given their importance (Marechal et al., 2019).

Emotion estimation from facial expressions, more commonly referred to as facial emotion recognition (FER), generally involves three steps: (1) face detection: identify a face in a camera stream or image and localize its boundaries; (2) feature extraction: identify and extract the most relevant information from the previously detected face. Many approaches exist; more popular is by setting key points as a vector for the most relevant parts of the face, such as the tip of the nose and the ends of the mouth in two-dimensional or three-dimensional coordinates, and (3) expression classification: using extracted features as inputs for a classification model that predicts the emotion conveyed by the facial expression.

The proposed emotion estimation system functions as a feedback mechanism that helps drive a conversation or interaction between a human and a robot. The system enables a robot to detect real-time human reactions through facial expressions and potentially change the topic, approach, or action of the conversation.

The scope of this research was limited by the availability of data and computational resources. Consequently, emotion classification was restricted to three classes instead of the original seven proposed previously (Happy, Sad, Angry, Afraid, Surprise, Disgust, and Neutral) (Ekman and Friesen, 2003). These three classes had labels (happy, sad), while all other emotions were grouped into a general category under an undefined label (Unknown).

The system development process begins with a full data processing pipeline, which includes loading, cleaning, augmenting, extracting features, and visualizing data. The data are then input into the BlendFER-Lite model for training. During inference, the trained model is integrated into the Gaze Project1, where it facilitates facial detection/localization and feature extraction. Then, the trained model classifies facial expressions based on the features.

Some considerations should be noted. First, the FER2013 (Goodfellow et al., 2013) dataset is considered a challenge dataset, i.e., it was not built or optimized for research purposes. It contains a group of non-relevant images, such as animations and images with covered faces, that do not provide useful information to the model. Additionally, model training and testing are performed on a laptop equipped with an NVIDIA GeForce RTX 3050 Laptop GPU, running on the Linux Ubuntu operating system, and the programming language used is Python, under the Apache-2.0 license.

This study serves as a proof of concept (POC) for a research paper that will be discussed in the future work Section 8. In the context of emotion estimation for embedded systems in social robotics, the main contributions of this study are as follows:

1. Building BlendFER-Lite into a compact and cost-effective model suitable for inference on a microcomputer, while considering the spatial and temporal aspects of facial expression.

2. Using the MediaPipe library Face Landmarker task, which is for face localization and feature extraction, and incorporating Blendshapes (Lugaresi et al., 2019a) as features, is a technique rarely employed in FER applications.

The remainder of this article is organized as follows: Section 2 covers related work, Section 3 details the methods, Section 4 discusses the ablation studies, Section 5 presents the results, Section 6 discusses the results and findings of the study, Section 7 concludes the study, and Section 8 presents the future work to be based on this study.

2 Related work

Facial emotion recognition has been an active research area for a long time (Ekman and Friesen, 2003). The main challenges for building an emotion estimation system are Zafeiriou et al. (2017).

• The complexity of the emotion patterns.

• The emotions are time-varying.

• The process is user- and context-dependent.

Many methods exist for facial detection and tracking, such as the Haar cascade algorithm, and other methods for feature extraction, such as the classical algorithms Histogram Oriented Gradient (HOG) and Scale Invariant Feature Transform (SIFT) (Gautam and Seeja, 2023; Kumar et al., 2016), or, more recently, the Convolutional Neural Network (CNN) (Li and Deng, 2020) and transformer encoders (Min et al., 2024) have been employed to detect faces and extract features. Classification models also employ CNNs, Recurrent Neural Networks (RNNs), or hybrid architectures (Li and Deng, 2020; Leong et al., 2023).

2.1 Feature extraction

Feature extraction can be considered a special type of data dimensionality reduction aimed at identifying a subset of informative variables from image data (Egmont-Petersen et al., 2002). The most common approach for extracting high-quality features from facial images involves estimating facial landmarks (Kumar et al., 2025; Belmonte et al., 2021; Shaila et al., 2023). One approach involves identifying the nose tip and segmenting it by cropping a sphere centered on it. Then, the eyebrows, mouth corners, eye corners and possibly many other facial landmarks are identified on the segmented sphere. Finally, by measuring the distances between landmarks (An et al., 2023) and changes in their positions, facial expressions can be interpreted (Farkhod et al., 2022).

2.1.1 History

Looking back on the evolution of FER, one of the methods used to extract features is the HOG (Gautam and Seeja, 2023), which determines edge presence, direction, and magnitude. Another method is SIFT (Kumar et al., 2016), which follows a four-step process: (1) constructing the scale space, (2) localizing crucial points, (3) assigning orientation, and (4) assigning unique fingerprints. A feature extraction method still widely used today is the Support Vector Machine (SVM) (Cortes and Vapnik, 1995; Leong et al., 2023).

More recently, feature extraction has been performed using deep neural networks (Li and Deng, 2020; Leong et al., 2023), such as fully connected networks (FCNs) and CNNs, which construct models consisting of multiple layers and hyperparameters. The complexity of the data, the required performance, and the available computational resources influence which model is implemented.

2.1.2 MediaPipe

This study used the MediaPipe (Lugaresi et al., 2019a,b) library for facial recognition, tracking, feature extraction, and landmark detection. Using the Face Landmarker task could yield three outputs per image or frame: 468 3D landmark vectors, 52 Blendshapes, and facial transformation matrices. In previous works (Bisogni et al., 2023), two approaches were proposed for feature extraction, one of which involved using MediaPipe to extract features from images. The results indicated that using this library can be more effective than deep learning networks, such as CNNs, for subtle facial expressions. The more intense the facial expressions are, the smaller the performance gap between the two approaches. In the most extreme cases, CNNs perform better across most emotions.

In another study (Savin et al., 2021), MediaPipe was compared to OpenFace for landmark detection. OpenFace (Baltrušaitis et al., 2016) uses OpenCV to detect 68 facial landmarks, whereas MediaPipe uses TensorFlow (Abadi et al., 2016) to identify 468 landmarks that are arranged in fixed quads and represented by their coordinates (x, y, z). Figure 1 shows an image annotated with the landmarks used for facial expression recognition.

Figure 1
A person in a business suit holding glasses, with a digital facial recognition grid overlay, highlighting eyes in blue and green. The individual has dark hair and a neutral expression.

Figure 1. Image annotated with MediaPipe.

Another study is Kleisner et al. (2024), which proposes adding a CNN to process MediaPipe 3D landmark vectors and improve landmark placement precision, working on 2D facial photographs.

2.1.3 Blendshapes

The second output of the MediaPipe face landmark detection task model consists of Blendshapes, which provide an approximate semantic parameterization and a simple linear model for facial expressions (Lewis et al., 2014). This technique originated in the industry before gaining traction in academia and has been widely used in computer graphics. Although the Blendshapes technique is conceptually simple, developing a full Blendshapes face model is computationally intensive. To express a complete range of realistic expressions, one face might require more than 600 Blendshapes.

The construction of a single Blendshapes model was originally guided by the Facial Action Coding System (FACS) (Ekman and Rosenberg, 1997). This system enables manual coding of all facial displays, known as action units, and more than 7,000 combinations have been observed. The FACS action units are the smallest visibly discriminable changes in a facial display, and combinations of FACS action units can be used to describe emotional expressions (Ekman, 1993) and global distinctions between positive and negative expressions.2

MediaPipe incorporates a set similar to the ARKit face Blendshapes,3 which consists of 52 Blendshapes describing facial parts and expressions. These are quantified with probability scores from 0 to 1, indicating the presence of specific Blendshapes, as shown in Figure 1, while Figure 2 presents a Blendshapes histogram.

Figure 2
Bar chart titled “Face Blendshapes” displaying various facial expressions and movements with their corresponding scores. The highest scores are for ‘MouthSmileLeft’ at 0.6488, followed by ‘MouthSmileRight’ at 0.5494 and ‘eyeSquintLeft’ at 0.4965, besides other scores.

Figure 2. Histogram for the Blendshapes of the image in Figure 1 via MediaPipe.

2.2 Emotion classification

For emotion classification, many approaches have been used in research, such as support vector machines (SVM) (Kumar et al., 2016), which have been implemented with linear as well as radial basis functions, and stochastic gradient descent (SGD) classifiers, which achieved approximately 95% accuracy on the Radboud Faces Database (RaFD) (Langner et al., 2010). Bisogni et al. (2023) presented a comparison between two approaches, namely, a MediaPipe-SVM and a CNN-LSTM. MediaPipe and the CNN were employed for feature extraction, and the SVM and LSTM (Computation, 2016) were used for classification. Experiments have shown that features extracted using MediaPipe are superior to those extracted by the CNN. Consequently, the classification performance of the first combination is better than that of the second approach.

One approach in Huang et al. (2023) used ResNet (He et al., 2015) blocks in addition to the Squeeze-and-excitation network (Hu et al., 2018) for expression classification and demonstrated the improvements that can be achieved using the correct transfer learning approach. It also used the suggested model to examine the important features and the location of the major facial information using feature maps, which will be further discussed in Section 6.

Another type of image classification method adopts CNNs. Gautam and Seeja (2023) used a sequential model of three convolutional layers and a dense layer to classify input features, achieving high accuracy on the dataset used.

Since the model being developed is designed for video processing, where data are sequential, as stated in Goodfellow (2016), choosing an appropriate RNN is crucial. Just as a CNN is a neural network specialized for processing a grid of values, such as an image, an RNN is a neural network specialized for processing a sequence of values or sequential data.

When working with sequential data, such as videos, and considering temporal dependencies between images, an RNN is the best choice (Li and Deng, 2020). The use of an LSTM, which is a special type of RNN, aims to address the vanishing gradient and exploding gradient problems that are common in RNN training; Jain et al. (2018) compared the results, and the network including a CNN part and an RNN part delivers better accuracy than a network with only CNN layers, which is approximately 20% higher in accuracy.

Another study discussing RNNs is Leong et al. (2023), where a CNN-LSTM architecture was proposed as a facial recognition system that can understand spatiotemporal properties in video.

3 Methods

This section discusses the approach to building the emotion estimation system, starting with the dataset used and the data processing techniques, then building and optimizing the model, and finally evaluating and testing the model's performance.

3.1 Dataset

Since the model being developed is deliberate for video-based inference, selecting a video dataset with different facial expressions was the first priority. Li and Deng (2020) found a group of datasets, some of which include videos, such as MMI (Pantic et al., 2005; Valstar et al., 2010) and AFEW7.0 (Dhall et al., 2017). However, due to computational constraints and the intensive data preprocessing required for video-based datasets, we opted to use an image-based dataset instead.

Among the datasets considered, MultiPIE (Gross et al., 2010) was included due to its image quality and size. During the search, we also considered the Radboud Faces Database RaFD (Langner et al., 2010) for the structure of the database and the quality of the distribution. Another choice was AffectNet (Mollahosseini et al., 2017), owing to its suitable size and popularity as a benchmark to evaluate the performance of models in many studies and experiments.

In this study, the FER2013 dataset (Goodfellow et al., 2013) was used, which was chosen for its simplicity. This dataset contains around 35K 48x48 grayscale images, a sufficient number that are easy to obtain from this open-access dataset.

The dataset was downloaded from Kaggle4 in .csv format. The data include the class, image pixel array grayscale values, and the data split as columns and training examples as rows.

The FER2013 dataset (Goodfellow et al., 2013) is categorized into seven classes: happy, sad, angry, afraid, surprise, disgust, and neutral. The full dataset is split into three groups: training, public testing, and private testing, which serve as the training, validation, and test sets, respectively. The distribution of images across these subsets is shown in Table 1.

Table 1
www.frontiersin.org

Table 1. FER2013 dataset: classes and splits.

The table shows that (1) the happy, sad, and neutral classes have the highest number of samples, whereas the disgust class has the lowest representation, and (2) the public test and private test splits are well-suited for use as the cross-validation set and test set, respectively. In the next subsections, the data processing techniques applied to restructure the database in a form that best serves the requirements of the model being built are discussed.

3.2 Data processing and cleaning

Since the FER2013 dataset (Goodfellow et al., 2013) was originally designed as a challenge dataset, additional data processing steps are required to adapt it for research and real-world robotic applications. To ensure its suitability for emotion estimation in this context, a series of data cleaning and preprocessing steps must be applied.

3.2.1 Creating training data classes

The first step in data processing involves splitting the dataset into three subsets: training, validation, and test. Owing to class imbalance, we decided not to use the full set of available training class images; instead, we included a portion of the available samples, as shown in Table 2.

Table 2
www.frontiersin.org

Table 2. Training set classes counts.

These values were chosen because the model focuses only on classifying happy, sad, and unknown classes, rather than the full seven emotions in the dataset. Assigning 4,000 samples to the main emotion classes ensures a more balanced dataset. The majority of other classes are allocated 1,500 samples each, keeping their combined total at 6,000, in addition to approximately 400 images for the disgust class. This distribution provides a sufficient number of training examples while ensuring that the unknown class remains aligned with happy and sad emotions.

Table 1 shows that including more training examples from each class is possible for some classes, but including all examples could lead to imbalanced data representation and a lower-quality distribution.

3.2.2 Readability by MediaPipe

To clean the dataset and prevent runtime errors caused by images that MediaPipe cannot detect, a detection program was developed. All images in the training set were processed with MediaPipe, and undetectable images were excluded from the dataset.

This process resulted in the following counts for each class, as shown in Table 3.

Table 3
www.frontiersin.org

Table 3. Number of images MediaPipe could not detect in the FER2013 dataset.

As described earlier, MediaPipe detects the main parts of the face and estimates landmark vectors and blendshapes for each image. If a part of the face is obstructed, this process becomes infeasible. Images with occlusions, such as those in Figure 3, are considered undetectable.

Figure 3
Two pixelated images in black and white. The first image shows a person smiling with their hand partially covering their face. The second image depicts a different person with short hair resting their face on their hands, appearing thoughtful.

Figure 3. Example of unidentifiable samples from the FER2013 dataset. This figure displays representative 48 × 48 pixel images from the FER2013 dataset where faces were not successfully detected by the MediaPipe framework, primarily due to insufficient facial exposure or occlusion.

3.2.3 Indexing the test set

This step followed a few experiments that revealed an unexplainably high error rate in model performance. As a tracking method, a column was added to the test set in the CSV file, which assigns a number to each image in the FER2013 dataset. This number remains associated with the image throughout data processing and is transferred to the image's Blendshapes when the Blendshapes dataset (Attrah, 2025b) is created.

This step helps track images with high error rates when the model is evaluated on the test set, facilitating visualization of the images or the creation of different types of plots to identify patterns in the images causing the errors.

3.2.4 Augmenting the training set

To increase the number of training images and the variety of poses, and thereby improve the model's generalizability, data augmentation was applied. By generating new examples (More, 2016) with random transformations, the model's generalizability was enhanced. The augmentation techniques that were used are as follows:

1. Random horizontal slip.

2. Random rotation of 0.2 × 180 degrees in the clockwise direction and counterclockwise direction.

These techniques are based on the model's use case. Since it is designed for a social robot, horizontal flipping will enable the robot to interpret reflections and provide a valid facial image from a different camera angle. Additionally, random rotations improve model performance by accounting for natural head tilts and rotations, which are commonly observed in human interactions. For a robot, the ability to detect and classify facial expressions from various angles is crucial for real-world applications.

Processing the image dataset through augmentation yields a new, extended dataset with more than 20,000 images across the three classes described earlier.

3.2.5 Blendshapes dataset

Using MediaPipe to detect faces in a video stream requires extensive data processing before the data are fed into the classification model. To integrate the model with a face detection program, the model needs to be trained to interpret the program's output, i.e., the Gaze project's Blendshapes. Consequently, the full dataset is converted into a Blendshapes dataset (Attrah, 2025b), which is used to train the model.

Using Blendshapes was preferred over the 3D landmark vectors and the base 48 × 48 images to save a significant amount of computation during inference. This is because using the 3D landmark vectors is equivalent to using 1,404 features to represent the image, and the features of the 48x48 images sum up to 2,304 while using the Blendshapes equals having only 52 features. Further possible processing steps that were experimented with are in the Supplementary material.

3.3 BlendFER-Lite architecture and training

The classification model BlendFER-Lite is primarily built using long short-term memory (LSTM) layers (Computation, 2016), with the exception of the last dense layer, which employs softmax activation to produce classification scores from the output of the final layer.

The decision to incorporate LSTM layers in building the BlendFER-Lite model was made for the following reasons:

1. The BlendFER-Lite model is designed to be integrated into a video stream and to classify faces, meaning it operates on a time-series basis, where each input depends on the sequence of preceding inputs. Given this dependency, using one type of (RNN) (Goodfellow, 2016) is intuitive, as it retains information from past states in addition to the current input.

2. Another reason for this choice is that LSTM is a type of gated recurrent unit (GRU) (Chung et al., 2014). One of its advantages is an internal gating mechanism that helps filter out irrelevant features, allowing the model to focus on the most critical ones.

3. LSTM stands for long short-term memory. A key advantage of this network over other types of recurrent neural networks is its ability to maintain long-term dependencies effectively, not just short-term ones, as is the case with RNN (Medsker et al., 2001) and GRU networks.

Once the LSTM layer was selected, the next step was designing the model architecture. Several key factors were considered, such as accuracy, recall, precision, latency, and model size. After a few experiments to determine the best parameters, the Keras Tuner (O'Malley et al., 2019) was employed as an architecture search framework to identify the best values for the hyperparameters. The search space was defined based on prior experiments and empirical intuition5.

The final model was trained for 5,000 epochs, which took approximately 2 days. The batch size was 128, and the architecture was plotted using a design similar to the Keras utils model plotting function, as shown in Figure 4. The optimizer used for training was AdamW (Kingma, 2014; Loshchilov, 2017), with a learning rate of 1.09e-06, a global clipping norm of 1, and AMSGrad (Reddi et al., 2019). The adopted callbacks were the checkpoint callback and early stopping callback.

Figure 4
Diagram showing a neural network architecture. It begins with an input layer, followed by four LSTM layers using SELU activation, with decreasing output dimensions from 52, 16, 48, to 44. It ends with a Dense layer using softmax activation, resulting in an output of dimension 3. Each layer operates with float32 precision.

Figure 4. BlendFER-Lite model structure.

For the loss function, categorical cross-entropy was used to encode the labels of the images as one-hot vectors, and the metrics for evaluation were loss, categorical cross-entropy, categorical accuracy, and the F1-score. Other details and experiments are available in the Supplementary material.

3.4 Model evaluation

The model evaluation process is conducted for every set of weights produced by the checkpoint callback and shows improved training and validation performance metrics. Approximately every 100 epochs, a model's weights are saved.

Each model is evaluated by predicting the Blendshape labels for test dataset images and comparing them to the ground truth labels. A loss, categorical cross-entropy, categorical accuracy, and F1-score are calculated for the model, and the F1-score is used as the main metric for optimization.

4 Ablation studies

To better understand the BlendFER-Lite model and its potential applications, several ablation studies were conducted as part of the experiments. These studies aimed to determine the optimal number of Blendshapes, the most effective model type, and the best loss function.

4.1 Selection of relevant Blendshapes

To improve the efficiency of the model, reduce computational requirements, and output quality classification results with less processing time and memory usage, a set of Blendshapes returned from MediaPipe is disregarded. Since the model processes detected faces in each video frame, a total of 52 Blendshapes, similar to the ARKit face Blendshapes, are processed.

This approach was developed as a result of observing, as might be obvious in Figure 2, that some Blendshapes do not have a score. Testing the whole dataset (Attrah, 2025b) revealed that some of the Blendshapes do not have scores across all images, as can be found in Figure 5, where each subplot rectangle corresponds to a Blendshape and the blue dots are for the images. It is clear that in some subplots, the dots are spread across the entire rectangle almost uniformly, which indicates that it has coverage of the range 0–1, which are the values of the Blendshapes while in other subplots, the blue dots are concentrated in the lower half of the rectangle.

Figure 5
Grid of 51 charts, each depicting blue scatter plots with data points across similar horizontal axes ranging from 0 to 2000 and vertical axes varying in scale. The plots illustrate variations in data distribution and density.

Figure 5. Blendshapes scatter plot for all images of the training dataset.

The research considered two strategies: (1) count the number of times a certain Blendshape has a zero score and discard it if it has a high count and (2) set a threshold value and count how many times each of the Blendshapes has a score that exceeds this threshold.

After testing various values for the high-count threshold in the first approach and different threshold values in the second approach, 0.4 was selected as the cutoff threshold, and 100 was set as a high-count number. Thus, for any Blendshape, if its score does not exceed 0.4 in more than 100 images in the dataset (Attrah, 2025b), it is discarded.

This choice was made for the following reasons:

• Counting zero values and making decisions on the basis of these criteria would have resulted in a small and non-relevant number of Blendshapes being discarded, which would not have improved the efficiency of the model.

• Using the second approach and setting a threshold higher than 0.4 would have resulted in discarding Blendshapes with a wide range of probabilities. These Blendshapes contribute valuable information to the model during training and removing them would result in a more compact model but reduced accuracy and performance.

• Setting a count higher than 100 would have resulted in a significant decrease in the number of Blendshapes being used, which would make it more difficult for the model to learn.

These reasons were validated through experimentation: training models on different data collections, testing their performance and size, and then making decisions for each number mentioned.

This process used 27 of the 52 Blendshapes listed as important in the Supplementary material, reducing the model size while keeping performance metrics, such as accuracy and F1-score, unchanged.

For further clarification, Figure 5 shows the scatter plot for the 52 Blendshapes. When discussing the importance of the Blendshapes table in the Supplementary material, highlight how many of its entries exhibit low values. This finding suggests that certain Blendshapes contribute minimal information and can be omitted.

4.2 Dense neural network model

To construct a faster model with shorter prediction time (latency) than the LSTM network, another, simpler network was built, with a fully connected (dense) layer as the only layer in the model. After a few optimization experiments, the model achieved high performance. However, when integrated into a real-time system using a camera stream, it lacked acceptable stability and oscillated between the classes without any change in the camera view, resulting in predictions with insufficient accuracy. This is because the layer does not consider the time dimension in the input data, which is well handled by the LSTM layer, as discussed further in the Supplementary material.

4.3 CCE and MSE loss functions

In earlier stages of the research, the mean squared error (Error, 2008) loss function was used to train the LSTM model. However, because the MSE function measures the distance between the prediction and the ground truth rather than the probability difference, it is better suited for regression applications. Therefore, the loss function had to be changed after a certain point.

MSE=1ni=1n(yi-y^i)2    (1)

Unlike the cross-entropy, categorical cross-entropy measures the difference between the predicted probability of the correct category and the ground-truth label, making it more suitable for classification tasks. Given that the application involves three classes, categorical cross-entropy was chosen.

CCE=-i=1Nyilog(y^i)    (2)

To address class imbalance in the dataset, a potential improvement was using categorical focal cross-entropy (Lin, 2017). Although it was originally designed for object detection, previous research has shown that categorical focal cross-entropy improves classification model performance.

FL=-α(1-y^)γCCE(y,y^)    (3)

5 Results

The results obtained from the BlendFER-Lite model were in the top five in the classification benchmark of the dataset of the Papers With Code website6 taking into consideration the requirement of the model to have small size, and to have a more accurate comparison, the BlendFER-Lite model is compared to models with no extra training data and a condition of being a non-transformer neural network only. Such as the models mentioned in Table 4.

Table 4
www.frontiersin.org

Table 4. Benchmark models.

Where the Table 4 shows that the BlendFER-Lite model is showing enhanced performance by around 2% compared to the LocalLearning BOW and the DeepEmotion models, and given the minimal architecture and model size, the enhancement is greatly considerable. While the Ad-Corre, VGGNet, and ResNet18 with tricks deliver higher accuracy, they do not surpass a 2% cap in improvement, in addition to the depth of the model making them resource-consuming. The best result metrics, aside from accuracy, obtained from the model on the test set which consists of Happy: 291, Sad: 245, and Unknown: 1110 images' blendshapes are as follows:

• Loss = 0.6238.

• Categorical cross-entropy = 0.6235.

• Categorical accuracy = 0.7199.

• F1-score = 0.6298.

For a more detailed analysis, the model's confusion matrix is shown in Table 5.

Table 5
www.frontiersin.org

Table 5. Confusion matrix.

The confusion matrix indicated that the happiness emotion achieved the highest classification accuracy, with 251 correctly classified instances out of 291 (a miss rate of 40). In contrast, the unknown class exhibited a correct classification rate of 850 out of 1,110 instances. Although this represents the largest absolute number of misclassifications (260) across the three classes, it is likely attributable to the inherent feature diversity and variability encompassed within this category. The sad emotion class demonstrated the lowest performance, with only 84 out of 246 instances correctly classified, resulting in 162 misclassified samples. While the unknown class showed the highest absolute number of misses, the sad class registered the highest proportional misclassification rate. It is also important to note that the test set for the sad class was the smallest of the three. Diving deep into the confusion matrix (Crall, 2023) and breaking down each part of it shows the happy, unknown, and sad classes as in Table 6.

Table 6
www.frontiersin.org

Table 6. The counts of the types of classifications extracted from the confusion matrix.

From these numbers, we can see that the unknown class has the highest counts in all parts, indicating consistent performance across all classes and inaccuracy corresponding to the total count of the class sample. Additionally, finding the precision and recall can be easily done using the Equations 4, 5

Precision=TPTP+FP    (4)
Recall=TPTP+FN    (5)
F1=2*Precision*RecallPrecision+Recall    (6)

and by calculating these metrics for each class, can yield the per-class precision and recall, as in Table 7.

Table 7
www.frontiersin.org

Table 7. Precision, recall and F1-score.

The findings in Table 7 show that the unknown class is the easiest to detect, followed by the happy class, with the sad emotion last and the least precision, recall, and F1-score, making it difficult to recognize. which is an expected proportion resulting from the large imbalance in the dataset.

Experimenting with different architectures, hyperparameters, and longer training times did not yield improved results. However, it affected the classification accuracy metric once and the time-related stability another time. Nevertheless, the confusion matrix continued to fluctuate, showing inconsistent improvements in classification for the happy and sad classes.

At one point, the classification of the sad emotion improved, resulting in higher precision and recall, while at another, the happy class was favored, and the experiment reported in Table 5 revealed a bias toward the happy class.

To assign weights to each class, the model focused on the happy and sad classes, assigning them greater importance while allocating less weight to the unknown class, which is easily classified in all the experiments. However, some limitations prevent the use of class weights in experiments.

When the model was integrated into the test camera stream, as demonstrated in the demo video,7 the model outputs classifications for happy and sad. However, when facial features change, the unknown class is frequently assigned. For all other facial expressions, the unknown class remains the predicted output.

Regarding the latency of the model, it was measured separately from the Blendshapes extraction pipeline of mediapipe, and it shows a mean latency of 33.47 ms when calculated across the full test set of the FER2013 blendshapes dataset.

While the model size as a zipped .keras file is 775.385 KB (Attrah, 2025a), which is significantly smaller than the other models compared on the benchmark, and the count of parameters of the model is 44,631 parameters, it makes it the fastest running with the smallest memory usage compared to the best in Table 4, which is a model based on the ResNet-18 network that has more than 11M parameters.

6 Discussion

Considering the use case for which the model is developed, it is necessary to optimize for latency and model size, in addition to compiler metrics such as loss, cross-entropy, accuracy, and F1 score. These metrics are critical during the implementation phase when deploying the model on an edge device for real-time inference. One method experimented with involved quantizing the Blendshapes and using only 4 floating-point numbers out of 16. This significantly increased the model's speed but negatively impacted performance metrics and reduced accuracy.

Regarding the model size, models in Goodfellow et al. (2013), Min et al. (2024), An et al. (2023), Huang et al. (2023), Zhou et al. (2023), Fard and Mahoor (2022), Vulpe-Grigorasi and Grigore (2021), Yuan (2021), Khaireddin and Chen (2021), and Minaee et al. (2021) are of large size, which makes them unfit for implementation on an edge device, yet some of them might show better performance metrics. While in Febrian et al. (2023), a model of CNN-BiLSTM is presented, which makes the LSTM part of it similar to ours in processing the temporal dimension of the data but without using bidirectional layers, and that is because part of the experiments in Section 3.3 showed that it does not improve the results, only increases the time required for training. While the first part, which is a CNN, is much different from MediaPipe, although both of them are concerned with the spatial dimension of the data, and from Section 2.1.2 in Bisogni et al. (2023), MediaPipe shows superior results to a CNN model for feature extraction.

Section 4.1 suggested and validated an approach to decrease the number of Blendshapes included in representing the face and used a statistical approach to choose the best Blendshapes to use, and it showed the Blendshapes to be included are a majority from the brows, eyes, and mouth, and as shown in the Supplementary material, while in Huang et al. (2023) generates feature maps that show that the mouth and the nose are the parts with the most information and importance for classification, and this difference might be resulting from the difference in the datasets and the approach used, since Blendshapes are a different concept from the feature maps suggested. Also, notice that only two Blendshapes of the 52 in the MediaPipe blendshapes set represent the nose, which might be hiding some information.

7 Conclusion

This article presents the building of BlendFER-Lite, a model to be implemented on a microcomputer in a social robot to detect faces in a camera video stream, extract Blendshapes as features from the face using MediaPipe, and classify facial expressions based on the emotions they represent, whether they are happiness, sadness, or unknown.

Demonstrated that a 4-layer LSTM model is a suitable architecture to classify the Blendshapes of faces in camera stream frames. The model showed good temporal stability. Note that good results were obtained from the video stream even after the model was trained on an image dataset. In addition, the model achieves results competitive with the dataset's benchmark, with no loss in accuracy throughout the feature extraction and classification pipeline. This approach saves considerable memory and computation in comparison to other available methods.

Besides that, conducted ablation studies and suggested methods to further improve the model's resource efficiency and performance.

8 Future research

Based on this research, a new model will be built and trained mainly on the Aff-Wild2 Dataset (Kollias et al., 2019b; Kollias and Zafeiriou, 2019; Kollias et al., 2019a, 2020; Kollias and Zafeiriou, 2021a,b; Kollias et al., 2021; Kollias, 2022, 2023; Kollias et al., 2023; Zafeiriou et al., 2017) and tested on other datasets in addition to Aff-Wild2, to deliver results on facial expression classification, action units, and valence/arousal. and will use techniques discussed and implemented in this research paper, as well as more advanced architectures and models, such as transformer-based models (LLMs, VLMs, and MLLMs). Afterward, an expansion to work on another research project that includes other modalities, such as voice tone, body pose, and natural language, to estimate emotions. and using compound facial expressions (Du et al., 2014) to have a more accurate and detailed estimation for human emotion.

Further studies about reducing the feature count will be conducted on other types of data and features.

Data availability statement

The datasets used for this study can be found at: Kaggle/FER2013 blendshapes dataset example (Partial) [link] and Kaggle/Challenges in Representation Learning: Facial Expression Recognition Challenge [link].

Ethics statement

Written informed consent was not obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article because besides the reason above, the dataset is licensed and verified by many institutions, so it does have consent but we do not have a version of it, and the manuscript submitted presents a dataset only created from processing the earlier mentioned dataset with out adding any original images.

Author contributions

SA: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Resources, Software, Writing – original draft, Writing – review & editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Acknowledgments

We thank Professor Marijn Jongerden, Jeroen Veen, and Dixon Devasia for their help and guidance throughout the process, and Victor Hogeweij for his contribution to the Gaze project. Additionally, we thank Bhupinder Kaur, An Le, and Muhammad Reza for their insights and support during the early stages of the study.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Gen AI was used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fnbot.2025.1678984/full#supplementary-material

Footnotes

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., et al. (2016). “Tensorflow: a system for large-scale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), 265–283.

Google Scholar

An, Y., Lee, J., Bak, E., and Pan, S. (2023). Deep facial emotion recognition using local features based on facial landmarks for security system. Comput. Mater. Continua 76:39460. doi: 10.32604/cmc.2023.039460

Crossref Full Text | Google Scholar

Attrah, S. (2025a). BlendFER-Lite. Kaggle. Available online at: https://www.kaggle.com/m/484311 (Accessed December 29, 2025).

Google Scholar

Attrah, S. (2025b). FER2013 Blendshapes Dataset Example (Partial). Kaggle. Available online at: https://www.kaggle.com/dsv/10716347 (Accessed December 29, 2025).

Google Scholar

Baltrušaitis, T., Robinson, P., and Morency, L.-P. (2016). “Openface: an open source facial behavior analysis toolkit,” in 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), 1–10. doi: 10.1109/WACV.2016.7477553

Crossref Full Text | Google Scholar

Belmonte, R., Allaert, B., Tirilly, P., Bilasco, I. M., Djeraba, C., and Sebe, N. (2021). Impact of facial landmark localization on facial expression recognition. IEEE Trans. Affect. Comput. 14, 1267–1279. doi: 10.1109/TAFFC.2021.3124142

Crossref Full Text | Google Scholar

Bisogni, C., Cimmino, L., De Marsico, M., Hao, F., and Narducci, F. (2023). Emotion recognition at a distance: the robustness of machine learning based on hand-crafted facial features vs. deep learning models. Image Vis. Comput. 136:104724. doi: 10.1016/j.imavis.2023.104724

Crossref Full Text | Google Scholar

Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

Google Scholar

Computation, N. (2016). Long short-term memory. Neural Comput. 9, 1735–1780. doi: 10.1162/neco.1997.9.8.1735

Crossref Full Text | Google Scholar

Cortes, C., and Vapnik, V. (1995). Support-vector networks. Mach. Learn. 20, 273–297. doi: 10.1023/A:1022627411411

Crossref Full Text | Google Scholar

Crall, J. (2023). The mcc approaches the geometric mean of precision and recall as true negatives approach infinity. arXiv preprint arXiv:2305.00594.

Google Scholar

Dhall, A., Goecke, R., Ghosh, S., Joshi, J., Hoey, J., and Gedeon, T. (2017). “From individual to group-level emotion recognition: Emotiw 5.0,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction, 524–528. doi: 10.1145/3136755.3143004

Crossref Full Text | Google Scholar

Du, S., Tao, Y., and Martinez, A. M. (2014). Compound facial expressions of emotion. Proc. Nat. Acad. Sci. 111, E1454–E1462. doi: 10.1073/pnas.1322355111

PubMed Abstract | Crossref Full Text | Google Scholar

Egmont-Petersen, M., de Ridder, D., and Handels, H. (2002). Image processing with neural networks—a review. Pattern Recognit. 35, 2279–2301. doi: 10.1016/S0031-3203(01)00178-9

Crossref Full Text | Google Scholar

Ekman, P. (1993). Facial expression and emotion. Am. Psychol. 48:384. doi: 10.1037//0003-066X.48.4.384

Crossref Full Text | Google Scholar

Ekman, P., and Friesen, W. V. (2003). Unmasking the Face: A Guide to Recognizing Emotions from Facial Clues, volume 10. Ishk.

Google Scholar

Ekman, P., and Rosenberg, E. L. (1997). What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS). Oxford: Oxford University Press, USA. doi: 10.1093/oso/9780195104462.001.0001

Crossref Full Text | Google Scholar

Error, M. S. (2008). Mean Squared Error. New York, NY: Springer New York 337–339.

Google Scholar

Fard, A. P., and Mahoor, M. H. (2022). AD-corre: adaptive correlation-based loss for facial expression recognition in the wild. IEEE Access 10, 26756–26768. doi: 10.1109/ACCESS.2022.3156598

Crossref Full Text | Google Scholar

Farkhod, A., Abdusalomov, A. B., Mukhiddinov, M., and Cho, Y.-I. (2022). Development of real-time landmark-based emotion recognition cnn for masked faces. Sensors 22:8704. doi: 10.3390/s22228704

PubMed Abstract | Crossref Full Text | Google Scholar

Febrian, R., Halim, B. M., Christina, M., Ramdhan, D., and Chowanda, A. (2023). Facial expression recognition using bidirectional lstm-cnn. Procedia Comput. Sci. 216, 39–47. doi: 10.1016/j.procs.2022.12.109

Crossref Full Text | Google Scholar

Gautam, C., and Seeja, K. (2023). Facial emotion recognition using handcrafted features and cnn. Procedia Comput. Sci. 218, 1295–1303. doi: 10.1016/j.procs.2023.01.108

Crossref Full Text | Google Scholar

Goodfellow, I. (2016). Deep Learning. Cambridge: MIT press.

Google Scholar

Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A., Mirza, M., Hamner, B., et al. (2013). “Challenges in representation learning: a report on three machine learning contests,” in Neural information processing: 20th international conference, ICONIP 2013, Daegu, Korea, november 3–7, 2013. Proceedings, Part III 20 (Springer), 117–124. doi: 10.1007/978-3-642-42051-1_16

PubMed Abstract | Crossref Full Text | Google Scholar

Gross, R., Matthews, I., Cohn, J., Kanade, T., and Baker, S. (2010). Multi-pie. Image Vis. Comput. 28, 807–813. doi: 10.1016/j.imavis.2009.08.002

PubMed Abstract | Crossref Full Text | Google Scholar

He, K., Zhang, X., Ren, S., and Sun, J. (2015). “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778. doi: 10.1109/CVPR.2016.90

Crossref Full Text | Google Scholar

Hu, J., Shen, L., and Sun, G. (2018). “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7132–7141. doi: 10.1109/CVPR.2018.00745

PubMed Abstract | Crossref Full Text | Google Scholar

Huang, Z.-Y., Chiang, C.-C., Chen, J.-H., Chen, Y.-C., Chung, H.-L., Cai, Y.-P., et al. (2023). A study on computer vision for facial emotion recognition. Sci. Rep. 13:8425. doi: 10.1038/s41598-023-35446-4

PubMed Abstract | Crossref Full Text | Google Scholar

Jain, N., Kumar, S., Kumar, A., Shamsolmoali, P., and Zareapoor, M. (2018). Hybrid deep neural networks for face emotion recognition. Pattern Recognit. Lett. 115, 101–106. doi: 10.1016/j.patrec.2018.04.010

Crossref Full Text | Google Scholar

Khaireddin, Y., and Chen, Z. (2021). Facial emotion recognition: state of the art performance on fer2013. arXiv preprint arXiv:2105.03588.

Google Scholar

Kingma, D. P. (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Google Scholar

Kleisner, K., Trnka, J., and Turecek, P. (2024). Facedig: automated tool for placing landmarks on facial portraits for geometric morphometrics users. arXiv preprint arXiv:2411.01508.

Google Scholar

Kollias, D. (2022). “Abaw: valence-arousal estimation, expression recognition, action unit detection &multi-task learning challenges,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2328–2336. doi: 10.1109/CVPRW56347.2022.00259

Crossref Full Text | Google Scholar

Kollias, D. (2023). “Abaw: learning from synthetic data &multi-task learning challenges,” in European Conference on Computer Vision (Springer), 157–172. doi: 10.1007/978-3-031-25075-0_12

Crossref Full Text | Google Scholar

Kollias, D., Schulc, A., Hajiyev, E., and Zafeiriou, S. (2020). “Analysing affective behavior in the first abaw 2020 competition,” in 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG), 794–800. doi: 10.1109/FG47880.2020.00126

Crossref Full Text | Google Scholar

Kollias, D., Sharmanska, V., and Zafeiriou, S. (2019a). Face behavior a la carte: Expressions, affect and action units in a single network. arXiv preprint arXiv:1910.11111.

Google Scholar

Kollias, D., Sharmanska, V., and Zafeiriou, S. (2021). Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv preprint arXiv:2105.03790.

Google Scholar

Kollias, D., Tzirakis, P., Baird, A., Cowen, A., and Zafeiriou, S. (2023). “Abaw: valence-arousal estimation, expression recognition, action unit detection &emotional reaction intensity estimation challenges,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5888–5897. doi: 10.1109/CVPRW59228.2023.00626

Crossref Full Text | Google Scholar

Kollias, D., Tzirakis, P., Nicolaou, M. A., Papaioannou, A., Zhao, G., Schuller, B., et al. (2019b). Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. Int. J. Comput. Vision 127, 907–929. doi: 10.1007/s11263-019-01158-4

Crossref Full Text | Google Scholar

Kollias, D., and Zafeiriou, S. (2019). Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855.

Google Scholar

Kollias, D., and Zafeiriou, S. (2021a). Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. arXiv preprint arXiv:2103.15792.

Google Scholar

Kollias, D., and Zafeiriou, S. (2021b). “Analysing affective behavior in the second abaw2 competition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 3652–3660. doi: 10.1109/ICCVW54120.2021.00408

Crossref Full Text | Google Scholar

Kumar, A., Kumar, A., and Gupta, S. (2025). Machine learning-driven emotion recognition through facial landmark analysis. SN Comput. Sci. 6, 1–10. doi: 10.1007/s42979-025-03688-w

Crossref Full Text | Google Scholar

Kumar, P., Happy, S., and Routray, A. (2016). “A real-time robust facial expression recognition system using hog features,” in 2016 International Conference on Computing, Analytics and Security Trends (CAST) (IEEE), 289–293. doi: 10.1109/CAST.2016.7914982

Crossref Full Text | Google Scholar

Langner, O., Dotsch, R., Bijlstra, G., Wigboldus, D. H., Hawk, S. T., and Van Knippenberg, A. (2010). Presentation and validation of the radboud faces database. Cogn. Emot. 24, 1377–1388. doi: 10.1080/02699930903485076

Crossref Full Text | Google Scholar

Leong, S. C., Tang, Y. M., Lai, C. H., and Lee, C. (2023). Facial expression and body gesture emotion recognition: a systematic review on the use of visual data in affective computing. Comput. Sci. Rev. 48:100545. doi: 10.1016/j.cosrev.2023.100545

Crossref Full Text | Google Scholar

Lewis, J. P., Anjyo, K., Rhee, T., Zhang, M., Pighin, F. H., and Deng, Z. (2014). “Practice and theory of blendshape facial models,” in Eurographics 2014 - State of the Art Reports, eds. S. Lefebvre and M. Spagnuolo (The Eurographics Association). doi: 10.2312/egst.20141042

Crossref Full Text | Google Scholar

Li, S., and Deng, W. (2020). Deep facial expression recognition: a survey. IEEE Trans. Affect. Comput. 13, 1195–1215. doi: 10.1109/TAFFC.2020.2981446

Crossref Full Text | Google Scholar

Lin, T. (2017). Focal loss for dense object detection. arXiv preprint arXiv:1708.02002.

PubMed Abstract | Google Scholar

Loshchilov, I. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

Google Scholar

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., et al. (2019a). “Mediapipe: a framework for perceiving and processing reality,” in Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR).

Google Scholar

Lugaresi, C., Tang, J., Nash, H., McClanahan, C., Uboweja, E., Hays, M., et al. (2019b). Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172.

Google Scholar

Marechal, C., Mikolajewski, D., Tyburek, K., Prokopowicz, P., Bougueroua, L., Ancourt, C., et al. (2019). Survey on ai-based multimodal methods for emotion detection. High-Perform. Model. Simul. Big Data Applic. 11400, 307–324. doi: 10.1007/978-3-030-16272-6_11

Crossref Full Text | Google Scholar

Medsker, L. R., and Jain, L., . (2001). Recurrent Neural Networks: Design and Applications, 1st Edn. CRC Press. doi: 10.1201/9781003040620

Crossref Full Text | Google Scholar

Min, S., Yang, J., Lim, S., Lee, J., Lee, S., and Lim, S. (2024). Emotion recognition using transformers with masked learning. arXiv preprint arXiv:2403.13731.

Google Scholar

Minaee, S., Minaei, M., and Abdolrashidi, A. (2021). Deep-emotion: facial expression recognition using attentional convolutional network. Sensors 21:3046. doi: 10.3390/s21093046

PubMed Abstract | Crossref Full Text | Google Scholar

Mollahosseini, A., Hasani, B., and Mahoor, M. H. (2017). Affectnet: a database for facial expression, valence, and arousal computing in the wild. IEEE Trans. Affect. Comput. 10, 18–31. doi: 10.1109/TAFFC.2017.2740923

Crossref Full Text | Google Scholar

More, A. (2016). Survey of resampling techniques for improving classification performance in unbalanced datasets. arXiv preprint arXiv:1608.06048.

Google Scholar

O'Malley, T., Bursztein, E., Long, J., Chollet, F., Jin, H., Invernizzi, L., et al. (2019). Kerastuner. Available online at: https://github.com/keras-team/keras-tuner (Accessed December,13, 2025).

Google Scholar

Pantic, M., Valstar, M., Rademaker, R., and Maat, L. (2005). “Web-based database for facial expression analysis,” in 2005 IEEE International Conference on Multimedia and Expo (IEEE), 5.

Google Scholar

Reddi, S. J., Kale, S., and Kumar, S. (2019). On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.

Google Scholar

Savin, A. V., Sablina, V. A., and Nikiforov, M. B. (2021). “Comparison of facial landmark detection methods for micro-expressions analysis,” in 2021 10th Mediterranean Conference on Embedded Computing (MECO) (IEEE), 1–4. doi: 10.1109/MECO52532.2021.9460191

Crossref Full Text | Google Scholar

Shaila, S., Vadivel, A., and Avani, S. (2023). Emotion estimation from nose feature using pyramid structure. Multimed. Tools Appl. 82, 42569–42591. doi: 10.1007/s11042-023-14682-w

Crossref Full Text | Google Scholar

Valstar, M., and Pantic, M. (2010). “Induced disgust, happiness and surprise: an addition to the mmi facial expression database,” in Proceedings of the 3rd Intern. Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect (Paris, France), 65.

Google Scholar

Vulpe-Grigorasi, A., and Grigore, O. (2021). “Convolutional neural network hyperparameters optimization for facial emotion recognition,” in 2021 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE), 1–5. doi: 10.1109/ATEE52255.2021.9425073

Crossref Full Text | Google Scholar

Yuan, X. (2021). Fer2013-facial-emotion-recognition-pytorch. Available online at: https://github.com/LetheSec/Fer2013-Facial-Emotion-Recognition-PytorchGitHubRepository (Accessed December,13, 2025).

Google Scholar

Zafeiriou, S., Kollias, D., Nicolaou, M. A., Papaioannou, A., Zhao, G., and Kotsia, I. (2017). “Aff-wild: valence and arousal ‘in-the-wild' challenge,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on (IEEE), 1980–1987. doi: 10.1109/CVPRW.2017.248

Crossref Full Text | Google Scholar

Zhou, W., Lu, J., Xiong, Z., and Wang, W. (2023). “Leveraging TCN and transformer for effective visual-audio fusion in continuous emotion recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5756–5763. doi: 10.1109/CVPRW59228.2023.00610

Crossref Full Text | Google Scholar

Keywords: emotion estimation, computer vision, social robotics, BlendFER-Lite, neural networks, Blendshapes

Citation: Attrah S (2026) Emotion estimation from video footage with LSTM. Front. Neurorobot. 19:1678984. doi: 10.3389/fnbot.2025.1678984

Received: 03 August 2025; Revised: 01 November 2025;
Accepted: 14 November 2025; Published: 06 February 2026.

Edited by:

Michalis Vrigkas, University of Western Macedonia, Greece

Reviewed by:

Imran Ashraf, University of Lahore, Pakistan
M. Sreekrishna, SRM Institute of Science and Technology, India

Copyright © 2026 Attrah. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Samer Attrah, c2FtaXJhdHJhOTVAZ21haWwuY29t

ORCID: Samer Attrah orcid.org/0009-0006-8090-1256

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.