Graph-based multimodal affect recognition in children using prototypical networks

Choudhary, Kavita; Prajapati, Gend Lal

doi:10.3389/fcomp.2026.1774796

ORIGINAL RESEARCH article

Front. Comput. Sci., 16 April 2026

Sec. Human-Media Interaction

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1774796

Graph-based multimodal affect recognition in children using prototypical networks

KC
Kavita Choudhary ^*
GL
Gend Lal Prajapati

Department of Computer Engineering, Institute of Engineering and Technology, DAVV, Indore, India

Abstract

Introduction:

Although physiological signals such as heart rate, perspiration, and facial muscle activity are recognized as markers of emotional events, precisely classifying affective states from these data remains a significant challenge. Addressing this issue is fundamental for developing advanced human-computer interaction and assistive technologies. While emotion recognition in adults has been extensively studied, it is less understood in children, necessitating focused research.

Methods:

This study introduces a multimodal framework tailored for the emotion recognition of children. We used prototypical networks to learn discriminative embeddings from each physiological modality. These embeddings were then used to construct an adaptive k-nearest-neighbors (KNN) graph that models the interrelationships among affective conditions across the modalities. A graph neural network (GNN) leverages this structural representation for the final classification, improving performance by capturing the intrinsic relational context.

Results:

Our proposed framework improved classification performance by 8%–10% compared to single-modality baselines and existing fusion approaches, achieving an overall accuracy of 83%.

Discussion:

These results show that multimodal fusion and graph-based learning can accurately capture the complex interplay of biological signals in children, providing a more accurate approach to pediatric affective computing.

1 Introduction

The field of emotional computing is both challenging and compelling because it aims to develop intelligent systems capable of recognizing, understanding, and responding to human emotions (D'Amelio et al., 2025). Emotion-aware intelligent systems hold significant promise in pediatric healthcare, where they can support early diagnosis, enable personalized interventions, and improve children's overall wellbeing (Landowska et al., 2020). However, advancing automated affect recognition in children requires a clear conceptualization of affect, as emotional expression and regulation differ substantially from those in adults (Sosa-Hernandez et al., 2024).

Emotion recognition using physiological signals [e.g., electrodermal activity (EDA), photoplethysmography (PPG), electroencephalography (EEG), and heart rate variability (HRV)] offers an objective, non-invasive alternative for children who may have limited ability to verbally express their emotions (Maithri et al., 2022; Zhang et al., 2020). These signals reflect autonomic and central nervous system activity related to emotional arousal and valence. Applications span clinical assessment (e.g., for autism or anxiety) and educational technologies for adaptive learning (Deng et al., 2025; Bhatti et al., 2021; Pérez-Jorge et al., 2025). Despite this promise, reliable affect recognition in children remains challenging. First, physiological responses and affect patterns in children are not simply scaled versions of adult data; they evolve with developmental maturation, necessitating child-specific analytical approaches (Zhang et al., 2020). Second, the creation of robust models is hindered by high inter-subject variability, the presence of motion artifacts, and the practical difficulty in collecting high-quality labeled physiological data from children in naturalistic settings (Wang et al., 2024). Consequently, there is a lack of large child-centric datasets, making it difficult to apply standard data-intensive machine learning models. This leads to the core problem of developing accurate, generalizable, and data-efficient affect recognition models for pediatric populations that can handle limited labeled data and inherent physiological variability. In this study, we propose a novel multisignal emotion recognition framework to address these problems, which combines the massive vascular response with electrodermal, cardiac, and motor inputs during emotional processing. Our approach is based on a prototypical network integrated with a graph neural network (GNN) designed to learn robust embeddings from limited datasets. We used an array of physiological signals, including EDA, multiwavelength PPG, and 9-axis Inertial Measurement Unit (IMU) data, collected from children during emotion-eliciting visual stimuli presentation. The prototypical network meta-learning strategy directly tackles the issue of limited labeled examples, whereas the GNN component is designed to model the complex nonlinear relationships between multimodal physiological signals and improve generalization across subjects. This study contributes a child-centric methodological framework aimed at improving data efficiency and handling variability in pediatric affective computing.

The remainder of this paper is organized as follows: Section 2 reviews the related literature and prior work on physiological emotion recognition; Section 3 describes the methodology; Section 4 discusses the results; and Section 5 concludes the paper.

2 Literature review

Previous studies on emotion recognition have comprehensively studied the use of physiological signals to infer emotional states. For example, EDA has been widely used because of its strong association with autonomic nervous system activity and efficiency in capturing emotional arousal, particularly in detecting stress and anxiety in adults (Deng et al., 2025; Bhatti et al., 2021). Similarly, HRV reflects cardiovascular regulation, which varies with emotional conditions and developmental maturation (Pérez-Jorge et al., 2025). While these peripheral signals are reliable indicators of arousal, they frequently lack specificity to distinguish between fine-grained valence states (e.g., joy vs. surprise), demonstrating the need for additional data sources to capture the full spectrum of affective experiences.

EEG provides complementary knowledge of the neural systems underlying emotional processing by monitoring the electrical activity across cortical regions. Studies have shown that changes in EEG frequency bands, such as alpha, beta, and theta, are correlated with affective regulation and states (Pérez-Jorge et al., 2025; Rahman et al., 2023). Machine learning techniques have been widely applied to EEG-based emotion recognition, enabling effective feature extraction and classification of affective conditions (Wang et al., 2024).

Recent research has shown that combining multiple physiological modalities can improve emotion-recognition performance. Multimodal fusion approaches that integrate EDA, electrocardiography (ECG)/HRV, EEG, and other biological signals have demonstrated improved robustness and accuracy by leveraging complementary emotional information across modalities (Deng et al., 2025; Bhatti et al., 2021; Alghamdi et al., 2025). Deep learning architectures, such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and transformer-based models, have been used to capture both spatial and temporal patterns in multimodal physiological data (Alghamdi et al., 2025; Nakisa et al., 2020).

In addition to deep learning approaches, traditional machine learning classifiers, including support vector machines (SVM), random forests, and logistic regression, have been used effectively for emotion recognition, particularly when combined with feature-level or decision-level fusion strategies (Alghamdi et al., 2025; Nakisa et al., 2020). Generative models, such as hidden Markov models (HMMs), have also been used to model the time-based dynamics of emotional states in dimensional emotion spaces such as arousal and valence (Torres-Valencia et al., 2014).

Recently, self-supervised learning (SSL) methods have received attention for their ability to learn reliable representations directly from unlabeled physiological data, thereby reducing the reliance on large labeled datasets (Montero Quispe et al., 2022; Zhang and Cui, 2024). Contrastive learning approaches have also enhanced multimodal emotion recognition by aligning heterogeneous features across different modalities (Lee et al., 2024). Additionally, brain-inspired techniques, such as hyperdimensional computing (HDC), have emerged as computing-efficient alternatives, demonstrating strong performance with limited training data while preserving robustness (Chang et al., 2019).

Overall, existing research indicates that effective physiological emotion recognition systems adopt deep learning for feature extraction along with temporal modeling, incorporate multimodal fusion strategies, and increasingly adopt self-supervised and comparative learning paradigms. However, most existing studies have focused on adult populations, and child-centric datasets and models are limited. This gap motivated the present study, which focuses on multimodal physiological emotion recognition specifically for children.

3 Methodology

3.1 Study protocol

Physiological signals were gathered from 15 children aged 8–13 years to examine their emotional responses. Basic demographic data were documented, and each participant was allocated a unique identifier. Audio-visual stimuli suitable for children were used to evoke three distinct emotional states: positive, negative, and neutral. Short video clips obtained from social media were pre-screened and classified by children according to their emotional impact, including humorous clips for positive emotions and mildly disturbing scenes for negative emotions. Neutral baseline signals were recorded before the stimuli were presented.

Data acquisition was conducted using the EmotiBit wearable device (Montgomery et al., 2023), which was chosen for its non-intrusive design to promote natural behavior. The experiments were performed in a controlled setting with regulated lighting and temperature, and all signals were synchronized using a single computer. After attaching the sensors, a two-minute baseline recording was conducted with the participants seated in a relaxed position. Each participant completed two experimental sessions. Within each session, four to five randomized affective video clips (two to three minutes each) were presented, with each clip followed by a 50-s neutral resting period. These interstimulus intervals served to allow physiological signals to return to baseline before subsequent emotional elicitation and to provide resting-state data for the neutral class. Additionally, a two-minute resting baseline was recorded before the start of each session. Physiological signs were continuously recorded throughout the experiment. From this protocol, each participant viewed an average of nine affective video clips across both sessions (4–5 per session), yielding a total of 135 emotional clips. Following sliding window segmentation (10-s windows with 4-s stride), the final dataset comprised approximately 6,800 samples, with a balanced representation of positive and negative emotional states and approximately 1,900 neutral samples derived from both the initial baseline and inter-stimulus intervals. A experiment and study protocol is summarize in Table 1.

Table 1

Category	Details
Participants	15 children (8–13 years)
Ethics	Informed consent/assent obtained
Setting	Familiar school environment
Stimuli
Neutral	2-min baseline rest + neutral
Positive	Amusing clips featuring individuals and animals.
Negative	Frightening and unsettling scenes from films & YouTube
Protocol	Baseline → neutral → Positive/negative (counterbalanced)
Inter-stimulus interval	50 s of neutral content
Recording device	EmotiBit (wrist-worn)
Recorded signals
Physiological	EDA, 3-wavelength PPG (IR, red, green), skin temperature
Gesture and posture	9-axis IMU (3-axis accelerometer, gyroscope, magnetometer)
Other	Heart rate, inter-beat intervals (IBI)
Sampling rates	EDA: 15 Hz, PPG: 25 Hz, IMU: 25 Hz, Temp: 7 Hz

Summary of study protocol.

3.1.1 Ethical considerations

This study was conducted with careful attention to the ethical principles of research involving children. The research protocol was reviewed by the research supervisor and the Head of School of Computers at IPS Academy, who confirmed that it adhered to the institutional guidelines for minimal-risk educational research. This study adhered to standard ethical practices for minimal-risk research with children, including written parental consent, child assent, the right to withdraw at any time, and continuous monitoring for participant distress. All procedures were explained to the parents and children prior to participation, and written informed consent was obtained from all parents. Child assent was obtained using age-appropriate language, and the participants were explicitly informed that they could withdraw at any time without consequence. The video stimuli consisted of age-appropriate content, including clips from commercially available children's movies and age-suitable media. The “negative” stimuli were limited to mildly unsettling scenes (e.g., suspenseful moments, mild conflict) that children might typically encounter in everyday media consumption. No graphic, violent, or intensely disturbing content was included. All clips were pre-screened by the research team and a panel of parents to ensure their appropriateness for children aged 8–13 years. During the experiment, parents were permitted to remain with their children, and a researcher continuously monitored the participants for any signs of distress. The session would have been stopped immediately if any child appeared uncomfortable (no such instances occurred). Following each session, the children engaged in a brief positive mood induction activity (e.g., watching a humorous clip) to ensure they departed in a neutral or positive emotional state. All data were anonymized, assigned unique identifiers, and securely stored on password-protected servers.

3.2 Preprocessing

3.2.1 EDA signal

EDA reflects changes in skin conductance that are linked to emotional arousal, particularly during stress or excitement (D'Amelio et al., 2025). The EDA signals underwent filtering and smoothing processes, utilizing a Butterworth high-pass filter (0.05 Hz) to remove slow trends and baseline drift, and a Butterworth low-pass filter (1.5 Hz) to retain phasic responses while eliminating high-frequency noise (Park et al., 2020). Additionally, we used a median filter with a window size of 3 to minimize spikes and outliers (Montgomery et al., 2023).

3.2.2 PPG signal

EmotiBit (Montgomery et al., 2023) captured PPG signals in red (PPG-PR), infrared (PPG-PI), and green (PPG-PG) wavelengths. The infrared PPG modality (PPG-PI) in our multimodal framework captures deep tissue perfusion and total hemoglobin dynamics originating from subcutaneous arteries and venous beds (Liu et al., 2016). Unlike green PPG (PPG-PG), which reflects superficial, sympathetically mediated microvascular changes (blush, pallor), and red PPG (PPG-PR), which tracks central cardiac pulsations, infrared PPG (PPG-PI) provides a motion-resilient composite signal encoding both phasic and tonic hemodynamic shifts associated with sustained emotional arousal (Jaafer et al., 2020). All PPG signals were processed by applying a fifth-order Butterworth band-pass filter (0.5–1.5 Hz) to isolate the relevant frequencies (Kim et al., 2018). Artifact removal was performed in two steps: first, a moving average filter with a window size of five samples was used for initial smoothing, followed by a median filter with a window size of three to eliminate spikes and outliers (Lee et al., 2024; Liang et al., 2018).

3.2.3 IMU signals and motion artifact quantification

IMU sensors, which consist of accelerometers (ACC), gyroscopes (GY), and magnetometers (MG), measure movement and orientation to identify the emotions of children. The accelerometer directly measures emotional and somatic nervous system activity. The ACC magnitude increases with emotional arousal, independent of valence, both positive and negative (Peng et al., 2023). The gyroscope shows how children look toward or away from emotional stimuli, indicating attention and avoidance. Even when motionless, the magnetometer indicates a child's orientation, demonstrating their attentiveness. These sensors demonstrate emotional reactions via gestures and postures; therefore, the preprocessing of IMU data is essential for the accurate recognition of emotions. To eliminate gravitational distortion, the orientation was estimated using accelerometer and gyroscope data through the Madgwick filter (Najafi et al., 2003). The IMU data consisted of readings from three sensors, each providing 3-axis measurements. To handle the complexity of these multidimensional data, we calculated the magnitude by condensing the 3-axis values into a single scalar value. Magnitude calculation (Equation 1) simplifies multidimensional data analysis by converting 3-axis data into scalar values.

We then used a median filter (window size = 5) to reduce transient noise and fluctuations (Najafi et al., 2003; Jaafer et al., 2020).

3.2.3.1 Motion artifact quantification

To account for motion-induced variability in the signal fidelity, we quantified the motion intensity using the accelerometer and gyroscope magnitude signals (Najafi et al., 2003). Following Madgwick filtering and magnitude calculation (Equation 1), for each time window corresponding to an emotion trial, we computed a motion intensity index M_I(t):

where ACC_mag and GY_mag are the magnitude values from the accelerometer and gyroscope, respectively, and T is the window length corresponding to each emotion trial. This index has two main uses: (1) to find sessions with too much motion (more than the mean plus two standard deviations) and (2) to assign each modality a confidence weight (Equation 3):

where w^(m) is the confidence weight for modality m and β is a scaling parameter. Subsequently, these weights were added to the multimodal fusion process to provide less weight to signals that could be distorted. This makes the method more resistant to motion artifacts, which is very important when dealing with children and consumer-grade devices. We applied participant-wise normalization to all modalities, which normalizes each child's modality data across all trials to mitigate variability between subjects.

3.3 Data augmentation

To enhance generalization and classification performance, we implemented a data augmentation pipeline that exclusively applies modality-specific transformations during the training process. Augmentation serves two purposes: (1) to increase the scale of the training dataset to prevent overfitting and (2) to balance the dataset to enhance the model's resilience to natural changes in signal properties.

3.3.1 EDA signal augmentation

Amplitude scaling was applied by multiplying the EDA signal with a random factor α∈[0.9, 1.1], using Equation 4 simulating natural variations in signal intensity due to physiological differences or sensor placement (Iwana and Uchida, 2021):

where x(t) represents the original signal, and the scaling factor range ±10% was chosen based on the observed inter-individual variability in our pilot data. Wavelet-based augmentation introduces controlled noise while preserving the physiological structure. The signal was decomposed using the Daubechies 4 (db4) wavelet (Yu et al., 2024), and Gaussian noise proportional to the standard deviation of each coefficient was added before reconstruction: (Iwana and Uchida, 2021)

where and σ(c_i) is the standard deviation of the coefficient c_i.

3.3.2 PPG signal augmentation

Wavelet-based augmentation (same as Equation 5) simulates natural fluctuations and minor motion artifacts in the data (Najafi et al., 2003). Time warping (Rahman et al., 2023; Burrello et al., 2022) using Equation 6 creates temporal variations mimicking heart rate variability.

with a warping factor γ ∈ [0.9, 1.1]. Linear interpolation resampled the signal to the original time scale.

3.3.3 IMU signal augmentation

For the ACC, GY, and MG data, we applied scaling to (Equation 7) adjust the movement intensity with factor β ∈ [0.8, 1.2]:

Rotation (Equation 8) simulates sensor orientation changes using a 3D rotation matrix with an angle θ ∈ [−10°, 10°] (Jaafer et al., 2020).

All augmentations were applied online during training with random parameters, ensuring that the model never saw identical samples twice.

3.4 Feature extraction

The recognition of affect requires the proper extraction of features from signals that correlate with emotional states. The EDA, PPG, and IMU signals are classified as non-stationary, making multiresolution analysis techniques particularly suitable for their analysis. Our approach utilized the Continuous Wavelet Transform (CWT) (Mallat, 2009) to provide a multiscale time-frequency representation through scalograms (Mashrur et al., 2021). We also used a Deep Wavelet Scattering Network (DWSN) (Sepúlveda et al., 2021) to effectively capture both the local and global characteristics of the signal through feature extraction at multiple scales and frequencies. Figure 1 illustrates the architecture of the proposed feature extraction method used in this study.

Figure 1

By combining the CWT and DWSN, we integrated traditional wavelet analysis with advanced deep learning techniques, improving the feature extraction process. Furthermore, by quantifying motion intensity from the IMU data, we established a mechanism to assess the signal quality and weight modality contributions, addressing a key challenge in child affective computing, where motion artifacts are prevalent. Temporal and dynamic IMU features were also extracted to analyze rapid movements, duration, frequency, behavioral sequences, and periodicity to refine affect recognition. The subsequent section describes the feature extraction methods used in this study.

3.4.1 CWT

We utilized the CWT for feature extraction (Equation 9), which is defined as (Yu et al., 2024):

where s(τ) represents the analyzed signal, a is the scale factor, b is the translation factor, and is the complex conjugate of the scaled and translated wavelet function defined through Equation 10.

In our analysis, we decomposed EDA, multiwavelength PPG (PPG-PR, PPG-PG, and PPG-PI), and IMU signals using CWT across multiple scales (1–128), producing scalograms that captured detailed time-frequency dynamics. Scalograms illustrate the distribution of signal energy over time and frequency, effectively highlighting transient events and dynamic patterns (Wang et al., 2021).

Compared with the Short-Term Fourier Transform (STFT), the CWT provides superior time-frequency localization. We specifically used the Complex Morlet (C-Morlet) wavelet (Snell et al., 2017), defined as through Equation 11.

where f_c is the central frequency, and f_b is the bandwidth parameter.

As an example of the feature extraction process, Figure 2 presents representative CWT-based scalograms of multi-wavelength PPG signals, illustrating how the time–frequency representations capture complex physiological signal dynamics.

Figure 2

3.4.2 DWSN

In this study, we used a DWSN (Sepúlveda et al., 2021; Andén and Mallat, 2014) to extract stable and informative features from EDA, multiwavelength PPG (PPG-PR, PPG-PG, and PPG-PI), and IMU signals. The DWSN integrates wavelet transforms with hierarchical feature extraction, providing stability against noise and translation invariance, as shown in Figure 3. Its capacity to capture features at various scales enables meticulous analyses of both short- and long-term patterns. The scattering process comprises cascading wavelet transforms, modulus nonlinearities, and low-pass filtering. Initially, a Gaussian low-pass filter h(τ) captures the coarse features of the signal (Equation 12):

Figure 3

The first-layer coefficients are obtained by (Equation 13) convolving the signal s(τ) with dilated wavelets φ_k(τ):

where . We used the Morlet wavelet, which is defined as (Equation 14)

Low-pass filtering is performed as (Equation 15) follows:

Second-layer coefficients, which capture complex structures, are computed as (Equations 16 and 17):

Higher-order coefficients (m ≥ 2) are generalized as through Equation 18

Final scattering coefficients are Equation 19 aggregated across the layers as follows:

Unlike traditional CNNs, DWSNs utilize predefined wavelet kernels to produce comprehensive feature representations at each layer (Andén and Mallat, 2014). In this study, a two-layered scattering network was used to extract features from EDA, multiwavelength PPG, and IMU data. After experimenting with different values of Q (representing the number of wavelets per octave), it was discovered that for layer 1, the optimal value for all signals was 8, and for layer 2, the optimal value was 2 for EDA and 4 for PPG and IMU signals.

3.4.3 Temporal and dynamic features from IMU data

We also extracted the key temporal and dynamic features of the IMU data, including entropy, median, mean, standard deviation, variance, and percentile values (5th, 25th, 75th, and 95th) (Yu et al., 2024). Additional computed features were mid-hinge, trimean, root mean square (RMS), zero-crossing rate, and mean crossing rate. Table 2 summarizes some of the key features extracted, along with their mathematical definitions.

Table 2

Features	Mathematical description
Entropy [H(Y)]
Root mean square (RMS)
Zero crosing rate (ZCR)
Rate of change of acceleration
Angular velocity(ω)
Intensity of movement(I)

Summary of key extracted features and their mathematical definitions.

3.5 Feature selection

Feature selection is essential for alleviating overfitting induced by the high-dimensional feature space generated via CWT- and DWSN-based feature extraction. A two-stage feature selection technique was employed. Initially, Principal Component Analysis (PCA) (Abdi and Williams, 2010) was used to reduce the dimensionality by projecting the characteristics onto a collection of uncorrelated components. The most discriminative characteristics were then determined using the capacity of Random Forest to simulate nonlinear correlations in the feature priority ranking (Vens and Costa, 2011). This technique was executed autonomously for each modality, including EDA, multiwavelength PPG, and IMU signals.

3.6 Multi-modal prototypical graph network

We propose a multimodal prototypical graph network architecture, illustrated in Figure 4, which integrates prototypical networks and graph neural networks (GNNs) for affect recognition. The framework exploits modality-specific representations and performs graph-based multimodal fusion to capture the complex interdependencies among physiological signals. Prototypical learning (Zhang et al., 2022; Snell et al., 2017) is particularly suitable for emotion recognition tasks because of the limited number of labeled samples per participant. These networks perform metric-based classification by assigning samples to the nearest class prototypes in a learned embedding space.

Figure 4

3.6.1 Feature processing through backbones

For each physiological modality [EDA, PPG (PI, PR, PG), and IMU sensors (ACC, GY, and MG)], the top 10 most informative features were selected. Each modality was represented by a feature matrix , where N denotes the number of samples through Equation 18.

The selected features were processed using a Multi-Layer Perceptron (MLP) (Alghamdi et al., 2025; Wang et al., 2021) consisting of two fully connected layers with batch normalization and dropout to enhance training stability and generalization. The transformation is given as follows:

The resulting embedding z^(m) ∈ ℝ ^N×32 represents a compact modality-specific latent space.

3.6.2 Prototypical network for modality-specific embeddings

In our architecture, we trained a separate prototypical network for each of the seven signal modalities, including EDA, PPG (PPG-PI, PPG-PG, and PPG-PR), and IMU sensors (ACC, GY, and MG). Each network processed the support and query samples from its respective modality and generated a prototype embedding for every class c.For a given modality m, the prototypical network computes a class prototype by using Equation 21 averaging the combined embeddings from the support set for class c:

where is the prototype for the class c in modality m, is the set of support samples for class c in modality m, is the combined embedding is generated for sample x_i by the backbone network. The combined embedding for modality m is calculated using Equation 20. The distance between a query sample embedding and class prototype is computed using the Euclidean distance (Equation 22):

The query sample was assigned to the class with the closest prototype for each modality.

3.6.3 Training and validation with LOSO-CV

We used the LOSO-CV to evaluate the model's performance across different subjects. In each fold, the data were split into the following sets:

Support set (D_support): used to compute class prototypes in the prototypical network, defined as (Equation 23):

Query set (D_query): used to evaluate the model's classification performance, defined as (Equation 24):

where z⁽ⁱ⁾ and z^(j) represent the combined embeddings for the support and query samples, respectively (as defined in Equation 20), and y_i and y_j are the corresponding class labels. Owing to the limited number of participants, no separate validation set was used within each fold, and overfitting was controlled through early stopping based on the query set performance.

3.6.4 Graph construction for multi-modal fusion

The modality-specific class prototypes are modeled as nodes in a graph, where the edges represent inter-modality relationships. A GNN is applied to refine the prototypes before the final classification. After generating query embeddings using prototypical networks for each modality, these embeddings were combined into a graph structure to capture the relationships between modalities. To ensure that the graph captures meaningful connections, we applied an adaptive k-nearest neighbor ( k-NN) approach (Sun and Huang, 2010). GNN was then applied to reduce the dominance of any single modality by incorporating neighborhood information, ensuring a balanced contribution from all modalities. This resulted in enhanced predictive performance in multimodal affect recognition tasks.

3.6.4.1 Node representation

Each node in the graph represents the prototypical network output embedding for a specific modality defined in Equation 25:

where V is the set of nodes in the graph, e^(m) ∈ ℝ ³² is the query embedding produced by the prototypical network for modality m, M = 7 represents the number of modalities (EDA, PPG-PI, PPG-PG, PPG-PR, ACC, GY, MG).

3.6.4.2 Confidence-weighted node features

To compensate for changes in the signal quality caused by motion, each node embedding was scaled by a confidence weight derived from the IMU data as follows (Equation 26):

where is the confidence weight for modality m, and is the motion intensity index defined in Equation 2. These weighted embeddings ẽ^(m) serve as input features for the graph neural network.

3.6.4.3 Edge construction using adaptive k-NN

Instead of using a fixed k-nearest-neighbor approach, we used an adaptive k-NN strategy that dynamically adjusted the number of neighbors based on the local density of embeddings. This ensures that each node is connected to a varying number of its most similar neighbors while preserving key structural relationships in the feature space.In multi-modal approaches, relationships among modalities can vary significantly across the feature space. An adaptive k-NN approach (Sun and Huang, 2010) efficiently captures these dissimilarity variations, resulting in a graph structure that better represents the underlying data and leads to more meaningful representations. The local density of embeddings is estimated using the average inter-node distance among the nearest neighbors in the embedding space. For each node i, we define the density function as follows (Equation 27):

where ρ(i) represents the local density of node i, is the set of k_min nearest neighbors of i, d(e⁽ⁱ⁾, e^(j)) is the Euclidean distance between embeddings e⁽ⁱ⁾ and e^(j), where each embedding is obtained from the prototypical network output. The average inter-node distance provides a measure of how closely packed the embeddings are in the feature space, making it a suitable metric for estimating the local density. To establish numerical stability, we normalized the density values between 0 and 1 as follows (Equation 28).

where ρ_min and ρ_max represent the global minimum and maximum density values. We dynamically adjusted k based on the relative density (Equation 29):

where k_min and k_max are the minimum and maximum number of neighbors allowed, k is the initial number of neighbors (from a fixed k-NN approach), and

is the normalized local density of node i. The scaling factor

adjusts the number of neighbors inversely proportional to local density.

If a node is in a sparse region (ρ(i) is high), it receives more connections [k′(i) increases] to maintain connectivity. If a node is in a dense region [ρ(i) is low], it receives fewer connections [k′(i) decreases] to avoid overclustering. This adaptive adjustment ensures that the graph structure remains well balanced across different embedding distributions, preventing isolated nodes in sparse areas and excessive edges in dense areas. The minimum number of neighbors k_min = 3 is selected to prevent node isolation and ensure that every node maintains a sufficient level of connectivity. The maximum number of neighbors k_max = 5 was chosen to avoid excessive computational overhead and mitigate overclustering. The values were selected based on graph sparsity analysis and a grid search with cross-validation to achieve the best classification performance.

3.6.4.4 Graph representation

The multi-modal graph is formally defined as G = (V, E, X) where V denotes the set of nodes, each of which represents an embedding derived from a specific modality. E defines the edges, constructed adaptively using adaptive connectivity based on k-NN to dynamically link the nodes according to their local density. X∈ ℝ ^N×d is the node feature matrix, where each row represents a node embedding of dimension d. Each edge (i, j) is stored in an adjacency matrix to ensure efficient connectivity. This graph structure enables efficient message passing and interaction across modalities through the graph convolutional layers in the GNN.

3.6.5 Graph neural network (GNN) for classification

Once the multimodal graph G is constructed, we train a GNN-based classifier (Wang et al., 2018) to make a final decision. The architecture consists of multiple layers of graph convolutional networks (GCNs) (Li et al., 2023), allowing the model to aggregate information from neighboring nodes, thereby enhancing multimodal feature representations. To effectively capture inter-node dependencies, we integrated an attention-based mechanism using graph attention networks (GATConv) (Gu et al., 2023) to compute attention weights and update node embeddings, prioritizing the most relevant neighbors in information propagation rather than treating all neighbors equally. The node update rule is as follows (Equation 30):

where represents the refined node embedding for node i,W is a trainable transformation matrix, α_ij defined using Equation 31 is the learned attention coefficient.

This enables the model to emphasize informative neighbors while reducing the effects of less relevant nodes. After the attention module, batch normalization is applied to stabilize training and standardize activations, reducing the internal covariate shift before applying non-linearity. Following the attention module, we applied graph convolutional layers (GCNConv) to further refine the node representations (Equation 32):

where is the node embedding at layer l, denotes the neighborhood set of node i, W^(l) is the trainable weight matrix at layer l, The denominator normalizes the contributions to prevent bias from the high-degree nodes. After the GCN layer, batch normalization was applied again to ensure stable updates and prevent distribution shifts across layers. Leaky ReLU activation follows batch normalization to introduce controlled sparsity in activations. To improve the stability during training, a residual connection was introduced after the first GCN layer (Equation 33):

This mitigates the risk of over-smoothing and improves gradient flow. The final output (Equation 34) from the GNN is passed through a softmax activation function to obtain the class probabilities:

where y represents the probability distribution over class labels, H^(L) denotes the final hidden node embedding, and W^(L) is the learnable classification weight matrix. The GNN was trained using cross-entropy loss. The model was evaluated using LOSO-CV, where each fold computed the training and validation losses across epochs to monitor convergence. Dropout was applied to prevent the co-adaptation of neurons and reduce overfitting. Specifically, dropout probabilities of 0.3–0.5 were used. The Adam optimizer was used with weight decay to penalize large weights and encourage the use of simpler models.

4 Results and discussion

The performance evaluation of the proposed multimodal affect recognition framework is presented in this section. We initially present the baseline performances of the individual modalities, traditional fusion methods, and our fusion model. Subsequently, we examined the learned representations and the impact of each modality. The findings presented in Table 3 show the performance metrics for emotion recognition on the proposed database for the three distinct classes.

Table 3

Performance metrics	Neutral		Negative		Positive
Performance metrics	Mean	Std.	Mean	Std.	Mean	Std.
Precseion	0.83	0.0032	0.81	0.0025	0.77	0.010
Recall	0.80	0.0491	0.80	0.0176	0.77	0.0170
F1-score	0.84	0.0194	0.80	0.064	0.76	0.058
Accuracy	0.84	0.0189	0.81	0.045	0.78	0.025

Evaluation of classification task.

4.1 Performance of single modalities

To establish a baseline and understand the contribution of each signal, we first evaluated each physiological modality independently using a prototypical network and assessed generalizability via LOSO-CV. The results presented in Table 4. The performance of individual modalities (>0.64 accuracy) is particularly noteworthy, given our pediatric cohort (Table 4). Unlike adult populations, where physiological signals are relatively stable, children exhibit greater physiological lability and motion-related variability (D'Amelio et al., 2025; Landowska et al., 2020; Wang et al., 2024). The strong performance of EDA (0.7345) in this context confirms that electrodermal activity remains a reliable arousal marker, even in developing populations, consistent with findings from pediatric studies (Feng et al., 2018). The IMU data (ACC, GY, and MG) also demonstrated strong performance, likely reflecting the motion-related correlates of affective states. In contrast, features derived from PPG were less effective when used as standalone methods.

Table 4

Modality	Accuracy	Precision	Recall	F1 score
EA	0.735	0.733	0.731	0.734
PI	0.637	0.642	0.636	0.645
PR	0.678	0.684	0.680	0.690
PG	0.631	0.632	0.625	0.630
ACC	0.690	0.691	0.698	0.699
GY	0.715	0.719	0.714	0.708
MG	0.662	0.664	0.660	0.664

Performance metrics of three-class classification for individual modalities.

4.2 Baseline fusion methods

To further evaluate the performance of our proposed GNN-based fusion, we compared it with two established multimodal fusion benchmarks: the first was feature-level concatenation (early fusion), where feature vectors extracted from each modality (PPG, EDA, and IMU) were simply concatenated into a single high-dimensional vector and passed to a prototypical network. The feature-level concatenation baseline achieved an accuracy of 73.8%, and the second was decision-level fusion (late fusion), where independent prototypical networks were trained for each modality, and their class probabilities were averaged to obtain the final prediction, which attained a 75.4% result.

4.3 Multimodal fusion

Our proposed GNN-based multimodal fusion framework significantly outperformed all single-modality models and traditional methods, achieving a peak accuracy of 0.83 for the three-class task (Table 3). This represents an improvement of 5%–10% across emotional classes. This performance gain is consistent with the established principle in affective computing that multimodal systems mitigate the limitations of individual channels (Kalateh et al., 2024). However, our use of a prototypical network with a GNN for fusion is novel in the pediatric setting. The ability of GNNs to model nonlinear, structured relationships between heterogeneous signals (e.g., linking EDA arousal spikes with specific HRV patterns) allows them to create a more reliable joint representation than simple feature concatenation or early fusion, effectively compensating for weak or noisy signals in any modality.

4.4 Analysis of the learned embedding space

To quantitatively assess class separability, we analyzed the within-class prototype distances for each emotion category across all LOSO folds.

Figure 5 displays the distribution of these distances, and Table 5 summarizes the descriptive statistics. The Kruskal–Wallis (Okoye and Hosseini, 2024) test revealed a significant overall difference among the three classes [H₍₂₎ = 62.12, p < 0.001] (mean, standard deviation, median, and interquartile range), and Table 6 presents the post-hoc pairwise comparisons using Dunn's test with Bonferroni correction (Dunn, 1964). The pairwise comparisons (Table 6) showed that all class pairs differed significantly, although the magnitude of the difference varied. The neutral class showed the most compact clustering (median distance = 0.418), whereas the positive class had the largest distances (median = 0.569) but with remarkably low variance (SD = 0.072), indicating a consistent offset from the prototype rather than a random scatter. The negative class displayed intermediate distances with higher variability (median = 0.510, SD = 0.136). The most pronounced difference was between neutral and positive (p_adj < 0.001), followed by neutral and negative (p_adj < 0.001), while positive and negative showed a smaller but still significant difference (p_adj = 0.039).

Figure 5

Table 5

Class	Mean	SD	Median	IQR
Neutral	0.375	0.144	0.418	0.234
Positive	0.553	0.072	0.569	0.086
Negative	0.489	0.136	0.510	0.129

Summary statistics of within-class prototype distances.

Table 6

Kruskal–Wallis test: H(2) = 62.12, p < 0.001
Comparison	Adjusted p-value	Significance
Neutral vs. positive	3.67 × 10⁻¹⁴	^***
Neutral vs. negative	8.61 × 10⁻⁸	^***
Positive vs. negative	0.039	^*

Overall and pairwise comparisons of within-class distances.

^***p < 0.001,

^*p < 0.05.

These quantitative findings offer a critical explanation for the classification results reported in Section 4.3. The model found it more challenging to classify positive affect based solely on physiological signals, as evidenced by its lower precision (0.77) and F1-score (0.76). This aligns with the ‘positivity offset” theory in affective neuroscience, where positive states may have more varied physiological signatures or may reflect specific stimuli used (Kalateh et al., 2024; Siegel et al., 2018). The consistency between the nonlinear (t-SNE) and linear (PCA) projections (Figures 6, 7) further strengthens this conclusion, indicating that the distinct distribution of the positive class is an inherent property of the data representation and not an artifact of the visualization method. Moreover, the unexpectedly low variance of the positive class distances suggests that while positive samples are consistently offset from the class centroid, they form a cohesive cluster, a nuance that could inform targeted feature engineering for pediatric affect recognition.

Figure 6

Figure 7

4.5 Modality contribution and ablation study

An ablation study (Table 7) provides granular insights into the importance of each modality, addressing a key challenge in multimodal research: understanding the contribution of each signal.

EDA was the most critical modality. Its removal caused the most significant performance drop (to 0.73, Δ = −0.106, p_adj < 0.001), corroborating its well-documented primacy in arousal-based affect recognition (Feng et al., 2018; Bhatti et al., 2021). This confirms that, for our child cohort, skin conductance remains the gold standard signal.
PPG signals (PPG-PI, PPG-PR, and PPG-PG) were collectively vital. The removal of all PPG features led to the largest overall decline (0.708, Δ = −0.131, p_adj < 0.001), underscoring their synergistic importance. Among the individual PPG features, PPG-PR was the most impactful (removal: 0.754, Δ = −0.085, p_adj < 0.01). This highlights the role of cardiovascular dynamics (heart rate and heart rate variability) in valence discrimination, a finding supported by prior work on both adult and pediatric emotion recognition (Maithri et al., 2022; Zhang et al., 2020).
IMU data (ACC, GY, MG) served a complementary role. Although removing GY caused a noticeable dip (to 0.749, Δ = −0.090, p_adj < 0.01), the IMU features were not decisive individually. Their value lies in providing contextual information (e.g., gross motor activity and fidgeting) that supports the interpretation of autonomic signals, especially in mitigating motion artifacts, which is a common challenge in pediatric data collection (Najafi et al., 2003). The relatively smaller drop when removing all IMU features (to 0.798, Δ = −0.041, p_adj < 0.05) confirms this supportive rather than primary role.

Table 7

Condition	Accuracy	Precision	Recall	F1	Δ from All
All modalities	0.839	0.830	0.823	0.826	—
Remove EDA	0.733^***	0.730	0.733	0.734	–0.106
Remove PPG (all)	0.708^***	0.702	0.703	0.700	–0.131
Remove IMU (all)	0.798^*	0.792	0.793	0.790	–0.041
Single PPG feature removals
Remove PPG-PI	0.771^**	0.779	0.775	0.771	–0.068
Remove PPG-PR	0.754^**	0.755	0.752	0.753	–0.085
Remove PPG-PG	0.754^**	0.738	0.720	0.693	–0.085
Pairwise PPG feature removals
Remove PPG-PR,PG	0.738^***	0.734	0.731	0.730	–0.101
Remove PPG-PI,PG	0.731^***	0.735	0.734	0.734	–0.108
Remove PPG-PI,PR	0.745^***	0.746	0.746	0.744	–0.094
IMU feature removals
Remove ACC	0.740^**	0.748	0.742	0.740	–0.099
Remove GY	0.749^**	0.743	0.740	0.742	–0.090
Remove MG	0.740^**	0.736	0.731	0.734	–0.099

Modality ablation study results with statistical comparison.

^*p_adj < 0.05,

^**p_adj < 0.01,

^***p_adj < 0.0001,

Repeated-measures ANOVA (Greenhouse–Geisser corrected): F_{(7.44, 104.2)} = 11.20, p < 0.001, η² = 0.44.

To statistically validate these observations, we conducted a repeated-measures ANOVA across the 13 experimental conditions (full model and 12 ablation conditions) using 15 LOSO folds as subjects. Mauchly's test (Mauchly, 1940) indicated a violation of sphericity [, p < 0.001], therefore, Greenhouse–Geisser (Blanca et al., 2023) correction was applied (ε = 0.62). The analysis revealed a significant main effect of modality removal on accuracy [F_{(7.44, 104.2)} = 11.20, p < 0.001, η² = 0.44], indicating that the composition of the modalities substantially influenced the model performance. The large effect size (η² = 0.44) demonstrates that nearly half of the variance in accuracy is explained by the modalities included. Post-hoc comparisons using paired t-tests with Bonferroni correction for 12 comparisons (adjusted α = 0.00417) confirmed that all removal conditions significantly reduced accuracy compared to the full model (Table 7). This rigorous analysis moves beyond typical ablation reporting and quantitatively proves that the performance of our model is genuinely dependent on the synergistic use of multiple features.

4.6 Significance and comparison

Our study advances the field of pediatric affective computing in several ways.

Addressing data scarcity with meta-learning: many state-of-the-art models in adult affective computing require large datasets (Rahman et al., 2023; Kalateh et al., 2024; Miranda-Correa et al., 2021). Our use of a prototypical network directly addressed the critical constraint of limited labeled pediatric data. Achieving an accuracy of 0.83 with this data-efficient approach is a significant result, demonstrating that robust models can be built without the massive datasets typically unavailable for child study.
Enhancing multimodal fusion in pediatrics: although multimodal fusion is conventional, our GNN-based integration presents an innovative architectural option for this field. The fact that our GNN explicitly incorporates inter-modal interactions, as opposed to more straightforward fusion techniques (such as feature-level concatenation employed in Banos et al., 2015; Schmidt et al., 2018; Zhang et al., 2023), is probably what allowed it to manage inter-subject variability and enhance generalization in the LOSO-CV, a scenario where many models fail.
Delivering clarified insights: the ablation research and embedding visualization extend beyond just presenting a conclusive accuracy metric. They provide further insights into the model's functionality by pinpointing EDA and PPG as major determinants, highlighting the challenges in defining positive affectivity, and validating the supporting role of the IMU. This analytical level is often absent but is essential for fostering trust and directing future sensor designs for child-centric applications in general.

Our results were significant. It offers an inventive, data-efficient methodology that attains competitive performance while directly tackling the core challenges of pediatric research: restricted datasets, considerable variability, and the need for generalizable and interpretable models. The findings validate the efficacy of our GNN-prototypical network fusion and underscore the importance of physiological cues in creating successful emotion-aware systems for children.

4.7 Limitations and future directions

The small sample size of 15 participants limits the generalizability of our results, although our study offers insights into affect recognition in children. Future studies should expand the dataset to include diverse samples of children with special needs. The complexity of emotions in everyday life cannot be precisely captured by relying solely on visual or auditory stimuli. Future research should incorporate various activities, such as interactive tasks, problem-solving challenges, teamwork, and interpersonal interactions, to better understand emotional experiences. In this study, we focused on positive, negative, and neutral affective states, including detailed emotional categorization across a wide range of states, such as anger, grief, enthusiasm, and surprise, which may provide deeper insights into the intricate emotional landscape of children with special needs. Additionally, integrating more modalities, including physiological signals, behavioral observations, and visual and contextual information, will further enhance the comprehensiveness of affect recognition systems. In the future, the inclusion of self-report approaches that account for children's communication abilities will enable real-time feedback and improve the emotional labeling process of this model.

5 Conclusion

This study introduces a multimodal affect recognition framework that integrates prototypical and graph neural networks (GNNs) to analyze the emotional states of children. The database includes PPG, EDA, heart rate, temperature, and nine-axis IMU data from 15 children. The combination of features from multiple modalities resulted in significantly better classification results. Our data were collected using a portable, wearable multimodal sensor, known as Emotibit, which provided precise and reliable measurements. Our model achieved 83% accuracy in the three-class classification. The adaptive k-nearest neighbors (KNN) graph construction dynamically adjusts node connections, enhancing the adaptability of GNN-based classification. The results highlight the promise of combining multiple modalities for emotion detection, which can play a crucial role in advancing affect recognition technology and guiding subsequent studies. We can better understand children's emotional experiences by utilizing advanced technology and multimodal data. This knowledge can be applied to develop customized interventions tailored to each child's needs. Thus, we can effectively provide personalized and effective support and assistance to children with autism.

Statements

Data availability statement

The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

Ethics statement

Ethical approval was not required for the study involving human samples in accordance with the local legislation and institutional requirements, as ethics approval was not necessary. Written informed consent for participation in this study was provided by the participants' legal guardians/next of kin.

Author contributions

KC: Conceptualization, Data curation, Investigation, Methodology, Software, Writing – original draft, Writing – review & editing. GP: Project administration, Supervision, Writing – review & editing.

Funding

The author(s) declared that financial support was not received for this work and/or its publication.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1
AbdiH.WilliamsL. J. (2010). Principal component analysis: principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2, 433–459. doi: 10.1002/wics.101
- CrossRef
- Google Scholar
2
AlghamdiA. M.AshrafM. U.BahaddadA. A.Almarh Al ShehriK. A.DarazA. A. (2025). Cross-subject EEG signals-based emotion recognition using contrastive learning. Sci. Rep. 15:28295. doi: 10.1038/s41598-025-13289-5
3
AndénJ.MallatS. (2014). Deep scattering spectrum. IEEE Trans. Signal Process. 62, 4114–4128. doi: 10.1109/TSP.2014.2326991
- CrossRef
- Google Scholar
4
BanosO.VillalongaC.GarciaR.SaezA.DamasM.Holgado-TerrizaJ. A.et al. (2015). Design, implementation and validation of a novel open framework for agile development of mobile health applications. Biomed. Eng. Online. 14 (Suppl 2):S6. doi: 10.1186/1475-925X-14-S2-S6
5
BhattiA.BehinaeinB.RodenburgD.HunglerP.EtemadA. (2021). Attentive cross-modal connections for deep multimodal wearable-based emotion recognition. arXiv [preprint]. doi: 10.48550/arXiv.2108.02241
- CrossRef
- Google Scholar
6
BlancaM. J.ArnauJ.García-CastroF. J.AlarcónR.BonoR. (2023). Repeated measures ANOVA and adjusted F-tests when sphericity is violated: which procedure is best?Front. Psychol. 14:1192453. doi: 10.3389/fpsyg.2023.1192453
7
BurrelloA.PagliariD. J.BiancoM.MaciiE.BeniniL.PoncinoM.et al. (2022). “Improving PPG-based heart-rate monitoring with synthetically generated data,” in Proceedings of IEEE Biomedical Circuits and Systems Conference (BioCAS) (Taipei), 153–157. doi: 10.1109/BioCAS54905.2022.9948584
- CrossRef
- Google Scholar
8
ChangE. J.BeniniL.RahimiA.WuA. Y. A. (2019). “Hyperdimensional computing-based multimodality emotion recognition with physiological signals,” in 2019 IEEE international conference on artificial intelligence circuits and systems (AICAS) (Hsinchu: IEEE). doi: 10.1109/AICAS.2019.8771622
- CrossRef
- Google Scholar
9
D'AmelioA.GalánL. A.MaldonadoE. A.BarquineroA. A. D.CuelloJ. R.BrunoN. M.et al. (2025). Emotion recognition systems with electrodermal activity: from affective science to affective computing. Neurocomputing651:130831. doi: 10.1016/j.neucom.2025.130831
- CrossRef
- Google Scholar
10
DengX.FanZ.DongW. (2025). MEFD dataset and GCSFormer model: cross-subject emotion recognition based on multimodal physiological signals. Biomed. Phys. Eng. Express. 11. doi: 10.1088/2057-1976/ae0e28
11
DunnO. J. (1964). Multiple comparisons using rank sums. Technometrics6, 241–252. doi: 10.1080/00401706.1964.10490181
- CrossRef
- Google Scholar
12
FengH.GolshanH. M.MahoorM. H. (2018). A wavelet-based approach to emotion classification using EDA signals. Expert Syst. Appl. 112, 77–86. doi: 10.1016/j.eswa.2018.06.014
- CrossRef
- Google Scholar
13
GuX.XieL.XiaY.ChengY.LiuL.TangL.et al. (2023). Autism spectrum disorder diagnosis using the relational graph attention network. Biomed. Signal Process. Control85:105090. doi: 10.1016/j.bspc.2023.105090
- CrossRef
- Google Scholar
14
IwanaB. K.UchidaS. (2021). An empirical survey of data augmentation for time series classification with neural networks. PLoS ONE16:e0254841. doi: 10.1371/journal.pone.0254841
15
JaaferA.NilssonG.ComoG. (2020). “Data augmentation of IMU signals and evaluation via a semi-supervised classification of driving behavior,” in Proceedings of IEEE 23rd international conference on intelligent transportation systems (ITSC) (Rhodes), 1–6. doi: 10.1109/ITSC45102.2020.9294496
- CrossRef
- Google Scholar
16
KalatehS.Estrada-JimenezL. A.Nikghadam-HojjatiS.BarataJ. (2024). A systematic review on multimodal emotion recognition: building blocks, current state, applications, and challenges. IEEE Access12, 103976–104019. doi: 10.1109/ACCESS.2024.3430850
- CrossRef
- Google Scholar
17
KimH. G.CheonE. J.BaiD. S.LeeY. H.KooB. H. (2018). Stress and heart rate variability: a meta-analysis and review of the literature. Psychiatry Investing15, 235–245. doi: 10.30773/pi.2017.08.17
18
LandowskaA.KarpusA.ZawadzkaT.RobinsB. (2020). Automatic emotion recognition in children with autism: a systematic literature review. Sensors22:1649. doi: 10.3390/s22041649
19
LeeJ. H.KimJ. Y.KimH. G. (2024). Emotion recognition using EEG signals and audiovisual features with contrastive learning. Bioengineering11:997. doi: 10.3390/bioengineering11100997
20
LiM.QiuM.KongW.ZhuL.DingY. (2023). Fusion graph representation of EEG for emotion recognition. Sensors23:1404. doi: 10.3390/s23031404
21
LiangY.ElgendiM.ChenZ.WardR. (2018). An optimal filter for short photoplethysmogram signals. Sci. Data5:180076. doi: 10.1038/sdata.2018.76
22
LiuJ.YanB. P.DingW. X.DaiX. R.ZhangY. T.ZhaoN. (2016). Multi-wavelength photoplethysmography method for skin arterial pulse extraction. Biomed. Opt. Express7, 4313–4326. doi: 10.1364/BOE.7.004313
23
MaithriM.RaghavendraU.GudigarA.et al. (2022). Automated emotion recognition: current trends and future perspectives. Comput. Methods Programs Biomed. 215:106646. doi: 10.1016/j.cmpb.2022.106646
24
MallatS. (2009). A Wavelet Tour of Signal Processing, 3rd Edn. San Diego, CA: Academic Press.
- Google Scholar
25
MashrurF. R.IslamM. S.SahaD. K.IslamS. M. R.MoniM. A. (2021). SCNN: scalogram-based convolutional neural network to detect obstructive sleep apnea using single-lead electrocardiogram signals. Comput. Biol. Med. 134:104532. doi: 10.1016/j.compbiomed.2021.104532
26
MauchlyJ. W. (1940). Significance test for sphericity of a normal n-variate distribution. Ann. Math. Stat. 11, 204–209. doi: 10.1214/aoms/1177731915
- CrossRef
- Google Scholar
27
Miranda-CorreaJ. A.AbadiM. K.SebeN.PatrasI. (2021). AMIGOS: a dataset for affect, personality and mood research on individuals and groups. IEEE Trans. Affective Comput. 12, 479–493. doi: 10.1109/TAFFC.2018.2884461
- CrossRef
- Google Scholar
28
Montero QuispeK. G.UtyiamaD. M. S.Dos SantosE. M.OliveiraH. A. B. F.SoutoE. J. P. (2022). Applying self-supervised representation learning for emotion recognition using physiological signals. Sensors22:9102. doi: 10.3390/s22239102
29
MontgomeryS. M.NairN.ChenP.DikkerS. (2023). Introducing EmotiBit, an open-source multi-modal sensor for measuring research-grade physiological signals. Sci. Talks6:100181. doi: 10.1016/j.sctalk.2023.100181
- CrossRef
- Google Scholar
30
NajafiB.AminianK.Paraschiv-IonescuA.LoewF.B'́ulaC. J.RobertP. (2003). Ambulatory system for human motion analysis using a kinematic sensor: monitoring of daily physical activity in the elderly. IEEE Trans. Biomed. Eng. 50, 711–723. doi: 10.1109/TBME.2003.812189
31
NakisaB.RastgooM. N.RakotonirainyA.MaireF.ChandranV. (2020). Automatic emotion recognition using temporal multimodal deep learning. IEEE Access8, 225463–225474. doi: 10.1109/ACCESS.2020.3027026
- CrossRef
- Google Scholar
32
OkoyeK.HosseiniS. (2024). “Mann–Whitney U test and Kruskal–Wallis H test statistics in R,” in R programming (Singapore: Springer Nature Singapore), 25–246. doi: 10.1007/978-981-97-3385-9_11
- CrossRef
- Google Scholar
33
ParkC. Y.ChaN.KangS.KimA.KhandokerA. H.HadjileontiadisL.et al. (2020). K-EmoCon, a multimodal sensor dataset for continuous emotion recognition in naturalistic conversations. Sci. Data7:293. doi: 10.1038/s41597-020-00630-y
34
PengZ.KommersD.LiangR.LongX.CottaarW.NiemarktH.et al. (2023). Continuous sensing and quantification of body motion in infants: a systematic review. Heliyon9:e18234. doi: 10.1016/j.heliyon.2023.e18234
35
Pérez-JorgeD.Olmos-RayaE.Alonso-RodríguezI.Pérez-PérezI. (2025). Electrodermal response to olfactory stimuli in children with autism spectrum disorder: a systematic review of emotional and cognitive regulation. Front. Educ. 10:1485252. doi: 10.3389/feduc.2025.1485252
- CrossRef
- Google Scholar
36
RahmanM. M.RivoltaM. W.BadiliniF.SassiR. (2023). A systematic survey of data augmentation of ECG signals for AI applications. Sensors23:5237. doi: 10.3390/s23115237
37
SchmidtP.ReissA.DürichenR.Van LaerhovenK. (2018). Wearable-based affect recognition—a review. Sensors18:2574. doi: 10.3390/s19194079
38
SepúlvedaA.CastilloF.PalmaC.Rodriguez-FernandezM. (2021). Emotion recognition from ECG signals using wavelet scattering and machine learning. Appl. Sci. 11:4945. doi: 10.3390/app11114945
- CrossRef
- Google Scholar
39
SiegelE. H.SandsM. K.Van den NoortgateW.CondonP.ChangY.DyJ.et al. (2018). Emotion fingerprints or emotion populations? A meta-analytic investigation of autonomic features of emotion categories. Psychol Bull. 144, 343–393. doi: 10.1037/bul0000128
40
SnellJ.SwerskyK.ZemelR. (2017). “Prototypical networks for few-shot learning,” in Advances in Neural Information Processing Systems (NeurIPS), Vol. 30.
- Google Scholar
41
Sosa-HernandezL.VogelN.FrankiewiczK.ReaumeC.DrewA.McVey NeufeldS.et al. (2024). Exploring emotions beyond the laboratory: a review of emotional and physiological ecological momentary assessment methods in children and youth. Psychophysiology61:14699. doi: 10.1111/psyp.14699
42
SunS.HuangR. (2010). “An adaptive k-nearest neighbor algorithm,” in 2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (Yantai), 91–94. doi: 10.1109/FSKD.2010.5569740
- CrossRef
- Google Scholar
43
Torres-ValenciaC. A.Orozco-GutierrezA. A.Alvarez LopezM. A.Garcia-AriasH. F. (2014). “Comparative analysis of physiological signals and electroencephalogram (EEG) for multimodal emotion recognition using generative models,” in 2014 XIX symposium on image, signal processing and artificial vision (Armenia: IEEE), 8, 1–5. doi: 10.1109/STSIVA.2014.7010181
- CrossRef
- Google Scholar
44
VensC.CostaG. (2011). “Random forest-based feature induction,” in 2011 IEEE 11th International Conference on Data Mining (Vancouver, BC), 744–753. doi: 10.1109/ICDM.2011.121
- CrossRef
- Google Scholar
45
WangT.LuC.SunY.YangM.LiuC.OuC.et al. (2021). Automatic ECG classification using continuous wavelet transform and convolutional neural network. Entropy23:119. doi: 10.3390/e23010119
46
WangX.ZhangT.XuX.-m.ChenL.XingX.-F.ChenC. L. P. (2018). “EEG emotion recognition using dynamical graph convolutional neural networks and broad learning system,” in 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (Madrid), 1240–1244. doi: 10.1109/BIBM.2018.8621147
- CrossRef
- Google Scholar
47
WangY.ZhangB.DiL. (2024). Research progress in EEG-based emotion recognition: a survey. ACM Comput. Surv. 56, 1–49. doi: 10.1145/3666002
- CrossRef
- Google Scholar
48
YuF.YiC.TianZ.LiuX.CaoJ.LiuL.et al. (2024). Intelligent wearable system with motion and emotion recognition based on digital twin technology. IEEE Internet Things J. 11, 26314–26328. doi: 10.1109/JIOT.2024.3394244
- CrossRef
- Google Scholar
49
ZhangJ.YinZ.ChenP.NicheleS. (2020). Emotion recognition using multi-modal data and machine learning techniques: a tutorial and review. Inf. Fusion59, 103–126. doi: 10.1016/j.inffus.2020.01.011
- CrossRef
- Google Scholar
50
ZhangM.CuiY. (2024). Self supervised learning based emotion recognition using physiological signals. Front. Hum. Neurosci. 18:1334721. doi: 10.3389/fnhum.2024.1334721
51
ZhangT.El AliA.HanjalicA.CesarP. (2022). Few-shot learning for fine-grained emotion recognition using physiological signals. IEEE Trans. Multimedia25, 3773–3787. doi: 10.1109/TMM.2022.3165715
- CrossRef
- Google Scholar
52
ZhangY.GhanemB.WangZ.ShenJ. (2023). Multimodal graph learning for artificial intelligence of things. IEEE Internet Things J. 10, 15797–15812.
- Google Scholar

Summary

Keywords

children, electrodermal activity, emotion recognition, multimodal, photoplethysmography

Citation

Choudhary K and Prajapati GL (2026) Graph-based multimodal affect recognition in children using prototypical networks. Front. Comput. Sci. 8:1774796. doi: 10.3389/fcomp.2026.1774796

Received

24 December 2025

Revised

21 February 2026

Accepted

27 March 2026

Published

16 April 2026

Volume

8 - 2026

Edited by

Andrej Košir, University of Ljubljana, Slovenia

Reviewed by

Ding Yang, Chengdu University of Technology, China

Fateh Bougamouza, University of Skikda, Algeria

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Kavita Choudhary, kavitalogar@gmail.com

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

ORIGINAL RESEARCH article

Graph-based multimodal affect recognition in children using prototypical networks

Abstract

1 Introduction

2 Literature review

3 Methodology

3.1 Study protocol

3.1.1 Ethical considerations

3.2 Preprocessing

3.2.1 EDA signal

3.2.2 PPG signal

3.2.3 IMU signals and motion artifact quantification

3.2.3.1 Motion artifact quantification

3.3 Data augmentation

3.3.1 EDA signal augmentation

3.3.2 PPG signal augmentation

3.3.3 IMU signal augmentation

3.4 Feature extraction

3.4.1 CWT

3.4.2 DWSN

3.4.3 Temporal and dynamic features from IMU data

3.5 Feature selection

3.6 Multi-modal prototypical graph network

3.6.1 Feature processing through backbones

3.6.2 Prototypical network for modality-specific embeddings

3.6.3 Training and validation with LOSO-CV

3.6.4 Graph construction for multi-modal fusion

3.6.4.1 Node representation

3.6.4.2 Confidence-weighted node features

3.6.4.3 Edge construction using adaptive k-NN

3.6.4.4 Graph representation

3.6.5 Graph neural network (GNN) for classification

4 Results and discussion

4.1 Performance of single modalities

4.2 Baseline fusion methods

4.3 Multimodal fusion

4.4 Analysis of the learned embedding space

4.5 Modality contribution and ablation study

4.6 Significance and comparison

4.7 Limitations and future directions

5 Conclusion

Statements

Data availability statement

Ethics statement

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

References

Summary

Outline

Figures

Cite article

Share article

Article metrics