ORIGINAL RESEARCH article

Front. Comput. Sci., 08 April 2026

Sec. Digital Education

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1780150

Multimodal AI in education: an avatar-based intelligent learning system for the Kazakh language

  • 1. Department of Online Learning, L.N. Gumilyov Eurasian National University, Astana, Kazakhstan

  • 2. Faculty of Information Technologies, L.N. Gumilyov Eurasian National University, Astana, Kazakhstan

  • 3. Department of Digital Development, L.N. Gumilyov Eurasian National University, Astana, Kazakhstan

  • 4. Center for Industrial Software, University of Southern Denmark, Sønderborg, Denmark

Abstract

This article describes the development of a multimodal learning system for the Kazakh language intended for digital educational environments. The study focuses on the lack of avatar-based learning systems adapted to the linguistic properties of the Kazakh language and the limited integration of verbal and non-verbal components in existing solutions. The proposed system combines syntactic and morphological text analysis with sentiment processing and intonation control. Speech synthesis, gesture generation, facial expression control, and lip synchronization are implemented within a single system architecture. Prosodic parameters are formed based on sentence structure and sentence-level emotional indicators, while visual articulation is synchronized with audio output. The system was tested in speech synthesis scenarios relevant to interactive educational use. The results show that the system can be used for automated lecture narration, voice-over of instructional materials, and basic learner interaction in avatar-based educational settings.

1 Introduction

Multimodal artificial intelligence is used in education to integrate verbal, visual, and affective channels of interaction. Such systems combine speech, gestures, facial expressions, and emotional cues to support comprehension and feedback within digital learning environments.

The application of artificial intelligence and interactive technologies has influenced the development of digital education. However, many existing learning systems provide limited mechanisms for personalization, emotional feedback, and interactive support, which may reduce learner involvement in the educational process (Achanta et al., 2021; Albouy et al., 2022; Amangeldy et al., 2023; Aipova et al., 2021; Aspelin et al., 2025; Berg et al., 2023; Bekmanova et al., 2022; Che et al., 2020).

These limitations are particularly evident in the instruction of morphologically complex and low-resource languages, including Kazakh. Although avatar-based systems, virtual agents, and immersive learning environments are widely discussed in the literature

(Achanta et al., 2021; Albouy et al., 2022; Amangeldy et al., 2023; Aipova et al., 2021; Aspelin et al., 2025; Berg et al., 2023; Bekmanova et al., 2022; Che et al., 2020; Chiradeja et al., 2025; Dai et al., 2024; Deng and Bao, 2022; Ersin et al., 2021; Feng et al., 2024; Fink et al., 2024; Guo et al., 2021), most existing solutions are not adapted to the linguistic structure and communicative characteristics of the Kazakh language.

The Kazakh language exhibits several linguistic characteristics that influence the design of computational language processing systems. As an agglutinative language, Kazakh expresses grammatical relations primarily through morphological suffixes rather than fixed word order. This results in flexible sentence structures and complex morphological forms, which can pose challenges for standard natural language processing approaches developed for Indo-European languages. In addition, communicative patterns in Kazakh discourse often combine syntactic structure, prosody, and non-verbal cues, such as gestures and facial expressions. These properties motivate the development of language-specific models that integrate morphological analysis, syntactic role detection, and culturally relevant gesture mapping within the proposed system.

Previous studies report that avatars can support communication and learner engagement in educational contexts (Achanta et al., 2021; Albouy et al., 2022; Amangeldy et al., 2023; Aipova et al., 2021; Aspelin et al., 2025; Berg et al., 2023; Bekmanova et al., 2022; Che et al., 2020; Chiradeja et al., 2025; Dai et al., 2024; Deng and Bao, 2022; Ersin et al., 2021; Feng et al., 2024). Avatar-based approaches have been applied in professional training, medical education, and interactive learning environments (Amangeldy et al., 2023; Hara et al., 2021; Hu et al., 2023; Issakova et al., 2020). Other studies indicate that learning effectiveness is related to the coordination of speech, gestures, facial expressions, and prosodic features (Aipova et al., 2021; Hu et al., 2023; Issakova et al., 2020).

Research on multimodal interaction highlights the role of combined visual and auditory cues in perception and attention (Hara et al., 2021; Hu et al., 2023; Issakova et al., 2020; Johansen et al., 2021). Emotional aspects of speech and facial expression have also been examined in the field of human–computer interaction, where their influence on communication processes has been reported (Berg et al., 2023; Bekmanova et al., 2022; Feng et al., 2024).

Studies related to the Kazakh language mainly address individual components, such as emotional speech processing (Feng et al., 2024), facial expression analysis (Hu et al., 2023), or general multimodal perception mechanisms (Hara et al., 2021; Issakova et al., 2020). As a result, educational systems that combine linguistic processing, affective modeling, and non-verbal behavior for the Kazakh language remain limited. This study builds upon our earlier work on an intelligent avatar-based learning system (IALS) for the Kazakh language, which focused on aligning linguistic structure with sentiment-driven expressions (Albouy et al., 2022). While the previous study addressed gesture design and emotional alignment at the sentence-structure level, the present work extends this approach by introducing a unified system architecture that integrates syntactic analysis, prosody control, sentiment processing, and real-time avatar-based instructional interaction.

This study aims to develop a multimodal IALS for the Kazakh language that integrates linguistic, paralinguistic, and affective components within a unified educational environment. The system supports interaction through syntactic analysis, speech synthesis with prosodic control, gesture and facial expression generation, and emotion-aware behavior regulation.

2 Methods

The development methodology of the IALS combines linguistic, technical, and audiovisual components to support instructional interaction in the Kazakh language (Kamiya and Guo, 2024; Kannan et al., 2021). The methodology is structured around three coordinated components that enable interaction with an educational avatar.

The first component implements sentence-level gesture modeling adapted to Kazakh grammar. It links syntactic patterns to predefined gesture categories and accounts for non-verbal communication conventions relevant to Kazakh discourse. The second component addresses speech synthesis and intonation control using a neural text-to-speech model for Kazakh. Prosodic parameters are derived from sentence structure and emotion-related lexical markers based on a lexical–semantic emotion resource (Kelmaganbetova et al., 2023). The third component provides an integration layer that coordinates multimodal synchronization, animation control, and data exchange between system modules (Figure 1). The integration of linguistic, technological, and audiovisual components enables a learning process that responds to learner input during interaction. The visual diagram presents the system development and integration workflow, illustrating how linguistic processing, speech synthesis, gesture generation, sentiment analysis, and avatar animation modules are combined within the proposed architecture.

Figure 1

2.1 Architecture of the IALS

The IALS is designed as a modular system that integrates natural language processing, speech synthesis, and avatar-based multimodal animation to support instruction in the Kazakh language. Figure 2 presents the system architecture and illustrates the processing flow between modules, including the transformation of instructional content into multimodal output and the role of the avatar in content presentation. The system comprises several interconnected modules, each responsible for a specific function within the data processing and output pipeline.

Figure 2

The primary processing module receives text-based instructional content and prepares it for linguistic and affective analysis. The language processing module performs morphological, syntactic, and semantic analysis of Kazakh text to identify sentence structure and emotion-related markers required for speech and gesture generation.

Based on this analysis, the gesture mapping module converts grammatical and semantic features into predefined non-verbal actions aligned with Kazakh communication conventions. Speech output is generated by a text-to-speech module with prosody control, which adjusts pitch, rhythm, and intonation according to sentence structure and affective parameters. Lip synchronization aligns phoneme timing with facial movements to maintain audiovisual correspondence. The avatar animation module produces gestures, facial states, and synchronized lip movements using a three-dimensional interface, while the integration layer manages data exchange, synchronization, and system operation. The system also includes a lecture presentation module that enables automated delivery of instructional content through a virtual avatar. Instructors can upload lecture texts and slides in PDF format, after which the system segments the content and presents synchronized speech and visual output. Assessment tasks are generated after each lecture using natural language processing methods.

To support learner interaction, the system provides real-time question handling through text or voice input. Learner queries are processed, relevant content is retrieved, and responses are delivered by the avatar.

2.2 System implementation

The IALS is implemented using a modular architecture that integrates natural language processing, speech synthesis, gesture animation, and facial expression alignment. This architecture supports the delivery of educational content in the Kazakh language through coordinated multimodal output. The system is implemented in Python 3.10, which enables consistent interaction and coordination among system modules.

The system uses Stanza for linguistic analysis (Kim et al., 2021; Köse and Saraçlar, 2021), the MMS text-to-speech model for voice synthesis (Kraxenberger et al., 2018), and Wav2Lip for lip synchronization (Lazaro et al., 2022). The source code and implementation details are publicly available at https://github.com/MMR000/Avatar_ENU, ensuring transparency and reproducibility.

2.3 Kazakh language processing model

Kazakh belongs to the Turkic language family and is characterized by an agglutinative morphological structure in which grammatical functions are expressed through suffix chains attached to word stems. As a result, grammatical relations such as subject, object, and tense are encoded morphologically rather than through rigid word order. For example, in the sentence “Ұстаз кітапты оқыды” (“The teacher read the book”), the object marker ты indicates the grammatical role of the noun кітап (“book”). Because word order may vary while grammatical relations remain encoded morphologically, dependency-based parsing is particularly suitable for analyzing Kazakh sentence structure. The proposed system, therefore, relies on dependency analysis to identify syntactic roles that guide gesture selection, prosodic patterns, and avatar behavior.

At the initial stage, the IALS processes learner text input using a dependency-based model adapted to the grammatical and syntactic features of the Kazakh language (Li et al., 2024). The model performs tokenization, syntactic analysis, dependency parsing, and classification of sentence components.

Tokenization segments the input text into words and punctuation marks, defining sentence boundaries for further analysis. Syntactic analysis then identifies grammatical relations between sentence elements, including subjects, predicates, objects, and modifiers (Köse and Saraçlar, 2021). Dependency parsing establishes relational links between these elements, which is appropriate for Kazakh, where grammatical relations are expressed mainly through morphological markers rather than fixed word order.

Sentence-level analysis determines the functional roles of sentence components and supports contextual interpretation. Based on this analysis, sentence structures are classified into predefined patterns, such as “subject–predicate” and “subject–object–predicate” (Köse and Saraçlar, 2021). These patterns are used by subsequent modules to select appropriate gestures, intonation contours, and emotional expressions for the avatar.

2.4 Kazakh sentence gesture mapping approach

Sentence-based gesture mapping is considered an independent approach in interactive systems, including avatar-based learning environments. This approach is based on establishing correspondences between sentence types—such as declarative, interrogative, and exclamatory forms—and predefined movements and facial expressions. Such mapping enables the integration of non-verbal communication elements with linguistic content and sentence-level emotional characteristics (Lindberg and Jönsson, 2023; Liu and Prud'hommeaux, 2021).

Gesture design was carried out with consideration of Kazakh sentence structure and semantic distinctions. To ensure cultural relevance, linguistic specialists and native speakers of Kazakh were involved in the development process (Logeswari et al., 2024; Luo et al., 2021; Machneva et al., 2022; Mageira et al., 2022). Their input was used to select non-verbal behaviors associated with Kazakh communication practices, including specific hand movements and posture adjustments, which were incorporated into the system.

Non-verbal communication conventions were also considered during the design of the gesture mapping scheme. In Kazakh conversational practices, gestures often accompany emphasis, questioning, or explanation during spoken interaction. For example, open-palm gestures may indicate reference to a subject, while head tilts and eyebrow movements commonly accompany interrogative statements. These culturally recognizable communication patterns were incorporated into the gesture mapping model with input from native speakers and linguistic specialists to ensure that avatar behavior remains consistent with typical Kazakh communicative norms.

Emotional intensity is understood as the degree to which an emotion is experienced as strong or weak, and can be reliably measured using well-established psychological methods (Oliveira et al., 2023).

Facial expressions were aligned with the emotional context of utterances. A set of facial configurations, including eyebrow movement, changes in mouth shape, and gaze direction, was used to represent different emotional states. These configurations were associated with emotion-related lexical markers in Kazakh text, allowing the avatar's visual responses to remain consistent with the detected emotional context. A detailed gesture design for Kazakh sentence structures was presented in our earlier work (Albouy et al., 2022). In contrast, this paper abstracts these findings into a unified representation linking syntactic roles with multimodal cues. Rather than reproducing previously defined gesture designs, Table 1 presents a generalized syntactic–multimodal mapping model derived from that analysis.

Table 1

Sentence elementGestureFacial expressionA visual illustration
SubjectOpen palm or pointing gesture to indicate the referentNeutral smile, raised eyebrows
PredicateHands brought to the center of the body to indicate action or relationSlight nod or soft smile
ObjectHands positioned together at abdominal level to denote the objectMild smile, head tilt
AttributeSmall bilateral hand movement emphasizing descriptionRaised eyebrows, neutral smile
Adverbial modifierDirectional or contextual hand gestureDirected gaze indicating context

A generalized syntactic–multimodal mapping model derived from empirical gesture design.

In addition, an analysis of Kazakh syntactic structures was conducted to determine how gestures and facial expressions are coordinated with verbal communication. The analysis incorporated common speech patterns, emotion-related expressions, and interactional features characteristic of face-to-face communication in Kazakh. Cultural conventions, frequently used gestures, and typical emotional responses were systematically integrated to formulate a gesture–facial expression mapping scheme, providing a structured basis for multimodal alignment.

In Kazakh sentences, the predicate plays a central role, and its form varies according to sentence type. In interrogative constructions, the predicate may include interrogative particles or pronouns. Verb forms and accompanying elements are used to signal interrogative intonation, which is reflected in both prosodic patterns and non-verbal behavior selection. Table 2 summarizes the gesture and facial expression models used for interrogative sentence predicates.

Table 2

PredicateGestureFacial expressionsA visual illustration
Model 1Raising one eyebrow to express skepticism or doubt.A tilted head gaze that conveys curiosity or uncertainty.
Model 2Raising the right hand to the chin in a gesture of reflection; the left hand supports the right elbow.Raised eyebrows and wide eyes expressing surprise.

Gesture and facial expression models for interrogative sentences.

Exclamatory sentences convey potent emotions such as surprise, delight, and rage. In exclamatory sentences, the predicate follows patterns found in declarative or interrogative forms and is accompanied by an exclamatory particle or intonation. Table 3 presents alternative predicate-level multimodal realization models for exclamatory sentences, enabling flexible gesture selection within a single sentence type.

Table 3

PredicateGestureFacial expressionsA visual illustration
Model 1Wide, smooth movements of the left hand, visually emphasizing the dynamic structure of the exclamatory sentence.A soft, barely noticeable smile reflecting the semantic connection between the subject and the predicate.
Model 2Raising both hands up with open palms and slowly lowering them, conveying a state of surprise and emphasizing the importance or unexpectedness of the action/situation being described.Raised eyebrows and wide-open eyes expressing surprise; tilting the head to the right as an element of the avatar's behavioral response.

Gesture and facial expression models for exclamatory sentences.

A three-dimensional avatar was developed as part of the IALS and used as an instructional interface in gesture-supported learning scenarios. The avatar acts as a presentation agent that delivers educational content using gestures associated with the structural patterns of Kazakh sentences. Gesture-based representations of sentence types are used to support content interpretation and learner participation in an avatar-mediated setting.

The avatar model and animations were created using Blender, an open-source three-dimensional modeling environment. Gesture sequences were designed in accordance with the linguistic structure of Kazakh utterances and integrated into the IALS for synchronized multimodal presentation.

2.5 Sentiment processor

The IALS incorporates a lexical analysis module that detects emotion-related lexemes in lecture content using a fallback dictionary (Kamiya and Guo, 2024) and computes sentiment indicators at different levels, including sentences, paragraphs, and entire lectures. These indicators are applied to regulate the avatar's non-verbal behavior, such as gesture selection, facial expression control, and adjustment of prosodic features. The inclusion of emotional parameters allows the avatar to respond to both structural and affective aspects of the text during interaction. The sentiment analysis module produces affective parameters used in multimodal communication for instructional interaction.

The sentiment analysis pipeline consists of several sequential stages that transform the original lecture text into a set of affective parameters used by the avatar. The first step involves preprocessing: tokenization, normalization, and correct punctuation processing using the Kazakh dependency model, ensuring a consistent linguistic breakdown of the text.

After preprocessing, each token or lemma undergoes a lexical matching procedure, during which potential emotionally charged lexemes are identified and checked against a fallback dictionary database (Kamiya and Guo, 2024). Each found unit is assigned a sentiment score.

si ε {−2, −1, 0, +1, +2}, while words that are not in the dictionary automatically receive a neutral value of 0, which ensures stable operation of the model even with weak expression of emotional markers in the text.

The extracted values are combined step by step, beginning at the sentence level. The mathematical formulation of the proposed model is presented in Equations 18. At this stage, the sentence sentiment is computed as the average of the emotion-related elements it contains. This approach supports representing emotional information across different levels of the text.

where N—the number of emotionally significant lexemes in a given sentence.

To quantitatively assess the emotional and evaluative content of lectures, an integrated metric, the Integrated Sentiment Score (ISS), was developed, which is supplemented by additional indicators characterizing the emotional richness of the text and the degree of sentiment imbalance.

The integral indicator ISS is calculated using the following formula:

where ISS—aggregated sentiment score, N–the number of emotionally significant lexemes in the text, si—sentiment value assigned to the i-th lexeme.

In addition to the mean values, two diagnostic meta-metrics are calculated: sentiment imbalance coefficient D, which characterizes the dominance of positive or negative polarity in the text, and the emotional saturation index E, reflecting the overall intensity of the affective content.

Sentiment imbalance coefficient (D):

where N+—number of lexemes with positive sentiment,

N—number of lexemes with negative sentiment,

N—total number of lexemes with non-zero emotional meanings.

D values close to 0 indicate a balanced text, while values approaching 1 reflect the dominance of one of the polarities.

The emotional intensity index E is determined by the following formula:

This metric reflects the degree of emotional intensity of the text, regardless of its polarity: the higher the value, the more emotionally charged the text is.

Since the calculated sentiment values represent a continuous range (from −2 to +2), a quantization procedure was introduced to map these values to discrete behavioral levels used by the avatar (Oliveira et al., 2023). Experimental evaluation indicated that a threshold value of 0.5 yields stable classification outcomes. This threshold is used to regulate the system's emotional response and to maintain a gradual transition between neutral and positive sentiment categories.

According to the applied quantization scheme:

when the aggregated sentiment value s is < 0.5, the sentence is assigned to the neutral class (sentiment score = 0);

when s is equal to or >0.5, the sentence is assigned to the positive class (sentiment score = +1).

This quantization scheme defines distinct emotional categories and reduces variability in the avatar's intonation patterns. The design follows principles of cognitive perception; whereby emotional cues are typically perceived only after their intensity exceeds a perceptual threshold rather than in response to minor variations. Using such a threshold helps preserve the naturalness of prosody and prevents artificial “emotional leaps” in the avatar's expressive behavior.

The generated aggregated sentiment parameters are directly fed to the avatar's expressive behavior module. They determine the choice of gestures and facial expressions and influence the prosodic characteristics of the synthesized speech—pitch, amplitude, pause structure, and speech rate—which are controlled in the TTS subsystem. Thus, the integrated pipeline forms the affective foundation necessary for coherent, context-sensitive, and multimodal interaction.

2.6 Speech synthesis and intonation

The IALS includes a speech generation module based on the Facebook MMS text-to-speech model, which is used to produce Kazakh-language audio output. The module supports intonation, stress patterns, and prosodic features required for spoken interaction.

After dependency-based analysis of the input text, structured linguistic information, including tokens and syntactic relations, is transferred to the synthesis module. The text-to-speech component converts textual input into audio while preserving sentence structure and incorporating affective parameters generated by preceding modules.

Intonation control is performed at the sentence level and depends on communicative type. Interrogative constructions are characterized by rising pitch toward the end of the utterance, whereas declarative sentences follow a more uniform pitch pattern. Prosodic parameters, including pitch variation, rhythm, and stress placement, are adjusted to maintain coherence and alignment with sentence structure.

The speech synthesis module does not perform sentiment analysis independently; instead, it applies emotional parameters provided by the linguistic processing stage. This allows modulation of vocal output to reflect the emotional characteristics of the utterance. The generated speech is produced in WAV format with adjustable intonation, preserving the structure of the original Kazakh sentence.

Sentence-level intonation contours are formed using predefined frequency values and combined to generate variable pitch patterns. Figure 3 illustrates an example of pitch variation across sentence types, demonstrating the relationship between syntactic structure and intonation.

Figure 3

In addition, pause adjustments were developed based on practical experience to ensure a natural and well-perceived flow of speech. Pause durations were defined according to sentence type: 300 ms for declarative utterances, 450 ms for interrogative forms to allow additional processing time, and 400 ms for exclamatory statements.

The combination of frequency-based intonation modeling and experimentally derived pause rules enables speech output that reflects sentence structure, emphasis, and rhythmic patterns. This configuration supports clarity in spoken interaction within the avatar-based system.

A speech synthesis and intonation modeling module was incorporated into the system using the VITS variational inference framework (Kraxenberger et al., 2018), which provides multilingual text-to-speech support, including Kazakh.

The selected model was chosen for its ability to produce natural-sounding speech while preserving pronunciation accuracy and prosodic consistency. The speech synthesis pipeline comprises multiple stages. Initially, the input text is processed using a Kazakh dependency-based linguistic analyzer, which performs tokenization, normalization, and punctuation adjustment (Oh Kruzic et al., 2020). This preprocessing establishes a structured linguistic representation required for synchronizing speech generation with gesture synthesis. Subsequent sentence-level analysis involves sentence segmentation, POS tagging, and dependency parsing to identify grammatical roles (e.g., subject, predicate, and object), along with semantic annotation of elements that influence intonational patterns and gestural dynamics. For example, in the sentence “Ұстаз әдеміеп оқыды” (The teacher read it beautifully), Ұстаз functions as the subject, әдеміеп as an adverbial modifier, and оқыды as the predicate. This structured linguistic information is then used to construct an intonation contour, enabling natural and contextually appropriate speech delivery.

This structured information is used to form an intonation contour, which ensures natural and contextually appropriate speech delivery.

Intonation plays a crucial role in creating a lively and expressive sound. In (Amangeldy et al. 2023), intonation was modeled by assigning frequency values (Hz) to each part of the sentence, which ensures precise pitch modulation and prosody alignment with the structure of the Kazakh utterance.

The modeling process involves parsing the sentence into its components, assigning a specific frequency value to each, and generating the final intonation profile. Once the analysis is complete, the TTS model generates a preliminary WAV file that preserves the sentence structure but requires subsequent pitch correction to enhance naturalness.

To precisely control intonation, the fundamental frequency (F0) is extracted from the generated TTS audio. Among the many available algorithms, the YIN algorithm (Prajwal et al., 2020) was chosen due to its high accuracy. Unlike methods based on simple autocorrelation, YIN reduces pitch detection errors by using a modified function with parabolic interpolation. Furthermore, the algorithm is robust to noise and demonstrates high reliability when working with Kazakh speech. Compared to deep neural models for pitch extraction (Pratap et al., 2023), YIN provides an optimal balance between accuracy and computational efficiency. Some algorithms (Reisenzein and Junge, 2024) are better suited for music, while YIN is optimized for human speech, making it most suitable for intonation correction in the IALS.

The YIN function calculates the periodicity of a speech signal using the following expression:

where d(τ) measures the degree of change of the signal at different time shifts τ,

s(t)—speech signal,

N—number of counts.

The minimum value of the function d(τ) corresponds to the most stable periodicity of the signal and, therefore, determines the fundamental frequency of the FTTS.

After FTTSextraction, it is corrected using the Pitch Synchronous Overlap-Add (PSOLA) method (Ren et al., 2019), which allows the pitch of each word to be adapted to pre-defined frequency ranges.

The adjustment formula is as follows:

where FTTS—original frequency generated by the TTS model,

Fmax and Fmin—acceptable pitch range for each word.

The purpose of this formula is to limit the pitch within the range of naturally perceived values. By limiting this, the pitch of each word is adjusted to make it sound more “human,” smooth, and natural.

Sentiment-based local modifiers adjust the base prosodic pattern by assigning intonation characteristics according to a sentiment score ranging from −2 to +2. Emotional assessment is conducted at the sentence level, following common practices in intonation analysis. Emotional meaning is conveyed through overall intonation patterns rather than through separate words, which supports consistent speech synthesis.

Sentiment analysis is performed by a lexical-semantic module that assigns sentiment values using a specialized dictionary. Sentence-level sentiment aggregation is calculated as a simple average of all detected nonzero values (Shakuf et al., 2019):

where si—sentiment value of the i-th term, L—number of terms with non-zero sentiment in a sentence.

The use of a simple average at the sentence level is motivated by three key factors:

  • Mathematical neutrality;

  • Cognitive validity; and

  • Technical robustness.

The arithmetic mean does not introduce additional distortions due to rare but strongly expressed sentiment values, which is especially important for preserving the natural shape of the intonation contour and preventing sharp jumps in speech parameters (Shumanov and Johnson, 2021).

From the listener's perspective, emotional coloring is perceived holistically, as an overall intonation pattern, rather than as a response to the “strongest word.” The method is also stable in cases where the sentence contains no emotionally charged units (in which case Ssent = 0). The lack of need for additional weighting coefficients or thresholds simplifies the method's integration into the prosody synthesis module of the avatar's expressive behavior system (Sudhan et al., 2024).

The emotional component directly influences the following speech parameters:

  • Volume;

  • Intonation (F0);

  • Duration of pauses;

  • Speech rate.

According to research on emotional prosody, these acoustic parameters–fundamental frequency, intensity, tempo, and pauses–are the primary channels for conveying emotion in spoken language. Experimental data show that emotions objectively alter acoustic characteristics: joy and fear are accompanied by increased frequency and intensity, while sadness is associated with decreased loudness and slower tempo.

Analysis of intonation patterns confirms these observations: increasing the average frequency and variability of F0 enhances the perceived emotionality, expressiveness, and clarity of speech (Šturm and Volín, 2023).

Taking into account these results and empirical methods of emotional speech synthesis (for example, regression modification of the pitch contour, intensity and duration of phrases) (Lazaro et al., 2022), the following parameter correction map is proposed:

  • Volume

Changes within ±6 dB are clearly distinguishable, but do not lead to artificial emotional exaggeration.

  • → Positive sentiment score → an increase in volume;

  • → Negative sentiment score → a decrease in volume;

  • → Which is consistent with the prosodic correlates of emotional speech (Triantafyllopoulos et al., 2023).

  • 2. Fundamental frequency (F0)

Variations of ±50 Hz on keywords convey emotional nuances while maintaining naturalness.

  • → Positive sentiment score ↑ F0;

  • → Negative sentiment score ↓ F0.

  • 3. Pauses

The basic interphrase pause is ≈ 350 ms.

  • → Positive sentiment score reduces pauses by 20%−30%, creating a more dynamic sound;

  • → Negative sentiment score increases them by 20%−40%, enhancing the effect of caution or sadness (Ukenova and Bekmanova, 2023).

  • 4. Speech rate

The base speaking speed during lectures is set to 3.5 words per second. A positive sentiment value increases the speech rate by 5%−10%, whereas a negative sentiment value reduces it by 5%−15%. These adjustments correspond to psycholinguistic observations on the influence of emotion on speech tempo (Hara et al., 2021).

The specific parameter values assigned to each sentiment category are presented in Table 4.

Table 4

Sentiment scoreVolume level (from baseline)F0 shift (Hz)Pause duration change (%)Speech rate (words/s)
−2−6 dB−50 Hz+40 %2,975 (−15 %)
−1−3 dB−25 Hz+20 %3,325 (−5 %)
00 dB0 Hz0 %3,5 (0 %)
+1+3 dB+25 Hz−20 %3,675 (+5 %)
+2+6 dB+50 Hz−30 %3,85 (+10 %)

Local modifiers according to sentiment scores.

This parameter setup allows the IALS to modify intonation in line with emotional context while preserving timing consistency, reducing sudden changes that could influence understanding of the content.

2.7 Avatar design and animation

The avatar functions as a visual component of the interactive learning environment and accompanies spoken interaction. It provides visual support through coordinated gestures, facial expressions, and alignment with speech output.

Two three-dimensional avatar models were developed (Figure 4), representing male and female characters. This design supports compatibility with different voice profiles and accounts for user-related preferences within linguistic and cultural contexts. Both models were created using Blender 3.6 and include facial rigs and articulated body structures, enabling facial movements and predefined gesture sets.

Figure 4

The animation process follows a structured workflow in which text processed by the natural language module generates gesture and emotion labels. These labels are mapped to predefined animation sequences, while a Python-based controller synchronizes animation playback with synthesized speech.

The avatars are capable of representing different emotional states through facial configurations and gestures, allowing visual responses to remain consistent with the grammatical structure and emotional content of the utterances.

To support real-time operation and limit computational load, the IALS uses pre-recorded MP4 animation sequences instead of dynamically generated gestures. These animations correspond to common syntactic structures and emotion-related patterns and are activated based on the processed text and prosodic features. Figure 3 illustrates an example of sentence processing that triggers an animation reflecting grammatical and emotional properties.

Gesture and speech coordination is achieved through time markers that align facial movements and body gestures with phoneme timing and prosodic parameters. Animation sequences are selected according to syntactic and emotional characteristics of the utterance and adjusted based on speech rate, pause duration, and intonation patterns.

The use of pre-recorded animation sequences enables consistent alignment between visual output, linguistic structure, and affective information while maintaining controlled processing latency. Future work may extend the system with dynamic facial expression control methods, such as blendshape-based or parametric animation, to support a wider range of interaction scenarios.

2.7.1 Lip synchronization

Lip synchronization is used to align the avatar's mouth movements with synthesized speech in the IALS. For this purpose, the Wav2Lip model is applied to generate lip movements directly from the audio signal (Lazaro et al., 2022). The model takes synthesized speech in WAV format and produces frame-level articulation that follows the timing of the spoken signal.

In the IALS pipeline, speech generated by the text-to-speech module is passed to Wav2Lip, which analyzes the audio signal and produces frame-level mouth articulation aligned with the phoneme timing of the speech. The generated articulation parameters are then mapped to the facial rig of the three-dimensional avatar in Blender using predefined blendshape configurations. While mouth articulation is derived dynamically from the synthesized speech, body gestures and facial animation sequences are implemented using pre-recorded animation clips that correspond to common syntactic and emotional patterns. These animation sequences are triggered by the linguistic analysis module and synchronized with the generated speech output.

Synchronizing speech with visible articulation supports audiovisual coherence during interaction. This is relevant for educational scenarios, where learners rely on both auditory and visual cues for speech perception, including in the Kazakh language. The integration of lip synchronization allows the avatar to present instructional content with coordinated speech and facial movement.

2.8 Evaluation of system performance

Speech synthesis plays a central role in IALSs, as it determines response timing and speech clarity during interaction. To examine whether the VITS model is suitable for use in the IALS, its performance was compared with the Tacotron model (Ukenova et al., 2025a). Both models were trained using the ESPnet framework on the computing facilities of Nazarbayev University (Astana, Kazakhstan).

The evaluation was carried out using four standard indicators: latency, real-time factor (RTF), throughput, and processor load on CPU and GPU (Ukenova et al., 2025b; Ukenova, 2025; Voinov et al., 2021; Yan et al., 2020). Latency was defined as the time interval between text input and audio output, which is relevant for interactive scenarios where delayed responses affect dialogue continuity (Yessenbayev et al., 2020; Yuan and Gao, 2024). RTF was used to estimate real-time capability by comparing synthesis time with audio duration, with values below 1.0 indicating faster-than-playback generation. Throughput characterized the amount of text processed per second, while CPU and GPU usage were measured to assess deployment feasibility on limited hardware.

All measurements were implemented in Python. Model execution was handled using PyTorch, system resource usage was monitored with psutil and pynvml, and statistical processing was performed using numpy. Audio input and output were processed with scipy.io.wavfile. Timing measurements relied on standard Python timing functions, and batch evaluation was supported through automated file handling and progress monitoring tools. This setup provided consistent and repeatable performance measurements under conditions representative of educational use.

2.9 Comparison with existing systems and component analysis

Avatar-based learning systems are often developed as general-purpose solutions and are not adjusted to the grammatical structure of specific languages. In such systems, speech generation and visual behavior are usually handled as separate processes. As a result, gestures and facial expressions are not consistently linked to sentence composition, and emotional behavior is controlled without direct reference to linguistic features of the input text.

Table 5 shows the differences between generic avatar-based systems and the IALS. Generic solutions typically rely on language-independent processing and predefined animation patterns. In contrast, the IALS analyzes Kazakh text at the syntactic and morphological levels and uses this information to guide prosody, gesture selection, and facial expression control. Non-verbal behavior is determined by sentence roles rather than by fixed animation templates. This approach supports coordinated verbal and visual output during instructional interaction.

Table 5

FeatureGeneric avatar-based systemsIALS
Language adaptationLanguage-independent processingKazakh-specific linguistic modeling
Morphological analysisNot explicitly supportedSupported through dependency-based analysis
Prosody controlFixed or template-basedAdjusted based on sentence structure
Sentiment integrationLimited or absentSentence-level sentiment processing
Gesture generationPredefined, content-independentMapped to syntactic sentence roles
Facial expression controlEmotion presetsControlled by linguistic and affective parameters
Lip synchronizationOptional or externalIntegrated using audio-driven model
Educational interactionScripted deliveryLecture narration and learner interaction
Support for low-resource languagesLimitedDesigned for Kazakh language

Comparison of the IALS with existing avatar-based learning solutions.

A component-level comparison was also conducted to assess the contribution of individual modules (Table 6). This comparison represents a conceptual ablation analysis of system components, illustrating how the inclusion of prosody control, gesture mapping, and sentiment processing contributes to the multimodal capabilities of the proposed system. The qualitative categories used in the table, such as “limited,” “moderate,” and “supported,” indicate the degree of multimodal instructional functionality provided by each configuration, rather than the results obtained from user-based experimental evaluation. Configurations based only on speech synthesis provide limited support for instructional use. Systems that include either prosody control or gesture mapping allow partial multimodal alignment. The complete configuration, which combines linguistic analysis, prosody modeling, gesture mapping, and sentiment processing, enables consistent avatar behavior suitable for educational scenarios. The results indicate that system performance depends on the interaction of multiple components rather than on isolated functions.

Table 6

ConfigurationProsody controlGesture supportSentiment processingEducational suitability
Text-only TTSLimited
TTS + gestures+Moderate
TTS + prosody+-Moderate
Full system (proposed)+++Supported

Ablation comparison of system components.

3 Results

The evaluation includes several quantitative metrics to compare the efficiency and computational costs of the two models. The main performance metrics are presented in Table 7.

Table 7

MetricVITS (49,250 samples)Tacotron (48 samples)
Model size138.42 MB101.94 MB
RTF0.004 ± 0.0030.051 ± 0.005
Latency25.51 ± 9.73 ms242.31 ± 96.30 ms
Throughput (chars/sec)4274.0 ± 3003.7310.6 ± 40.0
Peak GPU memory usage150.68 MB904.22 MB
Peak CPU memory usage1751.03 MB1501.55 MB
CPU utilization1.1%1.3%

Evaluation comparison.

In terms of efficiency and processing speed, the VITS model shows lower latency than the Tacotron. The average latency of VITS is 25.51 ms, compared to 242.31 ms for the Tacotron, which supports its applicability in real-time systems. In addition, VITS processes text at a higher rate, achieving 4,247 characters per second, whereas the Tacotron processes 310.6 characters per second.

A comparable pattern is observed in computational resource usage. While VITS requires a higher allocation of CPU memory, it operates with substantially lower GPU memory consumption (150.68 MB) compared to Tacotron (904.22 MB), which supports its deployment on hardware-constrained systems.

The performance of VITS is further reflected in the real-time factor (RTF): a value of 0.004 corresponds to rapid speech generation, whereas Tacotron reaches 0.051. These metrics indicate that VITS is suitable for integration into the IALS that require controlled latency and efficient resource utilization.

Platforms such as Synthesia and HeyGen provide tools for automated video generation and interactive media, but their design focus is not centered on educational use. In contrast, the proposed IALS was developed specifically for educational purposes and to support language acquisition. A systematic comparison highlights key differences between them (see Figure 5).

Figure 5

Commercial platforms such as Synthesia and HeyGen are mainly oriented toward content creation for business communication, marketing, and customer engagement. Their system design focuses on scalable audiovisual generation; however, these platforms do not include mechanisms for pedagogical adaptation. In contrast, IALS is designed as an educational system that integrates multimodal synchronization and dialogic interaction to support learner-oriented instruction.

Commercial avatar tools typically prioritize visual presentation, while principles related to multimedia learning and cognitive processing are not explicitly addressed (Oliveira et al., 2023). IALS follows an alternative design approach that integrates speech, gestures, and visual cues in accordance with dual-channel information processing and embodied cognition principles, supporting structured presentation of instructional content.

Unlike platforms that rely on scripted or partially automated content with limited interaction, IALS enables real-time dialogue that accounts for both linguistic structure and emotional characteristics of learner input. This allows responses to be adjusted based on the content and affective features of the interaction.

Although commercial platforms report support for multiple languages, their processing of less-represented languages, including Kazakh, remains limited. IALS addresses this issue through language-specific phonetic processing and gesture selection aligned with cultural communication patterns.

From a technical perspective, many existing solutions depend on cloud-based infrastructures and require substantial computational resources, which may restrict their use in educational environments with limited connectivity or budgets. IALS supports hybrid deployment, including local installation, to improve accessibility. In addition, while commercial systems are designed for general-purpose content generation, IALS is restricted to educational use, applying controlled data management and focusing on instructional objectives.

Overall, the comparison indicates that while platforms such as Synthesia and HeyGen provide general-purpose audiovisual generation, they do not address pedagogical adaptation, language-specific processing, or real-time educational interaction. IALS is positioned as a learning-oriented system designed to support instructional use in linguistically underrepresented contexts, including the Kazakh language.

4 Discussion

A general comparative assessment shows that although platforms such as Synthesia and HeyGen provide universal audiovisual generation, they are not focused on pedagogical adaptation, language-specific processing, or interactive educational interaction in real time. In contrast, IALS is positioned as a learning system specifically designed to support the learning process in linguistically underrepresented contexts, including the Kazakh language. The IALS demonstrates the feasibility of adapting intelligent avatar-based interaction to the linguistic characteristics of the Kazakh language. Kazakh exhibits complex morphology and flexible syntactic structures, which pose challenges for automated text processing. The use of dependency-based syntactic analysis enables the system to identify sentence structure and grammatical relations more consistently, supporting accurate interpretation of user input and stable dialogue generation. This approach contributes to reliable handling of syntactically complex constructions that are common in Kazakh-language communication.

The integration of non-verbal behavior based on textual characteristics further supports coherent interaction. Facial expressions and gestures are selected in accordance with sentence-level features, allowing the avatar's responses to remain aligned with the structure and intent of the utterance. Such alignment is particularly relevant in educational contexts, where consistency between verbal explanations and non-verbal cues supports learner comprehension. Adapting gesture and facial expression selection to syntactic patterns is important for Kazakh, as sentence structure plays a key role in conveying meaning.

The sentiment processing component operates at the sentence level and generates structured affective parameters that guide non-verbal behavior. Rather than relying on binary sentiment labels, the IALS represents emotional information in a graded form, which is used to regulate gesture selection, facial expression changes, and prosodic modulation. A lexicon-based fallback mechanism is applied in cases involving specialized terminology or limited lexical coverage. Additionally, emotion quantization is used to suppress low-intensity affective signals, reducing abrupt changes in intonation and visual behavior and supporting perceptual consistency.

Gesture–speech synchronization is incorporated as part of the interaction design. Coordinated alignment of articulation, facial movement, and gestures with synthesized speech contributes to a unified multimodal presentation. From the perspective of information processing, such integration supports balanced visual and auditory input, which is relevant for maintaining attention and supporting retention during learning activities.

From a technical standpoint, the IALS employs neural text-to-speech synthesis for Kazakh with controllable pitch and pause placement derived from sentence structure. Avatar models are created and animated in a 3D environment that allows detailed adjustment of facial and body movements. Automated lip synchronization ensures alignment between synthesized speech and mouth movements, supporting audiovisual consistency during interaction.

The IALS also supports automated instructional delivery. Uploaded lecture texts and presentation materials are segmented into structured units, which are presented sequentially by the avatar. Assessment tasks are generated after each instructional unit, and learners can submit questions via text or voice. The IALS identifies relevant content fragments and generates responses that are delivered through the avatar, enabling continuous interaction within the learning process.

In comparison with existing commercial avatar platforms that prioritize video generation and scalability, the proposed system is oriented toward instructional use. It emphasizes alignment with learning content, linguistic specificity, and interactive feedback. This design is particularly applicable to languages such as Kazakh, which remain underrepresented in general-purpose AI platforms, and supports the development of adaptive educational environments tailored to specific linguistic and pedagogical contexts.

Although the system demonstrates effective multimodal integration and real-time speech synthesis performance, this study focuses primarily on the system architecture and computational evaluation. Future research will involve user-centered studies with Kazakh speakers and language learners to assess perceived naturalness, engagement, and educational effectiveness through questionnaires and controlled experiments.

5 Conclusion

This study presents an intelligent interactive learning system based on an avatar framework and adapted to the linguistic characteristics of the Kazakh language. The integration of syntactic analysis, speech synthesis, lip synchronization, and gesture-based non-verbal communication supports the formation of a multimodal instructional environment focused on comprehension and learner participation. The approach emphasizes the importance of combining linguistic and paralinguistic elements—intonation, gestures, facial expressions—with sentence structure to ensure natural digital communication.

One of the key contributions of the work is the implementation of a sentiment processor and a model of emotionally conditioned prosody, which allows the avatar to adapt the pitch, volume, pause structure and speech rate in accordance with the emotional coloring of the text. The results confirm the feasibility of using structure-based intonation and predefined gestures to model human-like behavior in avatar learning systems. Furthermore, a scalable methodology applicable to the development of similar systems for other morphologically rich languages is demonstrated.

The IALS combines language modules with an intelligent learning component that supports automatic lecture narration, slide alignment, and test item generation using natural language processing techniques. These functions support interactive instruction and provide feedback during the learning process. The ability to respond to learner questions in real time, using voice or text, enables individualized interaction.

The study also outlines two limitations. First, the IALS uses pre-recorded gesture animations rather than dynamically generated movements, which reduces flexibility in unscripted interactions.

Second, emotional tone detection and prosodic control rely on predefined rules rather than data-based models or real-time analysis. This limitation influences the avatar's behavior in complex interaction cases.

Future work will address these limitations by developing dynamic emotion recognition methods, training prosody models using data, and enabling real-time response generation. These areas are expected to improve the interactivity, realism, and pedagogical effectiveness of avatar learning systems, particularly for languages underrepresented in commercial AI solutions.

Statements

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding authors.

Author contributions

AU: Conceptualization, Formal analysis, Writing – original draft, Writing – review & editing. GB: Data curation, Writing – original draft, Writing – review & editing, Validation. BY: Supervision, Writing – original draft, Writing – review & editing, Funding acquisition. SB: Methodology, Visualization, Writing – original draft, Writing – review & editing. MA: Investigation, Resources, Writing – original draft, Writing – review & editing. AN: Project administration, Resources, Writing – original draft, Writing – review & editing. ZL: Formal analysis, Resources, Writing – original draft, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research wass funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. AP23489504).

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

  • 1

    AchantaS.AntonyA.GolipourL.LiJ.RaitioT.RasipuramR.et al. (2021). “On-device neural speech synthesis.” in Proceeding of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia (New York, NY: IEEE), 115561. doi: 10.1109/ASRU51503.2021.9688154

  • 2

    AipovaA.ApaevaS.TemirgalinovaA.ShabambaevaA.KarabulatovaI. (2021). The features of the formation of ethno-value consciousness in the modern Kazakh pedagogical university. Rev. Eduweb. 15, 22944. doi: 10.46502/issn.1856-7576/2021.15.02.18

  • 3

    AlbouyP.Martinez-MorenoZ. E.HoyerR, S.Zatorre RJ.BailletS. (2022). Supramodality of neural entrainment: rhythmic visual stimulation causally enhances auditory working memory performance. Sci. Adv.8:eabj9782. doi: 10.1126/sciadv.abj9782

  • 4

    AmangeldyN.UkenovaA.BekmanovaG.RazakhovaB.MiloszM.KudubayevaS.et al. (2023). Continuous sign language recognition and its translation into intonation-colored speech. Sensors.23:6383. doi: 10.3390/s23146383

  • 5

    AspelinJ.LindbergS.ÖstlundD.JönssonA. (2025). Preservice special educators, relational competence in virtual simulations with avatars. Teach. Educ. 60, 26789. doi: 10.1080/08878730.2024.2385381

  • 6

    BekmanovaG.YergeshB.SharipbayA.MukanovaA. (2022). Emotional speech recognition method based on word transcription. Sensors.22:1937. doi: 10.3390/s22051937

  • 7

    BergC.DiekerL.ScolavinoR. (2023). Using a virtual avatar teaching simulation and an evidence-based teacher observation tool: a synergistic combination for teacher preparation. Educ. Sci.13:744. doi: 10.3390/educsci13070744

  • 8

    CheW.FengY.QinL. Liu T. (2020). N-LTP: an open-source neural Chinese language technology platform with pretrained models. arXiv [preprint]. arXiv:2009.11616. (2020). Available online at: https://arxiv.org/abs/2009.11616

  • 9

    ChiradejaP.LiangY.JettanasenC. (2025). Sign language sentence recognition using hybrid graph embedding and adaptive convolutional networks. Appl. Sci.15:2957. doi: 10.3390/app15062957

  • 10

    DaiC.KeP.PanF.MoonY.LiuJ. Z. (2024). Effects of artificial intelligence-powered virtual agents on learning outcomes in computer-based simulations: a meta-analysis. Educ. Psychol. Rev. 36:31. doi: 10.1007/s10648-024-09855-4

  • 11

    DengS.BaoC. (2022). Phase unwrapping based packet loss concealment using deep neural networks. Speech Commun. 138, 8897. doi: 10.1016/j.specom.2022.02.003

  • 12

    ErsinK.GundogduO.KayaS. N.AykiriD.Serbetcioglu MB. (2021). Investigation of the effects of auditory and visual stimuli on attention. Heliyon.7:e07567. doi: 10.1016/j.heliyon.2021.e07567

  • 13

    FengR.LiuY. L.LingZ. H.YuanJ. H. (2024). “Wav2f0: exploring the potential of wav2vec 2.0 for speech fundamental frequency extraction,” in Proceeding of the IEEE International Symposium on Chinese Spoken Language Processing (ISCSLP), Beijing, China (New York, NY: IEEE), 169173. doi: 10.1109/ISCSLP63861.2024.10800188

  • 14

    FinkM. C.RobinsonS. A.ErtlB. (2024). AI-based avatars are changing the way we learn and teach: Benefits and challenges. Front. Educ. 9:1416307. doi: 10.3389/feduc.2024.1416307

  • 15

    GuoT.WenC.JiangD.LuoN.ZhangR.ZhaoS.et al. (2021). “Didispeech: a large scale Mandarin speech corpus,” in Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), Toronto, ON, Canada (New York, NY: IEEE), 696872. doi: 10.1109/ICASSP39728.2021.9414423

  • 16

    HaraC. Y. N.GoesF. D. S. N.CamargoR. A. A.FonsecaL. M. M.AredesN. D. A. (2021). Design and evaluation of a 3D serious game for communication learning in nursing education. Nurse Educ. Today. 100:104846. doi: 10.1016/j.nedt.2021.104846

  • 17

    HuY. H. Yu, H. Y. Tzeng J. W ZhongK, C. (2023). Using an avatar-based digital collaboration platform to foster ethical education for university students. Comput. Educ. 196:104728. doi: 10.1016/j.compedu.2023.104728

  • 18

    IssakovaS. S.SadirovaK. K. Kushtayeva M. T KussaiynovaZ. A.AltaybekovaK. T.Samenova SN. (2020). World ethnocultural specificity of verbal communication: good wishes in the Russian and Kazakh languages. Media Watch. 11, 50214. doi: 10.15655/mw_2020_v11i3_202935

  • 19

    JohansenT. H.SørensenS. A.MøllersenK.GodtliebsenF. (2021). Instance segmentation of microscopic foraminifera. Appl. Sci.11:6543. doi: 10.3390/app11146543

  • 20

    KamiyaM.GuoZ. (2024). Scope of negation, gestures, and prosody: the English negative quantifier as a case in point. J. Psycholinguist. Res.53:56. doi: 10.1007/s10936-024-10075-8

  • 21

    KannanS.RajuP. R.MadhavR. S. S.TripathiS. (2021). “Voice conversion using spectral mapping, and TD-PSOLA,” in, Advances in Computing and Network Communications (Cham: Springer). doi: 10.1007/978-981-33-6987-0_17

  • 22

    KelmaganbetovaA.MazhitayevaS.AyazbayevaB.RamazanovaZ.RahymberlinaS.KhamzinaG.et al. (2023). The role of gestures in communication. Theory Pract. Lang. Stud. 13, 242938. doi: 10.17507/tpls.1310.09

  • 23

    KimJ.KongJ.SonJ. (2021). “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in, Proceedings of International Conference on Machine Learning (ICML) (New York, NY: PMLR), 55305540.

  • 24

    KöseÖ. D.SaraçlarM. (2021). Multimodal representations for synchronized speech and real-time MRI video processing. IEEE/ACM Trans. Audio. Speech. Lang. Process.29, 191224. doi: 10.1109/TASLP.2021.3084099

  • 25

    KraxenbergerM.MenninghausW.RothA.ScharingerM. (2018). Prosody-based sound–emotion associations in poetry. Front. Psychol. 9:1284. doi: 10.3389/fpsyg.2018.01284

  • 26

    LazaroM. J.LeeJ.ChunJ.YunM. H.KimS. (2022). Multimodal interaction: Input–output modality combinations for identification tasks in augmented reality. Appl. Ergon. 105:103842. doi: 10.1016/j.apergo.2022.103842

  • 27

    LiY.HashimA. S.LinY.NohuddinP. N.VenkatachalamK. Ahmadian A. (2024). AI-based visual speech recognition towards realistic avatars and lip-reading applications in the metaverse. Appl Soft. Comput. 164:111906. doi: 10.1016/j.asoc.2024.111906

  • 28

    LindbergS.JönssonA. (2023). Preservice teachers training with avatars: a systematic literature review of “human-in-the-loop” simulations in teacher education and special education. Educ. Sci. 13:817. doi: 10.3390/educsci13080817

  • 29

    LiuZ.Prud'hommeauxE. (2021). “Dependency parsing evaluation for low-resource spontaneous speech,” in Proceedings of the Second Workshop Domain Adaptation NLP Kyiv, Ukraine. (Stroudsburg: Association for Computational Linguistic location), 15665.

  • 30

    LogeswariP.JebarajN. R. S.BanuPriyaG. (2024). Comparative analysis of AI tools for video production. J. Inf. Technol. Rev. 15, 1327. doi: 10.6025/jitr/2024/15/4/132-137

  • 31

    LuoR.ChenE.TanX.WangR.QinT.LiuT. Y.et al. (2021). “LightSpeech: lightweight and fast text to speech with neural architecture search,” in Proceedings of the IEEE International Conference on ICASSP, Toronto, ON, Canada (New York, NY: IEEE), 56995703. doi: 10.1109/ICASSP39728.2021.9414403

  • 32

    MachnevaM.EvansA. M.StavrovaO. (2022). Consensus and (lack of) accuracy in perceptions of avatar trustworthiness. Comput. Hum. Behav. 126:107017. doi: 10.1016/j.chb.2021.107017

  • 33

    MageiraK.PittouD.PapasalourosA.KotisK.ZangogianniP.DaradoumisA.et al. (2022). Education AI chatbots for content and language integrated learning. Appl. Sci.12:3239. doi: 10.3390/app12073239

  • 34

    Oh KruzicC.KruzicD.HerreraF.BailensonJ. (2020). Facial expressions contribute more than body movements to conversational outcomes in avatar-mediated virtual environments. Sci. Rep. 10:20626. doi: 10.1038/s41598-020-76672-4

  • 35

    OliveiraF. S.CasanovaE.JuniorA. C.SoaresA. S.Galvão FilhoA. R. (2023). “CML-TTS a multilingual dataset for speech synthesis in low-resource languages,” in Text, Speech, and Dialogue (TSD 2023) (Cham: Springer). doi: 10.1007/978-3-031-40498-6_17

  • 36

    PrajwalK.MukhopadhyayR.NamboodiriV. P.JawaharC. (2020). “A lip sync expert is all you need for speech to lip generation in the wild,” in Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, United States (New York, NY: Association for Computing Machinery), 484492. doi: 10.1145/3394171.3413532

  • 37

    PratapV.TjandraA.ShiB.TomaselloP.BabuA.KunduS.et al. (2023). Scaling speech technology to 1,000+ languages. arXiv [preprint]. arXiv:2305.13516.

  • 38

    ReisenzeinR.JungeM. (2024). Measuring the intensity of emotions. Front. Psychol. 15:1437843. doi: 10.3389/fpsyg.2024.1437843

  • 39

    RenY.RuanY.TanX.QinT.ZhaoS.ZhaoZ. FastSpeech: fast, robust controllable text to speech. arXiv [preprint]. arXiv:1905.09263. (2019).

  • 40

    ShakufV.Ben-DavidB.WegnerT. G. G.WesselingP. B. C.MentzelM.DefrenS.et al. (2019). Processing emotional prosody in a foreign language: The case of German and Hebrew. J. Cult Cogn. Sci. 6, 25168. doi: 10.1007/s41809-022-00107-x

  • 41

    ShumanovM.JohnsonL. (2021). Making conversations with chatbots more personalized. Comput. Hum. Behav. 117:106627. doi: 10.1016/j.chb.2020.106627

  • 42

    ŠturmP.VolínJ. (2023). Occurrence and duration of pauses in relation to speech tempo and structural organization in two speech genres. Languages.8:23. doi: 10.3390/languages8010023

  • 43

    SudhanS.NairP. P.ThusharaM. (2024). “Text-to-speech and speech-to-text models: a systematic examination of diverse approaches.” in Proceedings of IEEE International Conference on Convergence Technology (I2CT), Pune, India (New York, NY: IEEE), 18. doi: 10.1109/I2CT61223.2024.10544015

  • 44

    TriantafyllopoulosA.SchullerB.HeW.YangX.LiuZ.TzirakisS. P.et al. (2023). An overview of affective speech synthesis and conversion in the deep learning era. Proc IEEE. 111, 135581. doi: 10.1109/JPROC.2023.3250266

  • 45

    UkenovaA. (2025). “Performance evaluation data for speech synthesis models,” Mendeley Data, V1. doi: 10.17632/hhzmhj3xwz.1

  • 46

    UkenovaA.BekmanovaG. (2023). A review of intelligent interactive learning methods. Front Comput Sci. (2023). 5:1141649. doi: 10.3389/fcomp.2023.1141649

  • 47

    UkenovaA.BekmanovaG.YergeshB.AltaibekM. (2025a). Interface design of an intelligent interactive learning system. Sci. J. Astana IT. Univ. doi: 10.37943/24VEWA2615

  • 48

    UkenovaA.BekmanovaG.ZakiN.KikimbayevM.AltaibekM. (2025b). Assessment and improvement of avatar-based learning system: from linguistic structure alignment to sentiment-driven expressions. Sensors25:1921. doi: 10.3390/s25061921

  • 49

    VoinovN. V.IvanovD. A.LeontievaT. V.MolodyakovS. A. (2021). “Implementation and analysis of algorithms for pitch estimation in musical fragments,” in Proceedings of the International Conference on Soft Computing and Measurement (SCM), St. Petersburg, Russia (New York, NY: IEEE), 113116. doi: 10.1109/SCM52931.2021.9507134

  • 50

    YanS.YeL.HanS.HanT.LiY.AlasaarelaE.et al. (2020). “Speech interactive emotion recognition system based on random forest,” in Proceedings of the International Wireless Communications and Mobile Computing (IWCMC), Limassol, Cyprus (New York, NY: IEEE), 14581462. doi: 10.1109/IWCMC48107.2020.9148117

  • 51

    YessenbayevZ.KozhirbayevZ.MakazhanovA. (2020). “KazNLP: a pipeline for automated processing of texts written in Kazakh language,” in Speech and Compute8 (SPECOM 2020) (Cham: Springer). doi: 10.1007/978-3-030-60276-5_63

  • 52

    YuanQ.GaoQ. (2024). Being there, and being together: avatar appearance and peer interaction in VR classrooms for video-based learning. Int. J. Hum. Comput. Interact. 40, 331333. doi: 10.1080/10447318.2023.2189818

Summary

Keywords

avatar-based learning, gesture modeling, gesture recognition, Kazakh language, multimodal AI, natural language processing, prosody control, sentiment analysis

Citation

Ukenova A, Bekmanova G, Yergesh B, Ben Yahia S, Altaibek M, Nazyrova A and Lamasheva Z (2026) Multimodal AI in education: an avatar-based intelligent learning system for the Kazakh language. Front. Comput. Sci. 8:1780150. doi: 10.3389/fcomp.2026.1780150

Received

03 January 2026

Revised

11 March 2026

Accepted

18 March 2026

Published

08 April 2026

Volume

8 - 2026

Edited by

Dewi Khairani, Syarif Hidayatullah State Islamic University Jakarta, Indonesia

Reviewed by

Hadipurnawan Satria, Sriwijaya University, Indonesia

Hirokazu Yokokawa, Kobe University, Japan

Updates

Copyright

*Correspondence: Banu Yergesh, ; Aizhan Nazyrova, ; Zhanar Lamasheva,

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Outline

Figures

Cite article

Copy to clipboard


Export citation file


Share article

Article metrics