Next-Gen orientation: supporting international students with generative AI NPCs in VR

Berrezueta-Guzman, Santiago; Wagner, Stefan

doi:10.3389/fcomp.2026.1799323

ORIGINAL RESEARCH article

Front. Comput. Sci., 12 March 2026

Sec. Human-Media Interaction

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1799323

Next-Gen orientation: supporting international students with generative AI NPCs in VR

SB
Santiago Berrezueta-Guzman ^*
SW
Stefan Wagner

Chair of Software Engineering, Technical University of Munich, Heilbronn, Germany

Abstract

Educational Virtual Reality (VR) provides immersive learning environments, yet most contemporary applications rely on pre-scripted Non-Player Characters (NPCs) that offer limited personalization and rigid interaction paths. This study presents the technical implementation and evaluation of TUMSphere, a VR orientation platform designed to facilitate the academic and cultural transition of international students. We propose a modular architecture that integrates Large Language Models (LLMs) with Unreal Engine via the Conversational AI (Convai) platform, enabling embodied NPCs to provide real-time speech recognition, context-aware dialogue, and autonomous spatial navigation. To validate this approach, a mixed-methods user study (N = 24) was conducted with international students to assess system latency, usability, and pedagogical efficacy. Results demonstrate a high System Usability Scale (SUS) score of 76.4 (SD = 12.5) and robust task completion rates, reaching 100% for spatial navigation and 96% for information retrieval. While technical benchmarking revealed an average end-to-end latency of 2.90s for complex, retrieval-heavy queries, qualitative findings indicate that users find this “latency-presence trade-off” acceptable in exchange for the pedagogical benefits. Crucially, participants reported a significant reduction in social anxiety when practicing language and administrative queries with AI agents compared to human interlocutors. These findings suggest that embodied, generative AI NPCs can serve as a scalable, low-pressure “social sandbox” that effectively redefines student support systems and orientation strategies in higher education.

1 Introduction

In recent years, educational Virtual Reality (VR) gaming has emerged as a powerful form of learning that immerses students inside interactive virtual worlds where they can explore, experiment, test theories, and practice skills (Lin et al., 2024; Damianova and Berrezueta-Guzman, 2025). Educational VR games can often be categorized into several types, including virtual laboratories where students conduct scientific experiments safely (Truchly et al., 2018; Konecki et al., 2023), historical simulations that allow learners to walk through ancient civilizations, language learning environments where students practice conversations in virtual cafes or shops (Hua and Wang, 2023; Peixoto et al., 2021), and professional training simulations for fields such as medicine or engineering (Du et al., 2025). VR has also gained relevance for remote and distance education, particularly following the widespread shift to online learning (Levidze, 2024). The growth in educational VR has been remarkable in recent years, with the market expected to expand by USD 81.13 billion from 2025 to 2030 (Mordor Intelligence, 2024).

Most educational VR games that exist today rely on pre-designed scenarios with scripted interactions (Lin and Cheng, 2024). Users might need to click on objects to read information, follow predetermined quest paths, or interact with simple menu-based systems. Some games include basic non-player characters (NPCs) that guide, usually with pre-recorded voice lines or text bubbles, similar to traditional video games. For example, a student exploring a virtual chemistry lab might click equipment to see explanations, follow step-by-step instructions displayed on-screen, or receive feedback via pop-up messages when completing tasks (Viitaharju et al., 2023). While these approaches can also be engaging and educational, they create rigid experiences in which students can access only information that developers specifically programmed into the system, which is not personalized or offers real-time interactions.

Meanwhile, large language models (LLMs) represent a significant advancement in artificial intelligence's (AI) ability to understand and generate human language (Minaee et al., 2024; Wang et al., 2025). These models, such as GPT (Leon, 2025), are trained on extensive text datasets and can engage in conversations, answer questions across various domains, and provide contextual explanations. In educational settings, LLMs have demonstrated potential as adaptive tutoring systems, capable of responding to individual student queries and tailoring explanations to learner needs (Chen et al., 2023; Wen et al., 2024; Dong et al., 2024). Unlike traditional rule-based educational software that relies on pre-programmed decision trees, LLMs can process natural language input, maintain conversation context, and generate relevant responses in real-time (Tracy and Spantidi, 2025). Recent research has explored the applications of LLMs, such as ChatGPT, in classrooms, where they serve as teaching assistants, helping students understand complex topics, providing feedback on assignments, and supporting personalized learning across various subjects (Hussein et al., 2024; Chen et al., 2024).

Thus, this is where the intersection of LLMs and VR environments occurs, integrating intelligent NPCs into educational VR environments, which play a critical role in creating more dynamic and personalized learning experiences (Gonzales et al., 2025; Wan et al., 2024). Intelligent NPCs powered by LLMs that respond to user needs in real-time are highly valuable for virtual training simulations, educational platforms, and even therapy programs (Guevarra et al., 2025). These virtual tutors can introduce students to complex problems while adapting their style to meet the students' individual needs (Barmpari et al., 2026). In educational VR, intelligent NPCs simulate real-life interactions with high realism, understanding and responding to natural language, exhibiting human-like behavior, and adapting their actions based on student input (Özkaya et al., 2025). Furthermore, AI-enabled NPCs can address concerns about learner engagement, anxiety, and cognitive workload during remote education (Vallance, 2023).

This paper presents the implementation and use-case study of AI-powered NPCs in TUMSphere, a novel educational virtual reality platform for first-semester Information Engineering students. TUMSphere is a virtual campus environment that combines educational mini-games, coding puzzles, and relaxation activities designed to help students develop problem-solving skills while providing an engaging campus experience. This paper specifically focuses on the integration and implementation of intelligent NPCs powered by large language models (LLMs) within this platform. These conversational NPCs enable students to interact naturally via voice, providing personalized guidance on university services, answering questions, and supporting international students' cultural adaptation. The implementation leverages Unreal Engine and specialized conversational AI (convAI) plugins to create responsive, context-aware virtual characters that address students' unique information needs during their initial university experience.

2 Related work

To bridge the gap between theoretical AI models and practical immersive education, this section situates our research within the broader landscape of intelligent virtual environments.

2.1 LLM-powered NPCs in educational virtual reality

Recent research has explored various LLM implementations in VR settings, with a particular focus on conversational agents that provide personalized guidance, support natural language interactions, and adapt to learners' needs in real time.

McKern et al., present an innovative approach to creating AI-powered digital assistants in VR environments by coupling two distinct AI systems: a Language Model (LM) for conversational responses and a Movement Model (MM) for generating context-relevant body movements. Their system employs a dual-channel LM architecture, with one channel responding to user queries. At the same time, the other provides text input to the MM, enabling the generation of coordinated movements that complement verbal communication. The authors implemented a proof-of-concept prototype using GPT-3.5-turbo as the LM and a text-to-motion model based on the HumanML3D dataset, deployed in Unreal Engine 5.5 for visualization. In a preliminary user study with nine participants, the AI-generated NPCs were rated significantly higher in understandability than human-recorded movements captured with their Avatar Replay System. The study demonstrated that AI-generated movements were more frequently and accurately identified by users, suggesting potential advantages over motion-captured animations in terms of clarity and creation efficiency (McKern et al., 2024).

Expanding on this idea of social interaction in VR, Liu et al., present ClassMeta, a GPT-4-powered virtual classmate designed to promote classroom participation in VR through peer influence behaviors, including note-taking, question-asking/answering, discussion-driving, and discipline-reminding. The system employs a dual-context architecture, where agents digest lesson materials as background context while capturing real-time classroom conversations. This enables them to generate contextually coherent responses through both predefined action signals and dynamic dialog responses via ElevenLabs text-to-speech. In a between-group study with 24 participants comparing ClassMeta to a baseline VR classroom, agents significantly improved student note quality (p < 0.001), reduced the need for instructor intervention, and achieved significant learning gains in four of six key competencies, including logical thinking skills (Liu et al., 2024).

Pan et al., introduced ELLMA-T, a GPT-4-powered embodied agent in VRChat designed for situated English language learning. By integrating level assessment, dynamic role-play scenarios, and personalized feedback, the system successfully reduced speaking anxiety among international graduate students who felt less judged than in human-to-human interactions. Despite its creative content generation, the study highlighted significant technical hurdles common to LLM-VR integration, specifically high response latency, rigid turn-taking, and a lack of non-verbal emotional nuance. Ultimately, the authors argue that while ELLMA-T proves the potential for low-anxiety practice, future development must prioritize advanced memory architectures and more sophisticated non-verbal cues to support long-term, adaptive learning (Pan et al., 2025).

Luo et al., presented “Study with Confucius,” a ChatGPT-powered educational game designed to teach classical Chinese literature through three educational modes that combine different AI agent configurations and distinct pedagogical approaches. The system uses the Unity engine with Prompt Graph to create modular AI agents with capabilities such as teaching, verification, item generation, and intelligent judgment, resulting in three gameplay levels where players act as Confucius's disciples, learning ancient texts through task completion. In a mixed-methods evaluation involving 32 eighth-graders and 7 high schoolers, the game demonstrated significantly higher knowledge retention and learning absorption compared to traditional textbook methods. Students demonstrated deep immersion and unexpected collaborative behavior during gameplay. However, the results revealed moderate Flow and Challenge scores, suggesting a high cognitive load (Luo et al., 2024).

Song et al., developed LearningverseVR, an immersive platform that leverages a hybrid generative AI architecture (combining GPT-3.5 and a local ChatGPT) to enable scriptless, high-fidelity NPC interactions. To ensure pedagogical accuracy, the system employs Retrieval Augmented Generation (RAG) and vector databases to mitigate the risk of hallucinations. A key innovation is its dynamic affinity mechanism, which tracks interaction history across four dimensions—topic relevance, tone, task completion, and frequency—to personalize NPC behavior over time. Furthermore, the platform introduces an “LLM ecosystem” in which diverse agents (world, organic, and inorganic) interact, significantly enhancing environmental immersion beyond standard conversational agents (Song et al., 2024).

Zhu et al., introduced VAPS (Virtual AI Patient Simulator), a GPT-4o-powered VR platform designed to bridge gaps in clinical communication training for Health Professions (HP) students. By utilizing high-fidelity MetaHuman characters with synchronized non-verbal cues, the system moves beyond traditional checklist-based simulations to offer authentic, unpredictable interactions. A core contribution is the user-friendly design interface, which empowers clinical educators—regardless of technical expertise—to customize patient personas based on medical history, literacy levels, and specific personality traits (e.g., openness, agreeableness). While the system effectively targets student challenges by personalizing communication for diverse patient backgrounds, it currently requires developers to manually review AI-generated prompts before deployment to ensure their appropriateness (Zhu et al., 2025).

Hu et al., developed Nurse Town, a Unity-based simulation designed to mitigate the clinical training shortage in nursing education through GPT-4o-powered patient avatars. The system distinguishes itself by featuring 10 randomized personality types (e.g., anxious-emotional, dismissive-overconfident), forcing students to adapt their communication strategies dynamically. Technically, the platform achieves low-latency (3 s) interactions using Whisper STT and OpenAI TTS, complemented by context-aware synchronized gestures. A key feature is the automated assessment component, which uses educator-defined rubrics to evaluate clinical accuracy, empathy, and professionalism. While early demonstrations show the system can effectively differentiate between student performance levels, its current limitations include a narrow clinical scope (hypertension only), a limited vocal emotional range, and a lack of formal user evaluation data (Hu et al., 2025). Table 1 presents a summary of related work, highlighting their key contribution and their limitations.

Table 1

Study	Core AI and architecture	Key contribution	Identified limitations
McKern et al. (2024)	Dual-Channel (GPT-3.5 + HumanML3D)	Decouples speech and movement generation to create coordinated, context-aware non-verbal behaviors.	Small sample size (N = 9); prototype focused on movement validation rather than full conversational flow.
Liu et al. (2024)	GPT-4 + ElevenLabs (Dual-Context)	“ClassMeta” agents act as virtual peers to drive participation via note-taking and discussion prompts.	Relies on predefined action signals for certain behaviors, specific to classroom interaction contexts.
Pan et al. (2025)	GPT-4 (VRChat Integration)	Social VR agent for situated language learning that significantly reduces speaking anxiety.	High response latency; rigid turn-taking logic; lacks emotional nuance in non-verbal cues.
Luo et al. (2024)	ChatGPT + Prompt Graph (Unity)	Modular agent roles (Teacher, Verifier, Judge) for game-based classical literature learning.	High cognitive load reported by students (moderate Flow/Challenge scores).
Song et al. (2024)	Hybrid (GPT-3.5 + Local ChatGLM) + RAG	“LLM Ecosystem” where Organic/Inorganic agents interact; uses RAG to minimize hallucinations.	High architectural complexity involving synchronization of local/cloud models and multiple agent types.
Zhu et al. (2025)	GPT-4o + MetaHuman	User-friendly configuration tool allowing non-technical educators to define patient personas.	Scalability bottleneck: requires manual developer review of prompts to ensure safety/appropriateness.
Hu et al. (2025)	GPT-4o + Whisper + OpenAI TTS	10 distinct patient personalities with automated assessment of clinical empathy and accuracy.	Limited scope (single hypertension scenario); emotionally flat TTS voices; lack of formal user evaluation data.

Comparative analysis of generative AI architectures in educational VR.

2.2 Virtual campus tours for university orientation

Virtual campus tours have emerged as essential tools for university marketing and student recruitment, providing prospective students with immersive previews of campus facilities and academic environments before they visit in person, or helping them become accustomed to the environment and feel comfortable with it.

Azizo et al., developed a voice-controlled VR 360 campus tour for Universiti Teknologi Malaysia (UTM) using IBM Watson's Speech-to-Text API and Unity 3D. The system allowed users to navigate nine campus locations via spoken commands, offering a hands-free alternative to controller-based movement. While participants generally valued the hands-free interaction, the reliance on cloud-based processing introduced significant limitations: the system struggled with accent recognition and background noise, suffered from latency due to internet dependency, and caused motion sickness in nearly half the users. These findings highlight the trade-offs between accessibility and reliability when employing cloud-dependent speech APIs in educational VR (Azizo et al., 2020).

Garcia et al., developed the MILES Virtual Tour, a playable 3D application for desktop and mobile devices designed to maximize accessibility by removing the need for VR headsets. Using an extended Technology Acceptance Model (TAM) with 104 participants, the study revealed that prospective students perceived significantly higher usefulness and behavioral intention than enrolled students, validating the system as an effective recruitment tool. Furthermore, beta testing highlighted a distinct user preference for interactive game mechanics—such as running and teleportation—over the passive navigation styles typical of static 360-degree tours (Garcia et al., 2023).

Salim and Khalilov developed a high-fidelity VR campus tour for Tishk International University (TIU) using a pipeline that combines Autodesk 3D Max for modeling and Unreal Engine 4 for interactive deployment on the HTC Vive Pro. Achieving over 90% user satisfaction during student competition testing, the project validates the role of immersive tours in student recruitment, citing parallel evidence of enrollment growth from similar implementations. However, the current system relies on static visualization without conversational agents, limiting engagement compared to dynamic AI-driven approaches (Salim and Khalilov, 2024).

While prior work has advanced specific aspects of AI-driven VR—ranging from motion generation (McKern et al., 2024) and peer modeling (Liu et al., 2024) to domain-specific clinical training (Hu et al., 2025; Zhu et al., 2025)—TUMSphere distinguishes itself by integrating these elements into a holistic student orientation platform. Unlike existing virtual campus tours that rely on static visualization (Salim and Khalilov, 2024) or error-prone voice commands (Azizo et al., 2020), our approach leverages a unified, modular architecture (Convai) to create fully embodied agents capable of both retrieval-augmented conversation and autonomous spatial navigation. This implementation moves beyond the purely linguistic focus of ELLMA-T (Pan et al., 2025) or the complex hybrid architectures of LearningverseVR (Song et al., 2024), providing a streamlined, scalable solution specifically designed to mitigate social anxiety and administrative confusion for international students during their academic transition.

3 System architecture and design

3.1 TUMSphere platform overview

TUMSphere is an educational VR game developed for the Technical University of Munich's Heilbronn campus, specifically designed to support first-semester Information Engineering students during their transition to university life. The platform combines immersive virtual reality technology with game-based learning principles to create an engaging introduction to both the physical campus environment and the core concepts of the academic program.

3.1.1 Game environment description

The virtual environment recreates the Bildungscampus Heilbronn, with a particular focus on the TUM School of Computation, Information and Technology (CIT) building, one of the main buildings at TUM. The game world was constructed using a combination of on-site measurements, photogrammetry (Berrezueta-Guzman et al., 2025), and official architectural floor plans to achieve spatial authenticity while making necessary adjustments for VR comfort and navigability. The environment includes both the exterior campus grounds, featuring surrounding buildings and landscaping elements, and the interior spaces, including classrooms, laboratories, corridors, and common areas. Through iterative playtesting, the development team discovered that real-world measurements needed to be modified for optimal VR perception, leading to subtle geometric adjustments that preserved the campus layout while enhancing user experience and spatial legibility.

The virtual campus serves multiple functions beyond simple visualization. Students can freely explore the environment using VR locomotion controls, interact with various objects and spaces, and encounter contextually placed educational content. This approach transforms passive observation into active discovery, allowing users to build spatial familiarity with their academic environment before or alongside their physical campus experience.

3.1.2 Educational objectives and game mechanics

TUMSphere pursues several interconnected educational objectives aligned with the needs of incoming Information Engineering students. The primary goal is to familiarize students with the campus, enabling them to navigate and understand the physical layout of their study environment, locate essential facilities such as lecture halls, laboratories, student services, and recreational spaces, and reduce anxiety associated with navigating an unfamiliar campus. A secondary objective focuses on curricular introduction, where the game introduces core subjects from the Information Engineering program through interactive mini-games that correspond to first and second-semester courses.

The game employs a progression-based structure visualized in a comprehensive workflow diagram (as shown in Figure 1) that maps mini-games to specific academic subjects. Students advance through increasingly complex challenges that mirror their educational journey, starting with foundational topics like Computer Architecture and Software Engineering, then progressing to more advanced subjects such as Operating Systems, Databases, and Linear Algebra. Each mini-game is designed to provide conceptual exposure rather than comprehensive instruction, serving as an engaging preview of academic content.

Figure 1

Game mechanics include standard VR interaction paradigms such as locomotion via thumbstick controls with optional teleportation, object interaction through grab mechanics using VR controller triggers, spatial navigation guided by visual cues and in-game NPCs, and progression systems that unlock new areas and challenges as students complete tasks. The control scheme was deliberately simplified and introduced through a tutorial sequence, with persistent access to a settings menu for reference, ensuring accessibility for users with varying levels of VR experience.

3.1.3 Target user group

The primary target audience for TUMSphere consists of prospective and newly enrolled students in the Information Engineering bachelor's program at TUM Campus Heilbronn. Of particular significance is the international student population, which represents a substantial proportion of the campus community. According to internal enrollment statistics from the 2024/25 academic year, the BIE program includes 603 students, of whom approximately 89% are international, representing 77 different countries. This demographic diversity underscores the need for multilingual support, accessible guidance systems, and culturally adaptive interaction design within the TUMSphere platform.

International students face unique challenges during their transition to German university life, including language barriers, unfamiliarity with German academic culture and administrative systems, limited prior knowledge of campus geography and local services, and potential social isolation during the initial adaptation period. TUMSphere addresses these challenges through multiple design features, most notably the integration of AI-powered NPCs capable of engaging in natural language conversations. These NPCs provide personalized guidance on university services, answer questions about academic requirements and campus life, and offer a low-pressure environment for practicing conversational interactions.

The platform also serves secondary audiences, including current students seeking academic support or campus navigation assistance, university staff and faculty interested in innovative pedagogical tools, prospective students conducting virtual campus tours during the decision-making process, and researchers investigating the use of serious games and VR applications in higher education. This broader accessibility enhances the platform's utility while maintaining its core focus on supporting the international student experience during the critical first-semester transition period.

3.1.4 Inclusivity-driven design rationale

The platform's design moves beyond general orientation by operationalizing specific inclusivity mechanisms for international students (N = 603, 89% of cohort).

Linguistic Accessibility: To mitigate high cognitive workload, NPC speech rates were reduced to 0.8x of conversational standard, and accents were set to neutral English models to ensure comprehension across 77 different linguistic backgrounds.
Psychological Safety: The “judgment-free” nature of the AI agent was specifically implemented as an anxiety-reduction mechanism, allowing students to rehearse administrative queries (Task A) or linguistic roleplay (Task C) in a low-stakes sandbox before physical campus immersion.

3.2 LLM integration architecture

The development of conversational NPCs within TUMSphere followed an iterative approach, evolving from initial desktop prototyping to fully immersive VR implementation. This evolutionary process enabled the systematic exploration of various integration strategies and technical solutions before committing to the final architecture.

3.2.1 Overall system design

The LLM integration architecture follows a client-server model, with the Unreal Engine-based VR application serving as the client and communicating with cloud-based AI services via structured API calls. The system was designed with modularity in mind, separating concerns between different functional layers: the VR presentation layer (user interface, 3D environment, character rendering), the interaction layer (input capture, gesture recognition, proximity detection), the integration layer (API communication, data serialization, response handling), and the AI service layer (speech recognition, language model processing, speech synthesis).

This layered architecture evolved through iterative development, beginning with a simpler desktop prototype that established core communication patterns before introducing VR-specific requirements. The initial implementation utilized the VaRest plugin to handle HTTP REST API communication with OpenAI's GPT models, providing fundamental capabilities for JSON parsing, asynchronous request handling, and response processing. However, the transition to VR revealed significant limitations in text-based interaction, necessitating the integration of speech-to-text and text-to-speech capabilities to maintain immersion.

Rather than assembling these capabilities from multiple disparate services, the final architecture adopted the Convai platform, which provides an integrated solution combining automatic speech recognition, natural language processing through LLM integration, neural text-to-speech synthesis, and automated generation of lip-sync and facial animation parameters. This unified approach significantly reduced architectural complexity while providing additional features such as persistent conversation memory, character-specific personality configuration, and voice customization options.

3.2.2 Component interaction and data flow

The interaction between system components follows a well-defined sequence, as shown in Figure 2, triggered by user proximity or explicit input, and can be traced through several distinct phases in a typical conversational exchange.

Figure 2

Initiation Phase: When a student navigates the virtual campus and encounters an NPC within a defined interaction radius or presses a designated activation button (mapped to the “A” button on VR controllers), the system initiates the conversation pipeline. The VR client captures the student's spatial position and controller input state, and determines whether the interaction conditions are met. Visual feedback, such as a subtle highlight or attention animation, indicates that the NPC is ready to receive input, and audio capture is activated.

Input Capture and Processing: The student speaks naturally while the system continuously streams audio data from the VR headset microphone through Unreal Engine's audio subsystem, here, to the Convai plugin's input handler, which manages buffering and network transmission. This streaming approach reduces perceived latency compared to waiting for complete utterance detection before transmission. On remote servers, speech recognition services analyze the audio stream, applying acoustic models and language models to produce text transcripts with associated confidence scores.

Language Model Processing: The transcripts, packaged with contextual metadata including NPC personality definitions, conversation history, and current game state, are forwarded to the large language model. The LLM evaluates the input against its training data, the NPC's specific instructions and personality parameters, conversation history stored in memory, and any relevant knowledge base entries configured for the character. Response generation considers multiple factors, including semantic coherence, character consistency, educational appropriateness, and conversational goals.

Response Synthesis: The generated response text undergoes parallel processing for audio and animation generation. The text-to-speech system applies prosody models to generate natural speech patterns, including pitch variations, speaking rate, emotional coloring, and punctuation pauses. Simultaneously, a phoneme analysis system generates timing data indicating when specific mouth shapes should occur during speech playback. These phoneme sequences are mapped to animation blendshapes, or bone transforms compatible with the character's facial rig.

Presentation and Playback: The VR client receives these processed outputs and coordinates their presentation. Audio streams are rendered through spatial audio processing, positioned at the NPC's location in 3D space with appropriate attenuation and environmental effects. Animation data drives real-time updates to the character's facial mesh, creating synchronized lip movements and complementary expressions such as blinks, eyebrow raises, or head tilts. The conversation state is updated with the exchange, enabling context retention for subsequent interactions in which NPCs can reference previous exchanges and adapt their responses based on interaction history.

This data flow architecture balances responsiveness with quality, employing various optimization strategies including asynchronous processing to prevent blocking the main rendering thread, progressive response streaming where supported, and graceful degradation when network conditions degrade. Error-handling mechanisms ensure that connection failures, timeouts, or malformed responses do not break the user experience; instead, they trigger fallback behaviors such as predefined error messages or temporary NPC unavailability indicators.

The modular nature of this architecture provides flexibility for future enhancements, allowing individual components to be upgraded or replaced without requiring a complete system redesign. For instance, alternative LLM providers, different speech synthesis services, or enhanced animation systems could be integrated by modifying only the relevant interface layer while preserving the overall data flow structure.

3.3 Technical stack

The final TUMSphere NPC implementation leverages several interconnected technologies to create a cohesive conversational experience, as illustrated in Figure 3.

Figure 3

3.3.1 Unreal engine

Unreal Engine serves as the primary development platform, utilizing OpenXR for cross-platform VR support. The Blueprint visual scripting system facilitated rapid prototyping of NPC behaviors and interaction logic without compilation delays. Furthermore, the engine's native skeletal mesh system provides the necessary infrastructure for character animation, specifically supporting morph targets required for real-time facial expression and lip-syncing (Berrezueta-Guzman and Wagner, 2026; Epic Games, 2024).

3.3.2 VaRest plugin

Utilized primarily during the initial desktop prototyping phase, the VaRest plugin enabled direct REST API communication within Blueprints. It handled HTTP request construction, JSON parsing, and asynchronous response management. This component validated the core logic of connecting Unreal Engine to OpenAI's endpoints before the transition to the more comprehensive Convai solution.

3.3.3 OpenAI GPT models

Conversational intelligence relies on OpenAI's GPT-3.5 Turbo. This model was selected for its optimal balance between reasoning capability and inference speed. For the specific use case of campus orientation—which requires factual information retrieval rather than complex creative writing—GPT-3.5-turbo offered sufficiently low latency to maintain the “illusion of presence” in VR while minimizing the delay between user input and NPC response (OpenAI, 2024).

3.3.4 Convai plugin

The Convai plugin functions as the central integration hub in the final VR architecture. It manages the runtime conversation pipeline, including character initialization via unique IDs, audio capture, and event-driven response handling. By unifying Automatic Speech Recognition (ASR), LLM processing, and Text-to-Speech (TTS) into a single plugin, it eliminates the complexity of managing separate APIs. Additionally, its web-based dashboard enables “no-code” configuration of NPC personalities and Knowledge Banks (Convai, 2024b; Nnoli, 2024).

3.3.5 ReadyPlayerMe

Visual representation is handled by Ready Player Me, which provides customizable, high-fidelity 3D avatars. These avatars are imported with pre-configured skeletal rigs and facial blend shapes (compatible with ARKit and Oculus Visemes). This standardization is critical for the pipeline, as it allows the phoneme timing data generated by Convai to drive the avatar's mouth movements automatically, resulting in accurate, real-time lip-synchronization without manual animation rigging (Ready Player Me, 2024; Convai, 2024a).

3.4 Alternative plugins and tools explored

During the prototyping phase, we evaluated several alternative tools for conversational AI in VR. While viable for other contexts, these were not selected for TUMSphere. Table 2 compares these solutions, illustrating why Convai was chosen: it offers the only unified platform combining LLM integration, speech processing, and VR-optimized lip-sync, thereby eliminating the architectural complexity of integrating multiple specialized plugins.

Table 2

Feature	Runtime-AI chatbot	Runtime-speech recognizer	Runtime MetaHuman Lip Sync	Convai
Primary function	LLM integration	Speech-to-text	Lip-sync animation	All-in-one
Internet required	Yes (cloud APIs)	No (offline)	No (offline)	Yes(cloud platform)
VR-ready	Partial	Partial	Yes (quest support)	Yes (full support)
Lip-sync support	No	No	Yes (3 models)	Yes
Emotional control	No	No	Yes (12 moods)	Limited
Platform support	All UE platforms	All UE platforms	All UE platforms	All UE platforms
GPU acceleration	N/A	Vulkan/Metal	onnxruntime	Server-side processing
Model options	15+ LLMs	Whisper (5 sizes)	3 quality levels	18 LLMs
Best use case	Multi-LLM projects	Privacy-first STT	MetaHuman characters	VR conversational NPCs

Comparison of alternative unreal engine plugins for conversational AI.

4 Methodology and technical implementation details

4.1 Initial prototype development in UE as a chatbox with VaREST

The initial prototype was developed as a desktop application in Unreal Engine, implementing a text-based chatbox interface that communicated with OpenAI's GPT-3.5 Turbo model via the VaRest plugin, as illustrated in Figure 4. This implementation established the foundational conversation loop and validated the core concept of LLM integration within the Unreal Engine environment before introducing VR-specific complexities.

Figure 4

4.1.1 User interface implementation

The prototype featured a scrollable chat interface built using Unreal Motion Graphics (UMG), Unreal Engine's visual UI authoring system. The interface employed a vertical scrolling container that dynamically displayed the conversation history, with each message appearing as a separate text element. User messages and AI responses were visually distinguished by color—user inputs were white, while AI-generated responses were pink, creating a clear visual separation between participants.

The system automatically scrolled to display the most recent message as the conversation progressed, eliminating the need to manually navigate the chat history. Each time a new message was added to the interface, whether from the user or the AI, the scroll position updated to show the latest content. A brief 0.2-s delay was introduced between message display and scroll adjustment to ensure smooth visual transitions and prevent jarring interface updates.

4.1.2 Input processing

User input was captured through a standard text entry field where students typed their questions or messages. The system monitored for the Enter key press to submit messages, triggering the API communication sequence. Before sending any request to the OpenAI API, the implementation validated that the input was not empty, preventing unnecessary API calls and associated costs from blank submissions. This validation step represented a basic but essential form of error prevention, ensuring that only meaningful user input reached the language model.

4.1.3 API communication architecture

The communication with OpenAI's API was implemented using the VaRest plugin, which provides Blueprint-accessible nodes for REST API interactions. The VaRest plugin handled the HTTP protocol communication, JSON formatting, and asynchronous request management that would otherwise require C++ programming.

The temperature parameter deserves particular attention. This value, ranging from 0 to 1, determines how deterministic or creative the model's responses will be. The prototype experimented with values between 0.1 and 0.5, ultimately favoring lower values around 0.1 for educational contexts. Lower temperatures produce more focused and consistent responses, which proved more appropriate for providing factual information about university services and campus navigation than highly creative or varied answers.

4.2 Research insights and system failure modes

While the initial desktop prototype validated basic LLM-UE5 communication via REST protocols, the transition to VR revealed critical design trade-offs that extend beyond standard implementation.

4.2.1 Design trade-offs: cloud vs. local processing

A key failure mode identified was the “latency-presence gap.” Utilizing cloud-based LLMs (GPT-3.5) provided superior semantic understanding but introduced a variable end-to-end latency of 1.42s to 2.90s. In VR, this delay disrupts the social rhythm, leading to “doubling,” in which users speak over the NPC.

4.2.2 Lessons for generalized VR+LLM systems

Input Modality Preference: Voice-driven interaction is mandatory for presence; physical keyboard requirements in early prototypes were found to be immersion-breaking.
Navigation as Intelligence: Embodiment (Task B) significantly outperformed verbal instructions in utility, suggesting that for orientation tasks, an LLM's value is maximized when coupled with spatial agency (NavMesh) rather than just speech.

4.3 Transition to VR and voice integration

The desktop prototype successfully demonstrated that LLM integration with Unreal Engine was technically feasible and could produce coherent, contextually appropriate responses to user queries. Students could ask questions about campus facilities, university services, or general information and receive relevant answers generated by the language model. The conversation history display allowed users to review previous exchanges, and the system maintained reasonable response times for typical queries.

However, the prototype lacked any integration with 3D characters or spatial positioning—responses appeared in a flat UI overlay rather than emanating from a character within the virtual environment. Therefore, we noticed we require voice-based interaction with speech recognition and synthesis capabilities, as well as embodied NPCs with synchronized facial animation and spatial audio. The research phase that followed evaluated available solutions for these requirements, examining standalone speech-to-text (STT) and text-to-speech (TTS) plugins, animation synchronization systems, and integrated platforms (as introduced in previous sections).

However, implementing STT and TTS as separate components would have required integrating multiple services and managing their individual APIs, authentication protocols, and data flows. This architectural complexity, combined with the need to synchronize speech audio with character lip movements and facial expressions, presented substantial technical challenges. During the research phase exploring VR voice integration solutions, the Convai platform emerged as a comprehensive solution addressing multiple integration challenges simultaneously, as illustrated in Figure 5.

Figure 5

4.4 Convai platform integration

Building upon the validated LLM communication patterns established in the desktop prototype, the development transitioned to implementing voice-enabled NPCs within the VR environment using the Convai platform. Convai distinguished itself through its comprehensive feature set, including multiple LLM provider options (OpenAI, Anthropic, and others), extensive voice customization with various accents and speaking styles, knowledge bank integration for domain-specific information, character background and personality configuration, and unified API handling STT, LLM processing, TTS, and animation generation. This all-in-one architecture eliminated the complexity of coordinating multiple services while providing VR-optimized features designed explicitly for interactive character implementation in game engines.

4.4.1 Platform architecture overview

Convai operates as a conversational AI platform specifically designed for games and virtual worlds, providing an integrated solution that combines LLM capabilities with voice interaction and character animation features. The platform operates on a client-server model, where the Unreal Engine plugin manages local operations, such as audio capture, animation playback coordination, and user interface responses. Meanwhile, Convai's cloud infrastructure handles computationally intensive tasks, including automatic speech recognition, LLM inference, and neural text-to-speech synthesis.

The Convai integration within TUMSphere required adapting the plugin's default interaction mechanisms to suit VR-specific requirements and the educational context. The primary integration work involved configuring input mappings, establishing interaction triggers, and connecting Convai's conversation system with TUMSphere's existing VR player controller.

To make interacting with NPCs feel natural, we set the “A” button on the VR controller as the primary trigger for speaking. We created a custom command, IA_Talk, that works with both VR controllers and keyboards, enabling a simple “push-to-talk” system similar to a walkie-talkie. By linking this command directly to the AI's listening functions, the system starts recording when the button is pressed and stops when it is released, ensuring a consistent, easy-to-test, and intuitive interaction for students.

To make the system easier to maintain and update, we moved the push-to-talk code from the plugin's default files into our own custom player file, the VRPawn Blueprint. Initially, our logic was buried inside the plugin's original code, which made it difficult to customize and risked our changes being erased whenever the plugin was updated. By moving the controls to our own system, we gained full control over the VR interaction flow while ensuring the game remains stable and future-proof against software updates(see Figure 6).

Figure 6

Beyond input remapping, the integration established proximity-based eligibility for interaction. The system tracked VR hand controller positions relative to NPC locations, enabling conversations only when students positioned their controllers within a defined interaction radius around characters. This spatial requirement created intuitive interaction mechanics that let students “approach” NPCs and reach toward them to initiate dialogue, mirroring natural social interaction patterns rather than relying on abstract button presses without spatial context.

To bring each NPC to life, we linked the game's 3D characters to their “brains” in the cloud using unique Character IDs. We generated these IDs in the Convai web dashboard and then pasted them into each character's settings within Unreal Engine. This link tells the game exactly which personality, voice, and specialized knowledge to use for each person the student meets on campus, ensuring the 3D model in the headset stays perfectly synced with the AI configuration stored online.

4.4.2 Environmental awareness and object interaction

Beyond conversational capabilities, the Convai integration included environmental awareness features that enabled NPCs to perceive and interact with objects in the virtual campus. This functionality, implemented through Convai's Actions API, allows characters to recognize named objects in their surroundings and perform actions related to them, creating more dynamic, spatially aware interactions than purely conversational systems.

The implementation involved registering specific campus objects with NPC characters through the Convai web interface's Objects configuration section. For TUMSphere, key campus facilities and landmarks were identified and added to character awareness, including vending machines, lecture halls, laboratories, student service offices, recreational areas, and building entrances. Each object was assigned a descriptive name and position reference that the character could use during navigation and conversation.

When students asked location-related questions such as “Where is the nearest vending machine?”, NPCs could not only provide verbal directions but also physically navigate to the requested location. The navigation system utilized Unreal Engine's NavMesh (Navigation Mesh) system, which defined walkable areas throughout the virtual campus. A NavMesh Bounds Volume encompassing the campus environment enabled characters to pathfind around obstacles, through corridors, and across open spaces, guiding students to their requested destinations.

The following behavior proved particularly valuable for campus orientation. Students could request “Follow me around the building,” and the NPC would accompany them, maintaining an appropriate following distance while continuing the conversation. This escort functionality combined spatial navigation with conversational guidance, enabling characters to provide contextual information about locations as students explored the campus together with their virtual guide.

These environmental interaction capabilities transformed NPCs from static information kiosks into active participants in campus exploration, providing a more engaging and practical orientation experience than purely verbal directions or static maps could offer.

4.5 Character configuration through Convai web interface

The behavioral characteristics, knowledge bases, and personalities of individual NPCs were configured through Convai's web-based dashboard interface. This cloud-based configuration system provided a crucial separation between technical implementation and content authorship.

4.5.1 Character creation and ID assignment

Setting up each NPC is a straightforward process of linking a digital “brain” in the cloud to a physical “body” in the virtual environment. This begins on the Convai web dashboard, where we define the character's personality, voice, and specific knowledge base. Once configured, the dashboard generates a unique Character ID—a specific string of text that acts as a bridge between the cloud-based AI and the Unreal Engine Blueprint. By copying and pasting this ID into the character's properties in the engine, we establish a vital connection that enables the system to retrieve the correct configuration in real time. This ensures that when a student approaches an NPC, the character responds with the specific voice and expertise assigned to it.

For the TUMSphere environment, we developed multiple NPCs tailored to the specific needs of international students. One primary character serves as a Campus Buddy, a friendly guide who supports newcomers with orientation and administrative questions, and provides details about the university's Buddy Program. This NPC is programmed with specific knowledge about campus facilities and course structures to help first-semester students feel at home. Another character serves as a dedicated Language Partner, helping users practice their German in a relaxed, low-pressure setting. Together, these AI-driven agents transform the virtual campus from a static 3D model into an interactive social sandbox where students can prepare for their academic and cultural transition.

4.5.2 Character description and backstory

The Character Description section illustrated in Figure 7 served as the foundation for NPC personality definition. This text field accepted natural language descriptions of the character's role, background, personality traits, and conversational objectives. For example, a campus tour guide character might be described as “a friendly and enthusiastic student ambassador who has been at TUM Heilbronn for three years, loves helping international students feel welcome, and enjoys sharing interesting facts about campus facilities and local culture.” These descriptions informed how the underlying LLM interpreted conversation context and generated responses aligned with the character's established identity.

Figure 7

The backstory functionality allowed the creation of rich character histories that influenced conversational behavior. Characters could reference their backgrounds naturally during conversations, creating more believable and engaging interactions than purely functional information-delivery systems.

To ensure scientific reproducibility of our NPC's behavior, Table 3 details the specific system prompt and persona constraints implemented for the character ‘Akira'.

Table 3

Category	Technical implementation and prompt content
Core identity	“You are a third-semester Information Engineering student at TUM Campus Heilbronn and a member of the TUM Campus Heilbronn Buddy Program.”
Behavioral logic	“You share your own experiences to help new students adapt both academically and socially. You guide new students, especially in the D-Building.”
Persona constraints	“You value the easy communication with professors and PhDs thanks to the open-door culture. You enjoy the modern, friendly atmosphere of the campus.”
Backstory and motivation	“When you first arrived in your 1st semester, you felt a bit overwhelmed, which is why now you enjoy helping newcomers.”
Tone style	Friendly, enthusiastic student ambassador; Clear, neutral English at a slightly reduced speaking rate for non-native comprehension.
Knowledge access	Integrated “Knowledge Bank” containing the TUM Ersti-Guide, campus facility locations, and academic program structures.

System prompt and persona configuration for NPC “Akira”.

4.5.3 Knowledge bank integration

The Knowledge Bank section provided one of the most impactful features for building educational and informational NPCs. This component enabled the upload of custom documents, structured datasets, and curated textual resources that characters could reference during conversations. Unlike generic LLM behavior, which relies solely on pre-training, the Knowledge Bank allowed precise grounding of responses in institution-specific and context-sensitive information.

For TUMSphere, the Knowledge Bank was populated with a diverse collection of resources tailored to both campus guidance and language-learning use cases. These included detailed university information, such as campus facilities, contact information, academic program structures, and international student support services. Additionally, local area information about Heilbronn was incorporated, including nearby transportation options, city orientation tips, and practical guidance. The Ersti-Guide, prepared by the TUM Student Council Heilbronn, was processed into a clean, structured text format and uploaded to ensure optimal retrieval quality inside Convai.

To support the German-practice NPCs, the Knowledge Bank was further expanded with official Goethe-Institut vocabulary lists for A1, A2, B1, and B2 proficiency levels. These documents enabled the character to provide level-appropriate explanations, vocabulary assistance, and guided speaking feedback grounded in authentic exam-relevant language data. By combining campus-specific content with standardized language-learning resources, the Knowledge Bank enabled characters to accurately handle a wide range of queries.

When users interacted with the system, Convai performed retrieval-augmented generation (RAG), scanning the Knowledge Bank for relevant passages before producing an answer. In this process, Convai automatically handles document vectorization and embedding creation, as well as the underlying similarity search over these embeddings, so no separate vector database or custom RAG pipeline (e.g., based on Pinecone or Weaviate) is required on the developer side. This significantly improved factual accuracy, consistency, and contextual grounding. As a result, NPC responses aligned closely with real procedures, local campus details, and exam-aligned linguistic knowledge for German practice, far beyond what a general LLM could provide without external references.

4.5.4 Language and speech configuration

The Language and Speech section controlled voice synthesis parameters that determined how characters sounded during conversations. Options included selecting from numerous voice models with different accents, genders, and tonal qualities; adjusting speaking rate to ensure clarity for non-native speakers; configuring pitch and emphasis patterns; and selecting primary and secondary languages for multilingual support.

For TUMSphere's international student audience, characters were configured with clear, neutral English accents at slightly slower speaking rates than conversational defaults. This adjustment improved comprehension for students with varying English proficiency levels while maintaining natural-sounding speech. Some characters were configured to support German, allowing students to practice German conversation in a low-pressure environment.

4.5.5 Personality traits and core AI settings

The Personality Traits section offered structured personality configuration beyond free-form descriptions. Characters could be assigned traits along various dimensions such as friendliness vs. formality, enthusiasm vs. reservedness, verbosity vs. conciseness, and helpfulness vs. directness. These parameters influenced response generation patterns: friendly characters produced warmer, more conversational responses, while formal characters maintained professional distance.

The Core AI Settings section provided access to advanced parameters, including LLM model selection, temperature and response randomness controls, maximum response length limits, and context window size for conversation memory.

4.6 Character integration with Convai ReadyPlayerMe plugin

The visual representation of conversational NPCs in TUMSphere utilized Ready Player Me avatars integrated through Convai's dedicated Ready Player Me plugin. This integration provided customizable, high-quality 3D character models with facial animation support, eliminating the need for custom character modeling while maintaining visual consistency across the platform.

4.6.1 Plugin installation and setup

The character mesh integration required installing two additional plugins beyond the core Convai plugin: the Convai Ready Player Me Plugin and the Convai Lip Sync Plugin, both compatible with Unreal Engine 5. These plugins were downloaded from Convai's documentation resources, extracted, and placed in a Plugins folder within the TUMSphere project directory. After restarting the Unreal Editor, the project recognized the new plugins and made their functionality available for character configuration.

The installation process required recompiling the project to integrate the plugin code with TUMSphere's existing Blueprint systems. Upon restart, the editor displayed confirmation messages indicating successful plugin integration, and new Blueprint parent classes became available for character implementation.

4.6.2 Character blueprint eeconfiguration

To integrate Ready Player Me avatars, the NPC parent class was updated from ConvaiBaseCharacter to ConvaiReadyPlayerMeCharacter, enabling automatic model downloading and rendering. While customized avatars only appear at runtime, the cloud-based initialization typically completes in under one second. Because this occurs before the user encounters the NPC, immersion is maintained without the need for loading indicators. Future versions may utilize locally stored models to eliminate this minor “cold-start” delay entirely.

4.6.3 Lip Sync and facial animation

The Convai Lip Sync Plugin synchronizes synthesized speech with facial animations by mapping real-time phoneme data to the Ready Player Me avatars' pre-configured ARKit and Oculus Viseme blend shapes. As the text-to-speech system generates audio, the platform simultaneously produces timing data that automatically drives jaw and lip movements. This automated pipeline eliminates the need for manual rigging or animation authoring while maintaining conversational believability.

5 Experimental setup and evaluation

To validate the efficacy of the TUMSphere platform and the integration of LLM-powered NPCs, we designed a two-fold evaluation strategy. This approach assesses both the pedagogical usability of the system from a student's perspective (User Study) and the technical viability of the architecture with respect to latency and resource consumption (Performance Evaluation).

5.1 User study design

The user study was designed to evaluate whether intelligent NPCs can effectively serve as campus guides and cultural mediators for international students. The study follows a mixed-methods approach, combining quantitative psychometric scales with qualitative semi-structured interviews.

5.1.1 Participants and demographics

We recruited a total of N = 24 participants from the freshman cohort of the Information Engineering program at TUM Campus Heilbronn. To align with our target demographic, the inclusion criteria required participants to be international students (non-German citizens) who had arrived in Germany within the past 6 months. The group consisted of 14 males and 10 females, aged 18–26 (M = 21.4, SD = 2.3). All participants had normal or corrected-to-normal vision. While 65% of participants reported having played video games previously, only 25% had prior experience with VR headsets.

5.1.2 Apparatus and environment

The study was conducted in a controlled lab environment at the university. The hardware setup consisted of a Meta Quest 3 headset connected via Air Link to a high-performance workstation (Intel Core i9-13900K, NVIDIA RTX 4090, 64GB RAM) running the packaged Unreal Engine 5 build of TUMSphere. This setup ensured that any performance bottlenecks were attributable to the software architecture or network latency rather than hardware limitations.

5.1.3 Experimental procedure

The procedure was divided into three phases, lasting approximately 45 min per participant:

Onboarding (10 min): Participants were briefed on safety protocols and completed a generic VR tutorial to familiarize themselves with locomotion and the “grab” mechanics. They were also introduced to the specific “push-to-talk” mechanic mapped to the controller “A” button.
Task Execution (25 min): Participants were asked to complete a “Campus Scavenger Hunt” scenario requiring interaction with three distinct NPC archetypes:

Task A (Information Retrieval): Locate the “Student Service Center” NPC and inquire about the deadline for semester fee payments and how to validate the student card.
Task B (Navigation): Find the “Buddy Program” NPC and ask for a guided tour to the nearest library. The participant had to follow the NPC as it physically navigated the NavMesh.
Task C (Language Practice): Engage the “German Language Tutor” NPC in a role-play scenario to order a coffee in German, utilizing the A1-level vocabulary set in the Knowledge Bank.

3. Post-Experience Evaluation (10 min): Semi-structured interviews were analyzed using a reflexive Thematic Analysis approach. Two researchers independently coded the transcripts, reaching a thematic saturation at N = 24. Inter-rater agreement (Cohen's κ) was 0.82, ensuring that the reported themes were systematically derived rather than anecdotal.

5.1.4 Metrics and questionnaires

To quantify the user experience, we employed standard usability instruments:

System Usability Scale (SUS): To measure the perceived usability of the VR interface and interaction mechanics.
User Experience Questionnaire (UEQ): To assess the experience across 6 scales (Attractiveness, Perspicuity, Efficiency, Dependability, Stimulation, and Novelty).
Igroup Presence Questionnaire (IPQ): Specifically to measure the sense of spatial presence and the realism of the NPC interactions.
Task Completion Rate (TCR): Binary measurement (Success/Fail) of whether the participant successfully retrieved the correct information from the NPCs.

5.1.5 Statistical reliability analysis

To ensure the internal consistency of the psychometric instruments within our specific international student cohort (N = 24), we calculated Cronbach's alpha (α) for the SUS, IPQ, and each UEQ subscale. Following standard psychometric practice, a coefficient of α≥0.70 was established as the threshold for acceptable reliability, confirming that the questionnaire items consistently measured the intended usability and presence constructs.

5.2 Performance evaluation

While user satisfaction is paramount, the technical feasibility of real-time conversational AI in VR relies heavily on low latency. High delays between a user's question and the NPC's answer can break the “illusion of presence” and induce cognitive dissonance. Therefore, we conducted a technical performance benchmark focusing on response latency and frame rate stability.

5.2.1 Latency measurement methodology

We defined the key metric as Time-to-Audio (TTA)—the duration from the moment the user releases the “talk” button (sending the audio buffer) to the moment the engine renders the first audio frame. This “end-to-end” latency encompasses four distinct stages:

Upload & STT: Transmission of audio to Convai servers and transcription (Whisper).
LLM Inference: Generation of the text response (GPT-3.5/4).
TTS Synthesis: Conversion of text to audio.
Download & Buffering: Receiving the audio stream in Unreal Engine.

We instrumented the Blueprint code to log timestamps at each stage of the network request. We ran a series of 50 automated test queries ranging from short greetings (“Hello”) to complex, retrieval-heavy questions (“Explain the module structure for Information Engineering”).

5.2.2 System resource profiling

To ensure the integration does not degrade the VR experience, we profiled the application using Unreal Insights. We measured the Game Thread and Render Thread times during active conversation states. Since the Convai plugin handles network requests asynchronously, we hypothesized that the impact on frame rate (FPS) would be negligible. We targeted a stable 90 FPS, which is the native refresh rate of the Meta Quest 3, to prevent motion sickness.

5.2.3 Quantitative Lip-sync assessment

To objectively quantify the synchronization between synthesized speech and facial animation, we introduce the Viseme-Audio Synchronization Error (VASE). This metric measures the temporal offset (Δt) between the onset of an audio phoneme and the peak activation of its corresponding viseme blendshape within the Unreal Engine environment.

The error is calculated as:

Timestamps were extracted by logging the OnPhonemeReceived events from the Convai Lip Sync Plugin and correlating them with the Unreal Engine high-resolution audio clock. This allows us to assess whether network-induced latency or CPU-bound animation processing causes the mouth movements to lag behind the spatial audio playback.

6 Results

This section presents the findings from our performance benchmarking and user evaluation. We first analyze the system's technical viability with respect to response latency and rendering stability, followed by analyses of user engagement, usability scores, and qualitative feedback gathered during the campus simulation tasks.

6.1 System performance and latency analysis

A critical factor for immersion in conversational VR is end-to-end latency—the time elapsed between the user finishing their sentence and the NPC beginning to speak.

6.1.1 Response latency (Time-to-Audio)

We analyzed 50 interaction cycles categorized into “Short Queries” (greetings, simple confirmations) and “Complex Queries” (retrieval-augmented questions requiring Knowledge Bank access). The breakdown of processing time across the pipeline stages is summarized in Table 4.

Table 4

Query type	STT processing	LLM inference	TTS and network	Total Time-to-Audio
Short (greetings)	0.45s	0.62s	0.35s	1.42s (±0.2)
Complex (RAG-based)	0.55s	1.85s	0.50s	2.90s (±0.6)

Average end-to-end latency breakdown by query type (in seconds).

The data represent the mean value across 50 trials. Bold value indicates “Time-to-Audio” latency.

The LLM inference stage is the most significant bottleneck, particularly for complex queries that require parsing the Knowledge Bank context. However, the average latency for standard interactions remained under 1.5 s, which users generally found acceptable for a “walkie-talkie” style interaction, though slightly delayed for a face-to-face conversation.

6.1.2 Frame rate stability

To ensure comfort in VR, maintaining a stable frame rate is essential. Profiling data from Unreal Insights confirmed that the asynchronous nature of the Convai plugin prevented the main game thread from locking during API calls.

Idle State: 90 FPS (Stable)
Active Conversation (Listening/Thinking): 89.4 FPS
Active Conversation (Speaking/Animating): 88.2 FPS

The slight dip during the speaking phase is attributed to the CPU cost of processing real-time lip-sync blendshapes on the Ready Player Me avatars. Still, performance remained well above the 72 FPS minimum threshold for comfortable VR experiences on the Meta Quest 3.

6.1.3 Lip-sync synchronization accuracy

Quantitative testing using the VASE metric revealed a mean synchronization error of 32ms (SD = 8ms) across 50 interaction cycles. As shown in Table 4, the synchronization remains stable even during complex RAG-based queries, despite the higher end-to-end latency reported in Table 3.

These results indicate that the Convai Lip Sync Plugin effectively buffers phoneme timing data to match the audio stream, thereby avoiding the “uncanny valley” effect despite the cloud-based inference.

6.2 User experience and usability

Participant responses (N = 24) were analyzed using the System Usability Scale (SUS) and the User Experience Questionnaire (UEQ).

6.2.1 Task completion rates

Participants showed a high success rate in accomplishing the defined educational objectives, as shown in Figure 8.

Task A (Information Retrieval): 96% success rate. Participants successfully retrieved deadline information.
Task B (Navigation): 100% success rate. All participants successfully followed the NPC to the library.
Task C (German Language Practice): 83% success rate. Failures were primarily due to speech recognition misinterpreting heavy accents or hesitation during German phrasing.

Figure 8

6.2.2 Psychometric scores

The overall System Usability Scale (SUS) score was 76.4 (SD = 12.5), placing the TUMSphere NPC system in the “Good” to “Excellent” range of usability.

The User Experience Questionnaire (UEQ) results, visualized in Figure 9, indicate particularly high scores for Stimulation and Novelty, suggesting students found the AI interaction engaging and innovative. However, Dependability scored lower, reflecting occasional frustrations with speech recognition accuracy or variable response latency.

Figure 9

6.2.3 Instrument reliability

The internal consistency analysis confirmed that all employed scales reached the required reliability threshold for the study's demographics. The Cronbach's alpha values for the primary constructs are summarized in Table 5. These results indicate that the “Good” to “Excellent” usability scores reported (SUS = 76.4) are statistically reliable and not the result of inconsistent participant responding.

Table 5

Query context	Mean VASE (ms)	Std. Dev. (ms)
Standard interaction	28	5
Complex RAG retrieval	36	11
Perception Threshold	< 45	–

Quantitative Lip-sync accuracy (VASE).

The particularly high reliability of the Hedonic Quality and Presence scales (α>0.80) reinforces the qualitative findings of high engagement and an “illusion of presence” experienced by students during NPC interactions as illustrated in Table 6.

Table 6

Instrument/scale	Number of items	Cronbach's alpha (α)
System Usability Scale (SUS)	10	0.84
UEQ (attractiveness)	6	0.79
UEQ (pragmatic quality)	12	0.76
UEQ (hedonic quality)	8	0.81
Igroup Presence (IPQ)	14	0.82

Internal consistency (Cronbach's alpha) for psychometric scales.

6.3 Qualitative findings

Semi-structured interviews revealed three dominant themes regarding the interaction paradigm.

6.3.1 Theme 1: reduced social anxiety in language learning

A significant majority, 75% of participants, identified the NPC as a “social sandbox.” One participant (P12) noted: “With a real officer, I worry about my grammar more than the deadline. The robot doesn't mind if I repeat myself three times.” This confirms the agent's role as a cultural mediator.

6.3.2 Theme 2: the “Thinking” silence

While the latency was technically managed, the silence during the “Thinking” phase (approx. 2–3 s for complex answers) caused social awkwardness. Forty percent of participants reported being unsure if the system had heard them. “I didn't know if I should speak again or wait. A simple 'Let me check that for you' or a thinking animation would help,” suggested one participant.

6.3.3 Theme 3: spatial guidance vs. verbal instructions

Participants overwhelmingly preferred the embodied navigation (Task B) over verbal descriptions. The NPC's ability to say “Follow me” and physically walk to the destination was cited as the most helpful feature for campus orientation, confirming the value of integrating the LLM with the Unreal Engine NavMesh.

6.3.4 Negative case analysis

Despite 96% success in information retrieval, the 17% failure rate in German practice (Task C) highlights a critical failure mode: the system's inability to parse heavy phonetic variation or the hesitant “filler” words common among A1-level speakers. This suggests that “inclusive” STT requires fine-tuning on non-native datasets rather than off-the-shelf models.

7 Discussion

The integration of Large Language Models into virtual reality environments represents a paradigm shift from static, pre-scripted educational experiences to dynamic, learner-centered interactions. This study aimed to evaluate the technical feasibility and pedagogical potential of this integration within TUMSphere. Our findings suggest that while LLM-powered NPCs significantly enhance engagement and provide a unique “safe space” for international students, technical challenges related to latency and non-verbal synchronization remain critical hurdles to seamless immersion.

7.1 Reducing social anxiety through artificial interlocutors

One of the most significant pedagogical findings of this study is AI agents' capacity to lower the threshold for social interaction. As indicated by our qualitative data, international students frequently cited the NPCs' “judgment-free” nature as a primary advantage. This aligns with findings by Pan et al. (2025) regarding ELLMA-T, confirming that virtual agents can effectively mitigate the “Foreign Language Anxiety” often experienced in real-world immersion.

Unlike traditional role-play partners, the TUMSphere NPCs allow students to pause, repeat questions, or make grammatical errors without fear of social embarrassment. This supports the hypothesis that VR combined with GenAI serves as an effective “sandbox” for social acclimatization. However, unlike pure language-learning applications, TUMSphere contextualizes these interactions within the specific administrative and spatial realities of the university campus. The high success rate in Task A (Information Retrieval) suggests that reduced anxiety directly correlates with effective information absorption, as students felt comfortable asking clarifying questions they might otherwise have hesitated to ask.

7.2 The latency-presence trade-off

From a technical perspective, the “Time-to-Audio” latency—averaging 1.42 s for simple and 2.90 s for complex queries—presents a nuanced challenge to the sense of presence. While our SUS scores indicate the system is usable, the qualitative feedback regarding “awkward silences” highlights a gap between user expectations set by human conversation (turn-taking gaps of ≈200ms) and the reality of cloud-based inference.

Our results contrast slightly with the lower latency reported by purely text-based implementations or local deployments like Nurse Town (Hu et al., 2025). The additional overhead in our architecture stems from the synchronization of the Knowledge Bank retrieval (RAG) and the phoneme-generation pipeline required for lip-sync. This introduces a “Latency-Presence Trade-off”: to provide accurate, institution-specific answers (high utility) and realistic visual articulation (high immersion), we currently sacrifice conversational fluidity.

Future iterations must address this by implementing “conversational fillers” (e.g., having the NPC nod or say “Let me check that...”) to mask the processing time, a technique suggested in conversational agent literature but not yet natively implemented in the standard plugin configuration used.

7.3 Embodiment and spatial agency

Our study reinforces the necessity of embodiment in educational VR. The universal success of Task B (Navigation), where students followed the NPC, demonstrates that spatial agency is as important as verbal intelligence. An LLM that can only speak is essentially a “radio”; an LLM that can navigate the NavMesh becomes a “guide.”

This extends the work of McKern et al. (2024), who focused on gesture generation. In TUMSphere, the coupling of the NPC's movement with its verbal instructions (“Follow me to the library”) reduced the cognitive load required to translate verbal directions into spatial navigation. This suggests that the future of educational NPCs lies not just in better language models, but in tighter integration between the LLM's semantic understanding and the game engine's physics and navigation systems—allowing the AI to “act” in the world rather than just describe it.

7.4 Limitations

Several limitations of this study must be acknowledged. First, the sample size (N = 24) and the restriction to a single university campus limit the generalizability of the quantitative findings. Second, the study was conducted on high-end hardware (RTX 4090) via Air Link. Deploying this architecture on standalone headsets (e.g., Meta Quest 3 native) would likely introduce additional performance constraints not captured here.

Furthermore, while the Knowledge Bank significantly reduced hallucinations, it did not eliminate them. On rare occasions, the NPCs provided plausible but incorrect information about specific university deadlines that were not explicitly covered in the uploaded documents. This underscores the need for robust “guardrails” in educational AI, where misinformation can have real-world academic consequences.

Finally, we observed a “Novelty Effect” (reflected in the high UEQ Novelty scores). Long-term studies are required to determine whether student engagement persists once the initial wonder of talking to a virtual character fades, or whether the system's utility sustains usage throughout the semester.

8 Conclusion

This study presented TUMSphere, a novel implementation of an educational Virtual Reality environment that leverages Generative AI to transform static Non-Player Characters into dynamic, conversational campus guides. By integrating Unreal Engine 5 with the Convai platform, we demonstrated a scalable architecture for creating embodied AI agents capable of natural language interaction, spatial navigation, and context-aware assistance.

Our results indicate that the convergence of VR and LLMs offers substantial pedagogical value, particularly for international students navigating the complexities of a new academic environment. The system successfully provided a “safe,” low-anxiety space for practicing language skills and asking administrative questions, addressing the psychological barriers often associated with real-world help-seeking. Furthermore, the strong user preference for NPCs that can physically guide users through the virtual campus underscores the importance of embodiment—demonstrating that in VR, an AI agent must be more than just a chatbot; it must be an active participant in the 3D space.

However, the transition from prototype to production-ready educational tool reveals distinct challenges. The “latency-presence trade-off” remains a critical technical hurdle, where the cognitive dissonance caused by processing delays conflicts with the immersive realism of the visual environment.

9 Future directions

Building on the initial validation of TUMSphere, our future research and development will focus on three key pillars to move the platform from a successful prototype to a persistent, high-fidelity educational ecosystem.

9.1 Enhancing interaction fluidity and non-verbal realism

To mitigate the “thinking silence” identified in our user study, we plan to implement continuous listening modes and conversational fillers (e.g., hesitation markers or verbal acknowledgments) to mask processing latency. Beyond verbal improvements, we will activate the State of Mind and Embodied Actions modules. By establishing a dynamic emotional layer, NPCs can exhibit context-aware moods—such as enthusiasm when welcoming a student or concern when discussing safety procedures. These states will be synchronized with physical gestures and posture changes, leveraging the Unreal Engine animation system to reduce the “uncanny valley” effect and ensure that non-verbal cues authentically complement the synthesized speech.

9.2 Structured pedagogy through hybrid narrative design

While free-form conversation provides a robust social sandbox, academic orientation requires structured progression. We intend to leverage Narrative Design features to create hybrid interaction models. This framework will allow us to define specific conversation trees and branching paths that guide students toward critical educational objectives—such as mandatory safety briefings or administrative onboarding—without sacrificing natural language flexibility. Unlike the current initiative-driven model, these agents will be able to track which topics have been discussed (e.g., semester deadlines or student card validation) and proactively prompt the student to explore missing information, effectively blending scripted pedagogy with generative freedom.

9.3 Environmental awareness, persistence, and scalability

To deepen the NPC's connection to the user and the virtual environment, we will implement Mindview and Memory modules. Mindview will grant agents observational awareness, allowing them to comment on the student's spatial actions or items they interact with on campus. Furthermore, by utilizing persistent Memory settings, NPCs will be able to recognize returning students and reference previous interactions across different VR sessions, fostering a continuous support system that evolves with the student throughout their first semester.

Finally, we aim to optimize these pipelines for standalone VR hardware (e.g., Meta Quest native) and explore multi-user synchronization. This will enable groups of international students to participate in collective orientation sessions, transforming TUMSphere from an individual practice tool into a collaborative, social learning environment that grows alongside the students it supports.

Statements

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Ethics statement

Ethical review and approval were not required for the study involving human participants, in accordance with local legislation and institutional requirements. The research involved standard usability and user experience testing of an educational software application with healthy adult volunteers, posing no physical or psychological risks beyond those encountered in daily life or standard Virtual Reality usage. All participants provided written informed consent before their participation, acknowledging the voluntary nature of the study, their right to withdraw at any time, and the anonymized processing of their data, in full compliance with the Declaration of Helsinki and the General Data Protection Regulation (GDPR). Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.

Author contributions

SB-G: Writing – review & editing, Funding acquisition, Writing – original draft, Formal analysis, Conceptualization, Methodology, Investigation. SW: Supervision, Writing – review & editing.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This research was financially supported by the TUM Campus Heilbronn Incentive Fund 2024 of the Technical University of Munich, TUM Campus Heilbronn (5420023). We gratefully acknowledge their support, which provided the essential resources and opportunities to conduct this study.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1
AzizoA. S. B.MohamedF. B.SiangC. V.IshamM. I. M. (2020). “Virtual reality 360 utm campus tour with voice commands,” in 2020 6th International Conference on Interactive Digital Media (ICIDM) (Xianyang: IEEE), 1–6. doi: 10.1109/ICIDM51048.2020.9339665
- CrossRef
- Google Scholar
2
BarmpariA.VoyiatzakiE.HatzilygeroudisI. (2026). “An educational virtual world system with gamification features and llm guided npcs,” in Generative Systems and Intelligent Tutoring Systems, eds. S. Graf and A. Markos (Cham: Springer Nature Switzerland), 213–223. doi: 10.1007/978-3-031-98281-1_17
- CrossRef
- Google Scholar
3
Berrezueta-GuzmanS.KoshelevA.WagnerS. (2025). “From reality to virtual worlds: The role of photogrammetry in game development,” in 2025 IEEE Gaming, Entertainment, and Media Conference (GEM) (Kaohsiung: IEEE), 1–6. doi: 10.1109/GEM66882.2025.11155764
- CrossRef
- Google Scholar
4
Berrezueta-GuzmanS.WagnerS. (2026). Choosing the right engine in the virtual reality landscape. IEEE Access14, 13972–13985. doi: 10.1109/ACCESS.2026.3657272
- CrossRef
- Google Scholar
5
ChenS.XuX.ZhangH.ZhangY. (2023). “Roles of chatgpt in virtual teaching assistant and intelligent tutoring system: opportunities and challenges,” in Proceedings of the 2023 5th World Symposium on Software Engineering, WSSE '23 (New York, NY: Association for Computing Machinery), 201–206. doi: 10.1145/3631991.3632024
- CrossRef
- Google Scholar
6
ChenY.DingN.ZhengH.-T.LiuZ.SunM.ZhouB. (2024). “Empowering private tutoring by chaining large language models,” in Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM '24 (New York, NY: Association for Computing Machinery), 354–364. doi: 10.1145/3627673.3679665
- CrossRef
- Google Scholar
7
Convai (2024a). Adding Lipsync to MetaHuman — Convai Unreal Engine Plugin Guide. San Jose, CA:Convai(Accessed October 14, 2025).
- Google Scholar
8
Convai (2024b). Convai API Documentation. San Jose, CA:Convai(Accessed October 14, 2025).
- Google Scholar
9
DamianovaN.Berrezueta-GuzmanS. (2025). Serious games supported by virtual reality–literature review. IEEE Access13, 38548–38561. doi: 10.1109/ACCESS.2025.3544022
- CrossRef
- Google Scholar
10
DongB.BaiJ.XuT.ZhouY. (2024). “Large language models in education: a systematic review,” in 2024 6th International Conference on Computer Science and Technologies in Education (CSTE) (Xi'an: IEEE), 131–134. doi: 10.1109/CSTE62025.2024.00031
- CrossRef
- Google Scholar
11
DuW.XuZ.DangT. (2025). “Research on the application of virtual reality technology in the field of education,” in 2025 5th International Conference on Artificial Intelligence and Education (ICAIE) (Suzhou: IEEE), 541–545. doi: 10.1109/ICAIE64856.2025.11158564
- CrossRef
- Google Scholar
12
Epic Games (2024). Virtual Reality Development Documentation. Unreal Engine Documentation (Accessed October 14, 2025).
- Google Scholar
13
GarciaM.MansulD.PempinaE.PerezM.AdaoR. (2023). “A playable 3d virtual tour for an interactive campus visit experience: showcasing school facilities to attract potential enrollees,” in 2023 9th International Conference on Virtual Reality (ICVR) (Xianyang: IEEE), 461–466. doi: 10.1109/ICVR57957.2023.10169768
- CrossRef
- Google Scholar
14
GonzalesW. D. W.ShenD. J.YanA.XieN.FranciscoM. L.WongP. P. Y. (2025). “AI NPCS in an educational metaverse: evaluating the effectiveness of prompt templates for contextual interactions,” in Innovating Education with AI, ed. E. C. K. Cheng (Singapore: Springer Nature Singapore), 53–74. doi: 10.1007/978-981-96-4952-5_4
- CrossRef
- Google Scholar
15
GuevarraM.BhattacharjeeI.DasS.WayllaceC.EppC. D.TaylorM. E.et al. (2025). “An llm-guided tutoring system for social skills training,” in Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence, AAAI'25/IAAI'25/EAAI'25 (Washington, DC: AAAI Press), 29643–29645. doi: 10.1609/aaai.v39i28.35353
- CrossRef
- Google Scholar
16
HuY.XiongQ.YiL.YoonI. (2025). “Nurse town: an llm-powered simulation game for nursing education,” in 2025 IEEE Conference on Artificial Intelligence (CAI) (Santa Clara, CA: IEEE), 215–222. doi: 10.1109/CAI64502.2025.00041
- CrossRef
- Google Scholar
17
HuaC.WangJ. (2023). Virtual reality-assisted language learning: a follow-up review (2018-2022). Front. Psychol. 14:1153642. doi: 10.3389/fpsyg.2023.1153642
18
HusseinR.ZhangZ.AmaranteP.HancockN.OrdunaP.Rodriguez-GilL. (2024). “Integrating personalized ai-assisted instruction into remote laboratories: Enhancing engineering education with openai's gpt models,” in 2024 IEEE Frontiers in Education Conference (FIE) (Washington, DC: IEEE), 1–7. doi: 10.1109/FIE61694.2024.10892918
- CrossRef
- Google Scholar
19
KoneckiM.KoneckiM.VlahovD. (2023). “Using virtual reality in education of programming,” in 2023 11th International Conference on Information and Education Technology (ICIET) (Fujisawa: IEEE), 39–43. doi: 10.1109/ICIET56899.2023.10111156
- CrossRef
- Google Scholar
20
LeonM. (2025). Gpt-5 and open-weight large language models: advances in reasoning, transparency, and control. Inform. Syst. 136:102620. doi: 10.1016/j.is.2025.102620
- CrossRef
- Google Scholar
21
LevidzeM. (2024). Mapping the research landscape: a bibliometric analysis of e-learning during the COVID-19 pandemic. Heliyon10:e33875. doi: 10.1016/j.heliyon.2024.e33875
22
LinA. J.ChengF. F. (2024). “Virtual reality game for science education,” in 2024 5th International Conference on Computer Science, Engineering, and Education (CSEE) (Shanghai: IEEE), 8–12. doi: 10.1109/CSEE63195.2024.00010
- CrossRef
- Google Scholar
23
LinX. P.LiB. B.YaoZ. N.YangZ.ZhangM. (2024). The impact of virtual reality on student engagement in the classroom: a critical review of the literature. Front. Psychol. 15:1360574. doi: 10.3389/fpsyg.2024.1360574
- CrossRef
- Google Scholar
24
LiuZ.ZhuZ.ZhuL.JiangE.HuX.PepplerK. A.et al. (2024). “Classmeta: designing interactive virtual classmate to promote vr classroom participation,” in Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI '24 (New York, NY: Association for Computing Machinery), 1–17. doi: 10.1145/3613904.3642947
- CrossRef
- Google Scholar
25
LuoH.GaoF.FangK.LiuD.LinZ.ChanW. K. V. (2024). “Study with confucius: an AI-based immersive educational game with multiple educational modes,” in SIGGRAPH Asia 2024 Educator's Forum (New York, NY: Association for Computing Machinery), 1–6. doi: 10.1145/3680533.3697066
- CrossRef
- Google Scholar
26
McKernA.MayerA.GreifL.ChardonnetJ.-R.OvtcharovaJ. (2024). “AI-based interactive digital assistants for virtual reality in educational contexts,” in 2024 IEEE 3rd German Education Conference (GECon) (Munich: IEEE), 1–5. doi: 10.1109/GECon62014.2024.10734030
- CrossRef
- Google Scholar
27
MinaeeS.MikolovT.NikzadN.ChenaghluM.SocherR.AmatriainX.et al. (2024). Large language models: a survey. arXiv preprint arXiv:2402.06196. doi: 10.48550/arXiv.2402.06196
- CrossRef
- Google Scholar
28
Mordor Intelligence (2024). Virtual Reality (VR) Market in Education - Growth, Trends, COVID-19 Impact, and Forecasts (2025-2030). Hyderabad:Mordor Intelligence(Accessed October 13, 2025).
- Google Scholar
29
NnoliI. (2024). Spotlight: Convai Reinvents Non-Playable Character Interactions. NVIDIA Developer Blog (Accessed October 14, 2025).
- Google Scholar
30
OpenAI (2024). Compare Models — OpenAI API Documentation. San Francisco, CA:OpenAI(Accessed October 14, 2025).
- Google Scholar
31
ÖzkayaS.Berrezueta-GuzmanS.WagnerS. (2025). How llms are shaping the future of virtual reality. IEEE Access13, 193335–193355. doi: 10.1109/ACCESS.2025.3631594
- CrossRef
- Google Scholar
32
PanM.KitsonA.WanH.PrpaM. (2025). “Ellma-t: an embodied llm-agent for supporting english language learning in social vr,” in Proceedings of the 2025 ACM Designing Interactive Systems Conference, DIS '25 (New York, NY: Association for Computing Machinery), 576–594. doi: 10.1145/3715336.3735786
- CrossRef
- Google Scholar
33
PeixotoB.PintoR.MeloM.CabralL.BessaM. (2021). Immersive virtual reality for foreign language education: a prisma systematic review. IEEE Access9, 48952–48962. doi: 10.1109/ACCESS.2021.3068858
- CrossRef
- Google Scholar
34
Ready Player Me (2024). Ready Player Me Documentation.Ready Player Me(Accessed October 14, 2025).
- Google Scholar
35
SalimM.KhalilovS. (2024). “Developing a virtual tiu campus tour: integrating 3d visualization of university facilities in vr,” in 2024 21st International Multi-Conference on Systems, Signals &Devices (SSD) (Erbil: IEEE), 540–544. doi: 10.1109/SSD61670.2024.10548711
- CrossRef
- Google Scholar
36
SongY.WuK.DingJ. (2024). Developing an immersive game-based learning platform with generative artificial intelligence and virtual reality technologies–“learningversevr”. Comput. Educ.: X Reality4:100069. doi: 10.1016/j.cexr.2024.100069
- CrossRef
- Google Scholar
37
TracyK.SpantidiO. (2025). Impact of gpt-driven teaching assistants in vr learning environments. IEEE Trans. Learn. Technol. 18, 192–205. doi: 10.1109/TLT.2025.3539179
- CrossRef
- Google Scholar
38
TruchlyP.MedveckýM.PodhradskýP.VančoM. (2018). “Virtual reality applications in stem education,” in 2018 16th International Conference on Emerging eLearning Technologies and Applications (ICETA) (Stary Smokovec: IEEE), 597–602. doi: 10.1109/ICETA.2018.8572133
- CrossRef
- Google Scholar
39
VallanceM. (2023). Independently supporting learners in vr with an ai-enabled non-player character (npc). Immers. Learn. Res. - Pract. 1, 69–73. doi: 10.56198/ITIG2WMWY
- CrossRef
- Google Scholar
40
ViitaharjuP.NieminenM.LinneraJ.YliniemiK.KarttunenA. J. (2023). Student experiences from virtual reality-based chemistry laboratory exercises. Educ. Chem. Eng. 44, 191–199. doi: 10.1016/j.ece.2023.06.004
- CrossRef
- Google Scholar
41
WanH.ZhangJ.SuriaA. A.YaoB.WangD.CoadyY.et al. (2024). “Building llm-based ai agents in social virtual reality,” in Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA '24 (New York, NY: Association for Computing Machinery), 1–8. doi: 10.1145/3613905.3651026
- CrossRef
- Google Scholar
42
WangZ.ChuZ.DoanT. V.NiS.YangM.ZhangW. (2025). History, development, and principles of large language models: an introductory survey. AI Ethics5, 1955–1971. doi: 10.1007/s43681-024-00583-7
- CrossRef
- Google Scholar
43
WenQ.LiangJ.SierraC.LuckinR.TongR.LiuZ.et al. (2024). “AI for education (ai4edu): advancing personalized education with llm and adaptive learning,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD '24 (New York, NY: Association for Computing Machinery), 6743–6744. doi: 10.1145/3637528.3671498
- CrossRef
- Google Scholar
44
ZhuX. T.CheermanH.ChengM.KiamiS. R.ChukoskieL.McGivneyE. (2025). “Designing vr simulation system for clinical communication training with llms-based embodied conversational agents,” in Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (Yokohama: ACM), 1–9. doi: 10.1145/3706599.3719693
- CrossRef
- Google Scholar

Summary

Keywords

Convai, educational games, virtual reality, lip-sync, LLMS, NPCs, speech recognition, speech-to-text

Citation

Berrezueta-Guzman S and Wagner S (2026) Next-Gen orientation: supporting international students with generative AI NPCs in VR. Front. Comput. Sci. 8:1799323. doi: 10.3389/fcomp.2026.1799323

Received

29 January 2026

Revised

19 February 2026

Accepted

20 February 2026

Published

12 March 2026

Volume

8 - 2026

Edited by

Salvador Otón Tortosa, University of Alcalá, Spain

Reviewed by

Vladimir Robles-Bykbaev, Salesian Polytechnic University, Ecuador

Martín López Nores, University of Vigo, Spain

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Santiago Berrezueta-Guzman, s.berrezueta@tum.de

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.