Research on an intelligent tutoring system based on automatic construction of multimodal knowledge graphs and retrieval-augmented generation

Deng, Chao; Yuan, Bo

doi:10.3389/fcomp.2026.1777749

TECHNOLOGY AND CODE article

Front. Comput. Sci., 25 February 2026

Sec. Human-Media Interaction

Volume 8 - 2026 | https://doi.org/10.3389/fcomp.2026.1777749

Research on an intelligent tutoring system based on automatic construction of multimodal knowledge graphs and retrieval-augmented generation

College of Computer, Guangdong University of Science and Technology, Dongguan, China

Article metrics

View details

336

Views

Downloads

Abstract

As a key application of technology-enhanced learning, Intelligent Tutoring Systems have long been constrained by bottlenecks such as expert-dependent, costly manual knowledge base construction and difficulties in adapting to unstructured teaching resources. Concurrently, generative large language models face challenges in educational question-answering, including factual inaccuracies and insufficient logical reasoning capabilities. To address these issues, this study proposes a framework for an Intelligent Tutoring System based on the automatic construction of multimodal knowledge graphs and Retrieval-Augmented Generation (RAG). The system integrates technologies such as FFmpeg, Whisper, OCR, and layout analysis to establish a pipeline for the fully automatic extraction and construction of knowledge graphs not only from course videos, but also from textbook PDFs. This process enables the integration of auditory information from videos with visual and textual knowledge from textbooks, building on this foundation, the framework combines graph retrieval and vector retrieval strategies, leveraging the RAG mechanism to drive large language models in generating accurate and explainable question-answering content. Experimental results demonstrate that the proposed system achieves positive feedback in terms of knowledge graph construction, the average accuracy and relevance of intelligent Q&A responses, overall user satisfaction, and system performance. Beyond automation, its core innovation is a cross-modal fusion mechanism that aligns and integrates knowledge from auditory explanations and visual-textual textbook content, thereby creating a unified, instructionally-structured knowledge graph. Thus, this study provides a feasible and innovative path from multimodal resources to intelligent services for Intelligent Tutoring Systems, holding significant practical implications for advancing personalized learning.

1 Introduction

1.1 Research background and significance

The rapid development of artificial intelligence technology is profoundly transforming the field of education, bringing unprecedented opportunities for technology-enhanced learning. The Intelligent Tutoring System (ITS) is a learning assistance system grounded in constructivist learning theory, integrating multidisciplinary perspectives from pedagogy, cognitive science, and computer science. It simulates the role of a teacher to engage in real-time, personalized interaction with learners, thereby facilitating effective learning processes (Du and Wang, 2025).

In terms of personalized learning, ITS can dynamically estimate and predict a learner’s state of knowledge mastery through knowledge tracing techniques, thereby establishing an accurate learner model. This ability to present learning resources and paths based on the learner’s individual characteristics helps to stimulate and sustain their learning confidence and motivation, increasing their engagement with the system (Lu et al., 2021). With the integration of technologies like affective computing, modern ITS is evolving toward affective tutoring systems, which provide more personalized instructional services by considering learners’ emotions, interests, and learning styles (Gong et al., 2019).

Despite demonstrating significant potential for personalized instruction, the development of ITS still faces numerous challenges. The models within ITS require regular updates and validation by domain experts, leading to long development cycles and high maintenance costs. Consequently, traditional methods struggle, especially when confronted with massive, unstructured modern educational resources like online course videos. Furthermore, although large language models possess powerful text generation capabilities, their inherent issues of “hallucination” and the lack of domain-specific knowledge anchoring become prominent when dealing with complex student questions requiring multi-step reasoning or the association of different knowledge points, leading to a trust crisis in educational scenarios demanding high precision (Kasneci et al., 2023).

In response to the aforementioned challenges, this study aims to address technical difficulties across three levels: first, the automatic extraction and fusion of multimodal knowledge—how to identify and extract knowledge points from multimodal data such as video, audio, and text; second, the structured representation and organization of knowledge—how to transform extracted knowledge into a computable, inferable form within a knowledge graph; third, intelligent tutoring based on the knowledge graph—how to leverage structured knowledge to provide learners with precise, personalized learning support and adaptive guidance.

This study explores a new paradigm for knowledge base construction in ITS through the technologies of automatic multimodal knowledge graphs construction and Retrieval-Augmented Generation (RAG). It aims to provide innovative ideas and technical solutions to address the bottleneck issues associated with traditional ITS knowledge base development.

1.2 Literature review

1. Development and current status of intelligent tutoring systems.

Intelligent technologies are driving a profound transformation in educational paradigms toward data-driven, human-computer collaborative, and online-offline integrated directions. Zhu et al. (Zhu and Peng, 2020; Zhu and Hu, 2021) pointed out that intelligent technologies, through data circulation, information connectivity, and service integration, are driving the formation of a new teaching paradigm characterized by bidirectional interweaving and boundless sharing. Realizing this vision requires careful instructional design, guided by advanced concepts to leverage the effectiveness of technology. This macro background has charted the direction for the development of Intelligent Tutoring Systems (ITS): namely, to construct a comprehensive learning support environment capable of sensing context, integrating data, and providing precise intervention.

Multiple cutting-edge technologies have demonstrated great potential in creating learning experiences and facilitating knowledge interaction, providing ITS with a rich toolkit. Huang et al. (2021) systematically proposed that artificial intelligence can empower students in three main scenarios: knowledge acquisition, self-directed learning, and learning companionship. In enhancing learning immersion and motivation, Cui and Zhao (2020) confirmed through meta-analysis that virtual reality technology can significantly improve students’ learning performance; Wang (2020) research also indicated that augmented reality technology positively promotes learning in various subject areas, including mathematics. Lu et al. (2023) explored the application value of generative AI, represented by ChatGPT, in areas such as exercise generation and automatic problem-solving. The team led by Professor Niels Pinkwart at Humboldt University of Berlin (Liu et al., 2018) utilized perceptual data to construct learner-centered adaptive environments. These technologies collectively enrich the interactive dimensions and support mechanisms of ITS.

The core development of ITS is reflected in the deepening and integration of three key pillars: knowledge tracing, personalized tutoring, and intelligent assessment. Lu et al. (2021) noted that with the maturation of deep learning technologies, the value of knowledge tracing models in core educational scenarios such as adaptive learning and resource recommendation has become increasingly prominent. In the area of personalized tutoring, Wei et al. (2025) integrated generative AI with the “Cone of Experience” theory to design a generative multi-agent tutoring system. They experimentally verified its effectiveness in supporting students through the “teaching-learning-guiding” triad of intelligent agents, offering new ideas to overcome the adaptability shortcomings of traditional systems. Regarding intelligent assessment, Liu et al. (2021) demonstrated that intelligent technologies can empower educational assessment, providing comprehensive and effective decision-making basis for teaching improvement, thus modernizing and professionalizing educational evaluation. Wu et al. (2021) constructed an AI evaluation pathway focusing on classroom language, behavior, and emotion analysis; Luo et al. (2021) noted progress in intelligent assessment for capability evaluation and mental health; Hu et al. (2021), based on frontier AI technologies and the principles of higher education teaching evaluation, designed an artificial intelligence-based education evaluation and intervention system. Experiments proved that this system outperforms traditional methods across multiple dimensions. Zhang et al. (2021a), referencing the architectural designs of ITS and collaborative learning management systems, ultimately developed a formal modeling and intelligent computing general architecture for classroom teaching evaluation.

Furthermore, institutions such as Carnegie Mellon University and Tsinghua University have also employed intelligent technologies to enhance classroom teaching evaluation (Zhang et al., 2021b).

However, the further development of ITS still faces several challenges. For instance, existing generative models have limitations in cognitive depth and trustworthiness. Lu et al. (2023) explicitly pointed out that they struggle to fully understand the internal logic of information, are prone to generating unreasonable or factually incorrect content, and their decision-making processes resemble a “black box,” lacking explainability. Additionally, comprehension and expression biases in the Chinese context, as well as security risks from malicious use of the technology, are obstacles that must be overcome to achieve large-scale, trustworthy application.

2. Construction methods for educational knowledge graphs.

Educational knowledge graphs process vast amounts of disordered, structurally complex multimodal educational data, transforming them into structured knowledge systems that can be directly utilized. This serves as an effective means for modeling educational knowledge in the era of artificial intelligence. Their construction methods are primarily divided into two paradigms: manual construction and (semi-)automatic construction:

(1) Manual construction.

Manual construction typically adopts a “top-down” approach, where domain experts and knowledge engineers collaboratively define an ontology schema and manually input structured knowledge into the system. This method ensures high precision, logical rigor, and pedagogical soundness within the knowledge system. However, its knowledge engineering bottleneck is particularly prominent: the process is labor-intensive, time-consuming, expensive, and exhibits poor scalability. It struggles to adapt to rapid knowledge updates and is ill-equipped to handle the processing demands of massive unstructured educational resources. Consequently, it is primarily suited for small-scale scenarios where the knowledge system is stable and domain boundaries are clearly defined.

(2) (Semi-)automatic construction.

This approach utilizes natural language processing, information extraction, and machine learning techniques to automatically extract entities and relationships from unstructured and semi-structured texts (such as textbooks, research papers, and video transcripts) to construct the graph. In recent years, some deep learning models like GRU (Yan et al., 2020), LSTM (Guo et al., 2019), and BERT (Chang et al., 2021; Jia et al., 2020) have achieved satisfactory results in named entity recognition. Employing these technologies for knowledge graph construction can significantly save human resources and time costs, thereby improving work efficiency.

However, this construction method still faces certain issues. For instance, there is a lack of research on multimodal resource fusion (Zhao et al., 2023). Linking multimodal data to the educational knowledge graph, serving as an extension of textual entities, can greatly enrich the representational forms of knowledge within the graph and meet the diversified needs of intelligent educational applications (Gao and Zhang, 2022).

3. Knowledge base-based automatic question answering systems.

Knowledge base-based automatic question answering refers to systems whose retrieval content relies on a specific knowledge base, which is generally composed of multiple structured triples (e.g., Entity1, Entity2, Relation_Type). This method primarily involves preprocessing the user’s input question, designing algorithms to parse the core semantics of the query, retrieving and ranking candidate answers, and finally presenting the precise answer to the user. Chen and Li (2020) proposed a Transformer-based deep attention semantic matching method for relation detection in questions for knowledge base question answering. Hua et al. (2020) proposed a question-answering system composed of a neural generator and a symbolic executor, capable of solving complex problems in knowledge base question answering with few samples. Zhang et al. (2021b) proposed an end-to-end model based on Bayesian neural networks, which enhances the reliability and interpretability of knowledge base question answering by estimating uncertainties arising from the model and the data.

Currently, there is relatively little research on automatic question answering based on knowledge graphs conducted specifically within the educational domain. There is a lack of end-to-end, complete solutions spanning from video content to intelligent question answering.

4. Graph retrieval-augmented generation and its educational potential.

While the aforementioned systems demonstrate the value of structured knowledge, the paradigm of Retrieval-Augmented Generation (RAG) has evolved to directly leverage graph structures to ground large language models (LLMs), an approach known as Graph Retrieval-Augmented Generation (Graph RAG). This paradigm aims to mitigate LLM hallucinations by retrieving interconnected subgraphs that provide contextual logic and evidence chains, rather than isolated text snippets. Recent studies in high-stakes domains underscore its effectiveness.

For instance, in combating health misinformation, a GraphRAG framework was developed to construct dynamic semantic knowledge graphs from the latest medical news, retrieving relevant subgraphs to provide LLMs with structured, timely evidence for fact-checking (Hang et al., 2024). This highlights the key strength of graph retrieval in scenarios demanding high factual accuracy: the ability to reason over a network of interlinked facts. Similarly, in the medical domain, research has formalized the retrieval of evidential paths from large-scale knowledge graphs to generate well-grounded, evidence-attributed answers (Wu et al., 2025). These works demonstrate how Graph RAG directly addresses the explainability and trustworthiness issues of general-purpose LLMs by leveraging predefined, verifiable relationships.

The hybrid retrieval-augmented paradigm proposed in this study is situated as an adaptation of Graph RAG for educational tutoring. The aforementioned systems primarily rely on structured graph retrieval, which excels at ensuring logical consistency but may be less flexible with diverse, narrative-style educational content where key pedagogical explanations are not fully captured by predefined schemas. Conversely, pure vector retrieval offers semantic flexibility but lacks inherent logical structure.

Our framework innovates by fusing precise relational queries from the graph database (Neo4j) with semantic similarity matching from the vector database. This hybrid strategy inherits the logical grounding benefits from established Graph RAG paradigms, ensuring answers are anchored in the curriculum’s knowledge structure. Simultaneously, it incorporates the strength of vector retrieval to capture broader contextual nuances and illustrative examples from video transcripts. Therefore, this work extends the Graph RAG paradigm by proposing a dual-retrieval mechanism tailored to the multimodal and didactic needs of intelligent tutoring systems (ITS), aiming to achieve a superior balance between factual precision and explanatory richness.

1.3 Contributions and innovations of this work

This paper aims to address the challenges associated with the automated construction of domain knowledge for Intelligent Tutoring Systems (ITS). It innovatively proposes an end-to-end ITS framework that enables the fully automated process from constructing a knowledge graph from course videos to providing intelligent question-answering. The core contributions of this research can be summarized into three points:

A Cross-modal Knowledge Fusion Pipeline: The system introduces a dedicated process for aligning and merging entities and relations extracted from video transcripts and textbook PDFs. This fusion ensures that key concepts are represented by consolidated knowledge from both narrative explanations and formal textual/visual definitions, enhancing the pedagogical adequacy of the constructed graph beyond unimodal extraction.
Pioneering a Hybrid Retrieval-Augmented Paradigm for the Question-Answering Core: Through the fusion of precise relational queries from a graph database and semantic similarity matching from a vector database—which subsequently drives a Retrieval-Augmented Generation (RAG) engine—the framework ensures that the generated answers are not only grounded in the logical underpinnings of the knowledge graph but are also enriched with flexible contextual semantic information.
Focus on Directly Empowering Rich Media Video as the Application Source: The framework directly utilizes massive, unstructured course videos as the starting point for knowledge processing. This approach broadens the applicability boundaries of ITS and provides critical technical support for its deep integration and widespread application within modern online education.

2 System architecture and design

2.1 Overall architecture design

The term “multimodal” in this work specifically refers to the integration of two primary sources: (1) auditory content from instructional videos, and (2) visual-textual content from textbook PDFs.

To address the core objectives of “automatic construction of multimodal knowledge graphs” and “intelligent question-answering tutoring,” this study designs an overall technical architecture featuring a dual-layer interaction between “Knowledge Construction” and “Q&A Interaction” (as shown in Figure 1). The layers are interconnected through mechanisms of “Knowledge Base Invocation” and a “Data Feedback Loop.” This design achieves end-to-end process coverage from “Multimodal Educational Resource Input” to “Personalized Tutoring Q&A Output,” providing the Intelligent Tutoring System with comprehensive technical support across the entire pipeline of “Knowledge Structuring” and “Intelligent Interaction.”

Figure 1

2.2 PDF textbook parsing and visual knowledge extraction module

To fully leverage the rich visual and structured knowledge contained in teaching materials, the system incorporates a dedicated pipeline for processing textbook PDFs. This module complements the video-derived knowledge by extracting entities and relationships from textual descriptions, diagrams, formulas, and layout structures. The processing pipeline consists of the following steps:

Document Structure Parsing: Utilizing PDF processing libraries (e.g., PyMuPDF), the system extracts raw text streams, image blocks, and layout metadata (e.g., titles, paragraphs, captions) from each page.
OCR Enhancement: For scanned PDFs, an Optical Character Recognition (OCR) engine (e.g., PaddleOCR) is employed to convert image-based content into machine-readable text, which is then aligned with native digital text where applicable.
Diagram and Formula Detection: Based on visual features and spatial layout analysis, regions containing figures, charts, and mathematical formulas are identified. Key visual elements are either stored as multimedia attachments or, for formulas, converted into structured representations (e.g., LaTeX).
Semantic Chunking and Annotation: The extracted content is segmented into coherent semantic chunks (e.g., “definition,” “theorem,” “example,” “exercise”) based on hierarchical headings and typographical cues. Each chunk is tagged with its semantic type and source location (page number).
Knowledge Triple Extraction: Similar to the video processing pipeline, a BERT-based model combined with rule-based patterns is applied to the textual chunks to extract entities and relationships. Crucially, diagrams and formulas are linked as multimedia attributes to their relevant entity nodes (e.g., a “binary tree” entity may link to its illustrative diagram).

The extracted triples and semantic chunks are then fed into the subsequent knowledge fusion stage, where entities from PDFs are disambiguated and aligned with those extracted from video transcripts, forming a unified, multimodal knowledge graphs.

2.3 Automatic knowledge graph construction module

The Automatic Knowledge Graph Construction Module takes multimodal educational resources such as course videos (audio and subtitles) and textbook PDFs as input. Through the technical pipeline of “Multimodal Preprocessing → Semantic Parsing → Graph Modeling,” it achieves the structured and interconnected representation of educational knowledge, providing a semantically precise and relationally clear knowledge base for intelligent question answering.

1. Multimodal preprocessing: FFmpeg is employed to extract the audio track from the video. Subsequently, the Whisper model is used to accurately transcribe the audio into text, providing “textual material” for the subsequent semantic parsing. As an end-to-end automatic speech recognition (ASR) model, the core of Whisper involves modeling the conditional probability of mapping an audio sequence X to a text sequence Y:

where is the audio feature sequence, is the token generated at the current time step, and represents the previously generated text history.

2. Semantic parsing: Leveraging LangChain to orchestrate the workflow of “Text Chunking—Prompt Engineering—Knowledge Extraction,” combined with NLP techniques, to extract entities such as “knowledge points, instructors, courses” and relationships like “prerequisite, belongs to” from the transcribed text.

Entity recognition: A BERT-based sequence labeling model is employed. Given a sentence of length n, the model encodes it into contextual vector representations , and then outputs the optimal sequence of entity labels through a CRF layer:

where A is the transition score matrix of the CRF, used to model dependencies between labels.

(2) Relation extraction: Modeled as a classification task based on entity pairs. For a pair of entities , we obtain their representations and from BERT, construct a classification feature vector, and calculate the probability of belonging to relation r via the softmax function:

3. Entity disambiguation and fusion.

After extracting entities from multimodal teaching resources, the diversity and ambiguity inherent in natural language expression may lead to the same entity appearing in different surface forms (for instance, the data structure “Stack” might be expressed as “Stacks”), while different entities may share the same surface form (e.g., “binary tree” could refer to either a data structure or a botanical term). To address this, the system introduces entity disambiguation and entity fusion mechanisms to enhance the semantic consistency and structural integrity of the knowledge graph.

This fusion is cross-modal in nature. Entities and their relationships are disambiguated and merged across the auditory stream (video transcript) and the visual-textual stream (textbook PDF). For example, the concept of a ‘Stack’ is unified from its verbal explanation in the videos and its formal definition and schematic diagram in the textbook, resulting in a multi-evidenced, instructionally richer node in the final knowledge graph.

(1) Entity disambiguation.

Entity disambiguation aims to determine the unique canonical entity identifier for different mentions referring to the same entity. The system employs a context-based semantic similarity calculation method to match each candidate entity mention `m` with a set of possible corresponding entities . Given the mention `m` and its contextual text , its contextual vector representation is obtained via a pre-trained language model (e.g., BERT):

For each candidate entity ⱼ, the system obtains its vector representation from its definition text :

The semantic similarity between the mention and the entity is calculated:

The entity with the highest similarity is selected as the disambiguation result for mention `m`:

(2) Entity fusion.

Entity fusion aims to merge multiple entities that are semantically identical or highly similar into a single node, integrating their attributes and relationships. The system adopts a hierarchical clustering method based on entity vector representations for unsupervised clustering. Given an entity set , a cosine similarity matrix S is computed, where .

Agglomerative hierarchical clustering is applied with a similarity threshold τ. If , entities and are grouped into the same cluster . For each cluster , a central entity (the one with the highest average similarity to other entities within the cluster) is selected as the representative entity. Attributes and relationships from other entities within the cluster are merged into this representative node.

(3) Knowledge representation after fusion.

After disambiguation and fusion, each entity in the knowledge graph possesses a unique identifier, and the semantic relationships between entities become clearer. The fusion process can be formalized as:

where is the originally extracted graph, and is the cleaned graph after disambiguation and fusion.

This mechanism significantly improves the logical consistency and query efficiency of the knowledge graph, providing a high-quality structured knowledge foundation for subsequent intelligent question answering.

4. Instructional context enrichment.

The constructed knowledge graph is further processed to capture instructional cues inherent in the educational materials. This enrichment leverages explicit structural signals—such as chapter headings, section ordering in the textbook PDF, and discourse markers in the transcript (e.g., “firstly,” “prerequisite to”)—to annotate entities and relationships with pedagogical metadata.

(1) Core concept identification: Entities that are central to section titles, emphasized repeatedly, or form hubs in the extracted relationship network are algorithmically tagged as Core Concept.
(2) Prerequisite relation inference: The sequential order of textbook sections and logical discourse cues are analyzed to hypothesize and tag potential is prerequisite of relationships between concept entities, complementing relations extracted directly from the text.

This process aims to produce a knowledge graph that approximates a didactic structure, moving beyond associative networks to encode preliminary knowledge dependencies. The output of this automatic enrichment serves as the input for expert validation and refinement through the Knowledge Graph Review Interface, ensuring the final structure aligns with pedagogical expertise.

5. Dual-storage: On one hand, the structured knowledge in the form of “entity-relation-entity” is stored as triples (h,r,t) in the Neo4j graph database, achieving an “interconnected representation” of knowledge. On the other hand, the chunked text {chunk₁, chunk₂, …, chunkₘ} is encoded into vectors via an Embedding model , and these vectors are stored in the vector database to support “semantic similarity retrieval.”

2.4 Vector indexing and retrieval

To support efficient semantic similarity retrieval, the system employs a vector database to index the text chunk vectors. Its core involves building a fast Approximate Nearest Neighbor (ANN) search system within a high-dimensional vector space.

Index construction: The Hierarchical Navigable Small World (HNSW) graph algorithm is adopted for index construction. This algorithm offers excellent query speed while ensuring a high recall rate. HNSW achieves this by building a hierarchical graph structure, where the bottom layer contains all data points, and the upper layers are sparse subsets of the lower ones, enabling rapid, long-distance, “highway”-like navigation. Its construction process can be formally described as continuously optimizing the graph structure G = (V, E) to minimize the average hop count during greedy searches at any layer (Figure 2).
Retrieval process: Given a query vector q, the retrieval process aims to find the k vectors most similar to q within the HNSW graph. Similarity is measured using cosine similarity:

Figure 2

The system returns the top k texts {chunk₁, chunk₂, ., chunkₖ} with the highest similarity scores as the retrieval results, providing semantic context for the subsequent hybrid retrieval.

2.5 Intelligent tutoring and question answering module

The Intelligent Tutoring and Question Answering Module is supported by a hybrid knowledge base combining “knowledge graph + vector retrieval.” It integrates a Large Language Model (LLM) and streaming transmission technology to achieve tutorial interactions characterized by “accuracy, naturalness, and real-time responsiveness,” thereby meeting learners’ personalized Q&A needs.

Question parsing: User questions, including text-based and voice-based queries, are received via an API. Voice queries are first transcribed into text using Whisper. Subsequently, a combination of “pre-trained models + LLM” is employed to identify the question’s intent.
Hybrid retrieval: Based on the identified question intent, graph retrieval and vector retrieval are executed in parallel. The details are as follows:

(1) Graph retrieval: The user question Q is converted into a Cypher query via the LLM to retrieve a structured knowledge subgraph Gₛ with clear logical relationships from Neo4j.
(2) Vector retrieval: As described in Section 2.3, the HNSW index is utilized to retrieve a set of text fragments Cᵥ that are semantically related to the question.
(3) Fusion strategy: A weighted fusion mechanism is designed to form the final retrieval context C. Its composite scoring function S is defined as:

Here, is an adjustable hyperparameter, and is the indicator function (equal to 1 if the fragment originates from the graph retrieval results, otherwise 0). The system re-ranks the candidate results based on and selects the Top-N to constitute the final context C.

3. Answer generation and streaming output: The fused retrieval results C serve as the grounding context, guiding the LLM via a RAG engine to generate the answer. This process can be formalized as a conditional probability generation task:

where A = {a₁, a₂, …, a_T} is the generated answer sequence. Concurrently, Server-Sent Events (SSE) technology is employed to push each newly generated token a_t to the front-end in real-time, achieving “streaming delivery” of the answer. This effectively reduces the user’s perceived waiting time and enhances the interactive experience.

The Automatic Knowledge Graph Construction Module provides the Intelligent Question Answering module with structured knowledge from the graph database and semantic knowledge from the vector database, together forming the cornerstone for intelligent Q&A. Furthermore, high-frequency questions and user feedback accumulated in the Q&A interaction layer provide optimization cues for the knowledge construction layer. Through continuous knowledge graph completion and fine-tuning of the Embedding model, the system achieves ongoing evolution in its level of intelligence and adaptive capability.

2.6 Scalability and maintainability design of the system

The modular architecture of the proposed framework, designed to process multimodal educational inputs such as videos and textbook PDFs, not only enables its core functionalities but also establishes a foundation for its scalable application and long-term evolution in real-world educational settings.

Cross-disciplinary Scalability is inherent in the domain-agnostic design of the core pipeline. The system’s modules for multimodal processing, semantic parsing, knowledge graph construction, and hybrid retrieval are implemented through parameterized configurations and abstract interfaces. This implies that adapting it to a new discipline primarily focuses on defining new entity types and relations—that is, updating the configuration of the domain ontology and performing lightweight domain-adaptive fine-tuning of the foundational NLP models, without the need to refactor the system core. This design ensures the generalizability of the technical framework and significantly reduces the cost of replication across different disciplines.

The Sustainable Maintenance Mechanism operates on a “human-in-the-loop” principle, positioning the system as an intelligent teaching assistant that relies on instructor oversight. A dedicated Knowledge Graph Review Interface allows instructors to audit and correct extracted knowledge, ensuring its pedagogical accuracy. The system also supports incremental updates and maintains full provenance metadata for traceability.

A continuous feedback loop enables dynamic knowledge refinement. User corrections to answers are logged and aggregated. When corrections for a specific answer reach a set threshold, the case is escalated to the administrator panel for manual annotation. The resulting complete message with annotations is then used to update both the vector and graph databases. This closed-loop process transforms maintenance into a streamlined, collaborative cycle, reducing long-term operational burden while steadily enhancing system reliability.

The system flowchart is as follows in Figure 3.

Figure 3

2.7 System implementation and demonstration

To intuitively showcase the system’s capabilities in automatic multimodal knowledge graphs construction and intelligent question-answering support, this section uses a teaching video from the “Stacks and Queues” chapter of a “Data Structures and Algorithms” course as an example. It demonstrates the complete system workflow from video upload to knowledge graph construction, chapter parsing, and storage.

In addition to video processing, knowledge is also extracted from the corresponding chapters of PDF textbooks through the process described in Section 2.2, and entities and relationships are integrated into a single unified knowledge graph.

2.7.1 Video upload and preprocessing

After a user uploads a course video through the system interface, the system initiates the multimodal preprocessing pipeline:

Use FFmpeg to extract the audio track from the video.
Call the Whisper automatic speech recognition model to transcribe the audio into structured text subtitles.
Perform sentence and paragraph segmentation on the transcribed text to form text chunks suitable for semantic parsing.

As shown in the Figure 4.

Figure 4

2.7.2 Textbook PDF processing and knowledge extraction

To fully demonstrate the system’s multimodal capabilities, this section also illustrates the processing pipeline for textbook PDFs. The system follows a dedicated procedure to extract and structure visual-textual knowledge from textbook PDFs, which is subsequently fused with knowledge from videos to form a unified knowledge graph. The process involves the following steps:

Page Segmentation and Layout Analysis: Each page of the uploaded textbook PDF is first converted into an image. The PP-DocLayout_plus-L model is employed for layout cleaning and analysis, accurately identifying and segmenting regions corresponding to text paragraphs, headings, diagrams, and formulas.
OCR and Text Enhancement: The segmented image regions undergo enhancement and binarization before being fed into an OCR engine (e.g., PaddleOCR) for precise text recognition. Post-processing techniques are applied to clean and correct the OCR results.
Content Structuring: The raw OCR text is structured using a combination of rule-based engines and regular expressions. This step parses the textbook content into semantic chunks (e.g., definitions, theorems, examples) based on typographical and layout cues, mirroring the pedagogical structure of the material.
Knowledge Extraction and Graph Construction: The structured content is then sent to a Large Language Model (LLM). Through prompt engineering, the LLM performs key knowledge point summarization and extracts entities and relationships, generating triples that are fed into the subsequent knowledge fusion stage.

This extracted knowledge from the textbook PDFs is seamlessly integrated with the entities and relations derived from the video transcripts during the cross-modal entity disambiguation and fusion process described in Section 2.3.3, ultimately contributing to a richer, more comprehensive multimodal knowledge graph.

2.7.3 Course overview generation and chapter parsing

Based on the transcribed text from the video and the structured content from the textbook PDF, the system utilizes the prompt engineering and semantic understanding module built with LangChain to automatically identify the structure and key points of the video content:

Course overview generation: Extract the video title, instructor, course name, and main content summary (Figure 5).
Chapter parsing: Identify chapter divisions within the video (e.g., “Definition of Stack,” “Queue Operations,” “Comparison of Stack and Queue”) and generate a hierarchical chapter tree (Figure 6).

Figure 5

Figure 6

2.7.4 Automatic knowledge graph construction and storage

The system performs semantic parsing on the multimodal course content (video transcripts and textbook text) to identify core knowledge points, and subsequently extracts entities and relationships to construct a structured knowledge graph:

Entity recognition: Identify entities such as “Stack,” “Queue,” “Push,” “Pop.”
Relation extraction: Identify relationships such as “belongs to,” “prerequisite,” “compare,” “application.”
Graph storage: Store the “Entity-Relation-Entity” triples in the Neo4j graph database, while simultaneously encoding the text chunks into vectors and storing them in the vector database.

As shown in the Figure 7.

Figure 7

2.7.5 Knowledge Q&A preparation

Upon completion of the knowledge graph construction, the system automatically generates a “Knowledge Overview Panel” for teachers or students to preview the built knowledge structure. This panel provides ready-to-use knowledge support for the subsequent intelligent Q&A module, enabling instant question-and-answer interactions.

3 Case study implementation and evaluation

To validate the effectiveness and practicality of the Intelligent Tutoring System constructed in this research, a case study was designed and conducted. This case study is set against the backdrop of an online “Data Structures and Algorithms” course. It fully implements the entire process from the automatic construction of a knowledge graph from multimodal teaching resources to providing intelligent question-answering tutoring. The system’s performance is comprehensively evaluated through quantitative and qualitative analysis of its output (Figure 8).

Figure 8

3.1 Experimental setup

1. Dataset and case selection.

This study selected three core instructional videos from the “Stacks and Queues” chapter of the “Data Structures and Algorithms” course as multimodal input sources, along with the corresponding chapter from the official course textbook in PDF format. The total video duration is approximately 45 min, covering the basic concepts, operational characteristics, typical applications, and comparative analysis of stacks and queues, and the textbook PDF contains rich textual explanations, code examples, schematic diagrams, and exercises. This topic features well-defined concepts, clear logical hierarchy, and rich entity relationships, making it a suitable typical scenario for verifying the system’s capabilities in automated knowledge construction and intelligent interaction.

This case study employs a pedagogically complete dataset encompassing all core concepts and applications for the “Stacks and Queues” topic. The density and structural coherence of this focused corpus enable a rigorous validation of the proposed pipeline’s core capabilities and reveal definitive performance patterns.

2. Evaluation metrics and methods.

To comprehensively examine system performance, evaluation metrics were established from three dimensions: knowledge construction quality, question-answering performance, and system performance. The specifics are as follows:

(1) Knowledge graph quality evaluation.

(a) Metrics: Common metrics from the information extraction field were adopted, namely Precision, Recall, and F1-score.
(b) Method: 300 sentences were randomly sampled from the video transcription text. Two instructors of the course independently annotated entities and relationships to form a gold standard. The system’s automated extraction results were compared against this standard for calculation.
(c) Pedagogical logic consistency: To evaluate how well the automatically constructed graph aligns with intended instructional logic, the course instructor was invited to perform a qualitative assessment. The instructor reviewed the final graph (152 entities, 243 relationships) against the official course syllabus and learning objectives for the “Stacks and Queues” unit. The assessment focused on: (i) coverage of all key concepts defined in the syllabus, (ii) correctness of essential prerequisite relationships (e.g., Stack → Function Call Mechanism), and (iii) overall cohesiveness of the knowledge hierarchy. The instructor confirmed that over 90% of the core syllabus concepts and their fundamental prerequisite links were correctly identified and structured in the graph. This indicates that the automated construction process, augmented by the instructional context enrichment step, can produce a knowledge structure that largely respects the designed learning progression.

(2) Question-answering performance evaluation.

(a) Metrics: Answer Accuracy, Answer Relevance, and User Satisfaction.
(b) Method:

Accuracy & relevance: A test set containing 50 questions (covering various types such as definition, characteristics, comparison, and application) was constructed. A domain expert not involved in the system development was invited to perform a blind evaluation of the system’s answers using a 3-point scale (2 = Completely Correct / Highly Relevant, 1 = Partially Correct / Relevant, 0 = Incorrect / Irrelevant).
User satisfaction: 30 learners enrolled in the course were invited to use the system and complete a satisfaction questionnaire regarding their Q&A experience.

(3) System performance evaluation.

(a) Metric: End-to-end Q&A response time.
(b) Method: Under standard load, the average time taken by the system from receiving a user’s question to completing the answer transmission via SSE streaming was recorded.

3.2 Results and analysis

1. Knowledge graph construction results and analysis.

The system successfully constructed a structured domain knowledge graph automatically from the 45 min of video material and the 20-page textbook PDF. The final graph contains 152 entities and 243 relationships, clearly representing the knowledge system of “stack,” “queue,” and their related concepts. Notably, approximately 35% of the entities and their associated relationships were extracted and integrated from the PDF textbook, demonstrating the complementary value of the visual-textual modality. This demonstrates the complementary role of textbook knowledge in enriching the graph with formal definitions, diagrams, and structured pedagogical content.

As shown in Table 1, the system achieved F1-scores exceeding 83% for both entity and relation extraction tasks. This indicates that the proposed method for automated knowledge graph construction in this paper can extract structured knowledge from video resources with relatively high accuracy and completeness. To a certain extent, it overcomes the issues of high cost and low efficiency associated with traditional expert-dependent manual construction, providing a sustainably evolving knowledge base for the Intelligent Tutoring System.

Table 1

Evaluation metrics	Entity extraction performance (%)	Relation extraction performance (%)
Precision	89.5	85.2
Precision	86.8	82.1
F1-score	88.1	83.6

Statistical results of knowledge extraction quality evaluation.

2. Intelligent question-answering performance results and analysis.

To quantitatively evaluate the Q&A performance, this study constructed a test set comprising 50 questions covering six types: definition comprehension, feature discrimination, comparative analysis, operational details, in-depth understanding, and scenario application. Evaluation was conducted via expert blind review, where a domain expert not involved in system development scored the answers on a 3-point scale. The results show that the system’s answers achieved an average accuracy score of 1.82 and an average relevance score of 1.88 (see the expert scoring sheet in Appendix 1). This demonstrates that in most cases, the system can generate highly accurate answers closely aligned with the questions, validating the effectiveness of the question-answering mechanism based on automatic multimodal knowledge graphs construction and retrieval-augmented generation.

To obtain feedback from real users, 30 learners enrolled in the “Data Structures and Algorithms” course were invited to use the system for a one-week trial and complete an anonymous satisfaction questionnaire. The questionnaire employed a Likert five-point scale and investigated multiple dimensions (see the questionnaire in Appendix 2). The results show that 86.7% of participants reported being “satisfied” or “very satisfied” with the system’s Q&A experience. Most users provided feedback that the system was “responsive” and that the answers were “clear and easy to understand.”

This indicates that the system has received high recognition at both technical and user experience levels. Its response speed, answer quality, and usability effectively meet learners’ needs.

3. System efficiency results and analysis.

Performance testing showed that the average end-to-end response time of the intelligent Q&A module was 1.3 s. This result verifies that the system, while ensuring answer quality, can meet the stringent requirements for response speed in real-time interactive applications.

3.3 Comparative experiment analysis

To thoroughly validate the superiority of the proposed system based on Multimodal Knowledge Graphs automatic construction and Retrieval-Augmented Generation (MMKG-RAG), a comparative experiment was designed in this section. Two representative types of knowledge-based question-answering systems were selected as benchmarks. A comprehensive comparison was conducted across multiple dimensions, including answer accuracy, relevance, explainability, and response efficiency.

3.3.1 Comparative systems and experimental design

1. Selection of comparative systems.

To comprehensively evaluate the performance of the proposed system, the following two systems were selected as baseline benchmarks:

(1) Baseline system A (Pure Text RAG System): This system represents the current mainstream question-answering paradigm based on Retrieval-Augmented Generation. Its knowledge source is the same as in this experiment, namely the transcribed text from the course videos. The system employs a sliding window approach for text chunking, performs semantic retrieval via a vector database, and drives the same large language model to generate answers. This system does not incorporate a structured knowledge graph.
(2) Baseline system B (Manually Constructed Knowledge Graph-based Q&A System): This system represents the traditional question-answering method based on a structured knowledge base. Its knowledge graph was manually constructed by domain experts based on the content of the “Stacks and Queues” section in the “Data Structures and Algorithms” course, containing 120 entities and 180 relationships. The system uses template-based semantic parsing to convert natural language questions into graph query statements (Cypher) and directly extracts or combines answers from the graph.

Baseline A (Text-only RAG) represents the mainstream paradigm for grounding LLMs, isolating the value of our structured graph. Baseline B (Manual-Graph QA) represents the traditional, high-cost approach, allowing evaluation of the trade-off between automation and expert precision.

2. Experimental setup.

(1) Test set: The test set constructed in Section 3.1, containing 50 questions covering six question types, was used.
(2) Evaluation methods and metrics: In addition to the original Accuracy and Relevance evaluations, a new dimension, Explainability, was added to assess whether the answer provides a clear reasoning process or indicates the knowledge source. All evaluations were conducted via blind review by the same domain expert not involved in the development of any system, using a 3-point scale (2 = Excellent, 1 = Acceptable, 0 = Unacceptable).
(3) Performance metrics: The answer quality scores (Accuracy, Relevance, Explainability) and the end-to-end response time for each system were recorded.
(4) Experimental environment: To ensure fairness, all systems were deployed on identical hardware (Intel i7-13700K, 32GB RAM, NVIDIA RTX 4090) and used the same large language model base (Qwen2-7B).

3.3.2 Experimental results

The comprehensive performance comparison of the three systems on the 50-question test set is shown in Table 2. The system proposed in this study outperformed both baseline systems in terms of answer Accuracy, Relevance, and Explainability. It also achieved the highest Fully Correct Answer Rate (82%) and the lowest Error Rate (2%). Although the average response time was slightly longer, it remained within an acceptable range and demonstrated stable performance on complex queries. These results validate the comprehensive advantages of the proposed system in multimodal knowledge fusion and hybrid retrieval strategy.

Table 2

Evaluation metrics	Our proposed system	Baseline system A (Text-only RAG)	Baseline system B (Manual-graph QA)
Average accuracy score	1.82	1.58	1.67
Average relevance score	1.88	1.62	1.70
Average interpretability score	1.76	1.32	1.58
Average response time (seconds)	1.3	0.9	0.8
Number of fully correct answers (Percentage)	41 (82%)	32 (64%)	35 (70%)
Number of partially correct answers (Percentage)	8 (16%)	12 (24%)	10 (20%)
Number of incorrect answers (Percentage)	1 (2%)	6 (12%)	5 (10%)

Comprehensive performance comparison of different Q&A systems.

Table 3 shows the accuracy comparison of each system across the six question types. The results indicate that for “Definition Comprehension” and “Operational Detail” type questions, the proposed system performed comparably to the rule-based graph system (Baseline B), both significantly outperforming the pure text RAG system (Baseline A). The advantage of the proposed system was most evident for complex questions requiring “Comparative Analysis” and “In-depth Understanding,” where its hybrid retrieval strategy effectively integrated multi-source knowledge to support deeper reasoning.

Table 3

Question types	Our proposed system	Baseline system A (Text-only RAG)	Baseline system B (Manual-graph QA)
Definition comprehension	1.95	1.80	1.90
Feature discrimination	1.85	1.65	1.75
Comparative analysis	1.80	1.50	1.70
Procedural details	1.85	1.70	1.80
Deep comprehension	1.75	1.45	1.55
Scenario application	1.70	1.40	1.50

Accuracy comparison of different Q&A systems across question types.

*The data in the table represents the average accuracy score for each question type. Scoring uses a 3-point scale (2 = Fully Correct, 1 = Partially Correct, 0 = Incorrect).

Furthermore, for 5 complex questions requiring multi-step reasoning (as shown in **Table 4**), the proposed system, leveraging the structured logical chains provided by graph retrieval and the detailed semantics supplemented by vector retrieval, surpassed both baseline systems in terms of answer quality and explanatory depth.

Table 4

Question descriptions	Our proposed system	Baseline system A (Text-only RAG)	Baseline system B (Manual-graph QA)	Strengths analysis
How to implement the FIFO property of a queue using stacks?	2.0	1.2	1.5	Combines algorithmic steps (Graph Retrieval) with code examples (Semantic Retrieval)
What are the different application scenarios of stacks in recursion versus iteration?	1.8	1.0	1.3	Extracts comparative relationships from the knowledge graph to provide structured explanations
What is the mathematical principle behind circular queues resolving false overflow?	2.0	1.5	1.8	Integrates formula derivation with algorithm implementation explanations
What are the strategies for implementing a thread-safe queue in a concurrent environment?	1.6	0.8	1.2	Combines conceptual explanations with practical programming advice
What is the complete lifecycle of a stack frame during a function call?	1.9	1.3	1.6	Provides a detailed, step-by-step chronological description of the process

Answer quality comparison for complex questions (Multi-step reasoning)**.

3.3.3 Discussion and analysis

Effectiveness of the hybrid retrieval strategy: Experimental results demonstrate that the hybrid strategy of “Precise Retrieval via Graph Database + Semantic Retrieval via Vector Database” adopted by the proposed system significantly improved answer explainability while ensuring answer accuracy (reducing LLM “hallucination”) and relevance. Graph retrieval ensured the rigor of the core answer logic, while semantic retrieval enriched contextual details, resulting in generated answers that are both structurally clear and substantively rich.
Practical value of automated construction: Compared to Baseline System B, which relies on expert manual construction, the knowledge graph built through the automated process in the proposed system is larger in scale (26.7% more entities, 35% more relationships) and holds overwhelming advantages in construction efficiency. This validates the significant potential of the proposed automatic construction method in reducing knowledge base construction costs and enhancing scalability.
Trade-off between performance and efficiency: Although the average response time of the proposed system is slightly higher than that of the two baseline systems (approximately 1.3 s), its processing time for complex questions is more stable. Baseline System A often experiences significantly prolonged response times (up to 3 s or more) when handling complex queries due to the need for multiple retrieval rounds. Combined with the user satisfaction survey results from Section 3.2 (86.7% of users were satisfied with the response speed), it can be concluded that the proposed system achieves a significant leap in answer quality within an acceptable response time.
Error analysis: Analysis of the few incorrect answers generated by the proposed system revealed that they mainly originated from minor deviations in the speech transcription stage (e.g., misrecognition of professional terms), which were propagated and amplified in subsequent knowledge extraction. This further corroborates the discussion on “Error Propagation” in Section 4.3 and points to a key direction for future improvement.

The relative advantages of our system over both baselines are clear within this focused study. While large-scale generalization across domains is for future work, these results robustly demonstrate the benefits of our hybrid design.

3.3.4 Conclusion of comparative experiments

Based on the systematic comparative experiments, the following conclusions can be drawn: The proposed MMKG-RAG system significantly outperforms the traditional Pure Text RAG system and the manually constructed Knowledge Graph-based Q&A system in comprehensive question-answering performance. Its core advantage lies in achieving an effective balance among accuracy, explainability, and scalability through the automatically constructed multimodal knowledge graphs and the hybrid retrieval-augmented generation framework. This not only provides more reliable knowledge service capabilities for Intelligent Tutoring Systems but also offers an effective technical pathway for constructing domain knowledge bases from unstructured resources and utilizing them efficiently.

3.4 Discussion

This case study successfully validates the feasibility of the proposed system framework in teaching practice. It transforms static, unstructured video content into dynamic, interactive intelligent tutoring services, enabling on-demand Q&A and personalized guidance. This not only, to a certain extent, liberates teachers’ productivity, allowing them to focus on more creative teaching activities, but more importantly, provides learners with an intelligent assistant that is accessible anytime for a deep understanding of course content. It holds positive practical significance for promoting personalized adaptive learning.

Despite the overall satisfactory performance, this study has also identified the following limitations at the current stage:

Error propagation issue: Minor errors from the speech transcription or semantic parsing stages may propagate through the pipeline into the knowledge graph, ultimately affecting the accuracy of question answering. Introducing a manual verification stage or integrating more contextual information for cross-verification is necessary.
Boundary of complex reasoning capability: For complex questions requiring in-depth synthesis and logical reasoning across multiple knowledge points, the system’s ability to organize responses still has room for improvement. Future work will explore integrating more powerful chain-of-thought mechanisms or domain-specifically fine-tuned large language models to enhance this capability.

4 Discussion

4.1 Educational adequacy of knowledge representation: capturing instructional structure beyond co-occurrence

For an Intelligent Tutoring System, a high-quality knowledge base requires educational adequacy—it must reflect pedagogical organization (e.g., learning objectives, prerequisites) beyond statistical co-occurrences in text. Our approach achieves this through cross-modal fusion, integrating auditory explanations from videos with formal textual and visual definitions from textbooks. This convergence creates instructionally richer representations than unimodal extraction alone. To further enhance this, our automated pipeline includes an Instructional Context Enrichment step, which analyzes structural signals (e.g., textbook headings, discourse markers) to tag Core Concepts and infer prerequisite relationships. This transforms the graph from a mere entity network into a model imbued with instructional logic.

The instructor’s qualitative assessment confirmed the graph “covered over 90% of the core syllabus concepts and their fundamental prerequisite links,” enabling tutoring that leverages prerequisite chains for targeted explanation. However, automating pedagogical intent is imperfect; representational gaps may arise from speech recognition errors or misunderstood implicit logic. Therefore, our system adopts a human-in-the-loop paradigm via a Knowledge Graph Review Interface, allowing instructors to validate and refine the graph, ensuring alignment with curricular goals.

In summary, our framework advances beyond automation by leveraging multimodal fusion for richer knowledge representation and integrating instructional enhancement with expert validation. This collaborative paradigm actively safeguards educational adequacy, moving toward ITS that not only “possess knowledge” but also “understand how to teach it.”.

4.2 Limitations in pedagogical adaptability

While the hybrid retrieval strategy effectively grounds answers in factual knowledge, it inherits a fundamental limitation of the RAG paradigm when applied to education: the gap between factual correctness and pedagogical optimality. Retrieved content, though accurate, may not constitute the most instructive explanation for a learner’s specific context.

This manifests in two key scenarios. First, logical brevity versus explanatory scaffolding: Graph-retrieved facts or relationships, while logically sound, may present a condensed “expert” view lacking the step-by-step elaboration a novice needs. For example, an answer derived from a (implements, Queue, two-Stacks) relation might state the algorithm correctly but omit the crucial simulation of operations that builds intuition. Second, semantic relevance versus curricular alignment: Vector-retrieved text chunks, while topically related, might introduce concepts beyond the current learning objective (e.g., advanced kernel-level stack examples during a basic operations lesson), potentially causing distraction or cognitive overload.

Our evaluation metrics (accuracy, relevance) do not fully capture this pedagogical adequacy. Future work must develop assessments for explanatory completeness and developmental appropriateness. The system’s human-in-the-loop design offers a direct mitigation: instructors can use the review interface not only to correct facts but also to tag or reshape content for pedagogical clarity, turning the knowledge base into a more teachable asset. Ultimately, advancing educational RAG requires shifting focus from merely retrieving correct information to orchestrating the most instructive learning experience.

4.3 LLM selection considerations and framework generalizability

Our experimental implementation utilized the Qwen2-7B model for engineering consistency. However, the proposed framework is designed to be model-agnostic. The effectiveness of the graph-grounding mechanism—where retrieved triples and contexts steer the LLM—can be influenced by the base model’s characteristics.

Model size and capability primarily affect the fluency, reasoning depth, and instruction-following fidelity of the final generated answer, especially when integrating and explaining complex relational subgraphs. Domain adaptation or instruction tuning of the LLM on educational corpora could significantly enhance its ability to interpret pedagogical relationships and terminology from the knowledge graph, leading to more coherent and didactically structured explanations.

Crucially, the hybrid retrieval pipeline itself serves as the primary grounding mechanism. By providing the LLM with logically structured subgraphs (from Neo4j) and semantically relevant text chunks (from vector DB), the framework constrains generation and reduces dependency on the model’s internal parametric knowledge. This architecture makes the system inherently more robust to variations in the base LLM, as the quality of the answer is largely anchored by the retrieved evidence. Future work will systematically evaluate this generalization by testing different open-source and instruction-tuned LLMs within the same pipeline, focusing on the interaction between retrieval quality and model capability in producing pedagogically optimal responses.

4.4 Educational significance and practical value

Through technological innovation, this system provides a viable technical pathway and a practical exemplar for realizing genuinely “personalized learning” and “adaptive mobile learning.” Traditional video learning resources are static and facilitate one-way information transfer, whereas this system transforms them into dynamic, interactive intelligent knowledge entities. Learners are no longer passive viewers of videos but can engage in “dialogue” about the video content at any time, achieving a personalized learning experience of on-demand knowledge acquisition. This capability for instant Q&A, combined with the convenience of mobile devices, means that high-quality learning support is no longer confined to classroom hours or a teacher’s office hours. It enables 24/7 academic assistance, serving as a vivid embodiment of the “adaptive mobile learning” concept.

In terms of alleviating the burden on teachers, the system acts as an “intelligent teaching assistant.” Evaluation results indicate that the system can accurately answer a large volume of foundational, repetitive conceptual questions. This liberates teachers from the heavy workload of repetitive Q&A tasks, allowing them to devote more energy to more creative endeavors such as curriculum design, heuristic teaching, and providing emotional care for students. Simultaneously, the automatically constructed knowledge graph can also provide teachers with a holistic view of the course knowledge, assisting them in optimizing teaching content and its logical structure.

4.5 Comparison with existing research

This study primarily achieves the following advancements through its end-to-end automated knowledge graph construction pipeline and hybrid retrieval-augmented generation framework:

In knowledge construction: This research overcomes the bottleneck of traditional methods relying on manual expert input and the limitations of existing studies often confined to processing textual resources. It achieves fully automated conversion from unstructured to structured knowledge graphs. This automated capability “from zero to one” alleviates the core challenges of high knowledge base construction costs and poor scalability.
In intelligent question answering: This study innovatively adopts a hybrid retrieval strategy combining graph databases and vector databases. This strategy not only ensures that generated answers are grounded in solid logical foundations and possess good explainability but also enriches the context through semantic retrieval. Consequently, it surpasses methods relying on a single knowledge source or purely generative models in terms of answer accuracy and reliability.

Thus, the methodological contribution of this work lies in the orchestrated pipeline for constructing pedagogically-aligned knowledge graphs, which explicitly addresses the cross-modal alignment challenge between videos content and textbook knowledge. This provides a scalable alternative to purely manual or semantically shallow automated constructions.

4.6 Ethical considerations and fairness

While promoting technological application, the accompanying ethical and fairness issues remain key concerns.

Data privacy and security: The core data processed by this system consists of course videos, which may involve instructors’ intellectual property rights and students’ portrait rights. Therefore, strict data management protocols must be established. This includes obtaining explicit authorization prior to data collection, encrypting data during transmission and storage, and anonymizing videos when not strictly necessary. These measures ensure all data processing activities comply with relevant laws and regulations.
Algorithmic fairness: The system’s performance may be influenced by data bias. Particularly in the speech transcription stage, the Whisper model’s recognition accuracy may vary for different accents, dialects, and noisy environments. This could lead to comprehension biases for input from certain user groups, thereby affecting the completeness of knowledge construction and the fairness of Q&A. Future work necessitates domain-adaptive fine-tuning of the model using data encompassing multiple accents and dialects to enhance its inclusivity and fairness.
Reliability of generated content and attribution of responsibility: Although the RAG mechanism significantly reduces “hallucination,” the risk of error propagation from knowledge extraction and understanding persists. Therefore, the system design must prioritize transparency and establish convenient user feedback and error correction channels. When the content is used for high-stakes decision-making, its auxiliary role must be clearly defined, with ultimate educational responsibility residing with human instructors.

A key ethical consideration is defining the system’s instructional role. Our framework is designed as a supportive assistant, not an authoritative final arbiter, especially in high-stakes scenarios such as formal assessment or critical skill acquisition. The system should act as an “explainer” when helping learners explore concepts or review materials, leveraging its knowledge graph to provide structured explanations. However, in assessment-related contexts, its role shifts to a “supplement”—offering hints or guiding learners to resources—while ultimate evaluation and accreditation remain the responsibility of human instructors. This distinction is crucial to prevent over-reliance and encourage critical thinking. The system’s interface explicitly presents answers as knowledge-grounded suggestions, includes source provenance where possible, and integrates seamless pathways for teacher consultation, thereby reinforcing its auxiliary, non-authoritative position within the learning ecosystem.

5 Conclusion and future work

5.1 Conclusion

This paper addresses the core issues of traditional Intelligent Tutoring Systems, namely the high cost and low automation in knowledge base construction, as well as the insufficient reliability of existing generative question-answering. It proposes and implements an Intelligent Tutoring System based on the automatic construction of multimodal knowledge graphs and Retrieval-Augmented Generation (RAG).

The core contribution of this research lies in designing an end-to-end technical framework spanning from “multimodal resource input” to “personalized question-answering output.” Firstly, by integrating technologies such as FFmpeg, Whisper, LangChain, and NLP, a pipeline was constructed for the fully automatic extraction and building of a structured knowledge graph from both course videos and textbook PDFs. This design legitimately qualifies the system as multimodal in its current implementation, as it processes and fuses knowledge from two primary modalities: the auditory modality and the visual-textual modality. This effectively lowering the barrier to knowledge base creation. Subsequently, a hybrid retrieval strategy combining graph and vector databases was employed to drive a RAG engine, thereby enhancing the generation process of the large language model. This ensures the accuracy, explainability, and contextual relevance of the intelligent Q&A answers.

The case study and evaluation results demonstrate that the proposed method can construct a knowledge graph from videos and textbook PDFs with high accuracy and generate high-quality answers based on it. Simultaneously, the system exhibits fast response capabilities and has received high satisfaction ratings from learners.

5.2 Future work

Although this study has achieved significant results, several directions warrant further in-depth exploration and improvement in the future:

Introducing deeper multimodal information fusion: The current system primarily utilizes audio information from videos. Future work will introduce computer vision techniques to extract visual information such as keyframes, on-screen text, diagrams, and instructor gestures from videos. This visual content will be aligned and fused with textual information to construct a more comprehensive and enriched multimodal knowledge graphs, thereby enhancing the representation of complex knowledge.
Exploring more interactive tutoring strategies: Future systems will evolve beyond passive Q&A toward proactive tutoring. We will explore personalized learning path recommendations based on the student’s knowledge state and learning history. Furthermore, interactive mechanisms capable of proactive questioning, guided inquiry, and heuristic dialog will be designed to simulate more advanced teaching strategies of human tutors and deepen the guidance effectiveness.
Optimizing the trustworthiness of retrieval and generation algorithms: To further reduce LLM “hallucination” and improve answer quality, research into more advanced retrieval result re-ranking mechanisms and answer consistency verification methods will be conducted. Concurrently, exploring the integration of chain-of-thought techniques will make the model’s reasoning process more transparent and controllable, fundamentally enhancing the system’s reliability and trustworthiness.
Conducting large-scale, long-term pedagogical empirical studies: While the case study in this research validates the system’s feasibility, its long-term effects need to be tested in more authentic and complex educational settings. Future plans include conducting large-scale, long-term teaching experiments across different subjects and educational stages. These studies will comprehensively evaluate the system’s educational value and social impact from multiple dimensions, including learning outcomes, user engagement, and its actual influence on teaching processes.

Statements

Data availability statement

The original contributions presented in the study are included in the article/Supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

CD: Software, Writing – original draft, Resources, Writing – review & editing, Funding acquisition. BY: Validation, Writing – review & editing, Writing – original draft.

Funding

The author(s) declared that financial support was received for this work and/or its publication. This study was supported by the following projects: 1. “Research on Key Technologies for a Complex-Scenario Emotion Monitoring System Based on Facial Recognition” (GKY-2022BSQD-43), funded by the Guangdong University of Science and Technology. 2. “Key Technologies for Vertical-Domain Large Language Models for Precise Talent Cultivation in the Intelligent Manufacturing Industry”, funded by the Dongguan Science and Technology Bureau.

Conflict of interest

The author(s) declared that this work was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declared that Generative AI was not used in the creation of this manuscript.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fcomp.2026.1777749/full#supplementary-material

References

1
ChangY.KongL.JiaK. (2021). “Chinese named entity recognition method based on BERT.” IEEE International Conference on Data Science and Computer Application (ICDSCA), Dalian, China, 2021 (pp. 294–299). IEEE.
- Google Scholar
2
ChenY.LiH. (2020). Dam: transformer-based relation detection for question answering over knowledge base. Knowl.-Based Syst.201-202:106077. doi: 10.1016/j.knosys.2020.106077
- CrossRef
- Google Scholar
3
CuiY. T.ZhaoZ. Q. (2020). The impact of virtual reality technology on students' learning performance: a meta-analysis based on 59 experimental or quasi-experimental studies. Distance Educ. China11, 59–67+77. doi: 10.13541/j.cnki.chinade.2020.11.007
- CrossRef
- Google Scholar
4
DuX. P.WangY. Y. (2025). Research on the construction of intelligent tutoring system empowered by retrieval-augmented generation: based on local large language models and private knowledge bases. China Educ. Technol.5, 117–127.
- Google Scholar
5
GaoM.ZhangL. P. (2022). Research on the connotation, technology and application of educational knowledge graphs integrating multimodal resources. Application Res. Computers39, 2257–2267. doi: 10.19734/j.issn.1001-3695.2021.12.0686
- CrossRef
- Google Scholar
6
GongL. L.LiuH. X.ZhaoW. (2019). Research on key technologies and tutoring models of affective tutoring systems (ATS): on the implication of intelligent tutoring systems evolving towards affective tutoring systems. J. Dist. Educ.37, 45–55. doi: 10.15881/j.cnki.cn33-1304/g4.2019.05.006
- CrossRef
- Google Scholar
7
GuoW.JinY.XieJ.LuoC.WuD.WangR. (2019). LSTM-CRF neural network with gated self attention for Chinese NER. IEEE Access7, 136694–136703. doi: 10.1109/access.2019.2942433
- CrossRef
- Google Scholar
8
HangC. N.YuP.-D.TanC. W. (2024). TrumorGPT: graph-based retrieval-augmented large language model for fact-checking. IEEE Trans. Artif. Intell.6, 3148–3162.
- Google Scholar
9
HuQ. T.WuW. Y.FengG. (2021). Key technologies and practice of higher education teaching evaluation in the artificial intelligence era. Open Educ. Res.27, 15–23. doi: 10.13966/j.cnki.kfjyyj.2021.05.002
- CrossRef
- Google Scholar
10
HuaY.LiY. F.QiG.WuW.ZhangJ.QiD. (2020). Less is more: data-efficient complex question answering over knowledge bases. J. Web Semant.65:100612. doi: 10.1016/j.websem.2020.100612
- CrossRef
- Google Scholar
11
HuangR. H.LiM.LiuJ. H. (2021). Value analysis of artificial intelligence in educational modernization. J. Natl. Acad. Educ. Adm.9, 8–15+66.
- Google Scholar
12
JiaC.ShiY.F.YangQ.R. (2020). “Entity enhanced BERT pre-training for Chinese NER.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 6384–6396). Association for Computational Linguistics.
- Google Scholar
13
KasneciE.SesslerK.KüchemannS.BannertM.DementievaD.FischerF.et al. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learn. Individ. Differ.103:102274. doi: 10.1016/j.lindif.2023.102274
- CrossRef
- Google Scholar
14
LiuZ.LiuS. N.KangL. Y. (2018). Intelligent learning companion systems in physical spaces: learning analytics technologies driven by perceptual data—an interview with professor Niels Pinkwart, educational technology expert at Humboldt University of Berlin. China Educ. Technol.7, 67–72. doi: 10.3969/j.issn.1006-9860.2018.07.010
- CrossRef
- Google Scholar
15
LiuB. Q.YuanT. T.JiY. C. (2021). Smart technology empowers educational evaluation: connotation, framework and practice path. China Educ. Technol.8, 16–24.
- Google Scholar
16
LuY.WangD. L.ZhangZ. (2021). A survey of knowledge tracing modeling in intelligent tutoring systems. Mod. Educ. Technol.31, 87–95. doi: 10.3969/j.issn.1009-8097.2021.11.011
- CrossRef
- Google Scholar
17
LuY.YuJ. L.ChenP. H.et al. (2023). Educational application and prospect of generative artificial intelligence—taking ChatGPT as an example. Distance Educ. China43, 24–31+51. doi: 10.13541/j.cnki.chinade.20230301.001
- CrossRef
- Google Scholar
18
LuoF.TianX. T.TuC. R.et al. (2021). New trend in educational evaluation: a review of intelligent assessment research. Mod. Distance Educ. Res.33, 42–52. doi: 10.3969/j.issn.1009-5195.2021.05.005
- CrossRef
- Google Scholar
19
WangL. N. (2020). A review and prospect of the application of augmented reality (AR) technology in mathematics education. J. Math. Educ.29, 91–97.
- Google Scholar
20
WeiY. T.XuQ.ShiY. F. (2025). Research on a generative multi-agent tutoring system based on the cone of experience. E-Educ. Res.46, 57–64. doi: 10.13811/j.cnki.eer.2025.08.007
- CrossRef
- Google Scholar
21
WuL. B.CaoY. N.CaoY. M. (2021). Framework construction for AI-empowered classroom teaching evaluation reform and technical implementation. China Educ. Technol.5, 94–101. doi: 10.3969/j.issn.1006-9860.2021.05.013
- CrossRef
- Google Scholar
22
WuJundeZhuJiayuanQiYunli (2025). “Medical graph RAG: evidence-based medical large language model via graph retrieval-augmented generation.” In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 63:28443–28467.
- Google Scholar
23
YanS.ChaiJ.WuL. (2020). “Bidirectional GRU with multi-head attention for Chinese NER.” 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC) (pp. 1160–1164). IEEE.
- Google Scholar
24
ZhangL. S.FengS.LiT. T. (2021b). Formal modeling and intelligent computing for classroom teaching evaluation. Mod. Distance Educ. Res.33, 13–25. doi: 10.3969/j.issn.1009-5195.2021.01.002
- CrossRef
- Google Scholar
25
ZhangL. S.LinC.ZhouD.HeY.ZhangM. (2021a). A Bayesian end-to-end model with estimated uncertainties for simple question answering over knowledge bases. Comput. Speech Lang.66:101167. doi: 10.1016/j.csl.2020.101167
- CrossRef
- Google Scholar
26
ZhaoY. B.ZhangL. P.YanS. (2023). A survey on the construction and application of subject knowledge graphs in personalized learning. Comput. Eng. Appl.59, 1–21. doi: 10.3778/j.issn.1002-8331.2209-0345
- CrossRef
- Google Scholar
27
ZhuZ. T.HuJ. (2021). Technology-enabled educational innovation post-pandemic: a new paradigm of online-offline blended teaching. Open Educ. Res.27, 13–23. doi: 10.13966/j.cnki.kfjyyj.2021.01.002
- CrossRef
- Google Scholar
28
ZhuZ. T.PengH. C. (2020). Practical pathways of technology-enabled smart education. J. Chin. Soc. Educ.10, 1–8.
- Google Scholar

Summary

Keywords

automatic knowledge graph construction, intelligent tutoring system, personalized learning, retrieval-augmented generation (RAG), technology-enhanced learning

Citation

Deng C and Yuan B (2026) Research on an intelligent tutoring system based on automatic construction of multimodal knowledge graphs and retrieval-augmented generation. Front. Comput. Sci. 8:1777749. doi: 10.3389/fcomp.2026.1777749

Received

30 December 2025

Revised

05 February 2026

Accepted

06 February 2026

Published

25 February 2026

Volume

8 - 2026

Edited by

Antonio Sarasa-Cabezuelo, Complutense University of Madrid, Spain

Reviewed by

Ching Nam Hang, Caritas Institute of Higher Education, Hong Kong SAR, China

Zou Xianxia, Jinan University, China

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Bo Yuan, 1396841128@qq.com

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

TECHNOLOGY AND CODE article

Research on an intelligent tutoring system based on automatic construction of multimodal knowledge graphs and retrieval-augmented generation

Abstract

1 Introduction

1.1 Research background and significance

1.2 Literature review

1.3 Contributions and innovations of this work

2 System architecture and design

2.1 Overall architecture design

2.2 PDF textbook parsing and visual knowledge extraction module

2.3 Automatic knowledge graph construction module

2.4 Vector indexing and retrieval

2.5 Intelligent tutoring and question answering module

2.6 Scalability and maintainability design of the system

2.7 System implementation and demonstration

2.7.1 Video upload and preprocessing

2.7.2 Textbook PDF processing and knowledge extraction

2.7.3 Course overview generation and chapter parsing

2.7.4 Automatic knowledge graph construction and storage

2.7.5 Knowledge Q&A preparation

3 Case study implementation and evaluation

3.1 Experimental setup

3.2 Results and analysis

3.3 Comparative experiment analysis

3.3.1 Comparative systems and experimental design

3.3.2 Experimental results

3.3.3 Discussion and analysis

3.3.4 Conclusion of comparative experiments

3.4 Discussion

4 Discussion

4.1 Educational adequacy of knowledge representation: capturing instructional structure beyond co-occurrence

4.2 Limitations in pedagogical adaptability

4.3 LLM selection considerations and framework generalizability

4.4 Educational significance and practical value

4.5 Comparison with existing research

4.6 Ethical considerations and fairness

5 Conclusion and future work

5.1 Conclusion

5.2 Future work

Statements

Data availability statement

Author contributions

Funding

Conflict of interest

Generative AI statement

Publisher’s note

Supplementary material

References

Summary

Outline

Figures

Cite article

Share article

Article metrics