Evaluating metaverse-based L2 vocabulary learning effectiveness using a proposed metric of vocabulary forgetting percentage

Zhang, Mike Minwen; Hashim, Harwati; Yunus, Melor Md

doi:10.3389/feduc.2025.1641638

ORIGINAL RESEARCH article

Front. Educ., 18 September 2025

Sec. Assessment, Testing and Applied Measurement

Volume 10 - 2025 | https://doi.org/10.3389/feduc.2025.1641638

Evaluating metaverse-based L2 vocabulary learning effectiveness using a proposed metric of vocabulary forgetting percentage

Mike Minwen Zhang^*^†

Harwati Hashim^†

Melor Md Yunus^†

Faculty of Education, Universiti Kebangsaan Malaysia, Bangi, Malaysia

Introduction: Vocabulary gain and retention are widely recognized as essential metrics in second language (L2) vocabulary learning. However, these traditional measures often fail to reflect the proportional loss of learned vocabulary knowledge over time, thus limiting their practical, diagnostic, and comparative value across different instructional contexts.

Methods: To address this gap, the present study proposes a percentage-based metric: Vocabulary Forgetting Percentage (VFP). To evaluate metaverse-based vocabulary learning (VL) effectiveness and also to empirically validate the VFP, a quasi-experiment was conducted, involving 50 Chinese middle school EFL learners who were assigned to either a metaverse-based group (MG) or a slides-assisted control group (SG). Over three learning sessions, participants learned equivalent vocabulary content and completed pretests, immediate post-tests, and delayed post-tests. Quantitative data were analyzed using independent-samples t-tests to compare vocabulary gains, retentions, and VFPs across groups.

Results: The MG significantly outperformed the SG in vocabulary gain and retention in each session and in mean scores. However, VFP results showed a different pattern: the MG’s third-session and mean VFPs were significantly lower than those of the SG, while differences in the first and second sessions were not significant.

Discussion: The MG’s late-emerging VFP difference from the SG’s suggests that extended exposure to immersive environments may be required before full benefits appear. Findings also confirm the pedagogical potential of the metaverse for vocabulary learning and empirically validate VFP as a complementary metric. By proportionally quantifying vocabulary loss, VFP offers researchers and educators a more nuanced tool for evaluating learning efficiency and retention sustainability in varied L2 contexts.

Introduction

Vocabulary is one of the most critical factors in second language acquisition (SLA) (Crossley et al., 2009; Schmitt, 2000; Webb, 2005), as it significantly influences how learners comprehend and produce a foreign or second language. Learners’ vocabulary knowledge greatly affects their success or failure in language learning (Afzal, 2019; Ng and Rosli, 2023). Scholars (Ghalebi et al., 2020; Arochman et al., 2023) have also argued that memorizing a large number of vocabulary items is a major challenge for foreign language learners, let alone retaining them in long-term memory. To develop autonomous vocabulary knowledge, learners must be highly motivated to engage in a dynamic process of skill development that involves practicing various strategies and employing effective techniques (Nation, 2001). Flashcards, notebooks, dictionaries, and the use of synonyms and antonyms are among the most commonly used tools by educators and students for vocabulary instruction and acquisition (Altiner, 2019; Elgort and Nation, 2010; Oxford and Crookall, 1990; Lessard-Clouston, 2021). Despite these efforts, vocabulary acquisition and retention remain among the most daunting aspects of SLA (Milton, 2009; Schmitt, 2008).

Recent years have seen tremendous progress in the integration of technology, especially in vocabulary acquisition, which opens up new opportunities for dynamic and engaging learning experiences (Lin and Wei, 2024; Teymouri, 2024; Zhang et al., 2025). The field of vocabulary learning has been profoundly impacted by emerging technological developments, including mobile learning apps (Govindasamy et al., 2019), VR (Chen and Yuan, 2023), AR (Hung and Yeh, 2023), and, most recently, the metaverse (Wang et al., 2025).

A metaverse is a shared virtual environment where users can interact with each other and the environment in a simulated three-dimensional space using a combination of digital technologies and media such as audio, video, images, animations, 3D objects, and interactive elements like slideshows (Mystakidis, 2022). Because they are capable of incorporating many digital technologies that heighten the sensation of realism, presence, and co-presence, metaverse platforms are exceptionally immersive. The immersive and collaborative features of the metaverse have the potential to revolutionize traditional vocabulary learning practices (Çelik and Baturay, 2024; Wang et al., 2025).

To evaluate VL effectiveness in technology-enhanced learning contexts, various measures and metrics have been formulated and employed to assess learning outcomes from various perspectives. The most explicit and frequently used measures are the raw scores of vocabulary assessments, such as pretests, immediate posttests, and delayed posttests (Alfadil, 2020; Lee, 2023). Many scholars have also introduced computed measures based on simple calculations of learners’ test results, including vocabulary gain and retention (Elekaeı et al., 2020; Tai et al., 2022). Additionally, more complex formula-based metrics—such as forgetting count (Fukushima et al., 2024) and forgetting rate (Tabibian et al., 2019; Rahman et al., 2021; Rivera-Lares et al., 2023)—have been proposed. While these existing measures and metrics each have their own advantages in analyzing and capturing VL effectiveness, their limitations are also evident. Therefore, more nuanced, accurate, and comprehensive metrics are still warranted, particularly in the realm of technology-enhanced L2 vocabulary learning.

Literature review

Vocabulary knowledge is widely acknowledged as an inherently complex construct, and for pedagogical as well as assessment purposes, broad terms such as “vocabulary knowledge” are considered too imprecise for effective operationalization (Stewart et al., 2024). Within the context of L2 vocabulary knowledge, the “form-meaning link” is recognized as its core component. In addition to form and meaning, a distinction can be made between recognition, in which a learner is presented with an L2 word form and is expected to activate its meaning(s), and recall, in which the learner is given some kind of stimulus that prompts the activation of the L2 word form from memory (Read, 2000). Schmitt (2010) classified overall vocabulary knowledge into four sub-constructs: form recognition, form recall, meaning recognition, and meaning recall. The recognition–recall distinction may carry significant implications for achieving full word mastery. Therefore, in this study, the design of the vocabulary learning sessions and vocabulary tests incorporated the four sub-constructs to enhance the accuracy and comprehensiveness of the measurement.

Vocabulary assessments

In L2 vocabulary learning (L2VL) research, how to effectively measure the learning outcomes of vocabulary knowledge has always warranted more investigations and better solutions. In the previous studies, vocabulary learning effectiveness has been widely assessed by various vocabulary tests. Researchers often rely on established assessments, including the Vocabulary Levels Test (VLT) (Nation, 2001; Schmitt et al., 2001), Wesche and Paribakht’s Vocabulary Knowledge Scale (VKS) (Wesche and Paribakht, 1996); British Picture Vocabulary Scales (BPVS) (Dunn et al., 1997); Nelson-Denny Reading Test – Vocabulary Subtest (Brown et al., 1993); Cambridge Assessment English – B1 Level Vocabulary Test (Cambridge Assessment English, 2020); Peabody Picture Vocabulary Test (PPVT) (Dunn and Dunn, 2007); Laufer and Nation’s Vocabulary-Size Test of Controlled Productive Ability (Laufer and Nation, 1999); Oxford Young Learners Placement Test (YLPT) (Oxford University Press, 2014). Additionally, many vocabulary researchers tailor their own tests to suit the target vocabulary and proficiency level of participants (Webb, 2005).

In empirical L2VL studies, the aforementioned standardized or researcher-developed vocabulary tests are often administered multiple times at different points of a VL intervention to measure learning gain and retention over time and also to capture the variability of a learner’s learning performances (Schmitt, 2010). Data on pretest, immediate posttest, and delayed posttest scores are commonly gathered at three distinct time points (Nation, 2001). 1. Pretest Score: The baseline measure of vocabulary knowledge before the learning intervention. 2. Immediate Posttest Score: The measure of vocabulary knowledge immediately after the learning activity. 3. Delayed Posttest Score: The measure of vocabulary knowledge after one or many designated time delays (Webb, 2005).

The pretest is administered before the learning intervention to establish learners’ baseline vocabulary knowledge. It ensures that participants have not previously mastered the target vocabulary items, and it provides a reference point for measuring subsequent gains. The immediate posttest is conducted immediately after the VL intervention (e.g., reading activity, multimedia instruction, app-based learning, or game-based practice). It is designed to capture the vocabulary gain, or the amount of knowledge acquired during the learning phase. The item types and test formats in the immediate posttest usually mirror those of the pretest to ensure measurement consistency (Read, 2000). Consistency in format is crucial for isolating the learning effect (Nation, 2001). The delayed posttest(s) are administered after a defined period—commonly one week, two weeks, or even several months after the immediate posttest—to assess vocabulary retention and measure the durability of learning (Barcroft, 2009). Like the pretest and immediate posttest, the delayed posttest(s) use consistent formats to allow direct score comparison. The time intervals are selected based on the research objective: shorter intervals assess short-term retention, while longer delays provide insights into long-term memory consolidation. In many cases, delayed posttest(s) are administered without prior warning to avoid rehearsal effects (Zhou, 2010).

Vocabulary gain and retention

Two metrics—“vocabulary gain” and “vocabulary retention”—are commonly used by researchers as key indicators of VL effectiveness (Faramarzi et al., 2014; Okyar and Çakır, 2019), particularly in technology-assisted contexts (Elekaeı et al., 2020; Lee, 2023; Tai et al., 2022). Concurrently, the three terms “vocabulary gain,” “vocabulary acquisition” (Chen and Yuan, 2023; Ersanli, 2023), and “vocabulary learning” (Alfadil, 2020; Sahinler, 2023) are often used interchangeably in technology-enhanced vocabulary learning research. While “vocabulary acquisition” and “vocabulary learning” also refer to the overall VL process, they can cause conceptual confusion in many cases. To ensure terminological clarity, “vocabulary gain” is opted for in this study to refer to the immediate VL effectiveness.

Vocabulary gain (VG) was defined as the short-term memory retrieval of vocabulary knowledge, measured by subtracting the vocabulary pretest from the immediate post-test score (Nation, 2001; Webb, 2007; Lai and Chen, 2023; Reynolds et al., 2022), and refers to the increase in the number of words and expressions that an individual learns and integrates into their active lexical memory immediately after the VL interventions. VG is commonly calculated as:

Vocabulary Gain (VG) = Immediate Posttest Score - Pretest Score

Vocabulary retention (VRe), on the other hand, is often viewed as a more intricate cognitive process of memory incorporating memorization or acquisition, recall, and recognition (Suleiman, 2009). VRe was defined by Richards and Schmidt (2002) as “the ability to recall or remember things after an interval of time” or long-term memory retrieval of vocabulary knowledge. Mohammed (2009) defines VRe as “the ability to retain the acquired vocabulary and retrieve it after a period of time following a certain learning intervention.” Therefore, VRe is commonly measured by the difference between a delayed post-test score and a vocabulary pretest score (Barcroft, 2004; Nation, 2001; Webb, 2007; Zhong, 2018).

Vocabulary Retention (VRe) = Delayed Posttest Score - Pretest Score

Unlike the immediate posttest and the delayed posttest, vocabulary gain and retention emphasize the changes in vocabulary knowledge resulting from specific learning interventions. Thus, learners’ initial vocabulary proficiency should be controlled to account for individual variation when calculating vocabulary gain and retention.

While both VG and VRe are valuable indicators of VL performances, studies have long questioned their adequacy in fully capturing the learning effectiveness (Milton, 2009; Schmitt, 2010), especially with the increasing complexity of digital and immersive learning environments (Feng and Ng, 2024; Weng et al., 2024). VG and VRe, as raw scores, are heavily influenced by learners’ baseline proficiency and the absolute difficulty of the vocabulary items, potentially skewing interpretations of effectiveness. Moreover, VG and VRe are limited in their interpretability for comparative or longitudinal research. Vocabulary gain may overestimate effectiveness by capturing short-term memorization rather than durable learning. Similarly, retention scores alone may mask inefficient learning processes, particularly if learners retained very little relative to what they initially gained.

Several recent empirical studies echo the need for more nuanced effectiveness measures. Lai and Chen (2023) emphasize that retention should not be viewed in isolation but rather in tandem with acquisition metrics. Likewise, Elekaeı et al. (2020) and Zhang et al. (2025) highlight that a higher initial gain followed by a steep forgetting curve may suggest superficial or ineffective learning strategies. Consequently, there is a growing consensus that evaluating the forgotten vocabulary knowledge—rather than solely what is gained or retained—can provide a more complete picture of L2 students’ learning effectiveness in varying contexts (Bahrick et al., 1993; Kornmeier et al., 2022; Sense et al., 2018).

Review of previous metrics for vocabulary forgetting

There are several existing metrics to measure forgetting or memory loss in the prior literature. Three formulas previously proposed specifically for vocabulary knowledge forgetting emerged in the L2 learning domain:

In Fukushima et al.’s (2024) study, the vocabulary forgetting measure, inconsistently referred to as “forgetting rate” and “forgetting count,” was calculated as “subtraction of the immediate posttest score from the one-week delayed posttest score.”

Forgetting Count = Immediate Posttest Score - Delayed Posttest Score

Notably, the term “forgetting rate,” which was mostly used in the article, was not an accurate phrasing in light of the nature of the rate, which is a time unit and supposedly evaluates how fast the vocabulary knowledge is forgotten. While the simple, raw-score-based formula solely captures the absolute number of vocabulary items forgotten during a given time interval, it lacks sensitivity to initial learning performance and cannot adequately support comparative or inferential analyses across different learners, groups, or instructional settings.

Also, adapted from an empirical formula of forgetting rate proposed by Tabibian et al. (2019), Rahman et al. (2021), in their study concerning reducing forgetting rate in EFL students using a spaced repetition-powered digital game-based learning application, proposed a revised version of the forgetting rate metric devoted to EFL education. This empirical forgetting rate is a measure of how fast the memory of an item decays after a single exposure at a certain point in time.

{\hat{n}}_{0, (u, i)} = \frac{- log (\hat{m} (t_{(u, i), 2})}{t_{(u, i), 2} - t_{(u, i), 1}}

In this formula, ${\hat{n}}_{0, (u, i)}$ = initial forgetting rate; $\hat{m} (t_{(u, i), 2})$ = a single word item learning performance tested by a single question in binary value of the second attempt; $t_{(u, i), 2} - t_{(u, i), 1}$ = time interval between the second attempt and the preceding one.

Firstly, this formula is designed for single-item forgetting; although it provides nuanced variations across items, it is not optimal for analyzing the learning outcomes of a list of vocabulary or inter-group learning performance analysis. Secondly, using a binary (0 or 1) score for the second attempt oversimplifies the actual learning performance. It does not capture partial correctness, degrees of confidence, or nuanced performance, making the forgetting rate overly sensitive to a single correct/incorrect response, especially in contexts with more complex or probabilistic learning data. Thirdly, the formula focuses only on performance at the second attempt, $t_{(u, i), 2}$ but ignores how well the item was learned during the first exposure. Without accounting for initial performance or learning strength, the forgetting rate estimate may misrepresent the true rate of forgetting. Fourth, this formula, albeit innovative, is based on a decay curve (forgetting curve) primarily modeled by a logarithmic function (Ebbinghaus, 1913; Murre and Dros, 2015). Human memory retention, however, might not fully adhere to these simple decay laws, particularly for complex materials or diverse learning settings, which does not necessarily reflect the reality (Cepeda et al., 2006; Soderstrom and Bjork, 2015; Wixted, 2004). Lastly, this logarithmic and time-normalized formula can be too sensitive to measurement errors. When analyzing a large sample or comparing across different groups, the errors can be magnified, and the results can be distorted.

Additionally, Rivera-Lares et al. (2023) implicitly indicated the term “forgetting rate” in a line graph by illustrating the means of the number of correct responses at three different points in time and the linear slopes by connecting every two adjacent time points. Hence, the forgetting rate in this study can be calculated by dividing the difference between two successive test results by the time interval between them. The formula is illustrated as follows:

Forgetting Rate = \frac{S_{t_{1}} - S_{t_{2}}}{Δt}

There are also two drawbacks to this formula. First, this study did not involve VL but sentence learning in its experiment design, though forgetting rate was investigated; second, no baseline test or pretest was conducted to homogenize the initial knowledge levels of learners. Therefore, this formula cannot be directly applied to L2 vocabulary acquisition.

Overall, a more interpretable, diagnostic, and comparative metric for vocabulary forgetting can be regarded as a necessary complement to VG and VRe, especially in technology-enhanced vocabulary learning. However, how to rigorously and scientifically calculate forgotten vocabulary knowledge remains unexplored in L2VL. Therefore, the current study aims to propose a new formula for vocabulary forgetting as a measure to evaluate L2 vocabulary effectiveness; subsequently, the researcher attempts to test the validity of the newly proposed formula in two different learning environments.

Research questions

1. How can the VFP be calculated to measure L2VL effectiveness?

2. To what extent does the metaverse-based learning approach affect the VL effectiveness among middle school EFL learners compared to the traditional slide-assisted counterpart?

3. What are the differences in capturing L2VL effectiveness between the proposed formula of VFP and the existing measures of VG and VRe in the metaverse-based learning context?

Vocabulary forgetting percentage formula formulation

Dissatisfied with widely used metrics like VG and VRe and also inspired by the aforementioned formula-based metrics for vocabulary forgetting, the current research is an effort to provide a metric for vocabulary retention loss to more rigorously evaluate the VL effectiveness across various L2 instructional settings.

A new metric, the Vocabulary Forgetting Percentage (VFP), is proposed in the current study. This metric aims to quantify the proportion of initially gained vocabulary that is subsequently forgotten over time, thereby offering a standardized and comparative indicator of long-term VL effectiveness. VFP is defined as the normalized percentage of vocabulary forgotten within a given time interval relative to the initial vocabulary gain immediately after the learning intervention. It offers a measure of retention loss across different groups in different learning environments. The formula for calculating the VFP is expressed as:

Vocabulary Forgetting Percentage (VFP) = \frac{Retention Loss}{Vocabulary Gain} \times 100 %

Explanation of the proposed formula

The numerator of the proposed formula, Retention Loss (RL), reflects the absolute number of vocabulary items forgotten between the immediate and delayed posttests, which can be represented as:

Retention Loss (RL) = Immediate Posttest Score - Delayed Posttest Score

Since the pretest score is typically fixed and consistent for each learner, vocabulary gain is measured by subtracting the pretest score from the immediate posttest score, and vocabulary retention is measured by subtracting the pretest score from the delayed posttest score, we can also express RL as the difference between VG and VRe.

\begin{array}{l} RL = (Immediate Posttest Score - Pretest Score) - \\ (Delayed Posttest Score - Pretest Score) = \\ Vocabulary Gain - Vocabulary Retention \end{array}

However, this raw loss can be misleading if considered in isolation, as it does not account for how much was learned initially from the learning intervention. To address this, the denominator, or VG, represents the total number of vocabulary items learned as a result of the intervention (from pretest to immediate posttest). Notably, by introducing VG instead of the Immediate Posttest Score in the denominator, the variation in initial vocabulary proficiency can be controlled for when comparing VFP values across different learners.

By dividing RL by VG, the formula computes the proportion of learned vocabulary that was lost and then multiplies it by 100 to express the result as a percentage. This yields a forgetting percentage that is both normalized and interpretable, providing meaningful comparisons across learners, instructional approaches, and research studies.

Meanwhile, based on the breakdown of the concepts above, it can also be written as:

VFP = \frac{Immediate Posttest Score - Delayed Posttest Score}{Immediate Posttest Score - Pretest Score} \times 100 %

VFP = \frac{Vocabulary Gain - Vocabulary Retention}{Vocabulary Gain} \times 100 %

This formula calculates the proportion of vocabulary forgotten after a specified retention interval, relative to the vocabulary gain in a certain learning intervention. The immediate posttest score represents learners’ short-term retention following instructional intervention, while the delayed posttest score reflects longer-term retention after a period of time has passed. By expressing the loss as a percentage of the vocabulary gain rather than a difference between the immediate posttest score and the delayed posttest score (Fukushima et al., 2024), the formula normalizes retention loss across different performance levels, thus offering a standardized and interpretable metric for both intra-group and inter-group comparisons.

Illustrative example

If intending to calculate and compare the VFPs of two vocabulary learners using two different learning methods, their pretest, posttest and one-week delayed test raw scores need to be first collected, respectively. Their scores are listed as follows:

Based on the formula:

Vocabulary Forgetting Percentage (VFP) = \frac{Retention Loss}{Vocabulary Gain} \times 100 %

Hence, the respective VG, RL, and VFP for students A and B are listed below.

This means that, despite the same scores on the one-week delayed posttest, student A still showed a lower VFP than student B.

Evaluation of metaverse-based VL effectiveness and validation of the proposed formula

Methodology

The second phase of this study adopts a quasi-experimental study design. This phase aims to examine the L2 learners’ VL effectiveness in the metaverse-based learning environment and to compare the VFP measure with traditional metrics of VG and VRe. By doing so, it also seeks to determine whether the VFP formula provides a more comprehensive and sensitive assessment of VL effectiveness across different instructional modalities.

Participants and sampling

The subjects of the study were Grade 8 students in a public middle school in Mainland China. Fifty students (26 males and 24 females, aged 13 to 15, M = 14.64) consented to participate in the study. Of these, 25 students utilized a metaverse platform, Spatial, to engage in three metaverse-based learning sessions (metaverse-based group, or MG), learning 20 words or expressions per lesson. The remaining 25 students in the control group attended PowerPoint slide-assisted learning sessions in a traditional classroom to acquire the same vocabulary knowledge (slide-assisted group, or SG). A t-test (two-tailed p = 0.425, p > 0.05) indicated that the initial vocabulary proficiency level of the MG and the SG was not statistically significant. All participants are native Chinese speakers with no prior experience utilizing any metaverse-based platforms or tools for English language learning before this study. Both groups were instructed by one English teacher appointed by the study’s researchers. The approval from the middle school was obtained prior to the commencement of the formal data-gathering process. All participants and their legal guardians were informed of the study’s purpose, procedures, and their right to withdraw at any time without penalty. Written informed consent was obtained from all participants and their guardians prior to participation.

Metaverse platform--Spatial

This study utilized a metaverse platform called Spatial.io (henceforth referred to as Spatial). Spatial¹ is a free-access, open-source metaverse platform. Spatial allows multiple people to engage in existing virtual environments or construct new ones for gaming, meetings, chatting, collaborative learning, and content sharing, accessible via regular web browsers or in real-time with VR headsets. Spatial provides high-fidelity, rich-content learning experiences, allowing users to engage more immersively with the material and fellow participants in both the paid and free versions. The present study utilized the free version, which offers the same level of customizability, enabling users to construct individualized metaverses from the ground up or to configure tailored VR rooms and avatars by modifying templates and incorporating an extensive array of intricate 3D content, embellishments, and toolkits provided on the platform.

Learning instruments and materials

Metaverse-based and slide-assisted VL sessions

The researcher prepared three consecutive and coherent VL sessions for both the MG and the SG in accordance with the English Curriculum Standards and the themes outlined in the Grade 8 middle school English textbook published by People’s Education Press. In alignment with the Grade 8 English curriculum, the exact three sessions (Animal Kingdom, Treasure Island and A Trip to Thailand) were developed for both the MG and SG, with both groups acquiring the same sets of vocabulary. Each learning session involved the acquisition of 20 words (see Appendix 1). The majority of the vocabulary acquired was sourced from the Grade 8 textbook’s word list, with additional terms and idioms incorporated based on their pertinence to current virtual environments and their frequency of use. Three experts were requested to assess the validity of the curriculum design for Metaverse-based VL sessions and the vocabulary list acquired (Figures 1–4).

Figure 1

Virtual scene depicting a group of animated characters interacting with various animals such as a kangaroo, rhinoceros, and dinosaur in a forest setting. User interface in the upper right shows a participant list.

Figure 1. MG learners interacting with 3D animated wild animals and learning the corresponding English words under the teacher’s instruction and supervision in the first learning session: Animal Kingdom. This classroom activity helps students reinforce form recognition and form recall knowledge.

Figure 2

A group of virtual avatars stands on a sandy beach under a partly cloudy sky. Avatars have floating name tags like

Figure 2. Learners in MG playing a multi-player treasure hunt game in the second learning session: Treasure Island, which facilitates incidental vocabulary learning, collaborative learning and situated learning. The game aims to improve students’ word recognition, word recall, meaning recognition and meaning recall knowledge.

Figure 3

A virtual room with avatars interacting in a wooden-paneled environment. The space includes a sofa, a desk, shelves with trophies, and a globe. A menu on the upper right lists user names like

Figure 3. In the third session, MG participants immersing themselves in a traditional Thai house and playing an item-seeking game, which is designed to enhance learners’ meaning recognition and form recall knowledge by reinforcing the psychological links between word spellings and corresponding images.

Figure 4

A virtual gallery space displaying various artworks on the walls, with people walking around. Exhibits include images and posters such as

Figure 4. Students in the MG exploring and learning at a virtual gallery where the spellings of different items related to Thailand and their corresponding images are displayed on the walls, with 3D models positioned in front, which helps learners consolidate their form recognition and meaning recognition knowledge. Subsequently, students are invited to play a word-matching game in the far section, which is designed to improve their meaning recall and form recall knowledge via intentional vocabulary learning.

English vocabulary proficiency test (EVPT)

The English Vocabulary Proficiency Test (EVPT) was adapted based on the Vocabulary Size Test (bilingual Mandarin version) developed by Nation and Beglar (2007). The original assessment has 14,000 words and includes 140 multiple-choice questions, with 10 questions derived from each 1,000-word family level. EVPT aims to test L2 learners’ receptive vocabulary knowledge (Nation and Beglar, 2007). The total score of a student must be multiplied by 100 to determine their overall receptive vocabulary. According to the English Curriculum Standards for Nine-Year Full-Time Compulsory Education (the Ministry of Education of the People’s Republic of China 2022), however, the average required vocabulary size for middle school graduates is approximately 1,600 words, including idioms and collocations, and the maximum requirement is approximately 2000 words and expressions. Presumably, in case some Grade 8 students’ vocabulary sizes are superior to the average level, it is reasonable to only keep the multiple-choice items between the 1^st and the 30th item in the Vocabulary Size Test. Hence, the items beyond the 3,000-word level are unnecessary to be included in the EVPT (see Appendix 2).

Vocabulary tests

The effectiveness of students’ VL was assessed using test sets from three metaverse-based sessions and three slide-assisted sessions, which included pretests, post-tests, and delayed post-tests. These were adapted from the Wesche and Paribakht (1996) and evaluated by three experts and the instructor of the learning sessions. Each vocabulary assessment package has one pretest, one posttest, and one delayed posttest. Each test comprises 20 items, each with five options, where every item assesses one word or expression acquired throughout each session (see Appendix 3). Furthermore, the vocabulary and phrases assessed were derived only from the three instructional sessions. To mitigate order effects and diminish test familiarity bias, the testing sequences of the vocabulary in pretests, posttests, and delayed posttests were randomized. The exam sets were designed to assess the receptive and productive vocabulary knowledge, including form recognition, meaning recognition, form recall and meaning recall, of both the MG and the SG (See Table 1). The fundamental concept of the scale is to assess incremental levels of vocabulary comprehension. Students must evaluate their familiarity with a term or expression using the provided scale and complete the corresponding blank.

Table 1

Table 1. Task description and type of knowledge measured.

Data collection procedure

The complete process required 8 weeks. A pilot study was undertaken during the initial week, resulting in adjustments to the hardware, software, and difficulty levels of the three lessons. The EVPT was administered the subsequent week, and individuals were categorized into two groups (MG and SG) according to the EVPT outcomes. The next 2 weeks consisted of training sessions for MG and SG focused on metaverse-based and slide-assisted learning, respectively, as well as for the instructor in both instructional environments. The educational sessions occurred from the fifth to the seventh week, consisting of three separate vocabulary acquisition sessions. Virtual reality-based learning sessions occurred in a computer laboratory at the designated school, while three slide-assisted instructional sessions were conducted in the original classroom. Each session lasts 90 min, including a 10-min intermission, which is similar to two consecutive normal English lessons at middle schools in China. Before each learning session, all 50 participants were directed to complete a pretest for the forthcoming topic. At the conclusion of each session, all students were directed to promptly complete a post-test regarding the vocabulary acquired during the session. One week following the relevant learning session, a delayed post-test was administered to assess their vocabulary knowledge once more. The time interval between the immediate posttest and delayed posttest was established as one week, as a one-week delayed posttest is commonly utilized by researchers in VR-assisted VL (Lai and Chen, 2021; Fuhrman et al., 2021; Fukushima et al., 2024; Kaplan-Rakowski and Thrasher, 2024; Tai et al., 2022; Luan et al., 2024).

Data analysis

In the present study, the Statistical Package for the Social Sciences (SPSS) version 27.0.1 was used to analyze the data from the quasi-experiment. Specifically, an independent-samples t-test was conducted, and both descriptive and inferential analyses were employed to compare the differences in VL effectiveness between two groups.

Findings

The quantitative findings of this study reveal significant differences in VL outcomes between students who engaged with the metaverse-based learning environment and those who participated in slide-assisted instruction. These results are discussed in terms of vocabulary gain, vocabulary retention, and vocabulary forgetting percentage, each offering a distinct perspective on the effectiveness of the respective instructional approaches.

From the descriptive analysis (see Table 2), it is shown that the mean scores of both vocabulary gain (M = 52.56) and retention (M = 49.21) in the MG are larger than those in the SG (M_VG = 38.91; M_VRe = 33.33), indicating the potential advantage of using metaverse as the VL approach. Additionally, the smaller mean value of VFP in the MG (M = 6.63%) than that in the SG (M = 14.72%) also demonstrates that the metaverse may help learners more effectively retain the learned vocabulary. However, the nuanced differences among the three effectiveness measurements and the superiority of metaverse-based VL cannot be concluded until inferential analysis is introduced.

Moreover, the boxplot graph (Figure 5) reveals that the MG exhibited a lower median VFP, suggesting a reduced vocabulary memory loss compared to the SG. Additionally, the interquartile range (IQR) for MG was narrower, indicating more consistent performance among participants. Several outliers (cases 6, 19, 22, and 23) were observed, but overall, the variability in VFP was lower than that of SG. In contrast, the SG demonstrated a higher median VFP and a larger IQR, indicating greater variability and a tendency toward higher vocabulary forgetting. Notably, an extreme outlier (case 49; MeanVFP₄₉ = 46.84%) in the SG group suggests that some participants experienced significant vocabulary loss.

Figure 5

Boxplot showing the means of vocabulary forgetting percentage by group, labeled MG and SG. MG has a lower interquartile range, with outliers at 6, 19, 22, and 23. SG has a wider range with an outlier at 49. The y-axis represents VFP Means from negative 20 to 40.

Figure 5. Means of vocabulary forgetting percentage by group.

In order to compare two groups in vocabulary gain, retention, and VFP, an independent samples t-test was conducted (see Table 3). In terms of vocabulary gain, students in the MG demonstrated significantly greater improvement from pretests to immediate posttests in all three vocabulary lessons compared to their peers in the slide group. Specifically, Lesson 1 revealed a statistically significant gain difference (p = 0.002), with MG students showing a notably higher short-term acquisition of vocabulary. This difference became more pronounced in Lesson 2 (p < 0.001) and remained significant in Lesson 3 (p < 0.001). The mean vocabulary gain across all three lessons was also significantly higher for the MG than the slide-assisted group (p < 0.001), confirming the overall effectiveness of the metaverse-based approach in enhancing vocabulary gain. Similarly, the results for vocabulary retention also favored the MG. All three lessons demonstrated significant differences (Lesson 1: p = 0.002; Lesson 2: p < 0.001; Lesson 3: p < 0.001), with higher retention scores among students in the MG. The overall mean vocabulary retention was also significantly greater for the MG. The between-group comparison yielded a statistically significant result, t (48) = 5.168, p < 0.001.

Table 2

Table 2. Descriptive statistics analysis of comparing MG and SG in vocabulary gain, retention, and VFP.

Of particular note is the analysis of VFP, which provides a relative indicator of the sustainability of learning (See Table 3). The MG exhibited a significantly lower mean VFP than the slide-assisted group, indicating that a smaller proportion of the vocabulary learned was lost between the immediate and delayed posttests. The difference in forgetting percentages was statistically significant, t(48) = −2.22, p = 0.031. Although the mean VFP further corroborated these patterns, the analysis of VFPs across specific lessons exhibited some noteworthy differences. Specifically, while the differences in forgetting percentage between groups were not significant in the first two lessons (p₁ = 0.424; p₂ = 0.476), a significant difference emerged in the third session (p₃ = 0.001), which was distinctive from the previous findings in both vocabulary gain and retention, with all comparisons yielding statistically significant results (p < 0.05), suggesting that the benefits of metaverse may accumulate over time or with repeated exposure (Table 3).

Table 3

Table 3. Independent t-test of comparing MG and SG in vocabulary gain, retention, and VFP.

Discussion and conclusion

This study aimed to formulate the VFP as a novel and informative metric to evaluate the effectiveness of L2VL and validate it, particularly in comparing the impact of the metaverse-based learning approach with traditional slide-assisted learning among middle school English learners. Defined as the normalized percentage of vocabulary forgotten within a given time interval relative to the initial vocabulary gain immediately after the learning intervention in the metric formulation phase, VFP offers a normalized and interpretable measure of long-term VL effectiveness. In the validation phase of this study, by analyzing learners’ VG, VRe, and VFP, the study offers a more comprehensive and comparative lens for assessing both the extent and durability of VL.

Consistent with prior studies (Chen and Yuan, 2023; Lai and Chen, 2021; Sahinler et al., 2023; Tai et al., 2022), the results demonstrate that the MG significantly outperformed the traditional slide-assisted group in both vocabulary gain and vocabulary retention. Specifically, the MG achieved a higher mean vocabulary gain (M = 52.56) than the slide group (M = 38.91) and also demonstrated superior long-term retention (M = 49.21 vs. 33.33). These findings suggest that immersive, interactive, and collaborative environments like the metaverse facilitate both immediate vocabulary gain and durable retention, possibly due to enhanced learner engagement (Çelik and Baturay, 2024), multisensory input (Jiao et al., 2024), and contextualized usage (Taguchi and Zhao, 2025). The significant enhancement in both vocabulary gain and retention across all sessions (p < 0.05) reinforces the hypothesis that metaverse-based learning environments provide not just novelty but sustained engagement and cognitive support (Makransky and Petersen, 2021).

Notably, the late-emerging difference in VFP highlights the importance of extended exposure to immersive learning environments, suggesting that the metaverse-based learning may require a threshold of interaction before its full benefits become apparent. This supports the notion that short-term exposure to the XR-related technologies may not be sufficient to manifest retention advantages, but continued engagement enables the encoding and consolidation of vocabulary into long-term memory (Ersanli, 2023; Kaplan-Rakowski and Thrasher, 2024; Lai and Chen, 2023; Xie et al., 2019). Moreover, the embodied nature of metaverse-based learning—where learners interact with virtual environments—may facilitate deeper cognitive processing and memory retention. Embodied cognition theories suggest that learning is grounded in sensory and motor experiences, and the metaverse provides a platform for such embodied interactions (Johnson-Glenberg, 2018; Macedonia, 2019). This aligns with previous findings that highlight the importance of active engagement and multimodal input in second language vocabulary acquisition (Çelik and Baturay, 2024; Glenberg and Kaschak, 2002; Mayer, 2014).

In conclusion, these findings not only validate the VFP formula as a meaningful and nuanced complement to existing evaluation metrics but also confirm the pedagogical potential of the metaverse for vocabulary instruction. Vocabulary gain and retention, which are expressed as raw scores, are independent from each other, hence making it difficult to compare across different learners due to the lack of control over individual variations of immediate vocabulary gain. The VFP, on the other hand, reflects the L2 learners’ VL effectiveness from a more comparative perspective. By quantifying vocabulary loss proportionally, the proposed VFP metric enables researchers and educators to better assess learning efficiency and retention sustainability in diverse L2 instructional contexts. Additionally, the validation approach enables a robust examination of the VFP formula’s reliability, generalizability, and practical applicability, ensuring that the proposed metric can serve as a useful tool for evaluating the effectiveness of various L2VL approaches from a long-term retention perspective. Also, it demonstrates that the metaverse can play a pivotal role in vocabulary acquisition, particularly when learning involves repeated exposure over time. Future research should further explore the threshold and mechanisms by which the metaverse contributes to durable vocabulary knowledge, perhaps considering variables such as interactivity, individual learner differences, and the nature of the target vocabulary. Overall, these findings contribute to the growing body of evidence supporting the integration of the metaverse in language education, offering practical insights for curriculum designers and educators aiming to foster deeper, more durable vocabulary learning.

Despite its potential, the proposed VFP metric also presents certain limitations. First, the formula is sensitive to learners’ initial posttest performance; if the immediate posttest score is low, even a small amount of forgetting can result in a high VFP, potentially exaggerating retention loss. Therefore, it is advisable to apply the formula only when the initial learning outcome exceeds a predetermined threshold, and also it is recommended to ensure the number of test items is not too small to ensure that the total assessment score is not too low. Second, the VFP can only report a learner’s forgetting trend at a limited number of time points. When conditions permit, it is suggested to administer multiple delayed post-tests to more accurately capture the forgetting patterns and characteristics of vocabulary learners across different time intervals. Third, the length of the retention interval should also be standardized or carefully reported when comparing different studies, as varying intervals can significantly influence the degree of forgetting observed.

Data availability statement

Data and supplementary files, including appendices and tables are available online at: 10.6084/m9.figshare.29880485.

Ethics statement

The studies involving humans were approved by Research Ethics Committee, Universiti Kebangsaan Malaysia. The studies were conducted in accordance with the local legislation and institutional requirements. Written informed consent for participation in this study was provided by the participants’ legal guardians/next of kin.

Author contributions

MZ: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing. HH: Funding acquisition, Supervision, Writing – review & editing. MM: Supervision, Writing – review & editing.

Funding

The author(s) declare that no financial support was received for the research and/or publication of this article.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The authors declare that Gen AI was used in the creation of this manuscript. We used ChatGPT to correct grammatical and syntactical errors, ensuring clarity and precision of the language. However, the originality and innovation of the ideas presented in this study are entirely the product of our team’s creativity and expertise. AI was used solely as a supportive tool to enhance the presentation of our concepts, not to generate the ideas themselves.

Any alternative text (alt text) provided alongside figures in this article has been generated by Frontiers with the support of artificial intelligence and reasonable efforts have been made to ensure accuracy, including review by the authors wherever possible. If you identify any issues, please contact us.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/feduc.2025.1641638/full#supplementary-material

Footnotes

1. ^https://www.spatial.io

References

Afzal, N. (2019). A study on vocabulary-learning problems encountered by BA English major students at the university level of education. Arab World English J. 10, 81–98. doi: 10.24093/awej/vol10no3.6