ORIGINAL RESEARCH article
Front. Comput. Sci.
Sec. Human-Media Interaction
Volume 7 - 2025 | doi: 10.3389/fcomp.2025.1675616
This article is part of the Research TopicArtificial Intelligence and Emerging Technologies for Inclusive and Innovative EducationView all articles
A Hybrid Voice Cloning for Inclusive Education in Low-Resource Environments
Provisionally accepted- 1Pak-Austria Fachhochschule Institute of Applied Sciences and Technology, Haripur, Pakistan
- 2Prince Sultan University, Riyadh, Saudi Arabia
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Voice cloning is the process of replicating a target speaker's vocal characteristics to generate synthesized speech that closely mimics the original. Modern voice cloning systems can now produce natural, expressive speech from limited data. A Generative Adversarial Network (GAN) and a Variational Autoencoder (VAE) based voice cloning system is proposed to enhance voice cloning characteristics using a Generalized End-to-End (GE2E) speaker encoder alongside a text-to-spectrogram synthesizer built from Tacotron technology, combined with WaveRNN-based model modifications through the addition of gating layers. It is possible to efficiently transform brief utterances into customized high-quality speech synthesis while achieving near real-time performance through the proposed approach. The proposed pipeline undergoes evaluation using three datasets, including LibriSpeech, VCTK, and local corpora, which exhibit better spectral fidelity as measured by Mel Cepstral Distortion (MCD) and subjective quality as assessed by Mean Opinion Score (MOS). This work also emphasizes the relevance of hybrid voice cloning for inclusive education in low-resource environments. In addition, it highlights responsible deployment approaches by discussing ethical implications and proposing watermarking, consent-based data collection, and automatic detection measures. The system demonstrates robust functionality when working across different languages and speaker conditions, making it suitable for assistive technology, educational applications, media production, and interactive conversational agents.
Keywords: Voice cloning, Text-to-speech, Speaker encoder, Tacotron, WaveRNN, deep learning, Variational autoencoders, GE2E Loss
Received: 29 Jul 2025; Accepted: 19 Sep 2025.
Copyright: © 2025 Younus, Iqbal, Durrani, Ahmad and Ladan. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Arshad Iqbal, arshad.iqbal@spcai.paf-iast.edu.pk
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.