ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Natural Language Processing
Volume 8 - 2025 | doi: 10.3389/frai.2025.1639147
This article is part of the Research TopicEmerging Techniques in Arabic Natural Language ProcessingView all 6 articles
Arabic Speech Recognition Model with BAIDU Deep and Cluster Learning
Provisionally accepted- Kuwait University, Kuwait City, Kuwait
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
This work involves extracting the spectrum from the Arabic raw, unlabeled audio signal and producing Mel-Frequency Cepstral Coefficients (MFCCs). The clustering algorithm groups the retrieved MFCCs with analogous features. The K-means clustering technique was important to our research, performing an unsupervised categorization of the unlabeled Arabic audio data. Employing K-means on the extracted MFCC features allowed us to classify acoustically similar segments into distinct groups without prior knowledge of their characteristics. This initial phase was crucial for understanding the inherent diversity among our diverse sampled dataset. Dynamic Time Warping (DTW) and Euclidean Distance are utilized for illustration. Classification algorithms such as Decision Tree, , XG Boost, KNN, and Random Forest are used to classify the various classes obtained based on clustering. This study also demonstrates the efficacy of Mozilla's deep speech framework for Arabic speech recognition. The core component of deep speech is its neural network architecture, which consists of multiple layers of Recurrent Neural Networks(RNNs). It strives to comprehend the intricate patterns and interactions between spoken sounds and their corresponding textual representations. The clustered labeled Arabic audio dataset along with transcripts and Arabic Alphabets, is used as input to the Baidu deep speech model for training and testing purposes. PyCharm in Python 3.6 is used to build a Dockerfile. Creating, editing, and managing Dockerfiles within the IDE is made simpler by PyCharm's functionality and integrated environment. Deep speech provides an eminent Arabic speech recognition quality with reduced loss, word error rate, and character error rate. Baidu's deep speech intends to achieve high performance in both end-to-end and isolated speech recognition with good precision and a low word rate and character error rate in a reasonable amount of time. The suggested strategy yielded a loss of 276.147, a word error rate of 0.3720, and a character error rate of 0.0568. This technique increases the accuracy of Arabic automatic speech recognition(ASR).
Keywords: clustering, language model, acoustic model, Baidus Deep Speech, RNN, deep learning
Received: 01 Jun 2025; Accepted: 11 Aug 2025.
Copyright: © 2025 Al-Anzi and Shalini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Fawaz Al-Anzi, Kuwait University, Kuwait City, Kuwait
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.