Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Natural Language Processing

Volume 8 - 2025 | doi: 10.3389/frai.2025.1639147

This article is part of the Research TopicEmerging Techniques in Arabic Natural Language ProcessingView all 6 articles

Arabic Speech Recognition Model with BAIDU Deep and Cluster Learning

Provisionally accepted
  • Kuwait University, Kuwait City, Kuwait

The final, formatted version of the article will be published soon.

This work involves extracting the spectrum from the Arabic raw, unlabeled audio signal and producing Mel-Frequency Cepstral Coefficients (MFCCs). The clustering algorithm groups the retrieved MFCCs with analogous features. The K-means clustering technique was important to our research, performing an unsupervised categorization of the unlabeled Arabic audio data. Employing K-means on the extracted MFCC features allowed us to classify acoustically similar segments into distinct groups without prior knowledge of their characteristics. This initial phase was crucial for understanding the inherent diversity among our diverse sampled dataset. Dynamic Time Warping (DTW) and Euclidean Distance are utilized for illustration. Classification algorithms such as Decision Tree, , XG Boost, KNN, and Random Forest are used to classify the various classes obtained based on clustering. This study also demonstrates the efficacy of Mozilla's deep speech framework for Arabic speech recognition. The core component of deep speech is its neural network architecture, which consists of multiple layers of Recurrent Neural Networks(RNNs). It strives to comprehend the intricate patterns and interactions between spoken sounds and their corresponding textual representations. The clustered labeled Arabic audio dataset along with transcripts and Arabic Alphabets, is used as input to the Baidu deep speech model for training and testing purposes. PyCharm in Python 3.6 is used to build a Dockerfile. Creating, editing, and managing Dockerfiles within the IDE is made simpler by PyCharm's functionality and integrated environment. Deep speech provides an eminent Arabic speech recognition quality with reduced loss, word error rate, and character error rate. Baidu's deep speech intends to achieve high performance in both end-to-end and isolated speech recognition with good precision and a low word rate and character error rate in a reasonable amount of time. The suggested strategy yielded a loss of 276.147, a word error rate of 0.3720, and a character error rate of 0.0568. This technique increases the accuracy of Arabic automatic speech recognition(ASR).

Keywords: clustering, language model, acoustic model, Baidus Deep Speech, RNN, deep learning

Received: 01 Jun 2025; Accepted: 11 Aug 2025.

Copyright: © 2025 Al-Anzi and Shalini. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Fawaz Al-Anzi, Kuwait University, Kuwait City, Kuwait

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.