Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Commun. Netw.

Sec. Signal Processing for Communications

Volume 6 - 2025 | doi: 10.3389/frcmn.2025.1662788

Sentence-level Consistency of Conformer Based Pre-training Distillation for Chinese Speech Recognition

Provisionally accepted
Haifang  LiHaifang LiChao  TangChao TangXin  YueXin YueXu  LiXu Li*
  • Xinjiang Normal University, Urumqi, China

The final, formatted version of the article will be published soon.

This paper presents a Conformer-based framework that integrates sentence-level consistency with pre-training distillation for Chinese speech recognition. Unlike conventional objectives that focus on word or phrase level, our method explicitly enforces semantic alignment at the sentence level while compressing model size through knowledge distillation. To support evaluation, we construct CH Broadcast ASR, a domain-specific Chinese corpus in the broadcast and television field. Experiments on Aishell-1, Aishell-3, and CH Broadcast ASR show that the proposed model consistently outperforms state-of-the-art baselines. It achieves a CER of 3.3% on Aishell-1, 3.7% on Aishell-3, and 3.9% on CH Broadcast ASR, surpassing TDNN, DFSMN-T, and TCN-Transformer, while reducing model size by over 10%. These results demonstrate that combining sentence-level consistency with distillation not only improves robustness for long-form broadcast speech but also enhances efficiency for real-time deployment.

Keywords: speech recognition1, Model Distillation2, Con-former3, Pre-trained Mode4, Sentence-Level Consistency5

Received: 09 Jul 2025; Accepted: 23 Sep 2025.

Copyright: © 2025 Li, Tang, Yue and Li. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Xu Li, lixu@xjnu.edu.cn

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.