ORIGINAL RESEARCH article
Front. Signal Process.
Sec. Audio and Acoustic Signal Processing
Volume 5 - 2025 | doi: 10.3389/frsip.2025.1587969
Generic Speech Enhancement with Self-Supervised Representation Space Loss
Provisionally accepted- Nippon Telegraph and Telephone (Japan), Tokyo, Japan
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
Single-channel speech enhancement is utilized in various tasks to mitigate the effect of interfering signals. Conventionally, to ensure the speech enhancement performs optimally, the speech enhancement has needed to be tuned for each task. Thus, generalizing speech enhancement models to unknown downstream tasks has been challenging. This study aims to construct a generic speech enhancement front-end that can improve the performance of back-ends to solve multiple downstream tasks. To this end, we propose a novel training criterion that minimizes the distance between the enhanced and the ground truth clean signal in the feature representation domain of self-supervised learning models. Since self-supervised learning feature representations effectively express high-level speech information useful for solving various downstream tasks, the proposal is expected to make speech enhancement models preserve such information. Experimental validation demonstrates that the proposal improves the performance of multiple speech tasks while maintaining the perceptual quality of the enhanced signal.
Keywords: Self-supervised learning, Loss function, SUPERB benchmark, Signal denoising, Speech enhancement, deep learning, speech recognition
Received: 05 Mar 2025; Accepted: 11 Jun 2025.
Copyright: © 2025 Sato, Ochiai, Delcroix, Moriya, Ashihara and Masumura. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Hiroshi Sato, Nippon Telegraph and Telephone (Japan), Tokyo, Japan
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.