Abstract
Separation of speech mixtures in noisy and reverberant environments remains a challenging task for state-of-the-art speech separation systems. Time-domain audio speech separation networks (TasNets) are among the most commonly used network architectures for this task. TasNet models have demonstrated strong performance on typical speech separation baselines where speech is not contaminated with noise. When additive or convolutive noise is present, performance of speech separation degrades significantly. TasNets are typically constructed of an encoder network, a mask estimation network and a decoder network. The design of these networks puts the majority of the onus for enhancing the signal on the mask estimation network when used without any pre-processing of the input data or post processing of the separation network output data. Use of multihead attention (MHA) is proposed in this work as an additional layer in the encoder and decoder to help the separation network attend to encoded features that are relevant to the target speakers and conversely suppress noisy disturbances in the encoded features. As shown in this work, incorporating MHA mechanisms into the encoder network in particular leads to a consistent performance improvement across numerous quality and intelligibility metrics on a variety of acoustic conditions using the WHAMR corpus, a data-set of noisy reverberant speech mixtures. The use of MHA is also investigated in the decoder network where it is demonstrated that smaller performance improvements are consistently gained within specific model configurations. The best performing MHA models yield a mean 0.6 dB scale invariant signal-to-distortion (SISDR) improvement on noisy reverberant mixtures over a baseline 1D convolution encoder. A mean 1 dB SISDR improvement is observed on clean speech mixtures.
1 Introduction
Signal enhancement of speech signals recorded in far-field scenarios has been active research topic for some decades now (; ; ). Isolating individual speakers from signal mixtures is often necessary when applying speech processing systems in real life applications (Wang and Chen 2018; ). Speech separation is a common approach to solving this problem. While there has been significant progress in recent years using deep neural network based architectures to separate clean speech mixtures (; Shi and Hain, 2021), the performance still drops significantly in noisy environments, especially for low signal-to-noise ratios (SNRs) (Wichern et al., 2019; ; ). Early approaches for separating speech signals were based on harmonic relationships in the signal () or non-negative matrix factorization (NMF) (Schmidt and Olsson, 2006; ) and later deep neural network (DNN) variations on NMF approaches (; ).
Models that used learned filterbank transforms from the time domain such as TasNets are able to consistently outperform models based on short-time Fourier transform (STFT) features (; ; ; ; ; Subakan et al., 2021). The encoder of TasNets can be interpreted as filter banks and this paper aims at visualising the encoded signals in TasNets in that respect. first proposed a recurrent TasNet (BLSTM-TasNet) model composed of a 1-dimensional convolutional encoder, bidirectional long short term memory (BLSTM) masking network and a transposed 1-dimensional convolutional decoder. revised this into a fully convolutional network (Conv-TasNet) by replacing the BLSTM network with a temporal convolutional network (TCN) (). Shi et al. (2019) proposed the introduction of gating mechanisms into the TCN as a means of controlling the flow of information through the network. A dual path recurrent neural network model (DPRNN) was introduced by which reorganises the input data into multiple data chunks and processes the inter chunk and intra chunk data sequentially using an long short term memory (LSTM) based network for modelling temporal context in sequences. The dual path Transformer network (DPTNet) () and Sepformer (Subakan et al., 2021) are dual path models that replace the recurrent neural networks in the DPRNN model with Transformer networks (Vaswani et al., 2017; ) for modelling temporal context in the mask estimation part of the network. Work by focused more on the encoder and decoder part of the generalized TasNet model structure where a deeper convolutional encoder and decoder network were proposed for the Conv-TasNet model. It was shown by Yang et al. (2019) that combining the learned features of Conv-TasNet’s encoder with STFT features leads to a small improvement performance for clean speech separation tasks. Similarly, demonstrated that using complex-valued learnable analytic filterbanks in the encoder and decoder can lead to further performance improvement over real valued encoder of Conv-TasNet. proposed hand-crafted multi-phase gammatone (MPGT) filter bank features over the learned filterbank in Conv-TasNet. This approach was effective when just applied to the encoder but the learned decoder of Conv-TasNet proved more effective than their MPGT based decoder.
This work investigates the use of attention mechanisms in the encoder and decoder of TasNets to improve the performance, particularly in noisy and reverberant situation. Vaswani et al. (2017) proposed MHA as a way to parallelize a single attention mechanism into multiple attention heads while maintaining a similar parameter count to single headed attention. This work proposes incorporating multihead attention mechanisms into the encoders and decoders of Conv-TasNet to improve the performance on noisy and reverberant speech mixtures where it is assumed that the noisy content of the data is orthogonal to the speech. Some discussion about the relevance of the orthogonality assumption and its relationship to cross correlation is given to motivate why attention mechanisms are a suitable choice for improving the encoders and decoders. The network structures are evaluated on noisy and reverberant data from the WHAMR corpus (). Although the main goal of this work is to minimize the negative effects of additive noise under the assumption of orthogonality, separation of reverberant speech mixtures, i.e. with convolutive noise (reverberation) are also considered. The remainder of this work proceeds as follows. In Section 2 the Conv-TasNet model is briefly revised and analyzed. In Section 3 the proposed Multihead Attention and the novel encoder and decoder structures are introduced. The training configuration and experiments conducted on the WHAMR corpus are explained in Section 4. Further discussion and some conclusions are give in Section 5.
2 Conv-TasNet
In this section the Conv-TasNet speech separation network proposed by is reviewed. The network is composed of three components: an encoder, a mask estimation network and a decoder. A schematic of the network structure is shown in Figure 1 exemplary for C = 2 output signals. The mask estimation network formulated in this section follows the implementation that can be found in the open source SpeechBrain and ESPnet () software toolkits. This implementation differs slightly from the original proposed by which is discussed in greater detail in Section 2.3.
FIGURE 1
2.1 Signal Model and Problem Formulation
The problem of monaural noisy reverberant speech separation is a 1 dimensional additive and convolutive problem for which the microphone signal is composed of C signals sc(t), c ∈ {1, …, C} convolved with their corresponding room impulse response (RIR), hc(t), and an additive a noise source ν(t).The symbol * in (1) denotes the convolution. The aim implicit in the noisy reverberant speech separation task is to find C estimates for each of sc(t), denoted as . The speech mixture signal x(t) in (1) can be discretized such that x(ti) := x[i], with i being the discrete sample index and Lx the length of the signal.
The discrete mixture x [i] is processed in overlapping segments of length LBL such that:where ℓ is the frame number for each of Lx frames and . Note that Lx and Lx are different quantities and that the frame overlap in (2) is fixed to 50% in this work.
The encoder encodes short overlapping blocks of the time domain signal xℓ as defined in (2). The encoder is a convolutional neural network where the layer weights are learned in an end-to-end (E2E) fashion. The mask estimation takes the output of the encoder network wℓ and uses it to estimate a set of mask-like vectors mℓ,c for each of the C speakers. These mask-like vectors are then multiplied with the encoded signal vector wℓ, producing a masked weight vector for each speaker. The decoder in the original Conv-TasNet approach () is a transposed 1D convolutional layer that decodes these representations back into the time domain to result in C separated source estimates . The goal of the decoder is theoretically to perform the inverse function of the encoder.
2.2 Encoder
The first stage of the network is to encode the input audio. The encoder is a constructed using a 1D convolutional filter of kernel size LBL with 1 input channel and N filters and an optional nonlinear encoder activation layer denoted by . For a piece of audio of length Lx this results in Lx frames and N output channels such that the network produces Lx encoded mixture vectors given bywhere represents a matrix of the trainable convolutional weights. In the implementation used in this section the nonlinear activation used is chosen as a rectified linear unit (ReLU) function. The encoded signal mixture for all frames ℓ can be defined by .
2.2.1 Channel Sorting for Visualisation of Encoded Signals
While time-frequency approaches for speech separation based on masking spectrogram representations are often easy to interpret, for visualization of the encoded signal W, sorting over the output convolutional channels n is beneficial. When visualising the encoded representations in this work, the encoded signals’ channels are thus reordered according to the sorting algorithm defined in Algorithm 1 based on depthwise Euclidean distance. In the Conv-TasNet paper, propose using unweighted pair group method with arithmetic mean (UPGMA) to sort the channels by Euclidean filter similarity. The proposed Algorithm 1 was found to be preferable in many cases to ’s approach as it leads to a less granular representation with most of the speech energy being located in the lower region of the representation, making it easier to observe lower energy noisier regions within the encoded signal. Consequently, the proposed channel sorting algorithm results in visualisations more similar to well-known spectrogram-like time-spectral representations. The key difference of the proposed sorting algorithm is that ’s method uses filter similarity to sort channels whereas the proposed method sorts channels according to encoded feature similarity. The use of UPGMA which is based on a clustering approach to sort the channels is also not clearly motivated by hence in our approach we simply suggest sorting the channels by decreasing similarity from the most similar channels measured in Euclidean feature similarity. This is premised on the assumption that the most similar channels will contain the most amount of speech energy.
Algorithm 1
FIGURE 2
2.3 Mask Estimation Network
The separation network is visualised in Figure 3. It uses a TCN which consists of X layers of convolutional blocks (horizontal and coloured in Figure 3A) which are repeated R times (vertical in Figure 3A). The initial channel-wise normalisation for each block of the encoded signal wℓ is defined aswhere and are trainable parameter vectors. A small value ϵ in the denominator of (7) ensures numerical stability. A pointwise convolution acts as a bottleneck layer and produces B channels as input for the successive convolutional blocks. At the output of the mask estimation network a set of masks are produced in a single vector, one each speaker at each frame (cf. also Section 2.3.3). This is done using a single pointwise convolution that changes feature dimension from B to CN.
FIGURE 3

(A) Temporal Convolutional Mask Estimation Network. (B) Network layers inside ConvBlock in Figure 3A. denotes the layer normalisation as defined in (7) and is the depthwise separable convolution as defined in (14).
2.3.1 Convolutional Blocks
Each of the convolutional blocks consist of a pointwise 1D convolutional layer proceeded by a depthwise separable convolutional operation as visualized in Figure 3B resulting in H channels within the convolutional block. Each subsequent convolutional block has an increasing dilation factor f = 20, 21, …, 2X−1 which widens the temporal context of the network for every additional block. This implementation of the Conv-TasNet TCN follows that which is used in popular research frameworks such as SpeechBrain (
Conv-TasNet was originally proposed in both causal and non causal implementations. In the causal implementation cumulative layer normalization is proposed by
A parametric ReLU (PReLU) activation function is used after the initial pointwise convolution as well as the in the depthwise separable convolution, denoted by in Figure 3b, cf. also (14). The TCN takes an Lx × N dimensional input and produces a Lx × CN dimensional output. The input sequences to the depthwise separable convolutional layers are zero padded such that the output sequences are always of the same length as the input sequences.
The depthwise separable convolution is an efficient algorithm for computing convolutions where the convolution is computed in two stages:
1) In the first stage a depthwise convolution, i.e. a convolution per channel, is applied to each of G input channels.
for the input matrix
of the convolution operation and the convolution kernel matrix
of size
P. Note that the convolution input channels
Galso equals
Hin the dilated convolutional blocks.
and
are the rows of
Yand
Kin
(13), respectively. The operator (⋅)
⊤denotes the transpose.
2) In the second stage pointwise convolution is then performed across each of the H channels. This operation is defined as
where
,
is the global layer normalization function (global layer normalization (gLN)) (
) and
is a parametric rectified linear unit (PReLU) activation function.
The depthwise separable convolution operation has G × P + G × H parameters where as standard convolution operation has G × P × H which means that the model size is reduced by a factor of when H ≫ P
2.3.2 Temporal Context
The TCN has a fixed window of depthwise inputs that the output layer is able to observe for a given output block. This window of data points is of interest particularly as the input speech data to the network can be modelled as a causal system with long term dependencies particularly with reverberant speech signals for which the room impulse response hc(t) in (1) significantly increases long term dependencies. The receptive field of a convolutional network refers to the number of data points that can be simultaneously observed by the network at the final convolutional layer in a deep convolutional network. The receptive field for the temporal convolutional network (TCN) used in Conv-TasNet depends on the number of convolutional blocks defined by blocks repetitions X and R as well as the kernel size P and can be defined asThe receptive field in (15) is measured in the number of frames observed in a given sequence. When the entire Conv-TasNet model is considered, it is possible to use the receptive field to measure the total temporal context observed by the whole network at any given output, measured in seconds. Given the sample rate fs and the block size LBL, the receptive field in seconds is
2.3.3 Output Masks
The output features of the TCN network for each frame ℓ are a concatenated vector of estimated masks, which is defined aswhere and c ∈ {1, … , C} such that there is a set of mask vectors for each source signal c. Note for later in Section 3.2.3.2 where novel decoders are derived that the authors consider the mask-estimation stage complete when the mask-like features in (18) of shape Lx × CN are de-concatenated into C features matrices of shape Lx × N and thus all computation proceeding from this stage is considered part of the decoder.
2.4 Decoder
The input signal of the decoder U is an element-wise multiplication of the masks mℓ,c and the encoded mixture wℓ from (3). Estimates for the source signals are then obtained from performing a transposed 1D convolution operation defined aswhere represents a set of learned basis vectors to be convolved with the masked mixture. is the estimated segment ℓ of for each audio source c. The matrix U in the original Conv-TasNet model proposed by
2.5 Objective Function
The objective function used for training is scale-invariant signal-to-distortion ratio (SISDR)which is a commonly used objective function for training DNN speech separation systems (
2.6 Deep PReLU Encoders and Decoders
Some work has already been done to investigate improved encoders and decoders for the Conv-TasNet model. Deeper convolutional encoder and decoder networks were proposed by
3 Multihead Attention Encoder and Decoders
In the following, the proposed MHA encoder and decoder designs are introduced. The scaled dot product attention function (Vaswani et al., 2017) and MHA are briefly introduced and the proposed application of MHA in the TasNet architecture is described. Attention was first proposed by
3.1 Attention Mechanism
In this work, scaled-dot product attentionis used where , and denote the query, key and value matrices, respectively. The terms query, key and value are commonly used terms with MHA (Vaswani et al., 2017) and so they are used here also. Each matrix has a sequence dimension, Lq and Lk, as well as a feature dimension, dk and dv. Note that the query and key matrices share the same feature dimension dk and the key and value matrices share the same sequence dimension Lk. The output of the attention function is of shape Lq × dv.
In the encoders and decoders proposed here, the output of the attention function is used to re-weight a sequence of features according to which features in a sequence have the most pointwise correlation (i.e. correlation across channels as opposed to across discrete time) to one another. There is a twofold assumption in our proposed application of the attention function. The first is that encoded blocks containing speech will have a higher correlation to one another than blocks containing noise. Note that this is a similar assumption to the orthogonality assumption made by Roux et al. (2019) in the SISDR objective function in (21) used for training models in this work. The second assumption is that in the encoded speech mixture of each individual speaker’s speech signal will have a larger pointwise correlation to itself than to any other speaker across all frames.
Figure 4 demonstrates the proposed approach to calculating the self-attention (
FIGURE 4

Top left: encoded NRSM signal. Top right: Computed attention matrix weights. Bottom right: Scaled dot product attention. Bottom left: encoded NRSM signal re-weighted with attention. The figures on the left have had values above 0.05× their maximum values clipped and are normalized between 0 and 1. The figures on the right are normalized between 0 and 1.
Figure 5 shows the attention weighted encoded input (middle panel) compared to an encoded NRSM features (top panel) as well as the corresponding encoded CSM features (bottom panel). The attention weighting adds greater emphasis to much of the features containing speech and conversely weights down some of the noisier parts of the encoded features.
FIGURE 5

Top: Encoded NRSM signal blocks. Middle: Encoded NRSM signal blocks re-weighted with attention as defined in (24). Bottom: Encoded CSM signal blocks (ν(t)=0, hc(t)= δ(t), ∀t ≥0). The top and bottom figures clip values above 0.05× the maximum value of the encoded NRSM signal and then normalized between 0 and 1. The middle figure clips values above 0.05× its maximum value and is then normalized between 0 and 1.
3.2 Multihead Attention Layer
The following section introduces multihead attention (Vaswani et al., 2017) as an extension to scaled dot product attention within the context of the encoder and decoder model proposed in this work where all the inputs to the attention layer are of equal dimensions.
3.2.1 Linear Projections and Attention Heads
To simplify notation in the following model descriptions, V, K, are used as notation for arbitrary inputs to each of the MHA layers. The first stage in MHA layer is to linearly project the inputs into a lower dimensional space. This is achieved by multiplying the input sequences by three trainable weight matrices,for each attention head a ∈ {1, … , A} where A is the number of attention heads and d = N/A is the reduced dimensionality. The motivation for reducing the dimensionality is that this retains roughly the same computational cost of using a single attention head with full dimensionality while allowing for using multiple attention mechanisms. Each of these weight matrices are used to compute (Ka, Qa, Va) for each attention head a ∈ {1, … , A} such thatFor each attention head the attention function is computed such thatwhere χa is the ath attention head.
3.2.2 Multihead Attention
The final stage is connecting the attention heads by concatenating a long the d length dimension and projecting the features using a linear layer defined by a weight matricesThe combined concatenation and linear projection is defined by the Multihead Attention function
3.2.3 MHA Encoder and Decoder Architectures
In this section the MHA encoder and decoder architectures are described. Both the encoder and decoder models use a similar paradigm by applying a multihead attention layer followed by a non-linearity to produce a set of mask like features which are then used to weight and encoded mixture.
3.2.3.1 Encoder
For the encoder self-attention (
FIGURE 6

Convolutional MHA encoder diagram.
3.2.3.2 Mask Refinement and Post-Masking Decoders
A number of approaches are proposed. Two encoder-decoder attention (Vaswani et al., 2017) based decoder models are proposed in the following subsection. The first is referred to as mask refinement (MR) and the other is referred to post-masking (PM). Both decoders are composed of an MHA layer proceeded by a ReLU activation function and a transposed 1D convolutional layer. For both architectures the input to the MHA layers are defined aswhere c ∈ {1, … , C} and C is the number of target signals. These inputs are defined to combine the principles of encoder-decoder attention, described in Section 3.2.3 of Vaswani et al. (2017), with those of self-attention as both the key and query contain information from the estimated masks. The same MHA layer is used for each speaker.
The MR decoder produces a mask from the MHA layer proceeded by a ReLU function which is multiplied by the encoded mixture and this re-masked encoded mixture is then decoded back into the time domain with the transposed 1D convolutional layer. The MR decoder model is depicted in Figure 7A. The motivation in this design is to use the MHA mechanism to produce a mask that refines the already masked encoded representation such that is attends better to features most relevant to the most present speaker features in the original masked encoded features.
FIGURE 7

(A) MHA mask refinement (MR) decoder architecture. (B) MHA post-masking (PM) decoder architecture. (C) MHA self-attention (SA) decoder architecture.
The post-masking decoder (PMD) also uses an MHA layer to produce a new mask but in this model the new mask is used to refine the already masked encoded mixture. The PMD model is shown in Figure 7B. The motivation in this design is to use the MHA mechanism to produce a new mask by observing speaker information in the masks and masked encoded mixtures to produce an improved hypothesis of what that masks should be by attending to the most prevalent an correlated speaker information in both types of representation.
3.2.3.3 Self-Attention Decoder
An additional decoder based on self-attention is proposed shown in Figure 7C. This decoder applies MHA to the masks estimated by the network defined in Section 2.3 in a self-attentive manner such thatThe output of the MHA layer is proceeded by a ReLU function to produce a new set of masks. The Hadamard product of the new masks with the encoded mixture is then computed. This masked encoded mixture is then decoded back into the time domain using a transposed 1D convolutional layer.
3.3 Relationship Between Dot Product and Cross-Correlation
Some brief discussion is given to how the scaled dot product function in multihead attention can be formulated as computing a cross correlation matrix of finite discrete processes across the features of each frame ℓ. Using this formulation it is suggested that the attention mechanism naturally applies more weight across frames that are highly cross correlated and applies less weight across frames that have lower cross correlation.
The discrete cross-correlation function of two finite processes q [n] and k [n] can be estimated by
The numerator of (24) is the following matrix of size Lq × Lk for which in the following Lq = Lk = Lx.
For each cell in the resultant matrix there is the dot product of the feature vectors qℓ and kℓ which can be written more explicitly asIn Eq. 35, can be formulated as the cross-correlation function in Eq. 33 where κ = 0, x is substituted with qℓ and y is substituted with kℓ. The intuition in using this formulation is that additive noise features will have much lower correlation to the target speech signal across time than the speech features will to themselves. Similarly it is assumed that convolutional noise features, i.e. reverberant features, will have much higher correlation to the target speech features across the temporal axis and thus the attention mechanism will yield less performant results at dereverbing the reverberant features.
3.4 Encoder and Decoder Complexities
Some brief discussion is given to the model complexities predominantly for reference. The complexities for each of the proposed encoders and decoders as well as the baselines used later in Section 4 are given in Table 1.
TABLE 1
| Model | Complexity |
|---|---|
| Conv-TasNet encoder | O ((LBL + 1) ⋅ Lx ⋅ N) |
| Conv-TasNet decoder | O (LBL ⋅ Lx ⋅ N) |
| Deep-PReLU encoder | O ((LBL + 7) ⋅ Lx ⋅ N + 3 ⋅ Lx ⋅ N2) |
| Deep PReLU-decoder | O ((LBL + 6) ⋅ Lx ⋅ N + 3 ⋅ Lx ⋅ N2) |
| SA encoder (proposed) | |
| self-attention (SA) decoder, PM decoder, MR decoder (proposed) |
Complexity of all encoder and decoder models evaluate including all non-linearities, weights and biases.
The proposed encoder described in 3.2.3.1 is more computationally complex than the encoders proposed by
4 Experiments
This section presents details on the experimental setup as well as the results performed to evaluate the proposed encoders and decoders in the previous section.
4.1 Data
A number of datasets have been proposed for benchmarking speech separation systems (
Noise clips were sampled from a number of urban environments and these are mixed with the speech mixtures at a randomly selected SNR value from a uniform distribution between −6 and +3 dB. RIRs are also randomly generated. An RIR is generated for each speaker from the same simulated room environment. The RIRs have a reverberation time RT60 ranging from 0.1 to 1 s and are generated using the pyroomacoustics software package (Scheibler et al., 2018).
4.2 Training Configuration
The Conv-TasNet model is implemented using the SpeechBrain framework introduced by
TABLE 2
| Variable | Description | Baseline | |
|---|---|---|---|
| N | Input channels | 512 | 512 |
| LBL | Input block size | 16 | 16 |
| B | Bottleneck output channels | 128 | 128 |
| Sc | Skip connection channels | 128 | N/A |
| H | Output channels | 512 | 512 |
| P | Kernel size of conv. block | 3 | 3 |
| X | Blocks of increasing dilation | 8 | 6 |
| R | Repeats of dilated layers | 3 | 4 |
| Temporal context (s) | 1.53 | 0.51 | |
| SISDR | SISDR (dB) on CSM | 14.3 | 14.6 |
Details of the Conv-TasNet configuration compared to
An utterance-level permutation invariant training (PIT) scheme (
4.3 Assessment Metrics
Performance is measured using SISDR, signal-to-distortion ratio (SDR), perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility (STOI).
SDR is a generalized SNR metric that measures the amount of energy in the signal compared with the energy in the combined residual noise, artifacts and interference. SDR has been widely used in assessing source separation models in general (Stoller et al., 2018;
PESQ was proposed by Rix et al. (2001) as an objective measure for speech quality assessment. The design of PESQ is supposed to offer similar results to Mean Opinion Score (MOS) by using psychoacoustically motivated filter models. The measure ranges from −0.5 to 4.5, with −0.5 being considered lowest quality. PESQ is often used for assessing general denoising and dereverberation tasks. It has also been used for assessing speech separation performance (Wang et al., 2014;
STOI is an intelligibility metric proposed by Taal et al. (2010) which uses correlation ratios between clean and degraded signals to asses the intelligibility of the degraded signal with a score between 0 and 1. STOI has been commonly used for assessing general speech enhancement tasks but has also been used in assessing speech separation models (
Δ measures are shown in addition to the absolute metric values to indicate the improvement in quality or intelligibility between the noisy reverberant signal mixture x and the network estimates against the reference sc.
4.4 Results
The following subsections address the speech separation results of the proposed method in comparison to baseline methods on the WHAMR corpus. The MHA encoder is evaluated first and then two subsequent sections analyse the MHA decoder architectures and look at how the number of attention heads affects performance. All metrics use the permutation invariant training schema to find the optimal value of each metric under the assumption this is the correctly matched permutation of speakers. Every set of results is compared against the original encoder and decoder proposed by
4.4.1 MHA Encoder Results
The MHA encoder model seen in Figure 6 is compared to the original Conv-TasNet baseline encoder proposed by
TABLE 3
| AC | Encoder | SISDR | ΔSISDR | SDR | ΔSDR | PESQ | ΔPESQ | STOI | ΔSTOI |
|---|---|---|---|---|---|---|---|---|---|
| CSM | Conv-TasNet | 14.7 | 14.7 | 15.1 | 15 | 2.99 | 1.69 | 0.94 | 0.342 |
| Deep PReLU | 14.8 | 14.8 | 15.2 | 15.1 | 2.96 | 1.66 | 0.941 | 0.344 | |
| SAE | 15.7 | 15.7 | 16.1 | 16.0 | 3.15 | 1.84 | 0.952 | 0.355 | |
| NSM | Conv-TasNet | 7.63 | 12.1 | 8.28 | 12.5 | 1.97 | 0.838 | 0.824 | 0.373 |
| Deep PReLU | 7.83 | 12.3 | 8.51 | 12.7 | 2.04 | 0.900 | 0.840 | 0.432 | |
| SAE | 8.37 | 12.9 | 9.01 | 13.2 | 2.09 | 0.93 | 0.854 | 0.446 | |
| RSM | Conv-TasNet | 5.52 | 8.81 | 7.75 | 7.87 | 2.20 | 0.969 | 0.847 | 0.312 |
| Deep PReLU | 5.91 | 9.20 | 8.09 | 8.21 | 2.26 | 1.04 | 0.860 | 0.325 | |
| SAE | 6.39 | 9.67 | 8.57 | 8.68 | 2.34 | 1.10 | 0.874 | 0.339 | |
| NRSM | Conv-TasNet | 3.54 | 9.66 | 5.48 | 8.96 | 1.79 | 0.656 | 0.75 | 0.366 |
| Deep PReLU | 3.63 | 9.76 | 5.56 | 9.05 | 1.82 | 0.68 | 0.76 | 0.372 | |
| SAE | 4.11 | 10.4 | 6.00 | 9.48 | 1.92 | 0.754 | 0.787 | 0.399 |
Comparison of MHA encoder with 4 attention heads to Original Conv-TasNet encoder across various acoustic conditions. Bold indicates the best performing model for each acoustic condition and metric.
These results demonstrate a consistent improvement of the MHA encoder over the original baseline purely convolutional encoder. Highest improvement in performance can be observed for the clean speech mixtures (CSM) since this is the easiest task for the network. The MHA encoder achieved slightly more performance improvement on the RSM condition than the NSM condition and the NRSM. The MHA encoder outperformed the Deep PReLU encoder on every acoustic condition. Figure 8 shows the intermediate features in the MHA encoder encoding an NRSM signal. Comparing the encoded signal after the first convolutional layer in the network to the similar representation in Figure 6 it is notable that the convolutional layer has learned to focus on a narrow set of channels. This implies a large number of the channels are in fact redundant, a similar find to the MPGT encoder and convolutional decoder model proposed by
Another interesting finding of the output of the MHA layer is that the mask-like features do not seem to attenuate the signal where there is only noise present as one might expect due noise not being present in the target signal at training. This effect can be seen more clearly when compared to the intermediaries of the CSM signal encoded by the MHA encoder in Figure 9.
FIGURE 8

Top left: encoded NRSM features after 1D convolution and non-linearity in MHA encoder sorted using Algorithm 1. Top right: mask-like output of self attentive MHA layer in MHA. Bottom left: output of the MHA encoder. Bottom right: Averaged attention weight matrix across all attention heads, A =4.
FIGURE 9

Top left: encoded CSM features after 1D convolution and non-linearity in MHA encoder. Top right: mask-like output of self attentive MHA layer in MHA. Bottom left: output of the MHA encoder. Bottom right: Averaged attention weight matrix across all attention heads, A = 4.
4.4.2 MHA Decoder Architecture Comparisons
A comparison of the mask refinement decoder (MRD) in Figure 7A, the PMD in Figure 7B and the self-attention decoder (SAD) in Figure 7C is carried out in the following to analyse which approach, if any, leads to superior decoding performance over the Conv-TasNet baseline (
TABLE 4
| AC | Decoder | SISDR | ΔSISDR | SDR | ΔSDR | PESQ | ΔPESQ | STOI | ΔSTOI |
|---|---|---|---|---|---|---|---|---|---|
| CSM | Conv-TasNet | 14.7 | 14.7 | 15.1 | 15 | 2.99 | 1.69 | 0.94 | 0.342 |
| Deep PReLU | 15.0 | 15.0 | 15.5 | 15.3 | 3.01 | 1.72 | 0.943 | 0.345 | |
| SAD | 15.0 | 15.0 | 15.5 | 15.3 | 3.09 | 1.78 | 0.944 | 0.347 | |
| MRD | 15.1 | 15.1 | 15.6 | 15.4 | 3.06 | 1.76 | 0.946 | 0.348 | |
| PMD | 12.5 | 12.5 | 13.1 | 13.0 | 2.85 | 1.55 | 0.932 | 0.335 | |
| NSM | Conv-TasNet | 7.63 | 12.1 | 8.28 | 12.5 | 1.97 | 0.838 | 0.824 | 0.373 |
| Deep PReLU | 7.87 | 12.4 | 8.55 | 12.8 | 2.05 | 0.913 | 0.834 | 0.426 | |
| SAD | 7.88 | 12.4 | 8.53 | 12.8 | 2.05 | 0.9 | 0.842 | 0.434 | |
| MRD | 7.52 | 12.0 | 8.19 | 12.4 | 1.98 | 0.837 | 0.837 | 0.429 | |
| PMD | 7.39 | 11.9 | 8.08 | 12.3 | 1.96 | 0.82 | 0.835 | 0.427 | |
| RSM | Conv-TasNet | 5.52 | 8.81 | 7.75 | 7.87 | 2.20 | 0.969 | 0.847 | 0.312 |
| Deep PReLU | 5.85 | 9.14 | 7.88 | 7.99 | 2.27 | 1.04 | 0.856 | 0.32 | |
| SAD | 5.92 | 9.2 | 8.07 | 8.19 | 2.27 | 1.03 | 0.859 | 0.323 | |
| MRD | 5.77 | 9.06 | 7.96 | 8.07 | 2.20 | 0.976 | 0.855 | 0.319 | |
| PMD | 5.37 | 8.66 | 7.32 | 7.44 | 2.22 | 0.986 | 0.850 | 0.315 | |
| NRSM | Conv-TasNet | 3.54 | 9.66 | 5.48 | 8.96 | 1.79 | 0.656 | 0.75 | 0.366 |
| Deep PReLU | 3.68 | 9.81 | 5.54 | 9.03 | 1.82 | 0.681 | 0.761 | 0.373 | |
| SAD | 3.87 | 9.99 | 5.74 | 9.22 | 1.88 | 0.718 | 0.774 | 0.385 | |
| MRD | 3.19 | 9.32 | 5.12 | 8.61 | 1.76 | 0.62 | 0.769 | 0.381 | |
| PMD | 3.08 | 9.20 | 4.84 | 8.32 | 1.76 | 0.62 | 0.756 | 0.368 |
There was a clear performance improvement on clean speech mixtures across all metrics with the MRD in Figure 7A. Also a noticeable performance increase can be observed for the reverberant speech mixtures but this improvement is not also seen for the noisy reverberant speech mixtures where there was a small drop across all measures except for the STOI measure. The PMD design showed decreased performance across all conditions and metrics. The best performing of the proposed decoders across all conditions was the self-attention decoder. This decoder also outperformed the baseline Deep PReLU decoder with greater success the more challenging the audio became, c. f. SISDR results for CSM, NSM conditions with SISDR results for RSM and NRSM conditions.
4.4.3 MHA Decoder Number of Heads Comparisons
Results shown in Section 4.4.2 demonstrated that the proposed self-attention decoder in Figure 7C was more effective than the MR and PM decoders. The MR decoder also showed some potential performance improvement for the CSM condition but this was not replicated across all conditions. In the following subsection, further analysis is done using the SAD and MRD to observe the effect that using a variable number of heads might have on the model. Experiments were performed using A = {2, 4, 8} attention heads for both decoders and are again compared against the Conv-TasNet (
The results in Table 5 show that using A = 4 attention heads leads to a small but consistent performance increase across all metrics used for MRD over the original Conv-TasNet decoder. The smallest improvement is often close to 0.1 dB SISDR and it is thought that this is not a strong enough improvement beyond the effects of randomized model initialization to confirm that this technique as implemented here is any more effective than the original Conv-TasNet decoder. The SAD again shows consistent improvement over the previously demonstrated model with only two attention heads for both A = 4 and A = 8. Typically for both models A = 4 leads to best average improvement across all metrics for both the MRD and SAD.
TABLE 5
| AC | Decoder | A | SISDR | ΔSISDR | SDR | ΔSDR | PESQ | ΔPESQ | STOI | ΔSTOI |
|---|---|---|---|---|---|---|---|---|---|---|
| CSM | Conv-TasNet | — | 14.7 | 14.7 | 15.1 | 15 | 2.99 | 1.69 | 0.94 | 0.342 |
| Deep PReLU | — | 15.0 | 15.0 | 15.5 | 15.3 | 3.01 | 1.72 | 0.943 | 0.345 | |
| MRD | 2 | 15.1 | 15.1 | 15.6 | 15.4 | 3.06 | 1.76 | 0.946 | 0.348 | |
| MRD | 4 | 15.0 | 15.0 | 15.4 | 15.3 | 3.07 | 1.75 | 0.944 | 0.347 | |
| MRD | 8 | 14.6 | 14.6 | 15.1 | 14.9 | 3.02 | 1.71 | 0.936 | 0.338 | |
| SAD | 2 | 15.0 | 15.0 | 15.5 | 15.3 | 3.09 | 1.78 | 0.944 | 0.347 | |
| SAD | 4 | 15.3 | 15.3 | 15.7 | 15.5 | 3.1 | 1.79 | 0.946 | 0.349 | |
| SAD | 8 | 15.3 | 15.3 | 15.8 | 15.6 | 3.14 | 1.82 | 0.948 | 0.351 | |
| NSM | Conv-TasNet | — | 7.63 | 12.1 | 8.28 | 12.5 | 1.97 | 0.838 | 0.824 | 0.373 |
| Deep PReLU | — | 7.87 | 12.4 | 8.55 | 12.8 | 2.05 | 0.913 | 0.834 | 0.426 | |
| MRD | 2 | 7.52 | 12.0 | 8.19 | 12.4 | 1.98 | 0.837 | 0.837 | 0.429 | |
| MRD | 4 | 7.74 | 12.2 | 8.40 | 12.6 | 2.04 | 0.87 | 0.834 | 0.426 | |
| MRD | 8 | 7.51 | 12.0 | 8.17 | 12.4 | 2.04 | 0.873 | 0.831 | 0.423 | |
| SAD | 2 | 7.88 | 12.4 | 8.53 | 12.8 | 2.06 | 0.9 | 0.842 | 0.434 | |
| SAD | 4 | 7.97 | 12.5 | 8.62 | 12.9 | 2.08 | 0.919 | 0.844 | 0.436 | |
| SAD | 8 | 7.96 | 12.5 | 8.61 | 12.8 | 2.09 | 0.931 | 0.841 | 0.433 | |
| RSM | Conv-TasNet | — | 5.52 | 8.81 | 7.75 | 7.87 | 2.20 | 0.969 | 0.847 | 0.312 |
| Deep PReLU | — | 5.85 | 9.14 | 7.88 | 7.99 | 2.27 | 1.04 | 0.856 | 0.320 | |
| MR | 2 | 5.77 | 9.06 | 7.96 | 8.07 | 2.20 | 0.976 | 0.855 | 0.319 | |
| MR | 4 | 5.58 | 8.87 | 7.84 | 7.96 | 2.25 | 1.00 | 0.846 | 0.311 | |
| MR | 8 | 5.46 | 8.75 | 7.71 | 7.83 | 2.21 | 0.968 | 0.846 | 0.306 | |
| SA | 2 | 5.92 | 9.2 | 8.07 | 8.19 | 2.28 | 1.03 | 0.859 | 0.323 | |
| SA | 4 | 6.01 | 9.3 | 8.13 | 8.25 | 2.29 | 1.05 | 0.863 | 0.328 | |
| SA | 8 | 5.99 | 9.28 | 8.12 | 8.24 | 2.28 | 1.04 | 0.862 | 0.326 | |
| NRSM | Conv-TasNet | — | 3.54 | 9.66 | 5.48 | 8.96 | 1.79 | 0.656 | 0.75 | 0.366 |
| Deep PReLU | — | 3.68 | 9.81 | 5.54 | 9.03 | 1.82 | 0.681 | 0.761 | 0.373 | |
| MR | 2 | 3.19 | 9.32 | 5.12 | 8.61 | 1.76 | 0.622 | 0.769 | 0.381 | |
| MR | 4 | 3.61 | 9.73 | 5.54 | 9.03 | 1.87 | 0.710 | 0.764 | 0.376 | |
| MR | 8 | 3.61 | 9.74 | 5.53 | 9.01 | 1.88 | 0.714 | 0.765 | 0.376 | |
| SA | 2 | 3.87 | 9.99 | 5.74 | 9.22 | 1.88 | 0.718 | 0.774 | 0.385 | |
| SA | 4 | 3.81 | 9.93 | 5.74 | 9.23 | 1.89 | 0.728 | 0.766 | 0.377 | |
| SA | 8 | 3.81 | 9.93 | 5.67 | 9.15 | 1.88 | 0.719 | 0.769 | 0.38 |
Comparison of using 2, 4 and 8 attention heads in MRD (Figure 7a) against the original Conv-TasNet decoder proposed by
4.4.4 Comparison of Combined MHA Encoder/Decoder Models to Deep Convolutional Encoder/Decoder
The final set of results given in this section compare the MHA encoder and decoder approach to a deep convolutional encoder and decoder proposed by
The results in Table 6 show that the proposed combinations of the SAE with the SAD or MRD lead to better results across all metrics for the CSM, RSM and NRSM acoustic conditions compared to the Deep PReLU baseline. The combination of the SAE with both the proposed decoders performed worse in all metrics than the SAE with the original Conv-TasNet decoder. This implies again that the minimal performance gain reported in Table 5 for the MRD might be purely due to initialization properties of the MHA decoder model. Furthermore, the MHA encoder model uses significantly less parameters than the Deep PReLU model as well as the proposed combined SAE and SAD model.
TABLE 6
| AC | Model | Size (M) | SISDR | ΔSISDR | SDR | ΔSDR | PESQ | ΔPESQ | STOI | ΔSTOI |
|---|---|---|---|---|---|---|---|---|---|---|
| CSM | Conv-TasNet | 3.5 | 14.7 | 14.7 | 15.1 | 15 | 2.99 | 1.69 | 0.94 | 0.342 |
| Deep PReLU | 8.2 | 14.8 | 14.8 | 15.2 | 15.1 | 2.96 | 0.66 | 0.943 | 0.345 | |
| SAE & MRD | 5.5 | 15.2 | 15.2 | 15.7 | 15.5 | 3.12 | 1.81 | 0.946 | 0.349 | |
| SAE & SAD | 5.5 | 15.6 | 15.6 | 16.0 | 15.9 | 3.16 | 1.85 | 0.952 | 0.355 | |
| SAE & CD | 4.5 | 15.7 | 15.7 | 16.1 | 16.0 | 3.15 | 1.84 | 0.952 | 0.355 | |
| NSM | Conv-TasNet | 3.5 | 7.63 | 12.1 | 8.28 | 12.5 | 1.97 | 0.838 | 0.824 | 0.373 |
| Deep PReLU | 8.2 | 8.20 | 12.7 | 8.88 | 13.1 | 2.07 | 0.938 | 0.849 | 0.441 | |
| SAE & MRD | 5.5 | 7.97 | 12.5 | 8.62 | 12.9 | 2.06 | 0.896 | 0.839 | 0.431 | |
| SAE & SAD | 5.5 | 8.3 | 12.8 | 8.94 | 13.2 | 2.11 | 0.943 | 0.852 | 0.444 | |
| SAE & CD | 4.5 | 8.37 | 12.9 | 9.01 | 13.2 | 2.09 | 0.93 | 0.854 | 0.446 | |
| RSM | Conv-TasNet | 3.5 | 5.52 | 8.81 | 7.75 | 7.87 | 2.20 | 0.969 | 0.847 | 0.312 |
| Deep PReLU | 8.2 | 6.23 | 9.51 | 8.24 | 8.36 | 2.32 | 1.10 | 0.870 | 0.334 | |
| SAE & MRD | 5.5 | 6.13 | 9.42 | 8.32 | 8.44 | 2.29 | 1.05 | 0.869 | 0.334 | |
| SAE & SAD | 5.5 | 6.13 | 9.41 | 8.33 | 8.44 | 2.29 | 1.05 | 0.869 | 0.334 | |
| SAE & CD | 4.5 | 6.39 | 9.67 | 8.57 | 8.68 | 2.34 | 1.10 | 0.874 | 0.339 | |
| NRSM | Conv-TasNet | 3.5 | 3.54 | 9.66 | 5.48 | 8.96 | 1.79 | 0.656 | 0.750 | 0.366 |
| Deep PReLU | 8.2 | 3.81 | 9.93 | 5.64 | 9.12 | 1.80 | 0.667 | 0.760 | 0.376 | |
| SAE & MRD | 5.5 | 3.80 | 9.93 | 5.69 | 9.19 | 1.88 | 0.717 | 0.778 | 0.389 | |
| SAE & SAD | 5.5 | 3.91 | 10.0 | 5.78 | 9.27 | 1.9 | 0.735 | 0.778 | 0.39 | |
| SAE & CD | 4.5 | 4.11 | 10.42 | 6.00 | 9.48 | 1.92 | 0.754 | 0.787 | 0.399 |
Comparison of MHA and encoder and decoder against the deep convolutional encoder/decoder Cont-TasNet model proposed by
5 Conclusion and Future Work
In this paper novel MHA encoder and decoder networks were proposed for improving TasNet models. The proposed self-attention based MHA encoder demonstrated significant improvement over other encoder baselines across SISDR, SDR, PESQ, and STOI metrics. Three MHA decoders, two using encoder-decoder attention approaches and one using a self-attention approach, were proposed. Performance compared to the original Conv-TasNet model (
There are a number of avenues for further research with the proposed MHA encoder and decoders. The MHA encoder demonstrated reliable performance improvements without the significant increase in model size seen in other encoder and decoder networks proposed for Conv-TasNet (
Statements
Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: https://wham.whisper.ai/.
Author contributions
WR was the main author, proposed using MHA layers in the encoders and decoders, and was involved in devising and implementing the channel sorting algorithm. WR also implemented all the experiments in Section 4. SG contributed to paper writing, assisted with the model analysis sections and provided supervisory support. TH proposed the channel sorting algorithm, had editorial input on this work and provided supervisory support.
Funding
This work was supported by the Centre for Doctoral Training in Speech and Language Technologies (SLT) and their Applications funded by United Kingdom Research and Innovation (grant number EP/S023062/1). This study received funding from 3 M Health Information Systems, Inc. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.
Conflict of interest
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Publisher’s note
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
Footnotes
1.^Conv-TasNet implementation in SpeechBrain: https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/lobes/models/conv_tasnet.py.
References
1
BahdanauD.ChoK.BengioY. (2015). “Neural Machine Translation by Jointly Learning to Align and Translate,” in Proc. 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA. eds. BengioY.LeCunY.. 10.48550/ARXIV.1409.0473
2
BenestyJ. (2000). An Introduction to Blind Source Separation of Speech Signals. USA: Kluwer Academic Publishers, 321–329.
3
CauchiB.GerkmannT.DocloS.NaylorP.GoetzeS. (2016). “Spectrally and Spatially Informed Noise Suppression Using Beamforming and Convolutive NMF,” in Proc. AES 60th Conference on Dereverberation and Reverberation of Audio, Music, and Speech (Leuven, Belgium).
4
CauchiB.KodrasiI.RehrR.GerlachS.JukićA.GerkmannT.et al (2015). Combination of MVDR Beamforming and Single-Channel Spectral Processing for Enhancing Noisy and Reverberant Speech. EURASIP J. Adv. Signal Process.2015, 61. 10.1186/s13634-015-0242-x
5
ChenJ.MaoQ.LiuD. (2020). Dual-Path Transformer Network: Direct Context-Aware Modeling for End-To-End Monaural Speech Separation. Interspeech., 2642–2646. 10.21437/Interspeech.2020-2205
6
[Dataset]CosentinoJ.ParienteM.CornellS.DeleforgeA.VincentE. (2020). Librimix: An Open-Source Dataset for Generalizable Speech Separation.
7
DengC.ZhangY.MaS.ShaY.SongH.LiX. (2020). Conv-TasSAN: Separative Adversarial Network Based on Conv-TasNet. Proc. Interspeech, 2647–2651. 10.21437/Interspeech.2020-2371
8
DitterD.GerkmannT. (2020). “A Multi-phase Gammatone Filterbank for Speech Separation via TasNet,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 36–40. 10.1109/icassp40776.2020.9053602
9
Haeb-UmbachR.HeymannJ.DrudeL.WatanabeS.DelcroixM.NakataniT. (2021). Far-field Automatic Speech Recognition. Proc. IEEE109, 124–148. 10.1109/JPROC.2020.3018668
10
HersheyJ. R.ChenZ.Le RouxJ.WatanabeS. (2016). “Deep Clustering: Discriminative Embeddings for Segmentation and Separation,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 31–35. 10.1109/ICASSP.2016.7471631
11
IsikY.RouxJ. L.ChenZ.WatanabeS.HersheyJ. R. (2016). “Single-channel Multi-Speaker Separation Using Deep Clustering,” in Proc. 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 545–549. 10.21437/Interspeech.2016-1176
12
KadıoğluB.HorganM.LiuX.PonsJ.DarcyD.KumarV. (2020). “An Empirical Study of Conv-TasNet,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7264–7268. 10.1109/ICASSP40776.2020.9054721
13
KatharopoulosA.VyasA.PappasN.FleuretF. (2020). “Transformers Are RNNs: Fast Autoregressive Transformers with Linear Attention,” in Proceedings of the 37th International Conference on Machine Learning. Editors III,H. D.SinghA., 5156–5165. 10.48550/ARXIV.2006.16236
14
KolbaekM.YuD.TanZ.-H.JensenJ.KolbaekM.YuD.et al (2017). Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process.25, 1901–1913. 10.1109/TASLP.2017.2726762
15
Le RouxJ.HersheyJ. R.WeningerF. (2015). “Deep NMF for Speech Separation,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 66–70. 10.1109/ICASSP.2015.7177933
16
LeaC.VidalR.ReiterA.HagerG. D. (2016). “Temporal Convolutional Networks: A Unified Approach to Action Segmentation,” in Computer Vision – ECCV 2016 Workshops. Editors Hua,G.JégouH. (Cham: Springer International Publishing), 47–54. 10.1007/978-3-319-49409-8_7
17
LiC.ShiJ.ZhangW.SubramanianA. S.ChangX.KamoN.et al (2021). “ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration,” in 2021 IEEE Spoken Language Technology Workshop (SLT), 785–792. 10.1109/SLT48900.2021.9383615
18
LinZ.FengM.Dos SantosC.YuM.XiangB.ZhouB.et al (2017). “A Structured Self-Attentive Sentence Embedding,” in 2017 Proceedings of the International Conference on Learning Representations (ICLR 2017). 10.48550/ARXIV.1703.03130
19
LuoY.ChenZ.HersheyJ. R.Le RouxJ.MesgaraniN. (2017). “Deep Clustering and Conventional Networks for Music Separation: Stronger Together,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 61–65. 10.1109/ICASSP.2017.7952118
20
LuoY.ChenZ.YoshiokaT. (2020). “Dual-path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation,” in Proc. 2020 ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 46–50. 10.1109/ICASSP40776.2020.9054266
21
LuoY.MesgaraniN. (2019). Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process.27, 1256–1266. 10.1109/TASLP.2019.2915167
22
LuoY.MesgaraniN. (2018). “Tasnet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation,” in Proc. 2018 ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, 696–700. 10.1109/ICASSP.2018.8462116
23
MaciejewskiM.WichernG.McQuinnE.RouxJ. L. (2020). “Whamr!: Noisy and Reverberant Single-Channel Speech Separation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 696–700. 10.1109/ICASSP40776.2020.9053327
24
MoritzN.AdiloğluK.AnemüllerJ.GoetzeS.KollmeierB. (2017). Multi-channel Speech Enhancement and Amplitude Modulation Analysis for Noise Robust Automatic Speech Recognition. Comput. Speech & Lang.46, 558–573. 10.1016/j.csl.2016.11.004
25
OchiaiT.DelcroixM.IkeshitaR.KinoshitaK.NakataniT.ArakiS. (2020). “Beam-TasNet: Time-Domain Audio Separation Network Meets Frequency-Domain Beamformer,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6384–6388. 10.1109/ICASSP40776.2020.9053575
26
ParienteM.CornellS.DeleforgeA.VincentE. (2020). “Filterbank Design for End-To-End Speech Separation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6364–6368. 10.1109/ICASSP40776.2020.9053038
27
ParsonsT. W. (1976). Separation of Speech from Interfering Speech by Means of Harmonic Selection. J. Acoust. Soc. Am.60, 911–918. 10.1121/1.381172
28
[Dataset]RavanelliM.ParcolletT.PlantingaP.RouheA.CornellS.LugoschL.et al (2021). SpeechBrain: A General-Purpose Speech Toolkit. 10.48550/ARXIV.2106.04624ArXiv:2106.04624
29
ReddyC. K. A.DubeyH.KoishidaK.NairA.GopalV.CutlerR.et al (2021). “INTERSPEECH 2021 Deep Noise Suppression Challenge,” in Proc. Interspeech 2021 (Brno, Czech Republic), 2796–2800. 10.21437/Interspeech.2021-1609
30
RixA. W.BeerendsJ. G.HollierM. P.HekstraA. P. (2001). “Perceptual Evaluation of Speech Quality (Pesq)-a New Method for Speech Quality Assessment of Telephone Networks and Codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 749–752. 10.1109/ICASSP.2001.941023
31
RouxJ. L.WisdomS.ErdoganH.HersheyJ. R. (2019). “SDR - Half-Baked or Well Done?” in Proc. 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 626–630. 10.1109/ICASSP.2019.8683855
32
ScheiblerR.BezzamE.DokmanicI. (2018). “Pyroomacoustics: A python Package for Audio Room Simulation and Array Processing Algorithms,” in Proc. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 351–355. 10.1109/ICASSP.2018.8461310
33
SchmidtM. N.OlssonR. K. (2006). “Single-channel Speech Separation Using Sparse Non-negative Matrix Factorization,” in Proc. Interspeech 2006. 10.21437/Interspeech.2006-655
34
ShiY.HainT. (2021). “Supervised Speaker Embedding De-mixing in Two-Speaker Environment,” in 2021 IEEE Spoken Language Technology Workshop (SLT 2021). 10.1109/SLT48900.2021.9383580
35
ShiZ.LinH.LiuL.LiuR.HanJ.ShiA. (2019). Deep Attention Gated Dilated Temporal Convolutional Networks with Intra-parallel Convolutional Modules for End-To-End Monaural Speech Separation. Proc. Interspeech, 3183–3187. 10.21437/Interspeech.2019-1373
36
StollerD.EwertS.DixonS. (2018). “Wave-U-Net: A Multi-Scale Neural Network for End-To-End Audio Source Separation,” in Proceedings of the 19th International Society for Music Information Retrieval Conference, ISMIR, 334–340. 10.48550/ARXIV.1806.03185
37
SubakanC.RavanelliM.CornellS.BronziM.ZhongJ. (2021). “Attention Is All You Need in Speech Separation,” in Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 21–25. 10.1109/ICASSP39728.2021.9413901
38
TaalC. H.HendriksR. C.HeusdensR.JensenJ. (2010). “A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 4214–4217. 10.1109/ICASSP.2010.5495701
39
VaswaniA.ShazeerN.ParmarN.UszkoreitJ.JonesL.GomezA. N.et al (2017). “Attention Is All You Need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems (Red Hook, NY, USA: Curran Associates Inc), 6000–6010. 10.5555/3295222.3295349
40
WangD.ChenJ. (2018). Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Trans. Audio Speech Lang. Process.26, 1702–1726. 10.1109/TASLP.2018.2842159
41
WatanabeS.HoriT.KaritaS.HayashiT.NishitobaJ.UnnoY.et al (2018). ESPnet: End-To-End Speech Processing Toolkit. Proc. Interspeech, 2207–2211. 10.21437/Interspeech.2018-1456
42
WichernG.AntogniniJ.FlynnM.ZhuL. R.McQuinnE.CrowD.et al (2019). WHAM!: Extending Speech Separation to Noisy Environments. Proc. Interspeech, 1368–1372. 10.21437/Interspeech.2019-2821
43
YangG.-P.TuanC.-I.LeeH.-Y.LeeL.-s. (2019). Improved Speech Separation with Time-And-Frequency Cross-Domain Joint Embedding and Clustering. Proc. Interspeech, 1363–1367. 10.21437/Interspeech.2019-2181
44
Yuxuan WangY.NarayananA.DeLiang WangD. (2014). On Training Targets for Supervised Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process.22, 1849–1858. 10.1109/TASLP.2014.2352935
Summary
Keywords
tasnet, speech separation, speech enhancement, encoder, decoder, attention
Citation
Ravenscroft W, Goetze S and Hain T (2022) Att-TasNet: Attending to Encodings in Time-Domain Audio Speech Separation of Noisy, Reverberant Speech Mixtures. Front. Sig. Proc. 2:856968. doi: 10.3389/frsip.2022.856968
Received
17 January 2022
Accepted
13 April 2022
Published
11 May 2022
Volume
2 - 2022
Edited by
Nobutaka Ito, University of Tokyo, Japan
Reviewed by
Yoshiki Masuyama, Tokyo Metropolitan University, Japan
Timo Gerkmann, University of Hamburg, Germany
Updates

Check for updates
Copyright
© 2022 Ravenscroft, Goetze and Hain.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
*Correspondence: William Ravenscroft, jwravenscroft1@sheffield.ac.uk
This article was submitted to Signal Processing Theory, a section of the journal Frontiers in Signal Processing
Disclaimer
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.