Music Demixing Challenge 2021

Music source separation has been intensively studied in the last decade and tremendous progress with the advent of deep learning could be observed. Evaluation campaigns such as MIREX or SiSEC connected state-of-the-art models and corresponding papers, which can help researchers integrate the best practices into their models. In recent years, the widely used MUSDB18 dataset played an important role in measuring the performance of music source separation. While the dataset made a considerable contribution to the advancement of the field, it is also subject to several biases resulting from a focus on Western pop music and a limited number of mixing engineers being involved. To address these issues, we designed the Music Demixing (MDX) Challenge on a crowd-based machine learning competition platform where the task is to separate stereo songs into four instrument stems (Vocals, Drums, Bass, Other). The main differences compared with the past challenges are 1) the competition is designed to more easily allow machine learning practitioners from other disciplines to participate, 2) evaluation is done on a hidden test set created by music professionals dedicated exclusively to the challenge to assure the transparency of the challenge, i.e., the test set is not accessible from anyone except the challenge organizers, and 3) the dataset provides a wider range of music genres and involved a greater number of mixing engineers. In this paper, we provide the details of the datasets, baselines, evaluation metrics, evaluation results, and technical challenges for future competitions.


INTRODUCTION
Audio source separation has been studied extensively for decades as it brings benefits in our daily life, driven by many practical applications, e.g., hearing aids, denoising in video conferences, etc. Additionally, music source separation (MSS) attracts professional creators because it enables the remixing or reviving of songs to a level, never achieved with conventional approaches such as equalizers. Further, suppressing vocals in songs can improve the experience of a karaoke application, where people can enjoy singing together on top of the original song (where the vocals were suppressed), instead of relying on content developed specifically for karaoke applications. Despite the potential benefits, the research community struggled to achieve a separation quality required by commercial applications. These demanding requirements were also aggravated by the under-determined settings the problem was formulated in since the number of provided channels in the audio recording is less than the number of sound objects that need to be separated.
In the last decade, the separation quality of MSS has mainly been improved owing to the advent of deep learning. A significant improvement in an MSS task was observed at the Signal Separation Evaluation Campaign (SiSEC) 2015 [1], where a simple feed-forward network [2] able to perform four-instruments separation achieved the best signal-to-distortion ratio (SDR) scores, surpassing all other methods that did not use deep learning. The use of deep learning in MSS was accelerated ever since and led to improved SDR results year after year in the successive SiSEC editions, held in 2016 [3] and 2018 [4]. An important component of this success story was the release of publicly available datasets such as [5] which, compared to previous datasets such as [6], was created specifically for MSS tasks. MUSDB18 consists of 150 music tracks in four stems and is up until now widely used due to a lack of alternatives 1 . The dataset also has a number of limitations such as its limited number of genres (mostly pop/rock) and its biases concerning mixing characteristics (most stems were produced by the same engineers). Since the last evaluation campaign took place, many new papers were published claiming state-of-the-art, based on MUSDB18 test data, however, it is unclear if generalization performance did improve at the same pace or if some models overfit on MUSDB18. To keep scientific MSS research relevant and sustainable, we want to address some of the limitations of current evaluation frameworks by using: • a fully automatic evaluation system enabling straightforward participation for machine learning practitioners from other disciplines.
• a new professionally produced dataset containing unseen data dedicated exclusively to the challenge to ensure transparency in the competition (i.e., the test set is not accessible from anyone except the challenge organizers).
With these contributions, we designed a new competition called Music Demixing (MDX) Challenge 2 , where a call for participants was conducted on a crowd-based machine learning competition platform. A hidden dataset crafted exclusively for this challenge was employed in a system that automatically evaluated all MSS systems submitted to the competition. The MDX Challenge is regarded as a follow-up event of the professionally-produced music (MUS) task of the past SiSEC editions; to continue the tradition of the past MUS, participants were asked to separate stereo songs into stems of four instruments (Vocals, Drums, Bass, Other). Two leaderboards are used to rank the submissions: A) methods trained on MUSDB18(-HQ) and B) methods trained with extra data. Leaderboard A gives the possibility to any participant, independently on the data they possess, to train a MSS system (since MUSDB18 is open) and includes systems that can, in a later stage, be compared with the existing literature, as they share the same training data commonly used in research; leaderboard B permits models to be used to their full potential and therefore shows the highest achievable scores as of today.
In the following, the paper provides the details about the test dataset in Sec. 2, the leaderboards in Sec. 3, the evaluation metrics in Sec. 4, the baselines in Sec. 5, the evaluation results in Sec. 6, and the technical challenges for future competitions in Sec. 7.

MDXDB21
For the specific purpose of this challenge, we introduced a new test set, called MDXDB21. This test set is made of 30 songs, created by Sony Music Entertainment (Japan) Inc. (SMEJ) with the specific intent to use it for the evaluation of the MDX Challenge. The dataset was hidden from the participants, only the organizers of the challenge could access it. This allowed a fair comparison of the submissions. Here we provide details on the creation of the dataset: • More than 20 songwriters were involved in the making of the 30 songs in the dataset, so that there is no overlap with existing songs in terms of composition and lyrics; • The copyright of all 30 songs is managed by SMEJ so that MDXDB21 can be integrated easily with other datasets in the future, without any issue arising from copyright management; • More than 10 mixing engineers were involved in the dataset creation with the aim of diversifying the mixing styles of the included songs; • The loudness and tone across different songs were not normalized to any reference level, since these songs are not meant to be distributed on commercial platforms; • To follow the tradition of past competitions like SiSEC, the mixture signal (i.e., the input to the models at evaluation time) is obtained as the simple summation of the individual target stems. Table 1 shows a list of the songs included in MDXDB21. To give more diversity in genre and language, the dataset features also non-western and non-English songs. The table also provides a list of the instruments present in each song; this can help researchers understand under which conditions their models fail to perform the separation. More information about the loudness as well as stereo information for each song and its stems are given in the Appendix.

LEADERBOARDS AND CHALLENGE ROUNDS
For a fair comparison between systems trained with different data, we designed two different leaderboards for the MDX Challenge: • Leaderboard A accepted MSS systems that are trained exclusively on MUSDB18-HQ [7] 4 . Our main purpose was to give everyone the opportunity to start training a MSS model and take part in the competition, independently on the data they have. On top of that, since MUSDB18-HQ is the standard training dataset for MSS in literature, models trained with it can also be compared with the current stateof-the-art in publications, by evaluating their performance on the test set of MUSDB18-HQ and using the metrics included in the BSS Eval v4 package, as done for example by [8] and [9].
• Leaderboard B did not pose any constraints on the used training dataset. This allowed participants to train bigger models, exploiting the power of all the data at their disposal.
To avoid some participants overfitting to the MDXDB21 dataset, we split the dataset into three equal-sized parts and designed two challenge rounds: in the first round, participants could access the scores of their models computed only on the first portion of MDXDB21. In the second round, the second portion of MDXDB21 was added to the evaluation and participants could see how well their models generalized on new data. After the challenge ended, the overall score was computed on all songs 5 . These overall scores were also used for the final ranking of the submissions.

EVALUATION METRIC
In the following section, we will introduce the metric that was used for the MDX Challenge as the ranking criterion and compare it to other metrics that have been used in past competitions.

Signal-to-Distortion Ratio (SDR)
As an evaluation metric, we chose the multichannel signalto-distortion ratio (SDR) [10], also called signal-to-noise ratio (SNR), which is defined as where s(n) ∈ R 2 denotes the waveform of the ground truth andŝ(n) the waveform of the estimate for one of the sources with n being the (discrete) time index. We use a small constant = 10 −7 in (1) to avoid divisions by zero.
The higher the SDR score is, the better the output of the system is.
For each song, this allows computing the average SDR, SDR Song , given by Finally, the systems are ranked by averaging SDR Song over all songs in the hidden test set. As (1) considers the full audio waveform at once, we will denote it as a "global" metric. In contrast, we will denote a metric as "framewise" if the waveform is split into shorter frames before analyzing it. Using the global metric (1) has two advantages. First, it is not expensive to compute as opposed to more complex measures like BSS Eval v4 which also outputs the image-to-spatial distortion ratio (ISR), signal-to-interference ratio (SIR), and signal-to-artifacts ratio (SAR). Second, there is also no problem with frames where at least one source or estimate is all-zero. Such frames are discarded in the computation of BSS Eval v4 as otherwise, e.g., SIR can not be computed. This, however, yields the unwanted side-effect that the SDR values of the different sources of BSS Eval v4 are not independent of each other which is not desired 6 . The global SDR (1) does not suffer from this cross-dependency between source estimates.

Comparison with Other Metrics
Before deciding to choose the global SDR (1), we did a comparison with other metrics for audio source separation for the best system from SiSEC 2018 ("TAU1"). Other common metrics are "Global" refers to computing the metric on the full song whereas "framewise" denotes a computation of the metric on shorter frames which are then averaged to obtain a value for the song. For the framewise metrics, we used in our experiment a frame size as well as a hop size of one second, which is the default for museval [4], except for the framewise SDR of SiSEC 2016 where we used a frame size of 30 seconds and a hop size of 15 seconds as in [3]. Fig. 1 shows the correlations of the global SDR (1) (used as reference) with the different metrics on MUSDB18 Test for "TAU1", the best system from SiSEC 2018 [4]. For each metric, we compute the correlation coefficient for each source to the reference metric and show the minimum, average, and maximum correlation over all four sources. Please note that some metrics are similarity metrics ("higher is better") whereas others measure the distance ("smaller is better"). As we use the global SDR as reference, the correlation coefficient becomes negative if the correlation with a distance metric is computed. We can see that there is a strong correlation between the used global SDR (1) and the median-averaged SDR from BSS Eval v3 and v4 as the Pearson and Spearman correlations 6 See for example https://github.com/sigsep/bsseval/issues/4. 7 This metric is equivalent to (1) except for the small constant .
are on average larger than 0.9. This analysis confirms that the global SDR (1) is a good choice as it has a high correlation to the evaluation metrics of SiSEC 2016, i.e., metric (f), and SiSEC 2018, i.e., metric (g), while being at the same time simple to compute and yielding per-source values which are independent of the estimates for other sources. In the following, we will refer to the global SDR (1) as "SDR".

BASELINE SYSTEMS
The MDX Challenge featured two baselines: Open-Unmix (UMX) and CrossNet-UMX (X-UMX). A description of UMX can be found in [13], where the network is based on a BiLSTM architecture that was studied in [14]. X-UMX [15] is an enhanced version of UMX. Fig. 2 shows the architectures of the two models. The main difference between them is that UMX can be trained independently for any instrument while X-UMX requires all the networks together during training to allow the exchange of gradients between them, at the cost of more memory. During inference, there is almost no difference between UMX and X-UMX regarding the model size or the computation time as the averaging operations in X-UMX do not introduce additional learnable parameters.

MDX CHALLENGE 2021 RESULTS AND KEY TAKEAWAYS
In this section, we will first give results for various systems known from the literature on MDXDB21 before summarizing the outcome of the MDX Challenge 2021.

Preliminary Experiments on MDXDB21
The two baselines described in Sec. 5, as well as stateof-the-art MSS methods, were evaluated on MDXDB21.  Table 3 shows the SDR results on leaderboard B. Since the extra dataset used in each model is different, we cannot directly compare their scores, but we nonetheless listed the results to see how well the SOTA models can perform on new data created by music professionals. It can already be seen that the difference in SDR between these models is smaller than what is reported on the well-established MUSDB18 test set. This indicates that generalization is indeed an issue for many of these systems and it will be interesting to see if other contributions can outperform the SOTA systems. Fig. 3(a) and 3(b) show SDR song for all 30 songs for the baselines as well as the currently best methods known from literature. There is one exception in the computation of SDR song for SS_015: For this song, it was computed by excluding bass and averaging over three instruments only as this song has an all-silent bass track and SDR Bass  would aggravate the average SDR over four instruments. For this reason this song was made part of the set of three songs (SS_008, SS_015 and SS_018) that were left out of the evaluation so that they could be provided to the participants as demo (see note 5 ).
The results for SS_025-026 are considerably worse than for the other songs; we assume that this is because these songs contain more electronic sounds than the others.

MDX Challenge 2021 Results
The challenge was well received by the community and, in total, we received 1541 submissions from 61 teams around the world. Table 4 shows the final leaderboards of the MDX Challenge. It gives the mean SDR from (1) for all 27 songs as well as the standard-deviation for the mean SDR over the three splits (each with 9 songs) that were used in the different challenge rounds as described in Sec 3. Comparing these numbers with the results in Table 2 and 3, we can observe a considerable SDR improvement of approximately 1.5dB throughout the contest. The evolution of the best SDR over time is shown in Fig. 4. As several baselines for leaderboard A were provided at the start of the challenge, progress could be first observed for leaderboard A and the submissions for leaderboard B were not signif-icantly better. With the start of round 2, the performance gap between leaderboard A and leaderboard B increased as participants managed to train bigger models with additional data.
This improvement was not only achieved by blending several existing models but also by architectural changes. The following sections give more details for the winning models and they are written by each team, respectively.
The original model consisted of an encoder/decoder in the time domain, with U-Net skip connection [20]. In the hybrid version, a spectrogram branch is added, which is fed with the input spectrogram, either represented by its amplitude, or real and imaginary part, a.k.a Complex-As-Channels (CAC) [21]. Unlike the temporal branch, the spectrogram branch applies convolutions along the frequency axis, reducing the number of frequency bins by a factor of 4 with every encoder layer. Starting from 2048 frequency bins, obtained with a 4096 steps STFT with a hop-length of 1024 and excluding the last bin for simplicity, the input to the 5th layer of the spectral branch has only 8 frequency bins remaining, which are collapsed   to a single one with a single convolution. On the other hand, the input to the 5th layer of the temporal branch has an overall stride of 4 4 = 256, which is aligned with the stride of 1024 of the spectral branch with a single convolution. At this point, the two representations have the same shape and are summed before going through a common layer that further reduces the number of time steps by 2. Symmetrically, the first layer in the decoder is shared before being fed both to the temporal and spectral decoder, each with its own set of U-Net skip connections. The overall structure is represented in Fig. 5.
The output of the spectrogram branch is inverted to a waveform, either directly with the ISTFT when CAC is used, or thanks to Wiener filtering [22] for the amplitude representation, using Open-Unmix differentiable implementation [13]. The final loss is applied directly in the time domain. This allows for end-to-end hybrid domain training, with the model being free to combine both domains. To account for the fact that musical signals are not equivariant with respect to the frequency axis, we either inject a frequency embedding after the first spectral layer, following [23], or we allow for different weights depending on the frequency band, as done by [24].
Further improvements come from inserting residual branches in each of the encoder layers. The branches operate with a reduced number of dimensions (scaled down by a factor of 4), using dilated convolutions and group normalization [25], and for the 2 innermost layers, BiLSTM and local attention. Local attention is based on regular attention [26], but replacing positional embedding with a controllable penalty limits its scope to nearby time steps. All ReLUs in the network were replaced by GELUs [27]. Finally, we achieve better generalization and stability by penalizing the largest singular values of each layer [28]. We achieved further gains (between 0.1 and 0.2dB) by finetuning the models on a specifically crafted dataset, and with longer training samples (30 seconds instead of 10). This dataset was built by combining stems from separate tracks, while respecting a number of conditions, in particular beat matching and pitch compatibility, allowing only for small pitch or tempo corrections.
The final models submitted to the competition are bags of 4 models. For leaderboard A, it is a combination of temporal only and hybrid Demucs models, given that with only MUSDB18-HQ as a train set, we observed a regression on the bass source. For leaderboard B, all models are hybrid, as the extra training data made the hybrid version better for all sources than its time-only version. We refer the reader to our Github repository facebookresearch/demucs for the exact hyper-parameters used. More details and experimental results, including subjective evaluations, are provided in the hybrid Demucs paper [8].

KUIELab-MDX-Net (kuielab)
Similar to Hybrid Demucs, KUIELab-MDX-Net [9] uses a two-branched approach. As shown in Fig. 6, it has a timefrequency branch (left-side) and a time-domain branch (right-side). Each branch estimates four stems (i.e., vocals, drums, bass, and other). The blend module [14] outputs the average of estimated stems from two branches for each source. While branches of Hybrid Demucs were jointly trained end-to-end, each branch of KUIELab-MDX-Net was trained independently.
For the time-domain branch, it uses the original Demucs [18], which was pre-trained on MUSDB18. It was not fine-tuned on MUSDB18-HQ, preserving the original  parameters.
For the time-frequency-domain branch, it uses five subnetworks. Four sub-networks were independently trained to separate four stems, respectively. For each stem separation, an enhanced version of TFC-TDF-U-Net [21] was used. We call the enhanced one TFC-TDF-U-Net v2 for the rest of the paper. Another sub-network called Mixer was trained to output enhanced sources by taking and refining the estimated stems.
TFC-TDF-U-Net is a variant of U-Net [20] architecture. It improves source separation by employing TFC-TDF [21] as building blocks instead of fully convolutional layers. Architectural/training changes were made for TFC-TDF-U-Net v2 to the original as follows: • For skip connections between encoder and decoder, multiplication was used instead of concatenation.
• The other skip connections (e.g., dense skip connections in a dense block [24]) were removed.
• While the number of channels is not changed after down/upsampling in the original, channels are increased/decreased when downsampling/upsampling in v2.
• While the original was trained to minimize timefrequency domain loss, v2 was trained to minimize time-domain loss (i.e., l 1 loss between the estimated waveform and the ground-truth) Since dense skip connections based on concatenation usually require a large amount of GPU memory, as discussed in [29], TFC-TDF-U-Net v2 was designed to use simpler modules. [9] found that replacing concatenation with multiplication for each skip connection does not severely degrade the performance of the TFC-TDF-U-Net structure. Also, they observed that removing dense skip connections in each block does not significantly degrade the performance if we use TFC-TDF blocks. A single Time Distributed Fully connected (TDF) block, contained in a TFC-TDF, has an entire receptive field in the frequency dimension. Thus, U-Nets with TFC-TDFs can show promising results even with shallow or simple structures, as discussed in [21]. To compensate for the lack of parameters by this design shift, it enlarges the number of channels for every downsampling, which is more general in the conventional U-Net [20]. It was trained to minimize time-domain loss for a direct end to end optimization.
Also, KUIELab-MDX-Net applies a frequency cut-off trick, introduced in [13] to increase the window size of STFT (or fft size in short) as a source-specific preprocessing. It cuts off high frequencies above the target source's expected frequency range from the mixture spectrogram. In this way, fft size could be increased while using the same input spectrogram size (which we needed to constrain for the separation time limit) for the model. Since using a larger fft size usually leads to better SDR, this approach can improve quality effectively with a proper cut-off frequency. It is also why we did not use a multi-target model (i.e., a single model to separate all the sources once), where we could not apply source-specific frequency cutting. Training one separation model for each source can have the benefit of source-specific preprocessing and model configurations. However, these sub-networks lack the knowledge that they are separating using the same mix-   : Hybrid Demucs architecture. The input waveform is processed both through a temporal encoder and through the STFT followed by a spectral encoder. After layer 5, the two representations have the same shape and are summed before going through shared layers. The decoder is built symmetrically. The output spectrogram go through the ISTFT and is summed with the waveform outputs, giving the final model output. The Z prefix is used for spectral layers, and T prefix for the temporal ones. ture because they cannot communicate with each other to share information. An additional sub-network called Mixer could further enhance the "independently" estimated sources. For example, estimated 'vocals' often have drum snare noises left. The Mixer can learn to remove sounds from 'vocals' that are also present in the estimated 'drums' or vice versa. Very shallow models (such as a single convolution layer) have been tried for the Mixer due to the time limit. One can try more complex models in the future, since even a single 1 × 1 convolution layer was enough to make some improvement on total SDR. Mixer used in KUIELab-MDX-Net is a point-wise convolution that is applied to the waveform domain. It takes a multichannel waveform input containing four estimated stereo stem channels and the original stereo mixture channel. It outputs four different stereo stems. It can be viewed as a U-Net blending module with learnable parameters. An ablation study is provided in [9] for interested readers. Finally, KUIELab-MDX-Net takes the weighted average of estimated stems from two branches for each source. In other words, it blends [14] results from two branches. For Leaderboard A, we trained KUIELab-MDX-Net which adopted TFC-TDF-U-Net v2, Mixer, and blending with the original Demucs, after training on MUSDB18-HQ [7] training dataset with pitch/tempo shift [18] and random track mixing augmentation [14]. For Leaderboard B, we used KUIELab-MDX-Net without Mixer but with validation and test dataset of MUSDB18-HQ. The source code for training KUIELab-MDX-Net is available at the Github repository kuielab/mdx-net.

Danna-Sep (Kazane_Ryo_no_Danna)
Danna-Sep is, not surprisingly, also a hybrid model blending the outputs from three source separation models across different feature domains. Two of them receive magnitude spectrogram as input, while the other one use waveforms, as shown in Fig. 7. The design principle is to combine the complementary strengths of both waveform-based and spectrogram-based approaches.
The first one of spectrogram-based models is a X-UMX trained with complex value frequency domain loss to better address the distance between ground truth and estimation spectrograms. In addition, we also incorporated differentiable Wiener filtering into training with our own PyTorch implementation 8 , similar to how Hybrid Demucs did. We initialized this model with the official pre-trained weights 9 before we start training. The second one is a U-Net with six layers consisting of D3 Blocks from D3Net [17] and two layers of 2D local attention [30] at the bottleneck. We also experimented with using biaxial BiLSTM along the time and frequency axes as the bottleneck layers, but it took slightly longer to train yet offered a negligible improvement. We used the same loss function as our X-UMX during training but with Wiener filtering being disabled.
The waveform-based model is a variant of 48 channels Demucs, where the decoder is replaced by four independent decoders responsible for four respective sources. We believe that the modification can help each decoder focusing on their target source without interfering with others. Each decoder has the same architecture as the original decoder, except for the size of the hidden channel which was reduced to 24. This makes the total number of parameters compared with the original Demucs. The training loss aggregates the L1-norm between estimated and ground-truth waveforms of the four sources. We didn't apply the shift trick [18] to this variant because of limited computation resources set by the competition, but in our experiments, we found it still slightly outperformed the 48 channels Demucs.
The aforementioned models were all trained separately on MUSDB18-HQ with pitch/tempo shift and random track mixes augmentation.
Finally, we calculated the weighted average of individual outputs from all models. Experiments were conducted to search for optimal weighting. One iteration of Wiener filtering was used for our X-UMX and U-Net before averaging. To see the exact configurations we used, reader can refer to our repository yoyololicon/music-demixingchallenge-ismir-2021-entry on Github. Experimental results of each individual model compare to the unmodified baselines are provided in our workshop paper [31].

General Considerations
Organizing a research challenge is a multi-faceted task: the intention is to bring people from different research areas together, enabling them to tackle a task new to them in the easiest way possible; at the same time, it is an opportunity for more experienced researchers to measure themselves once more against the state-of-the-art, providing them with new challenges and space to further improve their previous work. All this, while making sure that the competition remains fair for all participants, both novice and experienced ones.
To satisfy the first point above, we decided to host the challenge on an open crowd-based machine learning competition website. This allowed the competition to be visible by researchers outside of the source separation community. The participants were encouraged to communicate, exchange ideas and even create teams, all through AIcrowd's forum. To make the task accessible to everyone, we prepared introductory material and collected useful resources available on the Internet so that beginners could easily start developing their systems and actively take part in the challenge.
We also increased the interest of experienced researchers because the new evaluation set represented an opportunity for them to evaluate their existing models on new data. This music was created for the specific purpose of being used as the test set in this challenge: this stimulated the interest of the existing MSS community because old models could be tested again and, possibly, improved.
The hidden test set also allowed us to organize a competition where fairness had a very high priority. Nobody could adapt their model to the test set as nobody except the organizers could access it. The whole challenge was divided into two rounds, where different subsets of the test set were used for the evaluation: this prevented users from adapting their best model to the test data by running multiple submissions. Only at the very end of the competition, the final scores on the complete test set were computed and the winners were selected: at that moment, no new model could be submitted anymore. Cash prizes are offered for the winners in return for open-sourcing their submission.

Bleeding Sounds Among Microphones
The hidden test set was created with the task of MSS in mind, which meant that we had to ensure as much diversity in the audio data as possible while maintaining a professional level of production and also deal with issues that are usually not present when producing music in a studio.
Depending on the genre of the music being produced, bleeding among different microphones (i.e., among the recorded tracks of different instruments) is more or less tolerated. For example, when recording an orchestra, to maintain a natural performance, the different sections play in the same room: even if each section may be captured with a different set of microphones, each microphone will still capture the sound of all the other sections. This phenomenon is not desirable for the task of MSS: if a model learns to separate each instrument very well but the test data contains bleeding, the model will be wrongly penalized during the evaluation. For this reason, when producing the songs contained in the dataset, we had to ensure that specific recording conditions were respected. All our efforts were successful, except for one song (SS_008): even taking appropriate measures during the recording process (e.g., placing absorbing walls between instruments), the tracks contained some bleeding between drums and bass. For this reason, we removed this track from the evaluation and used it as a demo song instead.
Bleeding can be an issue also for training data: we cannot expect models to have better performance than the data they are trained upon unless some specific measures are taken. We did not explicitly address this aspect in the MDX Challenge; nevertheless, designing systems that are robust to bleeding in the training data is a desirable feature, not just for MSS. We could envision that future editions of the MDX Challenge will have a track reserved for systems trained exclusively on data that suffer from bleeding issues, to focus the research community on this aspect as well.

Silent Sources
Recording conditions are not the only factor that can make a source separation evaluation pipeline fail. Not all instruments are necessarily present in a track: depending on the choices of the composer, arranger, or producer, some instruments may be missing. This is an important aspect, as some evaluation metrics, like the one we chose, may not be robust to the case of a silent target. We decided to exclude from MDXDB21 one song (SS_015) as it has a silent target for the bass track. Please note that this issue was not present in previous competitions like SiSEC, as the test set of MUSDB18 does not feature songs with silent sources. Nevertheless, we think it would be an important improvement if the challenge evaluation could handle songs where one of the instruments is missing (e.g., instrumental songs without vocals, acapella interpretations, etc.).
Such an issue arises from the definition of clear identities for the targets in the source separation task: the evaluation pipeline suffers from such strict definition and causes some songs to be unnecessarily detrimental to the overall scores. This is a motivation to move towards source separation tasks that do not feature clear identities for the targets, such as universal sound source separation: in that case, the models need to be able to separate sounds, independently on their identity. An appropriately designed pipeline for such a task would not suffer from the issue above. For this reason, for future editions of the challenge, we may consider including tasks similar to universal sound source separation.

Future Editions
We believe that hosting the MDX Challenge strengthened and expanded the source separation community, by providing a new playground to do research and attracting researchers from other communities and areas, allowing them to share knowledge and skills: all this focused on solving one of the most interesting research problems in the area of sound processing. Source separation can still bring benefits to many application and research areas: this motivates the need for future editions of this competition.
In our view, this year's edition of the MDX Challenge allows us to start a hopefully long tradition of source separation competitions. The focus of this year was MSS on four instruments: given the role this task played in past competitions, this was a convenient starting point, that provided us with enough feedback and experience on how to make the competition grow and improve.
We encountered difficulties when compromising between how the source separation task expects data to be created and the professional techniques for music production: for instance, to keep the competition fair, we had to make sure that no crosstalk between target recordings was present. The same argument about crosstalk also highlighted the need for source separation systems that can be trained on data that suffer from this issue: this potentially opens access to training material not available before and can be another source of improvement for existing models. We realized how brittle the design of a simple evaluation system is when dealing with the vast amount of choices that artists can make when producing music: even the simple absence of an instrument in a track can have dramatic consequences in the competition results.
This knowledge will eventually influence the decisions we will take when designing the next editions of the MDX Challenge. In particular, we will: • design an evaluation system around a metric that is robust to bleeding sounds between targets in the test ground truth data • direct the attention of researchers to the robustness of models concerning bleeding among targets in the training data, possibly reserving a separate track to systems trained exclusively on such data • partially moves towards source separation tasks where there is no predefined identity for the targets, such as universal sound source separation.
Furthermore, motivated by the pervasiveness of audio source separation, we will consider reserving special tracks to other types of audio signals, such as speech and ambient noise. The majority of techniques developed nowadays provide useful insights independently on whether they are applied to music, speech, or other kinds of sound. In the interest of scientific advancement, we will try to make the challenge as diverse as possible, to have the highest number of researchers cooperate, interact and ultimately compete for the winning system.

CONCLUSIONS
With the MDX Challenge 2021, we continued the successful series of SiSEC MUS challenges. By using a crowdsourced platform to host the competition, we tried to make it easy for ML practitioners from other disciplines to enter this field. Furthermore, we introduced a newly created dataset, called MDXDB21, which served as the hidden evaluation set for this challenge. Using it allows to fairly compare all recently published models as it shows their generalization capabilities towards unseen songs.
We hope that this MDX Challenge will be the first one in a long series of competitions.