Differentiable black-box and gray-box modeling of nonlinear audio effects

Comunità, Marco; Steinmetz, Christian J.; Reiss, Joshua D.

doi:10.3389/frsip.2025.1580395

ORIGINAL RESEARCH article

Front. Signal Process., 31 July 2025

Sec. Audio and Acoustic Signal Processing

Volume 5 - 2025 | https://doi.org/10.3389/frsip.2025.1580395

This article is part of the Research TopicSound Synthesis through Physical ModelingView all 7 articles

Differentiable black-box and gray-box modeling of nonlinear audio effects

Marco Comunità*

Christian J. Steinmetz

Joshua D. Reiss

Centre for Digital Music, School of Electronic Engineering and Computer Science, Queen Mary University of London, London, United Kingdom

Audio effects are extensively used at every stage of audio and music content creation. The majority of differentiable audio effects modeling approaches fall into the black-box or gray-box paradigms; and most models have been proposed and applied to nonlinear effects like guitar amplifiers, overdrive, distortion, fuzz and compressor. Although a plethora of architectures have been introduced for the task at hand there is still lack of understanding on the state of the art, since most publications experiment with one type of nonlinear audio effect and a very small number of devices. In this work we aim to shed light on the audio effects modeling landscape by comparing black-box and gray-box architectures on a large number of nonlinear audio effects, identifying the most suitable for a wide range of devices. In the process, we also: introduce time-varying gray-box models and propose models for compressor, distortion and fuzz, publish a large dataset for audio effects research—ToneTwist AFx—that is also the first open to community contributions, evaluate models on a variety of metrics and conduct extensive subjective evaluation. Code and supplementary material are also available.

1 Introduction

Audio effects are central for engineers and musicians to shape the timbre, dynamics, and spatialisation of sound (Wilmering et al., 2020). For some instruments (e.g., electric guitar) the processing chain is part of the artist’s creative expression (Case, 2010) and music genres and styles are also defined by the type of effects adopted (Blair, 2015; Williams, 2012); with renowned musicians adopting specific combinations to achieve a unique sound (Prown and Sharken, 2003). Therefore, research related to audio effects, especially with the success of deep learning (Bengio et al., 2017) and differentiable digital signal processing (DDSP) (Engel et al., 2020), is a very active field (Comunita and Reiss, 2024), with many avenues of investigation open, like: audio effects classification and identification (Comunità et al., 2021; Hinrichs et al., 2022), settings or coefficients estimation and extraction (Kuznetsov et al., 2020; Nercessian, 2020; Colonel et al., 2022b), modeling (Martínez Ramírez, 2021; Wright et al., 2023), processing graph estimation (Lee et al., 2023, 2024) and style transfer (Steinmetz et al., 2022; Vanka et al., 2024; Steinmetz et al., 2024), automatic mixing (Steinmetz et al., 2021; Martínez-Ramírez et al., 2022) or reverse engineering of a mix (Colonel et al., 2022a; Colonel and Reiss, 2023).^1,^2,³

One of the most active, driven by interest in emulating vintage or costly analog devices, is audio effects modeling (virtual analog), often categorized into white-, gray-, and black-box approaches (see Section 2); and, when using techniques that support gradient backpropagation, referred to as differentiable white-, gray- and black-box. Differentiable methods, leveraging data for gradient descent, are widely used to learn audio effect models. Efforts focus on developing architectures that are both accurate and computationally efficient while being expressive enough to model diverse effect categories and devices ^1–3.

Of the three macro categories of audio effects - linear (e.g., EQ, reverb), nonlinear (e.g., distortion, compression), and modulation (e.g., phaser, flanger) - nonlinear effects have received the most attention, with architectures mainly proposed for black- and gray-box methods (see Secs. 2.7, 2.8). Most nonlinear audio effects models focus on a single category (e.g., guitar amplifiers, distortion, compressors) and are evaluated on a single or limited set of devices. Despite the effort, there is no clear state of the art yet that identifies the best-performing approach, particularly for developing architectures that can reliably model all types of nonlinear effects across a wide range of devices. Also, all previous works train and evaluate on data recorded from the same sources (e.g., a specific electric guitar or bass) and very similar content, playing techniques and style, not allowing to reach a reliable conclusion about the generalization capabilities of a trained model. In most cases, training and testing data is normalized without verifying the input dynamic range, not ensuring trained models to be accurate across signal levels.

Therefore, to overcome these limitations, assess the state of the art, and identify research directions, we conduct extensive experiments. In doing so, our work contributes mainly in the following ways:

$•$ Evaluate and compare differentiable black-box and gray-box approaches for modeling nonlinear audio effects like guitar amplifiers, overdrive, distortion, fuzz, and compression (Section 4).

$•$ Introduce time-varying gray-box models and propose models for compressor/limiter, distortion and fuzz effects (Section 3.2).

$•$ Publish a large dataset¹ of dry-input/wet-output pairs for audio effects research that is also the first open to community contributions (Section 3.3).

$•$ Evaluate the models on a variety of metrics (Section 6.1) and conduct extensive subjective evaluation (Section 6.4).

$•$ Code² and supplementary material³ are also available.

The remainder of the paper is organized as follows. In Section 2, we outline non-differentiable and differentiable white-, gray-, and black-box approaches to audio effects modeling. Sections 3.1, 3.2 describe the black- and gray-box architectures included in our experiments, while the ToneTwist AFx dataset is described in 3.3. Experiments and results are illustrated and discussed in Sections 4, 5, and objective and subjective evaluations are presented in Section 6. We conclude the manuscript in Section 7.

2 Background

2.1 Nonlinear audio effects modeling

The first thorough investigation of nonlinear distortion for musical applications is (Le Brun’s 1979 work on waveshaping (Le Brun, 1979) and Roads’ subsequent tutorial (Roads, 1979). In the following decades, numerous techniques were developed to digitally simulate or emulate nonlinear analog circuits and devices. This is why we now have approaches based on: solving or approximating the equations describing a system’s behavior, wave digital filters, state-space modeling, Volterra series or Wiener-Hammerstein models, just to name a few. And, with the advancement of machine learning, neural networks based modeling and differentiable DSP have been added to the state of the art.

Audio effects modeling approaches can be categorized into white-, gray-, and black-box, based on the degree of prior knowledge applied, and further classified as differentiable if they support gradient backpropagation. The following sections review previous work on nonlinear audio effects modeling and describe emulation paradigms and techniques.

2.2 White-box modeling

White-box modeling relies on full system knowledge, using differential equations to describe its behavior and numerical methods to solve them in the continuous or discrete domain. These models provide accurate results, making them ideal for replicating specific analog devices. Although, they can be time-consuming to develop, require exact knowledge of the equations describing nonlinear elements and access to the circuit schematic, do not generalize to different devices, and can result in substantial computational load.

Simple systems can be modeled manually by solving the equations (Yeh et al., 2007a; D’Angelo and Välimäki, 2014; Esqueda et al., 2017a); but, for more complex cases, there exist general-purpose frameworks like: state-space models (Mačák, 2012; Holters and Zölzer, 2015; Yeh et al., 2009), wave digital filters (Werner et al., 2015c; De Sanctis and Sarti, 2009; D’Angelo et al., 2012), port-hamiltonian systems (Falaize and Hélie, 2016).

In practice, solving equations have mostly been applied to a small subset of single nonlinear processing blocks like: vacuum tube (Sapp et al., 1999; Karjalainen et al., 2006; Dempwolf and Zölzer, 2011) or transistor stages (D’Angelo et al., 2019), diode clippers (Yeh et al., 2007a; Germain and Werner, 2015; D’Angelo et al., 2019), transformers (Macak, 2011), moog ladder filter (D’Angelo and Välimäki, 2013; D’Angelo and Välimäki, 2014) or the Buchla 259 wavefolder (Esqueda et al., 2017b). But there are also examples of literature modeling complete devices with this method (Yeh et al., 2007b; Macak, 2013).

The most studied white-box modeling framework is wave digital filters (WDF), which emulates analog circuits using traveling wave (scattering) theory and wave variables (instead of voltages and currents), preserving properties like passivity and energy conservation for stable, robust digital implementation.⁴ (Fettweis, 1986).

Beside single nonlinear blocks like: vacuum tube (Karjalainen and Pakarinen, 2006; Yeh and Smith, 2008; D’Angelo and Välimäki, 2012) or transistor stages (Yeh and Smith, 2008; Yeh, 2009; Bogason, 2018), diode clippers (Yeh et al., 2008; Yeh and Smith, 2008; Werner et al., 2015a; Bernardini and Sarti, 2017), opamps (Yeh, 2009; Bogason and Werner, 2017), nonlinear filters (Werner et al., 2015b; Bogason, 2018) and transformers (Pakarinen et al., 2009); we also find examples of WDFs applied to complete devices like: overdrive and distortion pedals (Yeh, 2009; Bernardini et al., 2020), amplifiers (Pakarinen et al., 2009; De Sanctis and Sarti, 2009), limiters (Raffensperger, 2012).

Similar observations are valid for state-space methods, which represent an electronic circuit as a system of first-order differential equations, describing the circuit in terms of state variables, inputs, and outputs. Using matrix algebra to model components relationships, this method enables efficient digital simulation of complex analog systems. Methods developed for the simulation of state-space systems include the K-method (Yeh and Smith, 2006; Yeh, 2009) the Nodal K-method (or NK-method) and the Discrete K-method (or DK-method or Nodal DK-method) (Yeh, 2009; Yeh et al., 2009).

Examples of state-space methods applied to single nonlinear blocks include: vacuum tubes, transistors, diode clippers and opamps (Yeh and Smith, 2008; Yeh, 2009; Yeh et al., 2009), or transformers (Holters and Zölzer, 2016). But also find examples of emulation of complete devices like: overdrive and distortion (Yeh, 2009; Holmes and van Walstijn, 2015), fuzz (Holmes et al., 2017; Bennett, 2022), compression (Kröning et al., 2011), or amplifiers (Cohen and Hélie, 2010; Macak et al., 2012).

Port-Hamiltonian approaches emulate analog circuits using Hamiltonian mechanics, focusing on energy conservation (Van Der Schaft, 2006). The only application of this approach to nonlinear audio effects is limited to single transistor stages and diode clippers. (Falaize and Hélie, 2016).

2.3 Black-box modeling

Black-box approaches require minimal system knowledge, and mostly rely on input-output data; simplifying the process to collecting adequate data. However, such models often lack interpretability and might entail time-consuming optimizations. The main non-differentiable black-box methods are based on Volterra series and dynamic convolution.

Volterra series represent a nonlinear system as an infinite series of integral operators, analogous to a Taylor series expansion but in the context of systems and signals. The output is given by a sum of convolutions of the input signal with a series of kernels (functions), and the modeling procedure mainly involves extracting such kernels from input-output measurements. Although it has been applied to a range of nonlinear blocks (Hélie, 2006; Yeh et al., 2008) and effects (Carini et al., 2015; Tronchin, 2013; Schmitz and Embrechts, 2017; Orcioni et al., 2017; Carini et al., 2019), this method is sufficiently accurate only for weakly nonlinear systems, making it not applicable to most nonlinear audio effects.

The same applies to dynamic convolution (Kemp, 1999; Primavera et al., 2012), which adjusts impulse responses or processing kernels based the present/past input amplitude to model nonlinear or hysteretic behavior.

2.4 Gray-box modeling

Gray-box approaches combine a partial theoretical structure with data—typically input/output measurements—to complete the model. They reduce prior knowledge requirements while maintaining interpretability through a block-oriented approach; although, the specific structure, measurements and optimization procedures, are critical to achieve a good approximation.

For nonlinear effects like amplifiers and distortion, models are typically defined as an interconnection of linear filters and static nonlinearities, such as: Hammerstein models (static nonlinearity followed by linear filter), Wiener models (linear filter followed by static nonlinearity) or Wiener-Hammerstein models (static nonlinearity in between two linear filters). Although, for greater accuracy Wiener-Hammerstein models have often been extended to include: non-static nonlinearities (i.e., hysteresis and memory) (Eichas et al., 2015; Eichas and Zölzer, 2016a; Eichas and Zölzer, 2018), or a cascade of pre- and power-amp models (Kemper, 2014; Eichas et al., 2017). There exist more complex arrangements made of cascaded and parallel blocks (see (Schoukens and Ljung, 2019) for more examples) and, for nonlinear effects, parallel polynomial (Novak et al., 2009; Cauduro Dias de Paiva et al., 2012) and Chebyshev (Novak et al., 2010; Bank, 2011) Hammerstein models have also been adopted. To model the nonlinear part of a system, all these methods rely on some form of waveshaping, which have been studied both as a synthesis Le Brun (1979); Fernández-Cid et al. (1999); Fernández-Cid and Quirós (2001) as well as an emulation (Timoney et al., 2010) technique.

To model dynamic range compression, all proposed methods (Ramos, 2011; Eichas and Zölzer, 2016b) are based on the block-oriented architecture described in (Zölzer et al., 2002) and analyzed in details in (Giannoulis et al., 2012).

2.5 Differentiable modeling

With the recent rise of machine learning the research community started applying neural networks and DDSP to white-, gray- and black-box modeling.

Due to their potential of being universal approximators (Hornik et al., 1989), artificial neural networks are often adopted for audio effects modeling to learn a mapping from dry-input to wet-output signals. State-of-the-art methods operate in the time domain, framing the task as sequence-to-sequence prediction. Time-domain modeling is preferred for real-time applications, as frequency-domain methods add complexity, latency, and limit resolution based on frame size. Previous literature includes approaches based on: multi-layer perceptrons (MLPs), recurrent neural networks (RNNs), convolutional neural networks (CNNs) and state-space models (SSMs).

MLPs (Bengio et al., 2017) are feedforward neural networks made up of an input layer, one or more hidden layers with nonlinear activation functions, and an output layer. Each layer is fully connected, enabling the network to learn complex, non-linear mappings from input to output. MLPs are popular due to their simplicity and ease of implementation. However, since they process each frame independently, they lack a mechanism to capture temporal dependencies, making them less suitable for modeling audio effects that rely on long-range context or time-varying dynamics.

RNNs process audio sequentially—sample-by-sample or frame-by-frame—using their internal memory to model temporal dependencies, which makes them well-suited for time-dependent audio effects. However, at typical audio sampling rates (e.g., 44.1 kHz), their effective receptive field is often too short to capture long-term dependencies.

CNNs are well-suited for audio effects modeling, as they learn complex time-domain transformations through local receptive fields. They act as banks of nonlinear filters—each combining a convolutional kernel with an activation function—to capture patterns in the signal. Stacking layers enables hierarchical feature learning, and using dilated convolutions greatly extends the receptive field without significantly increasing model size. This makes CNNs, especially dilated ones, effective for modeling effects with long-range temporal context, while remaining efficient for real-time processing.

SSMs (Gu et al., 2021a; Gu et al., 2021b; Gupta et al., 2022) are a powerful and flexible approach to audio effects modeling, combining the benefits of both convolutional and recurrent architectures. During training, SSMs are typically expressed in a convolutional form, applying learned kernels across the input sequence. This enables fast, parallelizable training. However, like CNNs, this form has fixed context size and is inefficient for online or autoregressive use, as it requires reprocessing the full input with each new data point. For inference, SSMs can be converted into a recurrent form, maintaining an internal state that updates with each new input. By bridging these two modes, SSMs offer a compelling trade-off—combining the efficient, parallel training of CNNs with the expressive, memory-rich inference of RNNs—making them particularly suitable for real-time, temporally complex audio effects modeling.

DDSP is a paradigm that integrates traditional signal processing operations into deep learning workflows by implementing them in a differentiable form. As defined by Hayes et al. (2024), DDSP refers to “the technique of implementing digital signal processing operations using automatic differentiation software.” This idea was first introduced by Engel et al. (2019), where the authors implemented traditional DSP modules—such as equalizers, FIR filters, nonlinearities, and oscillators—within machine learning frameworks, which support gradient-based optimization. By doing so, the internal parameters of these modules can be learned directly from data, effectively bypassing the need for manual design and tuning while still leveraging the structure and interpretability offered by decades of signal processing research.

In audio applications, this approach enables the creation of models that are both data-driven and physically grounded, making DDSP particularly powerful for tasks like synthesis, modeling, and processing. To implement a DSP block in a differentiable framework, the underlying algorithm must be expressed using operations compatible with automatic differentiation, typically within libraries such as PyTorch or TensorFlow. Fixed parameters are replaced with learnable variables, and non-differentiable or unstable components (e.g., discontinuities, hard clipping) are substituted with smooth approximations to maintain gradient flow. This fusion of classic DSP and modern deep learning enables efficient, interpretable, and high-quality end-to-end systems for real-time audio synthesis and processing.

2.6 Differentiable white-box modeling

In (Parker et al., 2019), the authors use a multi-layer perceptron (MLP) as a state-space model, called a state trajectory network, which predicts the output based on the input signal and an internal state. Since state-space models are based on circuit variables, when applying the method to a first-order and a second-order diode clipper the authors adopt input and internal voltages as training signals to predict output voltage. This approach is then extended in (Peussa et al., 2021) using a gated recurrent unit (GRU).

In (Esqueda et al., 2021), the authors introduce differentiable white-box virtual analog modeling, using backpropagation to optimize component values in analog circuits, applied to find optimal resistor and capacitor values that approximate the frequency response of an RC filter and tone-stack.

The idea of learning ordinary differential equations using differentiable methods is applied in Wilczek et al. (2022) to learn the governing equations of diode clippers.

In (Parker et al., 2022), the authors use recurrent neural networks with fast convolutional layers to model partial-differential equation governed systems, applying it to lossy dispersive string, 2D wave equation, and tension-modulated string.

Analogously to (Parker et al., 2019) for state-space models (Chowdhury and Clarke, 2022), introduces differentiable wave digital filters, using an MLP to learn the input-output relationship of wave variables in a diode clipper.

2.7 Differentiable black-box modeling

Black-box modeling is the most common approach for differentiable audio effects, and it has been applied to most linear, non-linear, and modulation effects, although with varying success.

The first example of neural networks adoption for a modeling task is (Mendoza, 2005), where the author uses an MLP to model overdrive. Following this, recurrent networks were proposed for guitar amplifiers modeling, with echo state machines (Holzmann, 2009) and NARX networks (Covert and Livingston, 2013) being adopted due to their easier and more stable training with respect to early LSTMs. Although, once the training issues were overcome, recurrent networks like LSTMs (Zhang et al., 2018; Schmitz and Embrechts, 2018b) have become a common approach, with different nonlinear effects, architectures and conditioning methods being further studied in (Wright et al., 2019; Wright et al., 2020; Wright and Välimäki, 2020; Chowdhury, 2020; Simionato and Fasciani, 2022, Simionato and Fasciani, 2024a; Yeh et al., 2024b).

In (Schmitz and Embrechts, 2019), an hybrid CNN and LSTM architecture is explored for guitar amps emulation, while (Ramírez and Reiss, 2019) is the first example of Encoder/Decoder (ED) architecture applied to nonlinear effects, although limited by a low sampling rate (16 kHz) and non-causal implementation. Improvements on this ED architecture were further explored in (Ramírez et al., 2019; Martínez Ramírez et al., 2020) and applied to dynamic range compression among other effects.

CNN architectures took longer to be explored as a modeling approach. In the first examples (Damskägg et al., 2019b; Damskägg et al., 2019a) the authors used a non-autoregressive WaveNet (Rethage et al., 2018), also known as a gated convolutional network (GCN), which is a special case of temporal convolutional network (TCN) (Bai et al., 2018) that uses gated activations. These works focused on guitar amplifiers, overdrive, and distortion effects, and achieved results equal or better than recurrent networks in a subjective test (Wright et al., 2020). TCNs have been proposed for compressor modeling in (Steinmetz and Reiss, 2022), while GCNs have been extended with temporal feature-wise linear modulation (TFiLM) (Birnbaum et al., 2019) in (Comunità et al., 2023) to model time-varying nonlinear effects like fuzz and compressor, and shown in both cases to outperform recurrent networks and standard GCNs.

ED-based architectures have also been further explored in the time (Simionato and Fasciani, 2023) and spectrogram (Hawley et al., 2019; Mitchell and Hawley, 2020) domain for compressor modeling. More recently, with the success of state-space models (Gu et al., 2021b; Gu et al., 2021a; Gu and Dao, 2023) in general time series modeling tasks, the application to effects modeling like compressor and overdrive were also explored (Yin et al., 2024; Simionato and Fasciani, 2024a; Simionato and Fasciani, 2024b). It is also worth noticing how LSTMs and GCNs have also been adopted for patents on differentiable nonlinear audio effects modeling for commercial products (Borquez et al., 2022) (Neural DSP Technologies).

2.8 Differentiable gray-box modeling

With the development of DDSP (Hayes et al., 2024), it has become possible and also desirable to explore differentiable gray-box approaches to audio effects modeling. The main advantages being the reduced number of parameters and higher computational efficiency w.r.t. neural networks as well as the higher degree of interpretability.

Few studies explore block-based differentiable modeling, with most extending W-H models for nonlinear effects (guitar amps, overdrive, distortion, fuzz) (Kuznetsov et al., 2020; Nercessian et al., 2021; Colonel et al., 2022a; Miklanek et al., 2023; Yeh et al., 2024a) or implementing a differentiable version (Lee et al., 2024; Yu et al., 2024; Colonel and Reiss, 2022; Wright and Valimaki, 2022) of previously suggested implementations of feed-forward digital compressors Zölzer et al. (2002); Giannoulis et al. (2012).

The first W-H model-inspired approach (Matsunaga et al., 2018) tests overdrive emulation, with input and output convolutional layers used to learn FIR filters and LSTM layers used to learn a nonlinearity. A similar approach (Taylor, 2020) implements guitar amps emulation as a series of six learnable Wiener models (i.e., linear filters followed by nonlinearity). Here the filters are implemented as fixed band-pass, while the scaling factor for each band component is learned, and the nonlinear block is set to a soft-sign. Tested on several amplifiers, it allows to implement hybrid models by interpolating coefficients from different amps.

DDSP for gray-box modeling is first explored in (Kuznetsov et al., 2020), where differentiable linear time-invariant filters and a nonlinearity form a W-H model that emulates the Boss DS-1 distortion pedal. LTI filters are implemented either as FIRs with 128 taps or second-order IIRs (biquads), while an MLP is used as a memoryless nonlinearity.

For their model of the Boss MT-2 distortion pedal, Nercessian et al. (2021) adopt a cascade of 40 hyperconditioned filtering stages, each made of biquad, gain and $\tanh$ nonlinearity. Hyperconditioning (Ha et al., 2022; Serrà et al., 2019) involves using neural networks to adjust the internal coefficients or weights of a differentiable block, such as a neural network or DDSP block. In this case, a parametric model is obtained by conditioning, through affine transformations, biquad coefficients and gain. Therefore, the controllers we use in our work (see Section3.2.2) are a form of hyperconditioning. Though interpretable, the long sequence of nonlinear filters doesn’t closely match the actual circuit or is easily observable.

Also Colonel et al. (2022a) explore differentiable W-H models with a focus on proposing several parametric nonlinearities (waveshapers): harmonic sum of $\tanh$ functions, power sum of $\tanh$ functions, Fourier series and Legendre polynomials. LTI filters are implemented as differentiable 20-band FIR graphic EQs. While good results are achieved, the model is limited to single input-output pairs (overfitting to one example) and is non-parametric.

In (Miklanek et al., 2023), Miklanek et al. proposed a hybrid differentiable guitar amplifier model, combining black-box and gray-box approaches. The pre- and power amplifiers are modeled using recurrent and fully connected layers, while a differentiable parametric tone-stack model, with a known processing function, is placed between them.

A recent guitar amplifier model in (Yeh et al., 2024a) features a detailed, differentiable parametric processor for each stage of a typical amplifier’s analog circuit. This includes several W-H stages for the pre-amplifier, a tone-stack (Miklanek et al., 2023), a custom AB push-pull power amplifier model, and a custom output transformer model. Despite its advanced design, the best-performing gray-box arrangement significantly underperforms compared to a single-layer recurrent network, with more trainable parameters (10.1k vs 8k), resembling a black-box model in size.

2.9 State of the art

As we have seen in the previous sections there is a vast literature on nonlinear audio effects modeling detailing a large number of approaches and paradigms, which include both differentiable and non-differentiable implementations. Generally speaking, the vast majority of the effort for non-differentiable methods is focused on white-box models, where the goal is to achieve a very accurate emulation of a specific device and - thanks to the access to and understanding of the circuit schematic - deterministic methods are still favorable and the best performing, assuming flexibility of a model is not a requirement and development time is not a concern.

Non-differentiable gray-box approaches have also been widely studied, especially for nonlinear effects like amps, overdrive and distortion, thanks to the theoretical basis guaranteed by W-H models. These are also used in commercial products, although they have never been thoroughly evaluated in subjective tests; it is therefore difficult to know whether they can compete with white-box methods in terms of accuracy.

Non-differentiable black-box methods have also been studied, but the literature does not highlight the potential to achieve sufficient emulation accuracy while still being difficult to optimize.

The opposite can be said for differentiable methods, where thanks to the wide development of neural networks, black-box approaches have become the most studied and achieved state-of-the-art performance while differentiable white-box methods have not yet developed and need to be further explored to understand their performance potential.

Similarly to gray-box methods, the differentiable counterpart has been proved to be useful and worth exploring further even if, at the moment of writing, cannot achieve state-of-the-art emulation accuracy and have not been thoroughly evaluated in subjective tests.

3 Methodology

In the following sections we outline the key aspects of our work to establish a state of the art for differentiable black- and gray-box modeling. For consistency, ease of comparison, and to encourage future developments we rely on the NablAFx⁵ (Comunità et al., 2025) audio effects modeling framework. We describe black- (Section 3.1) and gray-box (Section 3.2) architectures included in our experiments and introduce ToneTwist AFx (Section 3.3) - a novel dataset for audio effects research.

3.1 Differentiable black-box models

All architectures considered in our experiments are described in detail in (Comunità et al., 2025); specifically, we include: LSTM-based recurrent networks, temporal convolution networks (TCN) (Lea et al., 2016), gated convolution networks (GCN) Rethage et al. (2018) and structured state-space sequence models (S4) (Gu et al., 2021a) with diagonal matrices (Gupta et al., 2022). Single LSTM layer architectures have been widely adopted for nonlinear (overdrive, distortion, guitar amps) (Wright et al., 2019, 2020) and nonlinear time-varying (fuzz, compressor) (Steinmetz and Reiss, 2022; Comunità et al., 2023) effects. Shown to outperform recurrent networks on various tasks, TCNs have also been adopted to model nonlinear time-varying effects, such as compressors (Steinmetz and Reiss, 2022), while GCNs were proposed in (Damskägg et al., 2019a,b; Wright et al., 2020) for nonlinear (guitar amp, overdrive, distortion) and in (Comunità et al., 2023) for nonlinear time-varying effects (compressor, fuzz). More recently, S4 architectures were adopted for compressor modeling in (Yin et al., 2024; Simionato and Fasciani, 2024b). Since these architectures have each been applied to some but not all types of nonlinear effects, preventing clear conclusions on the state of the art, we include all of them in our experiments.

3.2 Differentiable gray-box models

As described in (Comunità et al., 2025), we define a gray-box model as a sequence of differentiable processors, each with an associated controller which generates the control parameters that dictate the processor’s exact behavior. We propose gray-box architectures for dynamic range compression (Section 3.2.1), distortion and fuzz (Section 3.2.2).

3.2.1 Model for compressor/limiter

While many approaches have been proposed to model dynamic range compression with differentiable black-box models (Ramírez et al., 2019; Hawley et al., 2019; Steinmetz and Reiss, 2022; Simionato and Fasciani, 2022; Comunità et al., 2023; Yeh et al., 2024b), there are fewer studies that adopt a gray-box paradigm (Yu et al., 2024; Colonel and Reiss, 2022; Wright and Valimaki, 2022) and are all based on two previously suggested implementations of feed-forward digital compressors Zölzer et al. (2002); Giannoulis et al. (2012).

Here we propose a simple architecture leveraging dynamic controllers (Comunità et al., 2025) to implicitly learn the compression curve as a function of input signal and attack/release time constants. The model - shown in Figure 2 with input $x_{n}$ , output $y_{n}$ and controls $c$ - is composed of: Static Parametric EQ, Dynamic Gain, Static Parametric EQ, Static Gain. The first Parametric EQ captures the overall frequency response of a device while the Dynamic Gain models the compression curve. The second Parametric EQ shapes the output timbre after nonlinear gain, and the output Gain is necessary to match the target level (make-up gain). The same model can be made parametric by replacing all static/dynamic controllers with their conditional counterparts. The dynamic controller works at the default block size of 128 samples, equivalent to 2.66 ms at 48 kHz sampling rate. To obtain a higher temporal resolution, and more accurate modeling, one could use a smaller block size and implement interpolation methods for smoother control sequences.

Figure 1

Diagram of audio effects modeling paradigms and methods: White-box, Black-box, and Gray-box. White-box includes methods like Equations, Wave Digital Filters, and State-space; subcategory Differentiable White-box adds items like Equations Learning. Black-box contains Volterra Series and Dynamic Convolution; Differentiable Black-box includes Neural Networks. Gray-box lists Block-based Models and Waveshaping; Differentiable Gray-box covers Differentiable Digital Signal Processing and related models.

Figure 1. Audio effects modeling paradigms and methods

Figure 2

Diagram depicting two signal processing systems. (a) shows a series of Parametric EQ and Gain blocks with static and dynamic conditional controllers, processing input $x_n$ to output $y_n$. (b) features Parametric EQ, Gain, Offset, and Nonlinearity blocks, with static/dynamic and dummy controllers, connecting $x_n$ to $y_n$.

Figure 2. Proposed gray-box architectures. (a) Gray-box model for compressor/limiter. (b) Gray-box model for amp/overdrive/distortion/fuzz.

In Figure 3 we show examples of the frequency/time response learned by the proposed model when trained on a compressor (Flamma AnalogComp) and a limiter (UA 6176 Vintage Channel Strip - 1176LN Limiter). In the first example, input and output EQs differ, while in the second, a nearly symmetrical response is learned. We notice how the Dynamic Gain successfully captures different behaviors, with the compressor having slower attack and faster release than the limiter; also the exponential time constants - typical of analog compressors - appear to be correctly captured.

Figure 3

The image displays six graphs labeled from 1a to 3a and 1b to 3b. Graphs 1a, 2a, and 2c depict gain versus frequency with varying slopes. Graphs 1b, 2b, 3a, and 4a show gain or offset versus samples, with fluctuations and shaded regions indicating variability. Graph 1c also illustrates gain versus frequency with a peak and then a decline. Each graph is characterized by unique line patterns and shading.

Figure 3. Examples of response for gray-box models of (1) Flamma Compressor, (2) UA 6176–1176LN Limiter, (3) Custom Dynamic Fuzz, (4) Harley Benton Fuzzy Logic. (1a) Parametric EQ. (1b) Dynamic Gain. (1c) Parametric EQ. (2a) Parametric EQ. (2b) Dynamic Gain. (2c) Parametric EQ. (3a) Dynamic Offset. (4a) Dynamic Offset

3.2.2 Model for amplifier, overdrive, distortion and fuzz

Similar to compression, there is also a wealth of literature on differentiable black-box models for: amplifiers (Covert and Livingston, 2013; Zhang et al., 2018; Schmitz and Embrechts, 2018c; Schmitz and Embrechts, 2018b; Schmitz and Embrechts, 2019; Damskägg et al., 2019a; Wright et al., 2019; Wright et al., 2020; Wright et al., 2023), overdrive (Mendoza, 2005; Ramírez and Reiss, 2019; Damskägg et al., 2019b; Wright et al., 2020; Chowdhury, 2020; Fasciani et al., 2024; Yeh et al., 2024b; Simionato and Fasciani, 2024a), distortion (Ramírez and Reiss, 2019; Damskägg et al., 2019b; Wright et al., 2019; Wright et al., 2020; Yoshimoto et al., 2020; Yoshimoto et al., 2021) and fuzz (Comunità et al., 2023). Recently, there has been exploration into applying a gray-box paradigm to this task (Kuznetsov et al., 2020; Nercessian et al., 2021; Colonel et al., 2022a; Miklanek et al., 2023; Yeh et al., 2024a), as it offers desirable characteristics such as fewer trainable parameters, greater interpretability, and reduced training data requirements.

In this work, we extend previous gray-box approaches based on the W-H model. Figure 2 illustrates our comprehensive architecture for emulating amplifiers, overdrive, distortion, and - when using dynamic controllers - fuzz effects. We use two Parametric EQs (Comunità et al., 2025) for pre- and post-emphasis filtering around a Nonlinearity, which can be memoryless or not. Gain stages control distortion and output amplitude, while an Offset stage before the nonlinearity models asymmetrical clipping. A fixed offset suffices for amplifiers, overdrive, and distortion, while a dynamic controller is needed for fuzz effects to capture hysteresis (i.e., dynamic bias shift) and the characteristic timbre during note attack and release. Conditional controllers can turn the model into a parametric one. Figure 3 shows examples of learned dynamic offsets from fuzz effects models, with the first having a longer attack and much longer release time constants than the second.

3.3 ToneTwist AFx dataset

In order to establish the state of the art in audio effects modeling across various effect types, we required access to dry-input/wet-output data from multiple devices; and, since modeling mainly focuses on physical units, data needed to be from analog devices. For a realistic scenario, data needed to include a variety of sources, content and playing techniques, to test generalization capabilities beyond the ones used for training. Furthermore, to run extensive experiments efficiently, we needed consistently organized data to avoid custom preprocessing and dataset code for each device.

Although there are publicly available datasets for audio effects modeling (Stein et al., 2010; Comunità et al., 2021; Pedroza et al., 2022), these were deemed unsuitable for our work for several reasons. The IDMT-SMT-Audio-Effects⁶ (Stein et al., 2010) dataset focuses on detection and classification tasks, it is limited to single notes and chords from two electric guitars and two electric basses, and it only includes digital effects covering 10 linear, non-linear, and modulation effects. Similarly, GUITAR-FX-DIST (Comunità et al., 2021), which focuses on classification and parameter estimation, includes single notes and chords from two guitars and only features digital emulations of 14 distortion effects. While EGFxSet⁷ (Pedroza et al., 2022) includes recordings from 12 physical units, equally split between analog and digital, it only features single notes from one electric guitar; and, despite being intended for audio effects modeling, it is unsuitable due to high background noise and other artifacts.

For the reasons above, we introduce the ToneTwist AFx dataset⁸ (Table 1), the largest collection (40 devices at the time of writing) of dry/wet signal pairs from nonlinear audio effects. Beside being the first thoroughly documented audio effects dataset, it is also open to contributions from the wider community. While data is hosted on Zenodo⁹ for long-term availability, our repository organizes download links, provides retrieval scripts, basic training code, and contribution guidelines, facilitating new submissions and fostering discussion.

Table 1

Table 1. ToneTwist AFx Dataset (at the time of writing). For accessibility, we also include data from previous works (see Source) which we verify, pre-process (when necessary), rename, and reorganize for consistency.

ToneTwist AFx is organized into four categories: analog, analog parametric, digital, and digital parametric, with parametric referring to entries sampled with control settings granularity sufficient for parametric modeling. It features dry-input signals from seven sources, including a variety of guitars, basses, and test signals like chirps and white noise. The test set features different content and instruments from the training/validation set, ensuring a realistic evaluation of the generalization capabilities of trained models.

Recording was carried out with a Focusrite Scarlett Solo audio interface, featuring a low-noise pre-amplifier ( $\pm$ 0.06 dB, 20Hz-20 kHz). Synchronization between dry-input and wet-output was achieved by adding two impulses at the start and end of each file and then manually shifting each wet-output file of the same number of samples to account for interface latency. Devices delays were assumed negligible. We also ensure a high input dynamic range by applying a uniformly distributed gain (–20 dB, 0 dB) to the dry-input every 5 s.

4 Experiments

In this section we give an overview and description of all the architectures, models configurations, and data used for our experiments. To help establish the state of the art in audio effects modeling and identify future research directions, we included a wide range of experiments and configurations.

4.1 Overview of models configurations

Table 2 lists all models, hyperparameter values, and computational complexity¹⁰ used in our experiments on non-parametric modeling, where we balance tractability by limiting the number of models per architecture while covering a broad range of sizes and complexities.

Table 2

Table 2. Configurations for models included in the experiments.

As described in Section 3.1, we include the four most common neural networks types for black-box modeling: LSTM, TCN, GCN and S4; as well as the gray-box models proposed in Section 3.2: GB-COMP for compressor/limiter, GB-DIST for amplifier/overdrive/distortion and GB-FUZZ for fuzz. For LSTM architectures, the only parameter to choose is the number of channels, with 32 and 96 channels being sensible choices for small and large models (LSTM-32/LSTM-96), as adopted in previous works Wright et al. (2020); Steinmetz and Reiss (2022); Comunità et al. (2023).

For TCN and GCN models, we consider variants with short, medium, and long receptive fields (45, 250 and 2,500 ms at 48 kHz), with small (5/6 blocks) or large (10/11 blocks) configurations (S/L in model names). We keep the number of channels constant and equal to 16. We also include variants without and with TFiLM (TF in the models’ names) to evaluate whether time-varying conditioning - shown to be able to capture long-range dependencies (Comunità et al., 2023) typical of compressors and fuzz effects - is beneficial across architectures. For TFiLM layers we always adopt a block size - i.e., downsampling/maxpooling factor - of 128, which was shown to be optimal in (Comunità et al., 2023). As shown in the table, for each receptive field, small and large models have comparable parameters, FLOP/s, and MAC/s, with the idea that the main difference is in the number of processing stages and not in the networks expressivity itself. With TFiLM adding a conditioning layer per block, matching parameter counts across block variations isn’t feasible, but due to the small channel count (16) and low TFiLM sample rate, model pairs remain computationally comparable.

For S4 architectures we define small and large models (S/L in the name) based on the number of blocks and state dimension, with values chosen based on the experiments reported in the original publication (Gu et al., 2021a). For S4 models as well we introduce TFiLM conditioned variants and keep the number of channels constant and equal to 16.

While for the GB-COMP architecture we only investigate one configuration, for GB-DIST and GB-FUZZ models we test two, based on the type of nonlinearity adopted: Static MLP Nonlinearity or Static Rational Nonlinearity (MLP/RNL in the name). The MLP itself accounts for most of the computational complexity of GB-DIST and GB-FUZZ models.

For experiments on parametric modeling (Table 3) we take a subset of the configurations in Table 2 and we pair them with the various conditioning mechanisms described in (Comunità et al., 2025). For LSTM models, we use concatenation (C) as the baseline and also test time-varying concatenation (TVC), which adds minimal computational cost. For convolutional backbones, we select only TCNs and models with a short (45 ms) receptive field, as these rely more on conditioning methods to capture long-range dependencies, highlighting performance differences. We consider FiLM conditioning (F in the name) to be the baseline method and experiment with TFiLM, TTFiLM (TTF in the name) and TVFiLM (TVF in the name). We do the same for S4 architectures. Table 3 shows that both TTFiLM and TVFiLM introduce time-dependent conditioning with parameter and computational efficiency similar to the baseline FiLM method, having minimal impact on parameters, FLOP/s, and MAC/s. For gray-box models, we add parametric control using differentiable controllers from (Comunità et al., 2025) and replace static/dynamic controllers with conditional ones (C in the model name). Although they increase the number of parameters, the impact of conditional controllers on the computational complexity is minimal due to the small size of the MLPs and LSTMs in their implementation.

Table 3

Table 3. Configurations for models included in the experiments.

Beside a good understanding of the state-of-the-art, our experiments aim to identify architectures that can model a wide range of effects and devices consistently, without requiring different configurations, with the broader goal of defining a universal audio effects modeling architecture. While we recognize that model pruning (Südholt et al., 2022) or hyperparameter optimization could identify the smallest model size for each effect, this would require extensive time and resources. Alternatively, model distillation (Hinton et al., 2015) could reduce size and increase efficiency after identifying a suitable architecture, but this also requires additional experimentation and training.

4.2 Overview of audio effects configurations

Table 4 lists all the audio effects and settings in our study. For non-parametric modeling, we select four devices per effect category (compressor/limiter, overdrive, distortion, fuzz), totaling sixteen devices, twelve of which are from our ToneTwist AFX dataset (see Section 3.3).

Table 4

Table 4. Audio effects configurations use in the experiments. For parametric models not all combinations are part of the data.

Settings for each device are chosen to ensure a variety of behaviors within each category, while maximizing the challenge for the modeling task. For compression, we select fast attack, slow release, and high compression ratio settings to capture fast transients, long-range dependencies, and nonlinear behaviors. For overdrive, we choose high gain settings with varying tone/equalization for a complex spectrum. For distortion, we select both medium and high gain settings, while for fuzz effects, which are inherently high gain, we opt for medium gain settings to ensure the dynamic behavior is noticeable and not concealed by excessive distortion. It is important to note that the settings values across effect types and devices are only vaguely correlated with the behaviors we are modeling, as they don’t correspond to specific units or internal parameters, serving only as a general description of the effect’s behavior and timbral traits.

The table also includes effects and settings for parametric modeling experiments: a vacuum tube guitar amplifier and a digital germanium transistor fuzz emulation. In addition to parametric modeling and conditioning methods, data from these two devices help evaluate performance on unseen control configurations and sources. For the Marshall JVM410H, training, validation, and testing use the same sources but different control settings. For the Multidrive Pedal Pro F-Fuzz, control settings are the same across sets, but sources and content differ, featuring unseen guitar and bass performances.

4.3 Training

Many previous audio effects modeling studies have typically assumed that training different architectures and model configurations under a consistent training scheme—including loss functions, loss weights, learning rate, and scheduler—provides a reasonable basis for performance comparison and drawing conclusions about state-of-the-art approaches. This has been the case also for works that compared vastly different architectures like black-box and gray-box ones (Yeh et al., 2024a; Miklanek et al., 2023; Wright and Valimaki, 2022). Although, in preliminary experiments we were able to assess that training an architecture under different schemes can yield vastly different loss values, and even more so when different architectures are involved. This highlights the need to choose appropriate training-related hyperparameters for each model to enable meaningful comparisons between architectures and configurations.

This section discusses preliminary experiments to identify optimal hyperparameters and training configurations for LSTM, TCN, GNC, S4, and GB architectures, and the resulting training schemes are shown in Table 5. Based on prior work on effects modeling Steinmetz and Reiss (2022); Comunità et al. (2023), sound synthesis Engel et al. (2019), and audio processing (Yamamoto et al., 2020), we assumed an $L 1$ + MR-STFT loss as the most reliable and perceptually relevant choice, and opted for specific combinations of loss weights, which we report in Table 5. Having a 10:1 ratio between $L 1$ and MR-STFT loss makes the two terms about equal in absolute numeric values, giving them the same importance. Conversely, having a 1:1 ratio equals to give the spectrogram loss greater importance. To compare experiments with different loss weights, we report results by scaling the loss terms accordingly. This means that, while backpropagation is influenced by the weight values, the results in Section 5 are independent of the specific choice, allowing for comparison across experiments.

Table 5

Table 5. Training configurations for each architecture.

We chose the learning rate (LR) for each architecture by training on one audio effect (Custom Dynamic Fuzz) and comparing results for $LR = [0.1, 0.01, 0.005, 0.001, 0.0005]$ . For gray-box models we also tested $L R = 1$ , since higher learning rates seemed to work better. We also found it essential to use a learning rate multiplier for each block in the gray-box signal chain, as more complex blocks (e.g., Static MLP Nonlinearity, Static FIR Filter) needed a lower LR than simpler ones (e.g., Parametric EQ, Static Rational Nonlinearity). This significantly improved gray-box losses and led to better local optima in terms of perceptual similarity with the target. Furthermore, we conducted experiments to choose activation functions and, after testing $t a n h$ , parametric ReLU and rational activations (Delfosse et al., 2020), no clear winner emerged, and proceeded selecting $t a n h$ , being the least computationally expensive one.

A major challenge during this phase was the instability in training LSTM models, resulting in frequent gradient explosion failures despite adjustments to: LR, gradient clipping, loss weights, training samples length, and the use of truncated backpropagation through time (TBPTT) (Comunità et al., 2025). A suboptimal solution was to select the training scheme reported in Table 5. While this does not prevent NaN loss failures, it helps to reach a suitable local optimum by training each model with a broader range of hyperparameters. For TBPTT, gradients are updated every 4,800 samples (100 ms at 48 kHz), and the max training steps match the epochs used for models without TBPTT. All other architectures revealed to be much more stable - with no NaN loss cases - while still benefiting from training on a variety of LRs to achieve optimal results. For all architectures we opt to reduce LR by half when there are no improvements in total validation loss for 20 epochs. To prevent overfitting, we also use early stopping of 50 epochs.

Examples that highlight differences in performance as a function of LR are shown in Figure 4, with training sessions that might not reach a good local optimum (Figure 4a) or exhibit scaled-loss differences of up to 30%–40% (Figures 4b,c).

Figure 4

Three line graphs labeled (a), (b), and (c) show loss values over epochs. Graph (a) has fluctuating lines initially peaking near six, then decreasing. Graph (b) shows smoother lines declining from 1.4 to 0.6. Graph (c) features a more gradual decline from 2.6 to around 1.8, with some variations. Each graph has different colored lines indicating separate datasets or experiments.

Figure 4. Validation scaled-loss for different learning rates reveals training discrepancies and unreliability of previous literature comparing architectures and model configurations using a single training scheme. (a) LSTM trained on Harley Benton Plexicon. (b) S4 trained on Harley Benton Rodent (c) GB-FUZZ trained on Harley Benton Fuzzy Logic.

5 Results

5.1 Results for non-parametric models

Figure 5 shows the total loss for architectures trained on sixteen effects (see Section 4), with box plots for overall results (Figure 5a) and each effect category (Figures 5b-e).

Figure 5

Five box plots labeled (a) to (e) compare total loss across different models: LSTM, TCN, TCN-TF, GCN, GCN-TF, S4, S4-TF, and GB. Each plot shows varying total loss levels, with LSTM and GB generally showing higher variability.

Figure 5. Total loss for different architectures: overall and for each effect type. (a) Overall. (b) Compressor/Limiter (c) Overdrive (d) Distortion (e) Fuzz.

Overall, S4 models with TFiLM conditioning perform the best. Also, TFiLM improves median performance across architectures and reduces the variance, demonstrating how time-varying conditioning enhances expressivity, regardless of the backbone. Although GCN achieves lower loss than TCN, TFiLM conditioning results in similar performance for both, making the added complexity of GCN-TF hard to justify and emphasizing how efficient and effective time-varying conditioning could be focus of further research. Gray-box models show higher median loss than black-box ones but can outperform them in some cases, and also exhibit smaller variance than LSTMs. Having the greatest variance across architectures, LSTMs are shown to be the least reliable, which makes them unsuitable for modeling a wide range of effect types and devices.

Breaking down performance by effect type, S4 models perform best in each category, with TFiLM conditioning generally improving or maintaining performance without hindering it. It therefore seems that S4 models with a form of time-dependent conditioning are the best candidate to develop high performing architectures across a wide range of effect types and implementations.

In convolutional architectures, TFiLM is shown to be helpful for every effect category although to a varying extent, particularly benefiting distortion and fuzz effects with their complex timbre and time-dependent behaviors. This is reflected in the average loss and standard deviation in Table 6, where distortion and fuzz are the most challenging effects to model. Beside compressors, LSTM is confirmed to be the least reliable architecture in each category, with far greater variance than any other architecture and performing on par or worse than gray-box models.

Table 6

Table 6. Mean $(μ)$ and Standard Deviation $(σ)$ of the total loss (L1 + MR-STFT) - overall and for each device type - for non parametric models. Bold indicates top performing model. Underline indicates bottom performing model. Superscripts indicate top-3 performing models. Subscripts indicate bottom-3 performing models.

Looking in further details at the results in Table 6 where we gather mean $(μ)$ and standard deviation $(σ)$ of the total loss for each device type and overall, we notice how S4 with TFiLM conditioning (S4-TF-L-16) performed the best, with TFiLM giving only a marginal improvement w.r.t. base S4 models of the same size. Standard S4 and GCN-TF rank second and third in average loss, while TCN-TF has the lowest standard deviation across effect types. TFiLM conditioning consistently improves loss, especially for convolutional backbones, enhancing performance at each receptive field and model size, while enabling smaller models to match the performance of larger ones, reducing the reliance on receptive field length and nonlinear activations to capture complex timbres and long-range dependencies. Overall, TCN-TF models perform well in terms of mean loss and are among the best in terms of standard deviation, which makes them a good choice to model diverse effects. Once more, looking at each model size and receptive field length, the performance improvement of GCN and GCN-TF w.r.t. TCN and TCN-TF models does not seem to justify the increase in parameters and computational cost. Also worth noticing is that rational nonlinearities are useful for gray-box models, allowing to match performance while greatly reducing the computational complexity of MLP-based models, regardless of the effect and across architectures (i.e., GB-DIST and GB-FUZZ).

5.2 Results for parametric models

Table 7 shows the losses for parametric models trained on Marshall JVM410H guitar amplifier and Multidrive Pedal Pro F-Fuzz digital emulation of Fuzz Face guitar pedal. We report both validation and test loss to assess model performance when tested on unseen controls configurations (i.e., guitar amp) or unseen sources and content (i.e., fuzz pedal). S4 models are the best performing on the guitar amp, with LSTM models and TCN with time-varying conditioning performing only marginally worse. For the fuzz effect pedal emulation, LSTMs are the best performing, with S4 models with time-varying conditioning (S4-TF-L-16 and S4-TVF-L-16) being only slightly worse.

Table 7

Table 7. Mean $(μ)$ and Standard Deviation $(σ)$ of the total loss (L1 + MR-STFT) - overall and for each device type - for non parametric models. Bold indicates top performing model. Underline indicates bottom performing model. Superscripts indicate top-3 performing models. Subscripts indicate bottom-3 performing models.

While TVCond (TVC) conditioning is not useful to improve LSTM performance on the Marshall amp, it is beneficial to achieve among best performance with a smaller LSTM model. For TCN models, TFiLM (TF) conditioning is the most effective for both guitar amp and fuzz pedal cases, regardless of the number of layers in the model. We also notice how the baseline FiLM (F) conditioning is the worst mechanism for all but one case when applied to either TCN or S4 models. Looking at S4 models, the proposed TTFiLM (TTF) and TVFiLM (TVF) conditioning allow performance on par or better than TFiLM (TF) regardless of model size. Also, TVF allows small models to achieve performance comparable to bigger ones. For TCN models, it is less clear whether there is an advantage in using TTF or TVF w.r.t. TF beside computational complexity.

To recap, TFiLM seems to be the best across architectures, with TVFiLM better than TTFiLM when reducing the computational complexity. A study on a wider variety of effects would allow to better asses whether TVF and TTF can compete with TF conditioning.

6 Evaluation

6.1 Objective evaluation

To extend the objective evaluation beyond loss values we use most of the metrics available from the NablAFx framework (Comunità et al., 2025), these include both signal-based metrics (i.e., MSE, ESR, MAPE) as well as latent representation-based ones (i.e., FAD). FAD is computed using three widely adopted representations (i.e., VGGish, PANN and CLAP) as well as a recently proposed audio production style representation (AFx-Rep) Steinmetz et al. (2024). We compute these metrics for several reasons beside broadening the analysis: to investigate which ones might be most suited for modeling tasks, to investigate correlation between objective metrics and subjective ratings (see Section 6.4), to encourage further research on audio effects related representations and evaluation methods.

When using objective metrics - which we plot in Figures 6, 7 on a log scale to highlight differences - the picture is similar to the results discussed in previous sections. S4 and S4-TF architectures show the best performance across effect type, regardless of the metric used. Once again, temporal conditioning (TFiLM), is helpful in improving performance - noticeable from the reduction in median as well as standard deviation values - regardless of the architecture’s backbone (TCN, GCN or S4). LSTM architectures are shown again to have a high performance variance, making them not reliable when modeling a wide range of devices. While gray-box models again achieve results that are not on par with black-box ones, therefore requiring further exploration.

Figure 6

Three box plots comparing performance metrics of different models. (a) MSE shows variance among models; LSTM and GB have higher medians. (b) ESR shows GB with a higher range and outliers. (c) MAPE displays similar variance with GB having larger values and outliers. Models include LSTM, TCN, TCN-TF, GCN, GCN-TF, S4, S4-TF, and GB.

Figure 6. Metrics for different architectures for all effect types. (a) MSE. (b) ESR. (c) MAPE.

Figure 7

Box plots compare different models' FAD scores across four scenarios: (a) VGGish, (b) PANN, (c) CLAP, and (d) AFXRep. Models include LSTM, TCN, TCN+TF, GCN+TF, S4-TF, and GB, with varying medians, quartiles, and outliers.

Figure 7. FAD for different architectures for all effect types. (a) VGGish. (b) PANN (c) CLAP (d) AFx-Rep.

6.2 Efficiency

While a full performance evaluation is beyond the scope of this work, we investigated the run-time of the proposed models and conditioning methods in a frame-based implementation to mimic a standard digital audio effect implementation.

The real-time factor (RTF) is defined as:

RTF ≔ \frac{F}{T \cdot f_{s}}

where $F$ is the frame size in samples and $T$ is the time in seconds to process them at sampling rate $f_{s}$ . We measure the RTF at power of two frame sizes, $F \in 32,64, \dots, 16384$ , on a 2019 MacBook Pro with an Intel Core i5 @ 1.4 GHz. For each frame size, we compute the RTF over 500 consecutive frames and report the average. This approach approximates real-world usage in frame-based, streaming scenarios, and provides a more realistic measurement.

Results for non-parametric models are shown in Figures 8A,B and for parametric ones in Figure 8C. We show only a limited number of configurations for clarity. In a block-based formulation, TCN models require a buffer of past samples such that we pass an input of $F + r - 1$ samples, where $F$ is the number of output samples and $r$ is the receptive field in samples.

Figure 8

Three line graphs labeled (a), (b), and (c), display the Real Time Factor against Frame Size. Each graph features multiple lines representing different models or configurations. In graph (a), nine configurations are shown including LSTM-32 and GRUZZ-MLP. Graph (b) and (c) include fewer configurations such as TCN-45-L16 and TCN-Z500-L16. All graphs have a horizontal dashed line for reference.

Figure 8. Real-time factor vs Frame size for some of the models included in the experiments. (a) Architectures. (b) Receptive field. (c) Conditioning method.

LSTMs achieve real-time operation at every frame size, but the RTF remains constant over a certain size due to the inability to parallelize computations along the temporal dimension. The RTF for TCN, GCN (not shown) and GB models, is instead proportional to the frame size. For convolutional backbones, whether the models achieve real-time operation at a sufficiently small frame size is strongly related to the receptive field and the number of blocks. Interestingly, S4 models show a behavior in between recurrent and convolutional networks, where the RTF is proportional to frame size for small sizes and flattens out for large sizes. Although best performing, S4 models require appropriate choice of number of blocks and state dimension to allow real-time operation, in fact, our larger model does not achieve $RTF > 1$ . In our implementation, all proposed gray-box models (of which we only show GB-FUZZ-MLP) show very similar RTF, proportional to frame size.

Figure 8B compares TCN models with and without TFiLM conditioning, with temporal feature-wise modulation consistently reducing the RTF at every model and frame size. Also, in Figure 8C we compare different conditioning methods (FiLM, TFiLM, TTFiLM, TVFiLM) for parametric TCN models, and include a non-parametric model (TCN-45-S-16) for reference. We notice how, even if computationally as efficient as models with FiLM and TVFiLM, TTFiLM conditioning is the worst in terms of RTF, while TVFiLM achieves real-time performance between FiLM and TFiLM, introducing time-varying conditioning at a low computational cost.

We emphasize, however, that RTF alone does not fully capture the suitability of a model for real-time audio deployment. Factors such as latency, system-level I/O delays, and the implementation environment (e.g., Python vs C++) significantly affect overall performance. Our results should therefore be interpreted as a lower-bound indication of computational cost under a reference implementation, rather than a definitive real-time capability assessment.

6.3 Summary

In Figure 9, we present a summary of our experimental results by plotting the mean total loss (L1 + MR-STFT) across all devices for each model configuration. Different architectures are distinguished by color, and marker sizes are scaled according to model complexity, calculated as the sum of parameters, FLOPs, and MACs. Additionally, the real-time factor (RTF), computed for a frame size of 512 samples, is displayed above each marker, with values equal or greater than one shown in bold.

Figure 9

Bubble chart displaying total loss against various model IDs. Each bubble's size and color represent different models like LSTM, TCN, and GCN. Loss values range from 0.06 to 2.99, with LSTM having the highest at 9.26.

Figure 9. Mean total loss (L1 + MR-STFT) across all devices for each model included in our experiments. Color indicates different architectures. Marker size proportional to model complexity (# Params + FLOP/s + MAC/s). Numbers above each marker show the real-time factor for a frame size of 512 samples. Bold indicates models for which the real-time factor is equal or greater than 1.

This figure highlights several key findings from our experiments. Large S4 models achieve the highest accuracy but exhibit lower efficiency compared to other architectures, despite their design intent to leverage CNNs’ training efficiency and RNNs’ inference efficiency (see Section 2.5). Notably, S4 models show significantly lower real-time factors (RTF) than LSTM models, raising questions about their practical inference efficiency. TFiLM conditioning consistently improves accuracy across all backbones (TCN, GCN, S4), with only a marginal increase in computational complexity, although it introduces a measurable reduction in efficiency. Models with long receptive fields (TCN and GCN) suffer from high computational costs, while applying TFiLM to architectures with short receptive fields appears to be a more efficient strategy for modeling long-range dependencies characteristic of dynamic range compression and fuzz effects. Additionally, the increased complexity of GCNs over TCNs does not yield proportional accuracy gains, with TFiLM conditioning providing a more effective improvement. Finally, gray-box models—except GB-COMP, which performs comparably to black-box baselines—do not yet offer clear advantages in accuracy or efficiency, suggesting that more advanced designs may be needed to fully realize the potential of theoretically grounded architectures like those based on Wiener-Hammerstein models.

6.4 Subjective evaluation

To evaluate the perceptual differences among architectures, we conducted a MUSHRA listening test, and the participants provided written informed consent to participate in this study.

Each trial presented participants with processed audio from all architectures under evaluation (LSTM, TCN, TCN-TF, GCN, GCN-TF, S4, and S4-TF) alongside a reference signal (wet-output) and an anchor (dry-input). We selected 12 effects (3 per type) out of the 16 we trained on. For each effect, we curated three distinct sound examples designed to highlight model performance. These included a bass riff with a high input level and clear attacks to emphasize timbral and temporal characteristics, a guitar sample at similarly high input levels with pronounced attacks to further assess these aspects, and a second guitar sample with a very low input level and steady picking. This last example was chosen to evaluate the models’ ability to faithfully capture devices across extreme input dynamics.

For each architecture we selected the best-performing model in terms of total test loss. The listening test consisted of a total of 68 possible questions, with each participant randomly assigned seven to maintain a manageable test duration. A total of 27 participants were recruited, and a rigorous screening process was implemented. Participants were assessed on prior experience with listening tests, musical training or production experience, and familiarity with the specific effects being evaluated (compressors, overdrive/distortion, and fuzz). Further filtering was applied to ensure test reliability, participants who did not rate the reference sample at 80 or higher in more than one out of their seven trials were excluded from the final analysis. After applying these screening criteria, we obtained 350 valid ratings in total, averaging 35 per architecture and approximately 8.75 ratings per effect type. Our strict participant selection strategy was intentional, prioritizing fewer but highly skilled listeners. This approach significantly reduced rating variance, leading to a more reliable subjective evaluation of model performance.

We show the overall results in Figure 10 together with ratings broken down by effect category. In general, subjective ratings confirm the results obtained during the training process, with S4 architectures performing best, followed by convolutional backbones, gray-box models and LSTMs. TFiLM conditioning allows to increase median accuracy and lower variance, improving modeling reliability across effects types and devices. With respect to objective metrics, gray-box models are on average rated higher than LSTMs, but with high variance, which might entail the potential to match black-box approaches while requiring careful architecture design for different effect types. Once again, LSTMs are confirmed to be unreliable as modeling architectures on a wide range of devices.

Figure 10

Five boxplots labeled (a) to (e) comparing ratings of models such as TRG, LSTM, TCN, among others. Plot (a) shows significant variance in many models with high ratings, while plot (b) to (e) display different distributions and variations within similar ranges, suggesting comparative performance assessments across different scenarios. Plots (b) to (e) show more closely aligned boxplots with varied interquartile ranges.

Figure 10. Subjective ratings for different architectures: overall and for each effect type. (a) Overall. (b) Compressor/Limiter. (c) Overdrive. (d) Distortion. (e) Fuzz.

If we break down ratings by effect type we notice that in the majority of cases TFiLM helps to improve performance, in particular for convolutional backbones trained on overdrive, distortion and fuzz. Proposed gray-box models perform, on average, well on compression and fairly on overdrive, while less so on distortion and fuzz. Also, GCNs are not consistently better than TCNs, which confirms that the added complexity of GCNs is in general not necessary for good performance. S4 models remain the most consistent, with median ratings at or above 80 in all but one case.

6.5 Correlation

We further extend our analysis calculating the Spearman correlation between objective metrics and subjective ratings (Table 8). Out of the signal-based metrics, MR-STFT shows the highest correlation with human ratings, supporting the assumption that it is a suitable term to include in the loss function. MAPE and L1 are the second and third most correlated metrics, again supporting the idea that including a time-domain loss term is meaningful for modeling tasks. Considering effects type, MR-STFT is the most relevant for high-distortion effects, while MAPE and L1 are the most correlated with subjective ratings for overdrive and compression respectively, although in the latter case, absolute correlation values are very low or not significant $(p > 0.05)$ , showing no single metric to be well suited for training. We conclude that using frequency-domain in conjunction with time-domain losses seems to be a sensible choice for a wide variety of cases, but that improvements are possible, especially for compression modeling.

Table 8

Table 8. Spearman correlation $(ρ)$ and $p$ -value $(< 0.05)$ between subjective ratings and objective metrics for each device type and overall. Correlation is negative since low metrics values correspond to high ratings values. Bold indicates highest absolute value.

Observing latent representation-based metrics instead, VGGish and AFx-Rep vectors are overall the most correlated with subjective ratings. Looking at specific effect types, none of the representations seem to be suited for compression and overdrive models evaluation, while CLAP and VGGish are, respectively, highly correlated with distortion and fuzz effects.

7 Conclusions and future work

In this work, we conducted extensive experiments to establish the state of the art in differentiable black-box and gray-box approaches for nonlinear audio effects modeling. Our study explored task-specific neural network architectures, proposed differentiable DSP-based gray-box models, and presented ToneTwist AFx, a large dataset of dry-input/wet-output pairs for audio effects research.

We described the challenges of training and comparing different approaches, and demonstrated how careful consideration of the training scheme is essential to ensure fair comparisons and draw reliable conclusions.

Thanks to our extensive objective and subjective evaluations we concluded that state-space models are the most suited to model a wide range of devices, with TCN and GCN architectures also achieving very good results. We found LSTM-based recurrent networks to be unreliable across devices and DDSP-based models not yet on par with neural networks. Time-varying conditioning like TFiLM was proven beneficial, across convolutional and state-space backbones, to improve median and variance in objective metrics and subjective ratings. Our analysis of the correlation between subjective ratings and signal-based metrics showed that a loss function combining time-domain (L1/MAPE) and spectro-temporal-domain (MR-STFT) is generally effective. However, there is room for improvement in signal-based and latent representation-based metrics for model evaluation.

Given these results, future work might focus on any of the following aspects:

$•$ Identify perceptually relevant signal-based and latent representation-based metrics for audio effects modeling. This would enable optimized training schemes for each architecture while ensuring objective model comparisons.

$•$ Enhance gray-box models to achieve modeling accuracy on par with black-box approaches. Training such models might benefit from hybrid strategies, such as combining gradient descent with evolutionary approaches that initialize identical architectures with diverse coefficient sets.

$•$ Further study parametric modeling and conditioning methods across a wide range of effects, especially focusing on efficient implementations.

$•$ Improve the implementation of gray-box processors and controllers to speed-up processing and increase overall efficiency. Although benchmarking of specific processors and controllers was beyond the scope of this work, future studies in this direction could provide valuable insights for practical deployment.

$•$ Investigate universal audio effects modeling architectures that can encompass all effect categories, including linear, nonlinear, and modulation effects.

$•$ Integrate hyperparameter optimization, along with model pruning or distillation, to achieve high-performance and efficient models.

Data availability statement

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found below: https://github.com/mcomunita/tonetwist-afx-dataset.

Ethics statement

The participants provided written informed consent to participate in this study.

Author contributions

MC: Conceptualization, Data curation, Formal Analysis, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review and editing. CS: Software, Supervision, Writing – review and editing. JR: Supervision, Writing – review and editing.

Funding

The author(s) declare that financial support was received for the research and/or publication of this article. Funded by UKRI and EPSRC as part of the “UKRI CDT in Artificial Intelligence and Music”, under grant EP/S022694/1.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Generative AI statement

The author(s) declare that no Generative AI was used in the creation of this manuscript.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/frsip.2025.1580395/full#supplementary-material

Footnotes

¹https://github.com/mcomunita/tonetwist-afx-dataset

²https://github.com/mcomunita/nablafx

³https://github.com/mcomunita/nnlinafx-supp-material

⁴https://ccrma.stanford.edu/∼dtyeh/papers/wdftutorial.pdf

⁵https://github.com/mcomunita/nablafx

⁶https://www.idmt.fraunhofer.de/en/publications/datasets/audio_effects.html

⁷https://egfxset.github.io/

⁸https://github.com/mcomunita/tonetwist-afx-dataset

⁹https://zenodo.org/

¹⁰https://github.com/MrYxJ/calculate-flops.pytorch

References

Bai, S., Kolter, J. Z., and Koltun, V. (2018). An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271. doi:10.48550/arXiv.1803.01271

CrossRef Full Text | Google Scholar

Bank, B. (2011). “Computationally efficient nonlinear Chebyshev models using common-pole parallel filters with the application to loudspeaker modeling,” in Audio Engineering Society Convention 130, London, UK, May 13–16 2011.