Exploiting Multiple Timescales in Hierarchical Echo State Networks

Echo state networks (ESNs) are a powerful form of reservoir computing that only require training of linear output weights whilst the internal reservoir is formed of fixed randomly connected neurons. With a correctly scaled connectivity matrix, the neurons' activity exhibits the echo-state property and responds to the input dynamics with certain timescales. Tuning the timescales of the network can be necessary for treating certain tasks, and some environments require multiple timescales for an efficient representation. Here we explore the timescales in hierarchical ESNs, where the reservoir is partitioned into two smaller linked reservoirs with distinct properties. Over three different tasks (NARMA10, a reconstruction task in a volatile environment, and psMNIST), we show that by selecting the hyper-parameters of each partition such that they focus on different timescales, we achieve a significant performance improvement over a single ESN. Through a linear analysis, and under the assumption that the timescales of the first partition are much shorter than the second's (typically corresponding to optimal operating conditions), we interpret the feedforward coupling of the partitions in terms of an effective representation of the input signal, provided by the first partition to the second, whereby the instantaneous input signal is expanded into a weighted combination of its time derivatives. Furthermore, we propose a data-driven approach to optimise the hyper-parameters through a gradient descent optimisation method that is an online approximation of backpropagation through time. We demonstrate the application of the online learning rule across all the tasks considered.


Introduction
The high inter-connectivity and asynchronous loop structure of Recurrent Neural Networks (RNNs) make them powerful techniques for processing temporal signals [1]. However, the complex inter-connectivity of RNNs means that they cannot be trained using the conventional back-propagation (BP) algorithm [2] used in feed-forward networks, since each neuron's state depends on other neuronal activities at previous times. A method known as Back-Propagation-Through-Time (BPTT) [3], which relies on an unrolling of neurons' connectivity through time to propagate the error signal to earlier time states, can be prohibitively complex for large networks or time series. Moreover, BPTT is not considered Previous works have studied the dependence of the reservoir properties on the structure of the random connectivity adopted, studying the dependence of the reservoir performance on the parameters defining the random connectivity distribution, and formulating alternatives to the typical Erdos-Renyi graph structure of the network [16,17,18]. In this sense, in [17] a model with a regular graph structure has been proposed, where the nodes are connected forming a circular path with constant shortest path lengths equal to the size of the network, introducing long temporal memory capacity by construction. The memory capacity has been studied previously for network parameters such as the spectral radius (ρ) and sparsity; in general memory capacity is higher for ρ close to 1 and low sparsity, but high memory capacity does not guarantee high prediction [19,20]. ESNs are known to perform optimally when at the "edge of criticality" [21], where low prediction error and high memory can be achieved through network tuning.
More recently, models composed of multiple reservoirs have gathered the attention of the community. From the two ESNs with lateral inhibition proposed in [22], to the hierarchical structure of reservoirs first analysed by Jaeger in [23], these complex architectures of multiple, multilayered reservoirs have shown improved generalisation abilities over a variety of tasks [24,23,25]. In particular, the works [26] [27] have studied different dynamical properties of such hierarchical structures of ESNs, while [28] have proposed hierarchical (or deep) ESNs with projection encoders between layers to enhance the connectivity of the ESN layers. The partitioning (or modularity) of ESNs was studied by [29], where the ratio of external to internal connections was varied. By tuning this partitioning performance can be increased on memory or recall tasks. Here we demonstrate that one of the main reasons to adopt a network composed by multiple, pipelined sub-networks, is the ability to introduce multiple timescales in the network's dynamics, which can be important in finding optimal solutions for complex tasks. Examples of tasks that require such properties are in the fields of speech, natural language processing, and reward driven learning in partially observable Markov decision processes [30]. A hierarchical structure of temporal kernels [31], as multiple connected ESNs, can discover higher level features of the input temporal dynamics. Furthermore, while a single ESN can be tuned to incorporate a distribution of timescales with a prefixed mode, optimising the system hyper-parameters to cover a wide range of timescales can be problematic.
Here, we show that optimisation of hyper-parameters can be guided by analysing how these hyper-parameters are related to the timescales of the network, and by optimising them according to the temporal dynamics of the input signal and the memory required to solve the considered task. This analysis improves performance and reduces the search space required in hyper-parameter optimisation. In particular, we consider the case where an ESN is split into two sections with different hyper-parameters resulting in separate temporal properties. In the following, we will first provide a survey of timescales in ESNs before presenting the comparative success of these hierarchical ESNs on three different tasks. The first is the non-linear auto-regressive moving average 10 (NARMA10) task which requires both memory and fast non-linear transformation of the input. Second, we explore the performance of the network in a reconstruction and state "perception" task with different levels of external white noise applied on the input signal. Finally, we apply the hierarchical ESN to a permuted sequential MNIST classification task, where the usual MNIST hand written digit database is serialised and permuted as a 1d time-series.

Survey of timescales in Echo State networks
We begin by describing the operations of an ESN and present a didactic survey of the inherent timescales in ESNs, which will be drawn upon in later sections to analyse the results.
As introduced in the previous section, an ESN is a recurrent neural network and the activity, x(t), of the neurons due to a temporal input signal s(t) is given by where W is a possibly sparse random matrix defining the connectivity of the network, W in defines the input adjacency matrix, and γ is a rescaling factor of the input weights. α = δt/τ is the leakage term of the node, and ρ is a scaling factor for the spectral radius of the connectivity matrix and will be discussed in more detail in the following. f () is a non-linear function, which in this work we define as the hyperbolic tangent. To ensure that the network exhibits the Echo-State property, and so that the activity does not saturate, the initial random connectivity matrix, W, is rescaled by its maximum eigenvalue magnitude (spectral radius), |λ max W | = max |eig(W)|, thus ensuring a unitary spectral radius which can be tuned using ρ as a hyper-parameter. In practice, W is constructed from a matrix of Normally distributed random numbers and the sparseness is enforced by randomly setting to zero a fixed proportion of these elements. Typically 10 non-zero connections per node are retained in W.
The timescales of this dynamical system are closely linked to the specific structure of W and to the two hyperparameters; α and ρ. Since α is the leakage rate, it directly controls the retention of information from previous time steps, while ρ specifies the maximum absolute magnitude of the eigenvalues and as such tunes the decay time of internal activity of the network. Thus, the basic hyper-parameters that need to be set are γ, α and ρ. Considering the nonlinear dependence of the network performance on these values and the task-dependent nature of an efficient parameterisation, this process can be challenging. Such hyper-parameters are commonly optimised through a grid search or through explicit gradient descent methods in online learning paradigms [32]. However, the fine tuning procedure can be guided, and the searchable space reduced, using a simple analysis of the hyper-parameters' relation to the timescales of the network, the external signal's temporal dynamics, and the memory required to solve the considered task.
Considering that the eigenvalues λ W of the connectivity matrix are inside the imaginary unit circle due to the normalisation procedure described previously, and that α is a constant common to all neurons, the eigenvalues of the linearised system given by Eq. 1 are This corresponds to a rescaling of value αρ and to a translation of value 1 − α across the real axis of the original λ W . This operation on the eigenvalues of W is depicted in Fig. 1A. Thus, considering that each eigenvalue λ i can be decomposed in its corresponding exponential decaying part exp(−δt/τ i ) and its oscillatory imaginary component, the timescales of the linearised system are  Figure 1: The analysis of the timescales of the system in the linear regime can guide the search for the optimal values of the hyper-parameters α and ρ. A: Translation and scaling of the eigenvalues of the system due to the presence of the leakage factor. B: Example of distribution of timescales, computed analytically (red line) and computationally (black points) estimated from the eigenvalues of W. C: Pirate plot of the distributions of timescales as α increases. Both axes are logarithmic. Higher α values correspond to longer timescales and to a more compressed range of timescales (logarithmic y-axis). D: Pirate plot of the distributions of timescales: as ρ increases, the range of timescales expands. Again, both axes are logarithmic. E: Example distributions of timescales for reservoirs with different connectivity structure. From left to right, a delay line, single ESN, 2 ESNs (connected and unconnected, see text for the reason why the timescales for these two structures are the same in the linear regime). The higher complexity of the models reported is reflected in a richer distribution of timescales.
When the connectivity matrix, W, is given by a sparse matrix with non-zero elements drawn randomly from a uniform distribution with the range [−1, 1], then the corresponding eigenvalues will be uniformly distributed within a circle with a radius of max(|λ W |) in the complex plane [33]. These eigenvalues are then re-scaled by max(|λ W |) to ensure they are within the unit circle. The distribution of the eigenvalues then reveals the distribution of timescales of the linearised system. Indeed, given p (Re(λ), Im(λ)), the distribution of timescales can be found through computation of the marginal p Re(λ)) = p Re(λ), Im(λ) dIm(λ) and the change of variable defined in equation 5, giving Importantly we note that whilst the eigenvalues are uniformly distributed over the unit circle, the timescales are not due to the inverse relationship between them. The resulting distribution of the linearised system, shown in Fig. 1B (red line), is in excellent agreement with the numerically computed distribution for a single ESN (black points + shaded area).
The analytical form of the distribution, together with Eq. 5, allows us to explicitly derive how changes in α and ρ affect the network timescales. Notably we can obtain analytical expression for the minimum, maximum and most probable (peak of the distribution) timescale: where Eq. 8 and 7 can be derived directly from Eq. 5, while Eq. 9 follows from maximisation of Eq. 6. As expected, α strongly affects all these three quantities; interestingly, though, α does not influence the relative range of the distribution, τ max /τ min = (1 + ρ)/(1 − ρ). Indeed α plays the role of a unit of measure for the τ s, and can then be used to scale the distribution in order to match the relevant timescales for the specific task. On the other hand, ρ does not strongly affect the shape of the distribution, but determines how dispersed the τ s are. Given the finite number of τ s expressed by a finite ESN, the hyper-parameter ρ can be used to balance the raw representation power of the network (how wide the range of timescales is) with the capacity to approximate any given timescale in that range. Fig. 1C and D give a more detailed view of how the distribution of timescales changes as α and ρ, respectively, vary; note the logarithmic scale on the y-axis, that makes the dependence on α linear. The link between the eigenvalues and the reservoir dynamics can be shown through the analysis of the network response to an impulsive signal, shown in Section 5.2, where the experimental activities are compared with the theoretical ones expected from the linearised system.

Hierarchical Echo-State Networks
Different studies have proposed alternatives to the random structure of the connectivity matrix of ESNs, formulating models of reservoirs with regular graph structures. Examples include a delay line [17], where each node receives and provides information only from the previous node and the following one respectively, and the concentric reservoir proposed in [18], where multiple delay lines are connected to form a concentric structure. Furthermore, the idea of a hierarchical architecture of ESNs, where each ESN is connected to the preceding and following one, has attracted the reservoir computing community for its capability of discovering higher level features of the external signal [34]. Fig. 2 schematically shows the architecture for (A) a single ESN, (B) 2 sub-reservoir hierarchical ESN for which the input is fed into only the first sub-reservoir which in turn feeds into the second and (C) a parallel ESN, where two unconnected sub-reservoirs receive the same input. These heirarchical ESNs are identical to the 2 layer DeepESN given by [27]. A general ensemble of interacting ESNs can be described by where the parameters have the similar definitions as in the case of a single ESN in Eq. 1. The index k indicates the network number and N ESN is the total number of networks under consideration. In a hierarchical structure of ESNs W (kl) = 0 for k = l or k = l + 1 only, and W (kl) can be drawn from any desirable distribution thanks to the absence of feedback connections to higher-order reservoirs. Indeed, in this case, the necessary condition for the Echo-State network property is that all the inner connectivity matrices W (kk) have eigenvalues with an absolute value less than one. Furthermore, in the typical hierarchical structure proposed in previous works [23,24,27,25,35], the input is fed to the first network only, and W (k) in = 0 if k = 1 only. We emphasise that the values of α (k) and ρ (k) , which are closely related to the timescales and repertoire of dynamics of network number k (and, in the case of hierarchical reservoirs, also to all subsequent networks), do not have to be equal for each ESN, but can be chosen differently to fit the necessity of the task. In particular, some tasks could require memory over a wide range of timescales that could not effectively be covered by a single ESN.
In Fig. 1E we show examples of the timescale distributions of the corresponding linearised dynamical systems for different ESN structures, from the simple delay line model to the higher complexity exhibited from two hierarchical ESNs. In order from left to right, the histograms of timescales are for a delay line, a single ESN, and two ESNs (whether hierarchically connected or unconnected; see below for clarification). All the models share an ESN with ρ = 0.9 and α = 0.9; where present, the second reservoir has α = 0.2. By construction, the richness and range of timescales distributions reported increases with the complexity of the models. However, we note how a simple delay line could exhibit longer temporal scales than the other structures analysed thanks to its constant and high value of minimum path length between any pairs of nodes. Nevertheless, its limited dynamics restricts its application to simple tasks. The cases with two ESNs show a bimodal distribution corresponding to the two values of α.
Yet, the spectrum of the eigenvalues of the linearised system is only partially informative of the functioning and capabilities of an ESN. This is clearly demonstrated by the fact that a hierarchical and a parallel ESN share the same spectrum in the linear regime. Indeed, for a hierarchical ESN, whose connectivity matrix of the linearised dynamics is given by: Input Output R 1 : α (1) , ρ (11) W (11) R 2 : α (2) , ρ (22) W (22) ρ (12) W (12) W out (1) , ρ (11) W (11) R 2 : α (2) , ρ (22) W (22) γ (2) W (2) in γ (1) W (1) in W out the input is fed into reservoir 1 only and the connection is unidirectional from R1 to R2, which is identical to the 2 layer DeepESN of [27]. C: A parallel (or unconnected hierarchical) ESN where the network is partitioned into 2 reservoirs, R1 and R2, which each receive the input and provide output but have distinct hyper-parameters.
it is easy to demonstrate that every eigenvalue of W (11) and W (22) is also an eigenvalue ofW, irrespective of W (12) , not unlike what happens for a parallel ESN (where W (12) = 0, and hence the demonstration follows immediately). Nonetheless, as we will see in the next sections, the hierarchical ESN has better performance on different tasks compared to the other structures considered, including the parallel ESN.
It is interesting to note, in this respect, that the success of the hierarchical ESN is generally achieved when the leakage term of the first reservoir is higher than the leakage term of the second (or, in other words, when the first network has much shorter timescales). Such observation opens the way to an alternative route to understand the functioning of the hierarchical structure, as the first reservoir expanding the dimensionality of the input and then feeding the enriched signal into the second network. Indeed, in the following, we will show how, in a crude approximation and under the above condition of a wide separation of timescales, the first ESN extracts information on the short term behaviour of the input signal, notably its derivatives, and the second ESN integrates such information over longer times.
We begin with the (continuous time) linearized dynamics of a Hierarchical ESN is given bẏ where, for simplicity, we have reabsorbed the ρ (kl) and γ (k) factors into the definitions of W (kl) and W (k) in respectively, and the new constants can be derived with reference to Eq. 1 and 2; for example: The neuron activity can be projected on to the left eigenvector of each of the M (i) matrices. As such we define the eigenvector matrices, V (i) , where each row is a left eigenvector and so satisfies the equation Λ (1) and Λ (22) are the diagonal matrices of the eigenvalues of the two M matrices. Using these we can define , and so the dynamical equations can be expressed aṡ whereW (1) in andW (12) = V (2) W (12) V (1) −1 are the input and connection matrices expanded in this basis. Taking the Fourier transform on both sides of Eq. 16, such that F T y (1) (t) =ỹ (1) (ω) and F T ẏ (1) (t) = −iωỹ (1) (ω), where i is the imaginary unit. The transformỹ (2) (ω) of y (2) (t) can now be expressed as a function of the transform of the signals(ω) giving where I is the identity matrix of the same size as Λ (1) . If the second ESN's timescale are much longer than that of the first one (i.e., Λ (1) Λ (2) ), then we can expand the inverse of theỹ (1) coefficient on the LHS of Eq. 18 when By applying this approximation to Eq. 18, and by defining the diagonal matrix of characteristic times T (1) ≡ −(Λ (1) ) −1 , the relation between the activity of reservoir 1 and the input in Fourier space is given bỹ The coefficients of this series are equivalent to taking successive time derivatives in Fourier space, such that (−iω) ns = d (n)s /dt (n) . So by taking the inverse Fourier transform we find the following differential equation for y (1) which can be inserted into Eq. 17 to givė Thus the second ESN integrates the signal with a linear combination of its derivatives. In other words, the first reservoir expands the dimensionality of the signal to include information regarding the signal's derivatives (or, equivalently in discretised time, the previous values assumed by the signal). In this respect, Eq. 23 is key to understanding how the hierarchical connectivity between the two reservoirs enhances the representational capabilities of the system. The finite-difference approximation of the time derivatives appearing in Eq. 23 implies that a combination of past values of the signal appears, going back in time as much as the retained derivative order dictates.

Online learning of hyper-parameter
Selecting the hyper-parameters of such systems can be challenging. Such selection process can be informed by the knowledge of the natural timescales of the task/signal at hand. Alternatively one can resort to a learning method to optimise the parameters directly. The inherent limitation of these methods is the same as learning the network weights with BPTT: the whole history of network activations is required at once. One way to by-pass this issue is to approximate the error signal by considering only past and same-time contributions, as suggested by Bellec et al. [4] in their framework known as e-prop (see also [36]), and derive from this approximation an online learning rule for the ESN hyper-parameters. Following their approach, we end up with a novel learning rule for the leakage terms of connected ESNs that is similar to the rule proposed by Jaeger et al. [32] but extended to two hierarchical reservoirs. The main learning rule is given by: where e (ki) (t) = dx (k) (t)/dα (i) is known as the eligibility trace which tracks the gradient of neuron activities in the reservoir number k with respect to the i-th leakage rate. Given the closed form for the hierarchical ESNs in Eqs. 10 and 11 these terms can be readily calculated. For our N ESN sub-reservoirs in the hierarchical structure there will be N 2 ESN eligibility traces to track how each sub-reservoir depends on the other leakage rates. In the hierarchical case of a fixed feed-forward structure some of these traces will be zero, and the number of non-zero eligibility traces would be N (N + 1)/2. Since the update of the neuron's activity depends on its previous values, so do the eligibility traces; therefore, they can be calculated recursively through where δ ki = 1 if k = i and 0 otherwise, i.e the Kronecker delta. The update of equations 25 for each k-i pair needs to follow the order of dependencies given by the structure of connected reservoirs considered. The eligibility trace is an approximation that only includes same-time contributions to the gradient but has the advantage that is can be easily computed online. A complete description of our method is given in the Supplementary Material. For an example where the mean squared error function E(t) = 1 2 ỹ(t) − y(t) 2 is used in a regression task and a structure composed by two reservoirs, the updating equations on the leakage terms are where η α is the learning rate on the leakage terms and e (k1) (t), e (k2) (t) (k = 1, 2 in this case with two reservoirs) is a vector composed by the juxtaposition of the eligibility traces, which can be computed through Eq. 25. Of course, the gradient can be combined with existing gradient learning techniques, among which we adopt the Adam optimiser, described in the Supplementary Material. In all online learning simulations, training is accomplished through minibatches with updates at each time step. Training is stopped after convergence. When learning αs and the output weights simultaneously, the learning rates corresponding to these hyper-parameters need to be carefully set, since the weights need to adapt quickly to the changing dynamic of the network, but a fast convergence of W out can trap the optimisation process around sub-optimal values of the leakage terms. For a reservoir with trained and converged output weights, a further variation of α's, even in the right direction, could correspond to an undesirable increase in the error function. We found that this problem of local minimum can be avoided by applying a high momentum in the optimisation process of α and randomly re-initialising the output weights when the α's are close to convergence. The random re-initialisation functions to keep the output weights from being too close to convergence. Thus, we defined the convergence of the algorithm for α's as when the α's do not change considerably after re-initialisation. When this happens, it is possible to turn off the learning on the leakage terms and to optimise the read-out only. More details about online training can be found in the discussions related to each task.

Results
The following sections are dedicated to the study of the role of timescales and the particular choices of α and ρ in various tasks, with attention on networks composed by a single ESN, 2 unconnected ESNs and 2 hierarchical ESNs. The number of trainable parameters in each task for the different models will be preserved by using the same total number of neurons in each model. The results analysed will be consequently interpreted through the analysis of timescales of the linearised systems.

NARMA10
A common test signal for reservoir computing systems is the non-linear auto-regressive moving average sequence computed with a 10 step time delay (NARMA10) [37,38]. Here we adopt a discrete time formalism where n = t/δt and the internal state of the reservoir is denoted as x n = x(nδt). The input, s n , is a uniformly distributed random number in the range [0, 0.5] and the output time-series is computed using where D = 10 is the memory length, a = 0.3, b = 0.05, c = 1.5, and d = 0.1. The task for the network is to predict the NARMA10 output y n given the input s n . We have adapted this to also generate a NARMA5 task where D = 5 but the other parameters are unchanged. This provides an almost identical task but with different timescales for comparison.
The task of reconstructing the output of the NARMA10 sequence can be challenging for a reservoir as it requires both a memory (and average) over the previous 10 steps and fast variation with the current input values to produce the desired output. A typical input and output signal is shown in Fig. 3A and the corresponding auto-correlation function of the input and output in B. Since the input is a random sequence it does not exhibit any interesting features but for the output the auto-correlation shows a clear peak at a delay of 9 δt in accordance with the governing equation. For a reservoir to handle this task well it is necessary to include not only highly non-linear dynamics on a short timescale but also slower dynamics to handle the memory aspect of the task.
This regression task is solved by training a set of linear output weights to minimise the mean squared error (MSE) of the network output and true output. The predicted output is computed using linear output weights on the concatenated where W is the weight vector of length N+1 when an additional bias unit is included. The MSE is minimised by using the ridge regression method [39] such that the weights are computed using where x is a matrix formed from the activation of the internal states with a shape of number of samples by number of neurons, y is the desired output vector, λ is the regularisation parameter that is selected using a validation data set and I the identity matrix. To analyse the performance of the ESNs on the NARMA10 task we use the normalised root mean squared error as whereỹ n is the predicted output of the network and y n is the true output as defined by Eq. 27.
To test the effectiveness of including multiple time-scales in ESNs, we simulate first a single ESN with N = 100 neurons and vary both α and ρ to alter the time-scale distribution. Secondly, we simulate a hierarchical ESN split into 2 reservoirs each with N = 50 neurons, where we vary α (1) and α (2) with ρ (1) = ρ (2) = 0.95. The input factor was set as γ (1) = 0.2 and γ (2) = 0 for the connected hierarchical ESN but when they are unconnected the input is fed into both, such that γ (1) = γ (2) = 0.2. In all cases the NRMSE is computed on an unseen test set and averaged over 20 initialisations of the ESN with a running median convolution is applied to the error surfaces to reduce outliers. In parallel to this we have also applied the online training method for the α hyper-parameters. The hyper-parameters used for the gradient descent learning are summarised in Table 1.  (1) and α (2) for 3 variations of the hierarchical ESN connection strength on the NARMA10 task. In the unconnected case (ρ (21) = 0, panels E and I), we find that the NRMSE drops by increasing both leakage rates but the minimum is when one of the leakage rates is ≈ 0.5. This is in agreement with the online learning method for the αs in shown in I but the error minimum is shallow and prone to noise in the signal or ESN structure. For the weakly connected hierarchical ESN (ρ (21) = 0.1, panels F and L) we find again that when the sub-reservoirs have different timescales the NRMSE is reduced. In comparison to the unconnected case the error surface is asymmetric with a minimum at approximately α (1) = 1.0 and α (2) ≈ 0.5. As the strength of the connection is increased (ρ (21) = 1.0, Panel G and M), the minimum error moves to a lower leakage rate in the second reservoir (α (2) ≈ 0.2) which reflects a better separation of the timescale distributions. This is a gradual effect with respect to the connection strength since stronger connection allows for a relative increase of the expanded input from the first reservoir compared to the base input signal. Since the input feeds into reservoir 1, a high α provides a transformation on the input over short time-scales, expanding the dimensionality of the signal, offering a representation that preserves much of the dynamic of the driving input and that is fed to the second reservoir. Then, since the latter does not have a direct connection to the input it performs a longer timescale transformation of the internal states of reservoir 1. In this way the reservoirs naturally act on different parts of the task, i.e. reservoir 1 provides a fast non-linear transformation of the input while reservoir 2 follows the slower varying 10-step average of the signal, and thus returning a lower NRMSE. As a side note, we can demonstrate the validity of the theoretical analysis in Section 2.1 by replacing the first reservoir by Eq. 23 on the NARMA task (see Section 3 Supplementary Material), resulting in a similar landscape as in Fig. 3G and a similar optimal value for α (2) . previous section the choice of these hyper-parameters becomes more evident. With α = 1 the timescale distribution of the network is sharply peaked close to the minimum timescale of 1 discrete step while when α = 0.1 this peak is broader and the peak of the distribution is closer to the second peak present in the auto-correlation function shown in Panel B. We note that whilst the most likely timescale is τ peak ≈ 6 for α = 0.1, ρ = 0.95 which is lower than the natural timescale of the problem, the increased width of the distribution increases the number of timescales at τ = 10 dramatically which maybe why a lower α is not necessary.
To further investigate the effect of the inherent timescale of the task on the timescales we performed a similar analysis on the NARMA5 task. Figure 3H and N show the NRMSE surface for the strongly connected case. The minimum error occurs at α (1) ≈ 1.0 (similar to the NARMA10 results in G and M) but α (2) ≈ 0.5 (as opposed to ≈ 0.2 for NARMA10). This is due to the shorter timescales required by the NARMA5 task and the peak timescale for these values is much closer to the peak in the auto-correlation shown in B. Panel D shows the performance of the single ESN where again the optimal leakage rate is α = 1 and similar to the unconnected cases but the NRMSE is higher than the connected cases.
In this theoretical task where the desired output is designed a priori, the memory required and the consequent range of timescales necessary to solve the task are known. Consequently, considering the mathematical analysis in section 2.1, and that for hierarchical ESNs the timescales of the first ESN should be faster than those of the second Fig. 3), the best-performing values of the leakage terms can be set a priori without the computationally expensive grid search reported in Fig. 3E-I. However, it can be difficult to guess the leakage terms in the more complex cases where the autocorrelation structure of the signal is only partially informative of the timescales required.
This problem can be solved using the online learning approach defined through Eq. 24. In this case, learning is accomplished through minibatches and the error function can be written explicitly as where N batch is the minibatch size and m is its corresponding index. A minibatch is introduced artificially by dividing the input sequence into N batch signals or by generating different NARMA signals. Of course, the two methods lead to equivalent results if we assure that the N batch sequences are temporally long enough. A learning rate η α /η W ≈ 10 −2 − 10 −3 was adopted. The optimiser used for this purpose is Adam, with the suggested value of β 1 = 0.9 adopted for the output weights and a higher first momentum β 1 = 0.99 adopted for the leakage terms. Instead, we set β 2 = 0.999 of the second momentum for both types of parameters (See section 5.1 for a description of the updating rules). Panels I-N show a zoomed in region of the error surface with the lines showing the online training trajectory of the α hyper-parameters. In each case the trajectory is moving towards the minimum NRMSE of the α phase space.

A volatile environment
We now turn to study the reservoir performance on a task of a telegraph process in a simulated noisy environment. The telegraph process s (1) (t) has two states that we will call up (1) and down (0), where the probability of going from a down state to an up state p(s = 1|s = 0) (or the opposite p(s = 0|s = 1)) is fixed for any time step. The environment is also characterised by a telegraph process s (2) (t), but the transition probability is much lower and controls the transition probability of the first signal. To simplify the notation in the following we denote the probability of the signal i transitioning from state a to state b as P (s (i) (t) = a|s (i) (t − δt) = b) = p The transition probabilities of the second signal are fixed and symmetric such that  and are parameters for the Adam optimiser (further details are given in the Supplementary Material). The † symbol indicates that the learning rate 5 × 10 −2 is for the case with 4 hidden states, while the learning rate 5 × 10 −3 is for the case with 28 hidden states. This decrease of η is due to the increase in the dimensionality of the representation for the latter case in comparison to the situation where the read-out is composed by four concatenated values of activity. Furthermore, such learning rates are 10 times higher than the case in which only the read-out is trained (only in the psMNIST task). Thus, the high learning rate adopted has the purpose to introduce noise in the learning process and to avoid local minima in the complex case where α and Wout are optimised simultaneously.
The probabilities p 1 , p 2 and p 3 are fixed parameters of the signal that define the process. Given that the second signal controls the probabilities of the first telegraph process, we say that it defines the regime of the input, while we refer to the up and down values of the first process simply as states. Thus, the reconstruction of s (1) (t) from the input will be called state reconstruction, while reconstruction of s (2) (t) will be called regime reconstruction. These reconstructions can be considered separately or as a joint task requiring the system to be modeled on long and short timescales simultaneously. Due to the probability transition caused by s (2) (t), both states and regime will be equally present over a infinitely long signal. The values adopted for the simulation are p 1 = 0.05, p 2 = 0.1 and p 3 = 0.0005.
The input signal corresponds to s (1) (t) + σN (0, 1), that is the faster telegraph process with additional white noise. The input signal constructed is a metaphor of a highly stochastic environment with two states and two possible regimes that define the probability of switching between the two states. The reservoir will be asked to understand in which state (s (1) (t) = 1 or 0) and/or regime (s (2) (t) = 1 or 0) it is for each time t, measuring the understanding of the model to estimate the state of the input signal. The input signal and telegraph processes is shown in Fig. 4A, while the B shows the corresponding auto-correlation structure of the processes. The auto-correlation shows that the input has a temporal structure of around 10 δt while the slow 'environment' process has a structure close to 1000 δt. This corresponds directly to the timescales defined by the probabilities of the signals.
Panels C and D of Fig. 4 show the performance of a single ESN when it is tasked to reconstruct the processes s (1) (t) (state recognition) and s (2) (t) (regime recognition) respectively. In this simulation, learning is always accomplished online and the error function is the same as Eq. 31. First, panel C demonstrates how the leakage term, α, must be tuned to the level of noise of the environment, and how lower values of α are desirable for noisier signals, in order to solve the state recognition problem. Indeed, the need to smooth the fluctuations of the input signal increases with σ, while for low values of noise the network should simply mimic the driving input. Second, panel D shows how the desirable values of α must be lower in the case where the network is asked to reproduce the slower dynamic of s (2) (t) independently of having to output the fast signal, in order to solve the regime recognition problem. This result exemplifies how the timescales of the network must be tuned depending on the desired output. It demonstrates that, even in this relatively simple environment, it is crucial to adopt multiple timescales in the network to obtain results that are robust with respect to a variation of the additional white noise σ. Finally, panels E and F of Fig. 4 show the accuracy of two unconnected (E) and connected (F) reservoirs when the network has to classify the state and the regime of the input signal at the same time. In this case, the desired output corresponds to a four dimensional signal that encodes all the possible combinations of states and regimes; for instance, when the signal is in the state one and in the regime one, we would require the first dimension of the output to be equal to one and all other dimensions to be equal to zero, and so on. The best performance occurs when one leakage term is high and the other one is low and in the range of significant delays of the auto-correlation function. This corresponds to one network solving the regime recognition and the other network solving the state recognition. For the unconnected reservoirs, it does not matter which reservoir has high vs. low leakage terms, reflected by the symmetry of Fig. 4E, while for the connected reservoirs, the best performance occurs when the first reservoir has the high leakage term and the second the low leakage terms, see Fig. 4F, similar to Fig. 3. Both two-reservoir networks can achieve accuracy 0.75, but the single ESN can not solve the task efficiently, since it cannot simultaneously satisfy the need for high and low αs, reporting a maximum performance of about 0.64.
The path reported in panel C of Fig. 4 and all panels in Fig. 5 show the application of the online training algorithm in this environment. The values of the hyper-parameters adopted in the optimisation process through the Adam optimiser are the same as in section 3.1, where we used a slower learning rate and a higher first momentum on the leakage terms in comparison to the values adopted for the output weights. The line of panel C (Fig. 4) shows the online adaptation of α for a simulation where the external noise increases from one to four with six constant steps of 0.5 equally spaced across the computational time of the simulation. The result shows how the timescales of the network decrease for each increase in σ, depicted with a circle along the black line. The path of online adaptation reports a decrease of the α value for noisier external signals. This result occurs because as the signal becomes noisier (σ rises), it becomes more important to dampen signal fluctuations. This result also shows that the online algorithm can adapt in environments with varying signal to noise ratio. Panels A, B, C of Fig. 5 show the online training of α (1) and α (2) for an environment composed by a faster and a slower composition of telegraph processes. This specific simulation is characterised by the alternation of two signals defined by Eq. 32, 33 and 34, each with different values of p 1 and p 2 . In particular, while p 1 = 0.5 and p 2 = 0.1 for the 'fast' phase of the external signal, p 1 = 0.1 and p 2 = 0.05 for the 'slow' phase. In contrast, the slower timescale of the task defined by p 3 = 0.0005 remains invariant across the experiment. Panel C shows the adaptation of the leakage terms for this task in the case of a hierarchical structure of ESNs. While α (2) adapts to the change of p 1 and p 2 following the transition between the two phases of the external signals, the relatively constant value of α (1) indicates how the first network sets its timescales to follow the slower dynamic of the signal, characterised by the constant value of p 3 . Thus, the composed network exploits the two reservoirs separately, and the first (second) reservoir is used to represent the information necessary to recognise the regime (state) of the external signal.

Permuted Sequential MNIST
The Permuted Sequential MNIST (psMNIST) task is considered a standard benchmark for studying the ability of recurrent neural networks to understand long temporal dependencies. The task is based on the MNIST dataset, which is composed of 60, 000 handwritten digits digitised to 28x28 pixel images. In the standard MNIST protocol every pixel is presented at the same temporal step so a machine has all the information of the image available at once and needs to classify the input into one out of ten classes. In contrast, in the psMNIST task, the model receives each pixel sequentially once at a time, so that the length of the one dimensional input sequence is 784. Thus, the machine has to rely on its intrinsic temporal dynamic and consequent memory ability to classify the image correctly. Furthermore, each image in the dataset is transformed through a random permutation of its pixels in order to include temporal dependencies over a wide range of input timescales and to destroy the original images' structure. Of course, the same permutation is applied on the entire dataset. The performance of ESNs on the MNIST dataset, where each columns of pixels in a image is fed to the network sequentially (each image corresponds to a 28 dimensional signal of length 28 time steps), has been analysed in [40] and in [41]. In [40] the original dataset was preprocessed through reshaping and rotating the original image to enhance the network's ability to understand high level features of the data. In this case, the original dataset is used. In [41], the addition of thresholds and the introduction of sparse representation in the read-out of the reservoir was used to improve the performance of the network in the online learning of the standard MNIST task through reservoir computing. This section is focused on the analysis of the performance of ESNs on the psMNIST task and on their dependence on the range of timescales available in the network, i.e. the values of α and ρ chosen. In contrast to the previous sections where ESNs are trained through ridge regression, we have applied an online gradient descent optimisation method. The cost function chosen to be minimised is the cross entropy loss where m is the minibatch index, N batch corresponds to the minibatch size and N class is the number of classes. For this task the desired output, y j , is a one-hot encoded vector of the correct classification while the desired output is a sigmoid function of the readout of the reservoir nodes. Furthermore, instead of reading out the activity of the reservoir from the final temporal step of each sequence only, we have expanded the reservoir representation by using previous temporal activities of the network. In practice, given the sequence of activities x(0), x(δt), ..., x(δtT ) (T = 784) that defines the whole temporal dynamic of the network subjected to an example input sequence, we trained the network by reading out from the expanded vector X = x(M δt), x(2M δt), ..., x(T δt) , where M defines the 'time frame' used to sample the activities of the evolution of the system across time.
, where sigm stands for sigmoid activation function. We then repeat the simulation for two different time frames of sampling for each different model, that is a single ESN and a pair of unconnected or connected ESNs, as in the previous sections.
The two values of M used are 28 and 196, corresponding to a sampling of 28 and 4 previous representations of the network respectively. Of course, a higher value of M corresponds to a more challenging task, since the network has to exploit more its dynamic to infer temporal dependencies. We note, however, that none of the representation expansions used can guarantee a good understanding of the temporal dependencies of the task, or in other words, can guarantee that the system would be able to discover higher order features of the image, considering that these features depend on events that could be distant in time.
In Fig. 6 we again analyse the performance of two connected or unconnected ESNs varying α (1) and α (2) for both M = 28 and 196. In contrast to the previous sections, we now use gradient descent learning on the output weights instead of ridge regression and increase the total number of neurons in each model to N = 1200 due to the complexity of the task. The Adam optimiser is used; its parameters, for both the output weights and α learning, are in Table 1. As previously, we have trained the output weights over a range of fixed αs and report the performance on an unseen test data set. In parallel to this we have trained both the output weights and α values which, as shown by the lines on the contour plots, converge towards the minimum computed using the fixed α's.
As in the other simulations, we found that the values of ρ corresponding to the best performance was approximately one, which maximises the range of timescales and the memory available in the network. Fig. 6E-F shows the case with M = 28, while Fig. 6G-H reports the accuracy for the simulation with M = 196 where E and G are unconnected and F and H connected reservoirs. The accuracy surface demonstrates how, in the case of the unconnected ESNs with a fast sampling rate in panel G, the best performance is achieved when at least one of the two values of α is close to one. The result is due to the fast changing dynamic of the temporal sequence that is introduced through the random permutation of the pixels. On the contrary, in the case of the unconnected ESNs with a slow sampling rate in panel E the best accuracy is in a range of intermediate timescales since both partitions must respond to both fast and slow timescales.
This relatively simple behaviour of the dependence of the accuracy on the setting of the hyper-parameters changes in the cases of two connected ESNs, whose additional complexity corresponds to a considerable increase in the performance. Fig. 6H reports how the network prefers a regime with a fast timescale in the first reservoir and a intermediate timescale in the second, which acts as an additional non-linear temporal filter of the input provided by the first network. The need of memory of events distant in time is emphasised in 6F, where the best performing network is composed by reservoirs with fast and slow dynamics respectively. The performance boost from the panels E-G to the ones F-H has only two possible explanations: first, the timescales of the second network are increased naturally thanks to the input from the first reservoir; second, the connections between the two reservoirs provide an additional non-linear filter of the input that can be exploited to discover higher level features of the signal. Thus, we can conclude once again that achieving high performance in applying reservoir models requires (1) additional non-linearity introduced through the interconnections among the reservoirs and (2) an appropriate choice of timescales, reflecting the task requirements in terms of external signal and memory.
Panels I, L, M and N show the application of the online training of αs for the various cases analysed. In the psMNIST task we found that the major difficulties in the application of an iterative learning rule on the leakage terms are: the possibility to get trapped in local minima, whose abundance can be caused by the intrinsic complexity of the task, the intrinsic noise of the dataset, the randomness of the reservoir and of the applied permutation; the high computational time of a simulation that exploits an iterative optimisation process on αs arising from a practical constraint in the implementation. Indeed, while the activities of the reservoir can be computed once across the whole dataset and then saved in the case of untrained values of αs, the activities of the nodes need to be computed every time the leakage terms change in the online learning paradigm. However, we found that using a higher learning rate η W on the output weights, compared to the value adopted in the paradigm where the leakage terms are not optimised (as in Panels E, F, G and H), can introduce beneficial noise in the learning process and help to avoid local minima. Furthermore, a higher value of the learning rate on the output weights corresponds to an increased learning rate on the thresholds, as shown from Eq. 43 and from the dependence of the updating equations on W out . As in the previous simulations of Sections 3.1 and 3.2, the output weights are randomly reinitialised after the convergence of αs, helping the algorithm to avoid an undesirable  (1) and α (2) . The paths are smoothed through a running average. quick convergence of weights. The online process is then ended when the leakage terms remain approximately constant even after the re-initialisation. Following this computational recipe, it possible to avoid the difficulties found and train the leakage terms efficiently.
Finally, we note how the best accuracy of 0.96 reached throughout all the experiments on the psMNIST is comparable to the results obtained by recurrent neural networks trained with BPTT, whose performance on this task are analysed in [42] and can vary from 0.88 to 0.95. In comparison to recurrent structures trained through BPTT, a network with two interacting ESNs provide a cheap and easily trainable model. However, this comparison is limited by the necessity of recurrent neural networks to carry the information from the beginning to the end of the sequence, and to use the last temporal state only or to adopt attention mechanisms.

Conclusion
In summary, ESNs are a powerful tool for processing temporal data, since they contain internal memory and time-scales that can be adjusted via network hyper-parameters. Here we have highlighted that multiple internal time-scales can be accessed by adopting a split network architecture with differing hyper-parameters. We have explored the performance of this architecture on three different tasks: NARMA10, a benchmark composed by a fast-slow telegraph process and PSMNIST. In each task, since multiple timescales are present the hierarchical ESN performs better than a single ESN when the two reservoirs have separate slow and fast timescales. We have demonstrated how choosing the optimal leakage terms of a reservoir can be aided by the theoretical analysis in the linear regime of the network, and by studying the auto-correlation structure of the input and/or desired output and the memory required to solve the task. The theoretical analysis developed needs to be considered as a guide for the tuning of the reservoir hyper-parameters, and in some specific applications it could be insufficient because of the lack of information about the nature of the task.
In this regard, we showed how to apply a data-driven online learning method to optimise the timescales of reservoirs with different structures, demonstrating its ability to find the operating regimes of the network that correspond to high performance and to the best, task-dependent, choice of timescales. The necessity of adopting different leakage factors is emphasised in the case of interactive reservoirs, whose additional complexity leads to better performance in all cases analysed. Indeed, the second reservoir, which acts as an additional non linear filter with respect to the input, is the perfect candidate to discover higher temporal features of the signal, and it consequently prefers to adopt longer timescales in comparison to the first reservoir, which has instead the role of efficiently representing the input. We believe such hierarchical architectures will be useful for addressing complex temporal problems and there is also potential to further optimise the connectivity between the component reservoirs by appropriate adaptation of the online learning framework presented here.

Online Learning
The online learning method formulated is similar to the approach followed in e-prop by [4] (see also [36]), a local learning rule for recurrent neural networks that exploits the concept of an eligibility trace, and in [32]. As in these previous works, we approximated the error function to neglect the impact that the instantaneous and online changes of the network's parameters have on future errors. In particular, considering a recurrent neural network as the one depicted in the computational graph in Fig. 7A and the dependencies of the error function E on the activities x(t) Eq. 37 and 38 define the algorithm back-propagation through time, where the dependencies of dE dx(t) on activities at future time t do not permit the definition of an online learning rule. As in the works of [32] and [4] we decided to adopt the following approximation We will now derive the equations defining the iterative learning approach for the example cost function whereỹ is the desired output and y = W out x(t) is the output of the ESN. Then, we desire to compute ∂E/∂α (k) , which describes the leakage term k for a network compose by multiple reservoirs. In particular, the case of two connected ESNs in considered and analysed here, while the more general case with N interacting ESNs can be easily derived following the same approach. In this case, the vector of activities x(t) = x 1 (t), x 2 (t) is composed by the juxtaposition of the vectors of activities of the two reservoirs.
That can be computed online tracking the eligibility traces (22) (t) and updating them in an iterative way. Of course, for the more general case of N connected reservoirs, the number of eligibility traces to be computed would be N 2 . We note how the differences between the connected and unconnected reservoirs are: e (21) (t) = 0 in the latter case, since the activity of the second reservoir does not depend on the activities of the first; e (22) (t) would have an analogous expression to e (11) (t) in the case of unconnected reservoirs.
In order to understand the meaning of the approximation in Eq. 39, we can consider the psMNIST task defined in section 3.3, in which two different numbers of previous hidden states are used for classification. In this example, the future terms t from which dE dx(t) depends correspond to the concatenated temporal steps t l l=1,...,Nconc used for the readout. Following the computational graph in panel B of Fig. 7 , the approximation of BPTT is where the contribution of the terms corresponding to q>l ∂E(t q ) ∂x(t q ) J tqt l are neglected. The number of these terms increases as the number of hidden states used to define the read-out rises, and the contribution of the matrices J tqt l becomes more important when the hidden states utilised are in closer proximity. Thus, the approximation used to define the online training algorithm is less precise for an increasing number of hidden states used. This consideration can be observed in Panels C and D of Fig. 7, in which the values of the gradients are compared to those given by BPTT for the two different numbers of concatenated values adopted in Section 3.3.
Given the gradients with respect of the parameters of the network dE dα (k) and dE dWij (W are the output weights here) in our simulations, we used the Adam optimisation algorithm, described below for completeness for a general parameter α (that could be one of the leakage terms or W ij ). The blue arrows represent the factors considered, while the red arrows correspond to the factors that are neglected in the approximation. Each mathematical term adjacent to an arrow is a multiplicative factor in the contribution of a path of dependencies in the computation of dE dx(t) . C-D: Comparison of a running average of the derivative estimated through the online training algorithm used (red and blue lines) and BPTT (dots and triangles). The approximation is less precise when the number of hidden states used for the read-out increases, as it is evident from the greater distance between the blue trend and the dots in panel D.
where t is the index corresponding to the number of changes made and m 0 = 0, v 0 = 0.

Timescales, oscillations and eigenvalues
We stimulated the reservoir with a square wave of duration 200δt (the time frame of the considered simulation) and analysed the system activity after the impulse to study its relaxation dynamics. Thus, we exploited the fact that, given a system described by dx dt = Mx(t) and where V are the left eigenvectors of M, i.e Thus the dynamics of the eigenvectors will be given by where Λ is the diagonal matrix composed by the eigenvalues of the matrix M. Of course, in the case considered M = (1 − α)I + αW and Re(λ) = 1 − α + αλ W , Im(λ) = αλ W . Thus, considering a column v of V and the corresponding eigenvalue λ v T x(t) = e Re(λ)t Re v T x(0) cos Im(λ)t − Im v T x(0) sin Im(λ)t , can be used to compare the true dynamic V T x(t) with the linearised one. Fig. 8 shows the result of this procedure for each dimension of V T x(t). Panel A reports example activities and their corresponding theoretical trend for the case of small input values (γ = 0.05, see 2), case in which the system can be well approximated through a linear behaviour. Panel B shows the RMSE and NRMSE 1 between the experimental activities and the theoretical one as γ increases. In this case, with y i (t) = v T x(t) experimentally observed, whileỹ i (t) estimated through the right side of Eq. 56

Delayed Signal to ESN
We computationally validate the equation 58 (below), derived in Section 2.1, on the NARMA10 task. The NARMA10 task is described in full in Section 3.1 (Main Text).
In order to approximate the scaling of the coefficients of the derivatives in Eq. 58, we incorporate a delay into the input signal such that the activity of the first reservoir is replaced by  Figure 9: Performance of an ESN when the signal is expanded by addition of its previous temporal steps in the NARMA10 task.

NRMSE
The lowest error corresponds to a leakage term α (2) that is in agreement with the optimal value of α (2) of the connected ESN structure reported Section 3.1 (Main Text).
where ξ ij are independent Gaussian variables of variance σ 2 ξ chosen such that Var[x (1) i ] = 1 for every i and every value of Delay. In practice, we adopted the following approximation; The stochastic elements ξ ij emulate the random mixing matrix that, in Eq. 58, projects the expanded input onto the second reservoir network.