Using photonic reservoirs as preprocessors for deep neural networks

Artificial neural networks are very time consuming and energy intensive to train, especially when increasing the size of the neural network in an attempt to improve the performance. In this paper, we propose to preprocess the input data of a deep neural network using a reservoir, which has originally been introduced in the framework of reservoir computing. The key idea of this paper is to use such a reservoir to transform the input data into a state in a higher dimensional state-space, which allows the deep neural network to process the data with improved performance. We focus on photonic reservoirs because of their fast computation times and low-energy consumption. Based on numerical simulations of delay-based reservoirs using a semiconductor laser, we show that using such preprocessed data results in an improved performance of deep neural networks. Furthermore, we show that we do not need to carefully fine-tune the parameters of the preprocessing reservoir.


Introduction
Deep neural networks (DNNs) are a key part of the domain of machine learning and artificial neural networks. They have shown to be widely applicable for various problems, e.g. natural language processing [1], drug discovery [2], detection of medical images [3][4][5], fault diagnosis in solar panels [6] and many other applications. They are able to achieve very high accuracy on these tasks because of their ability to generalise underlying patterns and features. This is achieved by training the network, which is performed by means of fitting a large dataset of input samples to a model. In order to achieve high accuracies, one often uses large and complex networks, with a large number of parameter weights [7]. However, increasing the number of layers and parameters in DNNs is computationally difficult and requires much memory, time and energy. For example, the GPT-3 autoregressive language model [8] contains 175 billion parameters which need to be optimized. Several hardware implementations have been investigated to accelerate the training of such DNNs. Examples include application specific integrated circuits (ASICs), graphics processing units (GPUs) and field programmable gate arrays (FPGAs) [9,10].
Instead of the aforementioned electronic hardware accelerators, several different photonic hardware accelerators have also been investigated [11,12]. Examples of this include coherent nanophotonic circuits [13], diffractive deep neural networks [14], all-optical neural networks [15] and photonic convolutional accelerators for optical neural networks [16]. The reason why photonics-based accelerators are particularly interesting is because of their high bandwidth and high power efficiency. Different from hardware accelerators, one can also look into preprocessing the input data before injecting it into the DNNs to improve their performance. For example, using a diffuser that generates speckle patterns can improve the scalability and generalization when using convolutional neural networks [17]. Additionally, hybrid data processing architectures have also been developed, such as LightOn's optical processing unit which functions as a fast external preprocessor [18,19]. Motivated by these photonic implementations, we look into the opportunities offered by photonic reservoir computing (RC) as preprocessor for DNNs.
Reservoir computing systems are recurrent neural networks which consist of three layers: an input layer, a reservoir and an output layer. The input layer is where the data is injected into the system and the output layer is where predictions are made using the input data, typically using a linear read-out. The reservoir contains a large amount of randomly connected nonlinear nodes. An illustration of such an RC system is shown in Figure 1A. Different from deep neural networks is that the weights of the reservoir are fixed and are not altered. Only the weights in the output layer are trained, shortening the training time of the network drastically. This also reduces the overall complexity of RCs and makes them ideally suited for hardware implementations using various physical systems [20]. They have been successfully applied for various prediction tasks, such as speech recognition [21,22], time-series predictions [23][24][25] and nonlinear channel equalization tasks [26].
The reason why RC systems perform well with a simple linear read-out layer (shown in Figure 1A) as output layer, is because the reservoirs transform the input in a high-dimensional statespace. Such a transformation, however, might also be very useful to be done at the input of a DNN such that the combined performance of a reservoir and DNN surpasses that of the DNN and/or that of the RC. In this paper, we investigate this conjecture, and we will do that based on a specific photonic reservoir by way of example.
We focus on a photonic reservoir, because of its fast information processing rate, high bandwidth and small power consumption [27,28]. More specifically, we use a photonic delaybased reservoir based on a single-mode semiconductor laser to function as a fast, low-energy preprocessor for digital deep neural networks. This optoelectronic delay-based reservoir has already shown good performance for various tasks in previous experiments, e.g. for spoken-digit recognition, time-series prediction of the Santa Fe dataset [29], speech recognition tasks [30], and nonlinear channel equalization tasks [26].
We investigate and compare the performance of various networks. We use the photonic delay-based reservoir, combined with either a linear read-out layer such as in conventional reservoir computing, or combined with a deep neural network. In case of the combined reservoir and DNN system, we feed the output of the reservoir as input to the DNN, making the photonic reservoir in this coupled system a preprocessor. This basically means that the linear read-out layer in Figure 1A is replaced with a DNN, as shown in Figure 1B. We compare the performance of these networks with other neural networks, such as the DNN alone and a long short-term memory (LSTM) network. This latter network contains a form of memory, making it ideal to be used to make comparisons with the other networks for the specific benchmark task used in this paper. We investigate the performance of the different networks on a time-series prediction task as benchmark and give a heuristic explanation for these results making use of their memory capacity. The final goal of this paper is to investigate whether photonic reservoirs are viable to preprocess input data when coupled to DNNs, resulting in a better performance compared to DNNs without such a preprocessing network. This allows one to use a fast photonic reservoir, combined with a relatively small DNN (which needs limited training) to achieve good results on benchmark tasks. This paper is organised as follows. Section 2 gives a short introduction on the various networks used in this paper: delay-based reservoir computing, together with the numerical model used to simulate this RC system (in Section 2.1), the deep neural network (in Section 2.2) and the long short-term memory network (in Section 2.3). The combined system, consisting of the preprocessing reservoir and DNN, which is the novel architecture, is described in Section 2.4. In these sections, we also discuss how we perform training. Additionally, we explain the benchmark task which is used (in Section 2.5) and describe a heuristic explanation of the results of the networks by their memory capacity (in Section 2.6). In Section 3, we give an overview of the performance results of the different networks. Section 4 concludes with a discussion of the results in this paper.

Photonic delay-based reservoir computing
In this work, we simulate a reservoir computing system which is based on a semiconductor laser (SL) with delayed optical feedback [31]. The layout of our RC system is shown in Figure 2, where we show in detail how the different layers of Figure 1A are implemented in our system. The input layer is where the input data is optically injected into the reservoir layer using an unbalanced Mach-Zehnder modulator (MZM). Before injecting the input data samples into the reservoir, these data samples are first convoluted with a mask m(t). The reservoir is made using a single-mode semiconductor laser with delayed optical feedback. The delay time is indicated by τ. In the output layer, the intensity of the SL is measured by a photodetector and is sampled at θ time intervals to represent the response at the virtual nodes. These intensity values are then Simplified architecture of a delay-based reservoir computing system, with the input layer, reservoir and linear read-out layer. SL represents the semiconductor laser, driven by a constant pump current. In the input layer, a mask m(t) encodes the input sample u k , which is in turn injected into the reservoir using a Mach-Zehnder modulator (MZM). In the reservoir, we have a node separation θ and a delay time τ, the virtual nodes are depicted by the cyan circles. In the linear read-out layer, we show the predicted target dataŷ k . (A). Simplified representation of the network architecture of the reservoir, functioning as preprocessor, coupled to the DNN, with its nodes depicted by blue circles (B).

Frontiers in Physics
frontiersin.org 03 multiplied with the trained weights in order to make a prediction [27,32].
The simulations for the delay-based RC system used in this paper are based on numerical integration of the following rateequations [33].
Where E is the dimensionless complex-valued slowly-varying envelope of the electric field of the laser and N the excess amount of carriers. α is the linewidth enhancement factor, ξ the differential gain and g the threshold gain of the semiconductor laser. η and μ are the feedback and injection rates. ϕ FB represents the phase difference between the electric field of SL in Figure 2A and the feedback term, which is delayed with the delay time τ.F β (t) represents the Gaussian distributed white noise, modelling spontaneous emissions. This noise strength is controlled by β and has following properties, 〈F β (t)〉 0 and 〈F β (t)F β (t′)*〉 β/τ c δ(t − t′), with τ c the carrier lifetime. ΔI/e is the excess pump current divided by the elementary charge, where ΔI = I − I thr , with I the pump current and I thr the threshold pump current of the SL. The injection of input data is handled by an MZM, which optically injects data into the reservoir. This injection is modeled in the simulations using E inj , where E inj (t) ϵ(1 + e iB(t) ), and is performed with no frequency detuning as E inj is injected on the same frequency as the optical frequency of the free running semiconductor laser [34]. The amplitude of the injected electric field is represented by ϵ, and the injected data signal by B(t), where B(t) = A S(t) + Φ. A and Φ are here the modulation amplitude and bias of the MZM, and S(t) the masked input signal that is defined as the convolution between the k-th input sample -from a total of n input samplesu k , and the mask m(t), The mask m(t) is a piecewise constant function, with N values randomly selected from [0, 0.25, 0.5, 0.75, 1], where each sublevel has a time length equal to the node separation θ. This ensures that the total length of the mask equals Nθ. In this work, we have chosen for the period of the mask Nθ being equal to the delay time τ, where τ = 4 ns. This choice is taken out of simplicity, although an improvement in performance has been observed in literature when a small mismatch in the mask length is introduced [35]. The parameters which are used in this work are summarized in Table 1, and are taken from [34] where it is shown that these parameter values lead to good RC performance.
To construct the linear read-out layer, the light intensity is calculated as |E(t)| 2 , as in a practical implementation one would sample the reservoir's output using a photodetector that measures intensities. We sample the intensity after every time interval of duration θ, and thus we calculate N samples of the intensity during every input sample u k .
A state matrix A is then constructed with the N sampled intensities stored in the columns of this matrix, for every input data sample, stored in the rows. This state matrix has the dimensions (n × N) and is normalized between 0 and 1, by dividing it with the maximum value found in this matrix. This state matrix is used during the training phase to obtain the output weights w corresponding to the N nodes of the reservoir, via a least squares minimization procedure with the expected data y, w A † y.
( 4 ) where we have used the real Moore-Penrose pseudoinverse (denoted by the symbol † ). The predicted valuesŷ for the data samples are thus given byŷ Aw.
Note that in Eq. 4, we actually add an additional bias node in the state matrix which can account for possible offsets in the data, so that the dimensions of A are here (n × (N + 1)). The addition of this extra bias node is only done for the linear read-out layer of the reservoir computing system. Adding an extra bias node is not performed for the other networks discussed in this paper, because they already have biases in their networks. The weights of Eq. 4 are then used in the test phase, where we apply these weights to unseen data and measure its performance.

Deep neural network
The DNN model we employ in this paper uses as input the data, with 200 features (unless stated otherwise), and returns a single value as a prediction. The reason for the specific amount of features is explained in Section 2.5. The network consists of three fully-connected layers, where the amount of nodes is decreased throughout the network. The first hidden layer in the DNN is a fully-connected layer with 100 nodes, with a ReLU activation function in each of the nodes [36]. The next hidden layer is another fully-connected layer, now with 50 nodes, and again with the ReLU activation function. This information is then injected into the final layer, which only has one node. The output of this node corresponds to the target data and results in a predictionŷ k for the data sample. Note that this network architecture is not necessarily the most optimal structure for the problems considered in this paper. Its architecture was chosen by a small preliminary study, where the amount of nodes is decreased with a factor of two in the next hidden layer compared to each previous layer. The optimization of this architecture, however, is not within the scope of this paper and thus we do not alter the DNN architecture itself here.
This DNN model is trained using the Adam optimizer [37], with a learning rate of 10 −3 for 10 4 epochs (unless stated otherwise). In order to avoid overfitting on the data, we use early stopping on a validation set, which is different from the training and test set. The loss, defined here as the mean squared error between the expected target data samples y k and the predicted target data samplesŷ k , is calculated for every epoch for the training and validation set. If the loss for the validation set increases for a certain amount of epochs, the training is stopped. Once the model is trained after the defined epoch amount, the model weights are used on the test data and the performance of the trained model is evaluated. We implement the DNN on a standard consumer mobile CPU (Intel Core i7-10710U) using the PyTorch 1.4.0 and NumPy 1.19.2 libraries, with Python 3.8.5 as programming language.

Long short-term memory
The DNN from the previous section does not contain recurrent connections and will thus have no memory in the internal network. In the time-series forecasting task that we use as benchmark (see details in Section 2.5), such memory might be essential to achieve good performance. Therefore, and in order to further validate any performance gains by the reservoir as preprocessor, we also consider another type of neural network with recurrent connections. More specifically, we also study the performance of the long short-term memory (LSTM) networks [38,39].
The LSTM network was first introduced in 1997 by Sepp Hochreiter and Jürgen Schmidhuber as a solution to the vanishing and exploding gradient problem of recurrent neural networks, which can occur when training an artificial neural network [40]. LSTMs are gradient-based recurrent neural networks with feedback connections and are able to introduce a lasting type of memory of recent inputs in the network during training. This is accomplished by using memory cells with gate units, where a connection is made to the neuron itself at the next time step during training. They have been proven to be very useful in many tasks, e.g. for speech recognition, natural language modeling and text classification [41,42].
We use an LSTM network that takes input data with 200 features. The LSTM network used in this paper has a hidden dimension of 500 nodes, with the ReLU activation function and batch normalisation. Additionally, a recurrent dropout is used with a rate of 50%. The final layer consists of a single fully-connected node, and results in a predictionŷ k for the data sample. Note that this network architecture is not necessarily the most optimal structure of the LSTM for the problems considered in this paper. For example, the amount of nodes is found by a small preliminary study, and is most likely not the optimal amount. The optimization of this architecture, however, is not within the scope of this paper and thus we do not alter the LSTM architecture itself here.
We again train using the Adam optimizer, with a learning rate of 10 −4 for 10 4 epochs. Early stopping on a validation set is also used here to avoid overfitting. The loss, defined here as the mean squared error between the expected target data sample y k and the predicted target data sampleŷ k , is calculated for every epoch for the training and validation set. If the loss for the validation set increases for a certain amount of epochs, the training is stopped. Once the model is trained after the defined epoch amount, the model weights are used on the test Photonic reservoir as preprocessor for deep neural networks With the aim of achieving an improved performance when combining systems, we use the photonic delay-based reservoir from Section 2.1 as preprocessor, together with the DNN described in Section 2.2, with the same (hyper)parameters. This means that the conventional linear read-out layer from Figure 2A is now replaced with the DNN, which is shown in Figure 2B. This is done as follows: we inject the input data samples as described in Section 2.1 into the reservoir. After the reservoir, we calculate the light intensity from the simulated time traces as |E(t)| 2 and perform the sampling in order to construct the state matrix A. Instead of performing training using the conventional linear read-out in the output layer, we now use a DNN and train its weights.
The advantages of such a photonic reservoir is that in principle it can be very fast to run, promises a low power requirement [27,28] and has inherently a form of memory due to the delayed feedback, making it an ideal candidate to preprocess the input data for the DNN. This means that all timedependent information processing is done by the fast analog photonic reservoir and the DNN is used for regression.
The physical implementation of the fast photonic reservoir can be based on existing (delay-based) RC systems. An example of such an RC system is described in [43], where a semiconductor laser is coupled with a fiber-optic feedback loop. This implementation achieved good performance on several tasks such as speaker recognition, spoken digit recognition and chaotic time-series prediction, and this at data rates above 1 Gbyte/s. Such a system, however, can become quite large to physically implement with many components. Instead, integrated implementations have been proposed and tested. For example, in [31], a compact delay-based reservoir computing system is demonstrated by using a laser and delay line on an InP photonic integrated circuit. They showed in their experiment that the chip was able to operate at 0.87 GSa/s, for a reservoir which contains 23 nodes. Another example of such an integrated implementation is given in [44]. The authors demonstrate a spatially parallel delay-based photonic reservoir, where multiple integrated optical cavities are used, resulting in the possibility for a much higher amount of virtual nodes. The experimental maximum bandwidth was here only limited by the used arbitrary waveform generator oscilloscope to 20 GHz. Such compact on-chip implementations of photonic reservoirs would allow for small-scale implementations of the combined system, where the information of the light intensity of the reservoir is measured by photodetectors and then digitally sent to the DNN which is implemented on e.g. FPGAs, ASICs, etc. Where the training is performed.
The DNN again takes as input data with 200 features, due to the reservoir containing 200 nodes in our case, so that every node from the reservoir is now sent as input to the DNN for every input sample, as shown in Figure 2B.
It is important to note that we do not further optimize the architecture of the DNN itself, but rather the reservoir with which we want to couple it with. This is done because the optimization of the DNN is out of the scope of this paper and because we mainly focus on the reservoir itself as the preprocessor.

One-step ahead prediction task: Santa Fe dataset
In order to compare the performance of the different networks, we evaluate their normalized mean squared error on a one-step ahead time-series prediction task as benchmark. The used input dataset is the Santa Fe dataset which consists of 9093 data samples sampled from a far-IR laser in a chaotic regime [45]. The Santa Fe data is first normalized over the whole dataset before being used in the networks, resulting in u k ∈ [0,1].
For the networks containing a reservoir, we inject the first 3010 data samples, which function as the training dataset. As test dataset, we inject the following 1010 data samples, with an interval of 10 samples between the training and the test dataset. For both training and test set we discard the first 210 samples to avoid transients occurring from switching datasets (10 samples) and to be able to compare with the other networks that do not contain a reservoir (200 samples). This results in a training dataset and test dataset with sizes of 2800 and 800 input samples for the networks containing a reservoir.
For the networks without a reservoir, we have to redefine the input as we do not have a state matrix here, which normally originates from the reservoir. As input for these networks, we create a matrix analogous to such a state matrix, but which features the preceding data samples, instead of the intensity response of the nodes of the reservoir. For this, we now use a sliding window over the data samples to create the input for the neural networks. The 200 data samples preceding a data sample are now used to predict the next data sample. This means that the first row of the matrix contains the first 200 Santa Fe data samples to predict the 201st data sample, the second row contains data samples 2 to 201 to predict the 202nd data sample and so on. This results in a training dataset and test dataset with dimensions (2800 × 200) and (800 × 200), which are the same dimensions as the state matrices used in the networks with reservoirs. The difference now is that the resulting matrices use the 200 previous data samples as features to predict the next data samples. Note Frontiers in Physics frontiersin.org 06 that we will split the training dataset into a true training dataset, containing 2300 samples, and a validation set, containing 500 input samples.
The final performance of the different networks is evaluated by their normalized mean squared error (NMSE) on the test set between the expected output y and predicted outputŷ, The network with the lowest NMSE thus results in the best performance for this particular task. The typical state-of-the-art values for the normalized mean squared error on the Santa Fe one-step ahead predictions via RC system simulations are around 0.01 [46,47].

Memory capacity
In order to give a heuristic explanation for the increase in performance between the delay-based reservoir computer and the DNN which uses the reservoir as preprocessor, we will calculate their task-independent memory capacities. This is done using techniques from [48] where we inject discrete input samples u k from a uniform distribution between [−1, +1] into the reservoir and train the weights to reconstruct products of normalized Legendre polynomials via l i -delayed past input samples u(k − l i ) [48,49]. In this work, we will limit this delayed input to l i ≤ 10, and we calculate the memory capacity up until degree 5 to limit the computation time.
The target data y(k) is produced by a product of Legendre polynomials P δ (·), of a certain degree δ, from previous inputs. Note that there exist multiple combinations of Legendre polynomials if we consider combined degrees greater than 1. For example, if the combined degree d = 3, we have to consider product combinations of Legendre polynomials with degrees: δ ∈ {(3), (2, 1) and (1, 1, 1)}. For example, the combination δ = (2, 1) for the third degree can then be written as a product of Legendre polynomials with degree 2 and 1: where (l 1 , l 2 , l 3 ) are the set of indices of past input samples, which we have limited to l i ≤ 10 here. This means that we also have many different possible combinations for these values, adding to the complexity when calculating the memory capacity. The mean squared error between the expected signal y(k), such as for the example given in Eq. (7), and the predicted signalŷ(k) can be calculated for all input samples. The mean squared error, for a specific product combination δ and a set of specific indices of past inputs (l 1 , . . . ), is used for calculating the memory capacity, If we then sum C y (δ) over all possible Legendre combinations which give rise to the combined degree d, we have the memory capacity per degree d, The total computational capacity of the system, CC, can be calculated by summing over all memory capacities C d , for all degrees. According to [48], its theoretical upper limit for the delay-based RC system considered in this paper is given by the amount of virtual nodes N. Since we are only using a finite amount of input data, we could potentially overestimate the value for the memory capacity C d . To avoid this, we implement a threshold capacity C thr , so that values below C thr are not considered for calculating the memory capacity, as discussed in [48]. The reason for this is because memory capacities below C thr are not statistically relevant and are thus removed for its calculation.
In order to limit the computational load when calculating the memory capacities, due to the large amount of possible combinations of Legendre polynomials and delays, we limit the amount of virtual nodes in the reservoir for this specific task to N = 25, instead of the previous N = 200 for the one-step ahead prediction task. The node separation θ remains fixed to θ = 20 ps, and the input data consists of 5 × 10 5 uniformly distributed samples, with values between -1 and 1. For the parameters used here, we have a threshold memory capacity of approximately C thr ≈ 2.7 × 10 −4 . Since we have changed the amount of virtual nodes to N = 25, we have also changed the layout of the DNN itself. It still consists of 3 fully-connected layers, but the DNN now takes input data with 25 features. The first hidden layer is a fullyconnected layer with 12 nodes, followed by the ReLU activation function [36]. The next layer is another fully-connected layer, now with 6 nodes, and again with the ReLU activation function. The final layer again only has a single node for predictingŷ k .
In case of the combined reservoir and DNN, we use the first 4 × 10 4 data samples for training the model, while the last 10 4 samples are used as validation dataset. This validation dataset is used for the early stopping procedure against overfitting, where the standard amount of training epochs is set at 1.5 × 10 3 epochs. The DNN is trained using the Adam optimizer, with a learning rate of 10 −3 .

Results
Performance on santa Fe one-step ahead prediction task We compare the different networks based on their performance as expressed by the NMSE on the one-step ahead prediction of the Santa Fe dataset. The final NMSE on the test dataset, together with their total training time and time per epoch are summarized in Table 2. The different times are measured on a standard consumer mobile CPU (Intel Core i7-10710U). For the parameters of the reservoir, we use the values given in Table 1, as they are the parameter values which were optimal in previous scans for the RC system [34].
We have investigated the learning curves, i.e. the NMSE as a function of the epochs (not shown), of all networks which require backpropagation, i.e. containing a DNN or LSTM, for choosing the amount of epochs to train the networks. We have observed that all three networks showed a decreasing training loss with increasing epochs. For both the combined system of the reservoir and DNN and only the DNN, we have observed that the validation loss saturates to a constant loss after around 2000 epochs. As their training time per epoch is not very large for both these networks, as shown in Table 2, we have limited their amount of epochs to 10 4 , allowing for a good NMSE. For the LSTM network, we observed that the validation error still decreases with increasing epochs. Due to this network being very time-intensive to train, with a training time per epoch being 30 times that of the other two networks as shown in Table 2, we have limited its training to also 10 4 epochs. It is important to note again that for this amount of epochs, the saturation of the validation loss is not yet reached for the LSTM.
The time required for calculating the weights of the RC network is negligible, because no backpropagation is performed here, in contrast to the other models. We can clearly see from Table 2 that the photonic reservoir coupled with the DNN as output layer outperforms all other networks, in both low training time and low NMSE. For the parameters specified in Table 1, we are able to achieve an NMSE for the reservoir combined with the DNN which is almost 4 times lower than the NMSE achieved when using a photonic RC system. The DNN has a total training time comparable to that of the photonic reservoir coupled with the DNN, and the same amount of free parameters that are being trained, but the NMSE is much larger when using only the DNN. The LSTM network, which is generally a good choice for time-series forecasting due to its recurrent nature, has an NMSE which is around 1.5 times larger than that of the photonic RC. This can be explained by the fact that the LSTM is very time-intensive to train compared to the other networks, so that we have opted to limit its amount of training epochs, and which still results in almost 1 hour of training time. We expect that if the amount of epochs is increased further (i.e. for several additional 10 4 epochs), we are able to achieve a similar NMSE as the photonic RC. Both the DNN and LSTM networks have a higher NMSE compared to the photonic RC system, with the LSTM having the largest training time. The reason why the photonic RC system has a lower NMSE than the DNN is due to the fact that the RC system is a recurrent neural network, which is able to capture the time dependencies of sequential input data. For example, the (software) introduction of reservoir computing was partially introduced by the Echo State Network (ESN) [23], which is a recurrent neural network and which showed good performance on the Santa Fe time-series prediction task [25].
Note that the simulation of photonic reservoir itself requires a lot of time. This simulation is, however, not concluded in the training time in Table 2, because physical implementations of photonic RC can be very fast. Its computing time is approximately equal to the amount of input samples multiplied with the delay time, nτ. This is of the order of several μs and is thus negligible compared to the training times in Table 2. Note that the time required for the optimization of the (hyper)parameters of the reservoir (for the photonic RC and photonic reservoir with DNN) and the neural networks (for the DNN combined with or without the reservoir, and for the LSTM) is not shown in Table 2, as it is not actual training time. However, if one wants to achieve the best possible performance, a scan for the reservoir parameters and/or of the network hyperparameters needs to be performed, which also requires time. The advantage of the photonic reservoir is that its parameter optimization does not need to be very precise, as a good performance can already be achieved for a rather broad range of parameter values, as we will show in the following results. The results in Table 2 show that the photonic reservoir, coupled with or without the DNN, results in a good performance. Therefore, we now investigate these two networks when varying the parameters of the reservoir instead of using only fixed reservoir parameters. We do this by varying the pump current, injection rate and feedback rate. For this, we fix the feedback rate and perform a grid scan over the pump current and injection rate in Figure 3. This is shown for both the RC system, in Figure 3A and the reservoir coupled with the DNN, in Figure 3B. The improvement factor, defined as the NMSE resulting from the RC system divided by the NMSE of the reservoir coupled with the DNN, for the same parameters of  Frontiers in Physics frontiersin.org both reservoirs, is shown in Figure 3C. The same figures are shown in Figure 4, where the feedback rate and pump current are now varied, for a constant injection rate. Figure 3A shows that the RC system performs well for a broad range of pump currents and injection rates. This region can mainly be found around high pump currents and high injection rates. This is indicated by the light and dark blue regions in the figures. The darker the blue color, the better performance, indicated by a lower NMSE on the test set. Additionally, we observe that the region of good performance is broader, and with better performance, when the reservoir is combined with the DNN, as shown in Figure 3B. This improvement of the reservoir coupled with the DNN is also clearly visible in the improvement factor in Figure 3C. Values which are larger than 1 indicate an improvement in performance when using the reservoir with the DNN compared to the RC system. This figure shows that a large portion of parameter space results in an improvement, often with a factor around 2 to 5, compared to only the RC system. Notably, the region below the threshold pump current is improved when using the combined system (shown in Figure 3C where the horizontal axis is less than 0). This indicates that these reservoirs can function relatively well even when the threshold pump current is not yet reached. If one considers the region at high current and low injection rate for which the RC did not result in a good performance (indicated by the red regions in Figure 3A), we can see that the performance at these parameter combinations is not improved upon when the reservoir is combined with the DNN (also shown by the red regions in Figure 3B). As with Figure 3, Figure 4A shows that there is a broad parameter range of pump current and feedback rate for which the RC system performs well. This region is further broadened and improved when we use a reservoir coupled with the DNN, shown in Figure 4B. This is again reflected by the light and dark blue regions in both figures, indicating a low NMSE, and thus good performance. We observe that when the pump current is increased above threshold, the feedback rate needs to be decreased in order to achieve a good performance on both networks. Analogous to Figure 3, all of the parameter combinations which were already performing well with the RC system are improved when the reservoir is coupled with the DNN at those same parameter combinations. This is shown in Figure 4C, where typical improvement factors can be found around 4 to 8 for a broad area of parameter combinations. Most notably, there are even parameter combinations which have more than 10 times improvement in performance, which can be found just below threshold pump current 1 . This shows that the feedback rate and pump current play a large role when using reservoirs combined with DNNs, and can allow for large performance increases when properly adjusted.
In order to find the optimal values for the reservoir parameters, we perform an additional Bayesian search in the parameter space of pump current, injection rate and feedback rate. This Bayesian search is performed using the Gaussian Processes function gp_minimize of the scikit-optimize 0.8.1 library. Figure 5 shows the NMSE for 1653 points projected on the pump-current and injection rate axes ( Figures 5A, B) and on the pump-current and feedback rate axes (Figures 5C, D). This figure shows the projections of a true three dimensional scan of the full three dimensional parameter ranges for the RC system and the reservoir coupled with the DNN. This is opposed to Figures 3, 4, where only two dimensional scans were performed, and where one value always remained fixed (either feedback rate or injection rate). In Figure 5, the parameter combinations which result in a good performance, and thus low NMSE, are represented by large, dark blue dots. If the NMSE is high, and thus a poor performance, the dots are small and red.
The results of Figure 5 show that also in this scan of parameter space the NMSE is improved when the reservoir is coupled with the DNN, as indicated by the larger and bluer dots in Figures 5B, D as compared to Figures 5A, C. This is even the case for parameter values which resulted in a poor performance for the RC systems, e.g. near the edges of the searched parameter space. This is in agreement with the results found in Figures 3, 4, where the same parameter regions with good performance can be found. The most optimal values (i.e. with the lowest NMSE) found for both systems are shown in Table 3. In this table, we see that the reservoir coupled with the DNN, at its most optimal parameter values, is able to have an NMSE which is 3.5 times better than the best NMSE found with RC, also at its most optimal parameter values. Note that these two NMSE values are found at different parameter combinations, as we are comparing the best possible performance we are able to achieve with both systems. This is in contrast to the previous two dimensional scans, where we reported the improvement factor for the same parameter combinations, where we observed higher improvement factors.

Memory capacity
In the previous section, we have established that combining a photonic reservoir with a DNN works well. We now try to gain more insight as to why such systems work well by calculating their memory capacity. Figure 6 shows the memory capacity for various degrees as colored bars. In this figure, we observe that the delay-based reservoir coupled with the DNN has for both the linear and nonlinear degrees the highest memory capacities. This results in this system having the largest total memory capacity, when compared with the conventional delay-based reservoir computing system.
Knowing that the total memory capacity for the RC system is bounded by N, in our case N = 25, we observe that the RC system has a low memory capacity. This is presumed to be the case due to noise inherently present in the delay-based system and due to correlations present between virtual nodes. As the output layer in such system is only a linear layer, we conjecture that this system thus has more trouble dealing with such negative effects compared to the combined reservoir with the DNN, which can minimize these effects and has much more inherent nonlinearity due to its structure. This also explains why mostly the nonlinear degrees for the memory capacity seem to benefit from adding a DNN to the reservoir, resulting in the total memory capacity being larger.
Note that it is not clear if the total memory capacity of the combined system is still bounded by the theoretical upper limit for the RC system, which is the size of the photonic reservoir (i.e. the maximum total memory capacity is equal to the number of nodes). It might be possible that the theoretical maximum of the memory capacity for the combined system is larger than the upper limit for the RC system, as additional variables are introduced in the DNN. However, an in-depth study of the theoretical maximum memory capacity of the combined system is outside the scope of the present paper. Figure 6 shows that coupling a delay-based RC system with a DNN significantly enhances both the linear and nonlinear memory capacities. This gives a heuristic explanation as to why it is able to outperform a conventional RC on the Santa Fe prediction task. This is because the Santa Fe one-step ahead

FIGURE 6
Memory capacity per degree of linearly independent basis functions for the delay-based reservoir computing system, with N =25, and the reservoir, also with N =25, coupled with the DNN.
Frontiers in Physics frontiersin.org prediction task typically requires high non-linear memory capacities [50], which is the case when a preprocessor is used. This also indicates that the combination of a reservoir and a DNN can potentially outperform the RC system in other tasks, different from the time-series prediction.

Discussion
In this work, we have numerically investigated the use of photonic delay-based reservoirs as preprocessors for deep neural networks. This results in a combined system which is different from conventional reservoir computing, as the output layer now contains more parameters which can be trained. The choice for photonic reservoirs was made because of their potential to achieve fast information processing, high bandwidth and small power consumption, making such reservoirs ideal as physically implemented preprocessors. These reservoirs are able to transform the input data into a higher dimensional state-space, which improves performance of the deep neural networks. We show that using a preprocessing reservoir with a deep neural network results in an test error which is almost four times lower when compared to conventional reservoir computing, at the most optimal reservoir parameters, and around ten times lower when compared with the DNN alone for predicting the Santa Fe time-series. Additionally, the improvement in performance is valid for a broad region of reservoir parameters, making the preprocessing reservoir advantageous even when it is not properly fine-tuned. We have also investigated the memory capacity in order to provide an explanation for the improvement when using a reservoir with a deep neural network compared to the reservoir computing system. We found that the memory capacity with the combined system is much higher when compared with the reservoir computing system. Therefore, we have shown that photonic reservoirs are ideally suited as preprocessors for achieving improved performance when combined with deep neural networks.
In our simulations, we have focused on predicting the Santa Fe time-series as main learning task. This task typically requires large nonlinear memory capacities in order to result in a good performance. A reservoir computing system is able to solve this task, as it has nonlinear memory capacity. However, as shown by our study of the memory capacity, we are able to increase the linear and nonlinear memory capacities of a DNN when we combine it with a preprocessor. Note that it is also possible to use other networks, different from the DNN, which also have intrinsic memory, such as e.g. the LSTM network. We have also investigated its performance on the Santa Fe time-series prediction task, but observed that its performance is slightly worse compared to the photonic reservoir computing system. Additionally, if we use an LSTM network with many parameters, it becomes very intesive to train. Using only a DNN for this prediction task results in the worst performance of all investigated networks. This is most likely because such networks have a lower memory capacity compared to the other networks. Note that some extra form of memory is added to both the LSTM and DNN due to the construction of the input data (i.e. using a sliding window of the input data) but it is not sufficient for this training task as reflected by their performance. This shows that first using a recurrent neural network as preprocessor for DNNs results in better performance for time-dependent prediction tasks, such as the Santa Fe task, as opposed to using recurrent DNNs. This is also supported by the fact that the memory capacity, which is a taskindependent performance indicator, is higher when using a combined system consisting of a reservoir as preprocessor and a DNN when compared to using only a reservoir computing system. We conjecture that due to the increased (non)linear memory capacities, the addition of a preprocessor is also able to yield improvements for other tasks which require such memory capacities, e.g. for tasks defined in [27]. Therefore, we are convinced that additional learning tasks, such as dynamical system modelling tasks (e.g. nonlinear autoregressive moving average model, NARMA) and nonlinear channelequalization tasks could be of great interest for future studies, due to their time-dependent nature, for the combined system of a reservoir and DNN.
Note that one does not necessarily need to use a delay-based reservoir to function as a preprocessor. The use of other reservoirs is also expected to give improvements, such as e.g. for photonic spatial reservoirs, or other types of photonic reservoirs, when coupled with DNNs. This is further supported for photonic spatial reservoirs, such as the swirl architecture, by the fact that they show good performance on the same task considered here [51], i.e. the Santa Fe one-step ahead prediction task. Therefore, we are convinced that adding such a reservoir to a DNN will also result in an improved performance. Our study thus indicates that physical reservoirs, as developed in the framework of reservoir computing, are very well suited as preprocessors for DNNs, resulting in improved performance.

Data availability statement
Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.