Learning to Approximate Functions Using Nb-Doped SrTiO3 Memristors

Memristors have attracted interest as neuromorphic computation elements because they show promise in enabling efficient hardware implementations of artificial neurons and synapses. We performed measurements on interface-type memristors to validate their use in neuromorphic hardware. Specifically, we utilized Nb-doped SrTiO3 memristors as synapses in a simulated neural network by arranging them into differential synaptic pairs, with the weight of the connection given by the difference in normalized conductance values between the two paired memristors. This network learned to represent functions through a training process based on a novel supervised learning algorithm, during which discrete voltage pulses were applied to one of the two memristors in each pair. To simulate the fact that both the initial state of the physical memristive devices and the impact of each voltage pulse are unknown we injected noise into the simulation. Nevertheless, discrete updates based on local knowledge were shown to result in robust learning performance. Using this class of memristive devices as the synaptic weight element in a spiking neural network yields, to our knowledge, one of the first models of this kind, capable of learning to be a universal function approximator, and strongly suggests the suitability of these memristors for usage in future computing platforms.


Introduction
The field of Machine Learning is, at its core, concerned with building function approximators from incomplete data samples.The state of the art approach to solving this problem is using artificial neural networks (ANNs), where a large number of real-valued artificial neurons are connected to each other by means of weights.The neurons in such networks are typically arranged into multiple layers, and are therefore referred to as deep learning.The optimisation process is performed by updating the weight matrices defining the connection weights between pairs of neurons and is guided by learning rules, which are heuristic optimisation algorithms capable of iteratively tuning the network weights to minimise some error function.This process is based on either global (as in the classic back-propagation algorithm) or local knowledge (which is more biologically plausible); the typical outcome is an interpolation for the hidden mapping from input samples to observed outputs.
Traditional neural networks usually require considerable power and are not necessarily constrained by physical preoccupations.One of the main reasons for this shortcoming is that artificial neural networks are run on traditional, general-purpose Von Neumann architectures in which memory and computation are not co-located.A simulation always has trade-offs in efficiency and, in this sense, the high energy required by deep learning can be ascribed to the fact that an artificial neural network is essentially a non-Von Neumann computational model -where memory and computation are co-located in the connection weight matrices -being run on hardware that implements a different computational paradigm.Even though many important advances in deep learning have been "biologically inspired" (e.g., convolutional neural networks), it is unclear how far the current deterministic approach can progress, both because of energy requirements, architectural complexity, and capacity to generalise.
To reduce energy usage, alternative approaches to traditional neural networks aim to implement learning in neuromorphic hardware.Memristors are novel fundamental circuit elements that have attracted a great deal of interest for this purpose because they exhibit multilevel conductance states, giving them -for example when arranged in a crossbar array architecture -the potential to carry out parallel vector-matrix multiplication ( C = aB ), which is the most important mathematical operation underpinning artificial neural network implementations [1].The ability to carry out this operation, where the input is voltage V , output is current I, and the weight corresponds to conductance G (the inverse of resistance R), follows from Ohm's law I = V G and Kirchoff's current law I j = I i .Due to their non-volatile nature and ability to co-locate memory and computing, memristors can enable parallel physical computation inside a low energy envelope.
Memristors have also been used as direct hardware implementations of weights in order to replicate standard deep learning software architectures, such as long short-term memory (LSTM) [2].Such systems can then be trained using standard gradient-based machine learning algorithms but in doing so limit themselves to the mainstream computational paradigm and do not take full advantage of the material properties of the memristive devices.Another alternative approach to computation in which the use of memristors is currently being explored is that of reservoir computing [3].A reservoir is a collection of recurrently interconnected units exhibiting short-term memory and non-linear response to inputs; the connections of the reservoir are fixed and the training of the network is focused on a simple feed-forward readout layer thus reducing training time and complexity.Previous literature has shown that memristor-based reservoir computers are capable of excellent performance on a variety of tasks by taking advantage of the intrinsic characteristics of the physical devices [4].
The biological parallel with the brain can be taken further by substituting continuous idealised neurons, found in traditional artificial neural networks, for their spiking equivalent and by using a learning algorithm which updates weights on the basis of local knowledge.For example, considerable interest has been devoted to using memristors to implement Hebbian learning rules which are thought to underpin the capacity of the brain to learn in an unsupervised manner and adapt to novel stimuli.Multiple flavours of spike timing-dependent plasticity (STDP) and hardware architectures have been proposed as ways to make good use of the physical characteristics of memristors, appropriately shaping voltage pulses to drive changes in their memristance and emulate Hebbian learning [5,6].
The most widely studied memristors for neuromorphic applications are ones relying on electric field control of nanoscale filaments between two electrodes [7][8][9][10], phase transitions [6,11,12] or voltage induced control of the ferroelectric configuration [13][14][15].A less studied class of materials for synaptic device applications are interfacetype memristors where electric field controlled resistive switching results from changes occurring at interfaces.While the resistive switching in for example metal/Nb-doped SrTiO 3 (Nb:STO) Schottky junctions, is well documented in literature, there are not many reports in which they are considered as individual neuromorphic components [16][17][18] and, to the best of our knowledge, no reports of their use as components in neural networks.Often, memristive devices require forming processes, which can be unfavourable for device performance and network integration [19,20].An attractive feature of Nb:STO memristors is that the switching behaviour is present in the as-fabricated device.In addition, they also show reproducible multi-level resistive switching and low reading currents at room temperature, making them potentially ideal as components in neuromorphic computers.Such computers need to be able to deal and operate within the physical constraints of the particular class of memristor they utilise as hardware substrate.The heterogeneous characteristics of memristive devices also entails that a neuromorphic computer be aware of their physical characteristics and responses, in order to be able to utilise them to their full potential.
Here, we utilised Nb:STO memristors as synapses in a simulated spiking neural network to explore the suitability of this class of memristors as computing substrate.Measurements were conducted on physical devices, in which the change in resistance in response to a series of voltage pulses was monitored.We found the resistance values in response to forward voltage pulses to follow a power law, whereby each pulse gives rise to an amplitude-dependent decrease in the resistance.Next, we built a simulated model using the Nengo Brain Builder framework, consisting of preand post-synaptic neuronal ensembles arranged in a fully-connected topology.Each artificial synapse was composed of a differential synaptic pair of memristors, one designated as "positive" and one as "negative", and the weight of the connection was given by the difference in normalised conductance values between the two paired memristors.Using a training process during which discrete voltage pulses were applied to one of the two memristors in each pair, we showed that our model was capable of learning to approximate transformations from randomly and periodically time-varying signals, suggesting that it could operate as a general function approximator.The initial state of the memristive devices was unknown, as were their exact operating parameters due to the addition of noise; this simulated the same kind of constraints faced in hardware, making it likely that the learning model could be ported to a physical implementation.Most importantly, robust learning performance was proven using only discrete updates based on local information.

Device Fabrication
Memristive devices are material systems which undergo reversible physical changes under the influence of a sufficiently high voltage, giving rise to a change in resistance [21,22].Here, metal/Nb-doped SrTiO 3 devices were used, where resistive switching results from changes occurring at the Ni/Nb:STO interface [23][24][25][26][27][28].We fabricated devices on (001) SrTiO 3 single crystal substrates doped with 0.01 wt% Nb in place of Ti (obtained from Crystec Germany).To obtain a TiO 2 terminated surface, a wet etching protocol was carried out using buffered hydrofluoric acid, this was followed by in-situ cleaning in oxygen plasma.To fabricate the Schottky junctions, a layer of Ni (20 nm), capped with Au (20 nm) to prevent oxidation, was grown by electron beam evaporation at a room temperature at a base pressure of ∼ 10 −6 Torr.Junctions with areas between 50 × 100 and 100 × 300 µm 2 were defined using UV-photolithography and ion beam etching.Further UV-photolithography, deposition and lift-off steps were done to create contact pads and finally wire bonding was carried out to connect the contacts to a chip carrier.Current-voltage measurements were performed using a Keithley 2410 SourceMeter in a cryostat under vacuum conditions (10 −6 Torr).All electrical measurements were done using three-terminal geometry with a voltage applied to the top Ni electrode.The measurements in the present work are performed on junctions of 20000 µm 2 .More details on the measurement techniques and device characterisation can be found in [28].In light of practical applications and due to the attractive lowering of reverse bias current with increasing temperatures [28,29], all measurements were done at room temperature.

Device Experimental Evaluation
Biological synapses strengthen and weaken over time in a process known as synaptic plasticity, which is thought to be one of the most important underlying mechanisms enabling learning and memory in the brain.For a device to be useful as a neural network component, its resistance should thus be controllable through an external parameter -such as voltage pulses -so that it may take on a range of values (within a certain window).The as-fabricated Ni/Nb-doped SrTiO 3 memristive devices showed stable and reproducible resistive switching without any electroforming process.
Hence, to investigate the effect of applying pulses across the interface, we conducted a series of measurements in which a device was subjected to a SET voltage of +1 V for 120 s to bring it to a low resistance state and reduce the influence of previous measurements.Then, 25 RESET pulses of negative polarity (−4 V) were applied and the current was read after each pulse.A negative read voltage was chosen because significantly larger hysteresis is observed in reverse bias compared to forward bias.Because of differences in the charge transport mechanisms under forward and reverse bias, a much smaller current flows in this regime: to ensure the measured current is sufficiently large to be read without significant noise levels, a reading voltage of −1 V was chosen.The RESET pulse amplitude was varied from −2 to −4 V.This procedure was repeated several times for each amplitude sequentially.The pulse widths were ∼1 s.
Next, we performed a set of measurements in which devices were SET to a low resistance state by applying +1 V for 120 s before applying 10 RESET pulses of -4 V to bring devices to a depressed state.This was followed by applying a series of potentiation pulses ranging from +0.1 V to +1 V.

Nengo Brain Maker Framework
The aforementioned devices were integrated into a simulated spiking neural network (SNN); this was implemented using the Nengo Brain Maker Python package [30], which represents information according to the principles of the Neural Engineering Framework (NEF) [31].Nengo was chosen because its two-tier architecture enables it to run on a wide variety of hardware backends -including GPUs, SpiNNaker, Loihi -with minimal changes to user-facing code, opening up the possibility of seamlessly running models on neuromorphic hardware [32,33].However, the results of our current work are framework-independent and would apply to any computational model congruent to the one we simulated in Nengo.
In Nengo, information takes the form of real-valued vectors which are encoded into neural activity and decoded to reconstruct the original signal [34].In the Nengo API, the representation is realised by neuronal ensembles, which are collections of neurons representing a vector; the higher the number of neurons in the ensemble, the better the vector can be encoded and then decoded/reconstructed.Each neuron i in an ensemble (which in our case are leaky integrate-and-fire (LIF) neurons) has its own tuning curve that describes how strongly it spikes in response to particular inputs (encoding).The collective spiking activity of the neurons in an ensemble represents the input vector, which can be reconstructed in output by applying a temporal filter to each spike train and linearly combining these individual contributions (decoding).Thus, each neuronal ensemble represents a vector through its unique neuronal activation pattern through time.
Ensembles are linked by connections which transform and transmit the encoded vector represented in the pre-synaptic ensemble by way of a weight matrix.The post-synaptic ensemble will thus be trying to render a transformation of the vector represented in the pre-synaptic neuronal ensemble; in the default case Nengo attempts offline optimisation to find a series of optimal weights to approximate the function along the connection.
Learning rules can be applied to the connection between two ensembles in order to iteratively modify its weighting and thus find the correct transformation in an online manner.

Simulated Model
The Nengo model used in this work -to evaluate the usage of our memristive devices in a neuromorphic learning setting -consisted of three neuronal ensembles: pre-and post-synaptic, and one dedicated to calculating an error signal E. The pre-and post-synaptic ensembles were linked by a connection which resulted in a fully-connected topology between their constituent neurons; this is equivalent to the physical arrangement of memristors into crossbar arrays [1] used in many previous works to realise efficient brain-inspired computation [2,35].The specific topology, which is crucial in enabling the model to learn, is shown in Fig. 1a.
The pre-synaptic ensemble had as input signal a d-dimensional vector x and the connection between the ensembles was tuned in order for the post-synaptic ensemble to represent a transformation/function f : x → y of the vector encoded into the pre-synaptic neuronal population.This time-varying input signal consisted of either d uniformly phase-shifted sine waves described by or of d white noise signals low-filtered to a 5 Hz cutoff.The latter signal was generated by using the nengo.processes.WhiteSignal class and is naturally periodic but, in order to be able to test network on unseen data, the period was chosen to be double the total simulation time of 30 s.
The third neuronal ensemble, dedicated to calculating the error signal, was connected to the pre-and post-synaptic ensembles using standard Nengo connections whose weights were pre-calculated by the framework.The connection from the post-synaptic ensemble implemented the identity function -simply transferring the vector -while the one from the pre-synaptic ensemble calculated −f (x) with f being the transformation of the input signal we sought to teach the model to approximate.The connection from pre to error ensembles did indeed already represent the transformation f (x) that the model needed to learn, but the optimal network weights for this connection were calculated by Nengo before the start of the simulation based on the response characteristics of the neurons in the connected ensembles.Therefore, the weights realising the transformation on this connection were not learned from data and did not change during the simulation.Given these connections from pre and post ensembles, the error ensemble represented a ddimensional vector E = y−f (x) which was used to drive the learning in order to bring the post-synaptic representation y as close as possible to the transformed pre-synaptic signal f (x), the ground truth in this supervised learning context.
The learning phase was stopped after 22 s of simulation time had passed; this was done by activating a fourth neuronal ensemble which had strong inhibitory connections to the error ensemble.Suppressing the firing of neurons in the error ensemble stopped the learning rule on the main synaptic connection from modifying the weights further and thus let us test the quality of the learned representation in the remaining 8 s of simulation time.
In separate experiments, the transformation f we sought to teach the model was either We theorised that learning the square function would be a harder problem to solve as f (x) = x 2 is not an injective function; that is, different inputs can be mapped to the same output ((−1) 2 = 1 2 = 1, for example).Even learning the identity function f (x) = x is not a trivial problem in this setting, as the pre-and post-synaptic ensembles are not congruent in how they represent information, even when they have the same size/number of neurons.Each of the ensembles' neurons may respond differently to identical inputs, so the same vector decoded from the two ensembles would be specified by different neuronal activation patterns.Therefore, the learning rule would need to be able to find an optimal mapping between two potentially very different vector spaces even when the implicit transformation between these is the identity function.
The quality of the learned representation was evaluated on the last 8 seconds of neuronal activity by calculating the mean squared error (MSE) and Spearman correlation coefficient ρ between the ground truth given by the transformed pre-synaptic vector f (x), and the y vector represented in the post-synaptic ensemble.
At a lower level, each artificial synapse -one to connect each pair of neurons in the pre-and post-synaptic ensemble -in our model was composed of a "positive" M + and a "negative" M − simulated memristor (see Fig. 1b).The mapping of the current resistance state R ± of a pair of memristors to the corresponding synaptic network weight ω was defined as: Note that the relationship between conductance and resistance was specified by G = 1 R ; thus, when normalising, the maximum conductance G 1 was given by the inverse of the minimum resistance 1 R0 , and vice-versa.So, the network weight of a synapse at each timestep was the result of the difference between its current memristor conductances G ± , which were normalised in the range [0, 1], and then multiplied by a gain factor γ.
To simulate device-to-device variation and hardware noise we elected to introduce randomness into the model in two ways.Firstly, we initialised the memristor resistances were initialised to a random high value in the range [10 8 ± 15%] in order to break their "symmetry" and ensure than all the weights were not 0 at the start of the simulation.As each entry in weight matrix W was given by the difference in normalised conductances of a pair of memristors, initialising them all to the same resistance 10 8 would have lead them all to necessarily be equal to 0, as per Eq. 2. Secondly, we perturbed, via the addition of 15% Gaussian noise, the parameters in Eq. 7 used to calculate the updated resistance states of the simulated devices.This was done independently at every timestep, and for every simulated memristive device.
Figure 2: High level example of how the mPES learning rule minimises the error E between the pre-and post-synaptic representations.The neurons contributing to making the the error E negative are found and the weights connecting them are adjusted accordingly; this example focuses on one such synapse.The error is negative so the synapse of the identified neurons is facilitated by pulsing its positive memristor M + .The facilitation increases the future response of the post-synaptic neuron and this is conducive to reducing the error in the following timesteps.NB: The process is analogous for the PES learning rule, with the exception of step 3..

Learning Rule
The process through which learning is effected in an artificial neural network is the iterative, guided modification of the network weights connecting the neurons.To this end, we designed a neuromorphic learning rule that enacted this optimisation process by applying SET pulses to our memristive devices.
In order to learn the desired transformation from the pre-synaptic signal, we iteratively tuned the model parameter matrix W , which represented the transformation across the synaptic connection between the pre-and post-synaptic ensembles.This was done by adjusting the resistances R ± of the memristors in each differential synaptic pair through the simulated application of +0.1 V SET learning pulses.At each timestep of the simulation, the resistances were adjusted using a modified version of the supervised, pseudo-Hebbian prescribed error sensitivity (PES) learning rule [36]; this extension of PES to memristive synapses will be referred to as mPES from here on.
The PES learning rule accomplishes online error minimisation by applying at every timestep the following equation to each synaptic weight ω ij : where ω ij is the weight of the connection between pre-synaptic neuron i and post-synaptic neuron j, κ is the learning rate, α j a gain factor for neuron j, e j the encoding vector for neuron j, E the global error to minimise, and a i the activity of pre-synaptic neuron i.The factor ǫ = α j e j E is analogous to the local error in the backpropagation algorithm widely used to train artificial neural networks [37].The overview shown in Fig. 2 applies to both the PES and mPES learning rules, with the exception of step 3 that is characteristic to mPES; in this high-level depiction we can see how the neuronal pairs i, j whose ω ij > 0 are the ones which contribute to the post-synaptic representation being lower than the pre-synaptic signal and should thus be the ones potentiated in order to bring the post-synaptic representation closer to the pre-synaptic one.The converse is also true, so the synapses whose corresponding ω ij < 0 should be depressed in order to reduce E. Thus, the neurons in the post-synaptic ensemble that are erroneously activating will be less likely to be driven to fire in the future, changing the post-synaptic representation towards a more favourable one.
mPES is a novel learning rule developed to operate under the constraints imposed by the memristive devices andessentially -behaves as a discretised version of PES.This discretisation is a natural consequence of the models it operates on not having ideal continuous network weights but instead, by these being given by the state of memristors, which were updated in a discrete manner through voltage pulses.The gradual change in the weights W between preand post-synaptic neuronal ensembles in the model can be classified as a self-supervised learning optimisation process, as the learning was informed by the error E calculated by the third neuronal ensemble -which was also a part of the model.Our learning rule had to be able to operate in a very restricted and stochastic environment given that we only supplied +0.1 V SET potentiation pulses to our memristive devices during the online training process, and that we introduced uncertainty on the initial resistance and update magnitude of the memristors.Furthermore, at most one memristor in each synapse could have its resistance updated (either M + or M − ) and the exact result of this operation was also uncertain due to the fact that we injected noise into the parameters in Eq. 7 governing the update.
The learning rule can be classified into the gated, pseudo-Hebbian kind and, notably, uses only local information and discrete updates resulting thus biologically plausible.The characterisation as a gated learning rule derives from the fact that when E = 0 learning does not occur, while the pseudo-Hebbian element is given by the post-synaptic activity being represented indirectly by α j e j instead of explicitly by a j .
mPES required the definition and tuning of a hyperparameter γ, which is analogous to the learning rate found in most machine learning algorithms.γ represents the gain factor used to scale the memristors' conductances G ± to network weights W whose magnitude was compatible with the signals the model needs to represent.A larger γ made each pulse to the memristors have an increased effect on the network weights but, unlike a learning rate κ, the magnitude of each update on the underlying objects was not changed by γ, affecting only the translation to network weights.
Our learning rule also presented an error threshold θ ǫ hyperparameter whose role was to guarantee that a very small error E would not elicit updates on the memristors, given that this was seen to very negatively impact learning.The value for the error threshold was experimentally determined but, at this time, we did not see the need to vary it; for all accounts and purposes it was considered a static part of the model, not a hyperparameter.
The mPES learning rule went through the following stages at each timestep of the simulation, in order to tune the resistances of each memristor in every differential synaptic pair with the overall goal of minimising the global error E: 1.The global error signal E was taken as input to the learning rule.
2. A local error ǫ vector was calculated as ǫ = −e j E. ǫ is a projection of the E onto each neuron's response vector and, consequently, encodes the contribution of each post-synaptic neuron to the global error E (Note that the sign of E is inverted.) 3. If no value in the ǫ vector was greater than an experimentally determined error threshold θ ǫ = 10 −5 the updates in this time step were skipped.The reason for this check was twofold.Firstly, if we didn't filter for local errors very close to 0, the learning performance would suffer.This is because a very small error may have no consequence in a continuous-update setting, as that where original PES is applied, but in our setup it could trigger a voltage pulse on the corresponding memristor and bring it to a new resistance state.As memristors move between discrete resistance states, such an update would end up having an outsized effect.Secondly, inhibiting the error ensemble to terminate the learning phase did not completely shut off its activity so E would never be exactly 0 and the adjustments to the resistances would not stop, thus leading to the learning phase going on indefinitely.

4.
A delta matrix ∆ = −ǫ ⊗ a i , whose entries were the equivalent to the PES ∆ω ij in Eq. 3, was calculated.Each entry in ∆ has a one-to-one correspondence to a synapse in the model and encodes the participation of each pair of pre-and post-synaptic neurons (i, j) to the error E. Therefore, the sign of each entry ∆ ij determines if the corresponding synapse should be potentiated (∆ ij > 0) or depressed (∆ ij < 0).
5. The a i vector of pre-synaptic activations was discretised into binary values 0 and 1, as the only information of interest was if a pre-synaptic neuron had spiked, not the intensity of its activity.
6.The update direction V for each synapse was calculated as V = sgn(∆), in order to determine which memristor in each pair needed adjusting: (a) The positive memristor M + ij in a differential synaptic pair was pulsed when the corresponding term V ij > 0. (b) The negative memristor M − ij was pulsed when V ij < 0. This lead to a facilitation of synapses whose neurons (i, j) had a positive participation to the error E and a depression of those for whom ∆ ij < 0. The synapses connecting neurons which "pushed" the post-synaptic signal y to be higher than the reference one f (x) were depressed in order to reduce the future activation of these same post-synaptic neurons.The magnitude of change induced by each pulse was determined by the memristive devices' characteristic response, which in this work was given by Eq. 7.
7. Finally, Eq. 2 was applied to each synapse in order to convert from M ± 's resistances to network weights W .

Optimisation Experiments
In order to assess the effect that the gain γ hyperparameter had on the learning capacity of the model, we set up a series of eight experiments, one for each combination of the three binary parameters: input function (sine wave or white noise), learned transformation (f (x) = x or f (x) = x 2 ), and ensemble size (10 or 100 pre-and post-synaptic neurons).In each experiment we ran 100 randomly initialised models for each value of γ logarithmically distributed in the range [10, 10 6 ] and recorded the average mean squared error and Spearman correlation coefficient measured on the testing phase of each simulation (i.e., on the final eight seconds).
We decided to select the "best" gain value for each combination of factors as the one which would give the highest mean squared error-to-Spearman correlation coefficient ratio ( ρ MSE ).

Learning Experiments
To assess the learning capacity of our model and learning rule (mPES) we elected to compare them against a network using prescribed error sensitivity (PES) as learning algorithm and, importantly, continuous synapses together with noiseless weight updates.The default model with mPES used to generate our results was run with 15% of injected Gaussian noise on both the memristor parameters and the initial resistances, as previously described.To show that our model was not just learning the transformed input signal f (x) but the actual function f , we tested both on the same signal that had been learned upon, and on the other.So, we sought to learn the f (x) transformations on a sine wave using models with 10 and 100 neurons, and then tested the quality of the learning both on a sine wave and a white noise signal, and vice-versa.Both input learning and testing signals were 3-dimensional in all cases.
To have a quantitative basis to compare the two models, we measured the mean squared error (MSE) together with Spearman correlation ρ between the pre-and post-synaptic representations in the testing phase of the simulation (corresponding to the last 8 of the total 30 s).We took PES as the "golden standard" to compare against, but also looked at what the error would have been when using the memristors without a learning rule modifying their resistances.
A lower mean squared error indicates a better correspondence between the pre-and post-synaptic signals, while a correlation coefficient close to +1 describes a strong monotonic relationship between them.A higher MSE-to-ρ ( ρ MSE ) ratio suggests that a model is able to learn to represent the transformed pre-synaptic signal more faithfully.

Other experiments
We also tested the learning performance of our network for varying amounts of noise, as defining models capable of operating in stochastic, uncertain settings is one of the fundamental goals of the field on Neuromorphic Computing.To this end, we defined the simplest model possible within our learning framework [10 neurons, learning identity from sine input] and used it to evaluate the learning performance for varying amounts of Gaussian noise and for different values of the c parameter in the resistance update Eq. 7. We ran 4 randomly initialised models for each of 100 levels of noise in the range [0%, 100%] and reported the averaged mean squared error for each noise percentage.
Using the same methodology, we also ran a parameter search for the exponent c in Eq. 7 in order to find the SET voltage magnitude best able to promote learning.To this end, we evaluated the MSE of randomly initialised models with 15% noise for 100 values of c uniformly distributed in the range [−1, −0.0001].

Device Experimental Evaluation
We investigated the effect of applying pulses across the interface by administering a long SET pulse followed by 25 RESET pulses.Due to the resistance's dependence on the memristor's previous history, the starting state was not the same every time and hence, in Fig. 3a the results are plotted as a percentage increase in resistance between the state before and after the application of a pulse.The first pulse gave rise to the largest increase in resistance, whose change decreased with increasing pulse amplitudes.The change in resistance quickly leveled off and the influence of subsequent pulses was significantly smaller.The application of a RESET pulse resulted in a switch to a high resistance Device response to the application of multiple SET pulses of varying amplitudes of +0.1 V (black), +0.5 V (blue) and +1 V (red).10 RESET pulses of -4 V are applied to increase the resistance before the application of the potentiation pulses.
state that depended strongly on the amplitude of the applied pulse, but not particularly on the number of pulses of that amplitude that were applied.
We also explored the converse, that is, the devices' response to SET voltages.It should be pointed out that in Fig. 3b the initial state (and hence also the large difference induced by the first RESET pulse) after applying a long SET voltage, which would correspond to pulse number 0, is not shown.Note that different pulse amplitudes were chosen for SET and RESET pulses because the asymmetric nature of the charge transport results in much larger currents in forward bias [28].With the application of positive pulses, the device saw its resistance gradually decrease i.e., it was potentiated.It is clear that in this case both the pulse amplitude and the number of applied pulses had a great impact on the resistance state of the device.The larger the amplitude of the SET pulse, the greater the induced difference to the resistance with each applied pulse.Interesting, compared to applying RESET pulses, the difference between the change induced by the first and later pulses was not severe.While 10 applied RESET pulses gave rise to close to saturated high resistance states, this was not observed when positive pulses were applied.

Device Memristive Response
In order to simulate our memristive devices in Nengo, we modelled their behaviour in response to SET pulses on the basis of the experimental measurements we carried out.
The device behaviour when applying a series of SET pulses, i.e. its forward bias response, was seen to be welldescribed by an exponential equation of the form: in which V represented the amplitude of the SET pulse, n the pulse number, R 0 the lowest value that the resistance R(n, V ) could reach, and R 0 + R 1 the highest value.One of the reasons we chose a fit of this form was because of the parallel we saw with the classic power law of practice, a psychological and biological phenomenon by virtue of which improvements are quick at the beginning but become slower with practice.In particular, skill acquisition has classically been thought to follow a power law [38,39].
By solving Eq. 4 for n, we can calculate the pulse number from the current resistance R(n, V ) by: Parameters a and b were found based on the data measured by applying SET voltages between +0.1 V and +1 V to the memristive devices, as shown in Fig. 4a.The exponents of the curves best describing the memristors' behaviour were then fitted with linear regression on the log-transformed pulse numbers and resistances, shown in Fig. 4b, in order to obtain a linear expression a + bV .This process yielded an estimated best fit for the memristor behaviour of: In our current work we always supplied SET pulses of +0.1 V so we could subsume the exponential term a + bV into a single parameter c: R(n) = R 0 + R 1 n c = 200 + 2.3 × 10 8 n −0.093−0.53×0.1 = 200 + 2.3 × 10 8 n −0.146 (7)

Optimisation Experiments
The gain factor γ in Eq. 2 was analogous to the learning rate κ in Eq. 3 -present in most machine learning algorithms -in that it defined how big an effect each memristor conductance update had on the corresponding network weight.
The rationale behind this experiment was to execute an equivalent of hyperparameter tuning -routinely done for artificial neural networks -as we had realised that γ was homomorphous to a learning rate.Therefore searching if there existed a "best" value of γ was integral in helping mPES obtain good learning performance from the models.
The results for this experiment are shown in Table 1, with the optimal value of the hyperparameter γ that was found for each combination of factors highlighted in bold.This "best" gain value was selected as the one giving the highest mean squared error-to-Spearman correlation coefficient ratio ( ρ MSE ) across various models.In our case this value was γ = 10 4 , which would be the hyperparameter used in our subsequent experiments to evaluate the learning of our models in greater depth.

Learning Experiments
The results for the experiments comparing learning a function using PES or mPES are shown in Table 2, where it can be immediately seen that our mPES learning rule is competitive with PES by noticing that the statistics encoding the quality of the learning, the mean squared error (MSE) and Spearman correlation coefficient (ρ), are consistently more favourable to mPES -or when not, very close to equal -across the spectrum of tested models.Fig. 5a, 5b, 5c, and 5d show a selection of results from our simulations for various combinations of neuronal ensemble sizes, desired transformations f and input signals.The plots show the decoded output for the post-synaptic neuronal ensemble (filled colours), overlaid to the signal represented in the pre-synaptic ensemble (faded colours) during the testing phase of the simulation.The x-axis is expressed in seconds and shows the final 8 seconds of the simulation, while the y-axis is fixed to [−1, +1] to show the value of the signals.In all cases the training input and testing signals were the same, and 3-dimensional, with the first dimension plotted in blue, the second in orange, and the third in green.
Table 1: Influence of the γ hyperparameter on the learning performance, expressed as mean squared error (MSE), Spearman correlation coefficient (ρ), and MSE-to-ρ ratio -measured in the testing phase -of the model for different combinations of ensemble size, input signal, and learned transformation.The bolded entries highlight the optimal value found for each statistic, and the best γ chosen for each model.NB: the γ chosen for the (100 neurons, white, f (x) = x) model was γ = 10 4 even though the ρ MSE was lower for γ = 10 These results were chosen to give an intuition on the outcome of learning from a qualitative point of view, given that mere statistics might not encode the notion of what an observer would deem a "good" fit.
The selection of results starts with Fig. 5a where we see how pre, post and error ensembles of 10 neurons each, after the learning period, were able to learn the identity function from a 3-dimensional sine wave.Moving the model's ensembles to 100 neurons (see Fig. 5b) gave a much cleaner output that better approximated the pre-synaptic signal.
To be noted is also the fact that the signal decoded from the smaller neuronal ensembles is much noisier, apart from having a worse average fit (i.e., has both higher MSE and lower ρ).Fig. 5c shows that 100-neuron ensembles could learn the identity function after being trained on a white noise signal.Finally, Fig. 5d indicates that neuronal ensembles of 100 neurons each could also learn to approximate the square from an input sine wave, with the caveat being that the quality of the learned transformation was noticeably lower -as could already be appreciated by looking at the statistics in Table 2.
Table 2: Mean squared error (MSE), Spearman correlation coefficient (ρ), and MSE-to-ρ -averaged over 100 runsfor models trying to learn either y = x or y = x 2 using PES or mPES as learning rule, from a 3-dimensional learning input signal.The models are given by a combination of number of neurons in their ensembles, training input signal, and function learned.Each model was tested on both kinds of testing input.The results where mPES is run with no learning time allotted are also included in the "No learning" column.An argument in support of reporting the notion of qualitative fit is given by analysing result in Table 2 for the models in Fig. 5b and Fig. 5d.Here we see how these particular models perform similarly in regards to the measured mean squared error when trying to learn the identity and square function from a sine wave, but how the Spearman correlation coefficient is two times worse when learning the latter.The answer to the dilemma can be found by looking at the qualitative fits on the testing part of the simulations (Fig. 5b and Fig. 5d) and noticing that, even though the pre-and post-synaptic signals are -on average -close in both cases, when trying to learn the square the post-synaptic signal is much noisier.A noisy signal negatively impacts the correlation coefficient as the pre-and post-synaptic signals are nearly never moving in tandem, even when the oscillations are around the correct value.This could be ascribed to the square being a harder function to learn than the identity, as already noted throughout, which could lead to a model being "overpowered", in the sense that its representational power is allotted to trying to learn the harder function, decreasing the resources available to represent the signal.
The sizes of the neuronal ensembles were not fine-tuned for compactness so it may well be possible to learn representations of similar quality using smaller neuronal ensembles.

Increasing Noise Experiments
Predictably, as can be seen in Fig. 6a, learning performance decreased for high levels of noise but this degradation was not graceful, as the mean squared error quickly stabilised around 0.45 for noise > 20%.For the particular input that the model was tested on, namely a sine wave, a mean squared error of 0.45 effectively corresponds to no learning as, on average, the post-synaptic representation could be seen as hovering around 0 while the pre-synaptic signal varied in the [−1, +1] domain.
Somewhat surprisingly, the best learning performance was found not with the lowest noise, but with levels in the [10% − 20%] range.

Exponent Search Experiments
The results for the exponent parameter search are shown in Fig. 6b) and here we can see how strongly the performance of the model depends on the magnitude of the exponent in the power law.The quality of the fit is in direct correspon- dence with the value of c, with only models with c ≥ 0.2 showing the capacity to learn satisfactorily.The lowest error is found for c in the range [−0.13, −0.04].However, there is a slight uptick in the error for the smallest values of the exponent but this effect is most likely the result of the memristors having too small of a response for the training time supplied (which was 24 s in all out experiments).

Dependence of Error on Memristor Parameters
Based on the previous two experiments, we attempted to identify which of the parameters c, R 0 , or R 1 in Eq. 7 were most important to explain the learning performance of our model.
As can be seen in Fig. 6a, the highest noise level of 100% lead to an average mean squared error of 0.45.The default exponent c = −0.146with 100% noise varied in the range [−0.292, 0], which would result in an average MSE of 0.3, based on the results of the exponent search shown in Fig. 6b.The uncertainty in the R 0 and R 1 parameters then only accounted for 1 3 of the variation in results for even a fairly small value of c = −0.146;given the exponential dependency of the resistance update on c (see Eq. 7), we can understand how the magnitude of the update and consequent learning performance rested chiefly on the choice of voltage of the SET pulse.

Discussion
Even with the restrictions imposed by the devices' physical characteristics, our experiments showed that we could enable the post-synaptic neuronal ensemble to well-represent transformations of the pre-synaptic signal by applying our novel mPES learning rule to synapses based on our Nb:STO memristive device.This process leads to a set of memris- tor resistances that can be mapped to network weights, which transform the supplied input into the desired output by encoding the learned function.Crucially, our results are not limited to our Nb:STO memristors but generalise to any memristive device whose memristance follows a power law.We showed that our model could learn to approximate both the identity y = x and square y = x 2 functions -the latter being a qualitatively harder problem to solve -with no theoretical reason limiting it to only representing these transformations.Learning was effective both when the input signal was a periodic sine wave and also when it was random, as the case where the pre-synaptic ensemble was fed white noise, or was switched between learning and testing phases; this second result is important in that it shows that the model was able to learn the transformation independently of the regularity of the supplied training data and of the input signal itself.It had already been suggested that the original PES learning rule -on which our mPES algorithm is based -would be able to learn arbitrary vector functions [36,40].The fact that our learning rule is derived from one already capable of approximating arbitrary functions leads further credence to our conclusion that our model is capable of learning any transformation of n-dimensional vectors.Critically, as our model aimed to present strong neuromorphic credentials, these results are robust to the presence of noise on the updates and the initial state of the memristors' resistances.
The results in the learning experiments -during which we compared the quality of the representations in the preand post-synaptic ensembles -were competitive with the original PES learning rule even though the latter had the advantage of operating on ideal, continuous network weights, with no uncertainty on their updates.Our mPES learning rule showed a post-training MSE-to-ρ ratio that was consistently close to that of PES; the results were especially favourable when trying to learn the identity function f (x) = x.In general, the MSE-to-ρ ratio was consistently lower for both algorithms when trying to learn the square transformation f (x) = x 2 compared to the f (x) = x case, but the drop in performance for mPES was more marked.These observations suggest that the square function could indeed be a harder problem to learn and that the gap in performance between PES and mPES when trying to learn this transformation could be bridged by using a higher number of neurons in conjunction with our learning rule.Both learning rules exhibited higher performance when trained on a white noise input signal compared to a sine wave; we would tend to attribute this difference to the dissimilarity in how the input spaces are sampled but have no grounded theoretical explanations to explain this influence.Most importantly, we did see a decline in learning performance when switching to a different input signal between learning and testing phases, but this effect was limited; this lends credence to the belief that our model is actually able to generalise and thus is learning the function f , not just the transformed input signal f (x).
The evaluations on the learning performance for increasing amounts of noise were done with models composed of a small number or neurons, which could potentially predispose the network to be more susceptible to the injected randomness; this seems to contribute towards confirming that using our memristive devices and learning rule can lead to robust performance.As previously reported, the model performed its best with a moderate amount of Gaussian noise on the initial memristor resistances and on the parameters of the update equation Eq. 7. The act of initialising each network weight to a different value is, in Machine Learning terminology, known as "symmetry breaking"; failing to do so can make a model difficult or impossible to train.Therefore, the noise on the initial resistance state could be effectively "breaking the symmetry" of the weights.An explanation for the increase in performance at moderate levels of noise is harder to come by, but it could be that perturbing the weights by adding noise to the resistance update equation could be seen as a regularisation technique which should help the network learn a smoother function from the training data, thus improving its generalisation on the test set [41, p. 242].Another -non-orthogonal -explanation could be that stochasticity is in itself a computational resource that the model can exploit to increase its capabilities.
It is notable that the learning performance was maintained even though our mPES learning rule had absolutely no knowledge of the magnitudes of the updates and that the only hyperparameter tuned was the gain γ, whose effect was analogous to a learning rate.We experimentally found, as reported in Table 1, that a value of γ = 10 4 lead to the best learning performance -defined as balance between testing error and correlation MSE-to-ρ ratio ρ MSE -for all models given by combinations of neuronal ensemble size, training input signal, and desired transformation.The best values for mean squared error (MSE) and Spearman correlation coefficient (ρ) were both found for the same value of γ = 10 4 in all the instances where the models were instructed to learn the identity function f (x) = x.Instead, when the models were attempting to learn the square function f (x) = x 2 there was a discrepancy between the values of γ leading to the lowest MSE and to the highest ρ.The two models learning the square from the regular sine wave input still had best ρ for γ = 10 4 , but showed the lowest error for a smaller gain factor of γ = 10 3 ; the combined MSE-to-ρ statistic still indicated γ = 10 4 as the best choice for this class of model as the positive gains in correlation counterbalanced the higher error.The models learning the square transformation from the random white noise input followed the same pattern as they also showed highest correlation for γ = 10 4 and required a smaller gain factor to exhibit the lowest error: γ = 10 2 in their case.The measured ρ MSE for this last class of models was close for both γ = 10 3 and γ = 10 4 as the correlation ρ did not suffer as much as when trying to learn the square from the sine wave input.In the [100 neurons, sine, f (x) = x 2 ] instance the best MSE-to-ρ ratio was found for γ = 10 3 but, given that the result was very close to the one obtained with γ = 10 4 , the latter was chosen for consistency with all other cases.The magnitude of γ was crucial to enable learning as it brought the normalised conductances calculated in Eq. 2 to a scale compatible with the model and the input signal.The need to precisely tune the learning rate when using memristors presenting non-linear and asymmetric response has already been recognised [42] and this effect is somewhat confirmed in our setting, as a too small or too large γ completely disabled the learning capacity of the model, while a sub-optimal value hindered the learning performance.Having to tune hyperparameters in order to achieve learning performance is, unfortunately, the norm in Machine Learning so it is not surprising that we ran into the same requirement in our setting.The search we performed was not fine-grained, but gave a good indication of the magnitude of γ we could expect to lead to reasonable performance.It could be argued that -in a neuromorphic setting -it is more advantageous to have a "rough" estimate for the hyperparameter -γ = 10 4 in our case -leading to "robust" learning in a variety of situations, rather than focusing on finding the optimal value for a specific setup.Future work should focus on establishing a theoretical link between the model setup, including input signal characteristics and magnitude, and the expected range for finding a γ leading to sufficiently robust learning performance.
The memristance response of our device had notable implications on its use as a substrate for learning [42,43].In our current work we only used SET pulses of +0.1 V while negative RESET pulses were largely ignored as the effect of sequential negative pulses was seen to quickly diminish with pulse number.Our initial choice to apply positive learning pulses of +0.1 V implied having an exponent c = −0.146for the power law governing the memristor updates, which put the model close to the region exhibiting best learning performance.From a physical point of view, the magnitude of the SET pulses could be set very low as the current-voltage characteristics of our memristors are continuous, without any step-like features reminiscent of a switching threshold voltage.Hence, in principle we expect even small pulse magnitudes to influence the resistive state of the device.In addition, by utilising a read voltage of a different polarity than that of the SET pulses, we created an asymmetry between the writing and reading processes that made it possible to use very small voltage amplitudes for the SET pulses.Thus it should be possible -in principle -to use physical voltage pulses as small as needed to guarantee the best learning performance.Different mechanisms govern charge transport in forward and reverse bias; this leads to asymmetries in current-voltage characteristics and may also result in differences between the SET and RESET processes.Thus, we elected to focus on the forward bias regime for learning, by only supplying positive SET pulses to either the M + or M − memristor in each differential synaptic pair.There do not seem to be theoretical reasons that could lead this to impinge on the learning performance, but long training regimes could potentially lead to the memristors' resistances to saturate.That is, using only positive pulses progressively brings the devices to a low resistance/high conductance state with no way to backtrack.This issue could be alleviated by a careful pulsing scheme where memristor close to saturating were "re-initialised" to a high resistance state by supplying RESET voltage pulses, at the cost of losing some precision in the represented network weights [42].We could imagine doing this by re-initialising both memristors in a pair to their initial resistances with RESET pulses and then supplying a number of SET pulses proportional to the magnitude of the weight before the reset, on either the positive of negative memristor depending on the polarity of the weight.After the resetting, a highly positive weight would thus see its memristor M + receiving more pulses than a lightly positive one, as a moderately negative weight might have its memristor M − receive fewer SET pulses than a highly negative one.This approach could have the disadvantage of requiring quite a complex meta-architecture, but it could also be that our current learning approach may already be robust enough to deal gracefully with the resetting of a small percentage of its synapses.
Another path that could improve the learning performance could be to grade the number, the duration, or the magnitude of pulses applied to each memristor, based on the participation of its pre-and post-synaptic neurons to the global error.This could prove to be a less demanding undertaking than shaping the pulses based on the memristors' current stateas previously described -while still resulting in positive gains to learning performance.

Conclusions
In this work we fabricated memristive devices based on Ni/Nb-doped SrTiO 3 and found that their memristance followed a power law.These memristive devices were used as the synaptic weight element in a spiking neural network to simulate, to our knowledge, one of the first models of this kind capable of learning to be a universal function approximator.The performance was tested in the Nengo Brain Builder framework by defining a simple network topology capable of learning transformations from multi-dimensional, time-varying signals.The network demonstrated good learning performance with robustness to moderate noise levels of up to ∼20%, showing that this class of memristive devices is apt to being used as component of a neuromorphic architecture.On the side, some of our results also suggest that stochasticity is indeed an important computational resource for neuromorphic computing, as our model performed at its best with a moderate amount of noise.It is worth restating that the network weights were found using only discrete updates to the memristors, based on knowledge local to each pre-and post-synaptic neuron pair; this means that our learning model presents many more neuromorphic characteristics than, for example, previous works where memristors are used as mere hardware implementations for artificial neural network weights (for example, see [2]).
Using memristive devices as weights can enable efficient computing -which is also one of the cornerstones of the neuromorphic approach -but not pairing them with a neuromorphic learning algorithm and a spiking neural network imposes quite stringent requirements on the physical characteristics of the device.For example, a memristor simply used as a physical implementation of a network weight needs to exhibit reliable, linear, and symmetric conductance response in order to approximate the idealised weight it stands in for [44].Present-day machine learning models suffer from "brittleness" (i.e., small changes in input can give rise to large errors) and do not seem well positioned in being able to match human adaptability across a wide variety of tasks.It is becoming accepted that stochasticity is itself a computational resource [1] -even though it is still not formally proven as such -and the fact that our model seems to perform best with a moderate amount of noise could be a result of this effect.It had already been proposed that networks using memristors as synapses could match the learning performance of traditional artificial neural networks [42] and our results are in line with this conjecture.

Additional Requirements Contribution to the Field Statement
Approaches based on artificial neural networks have found great success in performing a variety of tasks, but these are typically run on conventional (Von Neumann) computational architectures.Moving to a computational framework in which co-localisation of memory and computation is realised by brain-inspired (neuromorphic) hardware is a promising way to address the main limitations of current methods.In our manuscript we simulated this approach by utilising novel interfacial memristive devices -whose behaviour resembles that of biological synapses -as weights in a neural network.This network learned to represent non-linear functions through a training process based on an original supervised learning algorithm.Using this class of memristive devices as the synaptic weight element in a spiking neural network yields, to our knowledge, one of the first models of this kind capable of learning to be a universal function approximator; their interesting dynamics and characteristics also strongly suggest the suitability of our memristors for usage in future neuromorphic computing platforms.Our results are not limited to this class of memristors, but generalise to any whose behaviour follows a power law.We believe our findings and analyses also offer insight into building and benchmarking the algorithms necessary to benefit from these new materials.

Figure 1 :
Figure 1: (a) The Error neuronal ensemble calculates the difference E = y − f (x) between the post-synaptic y and transformed pre-synaptic f (x) representations in order to inform the mPES learning rule.These connections are not implemented using simulated memristors and are calculated by Nengo.mPES tunes the weights W on the connection between the pre-and post-synaptic ensembles with the overall goal of reducing the error E. W transforms x into y ∼ f (x) across the synaptic connection and, with training, the post-synaptic ensemble's representation improves, becoming closer to the transformed pre-synaptic one: y → f (x).NB: the input x, error signal E, and output y can have different dimensionality, which is indipendent from the number of neurons used to represent them.In our current work, all three signals have the same dimensionality.(b) In this example, the pre-and post-synaptic neuronal ensembles both have size 3 and are fully-connected so the weight matrix W has 3 × 3 entries.Each pair of preand post-synaptic neurons (i, j) is linked by a simulated synapse.Each synapse's weight W ij is given by the scaled difference in normalised conductances γ(G + ij − G − ij ) of its differential synaptic pair of memristors M + ij and M − ij .As an example, the figure focuses on how the neurons (1, 1) in the pre-and post-synaptic ensembles are connected by a synaptic weight W 11 ; the neural activity a 1 of the pre-synaptic neuron is combined with the synaptic weight W 11 to give the input to the first post-synaptic neuron.The activations of the post-synaptic neurons, in turn, define the signal represented in post-synaptic ensemble.

Figure 3 :
Figure 3: (a) Device response to the application of multiple RESET pulses of varying amplitudes of −2 V (black), −3 V (red) and −4 V (blue), plotted as the percentage difference to the resistance before and after the applied pulse.(b)Device response to the application of multiple SET pulses of varying amplitudes of +0.1 V (black), +0.5 V (blue) and +1 V (red).10 RESET pulses of -4 V are applied to increase the resistance before the application of the potentiation pulses.

Figure 4 :
Figure 4: (a) Circles show the experimental data of device resistance after the application of multiple SET pulses with amplitudes varying from +0.1 V (top branch) to +1 V (bottom branch).Lines represent fits to Eq. 4. (b) Black circles show the exponents extracted from the fits in (a) as a function of pulse voltage.The red line is a linear regression fit from which the a and b parameters are obtained.

Figure 5 :
Figure 5: (a) Decoded output from pre-and post-synaptic ensembles of 10 neurons on a sine wave input, after being trained on a sine wave to learn to represent the identity function f (x) = x.The quality of learning is ρ MSE = 16.3398.(b) Decoded output from pre-and post-synaptic ensembles of 100 neurons on a sine wave input, after being trained on a sine wave to learn to represent the identity function f (x) = x.The quality of learning is ρ MSE = 132.9364.(c) Decoded output from pre-and post-synaptic ensembles of 100 neurons on a white noise input, after being trained on a white noise signal to learn to represent the identity function f (x) = x.The quality of learning is ρ MSE = 10.5064.(d) Decoded output from pre-and post-synaptic ensembles of 100 neurons on a sine wave input, after being trained on a sine wave to learn to represent the square function f (x) = x 2 .The quality of learning is ρ MSE = 12.2479.

Figure 6 :
Figure 6: (a) The black line shows the mean squared error (MSE) measured on the testing phase of the simulation as a function of the amount of noise injected into the model.(b) The black line shows the mean squared error as a function of the magnitude of the exponent c in Eq. 7. The red line shows the trend for the MSE.