Development of a Bayesian Estimator for Audio-Visual Integration: A Neurocomputational Study

The brain integrates information from different sensory modalities to generate a coherent and accurate percept of external events. Several experimental studies suggest that this integration follows the principle of Bayesian estimate. However, the neural mechanisms responsible for this behavior, and its development in a multisensory environment, are still insufficiently understood. We recently presented a neural network model of audio-visual integration (Neural Computation, 2017) to investigate how a Bayesian estimator can spontaneously develop from the statistics of external stimuli. Model assumes the presence of two unimodal areas (auditory and visual) topologically organized. Neurons in each area receive an input from the external environment, computed as the inner product of the sensory-specific stimulus and the receptive field synapses, and a cross-modal input from neurons of the other modality. Based on sensory experience, synapses were trained via Hebbian potentiation and a decay term. Aim of this work is to improve the previous model, including a more realistic distribution of visual stimuli: visual stimuli have a higher spatial accuracy at the central azimuthal coordinate and a lower accuracy at the periphery. Moreover, their prior probability is higher at the center, and decreases toward the periphery. Simulations show that, after training, the receptive fields of visual and auditory neurons shrink to reproduce the accuracy of the input (both at the center and at the periphery in the visual case), thus realizing the likelihood estimate of unimodal spatial position. Moreover, the preferred positions of visual neurons contract toward the center, thus encoding the prior probability of the visual input. Finally, a prior probability of the co-occurrence of audio-visual stimuli is encoded in the cross-modal synapses. The model is able to simulate the main properties of a Bayesian estimator and to reproduce behavioral data in all conditions examined. In particular, in unisensory conditions the visual estimates exhibit a bias toward the fovea, which increases with the level of noise. In cross modal conditions, the SD of the estimates decreases when using congruent audio-visual stimuli, and a ventriloquism effect becomes evident in case of spatially disparate stimuli. Moreover, the ventriloquism decreases with the eccentricity.

   , which accounts for the presence of a lower threshold and upper saturation in neuron activity, and a first-order low-pass filter with time constant, which accounts for the neuron integrative capacity.
Hence, for the generic k-th neuron in the modality S (S = A or V for the auditory and visual modalities, respectively) we can write where S k y represents the neuron output, and the sigmoidal relationship is described by the following s and x0 are parameters, which set the slope and the position of the sigmoidal relationship. According to Eq. (A2), the neuron output activity is normalized between 0 and 1 (zero means a silent neuron, one a maximally activated neuron).
It is worth noting that, for the sake of simplicity, we used the same parameters (, s and x0) for all neurons independently of their modality. This choice was adopted to minimize the number of model assumptions.
The expression for the sensory input is computed as the scalar product of the sensory representation of the stimulus ( (1) − (6)) and the neuron receptive field ( We assumed that the neuron receptive field, S k R , has initially a large extension, described with a Gaussian function, and then progressively shrinks during training, to fit the width of the external input.
The lateral input is computed as follows where kj  represents a lateral intra-area synapse connecting the presynaptic neuron j to the post synaptic neuron k in the same area. Here we used the classical Mexican-hat arrangement: a neuron is excited by proximal neurons in the same area, and inhibited by more distal ones are parameters which set the strength and width of the excitatory and inhibitory portions of the Mexican hat. In particular, we have represents the distance between neurons' preferred positions, i.e.
It is worth noting that we used the same expression of lateral synapses (Eq. A5) in both the auditory and visual areas, to limit the number of model assumptions.
Finally, the cross-modal term in Eq. (A1) is computed as the convolution of the vector of cross modal synapses and the activity in the other unisensory area, i.e.
where SQ kj w represents a cross-modal synapse from the pre-synaptic neuron j in the area Q to the postsynaptic neuron k in the area S. We assumed that the cross-modal synapses are initially ineffective and are progressively reinforced during the training phase.

Training the network
Starting from the initial basal value of synapses, the network has been trained during a training period in which the sensory input representations (i.e., I A and I V ) have been given with a random distribution. The shape of the inputs, their strength, superimposed noise and random positions (in unisensory and cross-modal conditions) have been described in the text (see Eqs. (1) -(13)). During training we assumed that the standard deviation of noise (say S  in Eq. (1) (9)) is a portion of the maximum input. Hence In the present simulations we assumed = 0.5 during training and  = 0.33, 0.5 or 0.66 in evaluating network performances after training.
The synapses describing the receptive field, S kj r ,and those describing the cross-modal link between the two areas, SQ kj w , have been trained using a learning rule with a classical Hebbian potentiation factor and a decay term. We can write, in scalar form Eqs. (A9) and (A10) have been applied, at each step, using the final steady state values of the neuron output (i.e., when transient phenomena are exhausted).
At the beginning of training all cross-modal synapses are assumed equal to zero. Conversely, the receptive-field synapses have a broad spatial extension, and moderate amplitude, identical for the two modalities, i.e.
where r0 sets the initial strength of the receptive field, and R  establishes its initial spatial extension (we assume i.e. a wide initial receptive field) . Of course, Eq. (A11) holds only at the first step of training.
A list of all model parameters is given in Table 1.
Circular rule used to compute the preferred positions in Eq. (15) Let us assume that a stimulus of modality S was given at position S  (and so, the activity in the network of the same modality is approximately centered at that position