Recurrent network for multisensory integration-identification of common sources of audiovisual stimuli

We perceive our surrounding environment by using different sense organs. However, it is not clear how the brain estimates information from our surroundings from the multisensory stimuli it receives. While Bayesian inference provides a normative account of the computational principle at work in the brain, it does not provide information on how the nervous system actually implements the computation. To provide an insight into how the neural dynamics are related to multisensory integration, we constructed a recurrent network model that can implement computations related to multisensory integration. Our model not only extracts information from noisy neural activity patterns, it also estimates a causal structure; i.e., it can infer whether the different stimuli came from the same source or different sources. We show that our model can reproduce the results of psychophysical experiments on spatial unity and localization bias which indicate that a shift occurs in the perceived position of a stimulus through the effect of another simultaneous stimulus. The experimental data have been reproduced in previous studies using Bayesian models. By comparing the Bayesian model and our neural network model, we investigated how the Bayesian prior is represented in neural circuits.


INTRODUCTION
We are surrounded by many sources of sensory stimulation, i.e., many sights and sounds. Moreover, we can recognize who is speaking in a conversation even when there are many people and sounds around us. To perform such recognition, we have to integrate correct pairs of stimuli; the movements of a person's mouth and the sound of his/her voice. Thus, it is important to determine how we judge which pairs of audiovisual stimuli are related and how we integrate related cues. That is, we must study multisensory integration in order to elucidate how our brains link multiple sources of information. There is a good example of audiovisual integration known as the ventriloquism effect in which the perceived location of a ventriloquist's voice is altered through the movement of a dummy's mouth (Howard and Templeton, 1966). It is also known that the ventriloquism effect can be elicited under artificial experimental conditions such as a spot of light or a beep (Bertelson and Aschersleben, 1998;Pavani et al., 2000;Lewald et al., 2001;Hairston et al., 2003;Alais and Burr, 2004;Wallace et al., 2004). Several theoretical models based on Bayesian inference have been proposed to explain the data from psychophysical experiments on the ventriloquism effect (Körding et al., 2007;Sato et al., 2007). Although Bayesian inference gives a normative account as to the computational principle, it does not indicate how the nervous systems actually implement the computation.
To provide insights into the neuron dynamics related to sensory integration, several studies have constructed neural network models that implement Bayesian inference (Pouget et al., 1998;Ma et al., 2006). When stimuli have a common cause, their models are able to extract encoded information from the activities of large populations of neurons as reliably as the maximum likelihood is able to do (Deneve et al., 1999;Latham et al., 2003). However, when stimuli have distinct sources, the models cannot work correctly because they bind cues even when the stimuli do not have the same source. When the stimuli have distinct causes, the brain has to estimate the causal structure of the stimuli and extract information separately from each stimuli. We constructed a recurrent network model that can implement computations related to multisensory integration by changing the method of divisive normalization in the model of Deneve et al. (1999). We found that our model could estimate not only the locations of the sources of the stimuli but also the number of sources. By using computer simulation, we showed that the model accounts for the data of psychophysical experiments that have been explained by the Bayesian model. To elucidate how our brains implement a Bayesian prior distribution, we tried to determine which neural connectivities represent the prior distribution.

MODEL
We constructed a single layer recurrent network consisting of N = 1000 analog neurons with identical spatial receptive fields. Here, we will label a neuron, i, by an angle θ i and express the firing rate as a function of θ; therefore, a neural state, u i , describes the firing rate of the neuron population (including both excitatory and inhibitory neurons) with the preferred angle, i. In order to reduce the number of parameters and facilitate analysis of the system's behavior, we will study a simpler model, in which the excitatory and inhibitory populations are collapsed into a single equivalent population. To model a cortical hypercolumn consisting of a single layer of neurons, we assumed that the preferred orientations are evenly distributed from −50 to 50 deg and divided 100 deg into N = 1000 sections, that is, where h i represents an external input and the second term of the right-hand side of the equation represents a recurrent input.
Using a i , we defined the firing rate u i as To keep u i positive, we used the threshold linear function To control the gain of the firing rate u i , we used divisive normalization Carandini et al. (1997). The interaction in the network turns noisy input into a smooth hill shape. The cap coordinate of the hill gives an estimate of the orientation. In a previous study, Deneve et al. (1999) defined a function u i (t) in terms of the square of the input a î In order to collapse the excitatory and the inhibitory populations into a single equivalent population, we assumed that the synaptic weight, J ij , is a Mexican-hat-type connectivity: excitations are given to nearby neurons, inhibitions to distant neurons (Figure 1; Amari, 1977;Shadlen et al., 1996). We defined: (4) The parameters σ 1 , σ 2 , respectively define the range of the excitatory connection and lateral inhibition. Here, we set M 1 = 28, M 2 = 10, σ 1 = 1.5[deg], and σ 2 = 3[deg]. The two features in our model, i.e., weak normalization and lateral inhibition, make differences between ours and Deneve's model, and they enable our model to reproduce the results of psychophysical experiments (as discussed in the Results).
Let us consider an external input, h, from either a preceding layer or from the external world. The external input of neuron i, h i , is dependent on the orientation encoded in the previous layer and is Gaussian distributed with mean h i and variance σ 2 i . We define where z i denotes noise. We set σ 2 i to the mean activity, i.e., σ 2 i = h i , which better approximates the noise measured in the cortex Shadlen and Newsome (1994). The standard deviations σ V and σ A respectively represent the uncertainties of the visual and audio input. Note that the strength of the input activity, M √ 2πσ 2 , is determined not only by M but also by the uncertainty of the input, σ, in our model. We assumed that the visual input is more reliable than the audio input. To investigate the effect of the difference in uncertainty between visual and audio input, we fixed . Thus, the input strength of visual input is larger than that of audio input, i.e., A . An example of external input to the network is given in Figure 2. Now let us explain x V and x A in Equation 5. x V and x A represent the input locations of audiovisual stimuli. We assume that the audio and visual stimuli are Gaussian distributed:  Equation 1 and Equation 2), the noisy neural states become a smooth hill whose peak indicates the estimated position of the audiovisual stimuli (Figure 3).

RESULTS
By using computer simulations, we showed that our network model can estimate the position(s) of the sources of audiovisual stimuli with a disparity between the stimuli. We found that while previous models could not reproduce psychophysical experiments of the audiovisual integration, our results are consistent with both experimental observations and Bayesian inference. If the disparity of the input stimuli was small (x A − x V = 5 [deg]), the stimuli were integrated with a high rate (about 70%) ( Figure 3A). If the disparity was large, they were estimated as distinct stimuli (Figure 3B), something which could not be reproduced in previous models where the normalization term in Equation 2 is determined by the square sum Deneve et al. (1999). We found that in the previous network model, they were estimated not as distinct stimuli but as a united stimulus for any spatial disparity. The failure of Deneve's model to reproduce the phenomenon is partly the result of the strong divisive normalization they used (Equation 2), because the strong divisive normalization prunes weak multiple input peaks and extracts the maximum peak. Another reason for the failure of reproduction is the lateral inhibition between neurons. Figure 4 compares the models in the case of independent causes. Similarly to the Deneve's model, in a weak normalization model without lateral inhibition, they were estimated not as distinct stimuli but as a united stimulus for any spatial disparity, as shown in Figure 4B Marti et al. (2013). Thus, both weak normalization and lateral inhibition in our model are important for reproducing the results of the psychophysical experiments on audiovisual integration.

EFFECT OF SENSORY NOISE
We assumed that information about the orientations of the audiovisual stimuli from sense organs, x V , x A , are corrupted with sensory noise. This noise makes the output probabilistic ( Figure 5A). If we didn't add noise, the number of sources would be completely determined by the spatial disparity D (Figure 5B). Experiments have shown that people estimate the number of sources stochastically Wallace et al. (2004).

BIAS
Psychophysical experimental research has reported that when audiovisual stimuli were estimated as distinct stimuli, the estimated position of the auditory stimuli was away from the actual position of the auditory input Wallace et al. (2004). To examine how the perception of common versus distinct causes affects the estimation of the auditory stimuli position,Ŝ A , we calculated the localization bias, We performed 500 simulations and averaged the localization bias for each disparity between the audiovisual stimuli and for each case, i.e., common and distinct. We compared our model and the previous model of Deneve et al. (1999). In the previous model, the stimuli were unified with any spatial disparity ( Figure 6A). The value of the localization bias was nearly 100% with all spatial disparities. This means their model estimated the audio stimulus as noise. Our model made estimates about whether stimuli have a common cause or distinct causes stochastically ( Figure 6B).
When two stimuli were unified, the localization bias was nearly 80%. This indicates that when there was a common cause, the estimated auditory stimulus would be at a position that was on average very close to that of the visual stimulus. On the other hand, in the case of distinct causes, the localization bias took a negative value and was increasingly negative for smaller disparities. These results indicate that the estimated auditory position seems to be pushed away from the location of the visual stimulus, as was experimentally observed Wallace et al. (2004).

BAYESIAN PRIOR IN A NEURAL NETWORK
Bayesian inference is a method of reasoning that combines prior knowledge about the world with current input data. To be more precise, from experience we may learn how likely two cooccurring signals (visual and auditory signals) are to have a common cause versus two independent causes. Using the Bayesian prior, a Bayesian inference model integrates those pieces of information to estimate if there is a common cause and to estimate the positions of cues. Previous studies have reported that Bayesian inference could explain the pattern of localization bias as reproduced by our model (Figure 6C) (Körding et al., 2007;Sato et al., 2007). Considering that our neural network model and the Bayesian model could explain the same psychophysical experiment, there should be a neural connection in our model that represents prior information. We searched for the parameter of our network model that corresponded to the prior information of the likelihood of sensory integration.

MULTISENSORY INTEGRATION IN THE NEURAL NETWORK
To simplify the comparison between the network model and Bayes model, let us consider a case in which we receive sensory inputs without noise (x V = S V , x A = S A ). The distance between audiovisual stimuli D determines the causal structure in this case (Figure 5B), and we can determine the integration threshold D Net 0 (the distance within which the auditory and visual signals are integrated). When D Net 0 is determined, we can calculate the proportion of integration with noise as follows. When the distance between the audiovisual inputs, x V − x A = D input , is lower than D Net 0 , stimuli integrate. D input is drawn from a normal distribution with mean S V − S A = D, which is the distance between the original positions of the audiovisual stimuli, and standard deviation σ 2 Vx + σ 2 Ax , which is the sum of the auditory and visual noise. Using D Net , we obtain the proportion of integration as a function of D, dt.
D Net 0 determines the likelihood of sensory integration. We investigated the relationship between the parameters of the neural connection J ij and the Bayesian prior distribution regarding the integration threshold.

INTEGRATION THRESHOLD IN BAYESIAN MODEL
Using the Bayesian approach, we can also calculate the integration threshold D Bay 0 (distance within which auditory and visual signals are integrated in the Bayesian view) as follows (Körding et al., 2007). We determine whether the stimuli originate from the same source (C = 1) or two sources (C = 2). The perceived locations of audiovisual stimuli x V , x A are shifted from their original position using Gaussian noise with standard deviations of σ V , σ A . Accordingly, we calculate the probability of C = 1 using Bayes' theorem (Körding et al., 2007): When the source locations from the audiovisual signals are uniformly distributed in the spatial range [−a/2, a/2], we obtain We assume that the Bayesian model reports the same source when As shown in Equation 8, the Bayesian prior P co affects the judgment of unity. We investigated how P co affects D Bay 0 (Figure 7). When the causal structure is defined, we can calculate the optimal estimate of the stimulus position for the cases of C = 1 and C = 2. When the audiovisual stimuli have independent causes, the optimal solutions arê When the audiovisual stimuli have a common cause, the optimal solution isx We calculated the localization bias using the Bayesian model ( Figure 6C). Here, we fixed P co = 0.2, σ V = 3 [deg], and σ A = 6.5 [deg]. Both the Bayesian prior P co and recurrent connectivity J ij affect the integration threshold D 0 . Thus, the integration threshold D 0 validates the idea that the Bayesian prior P co corresponds to a recurrent connectivity J ij in the cortical neural network.

NETWORK CONNECTIVITY REPRESENTS BAYESIAN PRIOR P CO
Synaptic plasticity is thought to be the basic phenomena underlying learning. It could be said that a neural network learns a Bayesian prior by changing its connectivity. We investigated how "Common cause" refers to the situation in which the model regarded that two sensory signals have a common cause (i.e., the network converged to a single bump state) and "independent cause" refers to the situation in which the network regarded that two sensory signals have independent causes (i.e., the network converged to a state of two single bumps or the MAP estimate of the Bayesian model corresponds to two sources). The negative bias indicates that the perceived auditory position is on the opposite side of the true position with respect to the position of the visual stimulus. the parameters of the network connectivity J ij affect the integration threshold, as shown in Figure 7. D Net 0 increases with the ratio between the strength of excitatory connection, M 1 , and that of the inhibitory connection, M 2 . It approximately increases with the range of excitatory connection, σ 1 , similarly to the ratio between M 1 and M 2 , whereas it varies with a non-monotonic shape for the range of lateral inhibition, σ 2 , as illustrated in Figure 7D.

A B C D
Let us focus on the excitatory connection that could be changed through Hebb's learning rule. As shown in Figures 7A,  B, D Net 0 increases with M 1 in the same way as D Bay 0 increases with P co . This means that the Bayesian prior P co is represented as M 1 in the network model. This result suggests that the neural network achieves Bayesian inference through learning appropriate prior information by adjusting the excitatory connection M 1 .

DISCUSSION
We constructed a recurrent network model that distinguishes whether or not audiovisual stimuli have a common cause or distinct causes. We showed that our model not only estimates the number of sources, but also reproduces the localization bias, as observed in psychophysical experiments Wallace et al. (2004). Previous studies have revealed that the Bayesian ideal observer model could explain psychophysical data on sensory integration Körding et al. (2007); Sato et al. (2007). Although a Bayesian model gives a normative account of the computational principle, it does not provide a neural implementation of optimal causal inference. Our model is a biologically plausible one of cortical circuitry, and it provides information about how the nervous system can implement the computation Carandini et al. (1997). To reveal how the nervous system implements Bayes' inference, we investigated the relationship between the synaptic connection of the proposed model and the prior distribution in the Bayesian model. We found that the strength of the excitatory connection represents the prior distribution for the probability of integration.
Previous research has used divisive normalization for the firing rate, serving as a gain control. The network model extracted variables encoded by a population of noisy neurons Deneve et al. (1999). The neural activities converged to a smooth stable peak, and the position of the peak depended on the variables. Therefore, the position could be used to estimate these quantities in their model. Moreover, through proper tuning of the parameters, the model closely approximated the maximum likelihood, which would be used by an ideal observer in most cases of interest. However, two or more localized activities could not coexist in the previous network model. The model thus could not simultaneously estimate information about multiple sources, which is needed for living in a natural environment. We found that strong divisive normalization makes it hard for localized activities to coexist. Iteration of Equation 3 makes the ratio of local excitations large, and eventually, only the largest one can survive. This effect occurs if the exponent is greater than 1. We constructed a model in which an arbitrary number of local excitations could coexist by making the exponent equal to one. This simple normalization can be biologically implemented in a linear computation and shunting inhibition Carandini et al. (1997). Although our network may not achieve optimal inference for each source position, it is biologically plausible and can reproduce the properties of auditory-visual integration observed in psychophysical experiments. These results imply that normalization with a threshold linear function is important in multisensory integration with causal inference.
We reproduced the results of psychophysical experiments showing localization bias in audiovisual integration. Whenever stimuli were unified, the model estimated that the auditory position would shift to the location of the visual stimulus. This phenomenon is caused by the difference in the reliability of the stimuli. That is, because visual information for source localization is much more reliable than auditory information, vision dominates sound. Moreover, it is also known that when an auditory signal is more reliable than a visual signal, sound dominates vision Alais and Burr (2004). It is reported that localization bias is observed in some cross-modal cues Pavani et al. (2000). Our model represents the reliability of stimuli by the strength of the input activity. It can be generalized to other types of cue integration by changing the strength of the input activity.
Bayes' inference is a method of reasoning that combines prior knowledge with current input data. In our brains, information about the external world is estimated on the basis of prior knowledge Doya et al. (2007). However, until now, it was unknown how prior knowledge can be represented in a neural circuit. We investigated how a neural network can implement prior knowledge. Our results suggest that neural networks learn an appropriate prior with synaptic plasticity.
In the Bayesian model, negative bias is assumed to be caused by sensory noise Körding et al. (2007); Sato et al. (2007). Stimuli are unified when the distance between the perceived locations of audiovisual stimuli which are shifted from their original positions is smaller than D Bay 0 ; on the other hand, when it is larger than D Bay 0 , the stimuli are not unified. The averaged bias of the non-unified case takes on a negative value. In our neural network model, not only sensory noise but also the interaction of localized activities has an effect on the negative bias. Localized activities repel each other through the effect of a Mexican-hat type of connectivity (Figure 3). This corresponds to implementing the prior distribution such that of the likely positions of different input sources, which has not been implemented in the previous Bayesian models Körding et al. (2007); Sato et al. (2007). It is unclear where causal inference is performed in the brain. If the repulsive effect were to be observed in a brain region that performs multisensory integration, it would support the notion that our model is actually implemented in the brain.