Auditory Time-Interval Perception as Causal Inference on Sound Sources

Perception of a temporal pattern in a sub-second time scale is fundamental to conversation, music perception, and other kinds of sound communication. However, its mechanism is not fully understood. A simple example is hearing three successive sounds with short time intervals. The following misperception of the latter interval is known: underestimation of the latter interval when the former is a little shorter or much longer than the latter, and overestimation of the latter when the former is a little longer or much shorter than the latter. Although this misperception of auditory time intervals for simple stimuli might be a cue to understanding the mechanism of time-interval perception, there exists no model that comprehensively explains it. Considering a previous experiment demonstrating that illusory perception does not occur for stimulus sounds with different frequencies, it might be plausible to think that the underlying mechanism of time-interval perception involves a causal inference on sound sources: herein, different frequencies provide cues for different causes. We construct a Bayesian observer model of this time-interval perception. We introduce a probabilistic variable representing the causality of sounds in the model. As prior knowledge, the observer assumes that a single sound source produces periodic and short time intervals, which is consistent with several previous works. We conducted numerical simulations and confirmed that our model can reproduce the misperception of auditory time intervals. A similar phenomenon has also been reported in visual and tactile modalities, though the time ranges for these are wider. This suggests the existence of a common mechanism for temporal pattern perception over modalities. This is because these different properties can be interpreted as a difference in time resolutions, given that the time resolutions for vision and touch are lower than those for audition.

Perception of a temporal pattern in a sub-second time scale is fundamental to conversation, music perception, and other kinds of sound communication. However, its mechanism is not fully understood. A simple example is hearing three successive sounds with short time intervals. The following misperception of the latter interval is known: underestimation of the latter interval when the former is a little shorter or much longer than the latter, and overestimation of the latter when the former is a little longer or much shorter than the latter. Although this misperception of auditory time intervals for simple stimuli might be a cue to understanding the mechanism of time-interval perception, there exists no model that comprehensively explains it. Considering a previous experiment demonstrating that illusory perception does not occur for stimulus sounds with different frequencies, it might be plausible to think that the underlying mechanism of time-interval perception involves a causal inference on sound sources: herein, different frequencies provide cues for different causes. We construct a Bayesian observer model of this time-interval perception. We introduce a probabilistic variable representing the causality of sounds in the model. As prior knowledge, the observer assumes that a single sound source produces periodic and short time intervals, which is consistent with several previous works. We conducted numerical simulations and confirmed that our model can reproduce the misperception of auditory time intervals. A similar phenomenon has also been reported in visual and tactile modalities, though the time ranges for these are wider. This suggests the existence of a common mechanism for temporal pattern perception over modalities. This is because these different properties can be interpreted as a difference in time resolutions, given that the time resolutions for vision and touch are lower than those for audition.

INTRODUCTION
Temporal pattern processing is necessary for all sensory modalities and these patterns contain much essential information for our brain to learn what happens in the external world. Therefore, revealing the temporal perception system is fundamental to understanding the sensory processing system, but it is not fully understood yet.
Hearing three rapid successive sounds is a good situation for investigating the time-perception system. One reason for this is that the temporal accuracy of our auditory system is higher than those for other modalities (Burr et al., 2009;Vroomen and Keetels, 2010;Occelli et al., 2011); that is, auditory experimental results reflect the actual time-perception mechanism better. In addition, a combination of two time intervals is the simplest situation of temporal pattern perception. With regard to hearing three rapid sounds on a hundred-millisecond scale, it is known that our brain sometimes misestimates the second interval depending on the relative length of the two intervals. Concretely speaking, the second interval, T 2 , is perceived as shorter than the actual length in the case where T 2 is equal to or a little longer than the first interval, T 1 . This perceptual underestimation phenomenon was named "timeshrinking" (Nakajima et al., 1991). This illusion vanishes as the total length T 1 + T 2 increases. In addition, though the degrees of misestimation are not so large as those for the case of the timeshrinking illusion, the following phenomena on the perception of T 2 have also been observed (Miyauchi and Nakajima, 2005; Figure 1A): overestimation of T 2 when T 2 is a little shorter than T 1 ; underestimation of T 2 when T 2 is much shorter than T 1 ; and overestimation of T 2 when T 2 is much longer than T 1 . The time-shrinking illusion has been examined in other articles as well (Nakajima et al., 1992;ten Hoopen et al., 1993ten Hoopen et al., , 2006Suetomi and Nakajima, 1998;Miyauchi and Nakajima, 2007;Mitsudo et al., 2009). Furthermore, it was reported that this phenomenon occurs in other sensory modalities such as visual (Arao et al., 2000) and tactile (van Erp and Spapé, 2008) senses. This fact suggests that there is a common time-perception system among sensory modalities.
A time-perception model has been proposed to explain the time-shrinking illusion (Nakajima et al., 2004). In this model, it is assumed that the subjective duration of a time-interval is proportional to the sum of the actual length and a constant length. It is also assumed that if the neural system judges the two neighboring intervals as similar, the estimating process for the latter interval is shortened and the latter interval is thus underestimated. By these assumptions, this model can quantitatively mimic the timeshrinking illusion, namely, the underestimation of T 2 caused by a shorter preceding interval T 1 . However, the other misestimation phenomena when hearing three successive sounds are out of the scope of this model and cannot be reproduced by the model. In the present study, we consider that the perceptual phenomena as mentioned above are results of effective information processing in our neural system. Sensory information, which our brain uses to infer what happens in the world, inevitably has uncertainty caused by both internal noise in our nervous system (Faisal et al., 2008) and ubiquitous fluctuation in the external world. Therefore, our brain must manage with those kinds of uncertainty, otherwise we may misunderstand the situation or regard the same experiences as different. One reasonable way for the brain to cope with the uncertainty is exploiting prior knowledge, or the experience and statistics pertaining to the situation. This strategy can be formulated by using Bayesian inference. Bayesian modeling is a powerful method for describing the human perception mechanism and has been applied to visual temporal perception (Miyazaki et al., 2005;Jazayeri and Shadlen, 2010), and more widely to human perception (Vilares and Körding, 2011, for a recent review).

MATERIALS AND METHODS
To consider the perceptual phenomena of hearing three rapid sounds, we assume a Bayesian observer who tries to solve a common source identification problem for each pair of two neighboring sounds. Further, prior to hearing, the observer assumes that sounds from the same source have short and equal intervals. The assumption of prior knowledge of short time intervals for stimuli from the same source is based on some previous works. These studies showed that the closer the two sources, the shorter are the perceived time intervals (Akerboom et al., 1983, for audition;and Goldreich, 2007;Kuroki et al., 2010, for tactile sensation). Further, with respect to the assumption of equal intervals, we can find many examples of signals aligned at almost equal intervals: heart beats, swinging pendulum, etc. This can be because simple dynamical systems tend to generate periodical orbits, which are often observed as periodic signals generated by a limit cycle.
Here, we propose that the perception of sound intervals involves inference of causal relationship among sounds. Although there is little direct evidence for this notion, some auditory perceptual phenomenon could be associated with some form of causal judgments. For example, the time-shrinking illusion vanishes in the case wherein the temporal pattern is marked by sounds with quite different frequencies (Remijn et al., 1999). For this case, we consider that sounds with different frequencies have been judged as from independent sources. Therefore, the perceptual estimation of the latter time-interval is different from that for the case of a sound sequence composed of the same frequency. This view that sound frequency indicates source identity is also supported by an auditory psychological phenomenon (Deutsch, 1975). The perception of a common source is a kind of causal inference and should be important for making an effective inference (Körding et al., 2007;Sato et al., 2007;Shams and Beierholm, 2010). We will discuss this point further in Discussion.
Our Bayesian model assumes that our neural system cannot observe true time instants t 1 , t 2 , and t 3 of the sounds, but only observed times including noise s 1 , s 2 , and s 3 , respectively (Figure 2A). Each index of the variables indicates the order of emergence in the sound sequence. Then, our brain infers true interval durations T 1 = t 2 − t 1 and T 2 = t 3 − t 2 from the observation. To estimate them, our Bayesian observer composes a conditional probability, called a posterior probability, P(T 1 ,T 2 |s 1 ,s 2 ,s 3 ). Bayesian theorem enables us to represent the posterior probability as Since the denominator on the right side can be obtained by integrating the numerator over T 1 and T 2 , we need to consider only Frontiers in Psychology | Perception Science (B) Likelihood function of (T 1 ,T 2 ) given observed values of S 1 = 100 ms and S 2 = 140 ms (indicated as "+"). Intensity indicates the degree of likelihood. (C) Prior distribution of (T 1 ,T 2 ) with intensity on a logarithmic scale of the probability, illustrating that the prior takes a high value near both axes as well as along the 45˚line from the T 1 -axis. (D) Posterior distribution of (T 1 ,T 2 ) given observed values of S 1 = 100 ms and S 2 = 140 ms (indicated as "+"). Intensity indicates probability. The cross sign (×) corresponds to the peak of the distribution. (E-G) Prior distributions of (T 1 ,T 2 ) given that (E) the first and second sounds come from the same source, (F) the second and third sounds come from the same source, and (G) all three sounds come from the same source. Intensity indicates probability.
terms P(s 1 ,s 2 ,s 3 |T 1 ,T 2 ) and P(T 1 ,T 2 ) in the numerator. The first term of the numerator represents how the observational values are obtained, and is formulated as where we assume that distribution P(t 2 ) is constant; knowing T 1 , T 2 , and t 2 is equivalent to knowing t 1 , t 2 , and t 3 in the second line, and the noise distributions for the timings of the three sounds are assumed to be independent from each other in the third line. We set the distribution of the observation noise as a Gaussian distribution with the width σ o and the center at a true value given as Here, we consider standard deviation σ o to be constant with time. By substituting equation (3) into equation (2) and integrating over t 2 , we obtain the following formula (see Appendix for the details of this derivation): www.frontiersin.org where we introduce variables S 1 = s 2 − s 1 and S 2 = s 3 − s 2 , which represent the observed interval durations. Note that, given T 1 and T 2 , t 1 and t 3 are not independent from t 2 but change with t 2 . Therefore, the integral range of t 2 in equation (2) is (−∞, ∞). This term stands for the likelihood of the true intervals. Due to (T 1 − S 1 )(T 2 − S 2 ), this function has a negative correlation between T 1 and T 2 , as shown in Figure 2B.
Then, we formulate term P(T 1 ,T 2 ) in equation (1). This term does not relate to s i (i = 1, 2, 3); that is, what our neural system has observed. Thus, this probability function represents knowledge acquired prior to the event. We model the prior knowledge of two neighboring time intervals as follows, assuming that the observer solves a source identification problem. First, our brain infers from the three successive sounds whether each pair of two neighboring sounds comes from the same source. To consider the source identification inference, we introduce variable C that represents which of the three sounds are from the same source. Here, our brain is not considered to make a judgment that the first and third sounds come from the same source while at the same time the second sound comes from another source. Thus, C represents the following four cases: 1. each sound is from an independent source, 2. the first and second sounds come from the same source and the third from another source, 3. the second and third sounds come from the same source and the first from another source, 4. all three sounds are from the same source.
Then, we assign 1, 2, 3, and 4 as the value of C to the above cases, respectively. Using the variable C, we formulate the prior distribution as We treat the probabilities of C appearing in equation (5) as model parameters, and denote P(C = j)(j = 1, 2, 3, 4) by p j . Next, we formulate prior distributions P(T 1 ,T 2 |C) for C = 1, 2, 3, and 4, by using the assumption of equal and short intervals for sounds from the same source. The assumption is formulated as follows: • For C = 1, there is no bias for the sound intervals. Thus, the prior distribution is a two-dimensional uniform distribution: where L is a parameter defining the integration range.
• For C = 2 and C = 3, the two sounds that come from the same source are expected to have a short interval (Figures 2E,F). Each prior distribution is as follows: where standard deviation σ p is a parameter that controls the bias toward short intervals. P(T 1 |C = 2) gives the distribution of an interval wherein the two marker sounds are from the same source, and P(T 2 |C = 2) gives the distribution of an interval wherein the two sounds come from different sources. • For C = 4, the three markers are expected to have short and equal intervals. This distribution is expressed as a twodimensional Gaussian distribution, with the center at the origin and a positive correlation between the two variables T 1 and T 2 ( Figure 2G). Thus, this distribution can be expressed as where Z is the normalization term, and σ q and σ r are constant parameters. It is necessary for the prior distribution to satisfy the following condition: Given this condition, the constants Z, σ q , and σ r in equation (9) are represented as follows: New parameters σ q and σ r control the shape of the distribution. Since we intend the distribution to have a positive correlation between T 1 and T 2 , σ q should be greater than σ r . By substituting equations (6)-(9) into equation (5), we have prior distribution P(T 1 ,T 2 ). The obtained prior distribution has a large peak at the origin of the T 1 − T 2 plane, and also has high values along the T 1 and T 2 axes, and along the 45˚line from the T 1 -axis ( Figure 2C).
Then, we obtain the posterior distribution P(T 1 ,T 2 |s 1 ,s 2 ,s 3 ) by multiplying the likelihood function of equation (4) and the prior distribution ( Figure 2D).

RESULT
We conducted a numerical simulation to show the validity of our model. The parameter values used in the simulation are shown Frontiers in Psychology | Perception Science in Table 1. There are too many parameters in our model to learn the correct values from appropriate experiments. Thus, the parameter values are chosen and adjusted so that the time scales are not strange in terms of their physical implications. For example, because the time resolution of the auditory system changes with measurement methods, a specific time resolution parameter σ o cannot be decided. Therefore, we set it so that the time scale is similar to existing psychological results (Grondin and Plourde, 2007, for example). The parameter value of L is decided so as to cover the time range in which the stimuli are presented.
In this simulation, we calculated the expectation value of the marginal distribution of T 2 and regarded the value as a result of the Bayesian observer's inference. Although there are some other decision-making strategies, such as maximizing the posterior probability, we chose calculating the expectation value because of its low computational cost. However, the simulation result of the maximum a posteriori strategy was not qualitatively different from that of the expectation value. In addition, it is yet to be ascertained which rule should be applied to a Bayesian inference (see Jazayeri and Shadlen, 2010, for this issue).
Using this simulation, our model reproduced the timeshrinking illusion; that is, the large underestimation of T 2 when T 2 is a little longer than T 1 , due to the assumption of equal intervals. However, the amount of overestimation when T 2 is a little shorter than T 1 was smaller than the above underestimation. We also observed overestimation and underestimation of T 2 when T 2 is much longer and shorter than T 1 , respectively. Moreover, our model simulation showed that the underestimation and overestimation decrease as the total length increases and that there is underestimation of T 2 when T 2 = T 1 (Figure 1B). These properties of our model were also observed in psychological experiments ( Figure 1A).

EXPLANATION OF THE PERCEPTION OF THREE RAPID SOUNDS
Here, we explain how our model reproduces the behavior of the human auditory system. First, when the two time intervals are similar, the observed time-interval pair stands near the diagonal line on the T 1 -T 2 plane. Thus, the perception of three sounds shifts from noisy observation toward prior knowledge when all three sounds originate from the same source. As a result, the two intervals are perceived as more similar to each other than their observation. That is, T 2 is underestimated if T 2 is a little longer than T 1 (Point A 1 in Figure 3), and T 2 is overestimated if T 2 is a little shorter than T 1 (Point A 2 in Figure 3). In addition, the degree of underestimation is larger than that of overestimation because the peak of the prior distribution is at the origin due to the expectation of short intervals. The expectation of short intervals also causes the underestimation of T 2 when T 2 = T 1 .
Next, when the intervals are dissimilar, the time-interval pair is located either near the T 1 -axis or the T 2 -axis on the T 1 -T 2 plane. Therefore, perception is biased toward the T 1 -axis or the T 2 -axis by prior knowledge when the first two or the latter two sounds come from the same source, respectively. In addition, since the likelihood function has a negative correlation between T 1 and T 2 , perception shifts along the negative correlation. Thus, T 2 is perceived as longer than the actual duration if T 2 is longer than T 1 , and vice versa (Points B 1 and B 2 in Figure 3, respectively).
In addition, the shape of the prior distribution becomes more flat as distance from the origin and the axes on the T 1 -T 2 plane FIGURE 3 | Schematic figure of the model mechanism. Dashed lines indicate the shape of the prior distribution. Each solid-lined ellipse represents the likelihood function given an observed interval pair marked as a plus sign on the center of the ellipse. Each arrow describes the direction of the perceptual shift. increases. Therefore, the prior effect is weak in such areas (Point C in Figure 3).

DISCUSSION
Our model succeeds in replicating the human perception of a simple temporal pattern. This result suggests that our brain judges the causality of sounds and expects short and equal intervals for temporal patterns in the unconscious process. In our model, we assumed that the observer inferred the causal relationship among sounds. Although there is little evidence for this assumption, we can propose some experiments that could verify it. For example, we propose an experiment in which subjects hear three rapid sounds and report which of the three sounds come from the same source. The rate of each judgment on source identification can be predicted by calculating P(estimated C|T 1 ,T 2 ) from the present model. In addition, this experiment would also provide feedback on the parameter values of (p 1 , p 2 , p 3 , p 4 ), which are rather arbitrary in this study. By extending our model, we can also predict that the temporal pattern of sounds alters the perception of their spatial locations. Although we modeled the perception of time intervals marked by sounds in this article, we can also model the spatial perception of the sounds in almost the same form of causal inference and easily combine it with the current model. From this combined model, we predict that the same spatial patterns of sounds are perceived as spatially different if the patterns are temporally different. This is because the inference on the causal relationship among sounds is made from their temporal and spatial pattern in this model, and thus varies with temporal difference even if the actual spatial patterns are the same.
Our model has several parameters, and there exists some arbitrariness in their setting. For instance, even if we change the value of L from that in Table 1 to another value, we can reproduce a result similar to Figure 1B by adjusting parameters (p 1 , . . ., p 4 ). In this article, we choose quite a high value for p 4 relative to the other three parameters. Although we assumed that inference was made based on observed time of sounds, in reality, we observe other features of sounds such as direction, pitch, color, volume, and so on, and all of these provide cues for the causal relationship among sounds. In the experiment we reproduced, all of these other features were kept the same for the series of three sounds, which strongly suggests that the sounds had come from the same source. We interpret (p 1 , . . ., p 4 ) as including the cues from those other features. Thus, it might be natural to assume that p 4 , which is the probability of all of the sounds coming from the same source, is considerably higher than the other possibilities. This suggests that time-interval perception depends on other sound features and, if presented with visual stimuli, also depends on visual features such as color, size, or location. In fact, it was confirmed that the result of time-interval perception differs according to the combination of stimulus pitches (Remijn et al., 1999).
Our model could be improved by trying to replicate the experimental facts about the perception of T 1 . It was reported that the direction of the perceptual shift of T 1 follows the same pattern as that of T 2 ; that is, T 1 is underestimated when T 1 is a little longer or much shorter than T 2 , and T 1 is overestimated when T 1 is a little shorter or much longer than T 2 (Miyauchi and Nakajima, 2005). This qualitative property of T 1 perception can be predicted by our model. However, in that experiment, the magnitude of each perceptual shift of T 1 was found to be less than that of T 2 . Since the present model has symmetry between T 1 and T 2 , it is impossible for our model to mimic the difference between the perceptions of T 1 and that of T 2 . In the future, we seek to consider how we refine the present model to reproduce experimental results on the perception of T 1 .
In auditory science, the issue is discerning a single sound stream in a complex of multiple sounds. This ability of the auditory system is called "auditory scene analysis" or "auditory scene segregation" (Bregman, 1990), and regarded as an important key to reveal the auditory system. Because this sound separating mechanism should involve perceptual source identification, our model may contribute to considering a sound segregation mechanism from the temporal aspect.
Finally, let us consider the time-perception mechanisms for other sensory modalities. From the psychological experiments on the visual (Arao et al., 2000) and tactile (van Erp and Spapé, 2008) time-shrinking illusions, it is known that time ranges for these modalities are broader than those for audition. The underlying reason can be understood by using the present model as follows, given that the visual and tactile time resolutions are lower than the auditory one. The perceptual bias of our model becomes weaker in a longer time scale. However, for a low-temporal-resolution modality, the perceptual bias is still relatively strong, because the observation has much uncertainty. Thus, the illusion occurs in a wider range. Though we can give a possible explanation for the difference among the modalities, the time-perception mechanisms in the sub-second scale for the other sensory modalities have not been well studied. Therefore, more research is needed before concluding that a time-perception system is shared by all sensory modalities.