Intensity Estimation for Poisson Process with Compositional Noise

Intensity estimation for Poisson processes is a classical problem and has been extensively studied over the past few decades. Practical observations, however, often contain compositional noise, i.e. a nonlinear shift along the time axis, which makes standard methods not directly applicable. The key challenge is that these observations are not"aligned", and registration procedures are required for successful estimation. In this paper, we propose an alignment-based framework for positive intensity estimation. We first show that the intensity function is area-preserved with respect to compositional noise. Such a property implies that the time warping is only encoded in the normalized intensity, or density, function. Then, we decompose the estimation of the intensity by the product of the estimated total intensity and estimated density. The estimation of the density relies on a metric which measures the phase difference between two density functions. An asymptotic study shows that the proposed estimation algorithm provides a consistent estimator for the normalized intensity. We then extend the framework to estimating non-negative intensity functions. The success of the proposed estimation algorithms is illustrated using two simulations. Finally, we apply the new framework in a real data set of neural spike trains, and find that the newly estimated intensities provide better classification accuracy than previous methods.


Introduction
The study of point processes is one of the central topics in stochastic processes and has been widely used to model discrete events in continuous time. In particular, the Poisson process, a common point process, has the most applications [8,18,20]. Classical examples include the arrivals of park patrons at an amusement park over a period of time, the goals scored in an association football match, and the clicks on a particular web link in a given time period. Recently, Poisson processes have been used to characterize spiking activity in various neural systems [7,6]. In order to use a Poisson process in applications, one key step is to estimate its intensity function from a given sequence of observed events.
The estimation of the intensity function of a Poisson process has been studied extensively and various estimation methods have been proposed. If the intensity can be assumed to have a known parametric form, then likelihood-based methods can be used to estimate the model parameters. However, in many cases, the shape of the intensity is unknown and estimation requires the implementation of non-parametric methods. Non-parametric estimation methods provide more flexibility than parametric methods and can better characterize the underlying intensity function. A number of approaches have been proposed over the past three decades, including waveletbased methods [11,18,26] and kernel-based methods [3,8,10]. In the case where prior knowledge about the process or shape of the intensity is known, Bayesian methods can be adopted and they often lead to a more accurate estimation [2,14,33].
Treating a neural spike train as a realization of Poisson process, one can consider the example depicted in Fig. 1. In this case, the neural spiking activity, which is associated with certain movement behavior [38], was recorded (see detail of the data in Sec. 5.2). The process was repeated for 30 trials and the resulting spike trains are shown in Fig. 1A. Notice that in each repetition of the same movement, there is a gap in the spikes that occurs at slightly different times with variable lengths. This time shift in the gap in spikes is indeed an example of the notion of phase variability or compositional noise, a central topic in functional data analysis. The observed gap in spikes should be reflected in the underlying Poisson intensity estimate. However, using kernel-based estimation methods without accounting for phase variability results in an intensity estimate (shown in red in Fig. 1B) that does not capture this gap in spiking activity. The method introduced later in this paper does consider the presence of phase variability and yields an estimate of the underlying intensity of the spike train that clearly depicts the observed gap in the spiking activity (shown in blue in Fig. 1B). Therefore, it is important to develop estimation procedures that consider the presence of phase variability in repeated observations of the same process, and that is the goal of this paper. One key concept in functional data analysis where phase variability plays the central role is the notion of function registration, or alignment. Indeed, function registration is an important topic in functional data analysis and a significant amount of research progress has been made over the past two decades [23,13,32,16,34]. In order to properly register functions, one must consider two types of variability present in data: phase and amplitude variability. Phase variability describes the degree of "unalignment" in the data, and amplitude variability is the remaining variability in the vertical axis after alignment. The goal of function registration is to align the functions by removing phase variability. If analysis (such as principal component analysis or regression) is conducted on data which are not well aligned, one may obtain poor or undesired results.
While function registration has been extensively studied, the notion of aligning point processes with compositional noise has not been well studiedall aforementioned intensity estimation methods are based on the assumption there is no phase variability in the observed processes. However, as indicated in the above spike train example, that is not always a reasonable assumption. To understand the phase variability in point process observations, recent studies on intensity estimation in Poisson process have begun to identify and remove compositional noise during the estimation procedure. For example, Bigot and colleagues examined the estimation of the underlying intensity function for a set of linearly shifted Poisson processes [5]. They assumed that the intensity function is periodic and each realization of the process is warped according to a linear shift in time that follows a known distribution. Under the stated assumptions, the authors derived a wavelet-based estimator. They argued that the assumption of a linear shift in the observed processes is a reasonable assumption, particularly in an example of DNA Chip-Seq data. However, there are many other cases where it is not reasonable to assume that the phase variability is a simple linear shift such as examples in the literature of functional registration [23,34]. As a result, restricting the warping function to be strictly linear shifts may limit the general applicability of their method. In another recent study, Panaretos and Zemel proposed to separate amplitude and phase variation in order to align point processes [21]. Basically, they extended the notion of the separation of phase and amplitude variation in functions to that of point processes. While the work of Panaretos and Zemel applies generally to point processes, the goal of their work is fundamentally different from the goal of the work in this project. Their goal is estimation of the probability measure and they comment that their work is not to be used for density estimation (see Section 3.4 of [21]). The goal of the work in this project is intensity estimation, which will be shown reduces to density estimation.
In this paper we propose a new framework for intensity estimation of a Poisson process with compositional noise. We show that the noise is only encoded in the normalized intensity, or density, function. The estimation is based on our proposed metric which measures the phase difference between two density functions so the notion of the Karcher mean can be applied in the given framework. Since the only parameter in the method is the bandwidth for the kernel density estimate, the proposed method is a mostly non-parametric method that yields a consistent estimator of the underlying intensity function.
The rest of this paper is organized as follows. In Section 2, we present the new framework for positive intensity estimation and discuss its mathematical and computational properties. Consistency theory on the estimation algorithm is given in Section 3. In Section 4, we extend the estimation to nonnegative intensity functions. The estimations on positive and nonnegative intensity are illustrated with two simulations, respectively, in Section 5. We then show the application of intensity estimation in a real dataset of neural spike trains. Section 6 summarizes the work. Finally, all mathematical details are given in the Appendix.

Method
In this section, we present the new framework for positive intensity estimation of a Poisson process with non-linear time warping. Compositional noise is represented with time warping functions and, since the intensity of a Poisson process is a function, the representation of time warping is studied in the function space. The notation and representation in the function space that is given here is consistent to that in [19,31]. First, we will review the basics of Poisson processes [27] and the representation of time warping in function space [19].

Review of Poisson Process and Time Warping Representation
A Poisson process on the time domain [0, 1] is a special type of counting process N(t), t ∈ [0, 1]. For simplification of notation, we only examine the domain [0, 1] in this paper, and the framework can be easily adapted to any finite time interval. In the classical theory of point processes, a Poisson process is defined based on an intensity function λ(t) ≥ 0 and satisfies the following two conditions [27]: 1. Disjoint intervals have counts that are independent. In other words, the number of events occurring in the interval (a, b) is independent of the number of events occurring in the interval (c, d) if these two intervals are not overlapping.
2. The number of events in an interval (a, b) ⊂ [0, 1] follows a Poisson distribution with mean b a λ(t)dt. In other words, We denote a Poisson process with intensity λ(t) as P P (λ(t)). For distinction, a Poisson distribution with mean µ is denoted as P oisson(µ), and a Poisson probability mass function with mean µ at k is denoted as P oisson(k; µ) = e −µ µ k /k!. We represent compositional noise with time warping functions. Since the intensity of a Poisson process is a function, we study the representation of time warping in the function space. Let Γ be the set of all warping functions, where time warping is defined as an orientation-preserving diffeomorphism of the domain [0, 1]. That is, Elements of Γ form a group with function composition as the group action, and the identity in this group is the self-mapping γ id (t) = t. For any function h, we will use h to denote its L 2 norm ( 1 0 h(t) 2 dt) 1/2 . There are three different types of (right) group actions about time warping that can occur in the function space: where • denotes the conventional function composition. The properties on associativity and isometry of these three group actions are summarized in Table  1. In particular, the amplitude-preserved group action is the conventional registration for functions with phase variability and has been extensively studied over the past two decades [22,1,17]. The enery-preserved group action plays an essential role in the Fisher-Rao registration framework [31], where this action is applied in the Square-Root Velocity Function (SRVF) space (note: it is critical that in the Fisher-Rao framework there is a oneto-one correspondence between the energy-preserved SRVF space and the amplitude-preserved observational function space). In the following sections of this manuscript, we will show that the compositional noise in the Poisson process intensity function is properly characterized by the area-preserved group action.

Poisson Process with Compositional Noise
Before formally stating the main problem, we review the classical estimation problem in Poisson processes: Given a set of independent realizations from a Poisson process on [0, 1], how can we estimate the underlying intensity function? By notation, the set of realizations are given in the following form, where k i ∼ P oisson 1 0 λ(t)dt , i = 1, 2, . . . , n. Various computational approaches have been developed to address this problem, which include penalized projection estimators [25], wavelet methods [37,11], and estimators based upon thresholding rules [26].
In this paper, we assume the observed data are not {R i }, but a warped version in the form where γ i is a random time warping in Γ, i = 1, · · · , n. That is, Given observations {S i }, our goal is still to estimate the underlying intensity λ(t). To make the model identifiable, we add the constraint that the mean of {γ i } needs to be a scaled version of γ id (The detail on assumptions is clearly provided in Sec. 3). Note that since the time warping can be in any nonlinear form, this estimation problem is a significant challenge. A recent study only examines the case when the warping is a simple linear shift along the time axis [5]. As the warping function γ i is random, the warped process γ −1 i (R i ) is no longer a Poisson process, but a Cox process. Here we study, "Conditional on γ i , is γ −1 i (R i ) still a Poisson process? If this is true, what is the intensity function of that Poisson process?" Our answer is yes to the first question and the intensity function of the new Poisson process is given as follows.
Proof. If R is a Poisson process, then the number of events of R in the time interval (a, b) is independent of the number of events of R in the time interval Hence, the number of events in (γ(a), γ(b)) is also independent of the number of events in (γ(c), γ(d)).

For any
The last equality holds simply by the change of variable v = γ(u). Therefore, A direct result from Lemma 1 is that given γ i , S i is also a Poisson process and Based on the theory of Poisson processes, the intensity function λ(t) can be decomposed into the product of the total intensity Λ and the density function Therefore, the intensity estimation problem can be reduced to density estimation and scalar total intensity estimation. Note that for i = 1, · · · , n, That is, Λ is constant with respect to time warping. Hence, the density of the events in S i , given γ i , can be written as This expression indicates that the time warping is encoded in the density function, and independent of total intensity. By the theory of Poisson processes, the number of events in each process follows a Poisson distribution with mean Λ. For a set of given observations {S i }, Λ can be easily estimated using a conventional maximum likelihood estimate. Therefore, the intensity estimation problem reduces to estimating the underlying density f . Given {S i }, we propose a modified kernel method to estimate density functions {f i }, and then use these densities to estimate f . This whole procedure is described in detail in Section 2.5.

Phase Distance Between Positive Probability Density Functions
In this paper, we focus on a metric-based method to estimate the underlying density f . Metric distances between density functions is a classical topic and a number of measures have been proposed, for example, the Bhattacharyya Distance [4], the Hellinger Distance [15], the Wasserstein Distance [36] and the elastic distance beween densities based upon the Fisher-Rao metric [30].
Suppose f 1 and f 2 are two density functions on [0,1] with cumulative distribution functions F 1 and F 2 , respectively. Then, these metrics are defined as: Note that the Fisher-Rao metric between two density functions is similar to the Hellinger Distance (arc length vs. chord length) [30].
Based on the generative model in Eqn. 1, the difference between the true underlying density function and the noise-contaminated density is the time warping along the time axis. Such a difference is characterized as the phase difference and we expect that a metric measuring phase difference will be purely based on the warping function between two densities. That is, the distance between f 1 and f 2 will only depend on γ if f 1 = (f 2 ; γ). However, none of the above metrics purely measure this phase difference between two density functions. We aim to find a metric that can properly characterize such phase difference. In this paper, we will define a new distance between positive densities which properly measures their phase difference. The set of all positive density functions on [0, 1] is denoted as P.
We note that for any densities f 1 , f 2 ∈ P, their cumulative distribution functions F 1 , F 2 are warping functions in Γ. By the group structure of Γ, it is straightforward to find that the optimal warping function between f 1 and f 2 (i.e. γ * ∈ Γ such that f 1 = (f 2 • γ * )γ * or F 1 = F 2 • γ * ), is unique and has a closed-form solution given by Based on this result, it is natural to define a distance that measures the phase difference by measuring how far the warping function is from the identity warping function, γ id . In other words, the smaller the distance between the warping function and γ id , the less warping that is required between the two densities. One definition of the distance metric is given as follows.

Definition 1.
For any two functions f 1 , f 2 ∈ P, we define an intrinsic distance, d int , between them as: where γ is the optimal time warping between f 1 and f 2 (i.e. This definition of phase distance has been used in the Fisher-Rao framework [34]. This distance is intrinsic which measures the arc-length between √γ and 1 in the unit sphere S ∞ (SRVF space of Γ). Note that the definition of phase distance in P is not unique. We can also define an extrinsic distance as follows: For any two functions f 1 , f 2 ∈ P, we define an extrinsic distance, d ext , between them as: where γ is the optimal time warping between f 1 and f 2 (i.e.
To simplify the notation, we use γ f denote the cumulative distribution function of f ∈ P. Then Also notice that because the optimal warping function γ This shows that the consistency results that hold for d γ will also hold for d W , but the reverse is not true in general. Similar to the Wasserstein and Hellinger distances, this d γ metric is also a proper distance. The detailed proof in given in Appendix A. While d γ is not isometric like Bhattcharya and Hellinger, it is the only metric (within these four) that characterizes the phase difference between f 1 and f 2 . Either d int or d ext can be used to estimate the underlying density f . In this paper, we choose to focus on the extrinsic distance d ext for two reasons: 1. Computational algorithms based upon the extrinsic distance are usually more efficient than those based on the intrinsic distance. 2. The extrinsic distance provides a closed-form Karcher mean representation (see definition next), which plays an essential role in developing the asymptotic theory for our estimator in Sec. 3.

Karcher Mean
The notion of a Karcher mean was used on the set of warping functions where an extrinsic distance between warping functions is adopted [38]. That is, assuming γ 1 , · · · , γ n ∈ Γ is a set of warping functions, their Karcher mean γ can be defined asγ It was shown in [38] that this Karcher mean has a closed-form solution: where γ is the SRVF ofγ.
Similar to the Karcher mean of a set of warping functions in Γ, we can define the Karcher mean of a set of density functions in P. This definition is based on the newly-defined phase distance d ext in Eqn. 4. Definition 3. We define the Karcher mean µ n of a set of functions f 1 , · · · , f n ∈ P as the minimum of the sum of squares of distances in the following form: Based on the closed-form solution for the Karcher mean of a set of warping functions, we can efficiently compute the Karcher mean in Eqn. 5 using the following algorithm.

Algorithm 1: Karcher Mean Computation
Given a set of density functions f 1 , . . . , f n ∈ P, and their cumulative distribution functions F 1 , . . . , F n , respectively.

Compute the Karcher meanγ of {γ
The algorithm for computing the Karcher mean of functions in P is illustrated with a simple example in Fig. 2. The 10 gold lines in the figure denote the density functions of Beta distribution on the domain [0, 1] in the form Here the parameters α takes value 1, 1, 1.5, 2, 2, 2.5, 3, 3, 4, 5, and β takes value 4, 3, 3, 2.5, 2, 2, 1.5, 1, 1, 2 for the 10 functions, respectively. The Karcher mean of these functions was computed using Algorithm 1 and the result is shown as the thick red line in Fig. 2.

Intensity Estimation Method
Since the Karcher mean of a set of density functions (computed under d γ ) is itself a density function, we use the Karcher mean as an estimate of the underlying density of the process. In our proposed estimation method, the Karcher mean, as computed using Algorithm 1, is used in conjunction with the MLE of the total intensity of the process to produce an estimate of the intensity function. Note that the computation for Karcher mean using Algorithm 1 is based on the assumption that each warped density f i (t) is already known, but practical data are only Poisson process realizations. In this section, we propose a kernel estimation procedure to estimate f i (t).

Modified Kernel Density Estimation
Kernel density estimation has been well studied in statistics literature and it is well known that the standard kernel density estimator has good asymptotic properties when the domain is the real line. However, when the domain is a compact set such as [0, 1] in this paper, the standard kernel density estimator cannot be directly used. We adopt here a reflection-based method to address this issue [9,28,29]. Suppose x 1 , . . . , x m are observations in [0, 1] whose density is given by f ∈ P. The standard kernel density estimator is given byf where K(·) is a kernel function and h denotes the kernel width. Note that this estimated density is defined on the real line (−∞, ∞), and the section within [0, 1] in general is not a density function itself. To simplify the estimation procedure, we can choose kernel functions with compact support within [−1, 1]. That is, K(t) = 0 for |t| > 1.
Here we propose a two-step modification of the estimatef (t). At first, we wrap aroundf within the domain [0, 1], and denote the new function asf , a density function on [0, 1]. Secondly, we add a small positive constant tof , and then normalize the sum to be a density function. This step is to assure that the normalized function is positive on [0, 1], a necessary condition for the existence of the warping functions used in the distance d ext . The modified kernel density estimation can be summarized in the following algorithm.
t ∈ R, using an appropriate bandwidth, h, and a kernel function K with compact support (e.g. a Beta density function).

Estimation Algorithm
Estimation of the intensity of the process occurs in two independent components. First, the total intensity Λ can be easily computed with a standard MLE procedure. Second, the Karcher mean of the estimated densities is used to estimate f (t). This estimation algorithm is given as follows.

Algorithm 3: Intensity Estimation Algorithm
Given a set of observed processes S i with number of events being k i , i = 1, . . . , n,

Estimate Λ by its MLE
2. Use Algorithm 2 to estimate the density of each observed process, f i (t), i = 1, · · · , n.
4. Use Algorithm 1 to estimate the overall underlying density,f (t), as the Karcher mean of {f i }.
5. Estimate the underlying intensity λ(t) in the original process using:

Asymptotic Theory on Consistency
Asymptotical properties of estimators are often of interest since these properties can give reasonable certainty that the ground-truth parameters are appropriately estimated by the given algorithms. In this section, we provide asymptotic theory on the density estimatorf in Algorithm 3. Our estimation is based on the model where λ is the underlying intensity function and γ i ∈ Γ, i = 1, . . . , n, are a set of warping functions. By Lemma 1, each observation S i is a Poisson process realization with intensity λ i . Given {S i }, Algorithm 3 provides an estimation procedure for λ. As the total intensity Λ is independent of time warpings, our asymptotical theory will focus on the normalized intensity, i.e. intensity function f = λ/Λ. We mathematically prove that the proposed algorithm provides a consistent estimator for f . The asymptotic theory is based on sample size n as well as the total intensity Λ. Here we only provide result on the main theorem. All lemmas that lead to the theorem can be found in Appendix B.
Before we state the main theorem, we list all assumptions as follows: 1. The observations are a sequence of Possion process realizations {S i }, and S i follows intensity function 3. γ i (t), t ∈ [0, 1], i = 1, · · · , n are a set of independent warping functions. The SRVFs of their inverses γ −1 i (t) distribute around √γ id = 1 on the Hilbert unit sphere H ∞ . In particular, E( γ i (t)) ≡ β > 0 and there exist m γ , M γ > 0 such that γ −1 i (t) ∈ [m γ , M γ ], for any t ∈ [0, 1]. It is important to note that γ −1 i (t) is a point on the Hilbert unit sphere. As a result, it is easy to show that assuming E γ −1 i (t) = β > 0 is equivalent to assuming that the extrinsic mean of { γ −1 i (t)} is 1.
4. The total intensity Λ can vary in the form of a sequence {Λ m } ∞ m=1 . We assume the sequence goes to ∞ with Λ m ≥ α log(m), α > 1 for sufficiently large m.

The bandwidth of the kernel density estimator in Algorithm 2 is chosen
optimally. That is, for a sequence of r events, the bandwidth h r satisfies h r → 0 and rh r → ∞ when r → ∞.
− − → 0 when n → ∞. Note that 1,γ − (1,γ) depends on the total intensity Λ m and sample size n. We will show that this term also converges to 0 when m is large (for any fixed n).
To simplify the notation, we denote a i = γ −1 i ,â i = γ −1 i , i = 1, · · · , n. Let the number of events in S i be n i . Then n i is a random variable following Poisson distribution with mean Λ m . By Lemma 5, n i a.s. − − → ∞ when m → ∞. Using Lemma 3, Note that the convergence of 1,γ − (1,γ) is for any sample size n. Finally, we have proved that

Extension to Nonnegative Intensity Functions
The method developed thus far applies only to strictly positive density functions. In practice, this may be a quite restrictive condition and it is desired to extend the method to non-negative density functions. Our estimation is still based on the model where λ ≥ 0 is the underlying intensity function and γ i ∈ Γ, i = 1, . . . , n, are a set of warping functions. In this section, we propose to extend Algorithm 3 to estimate this nonnegative λ with Poisson process observations.

Representation of Nonnegative Intensities
For estimation, our focus is still on the density function f = λ/Λ as the total intensity Λ is independent of the time warping. Let F denote the CDF of f . Then F (0) = 0, F (1) = 1. However, as f is nonnegative, F may not be strictly increasing on the domain [0, 1]. To simplify the representation, we assume that F is strictly increasing except being constant on a finite number, K, of non-overlapping intervals (This finiteness assumption would be sufficient for nonnegative intensities in practical use). Let F = {F • γ|γ ∈ Γ} denote the set of CDFs which are warped versions of F , and F i be the CDF of f i = λ i /Λ. Then F i = F • γ i ∈ F will also be constant on corresponding intervals. In general, let h, g be two density functions whose CDFs H, G are in F . Then H and G are strictly increasing except being constant on K nonoverlapping intervals. We define Γ h,g = {γ ∈ Γ|h = (g • γ)γ} = {γ ∈ Γ|H = G • γ}. By construction, Γ h,g = ∅. We denote the K constant intervals for H and G are [a 1 , b 1 ], · · · , [a K , b K ] and [c 1 , d 1 ], · · · , [c K , d K ], respectively. For any γ ∈ Γ h, g , we must have γ(a k ) = c k and γ(b k ) = d k for k = 1, · · · , K. To include the boundary points, we denote b 0 = d 0 = 0 and a K+1 = c K+1 = 1. It is our goal to characterize all warping functions in Γ h, g .
Note that the function G is strictly increasing on each interval [d k , c k+1 ], k = 0, 1, · · · , K. Now we define a mapping G k : [d k , c k+1 ] → R as follows, It is apparent that G k is strictly increasing on its domain [d k , c k+1 ], k = 0, 1, · · · , K. For any γ ∈ Γ h,g and t . We then focus on the regions [c k , d k ], k = 1, · · · , K where G is constant (note: G −1 does not exist). Note that H(a k ) = G(γ(a k )) = G(c k ) = G(d k ) = G(γ(b k )) = H(b k ). Hence, any γ ∈ Γ with γ(a k ) = c k , γ(b k ) = d k satisfies that H(t) = G(γ(t)) for any t ∈ [a k , b k ]. Finally, we have shown that the set Γ h,g can be characterized as follows,

Estimation of Nonnegative Intensities
In Sec. 2, we defined a phase distance d ext between two positive density functions. Here we generalize the distance to nonnegative densities. Definition 4. Let h, g be two density functions whose CDFs H, G are in F . We define the distance between h and g as We present three properties of this distance below.
1. D is a generalization of the distance d ext -for strictly positive densities h, g, the set Γ h,g has single element G −1 • H, and therefore D(g, h) = d ext (g, h).
2. D is a proper distance. The proof of this property is similar to that for the distance d ext (see Appendix A) and is, therefore, omitted here.  Estimation Method: The estimation of nonnegative intensities follows the same procedure as in the Intensity Estimation Algorithm (Algorithm 3), where Algorithm 1 calls for the Karcher mean computation. However, in this case we need to update the second step of Algorithm 1 (computation of optimal warping between F 0 and F j ), the new optimal form in Eqn. 7 is adopted. Analogous to the proof in Sec. 3, one can demonstrate that the estimated nonnegative intensity is also an consistent estimator (under the metric D in Eqn. 6). We omit the details in this manuscript to avoid repetition.

Experimental Results
In this section we will demonstrate the proposed intensity estimation using two simulations -one is for a strictly positive intensity, and the other is for an intensity with zero-valued sub-regions. We will also apply the new method in a real spike train dataset and evaluate the classification performance using the estimated intensities.

Poisson Process with a Positive Intensity Function
Twenty independent realizations of a non-homogeneous Poisson process were simulated with the intensity function λ(t) = 100(3 + 2 sin((8t − 1/2)π)) on [0, 1]. This intensity function and these 20 original processes are shown in Fig.  3A. Because of the non-constant intensity, there is a higher concentration of events during intervals with high intensity and fewer events during intervals with low intensity. This pattern is easily seen in the simulated processes. We then generate 20 warping functions {γ i } 20 i=1 in the following form: Here a i are equally spaced between −2 and 2, i = 1, · · · , 20. These warping functions are shown in Fig. 3B. We then warp the 20 independent Poisson process using these 20 warping functions, respectively, by the formula in Eqn. 1. The resulting warped processes are shown in Fig. 3C. Comparing these processes with those in Fig. 3A, we can see that the clear link between number of events in each sub-region and the intensity value no longer exists. Given these noisy Poisson process observations, we aim to reconstruct the underlying intensity function λ(t).
The individual estimated density functions for the warped processes are shown in the top panel of Figure 4A. The true warped density functions are shown in the bottom panel of Figure 4A. The underlying intensity function was estimated for two different cases. In the first case, time warping is present and ignored during estimation. In the second case, time warping is present and accounted for in the estimation using the proposed method. Both of these estimates are displayed with the true intensity function for comparison in Figure 4B. When time warping is present and ignored, the estimated intensity function underestimates the true intensity in the middle two-thirds of the curve and the true pattern is not revealed. However, when the warping is accounted for during the estimation process, the estimate is a much better estimate of the true intensity function. When the warping functions are more severe (shown in Figure 4C; a i ∈ [−4, 4]), the performance decreases in all methods ( Figure 4D). The L 1 -, L 2 -, and L ∞ -norms were all used to measure the error in estimating the true intensity for each method (Table 2). However, the proposed method consistently has the lowest error regardless of which norm is used to measure the error.

Poisson Process with a Nonnegative Intensity Function
In this second example, we illustrate the estimation method for non-negative intensity functions in Sec. 4. The underlying intensity function is defined on [0, 1] and given in the following form:  This intensity, shown in Fig. 5A, has a trianglar shape with two flat subregions, [0, 0.25] and [0.75, 1], which occur on either side of the triangle whose peak is located at t = 0.5.
We then generate 11 warping functions {γ i } 11 i=1 in the following two steps: At first, we defineγ i ∈ Γ on [0, 1] as: where Then, each γ i (t) is defined by linearizingγ i (t) at the value points t = [0, 0.25, 0.5, 0.75, 1]. These warping functions are shown in Fig. 5B. The warped intensity functions, λ i (t) = λ(γ i (t))γ(t), are shown in the top panel of Fig. 5C. We then simulate 11 independent Poisson processes using these 11 intensity functions, respectively, and the results are shown in Fig. 5D. We can see that these realizations clearly display the warped intensity functions along the time axis. Given these noisy Poisson process observations, we aim to reconstruct the underlying intensity function λ(t).
To estimate λ(t), we first estimate the warped intensity functions using modified kernel method on the 11 observed realizations. We fitted a truncated Gaussian kernel with bandwidth h = 0.01 to estimate the intensities. The result is shown in the lower panel of Fig. 5C. Comparing with the true intensities in the corresponding upper panel, we can see the kernel method provides a reasonable estimation. In spite of the phase shift along the time axis, the kernel method estimates the flat subregions in the underlying intensity appropriately.
Once the individual intensities are estimated, we then compute their Karcher mean to get the the warping functions with formula in Eqn. 7. These warping functions were then used to estimate of the underlying intensity function for the process and the result is shown in Fig. 5F. Comparing the result with the true intensity function, we find that the proposed method provides a very accurate reconstruction.

Application in Spike Train Data
In this section the proposed intensity estimation method will be applied to a benchmark spike train dataset. This dataset was first used in a metric-based analysis of spike trains [38], and was also used as a common data set in a workshop on function registration, CTW: Statistics of Time Warpings and Phase Variations in Mathematical Bioscience Institute in 2012. It is publicly available from http://mbi.osu.edu/2012/stwdescription.html and is the same dataset used in Chapter 1. For completeness, a brief summary is given again. The spiking activity of one neuron in primary motor cortex was recorded in a juvenile female macaque monkey. In the experimental setting, a subject monkey was trained to perform a closed Squared-Path (SP) task by moving a cursor to targets via contralateral arm movements in the horizontal plane. Basically, the targets in the SP task are all fixed at the four corners of a square and the movement is stereotyped. In each trial the subject reached a sequence of 5 targets which were the four corners of the square with the first and last targets overlapping. Each sequence of 5 targets defined a path, and there were four different paths in the SP task (depending on the starting point). In this experiment, 60 trials for each path were recorded, and the total number of trials was 240.
To fix a standardized time interval for all data, the spiking activity in each different movement paths. In general, there are two types of decoding methods: i) classification based on pairwise distance between training and test data, and ii) classification using distance from test data to the Karcher mean in the training data. Note that the pairwise method has a quadratic efficiency (Cost is O(N 2 ), where N is the number of spike trains in training and testing set), but distance-to-the-mean is in the linear order [38]. In this chapter, the decoding result is reported using the efficient mean-based method.
Once an estimate for the density of each of the spike trains was obtained, the Karcher mean for each path was calculated using Algorithm 1 with one minor change to overcome numerical issues. In step 2, instead of directly using the CDF and inverse CDF of the two densities, the individual warping functions are found using Dynamic Programming [30]. The penalty coefficient used in the Dynamic Programming was 0.01, although the results are robust to the choice of this penalty coefficient. The computed Karcher means in each path are shown in Fig. 7.
Comparing with the original spike trains, all of the mean spike trains appropriately represent the firing patterns in the corresponding movement. For example, the spiking frequency is relatively higher when the hand moves upward, which is apparent in all four means. For the 120 test trains, each train is labeled by the shortest distance over the distances to the four means in the training set. This computation is apparently more efficient (only 120 × 4 = 480 distances need to be computed). It is found that the classification accuracy using the proposed estimation method is 82.5%(99/120) whereas the classification accuracies using the naive cross-sectional method and the Fisher-Rao registration method are 77.5%(93/120) and 55.0%(66/120), respectively. This result shows the proposed method can better differentiate neural signals with respect to different movement behaviors. The lower accuracy in the naive method indicates that the proposed method improves classification results.

Discussion
Intensity estimation has been a classical problem in Poisson process methods. The problem is significantly challenging if the observed data are corrupted with compositional noise, i.e. there is time warping noise in each realization. In the paper, we have proposed a novel alignment-based algorithm for positive intensity estimation. The method is based on a key fact that the intensity function is area-preserved with respect to compositional noise. Such a property implies that the time warping is only encoded in the normalized intensity, or density, function. Based on this finding, we decompose the estimation of intensity by the product of estimated total intensity and estimated density. Our investigation on asymptotics shows that the proposed estimation algorithm provides a consistent estimator for the underlying density. We further extend the method to all nonnegative intensity functions, and provide simulation examples to illustrate the success of the estimation algorithms.
While results from this method show promising improvements over previous methods, it is important to note that the method is dependent upon the kernel density estimates of the observed processes. In general, kernel density estimates are highly dependent upon the chosen bandwidth h [24,12]. In this paper, we have used a simple plug-in method to determine an appropriate bandwidth. In future work, we will consider the development of an algorithm that can automatically choose the optimal bandwidth for the modified kernel density estimator. Additionally, future work will examine the asymptotic variability of this estimator and an extension to general Cox processes for conditional intensity estimation.
estimate with optimal bandwidth given in Algorithm 2, then Here we will show that the first term goes to 0 (a.s.). Indeed, hn 0ĝ − − → 0. The convergence to 0 for the second to fourth terms on the RHS of Eqn. 9 can be similarly proven, and therefore      − − → 0 (by Lemma 3). That is,Ĝ n ⇒ G (uniform convergence) almost surely. By the theory on convergence of inverse functions [35], we also got that F n ⇒ F (a.s.).