Skip to main content

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 08 March 2023
Sec. Statistical and Computational Physics
This article is part of the Research Topic Advances in Information Geometry: Beyond the Conventional Approach View all 4 articles

Geometric properties of noninformative priors based on the chi-square divergence

  • 1Center for Education in Liberal Arts and Sciences, Osaka University, Suita, Japan
  • 2Center for Quantum Information and Quantum Biology, Osaka University, Suita, Japan

Recently, a noninformative prior distribution that is different from the Jeffreys prior was derived as an extension of Bernardo's reference prior based on the chi-square divergence. We summarize this result in terms of information geometry and clarify some geometric properties. Specifically, we show that it corresponds to a parallel volume element and can be written as a power of the Jeffreys prior in flat model manifolds.

1. Introduction

The problem of noninformative prior in Bayesian statistics is to determine what kind of probability distribution (often called a noninformative prior or an objective prior) is desirable on a statistical model in the absence of information about the parameters. In theory, though not in practice, it is essentially a problem of small-sample statistics, which has been under consideration for a long time [14].

Theoretical research on noninformative priors dates back to Jeffreys [3], and currently, a noninformative prior proposed by him, called the Jeffreys prior, is the standard noninformative prior. Theoretical justification of the Jeffreys prior comes from the theory of reference priors, which were originally proposed by Bernardo [5] decades ago when considering the maximization of the mutual information between the parameter and the outcome. Many related studies in this direction have since been reported [for review, see, e.g., Berger et al. [6]].

On the contrary, there are several criteria for considering noninformative priors. For example, Komaki [7, 8] has proposed objective priors to improve the performance of Bayesian predictive densities. Some significant results were presented by his co-workers, including the author [e.g., noninformative priors on time series models have been proposed [9, 10]]. From the viewpoint of information geometry, Takeuchi and Amari [11] proposed an α-parallel prior. For a recent review of other noninformative priors, see, e.g., Ghosh [12].

Recently, considering a certain extension of Bernardo's reference prior, Liu et al. [13] showed that a prior distribution different from the Jeffreys prior can be derived. Since it is based on the chi-square divergence, we call it χ2-prior for convenience. Apart from the Jeffreys prior, the geometric properties of χ2-prior are yet to be discussed.

In the present study, we investigate the derivation by Lie et al. of χ2-prior from the viewpoint of information geometry. We put emphasis on the invariance of the theory under reparametrization (coordinate transformation in differential geometry). While we follow their derivation, we rewrite the asymptotic expansion in geometric terms, which makes the problem easier to understand. We also derive the tensor equations that χ2-prior and an α-parallel prior satisfy. As a consequence, we find that χ2-prior agrees with an α-parallel prior for α=12, i.e., the 12-parallel prior.

Basic definitions and notation are given in Section 2. We also review some noninformative priors in terms of information geometry. In Section 3, we rewrite the asymptotic expansion by Liu et al. [13] in geometric terms to simplify their argument. In Section 4, we briefly review α-parallel priors, clarify a relation between χ2-prior and α-parallel prior, and derive a formula of an α-parallel prior in γ-flat models. Finally, concluding remarks are given in Section 5.

2. Preliminaries

We briefly review some definitions and notation of information geometry [for details, refer to textbooks on information geometry [14, 15]]. We also review some noninformative priors in terms of information geometry.

For a given statistical model, we would like to consider noninformative prior distributions defined in a manner independent of parametrization. For this reason, it is convenient to introduce differential geometrical quantities into our discussion, i.e., to consider them from the viewpoint of information geometry.

2.1. Basic definitions of information geometry

Suppose that a statistical model M={p(x;θ): θΘRp} is given, which is regarded as a p-dimensional differential manifold and called a statistical model manifold (though it will be called simply a model where no confusion is possible). As usual, all necessary regularity conditions are assumed.

We also define the Riemannian metric and affine connections on the manifold M. Let l = log p(x; θ) denote the log-likelihood function.

Definition 1. The Riemannian metric gij = g(∂i, ∂j) is defined as

gij=E[iljl],

where il=lθi=logp(x;θ)θi and E[·] denotes expectation with respect to observation x. The above quantities are also called the Fisher information matrix in statistics. Thus, we often call the above metric the Fisher metric.

The statistical cubic tensor and the coefficients of the e-connection are defined as

      Tijk=E[iljlkl],Γ(e)ij,k=E[ijlkl].

Definition 2. For every real α, p3 quantities

Γ(α)ij,k=Γ(e)ij,k+1-α2Tijk

define an affine connection, which is called the α-connection.

We identify an affine connection with its coefficients below. Connection coefficients with upper indices are obtained by

Γ(e)ijk=Γ(e)ij,lglk,Γ(α)ijk=Γ(e)ijk+1-α2Tijlglk,

where gij is the inverse matrix of the Fisher metric gij, and we have used Einstein's summation convention [see, e.g., Amari and Nagaoka [14] for details].

Conventionally, when α = 1, we call it the e-connection and when α = −1, we call it the m-connection and denote it as Γ(m)ij,k, i.e.,

Γ(m)ij,k=Γ(e)ij,k+Tijk.

It is well-known that α-connection and −α-connection are mutually dual with respect to the Fisher metric. (In a Riemannian manifold with an affine connection Γ, another affine connection Γ* is said to be dual with respect to Γ if it satisfies kgij=Γki,j+Γkj,i*. For equivalent definitions, see, e.g., Amari and Nagaoka [14], Chap. 3.) When α = 0, the self-dual connection is called the Levi-Civita connection, which defines a parallel transport that keeps the Riemannian metric invariant. The Levi-Civita connection is defined by the sum of the partial derivative of the metric, and its explicit form is given by

Γ(0)ij,k=Γ(e)ij,k+12Tijk=12(jgki+igkj-kgij).    (1)

2.2. Useful identities for alpha-connections

In the present study, the following identities are useful. They are obtained in a straightforward manner; thus, their proofs are omitted.

Lemma 1. Let mijk = E[∂ijkl]. Then,

mijk=-igjk-Γ(e)jk,i,

and

kgij=Γ(e)ki,j+Γ(e)kj,i+Tijk,            =Γ(e)ki,j+Γ(m)kj,i.

hold.

The first equation yields relation (Equation 1). The last equation shows the duality of e- and m-connections directly and is generalized to ±α-connections.

Lemma 2. For mutually dual connections, the following identities hold.

kgij=Γ(α)ki,j+Γ(-α)kj,i,    (2)
Γ(α)ki,j-Γ(-α)ki,j=-αTkij.    (3)

Using Lemma 1 and Equation (1), we obtain the Bartlett identity, which is well-known in mathematical statistics.

Lemma 3. For mijk, Tijk, and, the first derivative of Fisher metric, gij, the following holds.

mijk=12(Tijk-kgij-jgik-igjk).    (4)

2.3. Prior distributions and volume elements

In Bayesian statistics, for a given statistical model M, we need a probability distribution over the model parameter space, which is called a prior distribution, or simply a prior. We often denote a prior density as π (π(θ) ≥ 0 and Θπ(θ) θ=1).

A volume element on a p-dimensional model manifold corresponds to a prior density function over the parameter space (θ ∈ Θ ⊂ Rp) in a one-to-one manner. For a prior π(θ), its corresponding volume element ω is a p-form (differential form of degree p) and is written as

ω=π(θ)dθ1dθp

in the local coordinate system.

For example, in two-dimensional Euclidian space (p = 2), the volume element is given by ω = dxdy in Cartesian coordinates (x, y). In polar coordinates (r, θ), it is written as ω = rdr.

Then, under the coordinate transformation θ → ξ, how do the probability density on the parameter space and its ratio change? From the formula for the p-dimensional volume element, it is written as

π(ξ)=π(θ)|θξ|,    (5)

where |θξ| denotes the Jacobian. In differential geometry, such quantities are called tensor densities.

From the above Equation (5), we see that the ratio of two probability densities, say π1(θ)π2(θ), is invariant under reparametrization.

2.4. Noninformative priors defined by equations

We briefly summarize some of the prior studies on noninformative priors in Bayesian statistics. Basically, a noninformative prior is often defined as the solution of a partial differential equation (PDE) derived from fundamental principles. If it is independent of parametrization, then it usually has a geometrical meaning. The defining equation itself is expected to be invariant under every coordinate transformation.

2.4.1. Tensor equations

Before proceeding, we briefly review the definition of tensor on a manifold [for strict modern definitions, see, e.g., Kobayashi and Nomizu [16], Chap. 1].

For simplicity, we assume that the manifold admits global coordinates Θ, and each point is specified by θ. We fix some nonnegative integers r and s. Suppose that a set of pr+s functions of the parameter θ

Aa1arb1bs(θ),    a1,,ar;b1,,bs=1,,p

is given, and these functions also have a representation in a different coordinate system, say ξ. Suppose they satisfy the following equation:

Ãα1αrβ1βs(ξ)=Λ~α1a1Λ~αrarΛb1β1ΛbsβsAa1arb1bs(θ),

where Λbβ=ξβθb denotes the Jacobi matrix and Λ~αa=θaξα denotes the inverse. Then these functions are called a type (s, r) tensor field, or simply a tensor.

Some specific types have established names. For example, a type (0, 0) tensor is called a scalar (field) and a type (1, 0) tensor is called a vector (field). In particular, the ratio of two prior densities is a scalar. For a differential one-form, which is written as A=Ajdθj, the set of components Aj is regarded as a contravariant vector [type (0, 1) tensor].

For a type (s, r) tensor A, which often includes a derivative, we refer to an equation like A = 0 as a tensor equation. Usually, such a tensor A is derived using some differential operators, and the component-wise form yields a PDE. The component-wise form is given as

Aa1arb1bs(θ)=0,a1,,ar;b1,,bs=1,,p.

By definition, tensor equations are invariant under coordinate transformation (reparametrization). When we show that Aa1arb1bs(θ)=0 for one coordinate system, say θ, then, for another coordinate system, say ξ, due to multilinearity,

Ã(ξ)α1αrβ1βs=Λ~α1a1Λ~αrarΛb1β1ΛbsβsA(θ)a1arb1bs=0

holds. Tensor equations are often written in the form A = B.

2.4.2. Noninformative priors

Now let us explain about noninformative priors [see, e.g., Robert [4] for more details]. As mentioned before, we need to set a prior distribution over the parameter space for a given statistical model in Bayesian statistics. If we have certain information on the parameter in advance, then the prior should reflect this, and such a prior is often called a subjective prior. If not, we adopt a certain criterion and use a prior obtained through the criterion. Such priors are called noninformative priors.

The definition of a noninformative prior, which is often written as a PDE, should not depend on a specific parametrization (a coordinate system of the model manifold). If we claim to have no information on the parameter, then we do not determine which parametrization is natural. Based on this viewpoint, we take several examples of noninformative priors defined through a PDE with a certain criterion. Some equations defining a noninformative prior are not tensor equations and their solutions, that is, noninformative priors, do not satisfy the equation in another coordinate system.

2.4.3. Uniform prior

The uniform prior πU(θ) over the parameter space Θ would be the most naive noninformative prior. This idea dates back to Laplace and has been criticized [3]. The uniform prior is given by a solution of the following PDE:

θilogπ(θ)=0.    (6)

Clearly, the above PDE (Equation 6) is not a tensor equation. In other words, it is not invariant under reparametrization. While the solution for the original parameter θ is constant, πU(θ) ∝ 1, the solution for another parameter ξ is obtained by

π~U(ξ)=πU(θ)|θ=f(ξ)×|θξ|=|θξ|.

Thus, the final form does not satisfy the PDE (Equation 6) for ξ any more. That is,

ξαlogπ~U(ξ)0.

2.4.4. Jeffreys prior

Let us modify the above PDE (Equation 6) slightly so that it is invariant under coordinate transformation. Thus, we obtain the following PDE:

θilog{π(θ)/g}=0,    (7)

where g denotes the determinant of the Fisher metric. The solution, which is given as a constant times g, is called the Jeffreys prior [3]. It is the most famous noninformative prior in Bayesian statistics. Let πJ(θ)(g) denote the Jeffreys prior from here on. It is the straightforward extension of the uniform prior.

As Jeffreys himself pointed out, it is not necessarily reasonable to adopt the Jeffreys prior as an objective prior in a higher dimensional parametric model. This is one of the reasons to propose noninformative priors under a fundamental criterion [see, e.g., Robert [4] and references therein].

Note that the following identity for the Riemannian metric tensor will be useful:

ilogπJ=12ilogg,                 =12gjkigjk.    (8)

2.4.5. First moment matching prior

The moment matching prior was proposed by Ghosh and Liu [17]. From the original article, we obtain a PDE in terms of information geometry.

Theorem 1. Ghosh and Liu's moment matching prior is given by the solution of the following PDE:

ilog(π(θ)πJ(θ))-12gjk(θ)Γ(e)jk,i(θ)=0.

From the aforementioned form, it is clearly not a tensor equation, and thus, the PDE is not invariant under reparametrization. Indeed, while the first term of the LHS is a (0, 1) tensor, the second term is not.

Proof. First, from the formula in Ghosh and Liu [17] (Section 3, p. 193), we obtain

n(θ^πm-θ^MLm)=(Um+12Vm)+oP(1),

where θ^MLm and θ^πm are the MLE and the posterior mean of θ, respectively, Um=gmlllogπ and Vm=gmlgjkmljk. Therefore, the condition of the first moment matching is given by

Um+12Vm=0.

Multiplying both sides with Fisher matrix gim, we obtain an equivalent equation as follows:

ilogπ+12gjkmijk=0.

Therefore, using Lemma 1 and Equation (8), we obtain

ilogπ=-12gjkmijk               =-12gjk(-igjk-Γ(e)jk,i)               =ilogg+12gjkΓ(e)jk,i.

Since we may replace g with πJ in the last expression, we can obtain

ilog(π/πJ)-12gjkΓ(e)jk,i=0.

Remark 1. For the exponential family with the natural parameter θ, it is known that Γ(e)jk,i0. When all connection coefficients vanish, the coordinate system is called affine. In this sense, the natural parameter is called the e-affine coordinate. From the above equation, in this parametrization, the moment matching prior agrees with the Jeffreys prior. However, if we begin with a different parametrization, then we obtain a prior which is different from the Jeffreys prior. As a specific example, let us consider the binomial model with the success probability η (0 < η < 1) in Ghosh [12] (Section 5.2, p. 199). Thereafter, the moment matching prior for η is given by πM(η)η-1(1-η)-1. However, taking the natural parameter θ = log(η/(1 − η)), the moment matching prior for θ is given by the Jeffreys prior, πJ(θ)eθ/2(1+eθ)-1. It is rewritten as πJ(η)η-1/2(1-η)-1/2, which is different from πM(η).

2.4.6. Chi-square prior

Liu et al. [13] developed an extension of the reference prior by replacing the KL-divergence in the original definition with the general alpha-divergence. As an exceptional case, we obtain a prior which is different from the Jeffreys prior. The PDE is given by

ilog(π(θ)πJ(θ))=-14Ti,    (9)

where Ti=Tijkgjk is a type (0, 1) tensor. Thus, the above PDE is a tensor equation. Its derivation and details are explained in the next section.

Definition 3. [Liu et al. [13]]. If the PDE (Equation 9) has a solution, then we call the prior distribution χ2-prior. We denote χ2-prior as πχ2.

As we will see later, πχ2 does not necessarily exist. However, usual statistical models satisfy a necessary and sufficient condition for the existence of πχ2. These models are invariant under coordinate transformation.

3. Derivation of chi-square prior in terms of information geometry

Liu et al. [13] derived the PDE (Equation 9) that πχ2 should satisfy by considering the maximization of a functional of a prior π based on χ2-divergence. In the present section, we review their result and rewrite the functional in terms of information geometry. As a result, we obtain a more explicit form and a better interpretation of the maximization.

3.1. Extension of the reference prior

As an underlying principle, Bernardo [5] adopted construction of the minimax code in information theory to derive noninformative priors. After that, the noninformative prior is defined as the input source distribution that maximizes the mutual information between the parameter and the outcome. This prior is called a (Bernardo's) reference prior. Under some conditions, his idea has been strictly formulated by several authors [18, 19] (for a review, see, e.g., Berger et al. [6]).

In one of the many studies and variants of reference priors, recently Liu et al. [13] adopted the α-divergence instead of the KL-divergence in Bernardo's argument and obtained a generalized result.

Definition 4. Let p(x) and q(x) be probability densities. For a fixed real parameter α, the α-divergence from p to q is defined as

Dα(p;q)=1α(1-α){1-p(x)αq(x)1-αdx}(0).    (10)

Remark 2. In the textbook on information geometry by Amari [20], the following parametrization is used because of the emphasis on the duality:

D~β(p;q)=41-β2{1-q(x)1+β2p(x)1-β2dx},    (11)

where we write β instead of α. We adopt the parametrization of α in Equation (10). For example, χ2-divergence corresponds to α = −1 in Equation (10) and β = 3 in Equation (11). More explicitly, the relation α=1-β2 (and thus, 1-α=1+β2) holds.

When α = 0, 1, taking the limit, the α-divergence reduces to the KL-divergence.

Now, let us see the definition of the noninformative prior proposed by Liu et al. [13]. Under regularity conditions (e.g., the compactness of the parameter space Θ), they considered the maximization of the following functional of a prior density π as follows:

J[π]=E[Dα(π(·);π(·|X))| θ]π(θ)dθ,    (12)

where E[·|θ] denotes expectation with respect to p(X|θ), and the expression emphasizes that the parameter θ is fixed in the integral. Under their criterion, the maximizer of J[π] is adopted as a noninformative prior.

Following Liu et al. [13], we rewrite the above functional Equation (12) in a more simple form as follows:

α(1-α)J[π]=1-E[π(θ|X)-α| θ]π(θ)α+1dθ.

Depending on the sign of α(1 − α), our problem reduces to maximization or minimization of the expectation E[π(θ|X)−α| θ]. Clearly, it is not solved explicitly for general cases. Thus, as usual, we adopt the approximation of the expectation term under the assumption that X=(X1,,Xn)~i.i.d.p(x|θ) with n → ∞.

3.2. Asymptotic expansion of the expectation term

Except for α = −1 (χ2-divergence), the maximization of J[π] reduces to that of the first-order term in the following expansion (Theorem 2), which yields the Jeffreys prior for −1 < α < 1. However, for χ2-divergence, we need to evaluate the second-order term since the first-order term is constant.

First, we present a key result in Liu et al. [13]. Some notation in their result follows ours. For example, the Fisher information matrix and its determinant are denoted as gij and g, respectively. The dimension of the parameter θ is denoted as p. Please refer to the original article for technical details.

Theorem 2. [Liu et al. [13], Theorem 3.1] The expectation term E[π(θ|X)−α| θ] in the functional J[π] can be rewritten as

E[π(θ|X)-α| θ]=(2π)pα/2n-pα2g-12α(1-α)-p2                           [1+1n{+s(θ)}+o(n-1)],

where the 1/n part in braces {⋯} is given by

α1-αjgij·ilogπ+-α22(1-α)gij1gjgilogπ-α22(1-α)mijk(llogπ)gijgkl-α2gijilogπ·jlogπ+2α-α22(1-α)gijijππ+s(θ).

The last term s(θ) does not include the prior density π.

From Theorem 2, for a positive constant Cn and a sufficiently large n, the functional Equation (12) is approximated by

α(1-α)J[π]1-Cng-12απα+1dθ.

When −1 < α < 1, the maximization yields πg12, that is, the Jeffreys prior. When α < −1, rather, the Jeffreys prior minimizes the functional J[π].

However, at the boundary point α = −1 (χ2-divergence), the above first-order term becomes a constant independent of π. In this case, we need to evaluate the second-order term more carefully.

3.3. Rewriting Liu et al.'s Theorem 3.1 in geometrical terms

Now let us rewrite the second-order term of the asymptotic expansion in Theorem 2 in terms of information geometry. We fix α = −1, and from here on, consider only the case for χ2-divergence.

Although our approach differs from that in the original article, the final PDE agrees with their result. The difference and our contribution are discussed in the next subsection.

We summarize how we rewrite each term to obtain the final result (Theorem 3) later. First, we rewrite ijππ by using the following relation:

ijππ=ijlogπ+ilogπ·jlogπ.

After that, we replace a prior density π with the density ratio h = π/πJ, where πJ=g. The terms including the prior density π and its derivatives are expected to be written using the scalar function log h. Indeed, this expectation is correct, and we obtain the final form after tedious, lengthy, and straightforward calculation. Because we use partial integrals in transforming the original form of the asymptotic expansion, the integral symbol remains in the expression below.

Theorem 3. [Liu et al. [13]], Corollary of Theorem 3.1.

E[π(θ|X)| θ]dθ=(2π)-p/2np2(2)-p2[gθ+1n{+s(θ)}gdθ+o(n-1)],

where the 1/n part in square brackets is given by

1n{-14||dlogh+T4||2+14||dlogπJ-T4||2+s(θ)}gdθ,    (13)

in which, we set T:=Tidθi and the norm of one-form A is defined as ||A||2:=AiAjgij.

The above one-form T is called the Tchebychev form in affine geometry [see, e.g., p. 58 in Simon et al. [21]].

From Theorem 3, maximizing J[π] over the set of all prior densities is equivalent to maximizing the above integral with respect to a scalar function h when n → ∞. Since the second and third terms inside braces {⋯} in Equation (13) are independent of h, the expression achieves the maximum if the first term vanishes, that is,

dlogh+T4=0    (14)

holds. Thus, we obtain an equation of a differential one-form which determines χ2-prior. In a proper coordinate system, the component-wise form of equation (14) is given by

θilogh=-14Ti,

which agrees with the original PDE (Equation 9) derived in the previous study.

Finally, we discuss the existence of χ2-prior. Generally, χ2-prior does not necessarily exist on a statistical model. The existence of a χ2-prior on a given model is equivalent to the existence of the solution of PDE (Equation 9).

A solution of PDE (Equation 9) exists if and only if Ti satisfies the following condition:

jTi-iTj=0,    (15)

which is called an integrability condition and is well-known. As a simplification, we may write dT = 0.

A bit surprisingly, the above condition (Equation 15) agrees with the condition that the α(≠0)-parallel prior exists [11]. This implies a certain relationship between χ2-prior and an α-parallel prior. Indeed, its expectation is correct and χ2-prior is shown to be the 12-parallel prior, which is the theme in the next section.

3.4. Discussion

We here discuss the difference between the original result obtained by Liu et al. [13] and the present study.

First, the PDE they obtained for χ2-prior is not in the form of tensor equations. They gave a PDE for log π as follows

ilogπ=-14Tijkgjk+12g-1ig,    (16)

instead of our PDE (Equation 9) [Liu et al. [13], p. 357, Equation (48)]. Both sides of Equation (16) are not tensors, i.e., not invariant under coordinate transformation.

Second, although Liu et al. obtained the asymptotic expansion as in Theorem 2 (Theorem 3.1 in the original article), their approach to derive the PDE (Equation 16) is not sufficient. Strictly speaking, they only show that πχ2 satisfying the PDE (Equation 16) achieves the extreme value asymptotically. They did not organize messy terms and utilized the variational method in an ad hoc manner to derive the PDE (Equation 16). Moreover, their approach does not exclude the possibility of achieving the minimum.

Our approach shows more directly that πχ2 satisfying the PDE (Equation 16)achieves the maximum of the functional asymptotically. Using the square completion for the one-form d log h, we show that πχ2 maximizes the functional J[π] when n → ∞.

In addition, our underlying philosophy is the invariance principle under coordinate transformation. Clearly, the expected χ2-divergence from a prior to its posterior is independent of parametrization. Thus, we naturally expect that the O(n−1) term is independent of parametrization, i.e., represented by geometrical quantities. As a result, we obtain a simpler expression (Equation 13) in Theorem 3. This is a good example of how organizing from the viewpoint of information geometry can simplify various terms and make the structure of the problem easier to understand.

As for derivation of fundamental PDEs, we point out a formal analogy between general relativity and ours. Historically speaking, Hilbert showed that the Einstein equation is derived from Einstein–Hilbert action integral S[gab], where gab is the pseudo-Riemannian metric on the time-space manifold [see, e.g., Wald [22], Appendix E.1]. In our problem, we take the expected χ2-divergence from a prior to its posterior instead of S[gab]. The maximization of J[π] and minimization of S[gab] yield the tensor equation (Equation 16) and the Einstein equation, respectively.

4. Relation between chi-square priors and alpha-parallel priors

In this section, we show that χ2-prior is the 12-parallel prior, a special case of an α-parallel prior. As we shall see later, an α-parallel prior is defined through an α-parallel volume element and was proposed by Takeuchi and Amari [11]. Among several existence conditions for an α-parallel prior, we focus on the PDE of log π and rewrite it in terms of the log ratio log h.

In the exponential family, χ2-prior and α-parallel priors were derived by the two author groups, and Takeuchi and Amari [11] and Liu et al. [13], respectively. We also generalize this result for any α-flat model.

4.1. Alpha-parallel priors

Takeuchi and Amari [11] introduced a family of geometric priors called α-parallel priors, which include the well-known Jeffreys prior and maximum likelihood (ML) prior [23]. We briefly review basic definitions and related results on α-parallel priors below.

4.1.1. Equiaffine connection

First, we recall the definition of equiaffine connection in affine geometry. Let us consider a p-dimensional orientable smooth manifold M with an affine connection ∇. We shall say that a torsion-free affine connection ∇ is equiaffine when there exists a parallel volume element, that is, a nonvanishing p-form ω such that ∇ω = 0.

One necessary and sufficient condition for ∇ to be equiaffine is

Rijk   k=0,    (17)

where Rijk  l is the Riemann–Christoffel curvature tensor with respect to the connection ∇.The condition (Equation 17) is slightly weaker than the condition that an affine manifold is flat, Rijk  l=0.

4.1.2. Definition of alpha-parallel prior

Here, we develop an aforementioned argument in statistical models. Since statistical models have a family of affine connections in a natural manner, we expect that the condition of being an equiaffine connection is obtained as a property of model manifolds rather than one of affine connections.

Let a p-dimensional statistical model manifold M be given. We assume that it is covered by a single coordinate, say, θ ∈ Θ ⊆ Rp, orientable, and simply connected.

Definition 5. Suppose that there exists a parallel volume element ω for a fixed α, i.e., (α)ω=0. Therefore, in a coordinate system, say, θ = (θ1, …, θp), α-parallel volume element ω is represented as

ωπ(θ)dθ1dθp,

where π is a nonnegative function over the parameter space Θ. We call π an α-parallel prior.

Some examples of α-parallel prior are as follows: when α = 1, the 1-parallel prior (also called the e-parallel prior) is the so-called ML prior proposed by Hartigan [23]; and when α = 0, the 0-parallel prior is the Jeffreys prior. As we shall see later, the 0-parallel prior is exceptional and always exists on a statistical model. Indeed, a 0-parallel volume element, gdθ1dθp, is known as the invariant volume element on a Riemannian manifold (M,gij) with the Levi-Civita connection (0-connection in information geometry). Note that an α-parallel prior could be an improper prior. For other properties of α-parallel priors, see Takeuchi and Amari [11].

4.1.3. Existence conditions of alpha-parallel prior

In statistical models, we obtain a deeper result for the existence of an α-parallel prior. First, we note that the relation

Rijk(α) k=α2(iTjjTi)    (18)

holds for every α. From the necessary and sufficient condition for the existence of α-parallel prior (Equation 17), we find that the 0-parallel prior (α = 0) necessarily exists. For α ≠ 0, we introduce the concept of statistically equiaffine.

Definition 6. A statistical model manifold M is said to be statistically equiaffine [11], when the cubic tensor Tijk satisfies the following condition:

jTi-iTj=0.    (19)

Observing the existence condition for α-parallel prior (Equation 17) and the relation (Equation 18), we easily obtain the following theorem.

Theorem 4. For a statistical model manifold M, the following conditions are equivalent:

(a) an α-parallel volume element exists for all α,

(b) an α-parallel volume element exists for α (≠0),

(c) M is statistically equiaffine.

Note that a weaker condition (b) implies stronger conditions (a) and (c). The usual statistical models have been shown to be statistically equiaffine [11]. An important statistical model that is not statistically equiaffine is the ARMA model in time series analysis [24].

4.2. Chi-square prior is the half-parallel prior

Now let us consider a relation between α-parallel prior and χ2-prior. To compare them, we focus on the following PDE for an α-parallel prior:

ilogπ=gjkΓ(α)ij,k    (20)

[Takeuchi and Amari [11], Proposition 1, p. 1016, Equation (7)].

Since both sides of the PDE (Equation 20) are not tensors, its invariance under coordinate transformation is not clear. Thus, we introduce a one-form (geometrical quantity) derived from a scalar function h = π/πJ and modify the equation.

Theorem 5. The above PDE (Equation 20) is equivalent to the following tensor equation:

ilogh=-α2Ti.    (21)

When we set T=Tidθi, then the above equation (21) can be rewritten as

dlogh+α2T=0,

which is the equation of a differential one-form.

Proof. Using Equation (8), we rewrite the PDE (Equation 20) as follows:

ilogh=(-12igjk+Γ(α)ij,k)gjk.    (22)

Therefore, using Lemma 2, we modify the RHS of Equation (22):

(-12igjk+Γ(α)ij,k)gjk=(-12Γ(α)ij,k-12Γ(-α)ik,j+Γ(α)ij,k)gjk,                                  =(-12Γ(-α)ik,j+12Γ(α)ki,j)gjk,                                  =-12αTikjgjk,                                  =-α2Ti.

Surprisingly, the PDE defining πχ2 (Equation 9) agrees with Equation (21) with α=12. Thus, χ2-prior derived by Liu et al. [13] is the 12-parallel prior. This finding is interesting in two ways.

First, to Bayesian statistics, it is a new example where the formulation in terms of information geometry is useful to research on noninformative priors [for several examples, see Komaki [7] and Tanaka and Komaki [9]]. Liu et al. [13] derived the PDE (Equation 9) by considering one extension of the reference prior with χ2-divergence. Their starting point is completely independent of the geometry of statistical models. In spite of this, χ2-prior has a good geometrical interpretation: it is volume element invariant under the parallel transport with respect to the 12-connection.

Second, to information geometry, it would be the first specific example where only the 12-connection makes sense in statistical applications. In information geometry, the meaning of each connection among α-connections has not been clarified enough except for specific alphas (α = 0, ±1). In Takeuchi and Amari [11], α-parallel priors were not proposed as noninformative priors. Rather, they regarded the Jeffreys prior as the 0-parallel prior and extended it to every α. Except for α = 0, only the 12-parallel prior is interpreted as a noninformative prior.

4.3. General form of alpha-parallel priors in statistically equiaffine models

Let us derive a general form of α-parallel priors in statistically equiaffine models. In the following, we denote an α-parallel prior as πα. For example, π1/2=πχ2 and π0 = πJ.

First, we briefly review some formulas for α-parallel priors derived by several authors [11, 25]. According to Matsuzoe et al. [25], there exists a scalar function ϕ that satisfies Ta = ∂aϕ on a statistically equiaffine model manifold. Therefore, using this function ϕ, a general solution of the PDE (Equation 21) is given by h(θ)exp{-α2ϕ(θ)}. Thus, we obtain α-parallel prior πα as having the following form:

πα(θ)exp{-α2ϕ(θ)}πJ(θ).

In the exponential family (e-flat model), Takeuchi and Amari show that α-parallel priors are representable as a power of Jeffreys prior πJ for every α [Takeuchi and Amari [11], Example 2, p. 1017].

Here, we extend their result for the exponential family to a γ(γ ≠ 1)-flat model. We use the parameter γ instead of α because the two parameters may be different.

Theorem 6. Let γ ≠ 0. Suppose that a statistical model manifold (M,g,(γ),(-γ)) is γ-flat. Then, there exists an α-parallel prior πα for every α. In a γ-affine coordinate system, {θi}, it is written as a power of Jeffreys prior πJ, that is,

πα(θ)πJ(θ)1-αγ

holds.

Proof. Since the model is γ-flat, we take a γ-affine coordinate system, say, {θi}. Then, Γ(γ)ij,k=0 and from Equation (3) in Lemma 2,

Tijk=1γΓ(-γ)ij,k    (23)

holds.

On the contrary, from a result in Amari and Nagaoka [14], Section 3.3, there exists a scalar function ψ(θ) such that

gij=2θiθjψ,Γ(-γ)ij,k=3θiθjθkψ

for the γ-affine coordinates {θi}. This implies that

Γ(-γ)ij,k=kgij=3θiθjθkψ.    (24)

Therefore, using Eqs (23) and (24), we can rewrite Ti as follows:

Ti=gjkTijk    =1γgjkΓ(-γ)ij,k,    =1γgjkigjk,    =2γθilogg.

Clearly, T satisfies the condition (Equation 19), and thus, the model is statistically affine. Therefore, Theorem 4 implies that there exists an α-parallel prior πα for every α.

Now, let us obtain an explicit form by using πJ. Substituting the above Ti into the RHS of Equation (21),

θilogh=-αγθilogπJ,

and thus, we obtain

h(θ)=πα(θ)πJ(θ)πJ(θ)-αγ.

In particular, we get

πα(θ)πJ(θ)1-αγ.

It is true only in the γ-affine coordinate system {θi} that πα is equal to a power of the Jeffreys prior. Since the above argument is not invariant under coordinate transformation, we take the Jacobian into consideration in another coordinate.

Theorem 6 includes previous results. Liu et al. [13], Example 1, corresponds to the case when α = 1/2 and γ = 1. Takeuchi and Amari [11], Example 2 (p. 1017), corresponds to the case when γ = 1.

5. Conclusion

In the present study, we investigated the derivation by Lie et al. of χ2-prior from the viewpoint of information geometry. We showed that χ2-prior agrees with the 12-parallel prior (α-parallel prior for α=12), which gives a geometrical interpretation. In addition, in our formulation, using the log ratio log π/πJ, which is invariant under reparametrization, simplifies a PDE defining a noninformative prior π in Bayesian analysis.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Funding

This study was supported by JSPS KAKENHI Grant Number 19K11860.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

1. Bernardo JM. Reference analysis. In: Dey KK, Rao CR, editors. Handbook of Statistics, Vol. 25. Amsterdam: Elsevier (2005), p. 17–90. doi: 10.1016/S0169-7161(05)25002-2

CrossRef Full Text | Google Scholar

2. Berger J. Statistical Decision Theory and Bayesian Analysis. New York, NY: Springer (1985). doi: 10.1007/978-1-4757-4286-2

CrossRef Full Text | Google Scholar

3. Jeffreys H. Theory of Probability. Oxford: Oxford University Press (1961).

Google Scholar

4. Robert CP. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. New York: Springer (2001).

Google Scholar

5. Bernardo JM. Reference posterior distributions for Bayesian inference. J R Statist Soc B. (1979) 41:113. doi: 10.1111/j.2517-6161.1979.tb01066.x

CrossRef Full Text | Google Scholar

6. Berger JO, Bernardo JM, Sun D. The formal definition of reference priors. Ann Statist. (2009) 37:905–38. doi: 10.1214/07-AOS587

CrossRef Full Text | Google Scholar

7. Komaki F. Shrinkage priors for Bayesian prediction. Ann Statist. (2006) 34:808–19. doi: 10.1214/009053606000000010

CrossRef Full Text | Google Scholar

8. Komaki F. Bayesian predictive densities based on latent information priors. J Stat Plan Inf. (2011) 141:3705–15. doi: 10.1016/j.jspi.2011.06.009

CrossRef Full Text | Google Scholar

9. Tanaka F, Komaki F. Asymptotic expansion of the risk difference of the Bayesian spectral density in the autoregressive moving average model. Sankhya Series A. (2011) 73(A):162–84. doi: 10.1007/s13171-011-0005-1

CrossRef Full Text | Google Scholar

10. Tanaka F. Superharmonic priors for autoregressive models. Inf Geom. (2018) 1:215–35. doi: 10.1007/s41884-017-0001-1

CrossRef Full Text | Google Scholar

11. Takeuchi J, Amari S. Alpha-parallel prior and its properties. IEEE Trans Info Theory. (2005) 51:1011–23. doi: 10.1109/TIT.2004.842703

CrossRef Full Text | Google Scholar

12. Ghosh M. Objective priors: an introduction for frequentists. Stat Sci. (2011) 26:187–202. doi: 10.1214/10-STS338

CrossRef Full Text | Google Scholar

13. Liu R, Chakrabarti A, Samanta T, Ghosh JK, Ghosh M. On divergence measures leading to Jeffreys and other reference priors. Bayesian Anal. (2014) 9:331–70. doi: 10.1214/14-BA862

CrossRef Full Text | Google Scholar

14. Amari S, Nagaoka H. Methods of Information Geometry. Oxford: AMS (2000).

Google Scholar

15. Amari S. Information Geometry and Its Applications. Tokyo: Springer-Verlag (2016). doi: 10.1007/978-4-431-55978-8

CrossRef Full Text | Google Scholar

16. Kobayashi S, Nomizu K. Foundations of Differential Geometry I. New York, NY: Wiley (1969).

Google Scholar

17. Ghosh M, Liu R. Moment matching priors. Sankhya Series A. (2011) 73(A):185–201. doi: 10.1007/s13171-011-0012-2

CrossRef Full Text | Google Scholar

18. Clarke BS, Barron AR. Information-theoretic asymptotics of Bayes methods. IEEE Trans Inform Theory. (1990) 36:453–71. doi: 10.1109/18.54897

CrossRef Full Text | Google Scholar

19. Clarke BS, Barron AR. Jeffreys' prior is asymptotically least favorable under entropy risk. J Stat Plan Inference. (1994) 41:37–60. doi: 10.1016/0378-3758(94)90153-8

CrossRef Full Text | Google Scholar

20. Amari S. Differential Geometrical Methods in Statistics. Oxford: Springer-Verlag (1985). doi: 10.1007/978-1-4612-5056-2

CrossRef Full Text | Google Scholar

21. Simon U, Schwenk-Schellschmidt A, Viesel H. Introduction to the Affine Differential Geometry of Hypersurfaces. Tokyo: Lecture Notes of the Science University of Tokyo (1991).

22. Wald RM. General Relativity. Chicago, IL: The University of Chicago Press (1984). doi: 10.7208/chicago/9780226870373.001.0001

CrossRef Full Text | Google Scholar

23. Hartigan JA. The maximum likelihood prior. Ann Statist. (1998) 26:2083–103. doi: 10.1214/aos/1024691462

CrossRef Full Text | Google Scholar

24. Tanaka F. Curvature form on statistical model manifolds and its application to Bayesian analysis. J Stat Appl Probab. (2012) 1:35–43. doi: 10.12785/jsap/010105

CrossRef Full Text | Google Scholar

25. Matsuzoe H, Takeuchi J, Amari S. Equiaffine structures on statistical manifolds and Bayesian statistics. Differ Geom Appl. (2006) 24:567–78. doi: 10.1016/j.difgeo.2006.02.003

CrossRef Full Text | Google Scholar

Keywords: noninformative prior, Jeffreys prior, reference prior, alpha-parallel prior, objective prior, chi-square divergence

Citation: Tanaka F (2023) Geometric properties of noninformative priors based on the chi-square divergence. Front. Appl. Math. Stat. 9:1141976. doi: 10.3389/fams.2023.1141976

Received: 11 January 2023; Accepted: 13 February 2023;
Published: 08 March 2023.

Edited by:

Jun Suzuki, The University of Electro-Communications, Japan

Reviewed by:

Hideitsu Hino, Institute of Statistical Mathematics (ISM), Japan
Takafumi Kanamori, Tokyo Institute of Technology, Japan

Copyright © 2023 Tanaka. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Fuyuhiko Tanaka, ZnRhbmFrYS5jZWxhcyYjeDAwMDQwO29zYWthLXUuYWMuanA=

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.