This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Information geometry and Markov chains are two powerful tools used in modern fields such as finance, physics, computer science, and epidemiology. In this survey, we explore their intersection, focusing on the theoretical framework. We attempt to provide a selfcontained treatment of the foundations without requiring a solid background in differential geometry. We present the core concepts of information geometry of Markov chains, including information projections and the pivotal information geometric construction of Nagaoka. We then delve into recent advances in the field, such as geometric structures arising from time reversibility, lumpability of Markov chains, or tree models. Finally, we highlight practical applications of this framework, such as parameter estimation, hypothesis testing, large deviation theory, and the maximum entropy principle.
Markov chains are stochastic models that describe the probabilistic evolution of a system over time and have been successfully used in a wide variety of fields, including physics, engineering, and computer science. Conversely, information geometry is a mathematical framework that provides a geometric interpretation of probability distributions and their properties, with applications in diverse areas such as statistics, machine learning, and neuroscience. By combining the insights and methods from both fields, researchers have, in recent years, developed novel approaches for analyzing and modeling systems with time dependencies.
As the fields of information geometry and Markov chains are broad, it is not possible to review all topics exhaustively, and we had to confine the scope of our survey to certain basic topics. Our focus will be on timediscrete, timehomogeneous Markov chains that take values from a finite alphabet. In particular, we will not cover timecontinuous Markov chains [
This survey is structured into five sections.
In
In
In
Let
A timediscrete, timehomogeneous Markov chain is a random process
It will also be convenient to define
In particular,
We refer the reader to Levin et al. [
Let us first recall the definition of the Shannon entropy of a random variable. We let
For two random variables
Extending the aforementioned definition to Markov processes, the information divergence rate [
We briefly introduce basic concepts related to information geometry in the context of distributions. The central idea is to regard
In addition to
In the parametrization
As a consequence, the curvature tensors associated with ∇^{(e)}, ∇^{(m)} vanish simultaneously. In particular, they vanish for
Similar to the distributional setting, we regard
Our first order of business is to establish a dually flat structure on the set of stochastic matrices, following Nagaoka [
Recall the definition of the information divergence from one stochastic matrix
(i)
(ii)
(iii)
(iv)
We call
From any divergence function
As the metric and connections are derived from the KL divergence, they all depend solely on the transition matrices and are, in particular, agnostic of initial distributions. From calculations, we obtain the Fisher metric [
On the one hand, the metric encodes notions of distance and angles on the manifold. In particular, the information divergence
Consider two curves
Recall from
Similar to the distribution setting, we proceed to define exponential families (efamilies) and mixture families (mfamilies) of stochastic matrices.
(efamily of stochastic matrices [
Note that
The basis is given by
We can alternatively define efamilies as eautoparallel submanifolds of
We define the set of functions [
([
As a corollary [
In the stochastic matrix setting, the notion of a mixture family is naturally defined in terms of edge measures.
(mfamily of stochastic matrices [
It is easy to verify that
For an exponential family
(i)
(ii)
(iii)
(iv)
(v)
Defining the Shannon negentropy
Natural and expectation parametrizations of an efamily
A straightforward computation shows that all the econnection coefficients
An affine connection ∇ defines a notion of the straightness of curves. Namely, a curve
Ehull
(Exponential hull [
(Mixture hull [
When a family
The projection of a point onto a surface is among the most natural geometric concepts. In Euclidean geometry, projecting on a connected convex body leads to a unique closest solution point. However, the dually flat geometry on
For a continuously differentiable and strictly convex function
Geometrical interpretation of a Bregman divergence.
When we let
As
One may naturally wonder whether it is possible to recover the divergence
Geodesic convexity is a natural generalization of convexity in Euclidean geometry for subsets of Riemannian manifolds and functions defined on them. As straight lines are defined with respect to an affine connection ∇, a subset
However, for
Unlike in the distribution setting, where the KL divergence is jointly mconvex, this property does not hold true for stochastic matrices [
In the more familiar Euclidean geometry, projecting a point
For a point
(Pythagorean inequalities for geodesic econvex [
(i)
(ii)
(iii)
(iv)
Inequalities become equalities when projecting onto efamilies and mfamilies.
(Pythagorean theorem for efamilies, mfamilies [
(i)
(ii)
(iii)
(iv)
The construction of the conjugate connection manifold from a general contrast function in
The idea of tilting or exponential change of measure, which gives rise to efamilies in the context of distributions, can be traced back to Miller [
Some alternative definitions of exponential families of Markov chains include [
One area of recent progress has been the analysis of the geometric properties of significant submanifolds of
In this section, we briefly survey known geometric properties of notable submanifolds of
Geometry of submanifolds of irreducible Markov kernels for
Manifold  mfamily  efamily  Dimension 


Yes  Yes 


Yes  Yes 


Yes  Yes 


Yes  No 


Yes  No 


No  Yes 

We say that a stochastic matrix
([
(i)
(ii)
Recall the parametrization of Ito and Amari [
Bistochastic matrices, also called doubly stochastic matrices, are row and columnstochastic. In other words,
(i)
(ii)
A symmetric stochastic matrix
([
(i)
(ii)
In
Consider a Markov chain
We write
Timereversibility is a central concept across a myriad of scientific fields, from computer science (queuing networks [
Reversible Markov chains enjoy a particularly rich mathematical structure. Perhaps first and foremost, reversibility implies selfadjointness of
The time reversal operation is known to preserve some geometric properties of families of transition matrices. Consider
([
Moreover, the time reversal operation leaves the divergence between stochastic matrices unchanged [80, Proof of Proposition 2]:
When
([
The class of functions
It is possible to verify that
([
([
Let
There are known closedform expressions for
Information projections onto
([
Furthermore, the following bisection property holds
Finally, we mention that the entropy production
It is known that the set of bistochastic matrices—also known as the Birkhoff polytope—is the convex hull of the set of permutation matrices (theorem of Birkhoff and von Neumann [
([
(i)
(ii)
In the context of distributions, Čencov [
We briefly recall lumpability in the context of distributions and data processing. Consider a distribution
When
Crucially, in the independent and identically distributed setting, the lumping operation can be understood both as a form of processing of the stream of observations and as an algebraic manipulation of the distribution that generated the random process.
For Markov chains, the concept of lumpability is vastly richer. The first fact one must come to terms with is that a Markov chain may lose its Markov property after a processing operation on the data stream [
([
The subset of
Embeddings of stochastic matrices that correspond to conditional models were proposed and analyzed in [
A.1 Morphisms should preserve the Markov property.
A.2 Morphisms should be expressible as algebraic operations on stochastic matrices.
A.3 Morphisms should have operational meaning on trajectories of observations.
The following definition of a Markov morphism was proposed in [
(Markov morphism for stochastic matrices [
The constraints on the function Λ in
As a consequence, the Fisher metric and affine connections are preserved [
Markov morphisms (
However, they are no mgeodesic affine, which means that generally
A more restricted class of embeddings, termed memoryless embeddings, preserve mgeodesics [
(
Lumpable functions
(Linear congruent embedding).
(Characterization of Markov morphisms as congruent linear embeddings).
(i)
(ii)
As Markov morphisms and linear congruent embeddings can be identified, it will be convenient to refer to them simply as Markov embeddings. We proceed to give two examples of embeddings.
Let
Suppose a given stochastic matrix
There is generally no left inverse for a lumping map
For fixed
Less tersely,
It is not hard to show that the submanifold
Mutually dual foliated structure on
([
The following Pythagorean identity [
For a finite alphabet
For a string
(Tree model).
The tree model is a wellstudied model of Markov sources in the context of data compression [
(Finite State Machine X (FSMX) model).
Example of an FSMX tree (left) and a nonFSMX tree (right).
([
In this section, we give details of some application domains of the geometric perspective.
Recall that the maximum entropy probability distribution over a fixed alphabet
It is known [
In other words, the eprojection onto
The topic of large deviation theory is the study of the probabilities of rare events or fluctuations in stochastic systems, where the likelihood of these events occurring is exponentially small in the system parameters. In this context, we provide a concise overview of the classical asymptotic results and offer references to recent developments of finite sample upper bounds for the probability of large deviations. For
Similar in spirit to the heart of the approach taken in the iid setting, we proceed with an exponential change of measure (also known as tilting or twisting) of
We denote by
The large deviation rate is given by the convex conjugate (Fenchel–Legendre dual) of the logPerron–Frobenius eigenvalue of the matrix
([
([
Moulos and Anantharam [
([
Lastly, the subsequent uniform multiplicative ergodic theorem is known to hold.
([
For a more detailed exposition of the aforementioned results in a broader context, please refer to [
Let
The statistical behavior of
Although asymptotic analysis may be of mathematical interest, for modern tasks, it is crucial to have a finite sample theory that explains the behavior of the sample mean. With regard to the original bivariate function problem, the sample mean for a sliding window of pairs of observations can be defined as follows:
One can construct by exponential tilting the following onedimensional parametric family of transition matrices:
Defining the asymptotic variance for the bivariate
Note that it coincides with the reciprocal of the Fisher information with respect to the expectation parameter; see Eq.
We let
We interpret
We write
Then, 1 −
(i)
(ii)
The Neyman–Pearson lemma asserts the existence of a test, which can be achieved through the likelihood ratio test.
[
(i)
(a)
(b)
(ii)
If we ignore the effect of the initial distribution that is negligible asymptotically, the Neyman–Pearson accepts the null hypothesis if
Geometric interpretation of the Neyman–Pearson test as the orthogonal bisector to the egeodesic passing through both the null and alternative hypotheses.
Note that the efamily
In fact, it can be proved that
Binary hypothesis testing is one of the wellstudied problems in information theory. The use of the Perron–Frobenius theory in this context can be traced back to the 1970s and 1980s [
GW drafted the initial version, which was subsequently reviewed and edited by both authors. All authors contributed to the article and approved the submitted version.
GW was supported by the Special Postdoctoral Researcher Program (SPDR) of RIKEN and the Japan Society for the Promotion of Science KAKENHI under Grant 23K13024. SW was supported in part by JSPS KAKENHI under Grant 20H02144.
The authors are thankful to the referees for their numerous comments, which helped improve the quality of this manuscript, and for bringing reference [
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.
As is customary in the literature,
In our definition of mfamily, we do not allow a redundant choice of
The reason for this name will become clear in
When discussing geodesic convexity in this section, we only consider the section of the geodesic joining the two points, achieved for parameter
For
Note that log
The fact that the second derivative of
Note that Nakagawa and Kanaya [