Coming Together of Bayesian Inference and Skew Spherical Data

This paper presents Bayesian directional data modeling via the skew-rotationally-symmetric Fisher-von Mises-Langevin (FvML) distribution. The prior distributions for the parameters are a pivotal building block in Bayesian analysis, therefore, the impact of the proposed priors will be quantified using the Wasserstein Impact Measure (WIM) to guide the practitioner in the implementation process. For the computation of the posterior, modifications of Gibbs and slice samplings are applied for generating samples. We demonstrate the applicability of our contribution via synthetic and real data analyses. Our investigation paves the way for Bayesian analysis of skew circular and spherical data.


INTRODUCTION
Big and complex data sets are collected from various scientific fields such as atmospheric environment, social science, psychological and biomedical studies, bioinformatics, epidemiology, digital imaging information, and machine learning, to name just a few. Big Data can refer to data with big volume or velocity, high-dimensional data (Ahmed, 2017), unstructured or unusual data (Härdle et al., 2018), complex data, etc. Therefore, there is a need for developing statistical techniques other than the traditional analytical frameworks to model, interpret, and use such data in different fields of science. Data with directions are categorized as unusual data that cannot be analyzed and modeled under the Cartesian coordinate system. With the aid of directional statistics, data science meets another level of analytical methods. In this research work, we focus on the analysis of complex directional data. Bayesian methods have received extensive attention in data science because prior information can be added to enhance modeling. Therefore, here, we consider Bayesian analysis of complex directional data.
A robust roadmap with the symmetric Fisher-von Mises-Langevin (FvML) distribution as the key element from a Bayesian perspective is briefly reviewed. According to Kikuchi's collection of directional data (Kikuchi, 1982), the first attention paid to Bayesian methods for directional data was in a paper by Mardia and El-Atoum (1976). They used the Bayesian approach to estimate the location parameter of the FvML distribution when the concentration parameter was known. The author Bagchi made several contributions in this area: (i) Bagchi (1988) formulated a conjugate prior for the mean direction and a non-informative prior for the concentration parameter of the von Mises distribution; (ii) Bagchi and Guttman (1988) focused on Bayesian inference for the multi-variate FvML distribution; (iii) Bagchi and Kadane (1991) derived the Bayes estimate for the cosine of the direction parameter of the von Mises distribution when the concentration parameter is known; and (iv) Bagchi (1994) developed empirical Bayesian techniques to estimate the mean direction of the FvML distribution (see also Guttorp and Lockhart, 1988;Dowe et al., 1996). Damien and Walker (1999) presented a full Bayesian inference for the von Mises distribution implementing a Gibbs sampler while (Rodrigues et al., 2000) provided an empirical or approximate Bayesian inference for the von Mises distribution. Nuñez-Antonio and Gutiérrez-Peña (2005) presented Bayesian analysis of the FvML distribution when all of the parameters were unknown, as well as an algorithm to generate samples from the posterior distribution based on a sampling-importance-resampling method. Muralidharan and Parikh (2007) provided Bayes estimates for both location and concentration parameters of the von Mises distribution. Bhattacharya and SenGupta (2009) presented Bayesian analysis of a generalized von Mises distribution introducing a new algorithm based on importance sampling and Markov chain Monte Carlo (MCMC) to draw samples from the posterior distribution. Mardia (2010) moved the attention to the bivariate von Mises distribution (on the torus) from a Bayesian viewpoint. Infinite mixtures of FvML distributions using standard conjugate priors for the parameters and Dirichlet priors for the mixing probabilities received attention from Bangert et al. (2010). Hornik and Grün (2013) defined conjugate and Jeffreys priors for the FvML distribution while Taghia et al. (2014) worked on Bayesian inference for the FvML mixture model. From 2017 and onwards the following contributions can be highlighted: Straub (2017) presented Bayesian analysis for the FvML distribution in 3D; Røge et al. (2017) presented Bayesian inference in the case of infinite FvML mixture model assumption; Mulder et al. (2020) provided Bayesian inference for mixtures of von Mises distributions using a reversible jump MCMC sampler and focused on non-informative priors. Lastly, the interested reader is referred to Pewsey and García-Portugués (2021) for Bayesian inference of other directional distributions.
Numerous directional data sets tend to show non-trivial features such as skewness. Therefore, the underlying distribution is not always symmetric, which emphasizes the focus on skewed directional distributions. This inspired us to investigate Bayesian analysis for the general class of skew-rotationally-symmetric distributions (Ley and Verdebout, 2017b), an asymmetric extension of all rotationally symmetric distributions, when the location, concentration, and skewness parameters are unknown.
In section 2, the skew-rotationally-symmetric distribution and special cases are described. The novel contribution is given in section 3 where the posterior distributions are obtained for the skew-FvML as the likelihood model, for four different scenarios of the prior distributions for the parameters of the model. Moreover, an algorithm is provided for generating samples from these posterior distributions. The impact of the priors is explored in section 4, by implementing the Wasserstein Impact Measure (WIM). In section 5, a synthetic data analysis is conducted to show the accuracy of the Bayes estimates based on the assumptions of the skew-FvML model. We demonstrate the applicability of the Bayesian framework for well known real datasets in section 6 for dimensions p = 2 and 3.

SKEW-ROTATIONALLY-SYMMETRIC DISTRIBUTIONS
Most of the distributions on the unit hypersphere S p−1 = {v ∈ R p : v ⊤ v = 1}, p 2, share the common feature of being rotationally-symmetric about their location µ ∈ S p−1 . The distribution of a random variable X ∈ S p−1 is said to be rotationally-symmetric about µ if for any orthogonal matrix O p×p satisfying Oµ = µ it is concluded that OX is equal in distribution to X. The FvML distribution, the most common distribution in spherical studies, is obtained by conditioning on the p-variate normal distribution (see Mardia and Jupp, 2000;Ley and Verdebout, 2017a). Suppose X takes values on the non-linear manifold S p−1 and has the FvML distribution, then its probability density function (pdf) is given by where τ 0 is the concentration parameter, and I α is the modified Bessel function of order α. For τ = 0, (1) simplifies to the uniform distribution. If p = 2 in (1), it results in the von Mises distribution and when p = 3, the Fisher distribution (Fisher, 1953) is obtained.
However, in practice, all real-life phenomena cannot be represented by symmetric models. The interested reader is referred to Downs (2003) for medical research of heart disease diagnosis, Leong and Carlile (1998) for application in neurosciences, Shearman et al. (2000) for biological research on mammalian circadian rhythms, Mardia (2013) and Ameijeiras-Alonso and Ley (2020) for application in bioinformatics especially protein structure prediction, Fisher and Lee (1994) for some studies on wind direction, Ameijeiras-Alonso et al. (2021) for biomechanics studies, and Pewsey (2002) and Ley and Verdebout (2014) for animal movement studies. Therefore, in this paper, the focus will be on the skewrotationally-symmetric (SRS) distributions, introduced by Ley and Verdebout (2017b) as where f (x ⊤ µ) is a rotationally-symmetric pdf about µ ∈ S p−1 , : R → [0, 1] is a monotone increasing function satisfying (−t) + (t) = 1 for all t ∈ R, and ϒ ⊤ µ represents the semi-orthogonal matrix such that ϒ µ ϒ ⊤ µ = I p − µµ ⊤ and ϒ ⊤ µ ϒ µ = I p−1 , where I p is the p × p identity matrix. The parameter γ ∈ R p−1 is a skewness parameter vector such that γ = 0 provides the symmetric pdf f (x ⊤ µ) and non-zero values of γ provide skewed pdfs. This construction allows using the full potential of existing rotationally-symmetric distributions by turning them into skewed versions.
Then the vector X with pdf (4) is obtained as In the next section, Bayesian inference with the SFvML distribution as the key element is presented with all location, concentration, and skewness parameters µ, τ , and γ unknown.

METHODOLOGY
Let X = (X 1 , X 2 , ..., X n ) be a random sample of size n with pdf (4) where the standard normal cumulative density function (cdf) replaces . The likelihood function is Subsequently, four scenarios are presented to define the prior distributions for the location parameter µ, the concentration parameter τ , and the skewness parameter γ .

Prior Distributions
As above, let X denote a set of observations, and a generative model of the data be defined through a set of unknown parameters = (µ, γ , τ ) (see (7)). In this section the prior distributions for = (µ, γ , τ ) are outlined.
For the skewness vector γ the following prior distributions are proposed: (i) the multi-variate normal distribution with location parameter ξ and covariance matrix diag(σ ), (ii) the multivariate skew-normal distribution (Azzalini, 1985) with location parameter ξ , covariance matrix diag(σ ) and skewness parameter λ, i.e., and φ is the standard normal pdf, φ n and n are the pdf and cdf of the n-variate standard normal distribution, respectively. Next, the following priors for µ and τ are considered.

Posterior Distributions
Subsequently, the posterior distribution π( | X) ∝ π( )L( | X) is determined for the different prior assumptions on = (µ, γ , τ ). Firstly, assume the prior distribution set up as described under case 1 and different prior distributions for the skewness parameter.

Scenario 1
Assume the prior distribution of the skewness parameter, γ is given by (8), then for given X the posterior distribution of (µ, γ , τ ) can be obtained by using (7), (8), and (10) as The full conditionals for µ, γ , and τ are, respectively:

Sampling From the Posterior Distributions
A general algorithm is presented to obtain the Bayes estimates of the parameters = (µ, γ , τ ) based on the modified samplingresampling method (Smith and Gelfand, 1992) and modified Gibbs sampling.
After a sufficient burn-in period, the generated sample ((µ 1 , γ 1 , τ 1 ), (µ 2 , γ 2 , τ 2 ), ..., (µ N , γ N , τ N )) is approximately distributed according to the posterior distribution of = (µ, γ , τ ). As can be seen in Algorithm 1, it is sufficient to generate samples of size k from prior distributions of (µ, γ , τ ) which is one of the advantages of this algorithm. By increasing N and k in Algorithm 1 the approximation increases. When the joint prior distributions are not independent, Algorithm 1 still has a good performance (Muralidharan and Parikh, 2007). For the joint prior of (µ, τ ) in (10), the slice sampler can be used (see McElreath, 2020).

THE WASSERSTEIN IMPACT MEASURE
The prior distributions are a crucial part in Bayesian analysis. If the sample size is small, or available data provide only indirect information about the parameters of interest, the prior distribution becomes more important (Carlin and Louis, 2008). Different criteria can be used for prior selection, we refer the reader to Vehtari et al. (2017). Ghaderinezhad et al. (2022) implemented the Wasserstein Impact Measure (WIM) as a measure of the impact of the choice of the prior in a Bayesian approach. In fact it is a convenient way for quantifying prior impact which will help us to choose between two or more priors in a given situation. Suppose is the vector of parameters and F 1 (.) and F 2 (.) are two cumulative distribution functions (cdfs) of Algorithm 1 : Steps to generate samples from the posteriors by using priors and full conditionals.
for i = 1, 2, ..., k, where π(γ | X, µ, τ ) is the full conditional of γ . 10: Select γ (1) corresponding to max ρ 3i , i = 1, 2, ..., k. 11: Repeat Steps 1-10 with (1) = (µ (1) , γ (1) , τ (1) ) to obtain (2) = (µ (2) , γ (2) , τ (2) ) and continue until (B+N) is obtained. two posterior distributions π 1 ( |.) and π 2 ( |.). The Wasserstein distance between these two posteriors related to two different prior sets is obtained as follows: with D the domain of all possible values of . The Wasserstein distance between two posteriors indicates, at any finite sample size n, how close the posterior distributions are and how similar the related inference will be. This is particularly interesting when considering a simple vs. a complicated, computationally intense prior; if the WIM between them is small, then one can safely use the simpler version. When n → ∞ the distance tends to 0. In this section, a simulation study is conducted to compare the different sets of proposed priors in section 3 for p = 2 using this measure. Since the cdfs of the posteriors in (12)-(15) are not computable, Algorithm 1 and Monte Carlo integration are used to obtain the Wasserstein distance. Also, the transport package (Schuhmacher et al., 2020) in the R software offers functions for computing the Wasserstein distance between two sets of samples from different distributions. Most of the functions in this package have been designed for data with two or higher dimensions. For various combinations of the parameters we draw 200 random observations from the SvM in (6).
To compare the impact of the normal distribution and skewnormal distribution in (8) and (9) (for p = 2) as priors for the skewness parameter γ , the following steps were performed: (1) µ and τ were considered as known parameters.
We can thus conclude that from moderate sample sizes on, both priors for all three parameters are rather similar (hence one could use the less computationally intense of both priors), but differ clearly from a non-informative one. In order to judge how large the obtained WIM values actually are, bootstrap resampling could be done with the original data; we leave this for future research. Our analysis here is also limited to the chosen parameter values; more simulations need to be done to get a complete picture. A similar analysis can be performed for p = 3.

SYNTHETIC DATA ANALYSIS
In this section, to assess the performance of the Bayesian approach for obtaining the estimates of = (µ, τ , γ ), a synthetic data analysis was conducted to obtain the Bayes estimates of the parameters of the SvM distribution (6). We generated samples of size N = 20, 50, 100, 500 from the posterior distributions (12)-(15) (scenarios 1-4) with a burn-in period of 5,000 and k = 500, using Algorithm 1 (the values of these parameters are written down in the respective tables). It is noteworthy that steps 2-7 in Algorithm 1 are combined for scenarios 1 and 2. Bayes estimates of the parameters µ, τ and γ were obtained under the squared error, absolute error and zero-one loss functions by calculating mean, median, and mode of the generated samples, respectively. The results for p = 2 and p = 3 including the sample mean, standard deviation, quartiles, and mode of the posterior distribution are summarized in Table 1 for each of the scenarios. As can be seen in Table 1 the obtained Bayes estimates are FIGURE 2 | Traceplots, mean running and estimated posterior pdf plots of generated samples for (µ, τ , γ ) in Table 1 for p = 2, n = 500 and scenario 1 (first row), scenario 2 (second row), scenario 3 (third row), and scenario 4 (fourth row).
Frontiers in Big Data | www.frontiersin.org 9 February 2022 | Volume 4 | Article 769726 close to the actual values of the parameters. In addition, for small sample sizes our proposed Bayesian approach still provides accurate estimates. The traceplots of the generated samples from the posteriors, the compare-partial plots and the running mean plots are shown in Figure 2 (p = 2) and Figure 3 (p = 3) for each of the scenarios and p = 2 and 3 using the ggmcmc package in R (Fernández-i-Marın, 2016). A traceplot is an essential plot for evaluating convergence and diagnosing chain problems. It shows the time series of the sampling process and the expected outcome is to get a traceplot that looks completely random. A compare-partial plot provides overlapped kernel density plots FIGURE 3 | Traceplots, mean running and estimated posterior pdf plots of generated samples for (µ 1 , µ 2 , τ , γ 1 , γ 2 ) in Table 1 for p = 3 and n = 500.
Frontiers in Big Data | www.frontiersin.org   Frontiers in Big Data | www.frontiersin.org that compare the last part of the chain (the last 10% of the values, in green) with the whole chain (in black). Ideally, the initial and final parts of the chain have to be sampling in the same target distribution, so the overlapped densities TABLE 2 | Bayes estimates of parameters based on scenario 1 with prior parameters µ 0 = 2, ζ = 2, η = 1, ξ = 5, and σ = 2 for the movement of blue periwinkles dataset. should be similar. In addition to the traceplots, the running mean plot of the chains is very useful to find within-chain convergence issues. A time series of the running mean of the chain allows to check whether the chain is slowly or quickly TABLE 3 | Bayes estimates of parameters based on scenario 3 with prior parameters µ 0 = 0, τ 0 = 8, α = 5, β = 2, ξ = 3, σ = 1 for the long-axis orientations of feldspar laths dataset. FIGURE 8 | (Top) traceplots, mean running and estimated posterior pdf plots of generated samples for (µ, τ , γ ) in Table 2 for the movement of blue periwinkles dataset. (Bottom) the histogram and kernel density plot of the data related to the movement of blue periwinkles and the fitted curves under different loss functions.

Parameter
approaching its target distribution. The expected output is a line that quickly approaches the overall mean. Figures 2, 3 confirm the convergence of the chains and show that the modified Gibbs sampler recovers the values that actually come from the target posterior distributions. Running multiple independent chains in parallel is necessary to access the representativeness of the chains. If the multiple chains are not well mixed, the convergence of the chains is suspected (Kruschke, 2014;Vehtari et al., 2021). Therefore, four independent chains were run in parallel for scenario 3 (p = 2) in Table 1 to make the inference more robust and reliable. The results are shown in Figure 4 which confirm the convergency.
To compare the efficiency of Bayes estimates with respect to the maximum likelihood estimations (MLE), the mean squared errors (MSE) of MLEs and Bayes estimates of parameters under the squared error and absolute error loss functions were obtained for scenario 2 and 3 and n = 10, 20, 30, 50, 100 using a TABLE 4 | Bayes estimates of parameters based on scenario 4 with prior parameters µ 0 = 3, τ 0 = 2, α = 5, β = 6, ξ = 0.5, σ = 0.5, and λ = −2 for the thunder at Kew dataset. FIGURE 9 | (Top) traceplots, mean running and estimated posterior pdf plots of generated samples for (µ, τ , γ ) in Table 3 for the long-axis orientations of feldspar laths dataset. (Bottom) the histogram and kernel density plot of the data related to the long-axis orientations of feldspar laths and the fitted curves under different loss functions.

Parameter
Monte Carlo simulation with 500 replications. Then, the relative efficiency (RE) was computed as where is the MLE of = (µ, τ , γ ) and and are the Bayes estimates of under the squared error and absolute error loss functions, respectively. The results are shown in Figure 5 for scenario 2 (top) and scenario 3 (middle) and µ = 3, τ = 0.6, γ = 1.
From Figure 5 (top and middle) the following general conclusions can be observed: • Our proposed Bayesian approach provides more accurate estimates for parameters in comparison with the maximum likelihood method for small values of n.
• The obtained Bayes estimates under the squared error loss function have less MSE than the estimates based on absolute error loss function.

Parameter
• By increasing n, our proposed Bayesian approach has a similar performance as the maximum likelihood method.
Finally, to investigate the rule of k in Algorithm 1, the biases of the Bayes estimates of (µ, τ , γ ) under the squared error loss function were obtained for scenario 3 (p = 2) in Table 1, n = 100 and different values of k = 10, 50, 100, 200, 300, 500 using a Monte Carlo simulation with 500 replications. The results are shown in Figure 5 (bottom) which demonstrate that, by increasing k, the bias tends to zero and thus, the accuracy of estimates increases.

DATA APPLICATION
In what follows, the proposed Bayesian approach's performance for p = 2 is demonstrated through three datasets with different sizes n = 31, 60, 725 with the skew-von Mises distribution in (6) as assumed model. The circular boxplots (Buttarazzi et al., 2018) of the datasets are shown in Figure 6 (top) and confirm the skew pattern of the datasets. The obtained results in section 5 show all the assumed scenarios provide accurate estimates for the parameters. However, based on the obtained results in section 5 with the WIM, we propose scenario 3 or 4 for obtaining the Bayes estimates to avoid time intensive computations when the sample size is not small (see Figure 7). The justification is that scenarios 1 and 2 need the slice sampler in Algorithm 1 to generate samples from the joint prior (10). Therefore, the Bayes estimates of the parameters µ, τ , and γ were obtained based on scenario 1 for the movement of blue periwinkles dataset (n = 31); scenarios 3 and 4 for the long-axis orientations of feldspar laths (n = 60) and the thunder at Kew (n = 725) datasets, respectively. See below for the description of said datasets.
For p = 3, a dataset of size n = 40 including the expenditures of households is considered. Figure 6 (bottom) shows the scatter plot of the data. For all the datasets to obtain the Bayes estimates FIGURE 11 | (Top) traceplots, mean running and estimated posterior pdf plots of generated samples for (µ 1 , µ 2 , τ , γ 1 , γ 2 ) in Table 5 for the household expenditures dataset. (Bottom) the scatter plot of the household expenditures dataset and the contour plot of the fitted distribution under different loss functions.
we generated samples of size N = 1, 000 from the posterior distributions using Algorithm 1 with a burn-in period of 10,000 and k = 500.
In what follows, we shall describe the individual datasets in detail. Since the conclusion is more or less the same for all p = 2 settings, we already write it down here. It is observed that the proposed Bayesian approach with the skew von Mises distribution as the underlying model provides a good fit to the datasets. Generally, the obtained estimates based on the squared error and absolute error loss functions are more accurate.

Movement of Blue Periwinkles
A real dataset including the movement directions of 31 blue periwinkles, Nodilittorina unifasciata, in degrees is considered (Fisher, 1995). The data was collected from a series of experiments which were done on the distances and directions that small blue periwinkles moved after the transplantation to downshore at a specific height where they live normally. The test of Pewsey (2002) confirms that the underlying distribution for this dataset is asymmetric (p-value= 0.0000). This test is defined based on the sample second sine momentb 2 = 1 n n i=1 sin 2(θ i − θ ) whereθ is the sample mean direction. The large values of |b 2 √v ar(b 2 ) | compared with the quantiles of the standard normal distribution lead to the rejection of symmetry. For more details see Pewsey (2002). The Bayes estimates of parameters are obtained by using scenario 1 based on squared error, absolute error and zeroone loss functions. The results are summarized in Table 2. The traceplots of generated samples from the posterior, the comparepartial and mean running plots are shown in Figure 8 (top). The kernel density plot and histogram of the data along with the fitted curves under different loss functions are shown in Figure 8 (bottom).

Long-Axis Orientations of Feldspar Laths
Another dataset including the measurements of long-axis orientation of 60 feldspar laths in basalt (Fisher, 1995) is considered. The symmetry test of Pewsey (2002) confirms the skew pattern of the data in Figure 6 (p-value = 0.0000). The Bayes estimates of parameters are obtained by using scenario 3 based on squared error, absolute error and zeroone loss functions. The results are summarized in Table 3. The traceplots of generated samples from the posterior, the comparepartial, and the mean running plots are shown in Figure 9 (top). The histogram and kernel density plot of the data and the fitted curves under different loss functions are shown in Figure 9 (bottom).

Thunder at Kew
A grouped frequency data set consisting of 725 observations about the number of times that thunder was heard at Kew (England) during each two hourly interval of the day in the summers of 1910-1935 is considered (Mardia, 1975). In this case, each 15 • on the circle represents 1 h. According to the test of Pewsey (2002), the underlying distribution for this data set is not symmetric (p-value = 0.0000). The Bayes estimates of parameters are obtained by using scenario 4 based on squared error, absolute error and zeroone loss functions. The results are summarized in Table 4. The traceplots of generated samples from the posterior, the compare-partial and mean running plots are shown in Figure 10 (top). The histogram and kernel density plot of the data and the fitted curves under different loss functions are shown in Figure 10 (bottom).

Household Expenditures
For p = 3, a sub data from the dataset available in the HSAUR2 package (Everitt and Hothorn, 2017) in R is considered. The entire data was collected from a survey on household expenditures in four commodity groups of housing, food, goods, and service. It includes the expenses of 20 single men and 20 single women. We considered variables housing, food, and service and normalized. After applying cosine transformation (5), the SFvML was fitted on the data and the Bayes estimates of the parameters were obtained. The results are summarized in Table 5. The traceplots of generated samples from the posterior and the compare-partial and mean running plots are shown in Figure 11 (top). The scatter plot of the data and the contour plot of the fitted distribution under different loss functions are shown in Figure 11 (bottom).

CONCLUSION
Since the assumption that data is rotationally-symmetric is often rejected (Pewsey, 2002;Ley and Verdebout, 2014;Ameijeiras-Alonso and Ley, 2020;Ameijeiras-Alonso et al., 2021), in this paper, we have presented a Bayesian analysis for the skewrotationally-symmetric FvML distribution. For the first time in Bayesian analysis of directional data the impact of the proposed priors in the set up has been compared using the Wasserstein Impact Measure. Using this measure can give guidance to the practitioner to avoid computationally intensive priors if a simpler prior has similar impact. An algorithm has been used based on modified Gibbs sampling and weighted bootstrap resampling for generating samples from posterior distributions. This coming together of Bayesian methods and skew distributions in the directional domain promises new research interest.

DATA AVAILABILITY STATEMENT
All relevant references for data are contained within the article.

AUTHOR CONTRIBUTIONS
All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.