Investigating a Weakly Informative Prior for Item Scale Hyperparameters in Hierarchical 3PNO IRT Models

The half-t family has been suggested for the scale hyperparameter in Bayesian hierarchical modeling. Two parameters define a half-t distribution: the scale s and the degree-of-freedom ν. When s is set at a finite value that is slightly larger than the actual standard deviation of the parameters, the half-t prior density can be vaguely informative. This paper focused on such densities, and applied them to the hierarchical three-parameter item response theory (IRT) model. Monte Carlo simulations were carried out to investigate the performance of such specifications in parameter recovery and model comparisons under situations where the actual variability of item parameters varied, and results suggest that the half-t family does offer advantages over the commonly adopted uniform or inverse-gamma prior density by allowing the variability for item parameters to be either very small or large. A real data example is also provided to further illustrate this.


INTRODUCTION
With current enhanced computational technology and the emergence of Markov chain Monte Carlo (MCMC) simulation techniques (e.g., Chib and Greenberg, 1995), the methodology for parameter estimation with item response theory (IRT) models has rapidly moved to a fully Bayesian approach. One of the many advantages that this approach offers in the simultaneous estimation of both item and person parameters is the flexibility of setting prior distributions for model parameters or hyperparameters. The existing literature in Bayesian statistics (Gelman et al., 2003) offers two general options in specifying prior distributions by choosing between fully informative priors using application-specific information and non-informative priors. Each of these is adopted depending on the availability of prior information. However, when prior information is desired but not readily available, neither would provide a common solution for various actual situations. The problem with the former, in particular, is that misspecification tends to result in biased estimates and hence incorrect inferences (e.g., Mislevy, 1986). This paper focuses on something in the middle, namely, a somewhat informative prior distribution that can be used in a wide range of applications.
The fully Bayesian estimation procedure has been developed for the three-parameter normal ogive (3PNO; Lord, 1980) model by Sahu (2002; see also Johnson and Albert, 1999) generalizing the approach for the two-parameter model by Albert (1992). The procedure has been further implemented in some applications, e.g., Béguin and Glas (2001) and Glas and Meijer (2003). However, this specification where the hyperparameters take specific values causes problems when prior distributions for the item slope and intercept parameters are not strongly informative. Specifically, studies have shown that improper non-informative prior densities for the item slope and intercept parameters result in an undefined posterior distribution, which gives rise to unstable parameter estimates (Sheng, 2008(Sheng, , 2010. Even with proper non-informative prior densities, the procedure either fails to converge or requires an enormous number of iterations for the Markov chain to reach convergence (Sheng, 2010). Sheng (2013) shows that if one specifies prior distributions for the hyperparameters of the item parameters, instead of setting values for them, the problem can be resolved. This type of hierarchical modeling allows a more objective approach to inference by estimating the parameters of prior distributions from data rather than specifying them using subjective information.
Research on the effect of prior distributions for hyperparameters in Bayesian hierarchical models advises caution in choosing a prior distribution for the scale hyperparameter, as certain specifications of the hyperpriors may cause problems in inference (Brown and Draper, 2006;Gelman, 2006). In the Bayesian literature and software, various non-informative prior distributions have been suggested for the variance parameter in hierarchical linear models, including an improper uniform prior on σ (Gelman et al., 2003), an improper uniform prior on log(σ ), and a conjugate inverse-gamma (0.001, 0.001) prior (Spiegelhalter et al., 1994(Spiegelhalter et al., , 2003. Gelman (2006), however, in an attempt to illustrate the performance of these prior distributions for σ near zero, which is where classical and Bayesian inferences differ the most (see Brown and Draper, 2006), pointed out problems with the latter two, especially with the inverse-gamma family of non-informative prior distributions. As Gelman (2006) stated, the problem with the uniform prior distribution on log(σ ) is that it results in an improper posterior distribution (p. 521), and the problem with the inverse-gamma family is that inferences are sensitive to the choice of the hyperparameters when low values of σ are possible (p. 522). With respect to the proper non-informative prior, he recommended the use of a half-t family, which is inherently conjugate (see Gelman, 2006, for a detailed illustration) and is preferred over other parametric family for the hyperprior distributions because flat-tailed distributions allow for robust inference (Berger and Berliner, 1986). When the scale of the half-t distribution is set to a finite value that is slightly larger than the actual variability of the parameters, the resulting prior density can be vaguely informative.
In view of the above, the purpose of this study is to develop Bayesian hierarchical 3PNO IRT models with such half-t densities being the item scale hyperpriors and further investigate their performance in estimating model parameters as well as in providing model-data fit under different test situations where the actual variability of item parameter varies.
The remainder of the paper is organized as follows. Section 2 describes the hierarchical 3PNO IRT model and the Gibbs sampling procedure where half-t hyperpriors are assumed for the scale parameters for item slopes and intercepts. Then, two simulation studies were carried out to evaluate the performance of this model specification in parameter recovery as well as to compare it with other model specifications with uniform or inverse-gamma prior densities. The methodology and results of these simulation studies are presented in Sections 3 and 4, respectively. Section 5 gives an example where the model specification under investigation is implemented on a subset of College Basic Academic Subjects Examination (CBASE; Osterlind, 1997) English data. Finally, a few summary remarks are provided in Section 6.

MODEL AND THE GIBBS SAMPLING PROCEDURE
Before the study is further described, the hierarchical 3PNO model is briefly illustrated. Suppose a test consists of k binary response items (e.g., multiple-choice items), each measuring a single unified latent trait, θ . Let y = y ij n×k denote a matrix of n responses to the k items where y ij = 1 (y ij = 0) if the i-th person answers the j-th item correctly (incorrectly) for i = 1, . . . , n and j = 1, . . . , k.
The probability of person i obtaining a correct response to item j can be defined as (1) for the 3PNO IRT model, where denotes the normal CDF, θ i is a scalar latent trait parameter, α j is a scalar slope parameter describing the item discrimination, β j is the intercept parameter associated with item difficulty, and γ j is a pseudo-chance-level parameter indicating that the probability of correct response is < zero even for those with very low trait levels. This model is applicable for objective items, such as multiple-choice or true-orfalse items where an item is too difficult for some examinees.
To implement Gibbs sampling to the 3PNO model defined in (1), two latent variables, Z and W, are introduced such that Z ij ∼ N(η ij , 1) (Albert, 1992;Tanner and Wong, 1987), where η ij = α j θ i − β j , and W ij = 1 (W ij = 0) if person i knows (does not know) the correct answer to item j with a probability density function Prior densities p(θ ), p(ξ ) and p(γ ) can be assumed for θ i , ξ j and γ j , respectively, where ξ j = (α j , β j ) ′ . Here we focus on the normal conjugate priors for ξ j so that α j ∼ N (0,∞) (µ α , σ 2 α ), β j ∼ N(µ β , σ 2 β ). Further, with hyperpriors assumed for the hyperparameters µ α , µ β , σ 2 α , σ 2 β , the joint posterior distribution of (θ , ξ , γ , W, Z, µ ξ , ξ ) is hence where µ ξ = (µ α , µ β ) ′ , ξ = diag(σ 2 α , σ 2 β ), and f (y|W, γ ) = p y ij ij (1 − p ij ) 1−y ij is the likelihood function, with p ij being the model probability function as defined in (1). Assume a normal prior for θ i , a conjugate Beta prior for γ j so that θ i ∼ N(µ, σ 2 ), γ j ∼ Beta(d, t), and conditionally conjugate half-t prior distributions for σ α and σ β with mean 0, degrees-of-freedom ν and scale s, where ν and s are chosen to provide minimal prior information to constrain the scale hyperparameters to lie in a reasonable range. Hence, with the prior distributions specified this way, the full conditional distributions of all the parameters can be derived in closed forms through a multiplicative reparameterization following Knape et al. (2008) and updated iteratively using the Gibbs sampler (see the Appendix in Supplementary Material).
The half-t distribution considered for σ α and σ β takes the form (4) (Gelman, 2006). Setting ν = 1 results in a special case of half-Cauchy, which has a broad peak at zero and can be weakly informative if s takes large but finite values. On the other hand, setting the scale s to infinity corresponds to a flat prior distribution, setting the degrees of freedom ν to 100 corresponds to a half-normal distribution, which can be non-informative but proper if the scale s is set to a high value, such as 100. Here, in the study, we considered a half-normal, a half-Cauchy, and a half-t distribution with 4 degrees of freedom (ν = 4).

METHODOLOGY OF MONTE CARLO SIMULATIONS
In order to evaluate the performance of the hierarchical 3PNO model as described in Section 2, two simulation studies were conducted where it was compared with other model specifications in parameter recovery and/or model comparisons.

Simulation Study 1
In the IRT literature, it is well accepted that when prior information is not readily available, large data sizes are needed to estimate IRT parameters (e.g., Swaminathan and Gifford, 1983), as Bayesian estimation with a flat prior is equivalent to a maximum likelihood estimation, and that the accuracy of item (person) parameter estimates is related to the number of subjects (items) (e.g., Natesan et al., 2016;Sheng, 2010). Hence, in this simulation study, we evaluated the effect of sample size, test length, actual variability of item slope and intercept parameters on the accuracy with which model parameters are estimated considering a halfnormal, a non-informative uniform or an inverse gamma prior. Item responses for k items (k = 10, 20, and 40) and n individuals (n = 100, 300, 500, and 1000) were generated according to the 3PNO model, as defined in (1). Ability parameters were generated as samples from a standard normal distribution, pseudo-chance-level parameters were generated from a uniform distribution, γ j ∼ U(0.05, 0.4), and item slope and intercept parameters were generated as samples from uniform distributions so that • Sim1: α j ∼ U(0, 2), β j ∼ U(−1, 1); • Sim2: α j ∼ U(0, 2), β j ∼ U(−0.5, 0.5); and • Sim3: α j ∼ U(0.5, 1.5), β j ∼ U(−1, 1).
It is noted that although the inverse-gamma (0.001, 0.001) prior density was not suggested by Gelman (2006), it was considered in this study because of its popularity (see e.g., Xu et al., 2009;O'Brien et al., 2015). With each of the prior specifications considered, the Gibbs sampling procedure was implemented where 10,000-50,000 iterations were obtained with the first half set as burn-in.

Simulation Study 2
In the first simulation study, only two levels of variability (i.e., σ 2 α,β = 0.083 and σ 2 α,β = 0.333), and three different hyperpriors (one uniform, one inverse-gamma and one half-normal) were considered. Given that the range for item intercept parameters (β j ) is generally wider than [−1, 1] in practice, it would be interesting to see how the weakly informative half-t family performs when the variance for β goes beyond 1. Hence, a second simulation study was conducted where two factors were manipulated, namely, actual variability of the item intercept parameters and specifications of prior distributions for the item scale hyperparameters.
Item responses for 20 items and 1000 individuals were generated according to the 3PNO model, as defined in (1). Ability parameters were generated as samples from a standard normal distribution, item slope and pseudo-chance-level parameters were generated from uniform distributions, α j ∼ U(0, 2) and γ j ∼ U(0.05, 0.4), and item intercept parameters were generated as samples from uniform distributions so that It is noted that in the four simulations, the uniform distributions from which the intercept parameters were sampled from have an increasingly large σ β ranging from 0.29−2.31 (with the corresponding variance ranging from 0.083−5.333).
The Gibbs sampling procedure was implemented where 20,000 iterations were obtained with the first 10,000 as burn-in.

Evaluation Criteria
For both simulation studies, convergence was evaluated using the Gelman-Rubin R (Gelman and Rubin, 1992) statistic. The usual practice is using multiple Markov chains from different starting points. Alternatively, a single chain can be divided into subchains so that convergence is assessed by comparing the between and within sub-chain variance (Fox, 2007). Since a single chain is less wasteful in the number of iterations needed, the latter approach was adopted. For each Markov chain, the initial values were set to be α j = 1, β j = 0, and γ j = 0.2 for all items j and θ i = 0 for all persons i. After discarding the burn-in samples, the chain was then separated into five sub-chains of equal length and the R statistic was calculated following the procedure by Gelman and Rubin (1992). Convergence can also be monitored visually using time series graphs of the simulated sequence, such as the trace plot, the running mean plot, and the autocorrelation plot shown in Figure 1 for one item. It is observed that for this item, the autocorrelations between successive parameter draws became negligible at lags > 600, suggesting burn-in for a single chain should not take longer than that (Geyer, 1992). Indeed, the trace plot and the running mean plot both suggest possible convergence within 10,000 iterations. Inspection of such plots has, however, been criticized for being unreliable and unwieldy in the presence of a large number of model parameters (Gelman et al., 2003;Nylander et al., 2008). The R statistic obtained from using a single chain was hence the major approach for assessing convergence in this study.
For each simulated scenario, 25 replications, as recommended by Harwell et al. (1996), were conducted to avoid erroneous results in estimation due to sampling error. The accuracy of item/person parameter estimates was evaluated using the root mean square error (RMSE) and bias. Let τ denote the true value of a parameter (e.g., α j , β j , γ j , or θ i ) and t r its estimate in the rth replication (r = 1, . . . , R). The RMSE is defined as and the bias is defined as These quantities were averaged over items/persons to provide summary indices.
In simulation study 2 where six model specifications were compared, the adequacy of the fit of the hierarchical 3PNO model with a given prior density on the simulated data was evaluated using Bayesian deviance. It should be noted that this measure provides a model comparison criterion. Hence, it evaluates the fit of a model in a relative, not absolute, sense. The Bayesian deviance information criterion (DIC) was introduced by Spiegelhalter et al. (2002) who generalized the classical information criteria (e.g., AIC, BIC) to one that is based on the posterior distribution of the deviance. This criterion is defined as DIC =D + p D , whereD = E[−2 log L ϑ|y (y|ϑ)] is the posterior expectation of the deviance (with L being the likelihood function), and is the effective number of parameters (Carlin and Louis, 2000). In addition, let D(θ) = −2 log[L ϑ|y (y|θ)], whereθ is the posterior mean. To compute Bayesian DIC, MCMC samples of the parameters, ϑ (1) , . . . , ϑ (G) , can be drawn from the Gibbs sampler, thenD is approximated asD = . Generally more complicated models tend to provide better fit. Hence, penalizing for the number of parameters makes DIC a more reasonable measure to use. In this study, the Bayesian deviance estimate was obtained after each implementation and averaged across the 25 replications for each model specification.

SIMULATION RESULTS
The results of the two simulation studies are presented in this section and described as follows.

Simulation Study 1
Each implementation of the Gibbs sampler gave rise to Gelman-Rubin R statistics close to 1, indicating possible convergence of the Markov chains within the simulated number of iterations. Hence, the posterior estimates were obtained as the posterior expectations of the Gibbs samples and the average RMSE and bias values for α j , β j , γ j , and θ i are summarized in Tables 1-4. A close examination of these values leads to the following observations: 1. In Sim1 where the actual variances for α and β were both 0.333, the half-normal distribution performed relatively less well in recovering item and person parameters than the uniform or inverse-gamma distributions when data sizes (n × k) were < 10, 000, although it has to be noted that this distribution consistently resulted in smaller bias in estimating α regardless of sample size or test length. The uniform prior density performed similarly as the inverse-gamma prior density, with a slight advantage for data with small sample sizes (i.e., n = 100) especially where k = 10. 2. Sim2 differed from Sim1 in that the actual variance for β decreased to 0.083, a value closer to zero. Hence, we focus on the results in recovering β here. From Table 2, it is obvious that the half-normal distribution consistently performed better in recovering β than the uniform or inversegamma distribution when k < 40 or with large sample sizes (n > 300) when k = 40. Between the two non-informative priors, the uniform prior density performed less well than the inverse-gamma (0.001, 0.001) prior density. 3. Similar to Sim2, Sim3 differed from Sim1 in the actual variance for α being reduced to 0.083, and consequently, the results in recovering α are discussed here. It is noted from Table 1 that the half-normal distribution consistently resulted Frontiers in Psychology | www.frontiersin.org FIGURE 1 | Trace plots (top), running mean plots (middle) and autocorrelation plots (bottom) of α, β and γ for one item assuming a half-normal prior distribution for σ α and σ β [n = 1000, k = 10, chain-length = 20, 000, item slopes and intercepts were generated from α j ∼ U(0, 2) and β j ∼ U(−1, 1)].
in smaller RMSE and bias and hence performed better than the non-informative uniform or inverse-gamma family of prior distributions (except for the situations where n < 500 and k = 40). The uniform prior density tended to result in larger RMSE or bias than the inverse-gamma (0.001, 0.001) prior when k < 40. 4. It is interesting to note that the half-normal prior density consistently resulted in smaller bias in estimating α in all the simulated conditions. 5. Further, it is noted that the benefit of using a half-normal prior in Sim2 or Sim3 where it resulted in smaller RMSE and bias in estimating β or α was not reflected in estimating θ unless the data size was over 12, 000 in Sim2 or unless k ≥ 20 in Sim3.
From these observations, it is hence noted that the half-normal prior outperforms the uniform or inverse-gamma family of prior distributions in estimating the corresponding parameters in the hierarchical 3PNO model when the variability for item slope or intercept parameters is close to zero. Further, the inverse-gamma (0.001, 0.001) prior tends to perform better in parameter recovery than the uniform prior when the respective item parameters have a small variance. This can be explained by the fact that the non-informative uniform prior density is flat and hence does not restrict σ away from large values, whereas inversegamma (0.001, 0.001) and half-normal distributions have a peak around zero and do perform such shrinkage for variances near zero. It is further noted that the estimation error and bias in estimating item parameters (α, β, or γ ) reduce with the increase of sample sizes, and that the error and bias in estimating person parameters (θ ) reduce with the increase of test lengths. This is consistent with findings from other studies (e.g., Swaminathan and Gifford, 1983;Sheng, 2010;Natesan et al., 2016).

Simulation Study 2
Each implementation of the Gibbs sampler gave rise to Gelman-Rubin R statistics close to 1, indicating possible convergence within 20,000 iterations. Hence, the posterior estimates were obtained as the posterior expectations of the Gibbs samples and the average RMSE and bias values for α j , β j , and γ j are summarized in Table 5. A close examination of these values leads to the following observations: 1. In all four simulations where the actual variability for the intercept parameters (σ β ) ranged between 0.29−2.31, the three weakly informative half-t prior densities, namely, the half-Cauchy, the half-normal and the half-t with 4 degrees of freedom had consistently smaller RMSE if not bias in recovering item and person parameters, and in particular in recovering β j , compared with the other three prior distributions. 2. It is noted that in Sim2 where σ β was about 0.5, the three weakly informative half-t densities had more bias in estimating β and γ than the two non-informative prior densities. When σ β moved further away from 0.5 (i.e., toward 0 in Sim1 or toward 2.5 in Sim4), their advantages over other model specifications became more obvious in that they resulted in much smaller average RMSE and bias values. 3. Among the three weakly informative half-t densities, the half-normal resulted in relatively smaller RMSE and bias in recovering item and person parameters in Sim1, but larger RMSE and bias in Sim2, Sim3, and Sim4. 4. Between the two non-informative priors, the inverse-gamma (0.001, 0.001) prior distribution performed slightly better in estimating β and γ in Sim1, but worse in estimating other parameters or in other situations. This is consistent with findings from the first simulation study. 5. In Sim1 where σ β was close to 0, the informative inversegamma (3, 2) prior distribution was correctly specified with little bias and hence had small RMSE in recovering β j . However, when σ β moved further away from 0, it became less appropriate, and consequently resulted in an obviously larger bias and RMSE in recovering especially item parameters. From these observations, it is noted that the weakly informative half-t family works well in a wider range of situations than the non-informative or the informative prior densities in recovering the 3PNO model item parameters. Specifically, when the actual scale hyperparameter is close to 0, the non-informative prior density does not work well compared with informative or weakly informative prior densities, as prior information helps to restrict σ β away from large values. Even with prior information,  the informative inverse-gamma (3,2) prior density does not outperform the weakly informative half-t distribution because the latter has a better behavior near 0. Figure 2 graphically illustrates this point. The non-informative inverse-gamma prior distribution, in general, is not recommended for large σ values  because it is not non-informative for these values (see Figure 2). On the other hand, prior information has to be correctly specified, as misspecification leads to large bias and hence large estimation error, e.g., the actual σ β in Sim4 was clearly not in the range of the inverse-gamma (3, 2) distribution (see Figure 2). When comparing among the three half-t distributions, the half-Cauchy is more likely to allow for occasionally large values than the half-t density with 4 degrees of freedom, or the halfnormal, which has the smallest tail (see Figure 2). Hence, if it is known a priori that the scale hyperparameter might be far away from 0, a half-Cauchy or a half-t distribution with a small degrees of freedom is suggested. Otherwise, a half-normal may be considered.
In addition to parameter recovery, the model comparison results in each simulation were averaged over the 25 replications and are summarized in Table 6, which shows the averaged estimates for the posterior expectation of the deviance (D), the deviance of the posterior expectation [D(θ)] values, the effective number of parameters (p D ), and the Bayesian DIC, respectively. The model with the half-Cauchy or half-t prior density shows consistently smallerD, D(θ), and DIC than those with other prior specifications. Since small deviance values indicate better model fit, models with such prior distributions for the scale hyperparameters are shown to provide a better description of the simulated data when the actual σ β was especially larger than 0.29, even after penalizing for model complexities, i.e., the effective number of parameters.
After a close examination and comparison of the values shown in the table, a few remarks can be drawn from these results: • In all four simulations where σ β ranged between 0.29−2.31, Bayesian deviances consistently preferred the models with half-t prior densities, whose deviance values were increasingly smaller compared with other model specifications when σ β became larger. • The two non-informative priors performed similarly in describing the simulated data according to Bayesian DIC, which was slightly larger for the model with the inversegamma (0.001, 0.001) prior. • In all the simulated scenarios, the model with the informative inverse-gamma (3,2) prior distribution had consistently the largest average deviance values and hence is not preferred. One may further note that as σ β moved further away from 0, the difference in deviances between this model specification and others was increasingly larger. • Among the three half-t distributions considered, the Bayesian deviance measures did not favor the half-normal prior density. • It is interesting to note that when σ β was small with values of e.g., 0−0.6, models with half-t prior densities tended to have smaller effective number of parameters (p D ) than those with non-informative prior densities. The opposite is true for situations where σ β was over 1.
In summary, the results in parameter recovery and model comparisons using Bayesian deviances suggest that the hierarchical 3PNO model with weakly informative halft densities for the item scale hyperparameters does show advantages compared with those with non-informative or informative prior distributions in various test situations where the actual variability ranges from small to large values. When the actual variability of item parameters is not close to 0, the half-Cauchy or the half-t with small degrees of freedom is preferred over the half-normal distribution because of its flexibility in allowing for occasional large variability.

AN EXAMPLE WITH CBASE DATA
As an illustration, the hierarchical 3PNO model with the weakly informative half-t hyperpior was implemented to a subset of CBASE English subject data, and further compared with the other model specifications in describing the data.

Method
The overall CBASE exam contains 25 multiple-choice items on English reading/literature. The data used in this study were from college students who took the LP form of CBASE in years 2001 and 2002. After removing those who attempted the exam multiple times and removing missing responses, a sample of 1,200 examinees was randomly selected. To assess the model FIGURE 2 | Prior density functions for the slope and intercept standard deviation parameters: (a) uniform prior distribution on σ α (or σ β ), (b) inverse-gamma (0.001, 0.001) prior distribution on σ 2 α (or σ 2 β ), (c) inverse-gamma (3, 2) prior distribution on σ 2 α (or σ 2 α ), (d) half-Cauchy prior distribution for σ α (or σ β ), (e) half-t prior distribution for σ α (or σ β ) with 4 degrees of freedom, and (f) half-normal prior distribution for σ α (or σ β ). It is noted that the half-t distributions are on a different scale compared to the previous three densities.
goodness-of-fit, hierarchical 3PNO models with the three half-t prior distributions were compared with each other and further compared with those with uniform and inverse-gamma prior densities described in Section 4.

Results
Each of the six model specifications was implemented to the CBASE English data using the Gibbs sampling procedure, where 20,000 iterations were obtained with the first 10,000 set as burn-in. The Gelman-Rubin R statistics were used to assess convergence and they were found to be around or close to 1, suggesting that stationarity had been reached within the simulated Markov chains for the model. The Bayesian deviance estimates were subsequently obtained for each model specification and the results are summarized in Table 7. Among the six model specifications considered, the ones with half-t and half-Cauchy prior densities had the smallest DIC and/or expected posterior deviance (D) values. Therefore, the 3PNO model with a half-t or half-Cauchy hyperprior provided the best description of the data compared with other model specifications, even after penalizing for a large effective number of parameters (e.g., p D = 987.24 for the half-t, and p D = 993.80 for the half-Cauchy). On the other hand, with slightly larger deviances, the model with an informative inverse-gamma (3, 2) hyperprior described the data less adequately than those with other prior specifications. The relatively small differences in deviances suggested the actual variability might be close to 0. Furthermore, the model with a half-normal or half-t hyperprior had a larger p D than that with a non-informative prior density. Given the findings from the second simulation study in Section 4, this indicated that the actual standard deviation for the item slope and/or intercept parameters with the data was not larger than 1.

CONCLUDING REMARKS
The half-t family offers a good alternative parametric family for the prior distribution of scale hyperparameters in Bayesian hierarchical modeling. This paper adopts it as the hyperprior for item scale parameters in the hierarchical 3PNO IRT model, with the focus being on the weakly informative half-t distributions. Their utility in parameter estimation and model comparison has been explored and the results show that they do offer advantages over the commonly adopted uniform or inverse-gamma prior density by allowing the variability for item slope and/or intercept parameters to be either very small or large. The weakly informative half-t, and especially the weakly informative half-Cauchy density provides certain level of prior information while it still allows occasional large values. Hence, it overcomes problems resulting from using either a Sim1, β ∼ U(−0.5, 0.5); Sim2, β ∼ U(−1, 1); Sim3, β ∼ U(−2, 2); Sim4, β ∼ U (−4, 4) non-informative or an informative prior density when prior information is desired but not readily available. Consequently, the flat-tailed half-t distributions are applicable in a wide range of applications and are recommended. In particular, when prior information is not readily available but non-informative priors are not desired, the use of a half-Cauchy distribution is recommended with the scale set to a finite value that is higher than the actual standard deviation. It has to be noted that this paper only focuses on weakly informative halft distributions. One may set the scale of the distribution to be large, e.g., s = 100, to make it non-informative. For a non-informative but proper prior distribution, a half-normal with the scale s set to a high value, such as 100 should be considered. The use of the half-t prior density for the slope/intercept scale hyperparameter has also an effect on estimating person parameters in 3PNO models. Based on results from the first simulation study, we can see that it tends to result in smaller bias and error in estimating them for data sizes > 10,000. For small data sizes, the use of half-normal prior for item scale hyperparameters may not be suggested over the uniform or inverse-gamma (0.001, 0.001) prior if the focus is primarily on estimating person ability parameters.
One may need to further explore the use of other halft prior densities, such as the half-Cauchy, under these conditions. It would also be interesting to find out the reason why manipulating the hyperpriors for item parameters has such an effect on estimating person parameters in the hierarchical 3PNO model, and/or investigate the effect of different (hyper)priors for person parameters in estimating the model. Further, results and hence conclusion of the second simulation study are based on tests with k = 20 and n = 1000. Although it is believed that similar results can be obtained with other test lengths (e.g., k = 10 or k > 20) and/or larger sample sizes (n > 1000), additional studies are needed to confirm this and to further investigate the use of half-Cauchy or other half-t prior densities for item scale hyperparameters in smaller sample size conditions (e.g., n < 500).
In this study, due to the computational expense of MCMC procedures, only 25 replications were adopted and hence the simulation results are based on a limited number of replications. Future studies shall follow the procedure illustrated by Koehler et al. (2009) to ensure the adequacy of the number of replications. Alternatively, one may consider using variational Bayes as suggested by Natesan et al. (2016) given its improved computational efficiency and equivalent estimation accuracy when compared with MCMC. In addition, only certain prior or hyperprior densities for item slope and intercept scale hyperparameters were investigated in this paper. Future research may include more prior specifications or adopt non-conjugate priors for them. Finally, in this study, only Bayesian deviance was used to evaluate individual models. Given that DIC may be limited in that it is not invariant to parameterization and sometimes can produce unrealistic results, further studies can adopt other methods for model comparisons, such as Bayes factors (Kass and Raftery, 1995) or posterior predictive model checking (Rubin, 1984).

AUTHOR CONTRIBUTIONS
YS designed the study, carried out all the analyses and wrote the manuscript.