An Overview of Interrater Agreement on Likert Scales for Researchers and Practitioners

Applications of interrater agreement (IRA) statistics for Likert scales are plentiful in research and practice. IRA may be implicated in job analysis, performance appraisal, panel interviews, and any other approach to gathering systematic observations. Any rating system involving subject-matter experts can also benefit from IRA as a measure of consensus. Further, IRA is fundamental to aggregation in multilevel research, which is becoming increasingly common in order to address nesting. Although, several technical descriptions of a few specific IRA statistics exist, this paper aims to provide a tractable orientation to common IRA indices to support application. The introductory overview is written with the intent of facilitating contrasts among IRA statistics by critically reviewing equations, interpretations, strengths, and weaknesses. Statistics considered include rwg, rwg*, r′wg, rwg(p), average deviation (AD), awg, standard deviation (Swg), and the coefficient of variation (CVwg). Equations support quick calculation and contrasting of different agreement indices. The article also includes a “quick reference” table and three figures in order to help readers identify how IRA statistics differ and how interpretations of IRA will depend strongly on the statistic employed. A brief consideration of recommended practices involving statistical and practical cutoff standards is presented, and conclusions are offered in light of the current literature.


INTRODUCTION
The assessment of interrater agreement (IRA) for Likert-type response scales has fundamental implications for a wide range of research and practice. One application of IRA is to quantify consensus in ratings of a target, which is often crucial in job analysis, performance assessment, employment interviews, assessment centers, and so forth (e.g., Brutus et al., 1998;Walker and Smither, 1999;Morgeson and Campion, 2000;Harvey and Hollander, 2004). Another application of IRA is to determine the appropriateness of averaging individual survey responses to the group level (van Mierlo et al., 2009). In that spirit, IRA has been used to support the aggregation of individual ratings to the team level, follower ratings of leadership to the leader level, organizational culture ratings to the organizational level, and leadership ratings to the leader level (see discussions by Rousseau, 1985;Chan, 1998;Kozlowski and Klein, 2000). If consensus in the ratings of a target is low, then the mean rating may be a misleading or inappropriate summary of the underlying ratings (George, 1990;George and James, 1993). Underscoring the importance of IRA statistics is that, unlike interrater reliability and consistency statistics, IRA provides a single value of agreement for each rating target, thereby facilitating identification of units of raters who are very high or very low in agreement. This advantageous feature also permits subsequent investigation of other substantive and theoretically interesting variables that may be related to variance in agreement (Klein et al., 2001;Meade and Eby, 2007), or as a moderator of predictorcriterion relations (e.g., climate strength; Schneider et al., 2002).
IRA are particularly common when collecting systematic observations of behavior or phenomena. For example, Bernardin and Walter (1977) found that training and diary keeping reduced the errors in performance ratings. O'Neill and Allen (2014) investigated subject-matter experts' ratings of product innovation. Weingart et al. (2004) observed and coded negotiation behavior between teams and reported on methods for doing so. Many more examples exist. The key is that IRA becomes highly relevant when judges observe and provide ratings of behavior or phenomena, and the absolute agreement of those ratings is of interest.
Despite the widespread application of IRA statistics and the extensive research focusing on IRA, it appears that considerable challenges persist. For example, a recent review by Biemann et al. (2012) identified situations in which applications of IRA for aggregation of leadership ratings has been misused, as ratings were aggregated (or not) based on flawed interpretations of IRA. A possible contributing factor of the potential for IRA misuse is that considerations of the logic underlying equations and interpretations of alternative IRA statistics have been relatively scattered across organizational (e.g., , methodological (e.g., Cohen et al., 2001), and measurement (e.g., Lindell, 2001) journals, thereby making it difficult for researchers and practitioners to contrast the variety of statistics available and to readily apply them appropriately. LeBreton and Senter (2008) provided a seminal review of IRA and consistency statistics, but the focus was largely on implications of these types of statistics for multilevel research methods and not on the many other applications of IRA (e.g., agreement in importance ratings collected in job analysis; Harvey, 1991). Elsewhere, IRA statistics have been investigated as dispersion measures of substantive constructs in multilevel research in terms of criterion validity (Meade and Eby, 2007), power (e.g., Roberson et al., 2007), significance testing (e.g., Cohen et al., 2009;Pasisz and Hurtz, 2009), and performance under missing data conditions (Allen et al., 2007;Newman and Sin, 2009). Importantly, some existing articles may be seen as highly technical for some scholars that are new to the IRA literature (e.g., Cohen et al., 2001), and other reviews tend to focus on only one or two IRA statistics (e.g., Castro, 2002).
Given the above, what is needed is a relatively non-technical and tractable orientation to IRA that facilitates comparison and interpretation of various statistics for scholars. Accordingly, the purpose of this article is to contribute by providing an accessible and digestible IRA resource for researchers and practitioners with a diverse range of training and educational backgrounds who need to interpret or report on IRA. The current article fills a gap by reporting on an introductory comparative analysis involving eight IRA statistics: r wg , r * wg , r ′ wg , r wg(p) , average deviation (AD), a wg , standard deviation (S wg ), and the coefficient of variation (CV wg ). A unique contribution is a "quick reference" table containing citations, formulas, interpretations, strengths, and limitations (see Table 1). The aim of Table 1 is to support expedient consideration of the appropriateness of various IRA statistics given a researcher or practitioner's unique situation, and to serve as a foundation for more focused, complex issues addressed in technical guides (e.g., Burke and Dunlap, 2002). Further, three figures attempt to clarify the behavior of IRA statistics and to supplement understanding and interpretation of various IRA statistics. The article introduces James et al.'s (1984) r wg , some potential issues with interpretations of that statistic, and numerous contemporary alternatives. Before beginning, a comment on IRA and interrater consistency is offered.
JAMES ET AL.'S IRA: r wg FOR SINGLE AND MULTIPLE ITEMS

General Logic
For use on single-item scales, James et al. (1984;see also Finn, 1970) introduced the commonly-used, and perhaps most ubiquitous, IRA statistic known as r wg . This statistic is a function of two values: the observed variance in judges' ratings (denoted as S 2 x ), and the variance in judges' ratings if their ratings were random (denoted as σ 2 e in its general form, referred to as the null distribution). What constitutes a reasonable standard for random ratings is highly debated. One option, apparently the default in most research, is the rectangular or uniform distribution calculated with the following (Mood et al., 1974): where A is the number of discrete Likert response alternatives. This distribution yields the variance obtained if each Likert category had an equal probability of being selected. Observed variance in judges' ratings on a single item can be compared to this index of completely random responding to determine the proportion of error variance present in the ratings: proportion of random variance in judges' ratings = S 2 x /σ 2 eu (2) If this value-the proportion of error variance in judges' ratingsis subtracted from 1, the remaining variance can be interpreted as the proportion of variance due to agreement. Hence, the IRA for single item scales can be: Whereas, Equation (3) is for single-item scales, James et al. (1984) derived an index for multi-item response scales denoted as r wg(j) . It applies the Spearman-Brown prophecy formula (see Nunnally, 1978) to estimate IRA given a certain number of scale items (although James et al., 1984 did not use the Spearman-Brown in its derivation; see also LeBreton et al., 2005). Further, the term S 2 x from Equation (3) is substituted with the mean S 2 x derived from judges' ratings on each scale item to yield the following: where σ 2 eu is the same as in Equation (1), and J is the number of items.  (James et al., 1984; see also Finn, 1970)  • A value of 0 indicates agreement equal to the null distribution (i.e., one index of completely random responding. • Values below 0 or above 1.0 are assumed to be the result of sampling error and should be reset to 0 (see James et al., 1984). • Commonly used in the literature and generally known to researchers and reviewers.
• Likely the most researched agreement statistic.
• Linear function facilitates interpretation. • Uniform distribution may inappropriately model random responding, and selecting an alternative null distribution can be difficult (for guidance, see LeBreton and Senter, 2008).
• May not be directly comparable (i.e., equivalent) across different means of group ratings, number of raters, or sample sizes.
• It is not uncommon for values to exceed +1.0 or fall below 0. These inadmissible values might not be the result of sampling error.
Resetting the values to 0 may therefore be inappropriate and result in loss of information (Brown and Hauenstein, 2005).
r wg(j) (James et al., 1984) = mean of the observed variance in judges' ratings on each scale item; and σ 2 eu = see above.
• A value of 1.0 indicates complete agreement.
• A value of 0 indicates agreement equal to the null distribution.
• Values below 0 or above 1.0 are assumed to be the result of sampling error and should be reset to 0 (see James et al., 1984). • Commonly used in the literature and generally known to researchers and reviewers.
• Likely the most researched agreement statistic.
• Same as r wg , above.
• May not be directly comparable (i.e., equivalent) across different means of group ratings or the number of raters.
• It is upwardly influenced by the number of discrete Likert scale response options.
• Values in between 1.0 and 0 are difficult to interpret because the function is non-linear.
r* wg (Lindell and Brandt, 1997) 1 -(S σ 2 mv = variance of the maximum dissensus distribution, 0.5(X 2 • If using σ eu 2 , the interpretation is the same as r wg , described above.
• If using σ • r* wg using σ mv 2 will tend to be greater than is r* wg using σ eu 2 and r wg will always be less than is r* wg .
• Values below 0 (using σ • May not be directly comparable (i.e., equivalent) across different means of group ratings.
• Maximum dissensus may inappropriately model random responding, and selecting an alternative null distribution can be difficult (for guidance, see LeBreton and Senter, 2008).
• May be positively correlated with group mean extremity.
• Same as r* wg , above.
• With increasing items the function remains linear, unlike r wg(j ) .
• Same as r* wg , above.
• Interpretation is otherwise similar to r* wg(j ) . • Less attenuated than is r* wg ; Otherwise the strengths are the same as those of r* wg(j ) . • Shares many of the same limitations as does r* wg(j) except r' wg(j ) will often be less attenuated.
• Application has been rare in the literature and, accordingly, researchers and reviewers may be unaware of the underlying logic.  (Burke et al., 1999;Burke and Dunlap, 2002) (|x i -x|)/k x i = a judge's rating on the item; x= is the group mean rating on the item; and k is the number of judges.
• Indexes the average distance of judges' ratings from the group's scale mean.
• Considerable justification for practical cutoff criteria have been proposed, but they are not without assumptions (see Section Standards for Agreement).
• Interpretation is not complicated by changes (e.g., non-linearity) in the number of Likert categories (bearing in mind greater deviations are expected given category increases).
• Circumvents problems associated with choosing an appropriate null distribution. • May be negatively correlated with group mean extremity.
• Does not permit explicit modeling of random responding (i.e., has no null distribution term).
• AD values are highly dependent on the number of scale categories employed. This makes it very difficult to compare AD values of scales differing in length.
• Same advantages as AD M(j ) .
• Takes the average of each AD M(j ) and, therefore, does not unnecessarily complicate the multi-item interpretation.
• Same limitations of AD M(j ) .

(Continued)
Frontiers in Psychology | www.frontiersin.org  • Will equal single and multi-item r* wg using σ eu 2 when the group mean is at the midpoint and the variances are not mismatched.
• Controls for the extremeness of the group mean by not relying on a single specification of the null distribution.
• Uses the unbiased, sample variance to calculate observed and theoretical random variance terms, whereas the r wg family of statistics confound these.
• Circumvents problems of inadmissible values.
• Will not be affected by sample size because it employs matched variances.
• Shares interpretations of a wg(1) , except generalizes to multi-item scales.
• Takes the average of each a wg(1) and, therefore, does not unnecessarily complicate the multi-item interpretation.
S wg (Schmidt and Hunter, 1989) the group mean rating on the item; and n is the number of group members.
• The root of the average squared judge deviation from the mean.
• Provides a straightforward and direct index of agreement. • Will be scale dependent such that a greater number of response options will tend to produce greater S wg .
• Does not permit explicit modeling of random responding (i.e., has no null distribution term).
• Shares interpretations of a wg(1) , except generalizes to multi-item scales.
• Same advantages as S wg .
• Takes the average of each S wg(1) and, therefore, does not unnecessarily complicate the multi-item interpretation.
• Same limitations as S wg .

(Continued)
Frontiers in Psychology | www.frontiersin.org therefore, application and interpretation of CV wg may be difficult.
• The assumption of a non-negative ratio scale may not always be tenable.
• The CV wg is intended for situations in which means vary widely. If groups tend not to differ much on sample means there is little reason to adopt CV wg .
• Does not permit explicit modeling of random responding (i.e., has no null distribution term).
• Shares interpretations of CV wg , except generalizes to multi-item scales.
• Same advantages as CV wg .
• Same disadvantages as CV wg .
Interpretation Figure 1 shows the range of r wg values across all possible levels of mean S 2 x based on four raters and a five-point Likert scale (see also Lindell and Brandt, 1997). One observation from Figure 1 is that the single-item r wg is a linear function, such that complete agreement equals 1.0 and uniform disagreement equals 0 (i.e., raters select response options completely at random). But notice that for S x 2 > 2-that is, where S x 2 exceeds σ 2 eu − r wg takes on negative values. Figure 1 also contains the r wg(j) function ranging from −1.0 to +1.0 across levels of S x 2 based on four raters, a five-point Likert scale, and two, five, and ten items. Consistent with expectations, when r wg(j) is 1.0 agreement is perfect and when r wg(j) is 0 there exists uniform disagreement. However, at all other levels of r wg(j), interpretation is complicated because the shape of the function changes depending on the number of items. Consider that, as the mean S x 2 moves from 0 to 1.5, r wg(j) ranges from 1.0 to 0.40 [r wg(2) ], 1.0 to 0.63 [r wg (5) ], and 1.0 to 0.77 [r wg (5) ] suggesting that r wg(j) is insensitive to substantial changes at reasonable levels of mean S 2 x , and it might imply surprisingly high agreement even when there is considerable variance in judges' ratings. This also illustrates the extent to which the problem increases in severity as the number of items increases. The pattern creates the potential for misleading or inaccurate interpretations when the shape of the function is unknown to the researcher. Another issue is that S x 2 > 2 produces inadmissible values that are outside the boundaries of r wg(j) (i.e., < 0 or > 1.0). Regarding inadmissible values, James et al. (1984) suggested that these may be a result of sampling error. Other possible contributing factors include inappropriate choices of null distributions and the existence of subgroups. One recommended procedure is to set inadmissible values to 0 . This could be an undesirable heuristic, however, because it results in lost information (Lindell andBrandt, 1999, 2000;Brown and Hauenstein, 2005).

Potential Cause for Concern
Whereas, r wg is arguably the most widely used IRA statistic, there are five issues concerning its interpretation. First, there is the issue of non-linearity described above. This non-linearity, occurring with increased magnitude as the number of scale items increases, renders interpretations of agreement levels ambiguous compared to interpretations of linear functions. The appropriateness of interpretations may be particularly weak if the researcher or practitioner is unaware that the function is non-linear. Indeed, scales with a large number of items will almost always have very high agreement (Brown and Hauenstein, 2005;cf. Lindell and Brandt, 1997;Lindell, 2001), which limits the interpretational and informative value of r wg(j) with scales containing more than a few items. Figure 1 clarifies this. Second, there are difficulties involving inadmissible values, also described above. Resetting these values to 0 or 1.0 seems suboptimal because potentially useful information is arbitrarily discarded. It would be advantageous if that additional information could be used to further shed light on agreement. Third, r wg and r wg(j) appear to be related to the mean rating extremity. Brown and Hauenstein (2005) found a correlation between mean judge ratings and r wg(j) values of 0.63. This is not surprising because mean ratings falling closer to the scale endpoint must have restricted variance (i.e., agreement). Thus, r wg will be affected by the mean rating. Fourth, the typical selection of σ e 2 , the theoretical distribution of random variance, seems to be the rectangular distribution, described above as σ eu 2 (Cohen et al., 2009). But the σ eu 2 uses scaling that leads to inadmissible values (i.e., r wg < 0), and other distributions may be an improvement (LeBreton and Senter, 2008). Whereas, James et al. (1984) offered alternatives to σ eu 2 that attempt to model response tendencies or biases, in many cases it is difficult to make a choice other than σ 2 eu that can be defended (for laudable attempts, see Kozlowski and Hults, 1987;LeBreton et al., 2003). One alternative to σ 2 eu , suggested by Lindell and Brandt (1997), however, seems promising (described further below). Fifth, the observed variance in the numerator, S x 2 , tends to decrease with sample size, which creates the potential to spuriously increase r wg (Brown and Hauenstein, 2005).
Given the above issues involving James et al.'s (1984) r wg , the remainder of this article describes some alternatives and how each alternative was proposed to address at least one of the issues raised. Knowledge of this is intended to help the researcher or practitioner make informed decisions regarding the most applicable statistic (even r wg ) given his or her unique situation. r * wg WITH THE RECTANGULAR NULL AND MAXIMUM DISSENSUS NULL DISTRIBUTIONS

General Logic
In order to overcome shortcomings of non-linearity and inadmissible values of r wg and r wg(j),  proposed r * wg . r * wg using σ eu 2 is equal to r wg except r * wg allows for meaningful negative values to −1.0. Negative values will occur when S 2 x exceeds the variance of the rectangular distribution, σ 2 eu , and these negative values indicate bimodal distributions. In other words, clusters of raters are at or near the scale end points. Unlike r wg , which does not consider negative values to be admissible, r * wg recognizes that this information can provide theoretical insight into the nature of the disagreement. r * wg(j) with σ 2 eu also uses the same equation as does r wg but instead uses the mean variance in the numerator: where S 2 x is the mean of the item variances of judge ratings. Figure 2 illustrates that r * wg(j) has the favorable property of linearity, meaning that it will not be affected by increasing scale items.  suggested that interpretation may be aided by keeping the range of admissible values to those of James et al.'s (1984) r wg and r wg(j) (i.e., 0-1.0).  pointed out that this could be done by setting the expected random variance, σ e 2 , to the maximum possible disagreement, known as maximum dissensus. Maximum dissensus (σ mv 2 ) is: where X U and X L are the upper and lower discrete Likert categories, respectively (e.g., "5" and "1" on a five-point scale; Lindell, 2001). Maximum dissensus occurs when all judges are distributed evenly at the scale endpoints, and it can be used in the denominator of the r * wg or r * wg(j) equations. For example, for multi-item scales: It is instructive to point out that on a five-point scale, σ eu 2 is 2 and σ mv 2 is 4. Thus, the use of maximum dissensus essentially rescales James et al.'s (1984) r wg such that all values of S x 2 will result in r * wg values within the range of 0 and 1.0. This index avoids the problem of non-linearity and corresponding inflation potential of r wg(j) and addresses the problem of inadmissible values. Figure 2 contains functions for r * wg and r * wg(j) with σ eu 2 and σ 2 mv . Values for r * wg and r * wg(j) will range from −1.0 to 1.0 if the denominator is σ eu 2 , wherein a value of 0 is uniform disagreement (i.e., S x 2 = σ eu 2 ) and a value of −1.0 is maximum dissensus (i.e., S x 2 = σ mv 2 ). Note the advantage of r * wg and r * wg(j) in that information is preserved by assigning a meaningful interpretation to negative values. Values for r * wg and r * wg(j) will range from 0 to 1.0 when the denominator is σ mv 2 , wherein a value of 0.5 is uniform disagreement (i.e., S x 2 = σ eu 2 ), and a value of 0 is maximum dissensus (i.e., S 2 x = σ mv 2 ). Taken together, r * wg and r * wg(j) potentially address three drawbacks of James et al.'s (1984) statistics. First, negative values are interpretable by incorporating the concept of maximum dissensus. Second, by using S 2 x in the numerator, the multi-item agreement index is not extensively affected by the addition of scale items, which is a major interpretational difficulty of r * wg(j) . Third, r * wg and r * wg(j) have the further advantage of avoiding inadmissible values that exceed +1.0.

Interpretation
FURTHER ADVANCES ON r * wg : DISATTENUATED MULTI-ITEM r * wg : (r' WG(j) ) General Logic One of the difficulties with Lindell et al.'s (1999) observed r * wg statistics, described above, is the use of σ 2 mv when comparisons between James et al.'s (1984) r wg are of interest (Lindell, 2001). The problem lies in the differences in ranges; James et al.'s (1984) r wg statistics have admissible values within 0 and +1.0, whereas for r * wg statistics that use σ 2 mv the admissible range is from 1.0 to +1.0. Two steps could be taken to remedy this problem. First, as mentioned above,  observed that r * wg statistics could be computed with σ 2 eu . This facilitates comparisons, and also allows the researcher to use a multi-item r * wg that would have similar behavior compared to single-item r wg . But, a further problem noted by Lindell (2001) is that r * wg and r * wg(j) with σ eu 2 will be attenuated in comparison to admissible r wg and r wg(j) values. Thus, a second avenue offered by Lindell (2001) to address the relative attenuation of r * wg(j) using σ 2 eu is an alternative called r ′ wg(j) . r ′ wg(j) uses the variance of raters' scale scores on multi-item scales (referred to as S y 2 , see Table 1 for derivation details): Interpretation Lindell (2001) demonstrated that r' wg(j) tends to produce larger values than does r * wg(j) using σ 2 eu , thereby addressing the issue of attenuation. Otherwise, r ′ wg(j) has the same general interpretation as does r * wg(j) , although it might be expected to share the limitation of being correlated with group mean extremity. A further difficulty might involve the need to extensively explain r' wg(j) and its logic as reviewers may not be as familiar with this agreement statistic as they are with more frequently employed agreement indices (see Table 1).
POOLED AGREEMENT FOR SUBGROUPS: r wg(p)

General Logic
As a possible remedy for the problem of inadmissible r wg values that fall below 0 or above 1.0, LeBreton et al. (2005) offered r wg(p) . The rationale is that inadmissible values suggest bimodal response distributions, and the different clusters comprise subgroups. Therefore, separate IRA could be computed for each subgroup, which could then be pooled. Accordingly, r wg(p) computes the sample-size weighted average of raters' variance for the two groups, and this value is used in James et al.'s r wg or r wg(j) (see also Table 1). This will effectively remove the possibility of inadmissible values.

Interpretation
There are a few noteworthy drawbacks involving the use of r wg(p) . Calculating the pooled r wg(p) requires homogeneity of observed variances (e.g., using Fisher's F-test; see Table 1), otherwise pooling the variances to calculate the r wg(p) may not be justifiable. Another limitation is that these subgroups may be difficult to identify theoretically or a priori; thus, capitalization on chance is possible (LeBreton et al., 2005). This can be contrary to purpose as most researchers are interested in a pre-specified set of judges (e.g., team membership). Finally, given that r wg(p) has its basis on r wg and r wg(j) , r wg(p) would share many of the limitation of James et al.'s (1984) statistics. Notwithstanding these limitations, r wg(p) does provide a potentially advantageous extension of r wg and r wg(j) for use when subgroups are suspected.

General Logic
One major difficulty inherent in r wg is the choice of a suitable null distribution. As reviewed above, there is the choice of the rectangular distribution or the maximum dissensus distribution. Moreover, there are other potential distributions, such as skewed bell-shaped distributions, that may more realistically represent null distributions by taking into account factors such as socially-desirable responding or acquiescence tendencies (James et al., 1984; see also discussions by Schmidt and DeShon, 2003;LeBreton and Senter, 2008). Importantly, the selected distribution affects the magnitude of IRA statistics, their interpretation, and comparisons to other IRA statistics. To circumvent difficulties in choosing a null distribution, Burke et al. (1999) offered the average deviation index. The average deviation is calculated by determining the sum of the differences between each rater and the mean rating divided by the number of raters: where AD M (j) is the average deviation of judges' ratings on a given item, x i is a judge's rating on the item, x is judges' mean rating on the item, and k is the number of judges. When there are multiple items: where AD M (J) is the average deviation of judges' ratings from the mean judge rating across items, AD M (j) is the average deviation on a given item, and J is the number of scale items. Note that AD can be generalized for use with the median, instead of the mean, in order to minimize the effects of outlier or extreme raters.

Interpretation
The average deviation approach is advantageous as it provides a direct assessment of IRA without invoking assumptions about the null distribution. Moreover, Burke and Dunlap (2002) made useful inroads for determining cutoffs for supporting aggregation, as they attempt to control for the number of Likert response options by suggesting a cutoff criterion of A/6 (where A is the number of Likert categories; cutoff criteria are discussed further below). On the downside, like r wg statistics, the average deviation will be correlated with the group mean such that means closer to the extremities will be negatively related to average deviation values (see Table 1). In addition, whereas some forms of r wg can suffer from inadmissible values, AD has the problem of having no standard range whatsoever. Thus, AD values will be difficult to compare across scales with a different numbers of categories.

BROWN AND HAUENSTEIN'S "ALTERNATIVE" ESTIMATE OF IRA: a wg(1)
General Logic Brown and Hauenstein (2005) developed the a wg(1) to overcome the limitation of other agreement indices that are correlated with the extremeness of mean ratings. The closer the mean rating is to the scale endpoint (i.e., the extremity of the group mean), the lower the variance in those ratings, and the greater the agreement. This confounds all of the above IRA statistics with the group mean and consequently renders them incomparable across groups with different means. Accordingly, Brown and Hauenstein presented a wg(1) , which uses, as a null distribution, the maximum possible variance (i.e., maximum dissensus) given a group's mean: where S mpv/m 2 is the maximum possible variance given k raters, M is the observed mean rating, and H and L are the maximum and minimum discrete scale values, respectively. Once the maximum possible variance is known, the single-item a wg is: Note that multiplying S x 2 by 2 is arbitrary, and is done to give it the same empirical range as James et al.'s (1984) r wg . For multi-item scales, the single-item a wg s are averaged: Interpretation Figure 2 contains a wg values for means of 3, 2.5, and 3.5, on a five-point scale. Values of −1.0 indicate maximum dissensus (i.e., judge's ratings are on the scale endpoints as much as possible so as to maximize observed variance, S x 2 ), 0 indicate the observed variance is 50% of the maximum variance (i.e., uniform disagreement), and +1.0 indicate perfect agreement, given the group mean. Note that this is the same interpretation as of the single-item r wg , except a wg is adjusted for the group mean. Moreover, single and multi-item a wg are linear functions, thereby enhancing ease of interpretation (see Figure 2). Notice that a wg for means departing from the midpoint of the scale are slightly lower, thereby taking into account decreases in maximum dissensus as a result of restricted variance. Finally, a wg will not be influenced by sample sizes or number of scale anchors, which are notable additional advantages.
One limitation to Brown and Hauenstein's (2005) a wg is that S mpv/m 2 cannot be applied when the mean is extreme (e.g., 4.9 on a 5-point scale). This is because S mpv/m 2 assumes that at least one rater falls on each scale endpoint, although this is impossible given some extreme means. Thus, there are boundaries in means, outside of which appropriate maximum variance estimates should not be applied (Brown and  where L and H are the lowest and highest scale values, and k is the number of judges. This is however, a relatively modest limitation because mean ratings falling beyond these boundaries are likely to indicate strong agreement, as values close to the endpoints will only occur when agreement is high. Nevertheless, a wg scores exceeding interpretational boundaries cannot be compared at face value to other groups' a wg s. An additional limitation of a wg is that, unlike most other IRA statistics, a wg is based on more than a single parameter (e.g., the observed variance, S 2 x ). It also includes the mean. As both S x 2 and x are affected by sampling error, sampling error may have a greater influence on a wg than on some other IRA statistics (Brown and Hauenstein). Limitations aside, a wg is advantageous because it controls for the mean rating using a mean-adjusted maximum dissensus null distribution and it has a linear function.
Unlike other agreement statistics, a wg matches the variances (S x 2 , S mpv/m 2 ) on whether they employ the unbiased (denominator is n − 1) or population-based (denominator is n) variance equations. The r wg family mixes unbiased and population-based variances (i.e., S x 2 , σ eu 2 , respectively), thereby potentially leading to inflation of S x 2 as sample sizes decreases (see Brown and Hauenstein, 2005). This results in larger values for the r wg family as sample size increases, and, therefore, IRA agreement will almost always be high in large samples (Kozlowski and Hattrup, 1992). Conversely, a wg matches the variances by employing sample-based equations for both of S x 2 and S mpv/m 2 , making a wg independent of sample size. If population-level data is obtained, controls for sample size can be employed by substituting k for k−1 in both of S x 2 and S mpv/m 2 (Brown and Hauenstein; see Table 1).

General Logic
The square root of the variance term used throughout the current article, S x 2 , is the standard deviation, S wg . As S wg is the square root of the average squared deviations from the mean, Schmidt and Hunter (1989) advocated for S wg as a straightforward index of IRA around which confidence intervals can be computed. Using the standard deviation addresses problems associated with choosing a null distribution and of non-linearity. The average S wg across items can be used in the case of multi-item scales.

Interpretation
Advantages of using S wg as an index of IRA is that it is a common measure of variation, and its interpretation is not complicated by the use of multi-item scales or non-linear functions [see r wg(j) ]. However, the S wg has not always enjoyed widespread application. It cannot be explicitly compared to random response distributions, and this could be of interest. It also tends to increase with the size of the scale response options, meaning that comparisons across scales are not feasible. Finally, it will also tend to decrease with increases in sample size; thus, it will not be sample-size independent.

COEFFICIENT OF VARIATION
A problem with most IRA statistics reviewed is that they are scale dependent, making comparisons across scales with widely discrepant numbers of Likert response options problematic. With greater numbers of response options, the variance will tend to increase. Thus, the amount of variance (e.g., S x 2 ) could partly depend on scaling, thereby presenting a possible source of contamination for many IRA statistics. One way to address this difficulty is to control for the group mean, because means will typically be larger with greater numbers of response options. One statistic that attempts to address this issue is the coefficient of variation (CV wg ). The CV wg indexes IRA by transforming the standard deviation into a variance estimate that is less scale dependent, using the following: MULTI-ITEM CV wg COULD BE COMPUTED BY AVERAGING CV wg OVER THE J ITEMS.

Interpretation
By dividing the standard deviation (the numerator) by the group mean, CV wg aims to provide an index of IRA that is not severely influenced by choice of scale, thereby facilitating comparisons of IRA across different scales. For example, the CV wg for a sample with a standard deviation of 6 and a mean of 100 would be identical to the CV wg for a sample with a standard deviation of 12 and a mean of 200. Figure 3 contains CV wg for means equal to 50, 100, and 200 with standard deviations ranging from 0 to 15. An inspection of Figure 3 indicates that the CV wg increases faster with increases in standard deviations for low means than for high means, thereby taking into account the difference in variation that may be related to scaling. Thus, the CV wg could be helpful in comparing IRA across scales with different numbers of response options. On the other hand, it is only helpful for relative (to the mean) comparisons, and not absolute comparisons (Allison, 1978;Klein et al., 2001). This can be clarified by observing that the addition of a constant to a set of scores will affect the mean and not the standard deviation, making it difficult to offer meaningful interpretations of absolute CV wg s (but ratio scaling helps; Bedeian and Mossholder, 2000). Another issue is that negative CV wg will occur in the presence of a negative mean, but a negative CV wg is not theoretically interpretable. Thus, a further requirement is non-negative scaling (Roberson et al., 2007).

STANDARDS FOR AGREEMENT
What constitutes strong agreement within raters? This is an important question as researchers wishing to employ IRA statistics to support and justify decisions. For example, aggregation of individuals' responses to the group (mean) level may assume a certain level of consensus (Chan, 1998). Or, consensus thresholds may be used in the critical incident technique in job analysis, where performance levels of the employees involved in the incident should be agreed upon by experts (Flanagan, 1954). Identifying a unified set of standards for agreement, however, has proven elusive. Two general approaches to identifying standards for agreement have been suggested: statistical and practical. These are considered briefly below.

Practical Standards
Historically, the emphasis on IRA has been on practical standards or "rules of thumb." For example, the r wg family of statistics has relied on the 0.70 rule of thumb. It is important to acknowledge that the decision to choose 0.70 as the cutoff was based on what amounted to no more than a phone call (see personal communication, February 4th, 1987, in (George, 1990, p. 110; for a discussion, see LeBreton et al., 2003;Lance et al., 2006), and James et al. (1984) likely never intended for this cutoff to be so strongly, and perhaps blindly, adopted. Nevertheless, a perusal of Figures 1, 2 clearly shows why common, rule of thumb standards for any of these statistics are difficult to support. A value of 0.70 has a different meaning for most statistics in the r wg family. Further, even within-statistic, different situations may render that statistic incomparable. For example, agreement of 0.70 for an r wg based on 10 items vs. an r wg(j) based on two items is a different agreement benchmark because of the non-linearity. Identification of the null distribution is another influencing factor, as Figure 2 clearly shows that use of σ eu 2 vs. σ mv 2 changes the interpretation of any absolute r wg value (e.g., 0.70), not to mention other potential null distributions (see LeBreton and Senter, 2008).
As noted by Harvey and Hollander (2004), justification for a cutoff of 0.70 is based on an assumption that agreement is similar to reliability, and reliabilities exceeding 0.70 are preferred. However, reliability is about consistency of test scores, not absolute agreement of test scores. Test scores can be perfectly reliable (consistent) but very distinct in absolute quantities. Reliability can, and should be, approached using Generalizability Theory. G-theory involves the systematic investigation of all sources of consistency and error (Cronbach et al., 1972). For example, O'Neill et al. (2015) identified raters as the largest source of variance in performance ratings, rather than rates or dimensions. Thus, drawing on the 0.70 cutoff from reliability theory is not tenable as this rule of thumb underscores the complexity of reliability. An additional assumption is that a single value (e.g., 0.70) would be meaningfully compared across situations and possibly statistics in the r wg family. These assumptions seem untenable, and adopting any standard rule of thumb for agreement involving the entire r wg family would appear to be misguided (see Harvey and Hollander, 2004). For many of the same reasons (i.e., incomparability across different situations), there is no clear avenue for setting practical cutoff criteria for S wg and CV wg . LeBreton and Senter (2008) proposed that standards for interpreting IRA could follow the general logic advanced by Nunnally (1978; see also Nunnally and Bernstein, 1994). Specifically, cutoff criteria should be more stringent when decisions will be highly impactful on the individuals involved (e.g., performance appraisal for administrative decision making). Where applicable, LeBreton and Senter (2008) added that cutoff criteria should consider the nature of the theory underlying aggregation for multilevel research, and the quality of the measure (e.g., newly-established measures may be expected to show lower IRA than do well-established measures). For application to the r wg family, the following standards were recommended: 0-0.30 (lack of agreement), 0.31-0.50 (weak agreement), 0.51-0.70 (moderate agreement), 0.71-0.90 (strong agreement), and 0.91-1.0 (very strong agreement). Whereas, these standards will have different implications and meaning for different types of r wg and null distributions (consider Figures 1,  2), LeBreton and Senter (2008) proposed the standards for all forms of r wg . Thus, there is a strong "disincentive" to report versions of r wg that will result in the appearance of lower IRA (e.g., using a normal distribution for the null; LeBreton and Senter, p. 836). Nevertheless, they challenged researchers to select the most appropriate r wg by using theory (especially in the identification of a suitable null distribution), with the hope that professional judgment will prevail. Future research will be telling with regard to whether or not researchers adopt LeBreton and Senter's (2008) recommended practices. Turning to other IRA statistics, Burke and Dunlap (2002) suggested that practical significance standards for AD could apply the decision rule A/6 (where A is the number of Likert categories). Thus, for a five point Likert scale 5/6 = 0.83, and AD values exceeding 0.83 would be seen as not exhibiting strong agreement. But this decision rule makes two assumptions in its derivation (see Burke and Dunlap for details): (a) the basis is in classical test theory and that interrater reliability should exceed 0.70; and (b) the appropriate null distribution is the rectangular distribution. If these assumptions can be accepted, then the AD has a sound approach for determining cutoffs for practical significance. But if that "null distribution fails to model disagreement properly, then the interpretability of the resultant agreement coefficient is suspect" (Brown and Hauenstein, 2005, p. 166). Elsewhere, Burke et al. (1999) proposed different criteria. They suggested that AD should not exceed 1.0 for five-and seven-point scales, and AD should not exceed 2.0 for 11-point scales. Finally, it should be noted that Brown and Hauenstein (2005) proposed rules of thumb for a wg . Specifically, 0-0.59 was considered unacceptable, 0.60-0.69 was weak, and 0.70-0.79 was moderate, and above 0.80 was strong agreement.

Statistical Standards
Identifying standards for IRA using statistical significance testing involves conducting Monte Carlo simulations or random group resampling. For Monte Carlo simulations, the input is the correlation matrix of scale items, the null distribution, and the significance level (see Cohen et al., 2001Cohen et al., , 2009Burke and Dunlap, 2002;Dunlap et al., 2003). Tabled significance values were provided by several researchers (e.g., Dunlap et al., 2003;Cohen et al., 2009). The program R contains commands for running Monte Carlo simulations involving r wg and AD (see Bliese, 2009). The objective is to create a sampling distribution for the IRA statistic with an expected mean and standard deviation, which can be used to generate confidence intervals and significance tests. Random group resampling involves constructing a sampling distribution by repeatedly sampling and forming random groups from observations in the observed data set, and comparing the significance of the mean difference in within-group variances of the observed distribution and the randomly generated distribution using a Z-test (Bliese et al., 2000;Bliese and Halverson, 2002;Ludtke and Robitzsch, 2009), for which commands are available in R. Thus, significance testing of the S wg is possible through the random-group resampling approach. Similar logic could be applied to test the significance of r * wg , a wg, and CV wg , although existing scripts for running these tests may be more difficult to find.
Statistical significance testing of IRA statistics has its advantages. Cutoff criteria are relatively objective, thereby potentially reducing misuse by relying on inappropriate or arbitrary rules of thumb (see below). But, statistical significance does not appear to have been widely implemented. One reason might be because of the novelty of the methods for doing so, and the need to understand and implement commands in R, for example. Another reason might be because statistical agreement might be difficult to reach in many commonly-encountered practical situations. Specifically, many applications will involve three to five raters, yet r wg(j) needs to be in the range of at least 0.75 and AD would have to fall below 0.40 (Burke and Dunlap, 2002;Cohen et al., 2009) in order to reject the null hypothesis of no agreement. Indeed, Cohen et al. reported that groups with low sample sizes rarely reached levels of statistical significance that would allow the hypothesis of no statistical agreement to be rejected. If statistical significance testing is treated as a hurdle against which agreement must be passed in order for further consideration of the implicated variables, there is potential to interfere with advancement of research involving low (but typical) sample sizes. This may not always be the most desirable application of IRA, and, not surprisingly, practical standards have tended to be most common.

Current Best Practice in Judging Agreement Levels
It is important to acknowledge the two divergent purposes of practical and statistical approaches to judging agreement. Practical cutoffs provide decision rules about whether or not agreement seems to have exceeded a minimum threshold in order to justify a decision. Examples of such decisions include aggregation of lower-level data to higher-level units, retention of critical incidents in job analyses, and for assessing whether frameof-reference training has successfully "calibrated" raters. The use of practical cutoff criteria in these decisions implies that a certain level of agreement is needed in order to make some practical decision in light of the agreement qualities of the data (Burke and Dunlap, 2002).
Statistical standards are not focused on the absolute level of agreement so much as they are concerned with drawing inferences about a population given a sample. Statistical agreement tests the likelihood that the observed agreement in the sample is greater than what would be expected by chance at a certain probability value (e.g., p < 0.05). It involves making inferences about whether the sample was most likely drawn from a population with chance levels of agreement vs. systematic agreement. For example, a set of judges could be asked to rate the job relevance of a personality variable in personality oriented job analysis (Goffin et al., 2011). If agreement is not significant for a particular variable, it would suggest that there is no systematic agreement in the population of judges (Cohen et al., 2009). Notice that this differs from practical significance, which would posit a cutoff, above which agreement levels would be considered adequate for supporting the use of the mean rating as an assessment of the job relevance of the trait (e.g., O'Neill et al., 2011).
Statistical agreement raises issues of power and sample size. Specifically, in small samples statistical agreement will be more difficult to reach than in large samples. Accordingly, outcomes of whether agreement is strong or not may depend on whether one focuses on statistical or practical decision standards, and in large samples, statistical agreement alone should not sufficiently justify aggregation (Cohen et al., 2009). The key point, however, is that statistical significance testing is for determining whether the agreement level for a particular set of judges exceeds chance levels. Practical agreement is about absolute levels of agreement in a sample, which could be seen as strong even for nonsignificant agreement when sample sizes are low.
In light of the above discussion, it is clear that more research is needed in order to identify defensible and practical approaches for judging IRA levels. Best practice recommendations for the interim would involve reporting several IRA statistics, ideally from different families, in order to provide a balanced perspective on IRA. Practical significance levels could be advanced a priori using suggestions described above (e.g., Burke and Dunlap, 2002;Brown and Hauenstein, 2005;LeBreton and Senter, 2008) in order to identify cutoffs for making decisions. Statistical significance would be employed only when inferences about the population are important and when a power analysis suggests sufficient power to detect agreement, although practical standards should also be considered especially when power is very high. Thus, a researcher or practitioner might place little emphasis on statistical significance when he or she is not concerned about generalizing to the population, and when there are very few judges the researcher might be advised to consider a less stringent significance level (e.g., α = 0.10). Importantly, when evaluating agreement in a set of judges, the focus is typically not on whether the sample was drawn from a population with chance or systematic agreement, but whether there is a certain practically meaningful level of agreement. Thus, in many cases practical significance might be most critical.
Interpretations of practical agreement should probably not be threshold-based, all-or-none decision rules applied to a single statistic [e.g., satisfactory vs. unsatisfactory r wg(j) ]. This is how statistics can be misused to support a decision (see Biemann et al., 2012). Rather, reporting the values from several IRA statistics along with proposed practical standards of agreement reviewed here will provide some evidence of the quality of the ratings, which can be considered in the context of other important indices that also reflect data quality (e.g., reliability, validity). An overall judgment can then be advanced and the reader (including the reviewer) will also have the necessary information upon which to form his or her own judgment. This procedure fits well within the spirit of the unitary perspective on validity (Messick, 1991;Guion, 1998), which suggests that validity involves an expert judgment on the basis of all the available reliability and validity evidence regarding a construct. It would seem that IRA levels should be considered in the development of this judgment, but it may not be productive to always require an arbitrary level of agreement to support or disconfirm the validity of a measure in a single study. In any case, consequential validity (Messick, 1998(Messick, , 2000 should be kept in mind, and more research examining the consequences, implications, and meaning of various standards for IRA is needed.

CONCLUSION
IRA statistics are critical to justification of aggregation in multilevel research, but they are also frequently applied in job analysis, performance appraisal, assessment centers, employment interviews, and so forth. Importantly, IRA offers a unique perspective from reliability because reliability deals with consistency of ratings and agreement deals with the similarity of absolute levels of ratings. IRA has the added advantage of providing one estimate per set of raters-not one estimate for the sample as is the case with reliability. This feature of IRA can be helpful for diagnostic purposes, such as identifying particular groups with high or low IRA.
Despite the prevalence of IRA, there is the problem that articles considering IRA statistics tend to be heavy on the technicals (e.g., Cohen et al., 2001), and this might be a reason why r wg , with its widely known limitations (e.g., see Brown and Hauenstein, 2005), appears to persevere as the leading statistical choice for IRA. Indeed, a recent review suggested that a lack of a sound understanding of IRA statistics may have led to some misuses (see Biemann et al., 2012). Thus, despite the many alternatives offered (e.g., r * wg , AD, a wg , CV wg ), they may not receive full consideration because accessible, tractable, and non-technical resources describing each within a framework that allows for simple contrasting is not available. LeBreton and Senter (2008) provided solid coverage, but it was mainly with respect to multilevel aggregation issues and not directly applicable to other purposes (e.g., job analysis).
The current article aims to fill a gap in earlier research by offering an introductory source, intended to be useful for scholars with a wide range of backgrounds, in order to facilitate application and interpretation of IRA statistics. Through a comparative analysis regarding eight IRA statistics, it appears that these statistics are not interchangeable and that they are differentially affected by various contextual details (e.g., number of Likert response options, number of judges, number of scale items). The goal of the article is to facilitate critical and appropriate applications of IRA in the future, offer a foundation for tackling the more technical sources currently available, and make suggestions regarding best practices in light of the insights gleaned through the review. It is proposed that researchers interpret IRA levels with respect to the situation and best-practice recommendations for practical and statistical standards in the literature, as reviewed here. Because of the unique limitations of each statistic, it is probably safe to conclude that more than one statistic should always be reported. In submissions where this has been ignored, reviewers should request the author to report additional agreement statistics, ideally from other IRA families. Consistent with the unitary perspective on validity, it is suggested that judgments regarding the adequacy of the ratings rely on evidence of IRA in conjunction with additional statistics that shed light on the quality of the data (e.g., reliability coefficients, criterion validity coefficients). Regarding agreement standards, it would seem advisable to evaluate a given IRA statistic using appropriate a priori practical cutoffs and statistical criteria, depending on the purpose of assessing agreement levels. What we need to avoid is misuses of agreement statistics and adoption of inappropriate or misleading decision rules. This critical review aims to provide tools to help researchers and practitioners avoid these problems.

AUTHOR CONTRIBUTIONS
The author confirms being the sole contributor of this work and approved it for publication.