An Overview of Interrater Agreement on Likert Scales for Researchers and Practitioners

O'Neill, Thomas A.

doi:10.3389/fpsyg.2017.00777

REVIEW article

Front. Psychol., 12 May 2017

Sec. Organizational Psychology

Volume 8 - 2017 | https://doi.org/10.3389/fpsyg.2017.00777

An Overview of Interrater Agreement on Likert Scales for Researchers and Practitioners

Thomas A. O'Neill^*

Individual and Team Performance Lab, Department of Psychology, University of Calgary, Calgary, AB, Canada

Applications of interrater agreement (IRA) statistics for Likert scales are plentiful in research and practice. IRA may be implicated in job analysis, performance appraisal, panel interviews, and any other approach to gathering systematic observations. Any rating system involving subject-matter experts can also benefit from IRA as a measure of consensus. Further, IRA is fundamental to aggregation in multilevel research, which is becoming increasingly common in order to address nesting. Although, several technical descriptions of a few specific IRA statistics exist, this paper aims to provide a tractable orientation to common IRA indices to support application. The introductory overview is written with the intent of facilitating contrasts among IRA statistics by critically reviewing equations, interpretations, strengths, and weaknesses. Statistics considered include r_wg, $r_{wg}^{*}$ , r′_wg, r_wg(p), average deviation (AD), a_wg, standard deviation (S_wg), and the coefficient of variation (CV_wg). Equations support quick calculation and contrasting of different agreement indices. The article also includes a “quick reference” table and three figures in order to help readers identify how IRA statistics differ and how interpretations of IRA will depend strongly on the statistic employed. A brief consideration of recommended practices involving statistical and practical cutoff standards is presented, and conclusions are offered in light of the current literature.

Introduction

The assessment of interrater agreement (IRA) for Likert-type response scales has fundamental implications for a wide range of research and practice. One application of IRA is to quantify consensus in ratings of a target, which is often crucial in job analysis, performance assessment, employment interviews, assessment centers, and so forth (e.g., Brutus et al., 1998; Lindell and Brandt, 1999; Walker and Smither, 1999; Morgeson and Campion, 2000; Harvey and Hollander, 2004). Another application of IRA is to determine the appropriateness of averaging individual survey responses to the group level (van Mierlo et al., 2009). In that spirit, IRA has been used to support the aggregation of individual ratings to the team level, follower ratings of leadership to the leader level, organizational culture ratings to the organizational level, and leadership ratings to the leader level (see discussions by Rousseau, 1985; Chan, 1998; Kozlowski and Klein, 2000). If consensus in the ratings of a target is low, then the mean rating may be a misleading or inappropriate summary of the underlying ratings (George, 1990; George and James, 1993). Underscoring the importance of IRA statistics is that, unlike interrater reliability and consistency statistics, IRA provides a single value of agreement for each rating target, thereby facilitating identification of units of raters who are very high or very low in agreement. This advantageous feature also permits subsequent investigation of other substantive and theoretically interesting variables that may be related to variance in agreement (Klein et al., 2001; Meade and Eby, 2007), or as a moderator of predictor-criterion relations (e.g., climate strength; Schneider et al., 2002).

IRA are particularly common when collecting systematic observations of behavior or phenomena. For example, Bernardin and Walter (1977) found that training and diary keeping reduced the errors in performance ratings. O'Neill and Allen (2014) investigated subject-matter experts' ratings of product innovation. Weingart et al. (2004) observed and coded negotiation behavior between teams and reported on methods for doing so. Many more examples exist. The key is that IRA becomes highly relevant when judges observe and provide ratings of behavior or phenomena, and the absolute agreement of those ratings is of interest.

Despite the widespread application of IRA statistics and the extensive research focusing on IRA, it appears that considerable challenges persist. For example, a recent review by Biemann et al. (2012) identified situations in which applications of IRA for aggregation of leadership ratings has been misused, as ratings were aggregated (or not) based on flawed interpretations of IRA. A possible contributing factor of the potential for IRA misuse is that considerations of the logic underlying equations and interpretations of alternative IRA statistics have been relatively scattered across organizational (e.g., Lindell and Brandt, 1999), methodological (e.g., Cohen et al., 2001), and measurement (e.g., Lindell, 2001) journals, thereby making it difficult for researchers and practitioners to contrast the variety of statistics available and to readily apply them appropriately. LeBreton and Senter (2008) provided a seminal review of IRA and consistency statistics, but the focus was largely on implications of these types of statistics for multilevel research methods and not on the many other applications of IRA (e.g., agreement in importance ratings collected in job analysis; Harvey, 1991). Elsewhere, IRA statistics have been investigated as dispersion measures of substantive constructs in multilevel research in terms of criterion validity (Meade and Eby, 2007), power (e.g., Roberson et al., 2007), significance testing (e.g., Cohen et al., 2009; Pasisz and Hurtz, 2009), and performance under missing data conditions (Allen et al., 2007; Newman and Sin, 2009). Importantly, some existing articles may be seen as highly technical for some scholars that are new to the IRA literature (e.g., Lindell and Brandt, 1999; Cohen et al., 2001), and other reviews tend to focus on only one or two IRA statistics (e.g., Castro, 2002).

Given the above, what is needed is a relatively non-technical and tractable orientation to IRA that facilitates comparison and interpretation of various statistics for scholars. Accordingly, the purpose of this article is to contribute by providing an accessible and digestible IRA resource for researchers and practitioners with a diverse range of training and educational backgrounds who need to interpret or report on IRA. The current article fills a gap by reporting on an introductory comparative analysis involving eight IRA statistics: r_wg, $r_{wg}^{*}$ , r′_wg, r_wg(p), average deviation (AD), a_wg, standard deviation (S_wg), and the coefficient of variation (CV_wg). A unique contribution is a “quick reference” table containing citations, formulas, interpretations, strengths, and limitations (see Table 1). The aim of Table 1 is to support expedient consideration of the appropriateness of various IRA statistics given a researcher or practitioner's unique situation, and to serve as a foundation for more focused, complex issues addressed in technical guides (e.g., Burke and Dunlap, 2002). Further, three figures attempt to clarify the behavior of IRA statistics and to supplement understanding and interpretation of various IRA statistics. The article introduces James et al.'s (1984) r_wg, some potential issues with interpretations of that statistic, and numerous contemporary alternatives. Before beginning, a comment on IRA and interrater consistency is offered.

TABLE 1

Table 1. Summary of interrater agreement statistics for likert-type response scales.

James et al.'s IRA: r_wg for Single and Multiple Items

General Logic

For use on single-item scales, James et al. (1984; see also Finn, 1970) introduced the commonly-used, and perhaps most ubiquitous, IRA statistic known as r_wg. This statistic is a function of two values: the observed variance in judges' ratings (denoted as $S_{x}^{2}$ ), and the variance in judges' ratings if their ratings were random (denoted as $σ_{e}^{2}$ in its general form, referred to as the null distribution). What constitutes a reasonable standard for random ratings is highly debated. One option, apparently the default in most research, is the rectangular or uniform distribution calculated with the following (Mood et al., 1974):

σ_{eu}^{2} = (A^{2} - 1) / 12 (1)

where A is the number of discrete Likert response alternatives. This distribution yields the variance obtained if each Likert category had an equal probability of being selected. Observed variance in judges' ratings on a single item can be compared to this index of completely random responding to determine the proportion of error variance present in the ratings:

proportion of random variance in judges' ratings = S_{x}^{2} / σ_{eu}^{2} (2)

If this value—the proportion of error variance in judges' ratings—is subtracted from 1, the remaining variance can be interpreted as the proportion of variance due to agreement. Hence, the IRA for single item scales can be:

r_{wg} = 1 - (S_{x}^{2} / σ_{eu}^{2}) (3)

Whereas, Equation (3) is for single-item scales, James et al. (1984) derived an index for multi-item response scales denoted as r_wg(j). It applies the Spearman-Brown prophecy formula (see Nunnally, 1978) to estimate IRA given a certain number of scale items (although James et al., 1984 did not use the Spearman-Brown in its derivation; see also LeBreton et al., 2005). Further, the term $S_{x}^{2}$ from Equation (3) is substituted with the mean $S_{x}^{2}$ derived from judges' ratings on each scale item to yield the following:

r_{wg (j)} = J (1 - \bar{S_{x}^{2}} / σ_{eu}^{2}) / [J (1 - \bar{S_{x}^{2}} / σ_{eu}^{2}) + (\bar{S_{x}^{2}} / σ_{eu}^{2})] (4)

where $σ_{eu}^{2}$ is the same as in Equation (1), and J is the number of items.

Interpretation

Figure 1 shows the range of r_wg values across all possible levels of mean $S_{x}^{2}$ based on four raters and a five-point Likert scale (see also Lindell and Brandt, 1997). One observation from Figure 1 is that the single-item r_wg is a linear function, such that complete agreement equals 1.0 and uniform disagreement equals 0 (i.e., raters select response options completely at random). But notice that for S_x² > 2—that is, where S_x² exceeds $σ_{eu}^{2}$ − r_wg takes on negative values. Figure 1 also contains the r_wg(j) function ranging from −1.0 to +1.0 across levels of S_x² based on four raters, a five-point Likert scale, and two, five, and ten items. Consistent with expectations, when r_wg(j) is 1.0 agreement is perfect and when r_wg(j) is 0 there exists uniform disagreement. However, at all other levels of r_wg(j), interpretation is complicated because the shape of the function changes depending on the number of items. Consider that, as the mean S_x² moves from 0 to 1.5, r_wg(j) ranges from 1.0 to 0.40 [r_wg(2)], 1.0 to 0.63 [r_wg(5)], and 1.0 to 0.77 [r_wg(5)] suggesting that r_wg(j) is insensitive to substantial changes at reasonable levels of mean $S_{x}^{2}$ , and it might imply surprisingly high agreement even when there is considerable variance in judges' ratings. This also illustrates the extent to which the problem increases in severity as the number of items increases. The pattern creates the potential for misleading or inaccurate interpretations when the shape of the function is unknown to the researcher. Another issue is that S_x² > 2 produces inadmissible values that are outside the boundaries of r_wg(j) (i.e., < 0 or > 1.0). Regarding inadmissible values, James et al. (1984) suggested that these may be a result of sampling error. Other possible contributing factors include inappropriate choices of null distributions and the existence of subgroups. One recommended procedure is to set inadmissible values to 0 (James et al., 1993). This could be an undesirable heuristic, however, because it results in lost information (Lindell and Brandt, 1999, 2000; Brown and Hauenstein, 2005).

FIGURE 1

Figure 1. Single and multiple-item r_wg across levels of S_x² (five-point scale).

Potential Cause for Concern

Whereas, r_wg is arguably the most widely used IRA statistic, there are five issues concerning its interpretation. First, there is the issue of non-linearity described above. This non-linearity, occurring with increased magnitude as the number of scale items increases, renders interpretations of agreement levels ambiguous compared to interpretations of linear functions. The appropriateness of interpretations may be particularly weak if the researcher or practitioner is unaware that the function is non-linear. Indeed, scales with a large number of items will almost always have very high agreement (Brown and Hauenstein, 2005; cf. Lindell and Brandt, 1997; Lindell et al., 1999; Lindell, 2001), which limits the interpretational and informative value of r_wg(j) with scales containing more than a few items. Figure 1 clarifies this. Second, there are difficulties involving inadmissible values, also described above. Resetting these values to 0 or 1.0 seems suboptimal because potentially useful information is arbitrarily discarded. It would be advantageous if that additional information could be used to further shed light on agreement. Third, r_wg and r_wg(j) appear to be related to the mean rating extremity. Brown and Hauenstein (2005) found a correlation between mean judge ratings and r_wg(j) values of 0.63. This is not surprising because mean ratings falling closer to the scale endpoint must have restricted variance (i.e., agreement). Thus, r_wg will be affected by the mean rating. Fourth, the typical selection of σ_e², the theoretical distribution of random variance, seems to be the rectangular distribution, described above as σ_eu² (Cohen et al., 2009). But the σ_eu² uses scaling that leads to inadmissible values (i.e., r_wg < 0), and other distributions may be an improvement (LeBreton and Senter, 2008). Whereas, James et al. (1984) offered alternatives to σ_eu² that attempt to model response tendencies or biases, in many cases it is difficult to make a choice other than $σ_{eu}^{2}$ that can be defended (for laudable attempts, see Kozlowski and Hults, 1987; LeBreton et al., 2003). One alternative to $σ_{eu}^{2}$ , suggested by Lindell and Brandt (1997), however, seems promising (described further below). Fifth, the observed variance in the numerator, S_x², tends to decrease with sample size, which creates the potential to spuriously increase r_wg (Brown and Hauenstein, 2005).

Given the above issues involving James et al.'s (1984) r_wg, the remainder of this article describes some alternatives and how each alternative was proposed to address at least one of the issues raised. Knowledge of this is intended to help the researcher or practitioner make informed decisions regarding the most applicable statistic (even r_wg) given his or her unique situation.

$r_{w g}^{*}$ with the Rectangular Null and Maximum Dissensus Null Distributions

General Logic

In order to overcome shortcomings of non-linearity and inadmissible values of r_wg and r_wg(j), Lindell et al. (1999) proposed $r_{wg}^{*}$ . $r_{wg}^{*}$ using σ_eu² is equal to r_wg except $r_{wg}^{*}$ allows for meaningful negative values to −1.0. Negative values will occur when $S_{x}^{2}$ exceeds the variance of the rectangular distribution, $σ_{eu}^{2}$ , and these negative values indicate bimodal distributions. In other words, clusters of raters are at or near the scale end points. Unlike r_wg, which does not consider negative values to be admissible, $r_{wg}^{*}$ recognizes that this information can provide theoretical insight into the nature of the disagreement. $r_{wg (j)}^{*}$ with $σ_{eu}^{2}$ also uses the same equation as does r_wg but instead uses the mean variance in the numerator:

r_{wg}^{*} = 1 - (\bar{{S_{x}}^{2}} / σ_{eu}^{2}) (5)

where $S_{x}^{2}$ is the mean of the item variances of judge ratings. Figure 2 illustrates that $r_{wg (j)}^{*}$ has the favorable property of linearity, meaning that it will not be affected by increasing scale items. Lindell et al. (1999) suggested that interpretation may be aided by keeping the range of admissible values to those of James et al.'s (1984) r_wg and r_wg(j) (i.e., 0–1.0). Lindell et al. (1999) pointed out that this could be done by setting the expected random variance, σ_e², to the maximum possible disagreement, known as maximum dissensus. Maximum dissensus (σ_mv²) is:

σ_{mv}^{2} = 0.5 ({X^{2}}_{U} {+ X}^{2}_{L}) - {[0.5 (X_{U} {+ X}_{L})]}^{2} (6)

where X_U and X_L are the upper and lower discrete Likert categories, respectively (e.g., “5” and “1” on a five-point scale; Lindell, 2001). Maximum dissensus occurs when all judges are distributed evenly at the scale endpoints, and it can be used in the denominator of the $r_{wg}^{*}$ or $r_{wg (j)}^{*}$ equations. For example, for multi-item scales:

r_{wg (j)}^{*} = 1 - (\bar{{S_{x}}^{2}} / {σ_{mv}}^{2}) (7)

It is instructive to point out that on a five-point scale, σ_eu² is 2 and σ_mv² is 4. Thus, the use of maximum dissensus essentially rescales James et al.'s (1984) r_wg such that all values of S_x² will result in $r_{wg}^{*}$ values within the range of 0 and 1.0. This index avoids the problem of non-linearity and corresponding inflation potential of r_wg(j) and addresses the problem of inadmissible values.

FIGURE 2

Figure 2. Sample r_wg-family and a_wg statistics across levels of S_x² (five-point scale).

Interpretation

Figure 2 contains functions for $r_{wg}^{*}$ and $r_{wg (j)}^{*}$ with σ_eu² and $σ_{mv}^{2}$ . Values for $r_{wg}^{*}$ and $r_{wg (j)}^{*}$ will range from −1.0 to 1.0 if the denominator is σ_eu², wherein a value of 0 is uniform disagreement (i.e., S_x² = σ_eu²) and a value of −1.0 is maximum dissensus (i.e., S_x² = σ_mv²). Note the advantage of $r_{wg}^{*}$ and $r_{wg (j)}^{*}$ in that information is preserved by assigning a meaningful interpretation to negative values. Values for $r_{wg}^{*}$ and $r_{wg (j)}^{*}$ will range from 0 to 1.0 when the denominator is σ_mv², wherein a value of 0.5 is uniform disagreement (i.e., S_x² = σ_eu²), and a value of 0 is maximum dissensus (i.e., $S_{x}^{2}$ = σ_mv²). Taken together, $r_{wg}^{*}$ and $r_{wg (j)}^{*}$ potentially address three drawbacks of James et al.'s (1984) statistics. First, negative values are interpretable by incorporating the concept of maximum dissensus. Second, by using $S_{x}^{2}$ in the numerator, the multi-item agreement index is not extensively affected by the addition of scale items, which is a major interpretational difficulty of $r_{wg (j)}^{*}$ . Third, $r_{wg}^{*}$ and $r_{wg (j)}^{*}$ have the further advantage of avoiding inadmissible values that exceed +1.0.

Further Advances on $r_{wg}^{}$ : Disattenuated Multi-Item $r_{wg}^{}$ : (r'_WG(_j₎)

General Logic

One of the difficulties with Lindell et al.'s (1999) observed $r_{wg}^{*}$ statistics, described above, is the use of $σ_{mv}^{2}$ when comparisons between James et al.'s (1984) r_wg are of interest (Lindell, 2001). The problem lies in the differences in ranges; James et al.'s (1984) r_wg statistics have admissible values within 0 and +1.0, whereas for $r_{wg}^{*}$ statistics that use $σ_{mv}^{2}$ the admissible range is from 1.0 to +1.0. Two steps could be taken to remedy this problem. First, as mentioned above, Lindell et al. (1999) observed that $r_{wg}^{*}$ statistics could be computed with $σ_{eu}^{2}$ . This facilitates comparisons, and also allows the researcher to use a multi-item $r_{wg}^{*}$ that would have similar behavior compared to single-item r_wg. But, a further problem noted by Lindell (2001) is that $r_{wg}^{*}$ and $r_{wg (j)}^{*}$ with σ_eu² will be attenuated in comparison to admissible r_wg and r_wg(j) values. Thus, a second avenue offered by Lindell (2001) to address the relative attenuation of $r_{wg (j)}^{*}$ using $σ_{eu}^{2}$ is an alternative called r′_wg(j). r′_wg(j) uses the variance of raters' scale scores on multi-item scales (referred to as S_y², see Table 1 for derivation details):

{r^{'}}_{wg (j)} = 1 - ({S_{y}}^{2} / {σ_{eu}}^{2}) (8)

Interpretation

Lindell (2001) demonstrated that r'_wg(j)tends to produce larger values than does $r_{wg (j)}^{*}$ using $σ_{e u}^{2}$ , thereby addressing the issue of attenuation. Otherwise, r′_wg(j) has the same general interpretation as does $r_{wg (j)}^{*}$ , although it might be expected to share the limitation of being correlated with group mean extremity. A further difficulty might involve the need to extensively explain r'_wg(j) and its logic as reviewers may not be as familiar with this agreement statistic as they are with more frequently employed agreement indices (see Table 1).

Pooled Agreement for Subgroups: r_wg(p)

General Logic

As a possible remedy for the problem of inadmissible r_wg values that fall below 0 or above 1.0, LeBreton et al. (2005) offered r_wg(p). The rationale is that inadmissible values suggest bimodal response distributions, and the different clusters comprise subgroups. Therefore, separate IRA could be computed for each subgroup, which could then be pooled. Accordingly, r_wg(p) computes the sample-size weighted average of raters' variance for the two groups, and this value is used in James et al.'s r_wg or r_wg(j) (see also Table 1). This will effectively remove the possibility of inadmissible values.

Interpretation

There are a few noteworthy drawbacks involving the use of r_wg_(p). Calculating the pooled r_wg(p) requires homogeneity of observed variances (e.g., using Fisher's F-test; see Table 1), otherwise pooling the variances to calculate the r_wg(p) may not be justifiable. Another limitation is that these subgroups may be difficult to identify theoretically or a priori; thus, capitalization on chance is possible (LeBreton et al., 2005). This can be contrary to purpose as most researchers are interested in a pre-specified set of judges (e.g., team membership). Finally, given that r_wg(p) has its basis on r_wg and r_wg(j), r_wg(p) would share many of the limitation of James et al.'s (1984) statistics. Notwithstanding these limitations, r_wg(p) does provide a potentially advantageous extension of r_wg and r_wg(j) for use when subgroups are suspected.

Average Deviation Index

General Logic

One major difficulty inherent in r_wg is the choice of a suitable null distribution. As reviewed above, there is the choice of the rectangular distribution or the maximum dissensus distribution. Moreover, there are other potential distributions, such as skewed bell-shaped distributions, that may more realistically represent null distributions by taking into account factors such as socially-desirable responding or acquiescence tendencies (James et al., 1984; see also discussions by Schmidt and DeShon, 2003; LeBreton and Senter, 2008). Importantly, the selected distribution affects the magnitude of IRA statistics, their interpretation, and comparisons to other IRA statistics. To circumvent difficulties in choosing a null distribution, Burke et al. (1999) offered the average deviation index. The average deviation is calculated by determining the sum of the differences between each rater and the mean rating divided by the number of raters:

A D_{M (j)} = \sum (| x_{i} - \bar{x} |) / k (9)

where AD_M_(j) is the average deviation of judges' ratings on a given item, x_i is a judge's rating on the item, x is judges' mean rating on the item, and k is the number of judges. When there are multiple items:

A D_{M (J)} = \sum A D_{M}_{(j)} / J (10)

where AD_M_(J) is the average deviation of judges' ratings from the mean judge rating across items, AD_M_(j) is the average deviation on a given item, and J is the number of scale items. Note that AD can be generalized for use with the median, instead of the mean, in order to minimize the effects of outlier or extreme raters.

Interpretation

The average deviation approach is advantageous as it provides a direct assessment of IRA without invoking assumptions about the null distribution. Moreover, Burke and Dunlap (2002) made useful inroads for determining cutoffs for supporting aggregation, as they attempt to control for the number of Likert response options by suggesting a cutoff criterion of A/6 (where A is the number of Likert categories; cutoff criteria are discussed further below). On the downside, like r_wg statistics, the average deviation will be correlated with the group mean such that means closer to the extremities will be negatively related to average deviation values (see Table 1). In addition, whereas some forms of r_wg can suffer from inadmissible values, AD has the problem of having no standard range whatsoever. Thus, AD values will be difficult to compare across scales with a different numbers of categories.

Brown and Hauenstein's “Alternative” Estimate of IRA: a_wg(1)

General Logic

Brown and Hauenstein (2005) developed the a_wg(1) to overcome the limitation of other agreement indices that are correlated with the extremeness of mean ratings. The closer the mean rating is to the scale endpoint (i.e., the extremity of the group mean), the lower the variance in those ratings, and the greater the agreement. This confounds all of the above IRA statistics with the group mean and consequently renders them incomparable across groups with different means. Accordingly, Brown and Hauenstein presented a_wg(1), which uses, as a null distribution, the maximum possible variance (i.e., maximum dissensus) given a group's mean:

{S_{mpv / m}}^{2} = {[(H + L) M - (M^{2}) - H^{*} L]}^{*} [k / (k - 1)] (11)

where S_mpv/m² is the maximum possible variance given k raters, M is the observed mean rating, and H and L are the maximum and minimum discrete scale values, respectively. Once the maximum possible variance is known, the single-item a_wg is:

a_{wg} = 1 - [(2^{*} {S_{x}}^{2}) / {S_{mpv / m}}^{2}] (12)

Note that multiplying S_x² by 2 is arbitrary, and is done to give it the same empirical range as James et al.'s (1984) r_wg. For multi-item scales, the single-item a_wgs are averaged:

a_{wg (j)} = \sum a_{wg (1)} / J (13)

Interpretation

Figure 2 contains a_wg values for means of 3, 2.5, and 3.5, on a five-point scale. Values of −1.0 indicate maximum dissensus (i.e., judge's ratings are on the scale endpoints as much as possible so as to maximize observed variance, S_x²), 0 indicate the observed variance is 50% of the maximum variance (i.e., uniform disagreement), and +1.0 indicate perfect agreement, given the group mean. Note that this is the same interpretation as of the single-item r_wg, except a_wg is adjusted for the group mean. Moreover, single and multi-item a_wg are linear functions, thereby enhancing ease of interpretation (see Figure 2). Notice that a_wg for means departing from the midpoint of the scale are slightly lower, thereby taking into account decreases in maximum dissensus as a result of restricted variance. Finally, a_wg will not be influenced by sample sizes or number of scale anchors, which are notable additional advantages.

One limitation to Brown and Hauenstein's (2005) a_wg is that S_mpv/m² cannot be applied when the mean is extreme (e.g., 4.9 on a 5-point scale). This is because S_mpv/m² assumes that at least one rater falls on each scale endpoint, although this is impossible given some extreme means. Thus, there are boundaries in means, outside of which appropriate maximum variance estimates should not be applied (Brown and Hauenstein):

Minimum mean with interpretable a_{wg} = [L (k - 1) + H] / k (14)

Maximum mean with interpretable a_{wg} = [H (k - 1) + L] / k (15)

where L and H are the lowest and highest scale values, and k is the number of judges. This is however, a relatively modest limitation because mean ratings falling beyond these boundaries are likely to indicate strong agreement, as values close to the endpoints will only occur when agreement is high. Nevertheless, a_wg scores exceeding interpretational boundaries cannot be compared at face value to other groups' a_wgs. An additional limitation of a_wg is that, unlike most other IRA statistics, a_wg is based on more than a single parameter (e.g., the observed variance, $S_{x}^{2}$ ). It also includes the mean. As both S_x² and x are affected by sampling error, sampling error may have a greater influence on a_wg than on some other IRA statistics (Brown and Hauenstein). Limitations aside, a_wg is advantageous because it controls for the mean rating using a mean-adjusted maximum dissensus null distribution and it has a linear function.

Unlike other agreement statistics, a_wg matches the variances (S_x², S_mpv/m²) on whether they employ the unbiased (denominator is n − 1) or population-based (denominator is n) variance equations. The r_wg family mixes unbiased and population-based variances (i.e., S_x², σ_eu², respectively), thereby potentially leading to inflation of S_x² as sample sizes decreases (see Brown and Hauenstein, 2005). This results in larger values for the r_wg family as sample size increases, and, therefore, IRA agreement will almost always be high in large samples (Kozlowski and Hattrup, 1992). Conversely, a_wg matches the variances by employing sample-based equations for both of S_x² and S_mpv/m², making a_wg independent of sample size. If population-level data is obtained, controls for sample size can be employed by substituting k for k−1 in both of S_x² and S_mpv/m² (Brown and Hauenstein; see Table 1).

Standard Deviation

General Logic

The square root of the variance term used throughout the current article, S_x², is the standard deviation, S_wg. As S_wg is the square root of the average squared deviations from the mean, Schmidt and Hunter (1989) advocated for S_wg as a straightforward index of IRA around which confidence intervals can be computed. Using the standard deviation addresses problems associated with choosing a null distribution and of non-linearity. The average S_wg across items can be used in the case of multi-item scales.

Interpretation

Advantages of using S_wg as an index of IRA is that it is a common measure of variation, and its interpretation is not complicated by the use of multi-item scales or non-linear functions [see r_wg(j)]. However, the S_wg has not always enjoyed widespread application. It cannot be explicitly compared to random response distributions, and this could be of interest. It also tends to increase with the size of the scale response options, meaning that comparisons across scales are not feasible. Finally, it will also tend to decrease with increases in sample size; thus, it will not be sample-size independent.

Coefficient of Variation

A problem with most IRA statistics reviewed is that they are scale dependent, making comparisons across scales with widely discrepant numbers of Likert response options problematic. With greater numbers of response options, the variance will tend to increase. Thus, the amount of variance (e.g., S_x²) could partly depend on scaling, thereby presenting a possible source of contamination for many IRA statistics. One way to address this difficulty is to control for the group mean, because means will typically be larger with greater numbers of response options. One statistic that attempts to address this issue is the coefficient of variation (CV_wg). The CV_wg indexes IRA by transforming the standard deviation into a variance estimate that is less scale dependent, using the following:

C V_{wg} = {[\sum {(x_{i} - \bar{x})}^{2}] / (n - 1)}^{1 / 2} / \bar{x} (16)

Multi-Item CV_wg Could be Computed by Averaging CV_wg Over the J Items.

Interpretation

By dividing the standard deviation (the numerator) by the group mean, CV_wg aims to provide an index of IRA that is not severely influenced by choice of scale, thereby facilitating comparisons of IRA across different scales. For example, the CV_wg for a sample with a standard deviation of 6 and a mean of 100 would be identical to the CV_wg for a sample with a standard deviation of 12 and a mean of 200. Figure 3 contains CV_wg for means equal to 50, 100, and 200 with standard deviations ranging from 0 to 15. An inspection of Figure 3 indicates that the CV_wg increases faster with increases in standard deviations for low means than for high means, thereby taking into account the difference in variation that may be related to scaling. Thus, the CV_wg could be helpful in comparing IRA across scales with different numbers of response options. On the other hand, it is only helpful for relative (to the mean) comparisons, and not absolute comparisons (Allison, 1978; Klein et al., 2001). This can be clarified by observing that the addition of a constant to a set of scores will affect the mean and not the standard deviation, making it difficult to offer meaningful interpretations of absolute CV_wgs (but ratio scaling helps; Bedeian and Mossholder, 2000). Another issue is that negative CV_wg will occur in the presence of a negative mean, but a negative CV_wg is not theoretically interpretable. Thus, a further requirement is non-negative scaling (Roberson et al., 2007).

FIGURE 3

Figure 3. CV_wg from 0 to 15 within-group standard deviations (S_wg) and means of 50, 100, and 200.

Standards for Agreement

What constitutes strong agreement within raters? This is an important question as researchers wishing to employ IRA statistics to support and justify decisions. For example, aggregation of individuals' responses to the group (mean) level may assume a certain level of consensus (Chan, 1998). Or, consensus thresholds may be used in the critical incident technique in job analysis, where performance levels of the employees involved in the incident should be agreed upon by experts (Flanagan, 1954). Identifying a unified set of standards for agreement, however, has proven elusive. Two general approaches to identifying standards for agreement have been suggested: statistical and practical. These are considered briefly below.

Practical Standards

Historically, the emphasis on IRA has been on practical standards or “rules of thumb.” For example, the r_wg family of statistics has relied on the 0.70 rule of thumb. It is important to acknowledge that the decision to choose 0.70 as the cutoff was based on what amounted to no more than a phone call (see personal communication, February 4th, 1987, in (George, 1990), p. 110; for a discussion, see LeBreton et al., 2003; Lance et al., 2006), and James et al. (1984) likely never intended for this cutoff to be so strongly, and perhaps blindly, adopted. Nevertheless, a perusal of Figures 1, 2 clearly shows why common, rule of thumb standards for any of these statistics are difficult to support. A value of 0.70 has a different meaning for most statistics in the r_wg family. Further, even within-statistic, different situations may render that statistic incomparable. For example, agreement of 0.70 for an r_wg based on 10 items vs. an r_wg(j) based on two items is a different agreement benchmark because of the non-linearity. Identification of the null distribution is another influencing factor, as Figure 2 clearly shows that use of $σ_{mv}^{2}$ vs. $σ_{eu}^{2}$ changes the interpretation of any absolute r_wg value (e.g., 0.70), not to mention other potential null distributions (see LeBreton and Senter, 2008).

As noted by Harvey and Hollander (2004), justification for a cutoff of 0.70 is based on an assumption that agreement is similar to reliability, and reliabilities exceeding 0.70 are preferred. However, reliability is about consistency of test scores, not absolute agreement of test scores. Test scores can be perfectly reliable (consistent) but very distinct in absolute quantities. Reliability can, and should be, approached using Generalizability Theory. G-theory involves the systematic investigation of all sources of consistency and error (Cronbach et al., 1972). For example, O'Neill et al. (2015) identified raters as the largest source of variance in performance ratings, rather than rates or dimensions. Thus, drawing on the 0.70 cutoff from reliability theory is not tenable as this rule of thumb underscores the complexity of reliability. An additional assumption is that a single value (e.g., 0.70) would be meaningfully compared across situations and possibly statistics in the r_wg family. These assumptions seem untenable, and adopting any standard rule of thumb for agreement involving the entire r_wg family would appear to be misguided (see Harvey and Hollander, 2004). For many of the same reasons (i.e., incomparability across different situations), there is no clear avenue for setting practical cutoff criteria for S_wg and CV_wg.

LeBreton and Senter (2008) proposed that standards for interpreting IRA could follow the general logic advanced by Nunnally (1978; see also Nunnally and Bernstein, 1994). Specifically, cutoff criteria should be more stringent when decisions will be highly impactful on the individuals involved (e.g., performance appraisal for administrative decision making). Where applicable, LeBreton and Senter (2008) added that cutoff criteria should consider the nature of the theory underlying aggregation for multilevel research, and the quality of the measure (e.g., newly-established measures may be expected to show lower IRA than do well-established measures). For application to the r_wg family, the following standards were recommended: 0–0.30 (lack of agreement), 0.31–0.50 (weak agreement), 0.51–0.70 (moderate agreement), 0.71–0.90 (strong agreement), and 0.91–1.0 (very strong agreement). Whereas, these standards will have different implications and meaning for different types of r_wg and null distributions (consider Figures 1, 2), LeBreton and Senter (2008) proposed the standards for all forms of r_wg. Thus, there is a strong “disincentive” to report versions of r_wg that will result in the appearance of lower IRA (e.g., using a normal distribution for the null; LeBreton and Senter, p. 836). Nevertheless, they challenged researchers to select the most appropriate r_wg by using theory (especially in the identification of a suitable null distribution), with the hope that professional judgment will prevail. Future research will be telling with regard to whether or not researchers adopt LeBreton and Senter's (2008) recommended practices.

Turning to other IRA statistics, Burke and Dunlap (2002) suggested that practical significance standards for AD could apply the decision rule A/6 (where A is the number of Likert categories). Thus, for a five point Likert scale 5/6 = 0.83, and AD values exceeding 0.83 would be seen as not exhibiting strong agreement. But this decision rule makes two assumptions in its derivation (see Burke and Dunlap for details): (a) the basis is in classical test theory and that interrater reliability should exceed 0.70; and (b) the appropriate null distribution is the rectangular distribution. If these assumptions can be accepted, then the AD has a sound approach for determining cutoffs for practical significance. But if that “null distribution fails to model disagreement properly, then the interpretability of the resultant agreement coefficient is suspect” (Brown and Hauenstein, 2005, p. 166). Elsewhere, Burke et al. (1999) proposed different criteria. They suggested that AD should not exceed 1.0 for five- and seven-point scales, and AD should not exceed 2.0 for 11-point scales. Finally, it should be noted that Brown and Hauenstein (2005) proposed rules of thumb for a_wg. Specifically, 0–0.59 was considered unacceptable, 0.60–0.69 was weak, and 0.70–0.79 was moderate, and above 0.80 was strong agreement.

Statistical Standards

Identifying standards for IRA using statistical significance testing involves conducting Monte Carlo simulations or random group resampling. For Monte Carlo simulations, the input is the correlation matrix of scale items, the null distribution, and the significance level (see Cohen et al., 2001, 2009; Burke and Dunlap, 2002; Dunlap et al., 2003). Tabled significance values were provided by several researchers (e.g., Dunlap et al., 2003; Cohen et al., 2009). The program R contains commands for running Monte Carlo simulations involving r_wg and AD (see Bliese, 2009). The objective is to create a sampling distribution for the IRA statistic with an expected mean and standard deviation, which can be used to generate confidence intervals and significance tests. Random group resampling involves constructing a sampling distribution by repeatedly sampling and forming random groups from observations in the observed data set, and comparing the significance of the mean difference in within-group variances of the observed distribution and the randomly generated distribution using a Z-test (Bliese et al., 2000; Bliese and Halverson, 2002; Ludtke and Robitzsch, 2009), for which commands are available in R. Thus, significance testing of the S_wg is possible through the random-group resampling approach. Similar logic could be applied to test the significance of $r_{wg}^{*}$ , a_wg, and CV_wg, although existing scripts for running these tests may be more difficult to find.

Statistical significance testing of IRA statistics has its advantages. Cutoff criteria are relatively objective, thereby potentially reducing misuse by relying on inappropriate or arbitrary rules of thumb (see below). But, statistical significance does not appear to have been widely implemented. One reason might be because of the novelty of the methods for doing so, and the need to understand and implement commands in R, for example. Another reason might be because statistical agreement might be difficult to reach in many commonly-encountered practical situations. Specifically, many applications will involve three to five raters, yet r_wg(j) needs to be in the range of at least 0.75 and AD would have to fall below 0.40 (Burke and Dunlap, 2002; Cohen et al., 2009) in order to reject the null hypothesis of no agreement. Indeed, Cohen et al. reported that groups with low sample sizes rarely reached levels of statistical significance that would allow the hypothesis of no statistical agreement to be rejected. If statistical significance testing is treated as a hurdle against which agreement must be passed in order for further consideration of the implicated variables, there is potential to interfere with advancement of research involving low (but typical) sample sizes. This may not always be the most desirable application of IRA, and, not surprisingly, practical standards have tended to be most common.

Current Best Practice in Judging Agreement Levels

It is important to acknowledge the two divergent purposes of practical and statistical approaches to judging agreement. Practical cutoffs provide decision rules about whether or not agreement seems to have exceeded a minimum threshold in order to justify a decision. Examples of such decisions include aggregation of lower-level data to higher-level units, retention of critical incidents in job analyses, and for assessing whether frame-of-reference training has successfully “calibrated” raters. The use of practical cutoff criteria in these decisions implies that a certain level of agreement is needed in order to make some practical decision in light of the agreement qualities of the data (Burke and Dunlap, 2002).

Statistical standards are not focused on the absolute level of agreement so much as they are concerned with drawing inferences about a population given a sample. Statistical agreement tests the likelihood that the observed agreement in the sample is greater than what would be expected by chance at a certain probability value (e.g., p < 0.05). It involves making inferences about whether the sample was most likely drawn from a population with chance levels of agreement vs. systematic agreement. For example, a set of judges could be asked to rate the job relevance of a personality variable in personality oriented job analysis (Goffin et al., 2011). If agreement is not significant for a particular variable, it would suggest that there is no systematic agreement in the population of judges (Cohen et al., 2009). Notice that this differs from practical significance, which would posit a cutoff, above which agreement levels would be considered adequate for supporting the use of the mean rating as an assessment of the job relevance of the trait (e.g., O'Neill et al., 2011).

Statistical agreement raises issues of power and sample size. Specifically, in small samples statistical agreement will be more difficult to reach than in large samples. Accordingly, outcomes of whether agreement is strong or not may depend on whether one focuses on statistical or practical decision standards, and in large samples, statistical agreement alone should not sufficiently justify aggregation (Cohen et al., 2009). The key point, however, is that statistical significance testing is for determining whether the agreement level for a particular set of judges exceeds chance levels. Practical agreement is about absolute levels of agreement in a sample, which could be seen as strong even for non-significant agreement when sample sizes are low.

In light of the above discussion, it is clear that more research is needed in order to identify defensible and practical approaches for judging IRA levels. Best practice recommendations for the interim would involve reporting several IRA statistics, ideally from different families, in order to provide a balanced perspective on IRA. Practical significance levels could be advanced a priori using suggestions described above (e.g., Burke and Dunlap, 2002; Brown and Hauenstein, 2005; LeBreton and Senter, 2008) in order to identify cutoffs for making decisions. Statistical significance would be employed only when inferences about the population are important and when a power analysis suggests sufficient power to detect agreement, although practical standards should also be considered especially when power is very high. Thus, a researcher or practitioner might place little emphasis on statistical significance when he or she is not concerned about generalizing to the population, and when there are very few judges the researcher might be advised to consider a less stringent significance level (e.g., α = 0.10). Importantly, when evaluating agreement in a set of judges, the focus is typically not on whether the sample was drawn from a population with chance or systematic agreement, but whether there is a certain practically meaningful level of agreement. Thus, in many cases practical significance might be most critical.

Interpretations of practical agreement should probably not be threshold-based, all-or-none decision rules applied to a single statistic [e.g., satisfactory vs. unsatisfactory r_wg(j)]. This is how statistics can be misused to support a decision (see Biemann et al., 2012). Rather, reporting the values from several IRA statistics along with proposed practical standards of agreement reviewed here will provide some evidence of the quality of the ratings, which can be considered in the context of other important indices that also reflect data quality (e.g., reliability, validity). An overall judgment can then be advanced and the reader (including the reviewer) will also have the necessary information upon which to form his or her own judgment. This procedure fits well within the spirit of the unitary perspective on validity (Messick, 1991; Guion, 1998), which suggests that validity involves an expert judgment on the basis of all the available reliability and validity evidence regarding a construct. It would seem that IRA levels should be considered in the development of this judgment, but it may not be productive to always require an arbitrary level of agreement to support or disconfirm the validity of a measure in a single study. In any case, consequential validity (Messick, 1998, 2000) should be kept in mind, and more research examining the consequences, implications, and meaning of various standards for IRA is needed.

Conclusion

IRA statistics are critical to justification of aggregation in multilevel research, but they are also frequently applied in job analysis, performance appraisal, assessment centers, employment interviews, and so forth. Importantly, IRA offers a unique perspective from reliability because reliability deals with consistency of ratings and agreement deals with the similarity of absolute levels of ratings. IRA has the added advantage of providing one estimate per set of raters—not one estimate for the sample as is the case with reliability. This feature of IRA can be helpful for diagnostic purposes, such as identifying particular groups with high or low IRA.

Despite the prevalence of IRA, there is the problem that articles considering IRA statistics tend to be heavy on the technicals (e.g., Lindell and Brandt, 1999; Cohen et al., 2001), and this might be a reason why r_wg, with its widely known limitations (e.g., see Brown and Hauenstein, 2005), appears to persevere as the leading statistical choice for IRA. Indeed, a recent review suggested that a lack of a sound understanding of IRA statistics may have led to some misuses (see Biemann et al., 2012). Thus, despite the many alternatives offered (e.g., $r_{wg}^{*}$ , AD, a_wg, CV_wg), they may not receive full consideration because accessible, tractable, and non-technical resources describing each within a framework that allows for simple contrasting is not available. LeBreton and Senter (2008) provided solid coverage, but it was mainly with respect to multilevel aggregation issues and not directly applicable to other purposes (e.g., job analysis).

The current article aims to fill a gap in earlier research by offering an introductory source, intended to be useful for scholars with a wide range of backgrounds, in order to facilitate application and interpretation of IRA statistics. Through a comparative analysis regarding eight IRA statistics, it appears that these statistics are not interchangeable and that they are differentially affected by various contextual details (e.g., number of Likert response options, number of judges, number of scale items). The goal of the article is to facilitate critical and appropriate applications of IRA in the future, offer a foundation for tackling the more technical sources currently available, and make suggestions regarding best practices in light of the insights gleaned through the review. It is proposed that researchers interpret IRA levels with respect to the situation and best-practice recommendations for practical and statistical standards in the literature, as reviewed here. Because of the unique limitations of each statistic, it is probably safe to conclude that more than one statistic should always be reported. In submissions where this has been ignored, reviewers should request the author to report additional agreement statistics, ideally from other IRA families. Consistent with the unitary perspective on validity, it is suggested that judgments regarding the adequacy of the ratings rely on evidence of IRA in conjunction with additional statistics that shed light on the quality of the data (e.g., reliability coefficients, criterion validity coefficients). Regarding agreement standards, it would seem advisable to evaluate a given IRA statistic using appropriate a priori practical cutoffs and statistical criteria, depending on the purpose of assessing agreement levels. What we need to avoid is misuses of agreement statistics and adoption of inappropriate or misleading decision rules. This critical review aims to provide tools to help researchers and practitioners avoid these problems.

Author Contributions

The author confirms being the sole contributor of this work and approved it for publication.

Conflict of Interest Statement

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

This research was supported by an operating grant provided to TO by the Social Sciences and Humanities Research Council of Canada.

References

Allen, N. J., Stanley, D. J., Williams, H. M., and Ross, S. J. (2007). Assessing the impact of nonresponse on work group diversity effects. Organ. Res. Methods 10, 262–286. doi: 10.1177/1094428106/294731

An Overview of Interrater Agreement on Likert Scales for Researchers and Practitioners

Introduction

James et al.'s IRA: rwg for Single and Multiple Items

General Logic

Interpretation

Potential Cause for Concern

rwg* with the Rectangular Null and Maximum Dissensus Null Distributions

General Logic

Interpretation

Further Advances on rwg*: Disattenuated Multi-Item rwg*: (r'WG(j))

General Logic

Interpretation

Pooled Agreement for Subgroups: rwg(p)

General Logic

Interpretation

Average Deviation Index

General Logic

Interpretation

Brown and Hauenstein's “Alternative” Estimate of IRA: awg(1)

General Logic

Interpretation

Standard Deviation

General Logic

Interpretation

Coefficient of Variation

Multi-Item CVwg Could be Computed by Averaging CVwg Over the J Items.

Interpretation

Standards for Agreement

Practical Standards

Statistical Standards

Current Best Practice in Judging Agreement Levels

Conclusion

Author Contributions

Conflict of Interest Statement

Acknowledgments

References

James et al.'s IRA: r_wg for Single and Multiple Items

$r_{w g}^{*}$ with the Rectangular Null and Maximum Dissensus Null Distributions

Further Advances on $r_{wg}^{}$ : Disattenuated Multi-Item $r_{wg}^{}$ : (r'_WG(_j₎)

Pooled Agreement for Subgroups: r_wg(p)

Brown and Hauenstein's “Alternative” Estimate of IRA: a_wg(1)

Multi-Item CV_wg Could be Computed by Averaging CV_wg Over the J Items.