OBJECTIVES: This reliability generalization study aimed to estimate the mean and variance of the interrater reliability coefficients (ryy) of supervisory ratings of overall, task, contextual, and positive job performance. The moderating effect of the appraisal purpose and the scale type was examined. It was hypothesized that the ratings collected for research purposes and multi-item scales have higher ryy. It was also examined whether ryy was similar for the four performance dimensions. METHOD: A database consisting of 224 independent samples was created and hierarchical sub-grouping meta-analyses were conducted. RESULTS: The appraisal purpose was a moderator of ryy for the four performance dimensions. Scale type was a moderator of ryy for overall and task performance collected for research purposes. The findings also suggest that supervisors seem to have less difficulty evaluating overall job performance than task, contextual, and positive performance. The best estimates of the observed ryy for overall job performance are .61 for research-collected ratings and .45 for administrative-collected ratings. MAIN CONCLUSIONS: (1) Appraisal purpose moderates ryy and researchers and practitioners should be aware of its effects before collecting ratings or using empirically-derived interrater reliability distributions, (2) Scale type seems to moderate ryy in the case of the ratings collected for research purposes, only, (3) overall job performance is more reliably rated than task, contextual, and positive performance. Implications for research and practice are discussed.

Keywords: interrater reliability, Supervisory performance ratings, Meta-analysis, appraisal purpose, Range restriction

