ORIGINAL RESEARCH article
Front. Psychol.
Sec. Quantitative Psychology and Measurement
This article is part of the Research TopicStatistical Guidelines: New Developments in Statistical Methods and Psychometric Tools – Volume IIView all 7 articles
Operating Characteristics of Agreement Metrics in AI-Based Scoring: A Monte Carlo Simulation
Provisionally accepted- Bolu Abant Izzet Baysal University, Bolu, Türkiye
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
This study analyzed the threshold-exceedance performance of human and AI scoring agreement measures for scoring of open-ended items. Monte Carlo simulation was performed to represent the different types of errors encountered in automatic scoring determined by using studies in the literature. These conditions additive bias, variance inflation, midpoint compression, class imbalance, and subgroup-related offsets. Human scores served as a reference to assess agreement. Agreement levels were evaluated with ICC(A,1), Krippendorff's α (ranked), Quadratic Weighted Kappa (QWK), and Bland-Altman along with tolerance-based agreement metrics. Threshold-exceedance performance was defined as the proportion with which each metric surpassed conventional adequacy standards. ICC(A,1) shown higher threshold-exceedance performance for low and moderate variance inflation. QWK was observed to reach a moderate level of robustness. Krippendorff's α showed consistent performance, especially in conditions where the distributions were unbalanced or variance inflated. Tolerance-based fit demonstrated numerical closeness between human and AI scores. Analyses were also conducted on real data to validate the analyses conducted in the second part of the study. In this part, written texts were scored by six different students using three teachers and two large language models. The findings showed patterns consistent with simulated impairments. All findings indicate that fit indices vary systematically across different structural mechanisms and sampling conditions. The results suggest that different conditions can affect the interpretability of automated scores. Accordingly, the need for multi-metric assessment frameworks when assessing human-AI score fit is highlighted.
Keywords: artificial intelligence, Scoring reliability, Monte Carlo simulation, agreement metrics, measurement and evaluation in education and psychology.
Received: 15 Sep 2025; Accepted: 05 Jan 2026.
Copyright: © 2026 YANDI. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Alperen YANDI
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.