AUTHOR=Krumsvik Rune Johan 

TITLE=GPT-4’s capabilities in handling essay-based exams in Norwegian: an intrinsic case study from the early phase of intervention

JOURNAL=Frontiers in Education

VOLUME=Volume 10 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/education/articles/10.3389/feduc.2025.1444544

DOI=10.3389/feduc.2025.1444544

ISSN=2504-284X

ABSTRACT=The growing integration of artificial intelligence (AI) in education has paved the way for innovative grading practices and assessment methods. This study uniquely explores GPT-4’s capabilities in handling essay-based exams in Norwegian across bachelor, master, and PhD levels, offering new insights into AI’s potential in educational assessment. Driven by the need to understand how AI can enhance assessment practices beyond traditional approaches, this case study research examines GPT-4’s performance on essay-based exams related to qualitative methods, case study research, qualitative meta-synthesis, and mixed method research, using chain-of-thought prompting. Unlike existing studies that primarily assess AI’s grading abilities, this research delves into GPT-4’s capability to both evaluate student responses and provide feedback, bridging a critical gap in integrating feedback theories with AI-assisted assessment. The study specifically investigates GPT-4’s ability to answer exam questions, grade student responses, and suggest improvements to those responses. A case study design was employed, with primary data sources derived from GPT-4’s performance on six exams, based on course learning goals and grading scale (feed up), GPT-4’s handling of main content in the exams (feedback), and GPT-4’s ability to critically assess its own performance and limitations (feed forward). The findings from this intrinsic case study revealed that GPT-4 performs well on these essay-based exams, effectively navigating different academic levels and the Norwegian language context. Fieldwork highlights GPT-4’s potential to significantly enhance formative assessment by providing timely, detailed, and personalized feedback that supports student learning. For summative assessment, GPT-4 demonstrated reliable evaluation of complex student essay exams, aligning closely with human assessments. The study advances understanding in the field by highlighting how AI can bridge gaps between traditional and AI-enhanced assessment methods, particularly in scaffolding formative and summative assessment practices. However, since this case study examines only the early phase of the intervention, it has several limitations. With an awareness of its limitations, the findings underscore the need for continuous innovation in educational assessment to prepare for future advancements in AI technology, while also addressing ethical considerations, such as bias. Vigilant and responsible implementation, along with ongoing refinement of AI tools, remains crucial.