AUTHOR=Kadyrov Shirali , Abdrasilov Bolatbek , Sabyrov Aslan , Baizhanov Nurseit , Makhmutova Alfira , Kyllonen Patrick C. 

TITLE=Evaluating LLMs on Kazakhstan's mathematics exam for university admission

JOURNAL=Frontiers in Artificial Intelligence

VOLUME=Volume 8 - 2025

YEAR=2025

URL=https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1642570

DOI=10.3389/frai.2025.1642570

ISSN=2624-8212

ABSTRACT=IntroductionThe rapid advancement of large language models (LLMs) has prompted their exploration in educational contexts, particularly in high-stakes standardized tests such as Kazakhstan's Unified National Testing (UNT) mathematics component, which is critical for university admission. While most existing benchmarks for mathematical reasoning focus on English, concerns remain that LLMs may underperform in under-resourced or non-English languages. This study addresses this gap by evaluating LLM performance on 139 UNT multiple-choice mathematics questions administered entirely in Russian.MethodsWe assessed six LLMs-Claude, DeepSeek, Gemini, Llama, Qwen, and o1—on questions covering algebra, functions, geometry, inequalities, and trigonometry. Three evaluation conditions were employed: (1) zero-shot performance, (2) hybrid integration with SymPy for symbolic computation, and (3) a role-specific simulated multi-agent refinement framework that builds on existing self-correction techniques with targeted feedback.ResultsIn zero-shot settings, DeepSeek, Gemini, Qwen, and o1 achieved near-perfect or perfect accuracy (91.2–100%) across all difficulty levels and topics, while Claude and Llama lagged (43.5–76.5%). The hybrid approach significantly improved Claude and Llama's accuracy by 27.4% and 39.9%, respectively. Under the multi-agent refinement condition, Claude showed substantial gains, reaching 97.8% accuracy, which represented a 58.1% improvement over zero-shot performance.DiscussionThese findings provide important empirical evidence that LLMs can perform competitively on mathematics tasks in non-English languages. The results challenge prior assumptions about limited performance in under-resourced linguistic settings and highlight the potential of LLMs to support bilingual education and promote equitable access to higher education.