ORIGINAL RESEARCH article
Front. Artif. Intell.
Sec. Natural Language Processing
Volume 8 - 2025 | doi: 10.3389/frai.2025.1642570
This article is part of the Research TopicArtificial Intelligence for Technology Enhanced LearningView all 3 articles
Evaluating LLMs on Kazakhstan's Mathematics Exam for University Admission
Provisionally accepted- 1New Uzbekistan University, Tashkent, Uzbekistan
- 2Narxoz University, Almaty, Kazakhstan
- 3National Test Center, Astana, Kazakhstan
- 4ETS, Princeton, United States
Select one of your emails
You have multiple emails registered with Frontiers:
Notify me on publication
Please enter your email address:
If you already have an account, please login
You don't have a Frontiers account ? You can register here
The rapid advancement of large language models (LLMs) has prompted their exploration in educational contexts, particularly for high-stakes standardized tests like Kazakhstan's Unified National Testing (UNT) mathematics component, which is critical for university admission. While most existing benchmarks for mathematical reasoning focus on English, there is growing concern that LLMs may underperform in under-resourced or non-English languages. This study addresses that gap by evaluating LLM performance on a math test administered entirely in Russian. We assess six LLMs—Claude, Deepseek, Gemini, Llama, Qwen, and o1—on 139 UNT multiple-choice questions covering algebra, functions, geometry, inequalities, and trigonometry. The methodology includes three conditions: zero-shot performance, hybrid integration with SymPy for symbolic computation, and a role-specific simulated multi-agent refinement framework that builds on existing self-correction techniques with targeted feedback. Results show that Deepseek, Gemini, Qwen, and o1 achieved near-perfect or perfect accuracy (91.2–100%) in zero-shot settings across all difficulty levels and topics, while Claude and Llama lagged (43.5–76.5%). The hybrid approach significantly improved Claude and Llama's accuracy by 27.4% and 39.9%, respectively. As for the multi-agent approach, significant improvement was observed in Claude with accuracy reaching 97.8%, showing 58.1% improvement compared to the zero-shot. These findings provide important empirical evidence that LLMs can perform competitively on math tasks in non-English languages, challenging prior assumptions and underscoring their potential to support bilingual education and equitable access to higher education in under-resourced linguistic settings.
Keywords: Large language models, mathematical reasoning, Unified National Testing, Kazakhstan education, Zero-shot learning, Symbolic Computation, SymPy, simulated multi-agent refinement
Received: 10 Jun 2025; Accepted: 25 Aug 2025.
Copyright: © 2025 Kadyrov, Abdrasilov, Sabyrov, Baizhanov, Makhmuotova and Kyllonen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
* Correspondence: Nurseit Baizhanov, National Test Center, Astana, Kazakhstan
Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.