Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Natural Language Processing

Volume 8 - 2025 | doi: 10.3389/frai.2025.1642570

This article is part of the Research TopicArtificial Intelligence for Technology Enhanced LearningView all 3 articles

Evaluating LLMs on Kazakhstan's Mathematics Exam for University Admission

Provisionally accepted
Shirali  KadyrovShirali Kadyrov1,2Bolatbek  AbdrasilovBolatbek Abdrasilov3Aslan  SabyrovAslan Sabyrov3Nurseit  BaizhanovNurseit Baizhanov3*Alfira  MakhmuotovaAlfira Makhmuotova1Patrick  Charles KyllonenPatrick Charles Kyllonen4
  • 1New Uzbekistan University, Tashkent, Uzbekistan
  • 2Narxoz University, Almaty, Kazakhstan
  • 3National Test Center, Astana, Kazakhstan
  • 4ETS, Princeton, United States

The final, formatted version of the article will be published soon.

The rapid advancement of large language models (LLMs) has prompted their exploration in educational contexts, particularly for high-stakes standardized tests like Kazakhstan's Unified National Testing (UNT) mathematics component, which is critical for university admission. While most existing benchmarks for mathematical reasoning focus on English, there is growing concern that LLMs may underperform in under-resourced or non-English languages. This study addresses that gap by evaluating LLM performance on a math test administered entirely in Russian. We assess six LLMs—Claude, Deepseek, Gemini, Llama, Qwen, and o1—on 139 UNT multiple-choice questions covering algebra, functions, geometry, inequalities, and trigonometry. The methodology includes three conditions: zero-shot performance, hybrid integration with SymPy for symbolic computation, and a role-specific simulated multi-agent refinement framework that builds on existing self-correction techniques with targeted feedback. Results show that Deepseek, Gemini, Qwen, and o1 achieved near-perfect or perfect accuracy (91.2–100%) in zero-shot settings across all difficulty levels and topics, while Claude and Llama lagged (43.5–76.5%). The hybrid approach significantly improved Claude and Llama's accuracy by 27.4% and 39.9%, respectively. As for the multi-agent approach, significant improvement was observed in Claude with accuracy reaching 97.8%, showing 58.1% improvement compared to the zero-shot. These findings provide important empirical evidence that LLMs can perform competitively on math tasks in non-English languages, challenging prior assumptions and underscoring their potential to support bilingual education and equitable access to higher education in under-resourced linguistic settings.

Keywords: Large language models, mathematical reasoning, Unified National Testing, Kazakhstan education, Zero-shot learning, Symbolic Computation, SymPy, simulated multi-agent refinement

Received: 10 Jun 2025; Accepted: 25 Aug 2025.

Copyright: © 2025 Kadyrov, Abdrasilov, Sabyrov, Baizhanov, Makhmuotova and Kyllonen. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Nurseit Baizhanov, National Test Center, Astana, Kazakhstan

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.