Your new experience awaits. Try the new design now and help us make it even better

ORIGINAL RESEARCH article

Front. Artif. Intell.

Sec. Natural Language Processing

Volume 8 - 2025 | doi: 10.3389/frai.2025.1538165

Tokenization Efficiency of Current Foundational Large Language Models for the Ukrainian Language

Provisionally accepted
  • Kharkiv National University of Radio Electronics, Kharkiv, Ukraine

The final, formatted version of the article will be published soon.

Foundational Large language models (LLMs) are implemented in multilingual environments with different general or narrow task domains. Such models generate text token-by-token, which makes them slower and more computationally expensive for low-resource languages, which are less represented in the tokenizer vocabulary. It also makes their usage more costly for such cases, as pricing usually depends on the number of input and output tokens. This paper compares multiple tokenizers of pretrained LLMs for the Ukrainian language. This work provides tokenization fertility measurements for current state-of-the-art (SOTA) models, both in terms of general-purpose language and specific domains, and results of experiments with a transliteration approach to make tokenization more efficient without information losses. Results provide insights into the current models' disadvantages and possible problems in terms of Ukrainian language modeling.

Keywords: tokenization, Large Language Model, corpus, domain, Low-resource language

Received: 05 Dec 2024; Accepted: 29 Jul 2025.

Copyright: © 2025 Turuta and Maksymenko. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Oleksii Turuta, Kharkiv National University of Radio Electronics, Kharkiv, Ukraine

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.