Effect of tokenization granularity for Turkish large language models


Kaya Y. B., Tantuğ A. C.

Intelligent Systems with Applications, vol.21, 2024 (Scopus) identifier

Abstract

Transformer-based language models such as BERT (and its optimized versions) have outperformed previous models, achieving state-of-the-art results on many English benchmark tasks. These multi-layered self-attention-based architectures are capable of producing contextual word vector representations. However, the tokens created in the tokenization preprocessing step are not necessarily words, particularly for languages with complex morphology, such as Turkish. While previous research has often focused on tokenization algorithms and has explored optimal vocabulary sizes for machine translation in English, our study extends the scope by investigating the impact of varying vocabulary sizes and explores the feasilitiy of incorporating morphological tagging for Turkish. The granularity of the generated tokens is a feature determined by various factors related to tokenization, especially by the vocabulary size. This study presents a new collection of BERT models (ITUTurkBERT) trained using various tokenization methods on the corpus of the BERTurk and 1 BW corpora. We fine-tuned these models for named entity recognition, sentiment analysis, and question-answering downstream tasks in Turkish and achieved state-of-the-art performance on all of these tasks. Our empirical experiments show that increasing the vocabulary size improves performance on these tasks, except for sentiment analysis, which requires further investigation.