Normalizing Non-canonical Turkish Texts Using Machine Translation Approaches

Colakoglu T., Sulubacak U., Tantuğ A. C.

57th Annual Meeting of the Association-for-Computational-Linguistics (ACL), Florence, İtalya, 28 Temmuz - 02 Ağustos 2019, ss.267-272

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası:
Basıldığı Şehir: Florence
Basıldığı Ülke: İtalya
Sayfa Sayıları: ss.267-272
İstanbul Teknik Üniversitesi Adresli: Evet

Özet

With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-canonical text normalization provides a way to exploit this as a practical source of training data for language processing systems. The state of the art in Turkish text normalization is composed of a token-level pipeline of modules, heavily dependent on external linguistic resources and manually-defined rules. Instead, we propose a fully-automated, context-aware machine translation approach with fewer stages of processing. Experiments with various implementations of our approach show that we are able to surpass the current best-performing system by a large margin.