Preliminary Investigation on Using Semi-Supervised Contextual Word Sense Disambiguation for Data Augmentation


Torunoglu-Selamet D., İnceoğlu A., Eryiğit G.

5th International Conference on Computer Science and Engineering (UBMK), Diyarbakır, Türkiye, 9 - 11 Eylül 2020, ss.337-342 identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası:
  • Doi Numarası: 10.1109/ubmk50275.2020.9219389
  • Basıldığı Şehir: Diyarbakır
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.337-342
  • İstanbul Teknik Üniversitesi Adresli: Evet

Özet

Recently, neural architectures play a significant role in the task of Word Sense Disambiguation (WSD). Supervised methods seem to be ahead of its rivals and their performance mostly depends on the size of training data. A numerous number of human-annotated data available for WSD task have been constructed for English. However, low-resource languages (LRLs) still face difficulty in finding suitable data resources. Gathering and annotating a sufficient amount of training data is a time-consuming and labor-expensive work. To address and overcome this problem, in this paper we investigate the possibility of using a semi-supervised context based WSD approach for data augmentation (in order to be later used for supervised learning). Since, it is even difficult to find WSD evaluation datasets for LRLs, in this study, we use English datasets to build a proof-of-concept and to evaluate their applicability onto LRLs. Our semi-supervised approach uses a seed set and context embeddings. We test with 9 different context based language models (including ELMo, BERT, RoBERTa etc.) and investigate their impacts on WSD. We increased our baseline results up to 28 percentage point improvements (baseline with ELMo 5039% and ELMo Sense Seed Based Average Similarity Model 78.06%) in terms of accuracy. Our initial findings reveal that the proposed approach is very promising for the augmentation of WSD datasets of LRLs.