ANEC: An Amharic Named Entity Corpus and Transformer Based Recognizer


Creative Commons License

Jibril E. C., Cuneyd Tantug A. C.

IEEE Access, cilt.11, ss.15799-15815, 2023 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 11
  • Basım Tarihi: 2023
  • Doi Numarası: 10.1109/access.2023.3243468
  • Dergi Adı: IEEE Access
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Sayfa Sayıları: ss.15799-15815
  • Anahtar Kelimeler: Amharic, named entity recognition, synthetic minority over-sampling technique, deep learning BiLSTM-CRF, transfer learning
  • İstanbul Teknik Üniversitesi Adresli: Evet

Özet

Named Entity Recognition is an information extraction task that serves as a pre-processing step for other natural language processing tasks, such as machine translation, information retrieval, and question answering. Named entity recognition enables the identification of proper names as well as temporal and numeric expressions in an open domain text. For Semitic languages such as Arabic, Amharic, and Hebrew, the named entity recognition task is more challenging due to the heavily inflected structure of these languages. In this study, we annotate a new comparatively large Amharic named entity recognition dataset and make it publicly available. Using this new dataset, we build multiple Amharic named entity recognition systems based on recent deep learning approaches including transfer learning (RoBERTa), and bidirectional long short-term memory coupled with a conditional random fields layer. By applying the Synthetic Minority Over-sampling Technique to mitigate the imbalanced classification problem, our best performing RoBERTa based named entity recognition system achieves an f1-score of 93%, which is the new state-of-the-art result for Amharic named entity recognition.