Neural Machine Translation Approaches for Post-OCR Text Processing Optik Karakter Tanima Sonrasi Metin Işleme Adimi için Sinirsel Makine Çevirisi Yaklaşimlari


Topcu A. I., Töreyin B. U.

30th Signal Processing and Communications Applications Conference, SIU 2022, Safranbolu, Turkey, 15 - 18 May 2022 identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/siu55565.2022.9864878
  • City: Safranbolu
  • Country: Turkey
  • Keywords: error correction, error detection, natural language processing, neural machine translation, Post-OCR processing
  • Istanbul Technical University Affiliated: Yes

Abstract

© 2022 IEEE.Optical Character Recognition (OCR) is the process of extracting the texts from the images by means of some special programs and transferring them to the computer environment. OCR quality directly affects the quality of most natural language processing processes. Many applications such as text classification, information extraction, text summarization with texts extracted from images are used in daily life. Therefore, detecting and correcting incorrectly translated texts after OCR is a topic that researchers are working on with many methods today. In this study, it is aimed to apply and observe the results on the dataset presented in the International Conference on Document Analysis and Recognition (ICDAR) 2019 OCR Post Error Detection and Correction competition, using the latest neural machine translation methods to find and correct post-OCR text errors.