Classifying Turkish Trade Registry Gazette Announcements

7th International Conference on Computer Science and Engineering, UBMK 2022, Diyarbakır, Türkiye, 14 - 16 Eylül 2022, ss.204-209

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/ubmk55850.2022.9919536
Basıldığı Şehir: Diyarbakır
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.204-209
Anahtar Kelimeler: document classification, document processing, natural language processing, OCR
İstanbul Teknik Üniversitesi Adresli: Evet

Özet

© 2022 IEEE.Turkish Trade Registry Gazette is an important source of information in many sectors such as banking and telecommunication. Although the newspaper is publicly available, the data is hard to acquire, and announcements are offered in image format. It is possible to search for a specific announcement a company has, but there exist many other unrelated announcements in the image returned. This poses multiple challenges in the way of information extraction. Due to the structure of the documents in these images, it is hard to perform OCR directly. Moreover, even in the case where the text is extracted, the announcement boundaries must be detected to split the announcements within the page. Once the announcements are extracted, the announcement of the searched company should be matched. Since no information regarding the surrounding announcements is given as a result of the query, these announcements should also be categorized to detect any events of interest other companies may have. In this work, we address all of these problems and present a pipeline that includes image processing, OCR, announcement splitting, and document classification steps.