7th International Conference on Computer Science and Engineering, UBMK 2022, Diyarbakır, Turkey, 14 - 16 September 2022, pp.204-209
© 2022 IEEE.Turkish Trade Registry Gazette is an important source of information in many sectors such as banking and telecommunication. Although the newspaper is publicly available, the data is hard to acquire, and announcements are offered in image format. It is possible to search for a specific announcement a company has, but there exist many other unrelated announcements in the image returned. This poses multiple challenges in the way of information extraction. Due to the structure of the documents in these images, it is hard to perform OCR directly. Moreover, even in the case where the text is extracted, the announcement boundaries must be detected to split the announcements within the page. Once the announcements are extracted, the announcement of the searched company should be matched. Since no information regarding the surrounding announcements is given as a result of the query, these announcements should also be categorized to detect any events of interest other companies may have. In this work, we address all of these problems and present a pipeline that includes image processing, OCR, announcement splitting, and document classification steps.