Fusion of visual representations for multimodal information extraction from unstructured transactional documents

Oral B., Eryiğit G.

INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, vol.25, no.3, pp.187-205, 2022 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 25 Issue: 3
  • Publication Date: 2022
  • Doi Number: 10.1007/s10032-022-00399-3
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC
  • Page Numbers: pp.187-205
  • Keywords: Information extraction, Information fusion, Visual representations, Unstructured documents, Complex relation extraction, Document understanding
  • Istanbul Technical University Affiliated: Yes


The importance of automated document understanding in terms of today's businesses' speed, efficiency, and cost reduction is indisputable. Although structured and semi-structured business documents have been studied intensively within the literature, information extraction from the unstructured ones remains still an open and challenging research topic due to their difficulty levels and the scarcity of available datasets. Transactional documents occupy a special place among the various types of business documents as they serve to track the financial flow and are the most studied type accordingly. The processing of unstructured transactional documents requires the extraction of complex relations (i.e., n-ary, document-level, overlapping, and nested relations). Studies focusing on unstructured transactional documents rely mostly on textual information. However, the impact of their visual compositions remains an unexplored area and may be valuable on their automatic understanding. For the first time in the literature, this article investigates the impact of using different visual representations and their fusion on information extraction from unstructured transactional documents (i.e., for complex relation extraction from money transfer order documents). It introduces and experiments with five different visual representation approaches (i.e., word bounding box, grid embedding, grid convolutional neural network, layout embedding, and layout graph convolutional neural network) and their possible fusion with five different strategies (i.e., three basic vector operations, weighted fusion, and attention-based fusion). The results show that fusion strategies provide a valuable enhancement on combining diverse visual information from which unstructured transactional document understanding obtains different benefits depending on the context. While different visual representations have little effect when added individually to a pure textual baseline, their fusion provides a relative error reduction of up to 33%.