A Novel Minimal Arabic Script for Preparing Databases and Benchmarks for Arabic Text Recognition Research

Al-Muhtaseb H. A. , Mahmoud S. A. , Qahwaji R. S.

8th WSEAS Int Conference on Signal Processing/3rd WSEAS Int Symposium on Wavelets Theory and Applicat in Appl Math, Signal Proc and Modern Sci, İstanbul, Turkey, 30 May - 01 June 2009, pp.37-39 identifier

  • Publication Type: Conference Paper / Full Text
  • City: İstanbul
  • Country: Turkey
  • Page Numbers: pp.37-39


This work presents a minimal Arabic script that may be used in training and testing of Arabic text recognition systems. Collecting handwritten samples from different writers to build handwritten text databases which may be used for benchmarking Arabic text recognition systems is another application. The suggested Arabic script covers the different shapes of Arabic alphabet in all positions (viz. standalone, initial, medial, and terminal). The frequency of each shape in the minimal text is designed to be the minimal possible. The suggested script Is novel from different perspectives. A writer may participate with only three lines of meaningful Arabic text to cover all possible alphabet shapes, a total of 125 shapes. Collecting scripts from different writers provide evenly distributed letter frequencies that assure enough samples for all character shapes. This enables proper training resulting in more accuracy in the recognition phase. The same can be stated for printed Arabic text. This is especially useful when using large number of features with classifiers that require large number of samples for each category. Hidden Markov Models and Neural networks are two examples of these classifiers. In addition, the paper presents statistical analysis of Arabic corpora for estimating the number of occurrences of different shapes of Arabic alphabets in large corpora. The frequency of Arabic alphabet usage was utilized in enhancing the search for the minimal Arabic text.