Fast Pattern-Matching via k-bit Filtering Based Text Decomposition

Kulekci, Muhammed; Vitter, Jeffrey; Xu, Bojian

doi:10.1093/comjnl/bxq090

Fast Pattern-Matching via k-bit Filtering Based Text Decomposition

Atıf İçin Kopyala

Kulekci M. O., Vitter J. S., Xu B.

COMPUTER JOURNAL, cilt.55, sa.1, ss.62-68, 2012 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 55 Sayı: 1
Basım Tarihi: 2012
Doi Numarası: 10.1093/comjnl/bxq090
Dergi Adı: COMPUTER JOURNAL
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Sayfa Sayıları: ss.62-68
İstanbul Teknik Üniversitesi Adresli: Hayır

Özet

This study explores an alternative way of storing text files to answer exact match queries faster. We decompose the original file into two parts as filter and payload. The filter part contains the most informative k bits of each byte, and the remaining bits of the bytes are concatenated in the order of appearance to generate the payload. We refer to this structure as k-bit filtered format. When an input pattern is to be searched on the k-bit filtered structure, the same decomposition is performed on the pattern. The k bits from each byte of the pattern form the pattern filter bit sequence, and the rest is the payload. The pattern filter is first scanned on the filter part of the file. At each match position detected in the filter part, the pattern payload is verified against the corresponding location in the payload part of the text. Thus, instead of searching an m-byte pattern on an n-byte text, first k center dot m bits are scanned on k center dot n bits, followed by a verification of (8 - k)center dot m bits on the respective locations of the matching positions. Experiments conducted on natural language texts, plain ASCII DNA sequences and random byte sequences showed that the search performance with the proposed scheme is on average two times faster than the tested best exact pattern-matching algorithms. The highest gain is obtained on plain ASCII DNA sequences. We also developed an effective bitwise pattern-matching algorithm of possible independent interest within this study.