I/O-efficient data structures for non-overlapping indexing


Hooshmand S., Abedin P., Külekci M. O., Thankachan S. V.

THEORETICAL COMPUTER SCIENCE, cilt.857, ss.1-7, 2021 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 857
  • Basım Tarihi: 2021
  • Doi Numarası: 10.1016/j.tcs.2020.12.006
  • Dergi Adı: THEORETICAL COMPUTER SCIENCE
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Aerospace Database, Applied Science & Technology Source, Communication Abstracts, Computer & Applied Sciences, INSPEC, MathSciNet, Metadex, zbMATH, Civil Engineering Abstracts
  • Sayfa Sayıları: ss.1-7
  • İstanbul Teknik Üniversitesi Adresli: Evet

Özet

The non-overlapping indexing problem is defined as follows: pre-process a given text T[1, n] of length n into a data structure such that whenever a pattern P [1, m] comes as an input, we can efficiently report the largest set of non-overlapping occurrences of P in T. The best-known solution is by Cohen and Porat [ISAAC 2009]. The size of their structure is O (n) words and the query time is optimal O (m + nocc), where nocc is the output size. Later, Ganguly et al. [CPM 2015 and Algorithmica 2020] proposed a compressed space solution. We study this problem in the cache-oblivious model and present a new data structure of size O (n log n) words. It can answer queries in optimal O (m/B + log(B) n + nocc/B) I/O operations, where B is the block size. The space can be improved to O (n log(M/B) n) in the cache-aware model, where M is the size of main memory. Additionally, we study a generalization of this problem with an additional range [s, e] constraint. Here the task is to report the largest set of non-overlapping occurrences of P in T, that are within the range [s, e]. We present an O (n log(2) n) space data structure in the cache-aware model that can answer queries in optimal O (m/B + log(B) n + nocc([s,e]) B ) I/O operations, where nocc([s,e]) is the output size.