Succinct Data Structures
NSF CCF 1527435
May 2015-May 2019
PI: Sukhamay Kundu
In the era of
big-data, one needs to organize massive amounts of data so that it can be
searched quickly. This requires the building of an index over the raw data. In
many cases of big-data, like world-wide web or DNA sequencing, the actual
information content is low. This data is highly compressible. On the other
hand, the indexes require space which is several times the raw data. Thus,
indexing and compression are often conflicting goals. The field of compact (or
succinct) data structures attempts to achieve both these goals -compression and
indexability- simultaneously. This project will address some of the most
fundamental open problems in this field. This will have impact on next
generation biological sequence mining databases which could simply work within
the memory of a PC instead of requiring high-performance clusters. The
foundations built by this project will also impact image matching and music
retrieval. Since data structures is one of the most fundamental areas in
computer science education, research from this project will also impact data
Suffix trees are central to string indexing and have myriads of applications. However, suffix trees are known to take 15 to 50 times the size of the text they index. This actually stems from a complexity gap in the size of data which is n log s bits compared to the size of the index which is O( n log n) bits, for the text of n characters drawn from alphabet size of s. The techniques of Burrows-Wheeler Transform(BWT) and Phi-function were introduced in the last decade to address this gap. Most subsequent research in this field has treated BWT as a black box, compressing augmenting structures around it to address various applications. However, many problems (like parameterized pattern matching and 2D pattern matching) have remained open in this field. This project will attempt to go deeper and beyond the philosophy of BWT to solve such issues. It will also try to build foundations for deriving lower bounds for problems where compact index would be impossible. To create better understanding of data structure space and query complexity, the project will explore the recent theoretical model called "encoding model". The project will also explore the applied case of compressed indexing for highly repetitive sequences.
Rahul Shah (External
Sudip Biswas (Graduate Student:2015-2016)
Arnab Ganguly (Graduate Student:2015-2017)
Fatemeh Mohebbi (Graduate Student: 2017-2018)
Sanaz Rabinia (Graduate Student:2017 - )
Sharma Thankachan (External Collaborator)
Wing-Kai Hon (External Collaborator)
Sandeep Kumar Majji (External Collaborator Undergraduate Student IIT Mumbai - Summer 2017)
BVS Naidu (External Collaborator Undergraduate Student IIT Mumbai - Summer 2017)
Our goal in this project is to develop succinct indexes for some of the unsolved problems in the area of succinct indexes. Suffix trees have many applications and many of them can be solved by augmenting suffix tree with additional information. The emphasis in the area till now has been to replace suffix tree by it succinct counter-part and then re-permute and compress the augmenting information. However, there are other problems which have their own different variant of suffix tree. We look beyond the existing Burrows-Wheeler Transform (or Phi-function) based approach to address some of these challenges.
One of our initial breakthrough has come the following result (published in SODA 2017) which solves this case for parameterized pattern matching. In parameterized pattern matching the symbols of pattern and the text can be matched under any one-to-one correspondence. At the heart of this solution is a transform called pBWT (parameterized Burrrows-Wheeler Transform) and we show how its corresponding LF-mapping can be computed.
Next, we extend the above result to the problem of structural pattern matching for RNA matching. RNA bases come in pairs and this pairing structure needs to be preserved while matching apart from one-to-one correspondence as in parameterized matching. This is published in ISAAC 2017.
Recently, we obtained the first succinct index of the problem of order-preserving pattern matching.
Here is a link to Collaborator Shah’s talk - Parameterized Text Indexing.pptx
Collaborator Shah’s talk on Order-Preserving Matching -Hard Text Indexing.pptx
Arnab Ganguly, Rahul Shah, Sharma V. Thankachan, pBWT: Achieving Succinct Data Structures for Parameterized Pattern Matching and Related Problems, In ACM-SIAM Symposium on Discrete Algorithms (SODA), 397-407, 2017.
Arnab Ganguly, Wing-Kai Hon, Rahul Shah, Stabbing Colors in One Dimension, Data Compression Conference (DCC), 2017.
Arnab Ganguly, Wing-Kai Hon, Kunihiko Sadakane, Rahul Shah, Sharma V. Thankachan, Yilin Yang, Space-Efficient Dictionaries for Parameterized and Order-Preserving Pattern Matching, In Combinatorial Pattern Matching (CPM), 2:1-2:12, 2016.
Arnab Ganguly, Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, Space-Time Trade-Offs for the Shortest Unique Substring Problem, International Symposium on Algorithms and Computation (ISAAC), 34:1-34:13, 2016.
Arnab Ganguly, Wing-Kai Hon, Rahul Shah, A Framework for Dynamic Parameterized Dictionary Matching, Symposium on Algorithms Theory (SWAT), 10:1-10:1, 2016.
Paniz Abedin, Arnab Ganguly, Wing-Kai Hon, Yakov Nekrich, Kunihiko Sadakane, Rahul Shah, Sharma V. Thankachan, A Linear-Space Data Structure for Range-LCP Queries in Poly-Logarithmic Time. COCOON 2018: 615-625
Arnab Ganguly, Rahul Shah, Sharma V. Thankachan, Structural Pattern Matching - Succinctly. ISAAC 2017: 35:1-35:13
This material is based upon work supported by the National Science Foundation under Grant CCF 1527435.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and collaborators and do not necessarily reflect the views of the National Science Foundation.
Last updated: July 8th, 2018
Contact: visit PI’s webpage link at the top of the page