New Tools for Text Indexing and Beyond: Substring Complexity and String Synchronizing Sets
Tomasz Kociumaka
MPI-INF - D1
07 Feb 2024, 12:15 pm - 1:15 pm
Saarbrücken building E1 5, room 002
Joint Lecture Series
In recent decades, the volume of sequential data processed in genomics and
other disciplines has exploded. This trend escalates the challenge of designing
text indexes – concise data structures that allow answering various queries
efficiently. For example, pattern-matching queries ask if the text (a long
string modeling the dataset) contains an occurrence of a given pattern (a short
string provided at query time). The suffix array is a classic index for pattern
matching and many other queries, ...
In recent decades, the volume of sequential data processed in genomics and
other disciplines has exploded. This trend escalates the challenge of designing
text indexes – concise data structures that allow answering various queries
efficiently. For example, pattern-matching queries ask if the text (a long
string modeling the dataset) contains an occurrence of a given pattern (a short
string provided at query time). The suffix array is a classic index for pattern
matching and many other queries, but modern applications demand space-efficient
alternatives like the FM index. Still, even these data structures struggle with
massive highly repetitive datasets, such as human genome collections. This
scenario requires indexes whose size is comparable to the best text compression
methods.
The talk will introduce two tools originating from my text indexing research. Substring complexity is a well-behaved measure of text repetitiveness that aids in comparing the performance of compression algorithms. String synchronizing sets enable consistent selection of small subsets of text substrings. Recently, these tools became the core of the asymptotically smallest compressed text index with suffix-array functionality. Furthermore, they find applications beyond text indexing: in combinatorics on words, quantum string algorithms, and more.
Read more
The talk will introduce two tools originating from my text indexing research. Substring complexity is a well-behaved measure of text repetitiveness that aids in comparing the performance of compression algorithms. String synchronizing sets enable consistent selection of small subsets of text substrings. Recently, these tools became the core of the asymptotically smallest compressed text index with suffix-array functionality. Furthermore, they find applications beyond text indexing: in combinatorics on words, quantum string algorithms, and more.