Suffix tree and suffix array for string matching
Suffix tree
- Definition: a suffix tree is a compressed trie containing all the suffixes of the given text as their keys and positions in the text as their values.
See an example of suffix tree:
There are efficient algorithms to construct suffix trees given by Weiner (1973) and McCreight (1976) (in linear time)
- Suffix tree allows one to find, extremely efficiently, all distinct subsequences in a given sequence.
- Suffix tree for string matching (to check if a pattern P is found in a text T)
- Preprocess text T (of size n), not pattern P (of size m)
O(n) preprocess time (n: the length of the text)
O(m+k) search time (m: the length of the pattern)
k is number of occurrences of P in T
- Match pattern P against tree starting at root until
Case 1, P is completely matched; Every leaf below this match point is the starting location of P in T
Case 2: No match is possible; P does not occur in T - See a simple example:
- Preprocess text T (of size n), not pattern P (of size m)
- For the task of comparing two DNA sequences, suffix trees allow one to quickly find all subsequences shared by the two inputs. The alignment is then built upon this information.
- How to build suffix tree?
- Brute-force method: $O(n^2)$ ($n$ is the the text size)
while suffixes remain: add next largest suffix to the tree
- Linear time algorithms
- Weiner showed original O(n) time algorithm
- More space efficient algorithm by McCreight in 1976 (quadratic space)
- Simpler algorithm (online algorithm) by Ukkonen in 1995, which relies on a few implementation speed-ups (including suffix links) to achieve the $O(n)$ time and $O(n)$ space bounds (see the paper; some slides; Ukkonen's suffix tree algorithm in plain English).
- Brute-force method: $O(n^2)$ ($n$ is the the text size)
Suffix Array
- Suffix array is a space-efficient data structure, which is more compact than a suffix tree
- Given a text, the suffix array is basically a position array of all its suffixes, sorted in lexicographical (alphabetical) order.
- A suffix array for a text of length $n$ can be built in $O (n logn )$ time,
- Searching the text for a pattern of length $m$ can be done in $O (m log n )$ time by a binary search; reduced to $O(m + logn)$ if using LCP (longest common prefix)