Suffix trees.

An entire post covering both the theoretical part and the implementation is going to take too long, for this reason I’m splitting the material in two parts: A post with some theory introduction, and one post which will be focused on implementation details.

The algorithm I would like to show you is the Ukkonen method for building suffix trees, but first let’s start with some theory.

Introduction

Given a text and a pattern, we may want to know if the text contain the pattern, that is, the membership problem.One may want to be able to answer this question in linear time to the size of the pattern by querying a data structure which should be constructed in linear time and which should be linear in terms of memory usage.

Algorithms for sub string matching like Boyer-Moore or Knuth-Morris-Pratt are linear in the size of the input text plus the cost of building the shift tables for the given pattern, that’s a good bound for most of the applications, unless you’re dealing with non trivial patterns and very long text.

When the text is very long and many research need to be performed then BM and KMP are just too inefficient, the bound on the size of the text is not acceptable anymore.

The exact matching problem is just one flavor of a wide range of similar problems which we may want to solve in linear time, some of them are not even easily solvable or if they are then the solution is expensive from a computational complexity perspective, consider just as example the inexact matching problem.

What we want to do is develop and algorithm which given a text Tn and a pattern Sm, where we may assume n>>m, solve the membership problem in O(m) for any S. From now on, we will assume a fixed and finite alphabet.

Suffix trie.

Spelled “Suffix TRY”, from reTRIEval. Those Are nice and easy to understand data structure which are a good starting point for the topic I want to cover with this post. Those trees are commonly used to build dictionaries and are able to solve the membership problem in linear time, they also are good to retrieve or query for all the suffixes of the dictionary entries.

The only problem is the O(n^2) complexity bound for both construction time and memory consumption.

Say we have T={“Tree”,”Blog”,”Trump”,”Big”}, to build the Suffix trie we need to iterate for all the elements of T and for each s∈T, and we want to add to the tree all the s[i…l(s)] for 0<=i<l(s) where l(s) is the length of s, or in other words, all the suffixes of the words in the set T.

The addition starts always from the root, each edge has a label which is a single character from the fixed and finite alphabet, a path from the root to any vertex spells a suffix of s or a prefix of a suffix, eventually the whole s (which is a legit suffix) is spelled.

If when adding one of the words of T we end up in a leaf but we have still symbols to add, then we just proceed by including new edges to the tree and extending the path where we are. If our word match till a certain point a path in the tree, let say till s[j], but then for i>j the path labels are different, then we need to split the path and add a new edge out of the last matching point. The new path will have length i.

And as last case, if while traversing the tree we end up in the middle of a path and have no more symbols to add, then we do nothing, since the information about the string we are trying to add is already in the tree.

IMG_5353[1]
Suffix trie for “tree”,”blog”,”trump” and “big”
Given the suffix trie, the membership test is solvable in linear time to the size of the pattern, starting from the root we may check whether RUM exist in the text by following the path with the labels R,U and M, since the path exist then the string RUM must exist in the text (Actually RUM is the suffix of the prefix TRUM of TRUMP).

One interesting property of those data structure is that if when traversing the tree one picks always first the child with the lexicographical smaller label then once a leaf if reached the path spells the suffixes in order.

For the tree in the figure starting from the root the edge with the smallest label is B, then I and then G, which is just BIG, that just happen to be also the smallest (in terms of lexicographical comparison) of the words in the dictionary T.

That’s cools, but the construction time bound is quadratic and they are very expensive from the memory point of view.

From Suffix trie to Suffix tree.

We need to make some modification to our data structure, in order to do that let’s draw first the suffix trie for the string “abaab” and let’s see what we can do:

abaab
Suffix trie for abaab

First of all, let’s call internal nodes all the nodes that do have more than one child, not including the root. We have only one internal node for the suffix trie of abaab, now let’s compress (or compact) all the paths between internal nodes and internal nodes or the root, and all the paths from internal nodes to leafs, we will do this by grouping all the edge labels under a single edge, like this:

abaab_compress
The result of the path compression of the trie for abaab

This operations is permitting a certain degree of memory savings, we have much less edges and nodes, but nevertheless this modification alone is not going to be very effecting for big trees.

Let’s now notice one thing: The way we are treating the suffixes in our data structure has one flaw, we have no way to list all the suffixes of abaab by means of just traversing the tree, as sample let’s consider, ‘ba‘,which is not a legit suffix of abaab, but its a prefix of one of its suffix, or eventually a suffix of a factor or abaab.

Even tough ba is not a suffix of abaab, if we traverse the tree we find that te path with B->A exist, which means that the sequence ba is in the original text, but we want a way to distinguish legit suffixes of the text from just substring its substrings.

For this reason let’s distinguish internal nodes in two types, explicit and implicit nodes. Explicit nodes are the one created by explicitly splitting a path edge in two and consequently adding a new leaf to the tree, implicit node are those at the end of each possible suffix of our string.

Implicit node do not exist in our data structure as real nodes but are just stored as piece of information within the algorithm, by this I mean that we are not really adding any new node, but just saving the information that a path has eventually implicit nodes. How does this is performed will be clear later.

After applying the distinction between implicit and explicit nodes, abaab will look like this:

implicit_nodes
The small dots are for the implicit nodes, this is a proper suffix tree.

What we have now is a proper representation of a suffix tree for the string abaab, all the suffix are included and the data structure can be queried in order to solve the membership problem and all the suffixes of the string can be listed by a simple tree traversal, where each path from the root to a leaf or to an implicit node represent a suffix of the string.

Eventually implicit nodes may be replaced by real nodes with one additional out edge having a label made of symbols not present in the alphabet, which make possible to mark unequivocally the end of a suffix, in our sample the alphabet is made of A and B, so $ is a good choice:

implicit_nodes_marker
Using a real edge out of a not anymore implicit node.

All the legit suffix of abaab are now properly represented and can be identified easily. To further improve our data structure we need to make some observation on the process of inserting new elements to the tree, let’s assume we want to extend out string abaab with a c, having thus abaabc.

Since abaab is a prefix of abaabc, any legit suffix of abaab is a legit prefix of abaabc, which clearly means that we should be able to build the proper tree for abaabc by extending the suffix tree of abaab. In order to have some order here, let’s call I(n) the suffix tree for the string abaab, and I(n+1) the suffix tree built upon I(n) by adding one symbol, like c.

We need to insert all the suffixes of abaabc in I(n) in order to obtain I(n+1). We have s[i…l(s)] for 0<=i<l(s) which is: abaabc, baabc, aabc, abc, bc and c. Let’s insert them in order and see what we have:

extension1
Insertion of abaabc, baabc and aabc, the first three suffixes of abaabc

For the first three suffixes of abaabc we are just extending the leafs by adding the new symbol c, there’s no modification of the tree structure and since we are compacting the paths the only required operation after the proper leaf is identified is just appending c to the edge label.

extension2
Insertion of the remaining three suffixes of abaabc

Inserting abc required a modification of the tree structure, an implicit node disappear and the edge with label baabc need to be split, a new edge is thus created with label c. This edge points to a leaf node. Inserting bc required a split operation as well, the path baabc (this time starting from the root) is split in b and aabc, a new internal node is created with an edge to a leaf.

Adding C is trivial, just a new edge from the root since no path start with C.

Some implementation details.

When building I(n+1) upon I(n) for each path of a prefix of the string we have the following possible actions:

  1. If the path ends in a leaf and we have still characters to add, then we extend the label for that leaf and do nothing more. This operation require just constant time once the proper leaf is identified.
  2. If is possible to follow a path from the root to a certain edge and then if there’s a mismatch, we know that say from j the string do not match anymore the path. We have a match from s[0…j] and a mismatch from at least j+1. In this situation we need to split the edge where the last match happen and proceed with a new leaf, this is the case for abc and bc in the pictures before.
  3. If the path ends in the middle of an edge and we have no mismatch, no operation is required but adding eventually a new implicit node to mark the suffix.

Only the second action cause a modification to the data structure, the other made no real changes but appending symbols to the labels (1) or doing nothing (3).

One modification we may want to introduce in our data structure is substituting the string labels with just two indexes, i and j where s[i…j] is a substring of s. Storing the labels in this way is going to save a lot of space, since each edge will have just two integers instead of a possibly very long string. I will continue to use the labels tough, the reason is just clarity, is much easier to understand what is happening when you see the label instead of two numbers.

If we proceed by substituting string labels with indexes we notice one important thing, each leaf node refers always to the end of s, this is a property I already wrote about earlier, the path labels from the root to a leaf spells out a legit suffix of s, whereas paths from the root to internal nodes spells out suffix of prefixes of s.

Is possible to speed up the process of building the tree by ignoring the leafs extension by addition (as we made with the addition of ‘c‘ in the sample before) and by marking each leaf with a special termination index, for example ∞, if we find this symbol then we know that the edge will always refer to the last character of the string, extending the string will extend automatically ∞ to the new character.

Let’s see how does looks like our suffix tree after removing the string labels:

reference_pairs
Using indexes instead of strings, and marking the open leafs with ∞

Now we are saving a lot of space, and we are also ready to see how is possible to identify the position of all the occurrences of a given pattern, which is something we really want to do in order to make this structure handy.

Finding the pattern occurrences.

Once we have each edge storing the index information of the labels instead of the real string, we are able to easily identify the position of all the occurrence of a pattern. Let’s say we want to know how many time and where ab appears in abaabc , by just traversing the tree from the root we will find an occurrence of a with of the pair of indexes {0,1} now we have an internal node with two childs, the next match is in {1,2}=b, which spells ab.

So we know that ab is present in the text and by one simple property of the tree we just have built, this substring will appear exactly twice! From the edge where we are once ab is identified we may reach exactly two leafs, namely the ones with indexes {2,∞} and {5,∞}, as I said before, the tree will have one leaf for each possible suffix of s, which may or may not have ab as prefix.

If ab exist in the text then it must be a prefix of a suffix of s, so the amount of time ab (or any other substring) appears in s is exactly equal to the amount of time ab is a prefix of a suffix of s!

To find the starting position of any occurrence we just need to keep track of the length of the paths from the match of the substring till each leaf, and subtract from the starting index of the edge to the leaf those lengths.

To match ab we traversed the edge {0,1} and {1,2}, of length 1 and 1. The starting indexes of the two leafs from {1,2} are 2 and 5, so we have 2-2=0 and 5-3=0, which are the indexes of the starting position of the occurrences of ab!

Now let’s repeat the same procedure for the substring a, we have a complete match for a at the first edge from the root with indexes {0,1}. From this point we can reach three leafs, so the substring a must appear three times in the text.

Those three leafs have indexes {3,∞}, {2,∞} and {5,∞}, to reach them in the order they are listed one needs to traverse edges of length 1 (just {0,1}), then 1 + 1 and again 1 + 1, so the starting positions of the occurrences of a are 3-1=2, 2-2=0 and 5-2=3, Zero, two and three.

The tree construction process.

In order to understand properly the concept in the next sections some clarification is needed on the way the tree is build, I’m not going to add anything new but just some order in the ideas already introduced.

As I said, to build the tree of a string with length m we start with the suffix s[0…m], then s[1…m] and so on till s[m-1…m], each step will extend the tree by adding one more suffix of s. Let’s call those steps extension phases, there are m extension phases for a string of length m.

One of the things we have noticed before is that is possible to build the suffix tree I(n+1) from I(n) by extending I(n) itself (do you remember when we added c to abaab?), let’s exploit this fact in order to develop a process for building suffix trees by successively processing the prefixes of s, from s[0…1] to s[0…m] and for each end every prefix let’s proceed with the algorithm we have seen so far.

So for abaabc we will process in the following order: ‘a’,’ab’,’aba’…’abaabc’, for each prefix we built the suffix tree, having thus I(1), I(2)….I(n), where I(n) is the final suffix tree. Every I(m) have m extension phases, one for each prefix s[0…m].

The process is depicted in the following image (note that I will not use indexes but strings to label the edges, for human reading strings are much more understandable):

construction1
Steps for the incremental construction of the suffix tree for abaabc
construction2.jpg
Final steps, and the resulting tree.

For each prefix I drawn all the construction phases, underlined in read you may find the current prefix being processed followed by all the its suffixes.

Clearly this process as it was explained here is far from being efficient, if we consider the time needed to traverse the tree being linear to the size of the current suffix, then this algorithm is O(m^3), but those steps are really important to understand the algorithm I will introduce later, and upon this inefficient procedure a linear time construction algorithm will be developed.

Suffix links.

A key feature of suffix trees are suffix links, those links are special edges connecting internal nodes, every internal node will have it’s suffix link once the tree is constructed.

Let’s say we have a string φs in the tree, where s is eventually empty and φ is the first character of the string we are considering, for abaabc φ=a ans s=baabc, the string φs follow  a path in the tree till an internal node which lead to a leaf terminating φs, if in the sae tree exist a path for s (without φ), then will exist withing the structure an edge from the internal node identified by φs till the one for s. This edge is the suffix link.

To understand the benefits of those links let’s consider how the procedure we saw so far works. In each extension phase we proceed by adding one more suffix to the tree, by considering prefixes of decreasing length s[i…j] with 0<=i<=j and 1<=j<m. For each suffix we start from the root and follow a path identified by the symbols of the string.

For abaabc (the last step in the image before) we have added in order abaabc, baabc, aabc &c. Each time starting from the root. But, since φs match exactly but the first symbol, the we may avoid following the path from the root and instead using the suffix link from φs to identify immediately the correct internal node for s.

How do we create those links is actually simple in theory, but a little problematic in practice (but we’ll see this later on), when we are inserting the prefix s[1…j] then it must be the case that we have already inserted s[0…j], so we can save the internal node for s[0…m] we have found in the previous extension phase, and when we’re done with s[1…j] we can use that pointer to create the correct suffix link from s[0…i] to s[1…i].

We will make the following assumption about suffix links:

  • The suffix link from the root point to the root itself. The root is not considered an internal node and thus shouldn’t have a suffix link, but forcing the root to have a link to itself will simplify the implementation we will see later.
  • If a suffix link cannot be created, then we will forcibly point the internal node to the root. This is equivalent to setting the suffix link to nil and the starting the traversal from the root.

For the sample string abaabc we will create three suffix links:

1
New internal node created in the extension phase aa for abaa
2.jpg
Last extension phase for abaa, it terminates in the root so the suffix link should be nil, but we will let it point to the root.
3.jpg
Creation of an internal node during the extension phase abc of abaabc
4.jpg
In the next extension phase following abc we create a new internal node. We set the suffix link from the node saved before to this new one.
5.jpg
The final suffix tree.

Suffix links are going to save us a lot of time when during the construction process of the tree we’re going to skip long parts the string just jumping to the correct internal node which point the the leafs terminating the next suffix.

To appreciate a little more this last construct let’s have a look at the suffix tree for the string mississip, and let’s say we want to add a new symbol i to the tree, in order to have mississipi (reading again this post, I noticed I made a bad blunder! mississipi should be mississippi, with two p!! I’m sorry for this, but I really don’t want to draw again this tree with the correct spelling 🙂 ):

6
The suffix tree for mississip with the proper suffix links

Normally for each suffix of mississipi we will start the search for a path from the root this a leaf in order to extend the three with the additional symbol i, using suffix link will save us time by avoiding this procedure and skipping part of the string.

The following image is the path you have to follow to add i to mississip, I hope is readable enough, the numbers on the pencil drawn lines are for the sequence followed by the algorithm, first, second &c.

7
The path followed by the algorithm in order to extend mississip with the symbol i.

As you can see we are mostly moving from the last internal node of a prefix to the next one, skipping the whole internal part (when possible). Each time an extension is completed we look for the parent of the current leaf node, if it have a suffix link (this is always the case, the link will point to another internal node or to the root) then we follow the suffix link and proceed with the next extension.

For big trees suffix link are really going to make a difference, is very hard to show you how much by some small examples on paper. One thing you may have noticed is that once we follow a suffix link we do not have the information on where in the string we are, or in other words, which leaf should we follow next!.

On this paper example with human readable labels I can just check by a quick glance at the tree, let’s say I’m performing the step third in the figure, after following the link I have two leafs, ssip an p, the sequence of parents from the root spells ssi and I’m willing to add ssissipi to the tree, so after the jump I’m left with ssissipi without the prefix ssi, or ssip. Which gives me enough information to follow the proper leaf.

This last operation is doing in constant time in the final algorithm by exploiting the label indexes, remember that I’m using the string label for clarity? With the indexes instead of the strings the procedure turns to be a little different, but much faster!

Conclusion

This is a really long post! Probably there are mistakes somewhere in what I wrote, please forgive me, I made my best to avoid any mistake or unclear statement. Nevertheless if you find something unclear or something wrong let me know, was not an easy task to draw all that trees, and I’m not that good and drawing 🙂

In the next post I will develop further what was explained here, and I will try to provide the easier to understand possible to implementation of the Ukkonen algorithm, suffix trees are at the very hearth of many string processing algorithms or procedures, so understanding how they works and how to build them is I believe important.

Thanks for reading!

References

[1] “On–line construction of suffix trees“, Esko Ukkonen
[2] “Linear-Time Construction of Suffix Trees“, Dan Gusfield
[3] “From Ukkonen to McCreight and Weiner: A unifying View of Linear-Time Suffix Tree Construction“, Robert Giegerich and Stefan Kurtz
[4] ” On-line construction of suffix trees“, Chapter 4, Jewels of stringology
[5] “Introduction to Suffix Trees“, Chapter 5, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology

1 Comment

  1. […] hope to find the time to complete the second part of my post of suffix trees soon, had a lot of things going on those days and not very much free, hope you enjoyed this short […]

Leave a Reply