# Thread: Suffix Tree's internal representation

1. ## Suffix Tree's internal representation

Recently I've came up with an idea to make a compressor that records traversal of suffix tree. Nodes from the same parent would form a list, so unary coding would be perfectly feasible - with K elements alphabet the maximum number of flags per byte would be K - 1 (or K if we use additional flag to mark the end of data), but I expect the average number of flags per byte to be only a few. Time complexity would be O(N * K) where N is the input size.

The question is however about internal representation of suffix tree. There is a very old article of Mark Nelson: http://marknelson.us/1996/08/01/suffix-trees/ He is using the Ukkonen's on-line algorithm for building suffix trees that I also want to employ. However he is using unlinked hashmap instead of siblings list (so traversal is very inefficient) and his data structures seems somewhat reduntant.

Mark Nelson use following structures:

Node - at most 2 * N nodes:
4 bytes - suffix index

Edge - at most 2 * N edges, but hash table is 10 % larger than 2 * N:
4 bytes - index of starting character
4 bytes - index of ending character
4 bytes - index of origin node
4 bytes - index of ending node

Input - size N (of course):
1 byte - input symbol

The total size of suffix tree is: 2 * N * 4 + 2 * N * (4 + 4 + 4 + 4) + N = 8 * N + 32 * N + N = 41 * N bytes, that's a lot and doesn't even include things like parent pointer (which could be needed as I explain later).

My proposal is to integrate edges into nodes. Their number is almost equal (every different node has different incoming edge except root node).

I would want to use following structures:

Node - at most 2 * N nodes:
4 bytes - index of next (less recently used) sibling
4 bytes - index of first child
4 bytes - index of first character of incoming edge
4 bytes - suffix index

Input - size N (of course):
1 byte - input symbol

The total size of suffix tree is now: 2 * N * (4 + 4 + 4 + 4) + N = 32 * N + N = 33 * N bytes, so that's better. If we add parent pointer then we need 41 * N bytes.

(As a side note: another idea came up to my mind now: we could save similiar amount of memory with slight modification of Nelson's structures. We can use two types of edges: edges that goes to internal nodes and edges that goes to leafs. Edges that goes to leafs do not need index of ending character nor index of ending node - as that node's suffix is not used and set to -1 by default. Still, efficient traversal is impossible without augmenting structures with additional indexes)

Now: how do we know the index of last character of incoming edge in my scheme? There are two cases:
1. We are at leaf, so the last character of incoming edge is the last character of input (that's the property of leaf in suffix tree). We know that current node is a leaf if it has no children. (As a side note: we can use similar modification like in Nelson's approach: we can use two types of nodes: one that is an internal full node and one that is a leaf node, so suffix index is not necessary. Both leaves and internal nodes numbers are bounded by N, so we can shave 4 bytes from leaves).
2. We are at internal node and we need to 'magically' retrieve the last character index. In order to do that, at traversal of every edge we can update the index of first character of that edge to index of last occurence of such character. Additionally, when we have a match at some node, then we make that node a head of children list of its parent. After doing that we know that the index of first character of incoming edge of first children of a particular node is immediately following the index of last character of incoming edge of that node (lots of "of"'s here, I know my English is not very good).

1. What do you think about it? Could that work? Maybe I've missed (or messed up) something.
2. Is anybody familiar with sliding suffix tree implementation of Jesper Larsson (placed at: http://larsson.dogma.net/research.html) and could tell quickly if my space saving technique would be feasible to implement in it? I've skimmed over his description (in his doctoral thesis) of sliding suffix trees and they are using parent pointers - adding parent pointers should not impose any serious problems, I think.

PS:
I've found one obstacle. If we follow a suffix link we will end up in a internal node which can be first child of its parent. If that is the case then we must update the incoming edge of that parent parent. If it turns out that this parent is also first child of its parent, then we need to update the edges recursively. I don't know (yet?) how it will impact the computational complexity.

2. PPMd uses suffix trees. And Nelson's coder its based on likely does too (though i don't remember already).
There're also some papers on ppmd.
http://www.ctxmodel.net/files/PPMd/

3. Are you sure?
1. PPMd is of limited order but depth of suffix tree is O(input size) not constant.
2. I do not see any terms like "edge" or "path compression" in documentation. It is essential to have compressed paths, ie to merge node with only one child with its child, to obtain linear memory complexity.

BTW:
Could you convert articles on your page to UTF-8? Weird symbols are appearing when I read documentation written by Russians

4. > 1. PPMd is of limited order but depth of suffix tree is O(input size) not constant.

Its potentially unlimited, at least I tested up to order-1024, and ppmonstr in fact supports up to 128 as it is.

The limit is added to control speed and memory.
Same as BWT, unlimited contexts are troublesome.

> 2. I do not see any terms like "edge" or "path compression" in documentation.
> It is essential to have compressed paths, ie to merge node with only one child with its child, to obtain linear memory complexity.

There's a text buffer and pointers to that from leaf nodes.

As to "path compression", I'm more used to applying that to paths within the tree, not leaf ones.
Though I only know one compressor that (supposedly) does that - its called hipp.

> Could you convert articles on your page to UTF-8? Weird symbols are appearing when I read documentation written by Russians

If you mean engpaper.txt, its more about old age than russians - it should be viewed using DOS' cp866 charset for pseudographics.

5. It looks that my idea would work. I've finally implemented a program that does building suffix tree my way. I call the updating (mentioned in post scriptum in first post) "recovering" in my demo program. I am not sure but it is possible that the amortized complexity of recovering per input byte is constant.

My demo program takes an input from either console or first command line argument and then builds the suffix tree printing the intermediate suffix tree for successive prefixes of input string. Max 80 characters allowed. On every edge update during recovering process a "recovered" word is written. If you find (assuming you would be interested enough to play with suffix trees) a sequence that gives very high (proportionally to that sequence length) number of recovering then please share that finding with me

I based my implementation on: http://www.allisons.org/ll/AlgDS/Tree/Suffix/ . Demo program is compiled for 64-bit Ubuntu.lemontree.tgz

I'll try now to grasp Jesper Larsson's paper about sliding window suffix trees and incorporate relevant ideas. PPMd's model cut-offs aren't very granular, ie. after reducing model size compression ratio becomes much worse, because model size reduction is severe. From what I understand from Larsson's paper, his method is fully smooth thanks to the fact that every node has the same size. I would want to conserve memory by having many types of nodes, but I need to think how to achieve that with enough smoothness (probably I will opt for some tuned garbage collection).

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•