Recently I've came up with an idea to make a compressor that records traversal of suffix tree. Nodes from the same parent would form a list, so unary coding would be perfectly feasible - with K elements alphabet the maximum number of flags per byte would be K - 1 (or K if we use additional flag to mark the end of data), but I expect the average number of flags per byte to be only a few. Time complexity would be O(N * K) where N is the input size.
The question is however about internal representation of suffix tree. There is a very old article of Mark Nelson: http://marknelson.us/1996/08/01/suffix-trees/ He is using the Ukkonen's on-line algorithm for building suffix trees that I also want to employ. However he is using unlinked hashmap instead of siblings list (so traversal is very inefficient) and his data structures seems somewhat reduntant.
Mark Nelson use following structures:
Node - at most 2 * N nodes:
4 bytes - suffix index
Edge - at most 2 * N edges, but hash table is 10 % larger than 2 * N:
4 bytes - index of starting character
4 bytes - index of ending character
4 bytes - index of origin node
4 bytes - index of ending node
Input - size N (of course):
1 byte - input symbol
The total size of suffix tree is: 2 * N * 4 + 2 * N * (4 + 4 + 4 + 4) + N = 8 * N + 32 * N + N = 41 * N bytes, that's a lot and doesn't even include things like parent pointer (which could be needed as I explain later).
My proposal is to integrate edges into nodes. Their number is almost equal (every different node has different incoming edge except root node).
I would want to use following structures:
Node - at most 2 * N nodes:
4 bytes - index of next (less recently used) sibling
4 bytes - index of first child
4 bytes - index of first character of incoming edge
4 bytes - suffix index
Input - size N (of course):
1 byte - input symbol
The total size of suffix tree is now: 2 * N * (4 + 4 + 4 + 4) + N = 32 * N + N = 33 * N bytes, so that's better. If we add parent pointer then we need 41 * N bytes.
(As a side note: another idea came up to my mind now: we could save similiar amount of memory with slight modification of Nelson's structures. We can use two types of edges: edges that goes to internal nodes and edges that goes to leafs. Edges that goes to leafs do not need index of ending character nor index of ending node - as that node's suffix is not used and set to -1 by default. Still, efficient traversal is impossible without augmenting structures with additional indexes)
Now: how do we know the index of last character of incoming edge in my scheme? There are two cases:
1. We are at leaf, so the last character of incoming edge is the last character of input (that's the property of leaf in suffix tree). We know that current node is a leaf if it has no children. (As a side note: we can use similar modification like in Nelson's approach: we can use two types of nodes: one that is an internal full node and one that is a leaf node, so suffix index is not necessary. Both leaves and internal nodes numbers are bounded by N, so we can shave 4 bytes from leaves).
2. We are at internal node and we need to 'magically' retrieve the last character index. In order to do that, at traversal of every edge we can update the index of first character of that edge to index of last occurence of such character. Additionally, when we have a match at some node, then we make that node a head of children list of its parent. After doing that we know that the index of first character of incoming edge of first children of a particular node is immediately following the index of last character of incoming edge of that node (lots of "of"'s here, I know my English is not very good).
1. What do you think about it? Could that work? Maybe I've missed (or messed up) something.
2. Is anybody familiar with sliding suffix tree implementation of Jesper Larsson (placed at: http://larsson.dogma.net/research.html) and could tell quickly if my space saving technique would be feasible to implement in it? I've skimmed over his description (in his doctoral thesis) of sliding suffix trees and they are using parent pointers - adding parent pointers should not impose any serious problems, I think.
PS:
I've found one obstacle. If we follow a suffix link we will end up in a internal node which can be first child of its parent. If that is the case then we must update the incoming edge of that parent parent. If it turns out that this parent is also first child of its parent, then we need to update the edges recursively. I don't know (yet?) how it will impact the computational complexity.