My structure is the same as that link's suffix trie except I don't add nodes beyond the first one that makes a suffix unique. The only reason the node that makes a suffix unique is added is so that if a later suffix matches, the program can go back and add the child (or children, as needed) of the suffix that created the node, turning it into an internal node.

While the nodes/leaves ratio was used to determine a reasonably good cost function, the number of nodes and leaves is not used in the cost function of TreeCompress. It essentially uses the reduction in order 0 entropy of the candidate suffix, the number of occurances of the suffix and the length of the suffix as inputs to the cost function.

I think I could modify TreeCompress by adding one more word to each node indicating the number of symbols on the branch leading to that node and adding a branch splitting function. At this point, that would probably be easier than finding some else's code and adapting it. I am rather fond of my implementations ability to limit the sibling depth to log4(alphabet size) without hash tables and am not sure I would get that with other code.

Suffix tree construction averages roughly 75% of execution time on v0.3 of TreeCompress.

Suffix links will not work as well as normal for TreeCompress since the tree is built in pieces. Nonetheless, it is an interesting idea that might save a lot of tree traversals. I will think about this some more.

If you compare the first compression cycle time for enwik9 to that of enwik8, I am pretty sure the time for enwik9 is much closer to 10x that of enwik8 than 100x that of enwik8. I think technically the algorithm is O(n^3), but in practice the average suffix depth does not grow nearly as fast as the file and having to run the tree build in pieces does not have a big impact, so in practice it runs in close to linear time when comparing enwik8 to enwik9. The overall compression process takes about 25x longer, but that is mostly because a lot more tree builds are done because there are a lot more strings that can be effectively tokenized.

I think the simplest suffix array algorithms run in O(n * log(n)) time because of the binary search. For TreeCompress, what I call dictionary size is the same thing as the size of the alphabet or the number of unique symbols in the file. I am concerned that doing log(number of suffixes) string comparisons could take longer to run than (average suffix length) * log4(alphabet size) symbol comparisons when building the tree. I realize there are more complex algorithms that build a suffix array in linear time, but don't understand them yet.