1. Originally Posted by nburns
Okay, then you must actually be adding every suffix -- or at least part of every suffix. I'm not entirely sure exactly what you have there. It may be a suffix trie. Here's a very nice set of slides that I just found, that should be helpful:

http://www.cs.jhu.edu/~langmea/resou...ffix_tries.pdf
My structure is the same as that link's suffix trie except I don't add nodes beyond the first one that makes a suffix unique. The only reason the node that makes a suffix unique is added is so that if a later suffix matches, the program can go back and add the child (or children, as needed) of the suffix that created the node, turning it into an internal node.

Originally Posted by nburns
It's true that the type of tree you use internally makes no difference to the compressed output, but where it makes a difference is in trying to figure out and communicate the nature of the cost function. A suffix tree always has strictly-fewer internal nodes than leaves, so nodes/leaves is a number between one and two. A suffix trie can have a nodes/leaves ratio much higher than two, and generally will. So, depending on what kind of tree you're talking about, nodes/leaves can give very different numbers, and, obviously, those numbers measure different things.
While the nodes/leaves ratio was used to determine a reasonably good cost function, the number of nodes and leaves is not used in the cost function of TreeCompress. It essentially uses the reduction in order 0 entropy of the candidate suffix, the number of occurances of the suffix and the length of the suffix as inputs to the cost function.

Originally Posted by nburns
A suffix tree looks like a pretty difficult data structure to implement. I've studied Ukkonen's algorithm enough to feel confident that I could probably implement it correctly from memory (or, at least, a closely-related variety that yields a correct suffix tree), but I've never actually written a complete ST implementation. This has a lot to do with the fact that, at the time I got interested in suffix trees a couple years ago, all indications were that suffix arrays had made them virtually obsolete, due to SA's better performance, greater simplicity, and near-equal capabilities. Even if you ended up going with a ST for Tree, I'd recommend trying to start from one of the ST implementations that's available on the web, rather than write one from scratch.
I think I could modify TreeCompress by adding one more word to each node indicating the number of symbols on the branch leading to that node and adding a branch splitting function. At this point, that would probably be easier than finding some else's code and adapting it. I am rather fond of my implementations ability to limit the sibling depth to log4(alphabet size) without hash tables and am not sure I would get that with other code.

Originally Posted by nburns
There could be an enormous performance benefit from switching out suffix data structures, but it depends on how much time is going into suffix tree construction to begin with. In general, I've found that it's hard to guess where the real bottlenecks are just from looking at code; and it's a fact of software that optimizing anything other than the biggest bottleneck will tend not have a big impact (see Amdahl's Law).
Suffix tree construction averages roughly 75% of execution time on v0.3 of TreeCompress.

Originally Posted by nburns
Aside from not being a compressed trie, the other major feature of a suffix tree that your structure seems to be lacking is suffix links. Suffix links (or pointers) are additional links inside the tree that connect nodes with common suffixes. They are key to being able to build the tree in O(n) time, because you avoid spending time looking up the same substring over and over when inserting successive suffixes. Your tree construction algorithm seems to be O(n^2).
Suffix links will not work as well as normal for TreeCompress since the tree is built in pieces. Nonetheless, it is an interesting idea that might save a lot of tree traversals. I will think about this some more.

If you compare the first compression cycle time for enwik9 to that of enwik8, I am pretty sure the time for enwik9 is much closer to 10x that of enwik8 than 100x that of enwik8. I think technically the algorithm is O(n^3), but in practice the average suffix depth does not grow nearly as fast as the file and having to run the tree build in pieces does not have a big impact, so in practice it runs in close to linear time when comparing enwik8 to enwik9. The overall compression process takes about 25x longer, but that is mostly because a lot more tree builds are done because there are a lot more strings that can be effectively tokenized.

Originally Posted by nburns
Searching for a match in a suffix array is just binary search. There's no such thing as dictionary size in a suffix array; your M would correspond to the length of the text or of the generalized sequence. It doesn't seem to me that you would need to do much searching, though. Your interest would be mainly in gathering statistics. A suffix array isn't much help by itself; you'd want at least some of the other arrays that compose the enhanced suffix array.
I think the simplest suffix array algorithms run in O(n * log(n)) time because of the binary search. For TreeCompress, what I call dictionary size is the same thing as the size of the alphabet or the number of unique symbols in the file. I am concerned that doing log(number of suffixes) string comparisons could take longer to run than (average suffix length) * log4(alphabet size) symbol comparisons when building the tree. I realize there are more complex algorithms that build a suffix array in linear time, but don't understand them yet.

2. Originally Posted by nburns
Can we move this side discussion to a different thread in off-topic?
Maybe a different thread (not in the downloads area) but not in "off-topic." A discussion of the basic nature of the "compression problem" is about as on-topic as things could get around here. (And I think Matt's quite right about it being basically an AI problem; I tend to see it very much in terms of "case-based planning" requiring case-based plan understanding.)

3. Originally Posted by Kennon Conrad
My structure is the same as that link's suffix trie except I don't add nodes beyond the first one that makes a suffix unique. The only reason the node that makes a suffix unique is added is so that if a later suffix matches, the program can go back and add the child (or children, as needed) of the suffix that created the node, turning it into an internal node.
I think that would place it somewhere in the middle between a compressed trie and an ordinary trie, which is a kind of trie that I don't recall ever having come across before.

While the nodes/leaves ratio was used to determine a reasonably good cost function, the number of nodes and leaves is not used in the cost function of TreeCompress. It essentially uses the reduction in order 0 entropy of the candidate suffix, the number of occurances of the suffix and the length of the suffix as inputs to the cost function.
Somehow I got the impression that the nodes/leaves ratio was part of the algorithm. From what I'm hearing now, that isn't the case. It sounds like you are basically calculating or approximating the drop in order-0 entropy (global) that would result from each candidate replacement and using that as the score. That is so plainly logical in light of what you're doing that it really needs no justification, particularly in light of the fact that it's already known to work empirically.

I think I could modify TreeCompress by adding one more word to each node indicating the number of symbols on the branch leading to that node and adding a branch splitting function. At this point, that would probably be easier than finding some else's code and adapting it. I am rather fond of my implementations ability to limit the sibling depth to log4(alphabet size) without hash tables and am not sure I would get that with other code.
It looks like you're building a bitwise trie out of the siblings but reading the bits of each symbol in reverse, i.e., from low to high.

Suffix tree construction averages roughly 75% of execution time on v0.3 of TreeCompress.
That doesn't seem surprising.

Suffix links will not work as well as normal for TreeCompress since the tree is built in pieces. Nonetheless, it is an interesting idea that might save a lot of tree traversals. I will think about this some more.

If you compare the first compression cycle time for enwik9 to that of enwik8, I am pretty sure the time for enwik9 is much closer to 10x that of enwik8 than 100x that of enwik8. I think technically the algorithm is O(n^3), but in practice the average suffix depth does not grow nearly as fast as the file and having to run the tree build in pieces does not have a big impact, so in practice it runs in close to linear time when comparing enwik8 to enwik9. The overall compression process takes about 25x longer, but that is mostly because a lot more tree builds are done because there are a lot more strings that can be effectively tokenized.
O(n^2) describes worst-case performance. You could trigger the worst case by giving it a text made up of a single repeating symbol.

On actual text, you should expect match lengths to grow approximately proportionally to lg(n) for a text of length n. Your algorithm very rapidly slows down as match lengths increase, however, and so it will tend to slow down more than linearly, but somewhat less than O(n^2), when tested with real data of increasing length. There was a graph of this at the end of the slideshow I posted last time, wasn't there?

I think the simplest suffix array algorithms run in O(n * log(n)) time because of the binary search. For TreeCompress, what I call dictionary size is the same thing as the size of the alphabet or the number of unique symbols in the file. I am concerned that doing log(number of suffixes) string comparisons could take longer to run than (average suffix length) * log4(alphabet size) symbol comparisons when building the tree. I realize there are more complex algorithms that build a suffix array in linear time, but don't understand them yet.
I'm not sure what to recommend as far as data structures and algorithms go at this point. I think I'm getting ever closer to comprehending exactly what your algorithm does; I'll continue thinking it over.

4. Originally Posted by nburns
I think that would place it somewhere in the middle between a compressed trie and an ordinary trie, which is a kind of trie that I don't recall ever having come across before.

Somehow I got the impression that the nodes/leaves ratio was part of the algorithm. From what I'm hearing now, that isn't the case. It sounds like you are basically calculating or approximating the drop in order-0 entropy (global) that would result from each candidate replacement and using that as the score. That is so plainly logical in light of what you're doing that it really needs no justification, particularly in light of the fact that it's already known to work empirically.
Thus the tounge-in-cheek reference to the Conrad Tree. Perhaps "truncated suffix tree" would be more appropriate. I think you are right in that my tree is different, and perhaps that explains why there is little or no research on the nodes/leaves ratio of the tree I am using.

I did mention the limited usage of the nodes/leaves ratio in my April 19th post and vaguely described the inputs to the cost function in my April 2nd post, but that was some time ago. The scoring function in Tree Compress is more complex than just the change in entropy for each replacement. The scoring function first calculates the change in file entropy for the replacement, but then multiplies that value by ((change in entropy per replacement)^5)/(candidate length^3) to bias the results toward longer and "better" matches based on what I learned by looking at the change in order 0 entropy and the change in order 0 entropy * nodes/leaf of enwik8 as it compresses as well as the final results.

Originally Posted by nburns
It looks like you're building a bitwise trie out of the siblings but reading the bits of each symbol in reverse, i.e., from low to high.
Yes, that is right. I forgot the released code only has even and odd siblings in the node structure. v0.4 will use 4 siblings per node, which reduces the maximum number of sibling traversals in half. It's a nice speed improvement even though it uses more memory, causing the tree to split into more pieces when built. The code goes low to high because the bits tend to be more diverse at the low end.

Originally Posted by nburns
O(n^2) describes worst-case performance. You could trigger the worst case by giving it a text made up of a single repeating symbol.
I have this problem with needing to build the tree in sections to fit it in RAM. I think this makes it worse than O(n^2), although the time to scan the input is small compared to the time to build the tree so it does not have a huge impact on enwik9. If the file was big enough, this would play a bigger role.

Originally Posted by nburns
On actual text, you should expect match lengths to grow approximately proportionally to lg(n) for a text of length n. Your algorithm very rapidly slows down as match lengths increase, however, and so it will tend to slow down more than linearly, but somewhat less than O(n^2), when tested with real data of increasing length. There was a graph of this at the end of the slideshow I posted last time, wasn't there?
I think O(n log n) does represent expected performance on normal text. The thing is, the percentage change in log(n) gets smaller as n increases. When comparing a 100M input to a 1G input, the difference is only an extra 20% over linear. Based on that, it seems like enwik9 should take about 12x longer on the first pass than enwik8 plus a relatively small penalty for breaking the tree into more pieces. So, yes it's not 10x, but it's still way less than 100x.

5. Originally Posted by Kennon Conrad
Thus the tounge-in-cheek reference to the Conrad Tree. Perhaps "truncated suffix tree" would be more appropriate. I think you are right in that my tree is different, and perhaps that explains why there is little or no research on the nodes/leaves ratio of the tree I am using.
It's generally not a good thing to invent a new kind of tree, because simple kinds of trees have already been picked through and the useful ones have been given names. If you discover a new tree, the odds are overwhelming that you either made a mistake or it's not actually new. I think you could do a tiny bit more work on yours and get a real compressed trie. I think you already have enough data fields per node, so it would be free.

I did mention the limited usage of the nodes/leaves ratio in my April 19th post and vaguely described the inputs to the cost function in my April 2nd post, but that was some time ago. The scoring function in Tree Compress is more complex than just the change in entropy for each replacement. The scoring function first calculates the change in file entropy for the replacement, but then multiplies that value by ((change in entropy per replacement)^5)/(candidate length^3) to bias the results toward longer and "better" matches based on what I learned by looking at the change in order 0 entropy and the change in order 0 entropy * nodes/leaf of enwik8 as it compresses as well as the final results.
How did you get to that formula? Did you try simpler ones first and empirically tune it?

Yes, that is right. I forgot the released code only has even and odd siblings in the node structure. v0.4 will use 4 siblings per node, which reduces the maximum number of sibling traversals in half. It's a nice speed improvement even though it uses more memory, causing the tree to split into more pieces when built. The code goes low to high because the bits tend to be more diverse at the low end.

I have this problem with needing to build the tree in sections to fit it in RAM. I think this makes it worse than O(n^2), although the time to scan the input is small compared to the time to build the tree so it does not have a huge impact on enwik9. If the file was big enough, this would play a bigger role.

I think O(n log n) does represent expected performance on normal text. The thing is, the percentage change in log(n) gets smaller as n increases. When comparing a 100M input to a 1G input, the difference is only an extra 20% over linear. Based on that, it seems like enwik9 should take about 12x longer on the first pass than enwik8 plus a relatively small penalty for breaking the tree into more pieces. So, yes it's not 10x, but it's still way less than 100x.

Text has a complex distribution, so figuring out the expected performance would not be that simple. O(n lg n) is a lower bound. The entropy of the data would be one of the variables in the actual performance.

I don't want to propose a change with what I know at this point, but I can just about guarantee that the data structures its using now are suboptimal.

6. Originally Posted by nburns
It's generally not a good thing to invent a new kind of tree, because simple kinds of trees have already been picked through and the useful ones have been given names. If you discover a new tree, the odds are overwhelming that you either made a mistake or it's not actually new. I think you could do a tiny bit more work on yours and get a real compressed trie. I think you already have enough data fields per node, so it would be free.
My tree works fine so I don't think it's a "mistake". It may not be optimal for the program's performance, but I am not trying for perfection here. I have a hard time imagining my tree is new, but on the other hand I haven't found it mentioned on the web. For instance, Wikipedia shows the costs for 5 ways to implement suffix trees:

Attachment 2932

Why don't they show an option for a bitwise sibling tree with a lookup of order O(log sigma)? It seems like such an simple and obvious improvement to a sibling list implementation if you are willing to take the hit on memory size that it really surprises me I don't see it on Wikipedia or anywhere else. I tend to think it's just because I don't know where to look. I don't understand the difference between the two order symbols, but in general doesn't this make a sibling list have similar lookup order characteristics to a sorted arrays without the insertion penalty?

I think suffix arrays need another field for the branch length when comparing them to suffix trees. I do not know how you would get around that.

Originally Posted by nburns
How did you get to that formula? Did you try simpler ones first and empirically tune it?
I knew that just using the change in file entropy would result in mostly bottom up compression because that's where the biggest order 0 entropy reductions come. I wanted top down compression so I decided the change in file entropy should be biased with some combination of the change in file entropy per replaced string and the change in file entropy per replaced symbol. So I started with the change in entropy per replaced string multiplied by the change in entropy per replaced symbol. It was determined emperically that using the square of the change in entropy per replaced string and the cube of the change in entropy per replaced symbol is reasonably close to the best multiplicative combination of these two parameters for compression size and on the low end of compression time for those combinations that provided the best size.

Originally Posted by nburns
Text has a complex distribution, so figuring out the expected performance would not be that simple. O(n lg n) is a lower bound. The entropy of the data would be one of the variables in the actual performance.
I think the required complexity of estimation methods depends on the required quality of the estimate. Using O(n log n) is probably the simplest estimate that will typically get you in the ballpark for most large text files. O(n log n) is the lower bound for a single text file, but the nodes could grow slower than O(n log n) when comparing two different files or even at certain points within a file.

I ran a test this morning and found the first tree build and score cycle on enwik9 takes 18x longer than enwik8. The number of input scans grew by 22x so the O(n log n) portion of the growth was something less than 18x, but most likely above the 12x change in n log n. So in terms of actual performance on enwik9 compared to enwik8, it is much closer to O(n log n) than O(n^2).

Originally Posted by nburns
I don't want to propose a change with what I know at this point, but I can just about guarantee that the data structures its using now are suboptimal.
I have one more statistic from another test this morning. On the unreleased version I am testing out, the first TreeCompress compression cycle on enwik9 creates a total of 1.410 billion suffix tree nodes, of which 0.698 billion could be eliminated by using a compressed tree (49.5%). After 64 compression cycles (slightly over 10% of the cycles), the tree has 333 million suffix tree nodes, of which 35.9 million could be eliminated by using a compressed tree (10.8%). After 88 compression cycles (about 15% of the cycles), 15.4 million out of 252 million nodes could be eliminated (6.1%).

Since compressed suffix trees seem to need a branch length field, that would increase the memory for each node in TreeCompress by 14%. So by about the time that 10% of the compression cycles are done, using a compressed tree would increase the amount of memory required for the tree and the complexity of the tree building algorithm even though the vast majority of the nodes could not be compressed, and it only gets worse from there. I suspect that for much of the run, the program would mostly be creating compressed nodes that need to be split apart later when a partial match is found. I don't see the benefit in that. I suppose some sort of hybrid could be created that only created compressed nodes when they were likely to persist, but I'm not sure that's a good idea either. I wouldn't be surprised if adding hash tables would be a more optimal algorithm, but am concerned about the amount of memory hash tables would require.

The main goal of TreeCompress is to compress the most compressible nodes first, so that better decisions can be made when compressing less compressible nodes because the decision making process is not biased by having the same longer sequences appear repeatedly. It's not just a goal of reducing entropy, but reducing it in a way that leads to better entropy reduction in upcoming compression cycles.

7. I have made a suffix tree based compressors some time ago. One is Demixer, another one is LemonTree. I've posted both to this forum, first one with sources, second without. Demixer is based on bitwise suffix tree while LemonTree is based on bytewise suffix tree with linked lists of subnodes (left-child right-sibling representation). Both projects aren't very progressed as I have some serious mental problems with finishing projects or at least releasing useful versions :/ You can however download and study the Demixer as it's simpler and less tricky so I would be more willing to describe Demixer's internals to you.

Memory consumption for the tree is 16x the input size + 1x for input buffer which is needed all the time.
Tree is depth limited to 2^9 bits, but you can easily increase that to 2^32 bits if you remove the fields that are used for compression (making statistics), all while keeping the tree size constant.

One warn: Demixer keeps a queue of ActiveContexts. It's because it's a CM-based compressor which computes predictions for many orders at once for each bit in input. That's unnecessary for simply building the tree - you don't need a queue of ActiveContexts, just one would suffice.

8. Originally Posted by Piotr Tarsa
I have made a suffix tree based compressors some time ago. One is Demixer, another one is LemonTree. I've posted both to this forum, first one with sources, second without. Demixer is based on bitwise suffix tree while LemonTree is based on bytewise suffix tree with linked lists of subnodes (left-child right-sibling representation). Both projects aren't very progressed as I have some serious mental problems with finishing projects or at least releasing useful versions :/ You can however download and study the Demixer as it's simpler and less tricky so I would be more willing to describe Demixer's internals to you.

Memory consumption for the tree is 16x the input size + 1x for input buffer which is needed all the time.
Tree is depth limited to 2^9 bits, but you can easily increase that to 2^32 bits if you remove the fields that are used for compression (making statistics), all while keeping the tree size constant.

One warn: Demixer keeps a queue of ActiveContexts. It's because it's a CM-based compressor which computes predictions for many orders at once for each bit in input. That's unnecessary for simply building the tree - you don't need a queue of ActiveContexts, just one would suffice.
I like your coding style. Pretty straight forward and clean, so it's relatively easy to see what the code is doing. The code is particularly interesting to me because I would like to add a context mixer later and this has similar code to what I am realizing I need.

The use of bit-wise child pointers is similar to my depth 1 double bit-wise child array (and upcoming v0.4's depth 1 nibble-wise child array) and somewhat similar to my bit-wise sibling tree, but not the same. Did you try a bit-wise sibling tree instead of a sibling list in LemonTree? It would reduce the maximum number of sibling traversals for a list from 255 to 8 at the cost of one extra offset field. I seem to be the only person that uses a sibling tree and I don't understand why, other than thinking that maybe people are generally more concerned with memory usage than execution speed.

Do I understand correctly that Demixer does not have tree nodes for leaves? It looks like you just repurpose the child pointer. If so, that seems like a really good way to save a lot of memory and reduce execution time. I had not thought of that at all.

It appears to me that Demixer creates a 16 byte node structure per input bit, making the tree 128x the input size. Assuming I missed something, what did I miss?

9. Yes. I'm repurposing the child pointer, because that's obvious space optimization and it isn't very tricky.

Yes, the structure is 16 bytes but there's one node per input byte, not per input bit. The reason for that is that I'm inserting only bit-strings starting at positions divisible by eight (ie I'm inserting bit-strings starting on byte boundaries). It's easy to see that there will be exactly as many suffixes as there's bytes in input, so there will be exactly as many branches, making the tree size 16x of input.
Well, those 'exactly' are not exactly 'exactly' as they can be little smaller, ie the tree takes usually a little less space than strictly linear. If you perform the initial part of Knuth-Morris-Pratt algorithm (or rather Morris-Pratt one, I guess) on the input, you'll get a number for every input position - that's the number of nodes the tree will "lack" during processing that input symbol. For usual inputs that numbers are so small that you can skip them. Also if you want easy operations on the resulting suffix tree, it would be wise to use a special unique character to append to the end of input, so that no suffix if a prefix of another. You would get a tree that doesn't "lack" any nodes (ie the number from previous KMP algorithm preprocessing would be zero for that unique symbol position).
You can avoid the need for unique symbol in input if you do some sort of on-the-fly mapping, virtual sentinel, etc One idea would be to remap input symbols to 16-bits (leaving upper 8 bits as zeroes) and insert a virtual sentinel (which has some of the upper 8 bits set) at the end. That would only change the interpretation of 'depth' field - to get byte depth you would have to shift by 4 bits to right instead of three.
Remember that because bit-strings are inserted on byte boundaries, suffix links skip 8 bits (whole byte), not 1 bit.
For efficient removal of nodes it would be sufficient IMO to make a helper array which maps each position in input to a parent node of the leaf node of suffix starting at that position. Having such array it should be relatively easy to look for overlapping suffixes and adjust them. Also parent pointers are needed for node removal. Unless you want to do recursive compression, it's not needed to reinsert the resulting bit code into the tree - well, recursive compression would be very tricky I think, so probably it's not very wise to do it now, when there are still some low-hanging fruits.

LemonTree was supposed to be a unary-coding PPM coder, ie it would be encoding between 1 and 255 flags per input byte, depending on how early if finds the symbol when following suffix links and enumerating siblings. The data structure for LemonTree was trickier as the node structure was bigger and I wanted to repurpose as much fields as possible to reduce memory usage. Also I wanted to do a sliding-window suffix tree, but didn't implement that functionality - it's really tricky to do that. LemonTree has parent pointers, however.

Edit:
As to those KMP-thingies I've mentioned - I was wrong. At any point of suffix tree construction I'm at some depth in the tree. That depth is equal to length of longest repetition which end on the end of input, ie the length of longest suffix that is repeated somewhere in the input. Depth is increased by 1 at every symbol from input and decreased until we're in repeated suffix. Only during that decrement new branches are created so it's easy to see that there's linear number of branches as depth is increased exactly N times where N is the number of symbols in input.

10. Got it. Sometimes things that are obvious to one person are not so obvious to another.

I think I will need to give the sliding window suffix tree a try. It will be a good challenge. When developing TreeCompress, it wasn't all that hard to make it compress recursively. On the other hand, overlaps were tricky for me and I only really got it working when I built and used a suffix tree to detect them. The sliding window will need some serious planning before I try to implement it. In it's most basic form, I think the algorithm needs a dictionary array and a sliding suffix tree. It would fill up the suffix tree substituting existing dictionary entries for symbols as they are entered, recursively look for new strings to add to the dictionary, and then output the next symbol and remove the tree portion caused by that symbol. Depending on the implementation details, it would hopefully be possible to do a second pass to find all the good matches that didn't fit in the window on the first pass.

11. Originally Posted by Kennon Conrad
My tree works fine so I don't think it's a "mistake". It may not be optimal for the program's performance, but I am not trying for perfection here. I have a hard time imagining my tree is new, but on the other hand I haven't found it mentioned on the web. For instance, Wikipedia shows the costs for 5 ways to implement suffix trees:
Your algorithm is cool, because it seems to compress well and decompress fast. That said, I think you may be in denial about the performance of the encoder. It might be impossible to make it any faster, but that's very unlikely.

Why don't they show an option for a bitwise sibling tree with a lookup of order O(log sigma)? It seems like such an simple and obvious improvement to a sibling list implementation if you are willing to take the hit on memory size that it really surprises me I don't see it on Wikipedia or anywhere else. I tend to think it's just because I don't know where to look. I don't understand the difference between the two order symbols, but in general doesn't this make a sibling list have similar lookup order characteristics to a sorted arrays without the insertion penalty?
The reason they didn't show it is that the list would be endless if they included every possibility. You can treat it as a balanced tree as long as it comes out balanced. Depending on the data, it could degrade into a linked list. The big-O performance is somewhat irrelevant when the number of elements is small, because the difference tends to get swamped by hardware effects.

I think suffix arrays need another field for the branch length when comparing them to suffix trees. I do not know how you would get around that.
That's true. Do a search for enhanced suffix arrays.

I knew that just using the change in file entropy would result in mostly bottom up compression because that's where the biggest order 0 entropy reductions come. I wanted top down compression so I decided the change in file entropy should be biased with some combination of the change in file entropy per replaced string and the change in file entropy per replaced symbol. So I started with the change in entropy per replaced string multiplied by the change in entropy per replaced symbol. It was determined emperically that using the square of the change in entropy per replaced string and the cube of the change in entropy per replaced symbol is reasonably close to the best multiplicative combination of these two parameters for compression size and on the low end of compression time for those combinations that provided the best size.
I'm not sure I follow you completely. I'm trying to figure out how much support there is for that fudge factor. Because what I'm really trying to figure out is what are the minimal data structures and simplest algorithm you could use to get the same output. At a broader level, I guess I'm trying to just figure out what this algorithm is doing and why.

When you are tuning empirically, you should really test on a wide variety of inputs.

I think the required complexity of estimation methods depends on the required quality of the estimate. Using O(n log n) is probably the simplest estimate that will typically get you in the ballpark for most large text files. O(n log n) is the lower bound for a single text file, but the nodes could grow slower than O(n log n) when comparing two different files or even at certain points within a file.

I ran a test this morning and found the first tree build and score cycle on enwik9 takes 18x longer than enwik8. The number of input scans grew by 22x so the O(n log n) portion of the growth was something less than 18x, but most likely above the 12x change in n log n. So in terms of actual performance on enwik9 compared to enwik8, it is much closer to O(n log n) than O(n^2).
I'm not sure I'm following all of that. Suffix tries have pathological cases and they're not that hard to trigger. If you think you hit on some situation where the trie is unexpectedly better, you should test both on the same data and find out for sure.

I have one more statistic from another test this morning. On the unreleased version I am testing out, the first TreeCompress compression cycle on enwik9 creates a total of 1.410 billion suffix tree nodes, of which 0.698 billion could be eliminated by using a compressed tree (49.5%). After 64 compression cycles (slightly over 10% of the cycles), the tree has 333 million suffix tree nodes, of which 35.9 million could be eliminated by using a compressed tree (10.8%). After 88 compression cycles (about 15% of the cycles), 15.4 million out of 252 million nodes could be eliminated (6.1%).

Since compressed suffix trees seem to need a branch length field, that would increase the memory for each node in TreeCompress by 14%. So by about the time that 10% of the compression cycles are done, using a compressed tree would increase the amount of memory required for the tree and the complexity of the tree building algorithm even though the vast majority of the nodes could not be compressed, and it only gets worse from there. I suspect that for much of the run, the program would mostly be creating compressed nodes that need to be split apart later when a partial match is found. I don't see the benefit in that. I suppose some sort of hybrid could be created that only created compressed nodes when they were likely to persist, but I'm not sure that's a good idea either. I wouldn't be surprised if adding hash tables would be a more optimal algorithm, but am concerned about the amount of memory hash tables would require.

The main goal of TreeCompress is to compress the most compressible nodes first, so that better decisions can be made when compressing less compressible nodes because the decision making process is not biased by having the same longer sequences appear repeatedly. It's not just a goal of reducing entropy, but reducing it in a way that leads to better entropy reduction in upcoming compression cycles.

12. Originally Posted by nburns
Your algorithm is cool, because it seems to compress well and decompress fast. That said, I think you may be in denial about the performance of the encoder. It might be impossible to make it any faster, but that's very unlikely.
I have said there are many, many ways the compression speed can be improved and this is still true. Soon to be released Tree v0.4 compresses enwik9 in 33% less time than v0.3 and almost 80% less than v0.1.

Originally Posted by nburns
The reason they didn't show it is that the list would be endless if they included every possibility. You can treat it as a balanced tree as long as it comes out balanced. Depending on the data, it could degrade into a linked list.
Are you sure that's the reason? I would think a simple method that gives overall better time order than every option listed that does not include a hash table would be worthy of listing. Balanced trees are O(log n) for insertion. Suffix trees with sibling trees are O(n). For a 1 million symbol alphabet, TreeCompress v0.4 will never have a sibling tree depth of more than 10 siblings. A sibling list could have 999,999 siblings.

Originally Posted by nburns
I'm not sure I follow you completely. I'm trying to figure out how much support there is for that fudge factor. Because what I'm really trying to figure out is what are the minimal data structures and simplest algorithm you could use to get the same output. At a broader level, I guess I'm trying to just figure out what this algorithm is doing and why.

When you are tuning empirically, you should really test on a wide variety of inputs.
If you choose strings to add to the dictionary by picking the ones that reduce file entropy the most, you will generally pick pairs first because the frequency of the pairs is high. If you choose strings that individually compress the most, you will generally pick the longest string matches. If you choose strings that individually compress the most per symbol, you will generally pick long strings with the rarest symbols. None of these methods is optimal. Optimal compression of enwik9 with TreeCompress is achieved by making the algorithm pick generally longer strings, but not so long such that you start getting a lot of partial word matches. Biasing it to pick strings with rarer characters helps, but I can't really explain why. I was just following my gut instincts on that one. My big picture goal has been to create the best possible dictionary preprocessor for compressing enwik8 and enwik9. While I want my code to be not unreasonably slow, I have things to work on that are more interesting to me than perfecting the tree structure, such as space modeling, context matching, multi-threading, and a sliding suffix tree. I think many of your suggestions will be very helpful for the sliding tree implementation, but that's a ways down the road for me.

13. Originally Posted by Kennon Conrad
Are you sure that's the reason?
Yes.

If you choose strings to add to the dictionary by picking the ones that reduce file entropy the most, you will generally pick pairs first because the frequency of the pairs is high. If you choose strings that individually compress the most, you will generally pick the longest string matches. If you choose strings that individually compress the most per symbol, you will generally pick long strings with the rarest symbols. None of these methods is optimal. Optimal compression of enwik9 with TreeCompress is achieved by making the algorithm pick generally longer strings, but not so long such that you start getting a lot of partial word matches. Biasing it to pick strings with rarer characters helps, but I can't really explain why. I was just following my gut instincts on that one. My big picture goal has been to create the best possible dictionary preprocessor for compressing enwik8 and enwik9. While I want my code to be not unreasonably slow, I have things to work on that are more interesting to me than perfecting the tree structure, such as space modeling, context matching, multi-threading, and a sliding suffix tree. I think many of your suggestions will be very helpful for the sliding tree implementation, but that's a ways down the road for me.
That part was interesting. The value in what you did is in demonstrating high compression with fast decoding. How you wrote the encoder doesn't matter per se; what matters is the cost function and the high level optimization algorithm used to choose phrases for the dictionary. In other words, the math. We got into talking about trees, because I thought you were counting nodes in your cost function, and if so, it's important to know what kind of tree.

All you have to do is demonstrate that a compression technique works in principle and other people will refine the method.

14. Originally Posted by nburns
Yes.

That part was interesting. The value in what you did is in demonstrating high compression with fast decoding. How you wrote the encoder doesn't matter per se; what matters is the cost function and the high level optimization algorithm used to choose phrases for the dictionary. In other words, the math. We got into talking about trees, because I thought you were counting nodes in your cost function, and if so, it's important to know what kind of tree.

All you have to do is demonstrate that a compression technique works in principle and other people will refine the method.
It seems to me that if the only reason for not including a particular option is because the list would be endless if they included every option, then they would not include any options. To partially test your theory, I edited Wikipedia for the first time in my life. It will be interesting to see if and when the edit gets retracted.

The main reason I do not use the internal node reduction and leaf reduction counts is because they cannot be efficiently determined without having the entire tree available (and maybe not even then). It is definitely something I would try to implement in a sliding window suffix tree implementation.

Identifying the proper symbol transition points seems to be similarly important to finding long repeated strings. While I have optimized my cost function kludge for enwik files and it works reasonably well, I think a smarter cost function could do better. I have pre-release versions that use a sliding window tokenizer that builds the whole tree for the window and analyzes the ends of the strings for the best transition points. The problem with that version was that it couldn't find good matches more than a window width apart and make decisions that were good locally but not globally. If I had 3 GB of memory to run the code in instead of 2 GB, there would be a lot more good options for structures when compressing enwik9, but I think that's a Windows limit.

I would be happy to see other people refine the code or concept. I don't have enough time to do everything I would like to do. Other people know things I don't know and other people could create much nicer looking code.

Here is Tree v0.4. The compression process is now:

TreeCapEncode
TreeParagraphs
TreeWords
TreeCompress
TreeBitEncode

TreeParagraphs deduplicates paragraphs. TreeWords deduplicates words. TreeCompress now has double bit-wise suffix tree sibling trees. TreeBitEncode and TreeDecode now have implied two token definitions and one bit shorter codes for two instance token matches from the buffer.

Results on A8-5500 CPU:
Prior v0.3 release:
enwik9: 184,838,711 bytes, 105,728 seconds to compress, 23 seconds to decompress
enwik8: 23,233,932 bytes, 4,541 seconds to compress, 2.7 seconds to decompress

v0.4 release
enwik9: 184,312,072 bytes, 68,866 seconds to compress, 22.2 seconds to decompress
enwik8: 23,178,500 bytes, 3,546 seconds to compress, 2.7 seconds to decompress

compressed TreeDecode.c file size: 7,162 bytes

16. Is your tree a suffix tree already, ie with compressed internal paths?

Making a sliding window suffix tree with compressed internal paths will force you to deal with lots of corner cases.

17. Nice! Very fast!
But Tree selects bad long repeat string yet.
For examle,
aaaaaaaa...
bbbbbbbb...
aaaaaaaa...
bbbbbbbb...

select:
aaaaaaaa...
bbbbbbbb...

18. Originally Posted by Piotr Tarsa
Is your tree a suffix tree already, ie with compressed internal paths?

Making a sliding window suffix tree with compressed internal paths will force you to deal with lots of corner cases.
No, the internal paths are not compressed. I have not yet decided whether to use a compressed tree in a sliding window version. The best strings do not always end on nodes that would not be compessed in a compressed tree. I can see a compressed tree is more complicated to implement and use, but am not clear how corner cases cause problems.

19. Originally Posted by xezz
Nice! Very fast!
But Tree selects bad long repeat string yet.
For examle,
aaaaaaaa...
bbbbbbbb...
aaaaaaaa...
bbbbbbbb...

select:
aaaaaaaa...
bbbbbbbb...
You will have to provide more details on why you think this is bad if I am missing something. Here are my thoughts.

Assuming your input has just a line feed character between the lines, the input contains 16 a's, 16 b's, 12 .'s, 4 line feeds and one EOF symbol, giving an order 0 entropy of 96.1 bits.

The output of TreeCompress has a 6 symbol dictionary containing symbols representing strings "aaaaaaaa...LFbbbbbbbb...LF", "a", "b", ".", the line feed, and the EOF symbol with instance counts of 2, 8, 8, 6, 2 and 1 respectively, plus one special symbol for one string definition, giving an order 0 entropy of 67.1 bits.

I know of no other tokenization combination that provides a lower order 0 entropy when using this method. Including a dictionary symbol for "...LF" adds 2.5 bits to the order 0 entropy. Including dictionary symbols for "aa" and "bb" adds 10.9 bits. Including dictionary symbols for "aaaa" and "bbbb" also adds 10.9 bits. Doing all of the above adds 25.0 bits. Tree is not perfect, but I do not see anything different than I would expect in this case.

Looking into adding run length encoding is get closer to the top of my wish list.

That being said, I think my "someday in the future" sliding window suffix tree version will look at spacial relationships and give a higher value to close matches, quite possibly making the a's and b's worthy of being compressed.

20. Originally Posted by Kennon Conrad
It seems to me that if the only reason for not including a particular option is because the list would be endless if they included every option, then they would not include any options. To partially test your theory, I edited Wikipedia for the first time in my life. It will be interesting to see if and when the edit gets retracted.
I'm assuming Bitwise Sibling Tree is your entry. It doesn't seem likely that your lookup time is O(log sigma) and insertion is theta(1). If you're building a binary trie, then the paths will not be shorter than a balanced binary tree, asymptotically.

I guess I committed an error earlier when I said it could degrade into a linked list. The depth of a trie is limited by the number of bits in the longest element you store there. So the worst-case performance is the same as a balanced tree.

The main reason I do not use the internal node reduction and leaf reduction counts is because they cannot be efficiently determined without having the entire tree available (and maybe not even then). It is definitely something I would try to implement in a sliding window suffix tree implementation.
My hope is that you don't need to count actual tree nodes, because that would put a damper on performance and increase code complexity.

Identifying the proper symbol transition points seems to be similarly important to finding long repeated strings. While I have optimized my cost function kludge for enwik files and it works reasonably well, I think a smarter cost function could do better. I have pre-release versions that use a sliding window tokenizer that builds the whole tree for the window and analyzes the ends of the strings for the best transition points. The problem with that version was that it couldn't find good matches more than a window width apart and make decisions that were good locally but not globally. If I had 3 GB of memory to run the code in instead of 2 GB, there would be a lot more good options for structures when compressing enwik9, but I think that's a Windows limit.
If you have 64-bit Windows and enough memory, yes you can use more. How to build a 64-bit binary depends on what compiler you're using.

21. Originally Posted by nburns
I'm assuming Bitwise Sibling Tree is your entry. It doesn't seem likely that your lookup time is O(log sigma) and insertion is theta(1). If you're building a binary trie, then the paths will not be shorter than a balanced binary tree, asymptotically.

I guess I committed an error earlier when I said it could degrade into a linked list. The depth of a trie is limited by the number of bits in the longest element you store there. So the worst-case performance is the same as a balanced tree.
Yes, that is the change I made. Perhaps I just got tired of compressing enwik's and wanted to do the opposite.

We agree the paths will not be shorter with a bitwise binary tree than a balanced binary tree. Here are my thoughts on the lookup time and insertion time.

The maximum cost of finding a child on a given character (lookup) for a bitwise sibling tree is proportional to the number of bits in the character, which is log2 of the alphabet size. I don't see why it wouldn't be the same order as a balanced tree implementation (O(log sigma)), assuming the alphabet doesn't skip codes (which could be addressed in linear time).

The order of the cost of inserting a child is no different than for a sibling list. It only takes the time to update the parent's child pointer and initialize the child, which is constant. Unlike a balanced tree approach, the tree does not need to be rebalanced on each insertion.

As an aside, since I am building a bitwise quaternary tree rather than a bitwise binary tree, the maximum path length for my tree is 1/2 the maximum path length of a balanced binary tree. Additionally, depth 1 is implemented as a child array with nibble-wise (16-ary) sibling arrays. So getting to the tree node representing the first symbol plus the low order nibble of the second symbol only requires the algorithm to look up a sibling pointer in the level 1 child array and move to where it points.

Originally Posted by nburns
My hope is that you don't need to count actual tree nodes, because that would put a damper on performance and increase code complexity.
I don't NEED to count tree nodes, but do think that is the path to optimum compression for this scheme. Since I already have instance counters in my tree, not all nodes would need to be recounted. I think the algorithm would only need to look at the instances in the overlapping prefix strings and overlapping suffix strings of the candidate or something like that. I haven't figured out all the exact details yet. You are right in that this could slow down and complicate the algorithm. I don't know how much yet but still think it would be interesting to explore. There may be a smarter way to build the tree to do this along the lines of some extra pointers like the more advanced suffix array algorithms you have mentioned seem to have. TreeCompress has several double precision floating point operations that it runs at each candidate tree node and sometimes has to also do a log operation. Those operations are not helpful to the run time.

Originally Posted by nburns
If you have 64-bit Windows and enough memory, yes you can use more. How to build a 64-bit binary depends on what compiler you're using.
Yes, it does look like the 2 GB limit is for 32-bit Windows. I am running 64-bit Windows 8 with 6 GB RAM. I would prefer to not go to 64 bit code since there are still alot of 32 bit computers and the pointers won't double in size, but if I can't do that, then going to 64 bit code may be the way to go. It sounds like there is a way to get around the 2 GB limit on 64 bit machines by setting a linker flag, but I haven't been able to quickly figure out how to do it with MinGW's gcc. I will try harder later. Anyway, thanks for the tip since it does look like I should be able to get over the 2 GB hurdle on my computer.

22. Duplicate post.

23. Originally Posted by Kennon Conrad
Yes, that is the change I made. Perhaps I just got tired of compressing enwik's and wanted to do the opposite.

We agree the paths will not be shorter with a bitwise binary tree than a balanced binary tree. Here are my thoughts on the lookup time and insertion time.

The maximum cost of finding a child on a given character (lookup) for a bitwise sibling tree is proportional to the number of bits in the character, which is log2 of the alphabet size. I don't see why it wouldn't be the same order as a balanced tree implementation (O(log sigma)), assuming the alphabet doesn't skip codes (which could be addressed in linear time).

The order of the cost of inserting a child is no different than for a sibling list. It only takes the time to update the parent's child pointer and initialize the child, which is constant. Unlike a balanced tree approach, the tree does not need to be rebalanced on each insertion.

As an aside, since I am building a bitwise quaternary tree rather than a bitwise binary tree, the maximum path length for my tree is 1/2 the maximum path length of a balanced binary tree. Additionally, depth 1 is implemented as a child array with nibble-wise (16-ary) sibling arrays. So getting to the tree node representing the first symbol plus the low order nibble of the second symbol only requires the algorithm to look up a sibling pointer in the level 1 child array and move to where it points.
How do you find where to insert the new node without first doing a lookup?

I don't NEED to count tree nodes, but do think that is the path to optimum compression for this scheme. Since I already have instance counters in my tree, not all nodes would need to be recounted. I think the algorithm would only need to look at the instances in the overlapping prefix strings and overlapping suffix strings of the candidate or something like that. I haven't figured out all the exact details yet. You are right in that this could slow down and complicate the algorithm. I don't know how much yet but still think it would be interesting to explore. There may be a smarter way to build the tree to do this along the lines of some extra pointers like the more advanced suffix array algorithms you have mentioned seem to have. TreeCompress has several double precision floating point operations that it runs at each candidate tree node and sometimes has to also do a log operation. Those operations are not helpful to the run time.
I'd like to avoid trees altogether. Building and traversing them is slow and complicated, whether or not you count the nodes.

Yes, it does look like the 2 GB limit is for 32-bit Windows. I am running 64-bit Windows 8 with 6 GB RAM. I would prefer to not go to 64 bit code since there are still alot of 32 bit computers and the pointers won't double in size, but if I can't do that, then going to 64 bit code may be the way to go. It sounds like there is a way to get around the 2 GB limit on 64 bit machines by setting a linker flag, but I haven't been able to quickly figure out how to do it with MinGW's gcc. I will try harder later. Anyway, thanks for the tip since it does look like I should be able to get over the 2 GB hurdle on my computer.

24. sorry, "..." means long repeat.

input:{a * 256 + b * 256} * 2
dictionary: {a * 256 + b * 256}
after string: 00
size: 512 + 2(compressed size:340)

but if select a * 16 and b * 16,
dictionary: {a * 16}{b * 16}
after string: {0*16 + 1*16} * 2
size: 32 + 64(compressed size:76)

25. ## Thanks:

Kennon Conrad (23rd May 2014)

26. Originally Posted by nburns
How do you find where to insert the new node without first doing a lookup?
As with any of the other methods, inserting a new suffix is done by looking up the appropriate place to insert the new suffix and then doing the insertion. Insertion time order for insertion on a suffix tree with bitwise sibling trees is the same as for a suffix tree with sibling lists. Only the lookup has better time order since the bitwise sibling tree limits the number of siblings that must be traversed to log2(alphabet size) instead of alphabet size-1.

Originally Posted by nburns
I'd like to avoid trees altogether. Building and traversing them is slow and complicated, whether or not you count the nodes.
Then hash maps are probably the best option, but the traversal time could be problematic unless another structure is included. I like to avoid hash maps, but that's really just a personal bias without a sound foundation. I have implemented them, but don't like dealing with collisions and think they generally consume a lot of memory if you want to avoid collisions.

27. Originally Posted by xezz
sorry, "..." means long repeat.

input:{a * 256 + b * 256} * 2
dictionary: {a * 256 + b * 256}
after string: 00
size: 512 + 2(compressed size:340)

but if select a * 16 and b * 16,
dictionary: {a * 16}{b * 16}
after string: {0*16 + 1*16} * 2
size: 32 + 64(compressed size:76)
I see what you are describing now. This was caused by a bug in the high level minimum score handling of TreeCompress that can cause the tokenization process to stop prematurely on certain small files. A patch (not the most elegant) is attached. Now the dictionary has 11 entries and the compressed file size is 52 bytes.

28. Good result! But if input is a*256 only, dictionary has no entry.

29. Originally Posted by xezz
Good result! But if input is a*256 only, dictionary has no entry.
That is not a bug. The order 0 entropy of 256 a's followed by an EOF symbol is only 9.44 bits. Adding dictionary entries will make the order 0 entropy increase. The encoder is not very efficient on small files for several reasons. Adding support for symbol codes less than 4 bits would help or adding an arithmetic coder would be even better.

30. Originally Posted by Kennon Conrad
As with any of the other methods, inserting a new suffix is done by looking up the appropriate place to insert the new suffix and then doing the insertion. Insertion time order for insertion on a suffix tree with bitwise sibling trees is the same as for a suffix tree with sibling lists. Only the lookup has better time order since the bitwise sibling tree limits the number of siblings that must be traversed to log2(alphabet size) instead of alphabet size-1.
Calling sibling list insert O(1) in this situation requires funny accounting. If duplicates are not allowed, an insert has to incorporate a lookup. Therefore, insert can never be faster than lookup here. Wikipedia is good, but keep in mind where it comes from.

31. It does not say sibling list insertion is O(1). It says child insertion is O(1). I think they assume readers know a suffix insertion requires the lookup of the parent followed by a child insertion, but could be more explicit about this to avoid confusion. I find the table to be useful the way it is because it shows the performance of each method for the two parts of the suffix insertion. If they didn't show each step, sibling lists and sorted arrays would show the same time order for suffix insertion. By breaking it into the two steps, it's easy to see that sorted arrays are better than sibling lists for lookup but worse for child insertion.

What I don't like about the table is that it doesn't show the cost of suffix lookups, which I think is different that the cost of finding a child of a character plus inserting the child. At least for sibling lists and sibling trees, there are as many traversals of sibling lists or trees as there are nodes in the match string. For suffix arrays, it seems different since a program would just do one binary search with one or more character comparisons per element searched with the number of extra character comparisons being equal to something like two times the length of the match. I have never implemented a suffix array, so I'm a little fuzzy on that, but it does seem like a suffix array lookup should be faster in many/most cases.

Page 4 of 30 First ... 2345614 ... Last

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•