Page 2 of 29 FirstFirst 123412 ... LastLast
Results 31 to 60 of 849

Thread: Tree alpha v0.1 download

  1. #31
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Compression is all about prediction (assigning a probability distribution to the next symbol) and coding (in log 1/p bits). The decompresser can't peek at the next symbol, so it has to make an identical prediction to decode.
    You could say with equal truth that compression is all about bits, and making sure you use the smallest number of them. Given a number of bits b, you can calculate probability as 2^-b. The probabilistic model is just a mathematical construct. It doesn't need to be reflected explicitly in the code.

    You seem to be prejudicing this effort in favor of an on-line, symbol-wise kind of methodology. But the direction this has gone in is interesting. What would be the point of turning it into something else -- something that's already been done?
    Last edited by nburns; 18th April 2014 at 03:55.

  2. #32
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Kennon -
    I'm glad you read the re-pair paper. Another thing that would be worthwhile is trying to make the code a bit leaner and meaner. I took a look at it. Your style is somewhat untamed. A lot of people have ideas about what code should look like, and they tend toward the lecturing and pedantic. To me it's relatively simple: it's like compression; I move code into a function primarily to save space and eliminate redundancy. Smaller code is better. I think it could benefit readability and maintainability if you did some refactoring.
    Last edited by nburns; 18th April 2014 at 05:00.

  3. #33
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Yes, the re-pair paper is great. I searched the web high and low for papers about suffix tree based compression, but wasn't able to find anything of interest. Better search terms may have helped....

    You comments about the code are 100% accurate, not much different than the disclaimer in the Introduction document. I have spent much of my "coding life" writing assembly code and embedded applications that didn't have enough CPU horsepower and seen a lot of stupid compilers. This may have tainted me. That being said, much of this code is bloated because I couldn't figure out how to get inline functions to work in gcc and function calls were hurting the execution time. Your comment spurred me to look into inline functions again and I found it's just like it should be and I'm baffled why I could get it working earlier. This makes me feel dumb, but happy that I can improve this aspect. A few good inline functions will help a lot on readability and maintainability. It will even help execution speed in the few places I do have non-recursive functions.

    I have also spent time thinking about your earlier comment about data structures and have more organized thoughts about improvements. While the suffix tree structure is good, the program would benefit a lot by using a better sorting algorithm (maybe a 2-3 tree instead of brute force) and changing the prefix tree structure to contain an array of sibling pointers instead of a single sibling pointer, and the recursive functions could be changed to loops that use LIFOs. It's not just maintainability. I am using TreeCompress and a cleaner version of TreeCapEncode as the baseline for smarter additional processes such as an HTML preprocessor, a word encoder, a white space encoder, and a space "transformer".

  4. #34
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yes, the re-pair paper is great. I searched the web high and low for papers about suffix tree based compression, but wasn't able to find anything of interest. Better search terms may have helped...
    A suffix tree is just for doing efficient searches, so the suffix tree itself is ancillary to compression and not really part of the algorithm.

    Suffix trees are notorious for consuming excessive memory and just being generally wasteful; if your algorithm doesn't absolutely require a ST (this is unlikely), you should switch to a suffix array instead. libdivsufsort is very fast and easy to use.


    You comments about the code are 100% accurate, not much different than the disclaimer in the Introduction document. I have spent much of my "coding life" writing assembly code and embedded applications that didn't have enough CPU horsepower and seen a lot of stupid compilers. This may have tainted me. That being said, much of this code is bloated because I couldn't figure out how to get inline functions to work in gcc and function calls were hurting the execution time. Your comment spurred me to look into inline functions again and I found it's just like it should be and I'm baffled why I could get it working earlier. This makes me feel dumb, but happy that I can improve this aspect. A few good inline functions will help a lot on readability and maintainability. It will even help execution speed in the few places I do have non-recursive functions.
    gcc is supposed to inline by itself, or at least that's what they say. I think you should also be okay if you declare the function with "static inline."

    I have also spent time thinking about your earlier comment about data structures and have more organized thoughts about improvements. While the suffix tree structure is good, the program would benefit a lot by using a better sorting algorithm (maybe a 2-3 tree instead of brute force) and changing the prefix tree structure to contain an array of sibling pointers instead of a single sibling pointer, and the recursive functions could be changed to loops that use LIFOs. It's not just maintainability. I am using TreeCompress and a cleaner version of TreeCapEncode as the baseline for smarter additional processes such as an HTML preprocessor, a word encoder, a white space encoder, and a space "transformer".
    If I had a better grasp of how it works I could probably make some suggestions. Changing recursion into iteration is not going to put a dent in the kind of runtimes you're dealing with. Unless it's running catastrophically low on memory and just thrashing the HDD for days, you probably need to rework the algorithm. The question is still open whether you are actually computing something valuable but very expensive or if the computation is simply doing lots of wasteful things that can be cut out.

  5. #35
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    A suffix tree is just for doing efficient searches, so the suffix tree itself is ancillary to compression and not really part of the algorithm.

    Suffix trees are notorious for consuming excessive memory and just being generally wasteful; if your algorithm doesn't absolutely require a ST (this is unlikely), you should switch to a suffix array instead. libdivsufsort is very fast and easy to use.
    The goal for tree is to optimally compress the suffix tree of the input data. Maybe it is the first algorithm to actually target the prefix tree for compression, I'm not sure. I will have to look into suffix arrays. Since the program's measurement of the suffix tree is not optimal, there may be better algorithms that can provide similar results on a reduced data set.

    There are clearly significantly faster ways to do the the compression by selecting only portions of the data, be it blocks of unaltered data, words, whitespace, etc., but none of these are providing as high of a compression ratio. Since my goal is optimum compression ratio rather than speed, I have tended away from those. From what I have seen, the most promising speed improvements can be achieved by breaking the input data into rotating pairs of smaller blocks. There are lots of other opportunities for speed improvement, it just takes time to motivation to explore them.

    Quote Originally Posted by nburns View Post
    gcc is supposed to inline by itself, or at least that's what they say. I think you should also be okay if you declare the function with "static inline."

    If I had a better grasp of how it works I could probably make some suggestions. Changing recursion into iteration is not going to put a dent in the kind of runtimes you're dealing with. Unless it's running catastrophically low on memory and just thrashing the HDD for days, you probably need to rework the algorithm. The question is still open whether you are actually computing something valuable but very expensive or if the computation is simply doing lots of wasteful things that can be cut out.
    Maybe there is a compile flag or something for inline. It clearly hasn't been doing it automatically for me. The runtime is so ridiculously long that minor changes in certain areas can save an hour or more on compression. Percentage wise, it's not much, but if I can save a few hours on a test cycle, I'll take it.

    Highly recursive functions can blow up the stack and my experience with function calls is that they carry a big time penalty. Changing to a loop will not be a miracle cure, but will make the code better.

  6. #36
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    The goal for tree is to optimally compress the suffix tree of the input data. Maybe it is the first algorithm to actually target the prefix tree for compression, I'm not sure. I will have to look into suffix arrays. Since the program's measurement of the suffix tree is not optimal, there may be better algorithms that can provide similar results on a reduced data set.
    Compressing a suffix tree as such wouldn't really be useful, because a suffix tree is normally just an index; it's not a complete representation of the text. The ST is entirely a derivative of the text, so, as far as compression is concerned, any representation of the text contains the suffix tree.

    Your algorithm could be separated from its tree. All you'd have to do is figure out what it measures about the text itself and how to get that measurement without building a ST. It can be useful to picture a ST when thinking about compression, but that doesn't mean you have to literally build a ST in code.

    There are clearly significantly faster ways to do the the compression by selecting only portions of the data, be it blocks of unaltered data, words, whitespace, etc., but none of these are providing as high of a compression ratio. Since my goal is optimum compression ratio rather than speed, I have tended away from those. From what I have seen, the most promising speed improvements can be achieved by breaking the input data into rotating pairs of smaller blocks. There are lots of other opportunities for speed improvement, it just takes time to motivation to explore them.
    By rework the algorithm, I meant to look for more efficient ways to compute the exact same result.

    Maybe there is a compile flag or something for inline. It clearly hasn't been doing it automatically for me. The runtime is so ridiculously long that minor changes in certain areas can save an hour or more on compression. Percentage wise, it's not much, but if I can save a few hours on a test cycle, I'll take it.
    http://gcc.gnu.org/onlinedocs/gcc/Inline.html

    Highly recursive functions can blow up the stack and my experience with function calls is that they carry a big time penalty. Changing to a loop will not be a miracle cure, but will make the code better.
    Yes, but you're talking about a few percent. There are optimizations at various different orders of magnitude.

  7. #37
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Not following your point about a suffix tree not being a complete representation of the text. For each cycle of compression, the input for that cycle could be extracted from the suffix tree by using the last match pointers. The goal of the algorithm is to look for elements deep in the suffix tree that have the most diverse set of children. I wouldn't say I am there, but that's the intent.

    Improving the ordering of the list of best strings by using a 2-3 tree or similar instead of a brute force approach will produce a faster algorithm with the exact same results. Same thing for the prefix tree addition I mentioned. The recursive thing is relatively minor, but even 1% faster is almost an hour on my computer with enwik9. The ordering of the strings is the biggest one because it is very inefficient to move up to 40K values when a new good match is found:

    while (score_position > new_score_position)
    {
    element_scores_score[score_position] = element_scores_score[score_position-1];
    element_scores_num_chars[score_position] = element_scores_num_chars[score_position-1];
    element_scores_last_match_ptr1[score_position] = element_scores_last_match_ptr1[score_position-1];
    element_scores_last_match_ptr2[score_position] = element_scores_last_match_ptr2[score_position-1];
    score_position--;
    }

    Thanks for the gcc link. I need to spend more time there. It sounds like making the inlines static may be a little better than my first cut not using static. I already chopped many lines out of TreeCompress, so it will be more readable when I am done. Still not great, but better.

    I did find that I can match all the words with spaces in front that occur at least 100 times without hurting compression of enwik8. It only takes 39 seconds on my computer and knocks the compression time down to about half. I imagine the time improvement will be more dramatic on enwik9.
    Last edited by Kennon Conrad; 18th April 2014 at 20:56.

  8. #38
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Not following your point about a suffix tree not being a complete representation of the text. For each cycle of compression, the input for that cycle could be extracted from the suffix tree by using the last match pointers. The goal of the algorithm is to look for elements deep in the suffix tree that have the most diverse set of children. I wouldn't say I am there, but that's the intent.
    Ok. Earlier you said that your goal was to compress a suffix tree. What I'm hearing you say now is that you're using a ST for compression. It makes sense to use a ST to look up information. If you tried to compress a ST as a way to compress text, though, you'd most likely find that this approach leads nowhere.

    Improving the ordering of the list of best strings by using a 2-3 tree or similar instead of a brute force approach will produce a faster algorithm with the exact same results. Same thing for the prefix tree addition I mentioned. The recursive thing is relatively minor, but even 1% faster is almost an hour on my computer with enwik9. The ordering of the strings is the biggest one because it is very inefficient to move up to 40K values when a new good match is found:

    while (score_position > new_score_position)
    {
    element_scores_score[score_position] = element_scores_score[score_position-1];
    element_scores_num_chars[score_position] = element_scores_num_chars[score_position-1];
    element_scores_last_match_ptr1[score_position] = element_scores_last_match_ptr1[score_position-1];
    element_scores_last_match_ptr2[score_position] = element_scores_last_match_ptr2[score_position-1];
    score_position--;
    }
    That looks a lot like insertion sort, which could account for a lot of wasted time. I checked your code, however, and it looks like you're also scanning for overlaps as you go. The sorting problem probably has a simple solution, but you won't see much benefit unless you can also do something about the overlap scan.

    Thanks for the gcc link. I need to spend more time there. It sounds like making the inlines static may be a little better than my first cut not using static. I already chopped many lines out of TreeCompress, so it will be more readable when I am done. Still not great, but better.
    I'm not that big of a fan of gcc inline functions. If you want absolutely guaranteed inlining, you can use macros, which have simple and predictable effects. I've never been in a situation where I felt like I absolutely needed inline functions.

    I did find that I can match all the words with spaces in front that occur at least 100 times without hurting compression of enwik8. It only takes 39 seconds on my computer and knocks the compression time down to about half. I imagine the time improvement will be more dramatic on enwik9.
    There are going to be some optimizations that will be no-brainers because they make the code significantly shorter, simpler and faster all at the same time. There will be others, though, that give you a small benefit while simultaneously expanding and complicating the code. It would be a good idea to save that kind for last.

    I noticed that I have no clue what you mean by prefix tree, because you're probably using it in some nonstandard way. Reading about algorithms will help you to start speaking the same language as everybody else, which makes communication a lot easier.
    Last edited by nburns; 19th April 2014 at 04:28.

  9. #39
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Ok. Earlier you said that your goal was to compress a suffix tree. What I'm hearing you say now is that you're using a ST for compression. It makes sense to use a ST to look up information. If you tried to compress a ST as a way to compress text, though, you'd most likely find that this approach leads nowhere.
    I should have been clearer. I am trying to compress the text in a way such that each iteration maximizes the ratio of the decrease in order 0 entropy of the input to the decrease in the average depth of the suffix tree of the input. I think of it as compressing the suffix tree because strings are being merged into a single node but technically it is not because branches are being changed. My apologizies for confusing you with a loose and incorrect description.

    Quote Originally Posted by nburns View Post
    That looks a lot like insertion sort, which could account for a lot of wasted time. I checked your code, however, and it looks like you're also scanning for overlaps as you go. The sorting problem probably has a simple solution, but you won't see much benefit unless you can also do something about the overlap scan.
    The overlap and substitution scans should both be much less susceptible to getting bogged down with the change to use a sibling array. If I go with an array of 16 pointers, scan characters will always be found within 6 siblings. It can be well over 1000 now on the character after spaces. With this change, the majority of siblings should be found within two siblings. I may abandon the initial overlap checks and just let the overlap scan find them.

    Quote Originally Posted by nburns View Post
    There are going to be some optimizations that will be no-brainers because they make the code significantly shorter, simpler and faster all at the same time. There will be others, though, that give you a small benefit while simultaneously expanding and complicating the code. It would be a good idea to save that kind for last.
    This is more experiment than anything and there is a reason I call the release a toolset. I just copied TreeCompress to another file and changed about a dozen lines of code to convert the program into a word finder. I don't mind having to string a few processes together in a batch file for now. I like to do a lot of experimenting, so cutting the compression time down by half sounds good to me. It also makes logical sense because words that occur over and over are going to get tokenized eventually anyway and these are patterns that consume a lot of the memory needed to build the suffix tree. On enwik9, the first iteration of TreeCompress went from about 55 minutes and 500 suffix tree build cycles to less than 15 minutes and 75 cycles. It took about 15 minutes to get the program to a point that takes about a day on TreeCompress now. TreeCompress has been running on enwik9 for 9 hours now and I see compression similar to the 60 hour point on the released code. The compression ratios look fine, but I can't be sure until the program is done. It probably won't be much faster from here, but am pretty optimistic that this can be a good addition.

    Quote Originally Posted by nburns View Post
    I noticed that I have no clue what you mean by prefix tree, because you're probably using it in some nonstandard way. Reading about algorithms will help you to start speaking the same language as everybody else, which makes communication a lot easier.
    http://en.wikipedia.org/wiki/Prefix_tree I use the prefix tree of the up to 10,000 strings that are being tokenized to check for overlaps in a single scan and then to substitute the new symbols in a single pass. It is augmented with additional variables so the parser knows where to go when the input token does not match any of the siblings.

    I looked at the gnu manual for macros around New Years, but couldn't figure it out in a short amount of time. Don't you need an end define or something?
    Last edited by Kennon Conrad; 19th April 2014 at 06:56.

  10. #40
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I should have been clearer. I am trying to compress the text in a way such that each iteration maximizes the ratio of the decrease in order 0 entropy of the input to the decrease in the average depth of the suffix tree of the input. I think of it as compressing the suffix tree because strings are being merged into a single node but technically it is not because branches are being changed. My apologizies for confusing you with a loose and incorrect description.
    How are you calculating average depth? Are you averaging across all nodes? Are you counting the depth in nodes or symbols? I don't think there's any one clear way to measure that. Each position in the text occurs at multiple depths.

  11. #41
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    How are you calculating average depth? Are you averaging across all nodes? Are you counting the depth in nodes or symbols? I don't think there's any one clear way to measure that. Each position in the text occurs at multiple depths.
    Once again, I have not said what I wanted to say. I should have said the average depth of the leaf nodes. It may be obvious now, but the average depth of the leaf nodes in a suffix tree is (I+L)/L, where I is the number of internal nodes and L is the number of leaves. L is equal to the number of input symbols because a suffix tree has one leaf per input symbol. The released program only does the (I+L)/L calculation on a small subset of the data at the start of each loop to determine how much of the suffix tree to build in each sub-loop. The main usage was with an unreleased program that helped me figure out a node rating algorithm that works reasonably well. It is also used in the program used to generate the plot of the estimated unbounded entropy for blocks of enwik9 in the .pdf document in the v0.1 package.

    I have added 20 new functions to TreeCompress so far and the code is 17% smaller and somewhat easier on the eyes, but it still could use a lot more cleanup.

  12. #42
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Once again, I have not said what I wanted to say. I should have said the average depth of the leaf nodes. It may be obvious now, but the average depth of the leaf nodes in a suffix tree is (I+L)/L, where I is the number of internal nodes and L is the number of leaves. L is equal to the number of input symbols because a suffix tree has one leaf per input symbol. The released program only does the (I+L)/L calculation on a small subset of the data at the start of each loop to determine how much of the suffix tree to build in each sub-loop. The main usage was with an unreleased program that helped me figure out a node rating algorithm that works reasonably well. It is also used in the program used to generate the plot of the estimated unbounded entropy for blocks of enwik9 in the .pdf document in the v0.1 package.

    I have added 20 new functions to TreeCompress so far and the code is 17% smaller and somewhat easier on the eyes, but it still could use a lot more cleanup.
    So your formula is basically nodes/leaves. That doesn't seem to make sense to me. Wouldn't the average depth of a perfectly-balanced tree be log(n)? I'm pretty sure there should be a log in there if the tree is balanced.

    That formula must be meaningful somehow, though, if it works. That number is the inverse of the probability that a randomly-chosen node is a leaf, which is something like the expected number of trials to get a leaf when selecting random nodes. It would be related to the 0-order entropy of the tree structure.

    It sounds like the refactor is going well. There's no points awarded for number of functions, of course. A good way to tell a good function is whether you can give it a short, pithy and accurate name. As a hypothetical example, calculate_symbol_entropy sounds like a good function; done_with_sort_and_about_to_do_x sounds like a bad one. I'm not sure if anyone else has observed this, but factoring code into functions is a lot like compression, particularly the type of compression you're doing.

    It occurred to me that what your formula might represent is the entropy of the dictionary. So you're trying to maximize the decrease in entropy of the text while minimizing the increase in entropy of the dictionary, which would make plenty of sense.

  13. #43
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Yes, nodes/leaves. You are right (again) and I'm still in this way deeper than my background supports. A perfectly balanced tree would have an average depth of log(n) if the base of the log is the the alphabet size but have a nodes/leaf ratio close to 1, indicating the unbounded entropy is approaching the order 0 entropy. The formula isn't quite right because it doesn't handle the expected internal nodes caused by random data, but the entropy and entropy * nodes / leaves are close enough to converging that it is still useful for me (indicating order 0 compression is done).

    The dictionary and text are treated as a single input string while compressing. The dictionary grows because new symbols are added but that is somewhat offset by existing dictionary entries getting compressed along with the main text. I'd say I'm trying to maximize the decrease in order 0 entropy of the text plus existing dictionary while minimizing the loss of higher order entropy in the text plus dictionary. My simplistic view is that the tokenization process changes some of the higher order entropy to order 0 entropy but also destroys some of the higher order entropy. The idea is to preserve as much of the higher order entropy as possible while converting the higher order entropy to order 0 entropy. I imagine I explained that poorly, but hope you get the idea. Long, common strings give the best reduction in the number of leaves (delta L = length * (instances - 1) - 1) relative to the loss of internal nodes (I), keeping the (I+L)/L ratio up compared to choosing pairs. I think I always decreases faster than L, it's a matter of keeping the extra decrease to a minimum.

    Just being able to look at the code and more easily recognize what is going on is reward enough for me. My function names are things like get_token, finish_UTF8_2byte_read, create_child, add_suffix, move_to_match_sibling, etc. It makes the code look more organized and it's easier to read, but there are even more global variables that add a layer of obscurity, so it's still not clean. I took a little break from the cleanup of TreeCompress and changed the prefix tree structure to have the array of 16 sibling pointers and made a lame initial implementation that only uses the lower nibble of the token independent of the sibling depth (basically 16 lists instead of 1) and saw the overlap scan plus token substitution scan drop from a combined 49 seconds to 21 seconds for one iteration about half way through the compression process of enwik9. The full implementation will be a nice improvement, especially helpful if the code is ever multi-threaded.
    Last edited by Kennon Conrad; 20th April 2014 at 11:44.

  14. #44
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Of course you can transmit hints to the decompresser separately (if you don't like the backward approach) but the hints eat up any coding space you would have saved. You can't get around Kolmogorov complexity.
    In my head, I often make an analogy between text compression and building a network of roads. You can have the paver tied to the back of your car or you could do some planning and send the paver ahead. I like the concept, but perhaps you are right and it doesn't work.

    The latest compression run on enwik9 has completed. Even though I was using my computer a lot, sometimes running another compression simultaneously, TreeCompress run time was reduced from about 337K seconds to 171K seconds. The scan speed improvements should take that under 160K. The pre-compressor adds about 580 seconds. Compressed file size went from 185,311,980 bytes to 185,194,072 bytes. It's still ridiculously slow, but there are also still good opportunities for compression speed improvements.

  15. #45
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yes, nodes/leaves. You are right (again) and I'm still in this way deeper than my background supports. A perfectly balanced tree would have an average depth of log(n) if the base of the log is the the alphabet size
    Yeah. In computer science, we ignore those kinds of details and just say O(log n).

    but have a nodes/leaf ratio close to 1, indicating the unbounded entropy is approaching the order 0 entropy. The formula isn't quite right because it doesn't handle the expected internal nodes caused by random data, but the entropy and entropy * nodes / leaves are close enough to converging that it is still useful for me (indicating order 0 compression is done).
    By unbounded entropy, you mean all high-order entropy?

    I think a De Bruijn sequence would give you a perfect suffix tree. Random data would give you a tree that is approximately balanced.

    Any tree that is reasonably well-balanced has an average height of O(log n).

    The dictionary and text are treated as a single input string while compressing. The dictionary grows because new symbols are added but that is somewhat offset by existing dictionary entries getting compressed along with the main text. I'd say I'm trying to maximize the decrease in order 0 entropy of the text plus existing dictionary while minimizing the loss of higher order entropy in the text plus dictionary. My simplistic view is that the tokenization process changes some of the higher order entropy to order 0 entropy but also destroys some of the higher order entropy. The idea is to preserve as much of the higher order entropy as possible while converting the higher order entropy to order 0 entropy. I imagine I explained that poorly, but hope you get the idea. Long, common strings give the best reduction in the number of leaves (delta L = length * (instances - 1) - 1) relative to the loss of internal nodes (I), keeping the (I+L)/L ratio up compared to choosing pairs. I think I always decreases faster than L, it's a matter of keeping the extra decrease to a minimum.
    Re-compressing the dictionary with the stream must create an interesting dynamic. I can't think of anything else that works like that. In re-pair, the dictionary is a forest of binary trees and the text is just text; it wouldn't seem to be possible to mix the different types of data. I think you are using valid unicode symbols for dictionary references; you must be encoding the dictionary that way, too. In re-pair, it seems clear by the way the references are chosen that the dictionary would not be compressible, at least, not in the same way as the original text. But I guess you are basically starting with long matches and working your way down, whereas re-pair does the opposite: it starts with small matches and works its way up. So, you would probably have subject the dictionary to further compression, wouldn't you? That is certainly something that makes your algorithm interesting.

    I get what you're saying. Nodes are repeated phrases in the text, and, ideally, you could surgically extract them one at a time, but, in reality, repeats partially overlap, and so when you choose a node/set of matches to condense into a dictionary phrase, you are typically destroying other nodes/matches in the process. You want to make the best dictionary phrase you can while minimizing collateral damage to the tree. The node-match relationship is made complicated, of course, by the fact that every repeat is made up of multiple nodes, one for each suffix. So the algorithm will subtract multiple nodes even in the ideal case. It gets complicated to reason about, but you seem to have found a good formula. Your formula may help throw light on how compression works from a new and different direction, which is always fun.

    You do realize that you can build any phrase out of pairs, don't you?

    Just being able to look at the code and more easily recognize what is going on is reward enough for me. My function names are things like get_token, finish_UTF8_2byte_read, create_child, add_suffix, move_to_match_sibling, etc. It makes the code look more organized and it's easier to read, but there are even more global variables that add a layer of obscurity, so it's still not clean. I took a little break from the cleanup of TreeCompress and changed the prefix tree structure to have the array of 16 sibling pointers and made a lame initial implementation that only uses the lower nibble of the token independent of the sibling depth (basically 16 lists instead of 1) and saw the overlap scan plus token substitution scan drop from a combined 49 seconds to 21 seconds for one iteration about half way through the compression process of enwik9. The full implementation will be a nice improvement, especially helpful if the code is ever multi-threaded.
    If it's looking better to you, then I'm sure it's getting better. The code must have been well-organized even if it was verbose, because you generally can't write something complicated that works if your code is pathologically flawed. You can recognize good code without having to see it, because good code works.

  16. #46
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Yeah. In computer science, we ignore those kinds of details and just say O(log n).



    By unbounded entropy, you mean all high-order entropy?
    I have to laugh at myself. I actually graduated Cum Laude in CS but have forgotten so much. It's kind of sad that my ability to speak the language has gotten so bad. At least it is slowly getting better....

    Unbounded entropy is a term Dr. Mahoney suggested in a private email. It gets one thinking along the right track, but is perhaps not a perfect description. The formula captures much of the high-order entropy but only the portion with matches in the lower orders. It can't measure entropy that exists when intervening characters don't match. An example would be "King George IV" and "King Edward IV". It also does not account for changes in symbol frequency or symbol relationships since it doesn't look at position. It does not work well on the Calgary Corpus, probably because of the drastic changes in the data. It is also possible (or likely) that the L/N * Order 0 entropy formula is only the first order approximation of the actual formula. All I am sure of is that it provides a realistic estimate for enwik8 and enwik9, helps me test variations of the ranking algorithm, and makes a plot of enwik9 that looks a lot like the plot from an actual compressor on the LTCB site.

    Quote Originally Posted by nburns View Post
    Re-compressing the dictionary with the stream must create an interesting dynamic. I can't think of anything else that works like that. In re-pair, the dictionary is a forest of binary trees and the text is just text; it wouldn't seem to be possible to mix the different types of data. I think you are using valid unicode symbols for dictionary references; you must be encoding the dictionary that way, too. In re-pair, it seems clear by the way the references are chosen that the dictionary would not be compressible, at least, not in the same way as the original text. But I guess you are basically starting with long matches and working your way down, whereas re-pair does the opposite: it starts with small matches and works its way up. So, you would probably have subject the dictionary to further compression, wouldn't you? That is certainly something that makes your algorithm interesting.
    I find this puzzle to be fascinating. That's the only reason I am still working on it. I agree with your thoughts on re-pair not being to usefully mix the dictionary and text since pairs can't be broken down further (unless you want to un-pair??). I think all that is left at that point is predictive methods.

    I am using invalid unicode symbols and UTF-8 symbols. The I/O file format uses 4 byte sequences for dictionary reference that all use 0xFE for the first byte. UTF-8 symbols are not changed. No valid UTF-8 character starts with any of 0xFC - 0xFF. Dictionary entries are placed at the end of the file and use 4 byte sequences starting with 0xFF to mark the start of the entry. When processing the data, UTF-8 symbols are mapped to their unicode code points and dictionary references are mapped to start at 0x29000, which is just big enough to handle the largest unicode code point in enwik9. I just realized the largest unicode code point is only 0x2FFFF, so I should probably use 0x30000 instead or make it one more than the largest code point detected in the file and add a header.

    Since there is only one instance of each dictionary entry marking symbol, these symbols cause the tree builder to place them in leaf nodes because the position in the tree won't ever get a match. The scoring algorithm only looks at internal nodes so the compressor only compresses the text between the dictionary entry symbols. The bit encoder replaces the first instance of each symbol in the text with a (non-unque) "start definition" symbol, a symbol "frequency/count" code, and a symbol (compressed) length code. It stops when the first dictionary entry is detected in the input. This effectively converts the first symbol plus the unique dictionary entry symbol (which would consume an average of about 5 bytes together) into three typically very short codes, with a total length that averages about 1 byte.

    It sounds like you understand how the compressor works now. At least two dictionary entries are over 5000 symbols long and lots that are over 1000 characters long when put in the dictionary. The longest dictionary entry when the program is done is about 700 symbols and there are only five entries with more than 500 symbols.

    Yes, I do realize any phrase can be built from pairs. But I think it is possible that it is better to have a "3.14" in the dictionary without any of the subpairs. Testing this idea out was one of the main reasons I wrote the code.

  17. #47
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.3

    Here is version 0.3 of the Tree Compression Toolset.

    Changes are as follows:
    TreeCompressLS is an suffix tree reducer that only tokenizes certain combinations of a space character followed by text. It is very exploratory and represents my first implementation of modeling spaces separately that does not hurt the compressed file size. It will probably change. This process should be run between TreeCapEncode and TreeCompress.

    TreeComp.bat is included to provide a sample of how to compress a UTF-8 compliant text file.

    TreeCapEncode now has buffered writes. Enwik8 run time is 12 seconds instead of 4 minutes for prior releases.

    TreeCompress has improved data structures. It also has functions for code that was redundant, which makes the code more readable. The following are brief descriptions of the changes. The level 1 suffix tree nodes (for match finding) now each have an array of 16 sibling pointers instead of 2. The prefix tree nodes (for overlap checking and one-pass token substitution) now have arrays of 16 sibling pointers instead of a single sibling pointer. Maximum match quantity management has been added and the maximum match quantity has been increased from 10,000 to 20,000 new tokens. Token candidate selection is slightly improved. Crash file saves have been reduced. The token creation cutoff levels have been re-optimized for not having an end of define character consuming token space (v0.2 change). Maximum RAM usage has been reduced to hopefully provide more consistent results.

    TreeBitEncode and TreeDecode have been modified to allow bit codes as short as 4 bits instead of 8 bits.

    Results on A8-5500 CPU:
    Prior v0.2 release:
    enwik9: 185,311,980 bytes, 337,287 seconds to compress, 22 seconds to decompress
    enwik8: 23,250,856 bytes, 11,681 seconds to compress, 2.8 seconds to decompress

    v0.3 release
    enwik9: 184,838,711 bytes, 105,728 seconds to compress, 23 seconds to decompress
    enwik8: 23,233,932 bytes, 4,541 seconds to compress, 2.7 seconds to decompress

    compressed TreeDecode.c file size: 6,591 bytes
    Attached Files Attached Files
    Last edited by Kennon Conrad; 28th April 2014 at 04:28.

  18. #48
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    For TreeCapEncode, you could probably use something like re2c. flex and ragel are other options. The code would be shorter and it would likely run faster.

    Edit: see if you can use flex. It's the most standard of the three, and it has an easy syntax. It's a useful thing to know about, anyway.
    Last edited by nburns; 28th April 2014 at 08:06.

  19. #49
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I have to laugh at myself. I actually graduated Cum Laude in CS but have forgotten so much. It's kind of sad that my ability to speak the language has gotten so bad. At least it is slowly getting better....
    I was about to say, you're by far the best untrained computer scientist I know. I guess you're not entirely untrained, huh?

    Unbounded entropy is a term Dr. Mahoney suggested in a private email. It gets one thinking along the right track, but is perhaps not a perfect description. The formula captures much of the high-order entropy but only the portion with matches in the lower orders. It can't measure entropy that exists when intervening characters don't match. An example would be "King George IV" and "King Edward IV". It also does not account for changes in symbol frequency or symbol relationships since it doesn't look at position. It does not work well on the Calgary Corpus, probably because of the drastic changes in the data. It is also possible (or likely) that the L/N * Order 0 entropy formula is only the first order approximation of the actual formula. All I am sure of is that it provides a realistic estimate for enwik8 and enwik9, helps me test variations of the ranking algorithm, and makes a plot of enwik9 that looks a lot like the plot from an actual compressor on the LTCB site.
    I think Matt means this: the conventional formulas for high-order entropy use a fixed order k. So, there is a different one for each k. I think that by unbounded entropy Matt is simply referring to a combination of all orders.

    I find this puzzle to be fascinating. That's the only reason I am still working on it. I agree with your thoughts on re-pair not being to usefully mix the dictionary and text since pairs can't be broken down further (unless you want to un-pair??). I think all that is left at that point is predictive methods.

    I am using invalid unicode symbols and UTF-8 symbols. The I/O file format uses 4 byte sequences for dictionary reference that all use 0xFE for the first byte. UTF-8 symbols are not changed. No valid UTF-8 character starts with any of 0xFC - 0xFF. Dictionary entries are placed at the end of the file and use 4 byte sequences starting with 0xFF to mark the start of the entry. When processing the data, UTF-8 symbols are mapped to their unicode code points and dictionary references are mapped to start at 0x29000, which is just big enough to handle the largest unicode code point in enwik9. I just realized the largest unicode code point is only 0x2FFFF, so I should probably use 0x30000 instead or make it one more than the largest code point detected in the file and add a header.

    Since there is only one instance of each dictionary entry marking symbol, these symbols cause the tree builder to place them in leaf nodes because the position in the tree won't ever get a match. The scoring algorithm only looks at internal nodes so the compressor only compresses the text between the dictionary entry symbols. The bit encoder replaces the first instance of each symbol in the text with a (non-unque) "start definition" symbol, a symbol "frequency/count" code, and a symbol (compressed) length code. It stops when the first dictionary entry is detected in the input. This effectively converts the first symbol plus the unique dictionary entry symbol (which would consume an average of about 5 bytes together) into three typically very short codes, with a total length that averages about 1 byte.

    It sounds like you understand how the compressor works now. At least two dictionary entries are over 5000 symbols long and lots that are over 1000 characters long when put in the dictionary. The longest dictionary entry when the program is done is about 700 symbols and there are only five entries with more than 500 symbols.

    Yes, I do realize any phrase can be built from pairs. But I think it is possible that it is better to have a "3.14" in the dictionary without any of the subpairs. Testing this idea out was one of the main reasons I wrote the code.

  20. #50
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I was about to say, you're by far the best untrained computer scientist I know. I guess you're not entirely untrained, huh?
    Well trained, but very rusty. At one time, I knew Basic, Fortran, Ratfor, Pascal, Lisp, Cobol, APL and a few other ancient languages. I first used a computer in the 8th grade, going to school early so I could use the teletype machine to connect to the school district's computer to write and test simple basic programs before the staff arrived. How things have changed.... My initial motivation was just to brush up on my programming and algorithm development "skills" for a week or so. That was last summer.


    Quote Originally Posted by nburns View Post
    I think Matt means this: the conventional formulas for high-order entropy use a fixed order k. So, there is a different one for each k. I think that by unbounded entropy Matt is simply referring to a combination of all orders.
    Agreed. It's just that in general, higher order entropy does not require the presence of lower order entropy. Perhaps calling it something like contiguous unbounded entropy would be more appropriate. Thinking about your thought about the shape of the tree with uncorrelated data has given me new insight. I have felt like I am on the verge of having a better formula a couple of times recently, but am not quite there yet.

    Thanks for the tip on flex. I don't plan to do much with TreeCapEncode, but if I had known earlier it seems like an easy way to go. Also useful for the future.

    For now, I continue to work on the data structures and algorithms to try to make the compression time less ridiculous. TreeCompress is now running about 25% faster than v0.3. v0.3 was probably a little premature, but I kind of felt sorry for anyone looking at the code for TreeCompress. It's still far from perfect, but a big step in the right direction.
    Last edited by Kennon Conrad; 30th April 2014 at 17:51.

  21. #51
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Well trained, but very rusty. At one time, I knew Basic, Fortran, Ratfor, Pascal, Lisp, Cobol, ATL and a few other ancient languages. I first used a computer in the 8th grade, going to school early so I could use the teletype machine to connect to the school district's computer to write and test simple basic programs before the staff arrived. How things have changed.... My initial motivation was just to brush up on my programming and algorithm development "skills" for a week or so. That was last summer.
    I'd say it was worth it.

    This is a nice contribution to the forum.

    Agreed. It's just that in general, higher order entropy does not require the presence of lower order entropy.
    I think that it does, because if you don't have any low-order entropy, then you don't have any data. I think you may be mixing up the meaning of entropy and using it to refer to its exact opposite. Entropy measures the part of data that is incompressible, so the opposite would be something like the redundancy or waste. It's also formally defined as -log(p) and you can find examples of how various entropy measures are calculated on the web.

    *"Unbounded entropy" might not have a formula.

    Perhaps calling it something like contiguous unbounded entropy would be more appropriate. Thinking about your thought about the shape of the tree with uncorrelated data has given me new insight. I have felt like I am on the verge of having a better formula a couple of times recently, but am not quite there yet.
    The thing about entropy is that the number you get depends on what type of entropy you choose to measure. Some of the kinds of patterns you were mentioning are very hard to find and to profit from in compression.

    Thanks for the tip on flex. I don't plan to do much with TreeCapEncode, but if I had known earlier it seems like an easy way to go. Also useful for the future.
    It sounded like you were trying to optimize it. I thought I might save you time.

    For now, I continue to work on the data structures and algorithms to try to make the compression time less ridiculous. TreeCompress is now running about 25% faster than v0.3. v0.3 was probably a little premature, but I kind of felt sorry for anyone looking at the code for TreeCompress. It's still far from perfect, but a big step in the right direction.
    The fairly obvious thing to attack is the repeated suffix tree builds. Doesn't the new ST look a lot like the old one?
    Last edited by nburns; 30th April 2014 at 13:06.

  22. #52
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I think that it does, because if you don't have any low-order entropy, then you don't have any data. I think you may be mixing up the meaning of entropy and using it to refer to its exact opposite. Entropy measures the part of data that is incompressible, so the opposite would be something like the redundancy or waste. It's also formally defined as -log(p) and you can find examples of how various entropy measures are calculated on the web.

    *"Unbounded entropy" might not have a formula.
    Yes, I meant the loss in entropy or whatever it is called. The order 0 entropy * L/N formula only sees see higher orders of compressibility when all of the lower order symbols match. I get the impression the shape of the suffix tree has not been studied much and that even my simple formula hasn't been looked at before.

    Quote Originally Posted by nburns View Post
    The fairly obvious thing to attack is the repeated suffix tree builds. Doesn't the new ST look a lot like the old one?
    Yes and no. It would look a lot like the old one if the program had enough memory to build the entire suffix tree at once. With a 2 MB process limit, it's not possible to fit the data plus suffix tree in RAM for enwik9 at any stage of compression. With unlimited RAM, the algorithm could build the tree once and then just modify it as it goes. I have thought about moving away from Windows, but hesitate because most computers run Windows. Long term, I am thinking the best speed improvement would be to do most of the compression in a single pass with a limited length look-ahead suffix tree that gets modified on the fly.

  23. #53
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yes, I meant the loss in entropy or whatever it is called. The order 0 entropy * L/N formula only sees see higher orders of compressibility when all of the lower order symbols match. I get the impression the shape of the suffix tree has not been studied much and that even my simple formula hasn't been looked at before.
    STs have received lots of study. I tried a quick search to see if the nodes/leaves ratio had been mentioned and given a name, but I didn't find anything. You ought to pin down exactly what it measures about the text itself, and also how exactly it acts as a proxy for compressibility in the score.

    Yes and no. It would look a lot like the old one if the program had enough memory to build the entire suffix tree at once. With a 2 MB process limit, it's not possible to fit the data plus suffix tree in RAM for enwik9 at any stage of compression. With unlimited RAM, the algorithm could build the tree once and then just modify it as it goes. I have thought about moving away from Windows, but hesitate because most computers run Windows. Long term, I am thinking the best speed improvement would be to do most of the compression in a single pass with a limited length look-ahead suffix tree that gets modified on the fly.
    I've been meaning to recommend this paper: http://citeseerx.ist.psu.edu/viewdoc...=10.1.1.3.3557

    They take a different approach, but they are after the same goal as you are. I wonder how close their approach is to yours, once superficial details are stripped away.

    They approach high order entropy in a more literal fashion by looking for nodes that stand out for their value as a context, while you are looking for nodes to collapse into new symbols. They might end up finding almost the same nodes, though, given equivalent cost functions.
    Last edited by nburns; 1st May 2014 at 03:32.

  24. #54
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    STs have received lots of study. I tried a quick search to see if the nodes/leaves ratio had been mentioned and given a name, but I didn't find anything. You ought to pin down exactly what it measures about the text itself, and also how exactly it acts as a proxy for compressibility in the score.
    Yes, but it seems like most of the research has been focused on how to build suffix trees and and string searches, and just a little on compression. I am interested in the shape of the tree (# of nodes, # of leaves, leaf depth, etc.) and the implications on entropy. I do need to spend some time pinning this down. I have been too distracted with trying to improve the code to make significant progress.

    Quote Originally Posted by nburns View Post
    I've been meaning to recommend this paper: http://citeseerx.ist.psu.edu/viewdoc...=10.1.1.3.3557

    They take a different approach, but they are after the same goal as you are. I wonder how close their approach is to yours, once superficial details are stripped away.

    They approach high order entropy in a more literal fashion by looking for nodes that stand out for their value as a context, while you are looking for nodes to collapse into new symbols. They might end up finding almost the same nodes, though, given equivalent cost functions.
    The link you posted is by far closer to what I am thinking about and doing than anything else I have seen. Obviously their theory is much more developed. It is very interesting to me. I don't understand it all yet, so I will read it a few more times. I am curious what results they were able to achieve with their booster.

    The document you referenced contained a reference to another publication that led to this presentation: http://www.cs.ucf.edu/courses/cap593...ght_slides.pdf. I had all of this in my head as things I want to do, but had never seen it in one place. It will be a handy guide if and when I am ready to add another layer of complexity to the code. It will also be needed to faster versions.

  25. #55
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    150
    Thanks
    30
    Thanked 59 Times in 35 Posts
    Tree can't works for random text, small text, etc...

  26. #56
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yes, but it seems like most of the research has been focused on how to build suffix trees and and string searches, and just a little on compression. I am interested in the shape of the tree (# of nodes, # of leaves, leaf depth, etc.) and the implications on entropy. I do need to spend some time pinning this down. I have been too distracted with trying to improve the code to make significant progress.



    The link you posted is by far closer to what I am thinking about and doing than anything else I have seen. Obviously their theory is much more developed. It is very interesting to me. I don't understand it all yet, so I will read it a few more times. I am curious what results they were able to achieve with their booster.

    The document you referenced contained a reference to another publication that led to this presentation: http://www.cs.ucf.edu/courses/cap593...ght_slides.pdf. I had all of this in my head as things I want to do, but had never seen it in one place. It will be a handy guide if and when I am ready to add another layer of complexity to the code. It will also be needed to faster versions.
    Are you looking at compressed suffix trees? I did a little experimental project https://code.google.com/p/minsort-transform/ where I used a small enhanced suffix array library; you might take a look. I like that little library. The only disadvantage is that it creates a dependency on C++. IIRC I needed only one fairly small piece to replicate it in C; I started coding and I think I got interrupted by more important things before I finished.

    There is a significant literature on suffix trees. The most up-to-date work tends to relate to applications in biology.

    Compression as such is not a hot area of research; it went cold many years ago (it went cold before my time, so I'm making inferences from the literature record). I'm going to guess that there was a golden age of compression research starting with Lempel Ziv in the 1970s and ultimately fizzling some time in the 1980s.

    There are still a few groups around the world working on pure compression research, at least part time, and that paper is a fine example. I guess Matt might fall into that category to some extent, but I don't think it's his day job. But it's easy to see why gene research might take priority over file compression research. Gene research is more likely to save your life, right?
    Last edited by nburns; 1st May 2014 at 15:04.

  27. #57
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Actually it is my day job
    But you're right, there's not a lot of people working on pure data compression research. Probably there are more people working on specific applications like video compression and in related fields like machine learning and data mining.

    Also, about entropy. Normally it just means information content, which assumes you know the probability distribution. A lot of the time you don't, so we just use compressed size instead to get an upper bound. Entropy also has a meaning in thermodynamics which turns out to be the same thing. In classical thermodynamics we could only talk about the relative entropy difference between two states. But thanks to quantum mechanics it has an absolute value too.

  28. #58
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Actually it is my day job
    But you're right, there's not a lot of people working on pure data compression research. Probably there are more people working on specific applications like video compression and in related fields like machine learning and data mining.
    I guess I have little idea what your job actually is. I'm assuming that it's not like a tenure-track position in academia, in which you have grad students around to multiply your efforts, and the resources of a university at your disposal (I know there are downsides to academia, also).

    Also, about entropy. Normally it just means information content, which assumes you know the probability distribution. A lot of the time you don't, so we just use compressed size instead to get an upper bound. Entropy also has a meaning in thermodynamics which turns out to be the same thing. In classical thermodynamics we could only talk about the relative entropy difference between two states. But thanks to quantum mechanics it has an absolute value too.
    Somehow, I'm not sure if that will make it any clearer. Entropy has both a formal mathematic definition and some informal, colloquial uses that it's picked up. In my mind, the most useful thing to focus on is the formal definition. That is essential for reading papers and reasoning clearly. The informal uses you can pick up at leisure.

  29. #59
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by xezz View Post
    Tree can't works for random text, small text, etc...
    I found a problem with TreeBitEncode not working when the input file has not been changed by either TreeCompressLS or TreeCompress. If you replace the v0.3 TreeBitEncode with the attached version, it should work if this was the problem you encountered. Otherwise, I will need more details.
    Attached Files Attached Files

  30. #60
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    Are you looking at compressed suffix trees? I did a little experimental project https://code.google.com/p/minsort-transform/ where I used a small enhanced suffix array library; you might take a look. I like that little library. The only disadvantage is that it creates a dependency on C++. IIRC I needed only one fairly small piece to replicate it in C; I started coding and I think I got interrupted by more important things before I finished.

    There is a significant literature on suffix trees. The most up-to-date work tends to relate to applications in biology.

    Compression as such is not a hot area of research; it went cold many years ago (it went cold before my time, so I'm making inferences from the literature record). I'm going to guess that there was a golden age of compression research starting with Lempel Ziv in the 1970s and ultimately fizzling some time in the 1980s.

    There are still a few groups around the world working on pure compression research, at least part time, and that paper is a fine example. I guess Matt might fall into that category to some extent, but I don't think it's his day job. But it's easy to see why gene research might take priority over file compression research. Gene research is more likely to save your life, right?
    Yes. Do the gene reseach guys do anything with suffix trees besides looking for DNA string matches?

    I will fully admit I am 30 years late in getting around to implementing my idea. Not that I could have done it on a home PC back then....

    I am thinking through what to tackle next. I think using a compressed suffix tree will allow for faster execution with less memory. But there are a lot of other things I would also like to work on such as fixing my HTML pre-processor so it can be released (works on enwik8, but not enwik9), improving the existing code, exploring a context matching backend, and exploring a sliding window suffix tree compressor followed by a deduplicator (TreeCompress would work). I downloaded your code, took a quick look, and will take a closer look when I get serious about improving the tree structure.

Page 2 of 29 FirstFirst 123412 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •