Page 1 of 29 12311 ... LastLast
Results 1 to 30 of 849

Thread: Tree alpha v0.1 download

  1. #1
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree alpha v0.1 download

    Hello,

    First time poster. I am a hardware engineer with a minimal background in compression and limited programming skills, but I think I have created a unique compressor and would like to share it.

    Please find the attached "Tree" alpha v0.1 compression toolset. It has four programs - one that encodes capital letters, one that tokenizes a (capital encoded) file, one that bit encodes a tokenized file, and one to decodes a bit encoded file.

    The tokenizer iteratively creates tokens based on the best strings found in the suffix tree of the input. It is extremely slow (almost 4 days to compress on an A8-5500 processor) but creates a compressed enwik9 file of 187,978,600 bytes (187,985,256 bytes for file+decoder) that can be decompressed in 22 seconds on an A8-5500 processor. This establishes a new point on the Pareto Frontier for compressed file size vs. decompression speed. The code is optimized for enwik9 and contains some hacks but works. More details are available in the .pdf in the attached file.

    The test machine has an 3.2 GHz A8-5500 processor with 6 GB RAM running Windows 8. The compressor uses about 2 GB of RAM. The decompressor uses about 400 MB of RAM.

    This was a home project done for fun. I am not sure if what I have done is useful, but I do think it is unique and interesting. Please let me know what you think.

    Best Regards,

    Kennon Conrad
    Attached Files Attached Files
    Last edited by Kennon Conrad; 7th January 2016 at 10:36.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts

  3. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Kennon Conrad (2nd April 2014)

  4. #3
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    I'm impressed, very good job, waited for something like this with fast decompress.

    enwik8

    compress:
    treecapencode 10.864 sec.
    treecompress 5,219.757 sec.
    treebitencode 1.341 sec.
    -----------------------------------
    total: 5,231.962 sec.

    23,668,360 bytes

    decompress:
    treedecode 1.351 sec.

    compare ok.

    Compare:

    23,171,272 bytes, 5,231.958 sec. - 1.068 sec., treecapencode treecompress treebitencode8, treedecode8
    23,660,360 bytes, 5,231.962 sec. - 1.351 sec., treecapencode treecompress treebitencode, treedecode
    24,416,777 bytes, 100.768 sec. - 1.118 sec., lzturbo -49 -p1 -b102, lzturbo -d -p1
    24,778,037 bytes, 60.603 sec. - 5.760 sec., plzma64 c2, plzma64 d
    24,782,355 bytes, 235.847 sec. - 0.537 sec., lzhamtest_x64 -m4 -d29 -c -u -t0 -p -x -e c, lzhamtest_x64 -c -u -t0 d
    24,791,315 bytes, 71.798 sec. - 1.806 sec., lzip -9 -s512MiB, lzip -d
    24,831,648 bytes, 66.933 sec. - 1.285 sec., xz -9 -e , xz -d
    25,536,017 bytes, 38.775 sec. - 1.854 sec., packet a -h102 -b102 -mx, packet x
    25,768,105 bytes, 52.720 sec. - 0.545 sec., tor64 -16, tor64 -d
    25,768,724 bytes, 13.445 sec. - 2.959 sec., yzx a -m9 -b7 -h5, yzx x

    Compare ok.
    Last edited by Sportman; 6th April 2014 at 21:13.

  5. The Following User Says Thank You to Sportman For This Useful Post:

    dnd (6th April 2014)

  6. #4
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    The concept seems somewhere between BPE and RePair, but probably more like RePair: http://citeseerx.ist.psu.edu/viewdoc...=rep1&type=pdf

    There's a working RePair implementation on the web somewhere. It uses a greedy heuristic to choose which dictionary phrases to add, based solely on the frequency (after substituting all previously-defined phrases). The results are plainly less than optimal, since you wind up with, e.g. lots of phrases with spaces attached alternately at the beginning or end, which tends to make lots of mutually-incompatible phrases. The OP might have found a better heuristic.

    You could probably use the same data structures for finding phrases as RePair does. 3 days to encode is a little bit nuts, and says that there's probably a lot of improvement possible using the right data structures and algorithms.

    I don't know what size RePair would get for enwik9, and it's been long enough since I played with it that I couldn't even make a decent guess. I played with Navarro's implementation on the web, which also happens to not be a finished compressor: it builds the dictionary and replaces the text, but it outputs both as 32-bit ints without further encoding. The Moffat Larsson paper advocates encoding the dictionary separately, but I took what seemed like logical next step and re-interpolated the dictionary back into the text at the first appearance of each phrase, reducing the dictionary information down to just the locations of the first appearances, which becomes a bit sequence of n+m bits (n is the length of the replaced text and m is the number of phrases in the dictionary). From there, you still have to encode the ints to fewer than 32 bits to get a competitive compressor. The bit sequence is slightly compressible, but its contribution is already small. That's where I left that project.

    There's clearly potential in offline dictionary methods, and there seems to be plenty of room for improvement in compression efficiency. Like sliding-window LZ, they use restricted dictionaries, but, since the dictionary is specified explicitly, the encoder has the opportunity to search for the optimal dictionary, taking into account the entire file, while the decoder can stay very simple and fast.
    Last edited by nburns; 2nd April 2014 at 06:40.

  7. #5
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Thanks, Sportman. I have been really curious how better computers compare. For enwik8, I get the same file size but TreeCompress is 11681 seconds and TreeDecode is 2.8 seconds. You are 2x+ faster . The memory warning is insignificant for enwik8 since the suffix tree sections are much smaller.

    I have TreeBitEncode8 and TreeDecode8 available if wanted. These are optimized for enwik8 instead of enwik9. On my A8-5500, compressed file size is 23,161,488 bytes (499 KB less) and decompression time is 2.4 seconds (0.4 seconds less). That's more speed improvement than I expected but imagine it is because the prefix code array is 1/4 as big. An optimized version of the tokenizer has not been created.

    My early efforts included some "research" into pairing, but I wasn't particularly impressed by the results. Over time, I came to adopt the general goal for token creation of maximizing the decrease in the "Shannon file size" (Shannon entropy in bytes multiplied by the number of symbols) while minimizing the increase in "Estimated entropy" ("Shannon file size" multiplied by the number of symbols divided by [the number of symbols plus the number of internal nodes in the suffix tree]). The algorithm for choosing tokens is not optimum, but it attempts to do this by choosing strings that are generally longer than "average" and more common than expected. All the long matches are quickly chosen, and then the matches generally get shorter and shorter as the process proceeds.

    The largest cause of the long run time, besides the fact that it just takes a long time to build the full suffix tree over and over, is that the algorithm struggles with the space character, especially when followed by the special capital symbol. This causes a lot of overlapping token candidates and causes the algorithm to throw away up to about 95% of the tokens it might create. I think my next step is to see how much creating a word dictionary first helps compression time and hurts compression size. After that, generalization and exploration of context matching is being considered.
    Last edited by Kennon Conrad; 2nd April 2014 at 11:56.

  8. #6
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    My early efforts included some "research" into pairing, but I wasn't particularly impressed by the results. Over time, I came to adopt the general goal for token creation of maximizing the decrease in the "Shannon file size" (Shannon entropy in bytes multiplied by the number of symbols) while minimizing the increase in "Estimated entropy" ("Shannon file size" multiplied by the number of symbols divided by [the number of symbols plus the number of internal nodes in the suffix tree]). The algorithm for choosing tokens is not optimum, but it attempts to do this by choosing strings that are generally longer than "average" and more common than expected. All the long matches are quickly chosen, and then the matches generally get shorter and shorter as the process proceeds.
    You can construct any dictionary out of pairs, so I don't think you lose any generality by speaking in terms of pairs. It would be interesting to see if you found a cost function that does better than RePair's simple one. I'm not sure I follow your description.

  9. #7
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    You are 2x+ faster
    I have TreeBitEncode8 and TreeDecode8 available if wanted.
    Hardware used:
    Intel Core i7 4960X 3.6GHz OC at 4.5GHz - 6 core (12 threads) 22nm Ivy Bridge-E
    Kingston 8 x 4GB (32GB) DDR3 2400MHz 11-14-14 under clocked at 2000MHz 10-11-11.

    Software used:
    Windows 8.1 Pro 64-bit
    SoftPerfect RAM Disk 3.4.5 64-bit

    I like to test the Tree enwik8 optimized version.

  10. #8
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    I like to test the Tree enwik8 optimized version.
    Here you go. Nice computer, BTW.
    Attached Files Attached Files

  11. #9
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Yes, dictionaries can be constructed from pairs but I ran into two things I didn't like with my implementation, although these could have been problems with the implementation. The first was that there were good matchable strings that did not get matched because tokenizing any of the pairs within the string caused an increase in the "Shannon file size". The second was that tokenizing pairs sometimes makes individual words less compressible. As a hypothetical example, a pair tokenizer might decide "n" and "t" make a good pair. But the word "maintain" might be better spelled with "main" and "tain" than with "mai", "nt" and "ain". A tokenizer that creates a token for "main" or "tain" prior to creation of the "nt" token will avoid this problem. My example may not be the best, but hopefully clarifies the idea. While the "nt" token reduces the file size quite a bit, it "damages" the ratio of leaves to (internal nodes + leaves) (ie. the "compressibility") of the suffix tree quite a bit more than adding a token for "main".
    Last edited by Kennon Conrad; 2nd April 2014 at 19:49.

  12. #10
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Yes, dictionaries can be constructed from pairs but I ran into two things I didn't like with my implementation, although these could have been problems with the implementation. The first was that there were good matchable strings that did not get matched because tokenizing any of the pairs within the string caused an increase in the "Shannon file size". The second was that tokenizing pairs sometimes makes individual words less compressible. As a hypothetical example, a pair tokenizer might decide "n" and "t" make a good pair. But the word "maintain" might be better spelled with "main" and "tain" than with "mai", "nt" and "ain". A tokenizer that creates a token for "main" or "tain" prior to creation of the "nt" token will avoid this problem. My example may not be the best, but hopefully clarifies the idea. While the "nt" token reduces the file size quite a bit, it "damages" the ratio of leaves to (internal nodes + leaves) (ie. the "compressibility") of the suffix tree quite a bit more than adding a token for "main".
    I'm still not sure I understand your formulas for the various entropy measures, but I can tell that you were thinking about this the right way. As I said, RePair's heuristic is very simple, too simple to deal with anything but pairs. Pairs are atomic, and so they're the natural place to start. But, for the kinds of reasons you've described, it might be necessary to, in effect, plan more than one pairing step ahead in order to avoid making locally-attractive but globally-poor decisions.

    The downside of looking multiple steps ahead is that you can get very slow algorithms. If the encoder is slow because it's searching a large space of possible solutions, then that's a bit harder and not just a data structure problem.

  13. #11
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Here you go.
    treebitencode8 1.341 sec.

    23,171,272 bytes

    treedecode8 1.068 sec.

    Compare ok.

  14. The Following User Says Thank You to Sportman For This Useful Post:

    Kennon Conrad (3rd April 2014)

  15. #12
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    treebitencode8 1.341 sec.

    23,171,272 bytes
    I repeated my run and got 23,158,692 bytes this time. Apparently the compressed file size varies slightly with the memory usage of TreeCompress. I didn't think that would happen with enwik8, but have thought of a possible reason. Something to put on the list for the next version.....

  16. #13
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I'm still not sure I understand your formulas for the various entropy measures, but I can tell that you were thinking about this the right way. As I said, RePair's heuristic is very simple, too simple to deal with anything but pairs. Pairs are atomic, and so they're the natural place to start. But, for the kinds of reasons you've described, it might be necessary to, in effect, plan more than one pairing step ahead in order to avoid making locally-attractive but globally-poor decisions.

    The downside of looking multiple steps ahead is that you can get very slow algorithms. If the encoder is slow because it's searching a large space of possible solutions, then that's a bit harder and not just a data structure problem.
    Much of the time I am not sure I understand my formulas either . The actual formula in the code is a hack in many ways, only a gross approximation of what I am aiming for, but is close enough it works reasonably well.

    The formula I am aiming at is made up by me, so that makes it hard to understand. What I call the "Shannon file size" is really just the file size an encoder would get by directly coding the symbols in the file with non-integer length Huffman codes. I imagine there is a simple way to say this and I just don't know how.

    I have found through experimentation that if a file has X symbols, a Shannon entropy of Y bytes per symbol (Shannon file size of X*Y) and an uncompressed prefix tree with Z internal nodes + leaves, then the value X*Y*X/Z provides a reasonable estimate of the order 0 entropy - 16.869 MB for enwik8 for example. The goal for the token prioritization is to maximize the reduction in X*Y while minimizing the increase in X*Y*X/Z. Through experimentation, I have found this is best accomplished by matching long strings at good boundaries, with the meaning of long changing (reducing) as the Shannon entropy increases. The full numerical analysis would be worthwhile, I just haven't had the time. Basically I think it reduces to making the change in X divided by the change in Z as small as possible, which is done by finding portions of the suffix tree that represent relatively long strings with relatively few siblings. The two values almost converge as X/Z goes from almost 4 to about 1.01. The smaller the value they converge at, the better (duh!).

    The algorithm does not have optimal prioritization, does not create optimal boundaries, etc., etc., etc. I agree with everything you have said about the compression speed. The goal for the initial release was to create a file that was a small as possible with a decompression speed on the Pareto Frontier on a relatively slow computer. That is done and I have a baseline to test speed enhanced versions against. When I started, the algorithm would have taken about 1000 years to run. While the time is still laughable in many ways, it is at least on the boundary of credibility.

    I think I can improve the speed a lot without hurting compression too much and others are welcome to also try. I don't have time to try all of the worthy ideas, at least any time in the near future. I think the biggest problem speed-wise right now is that the algorithm doesn't use enough of the information it gathers - it needs a smarter brain. I suspect a couple of weeks of effort in this area could improve the compression time to about 1/3 of what it is now. Alternatively, based on some thoughts Dr. Mahoney provided, I am thinking about creating a word dictionary first that could either be a) further tokenized or b) context matched. It seems like it might yield a massive compression speed improvement for the tokenization method even with the current "brain".
    Last edited by Kennon Conrad; 3rd April 2014 at 10:54.

  17. #14
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Much of the time I am not sure I understand my formulas either . The actual formula in the code is a hack in many ways, only a gross approximation of what I am aiming for, but is close enough it works reasonably well.

    The formula I am aiming at is made up by me, so that makes it hard to understand. What I call the "Shannon file size" is really just the file size an encoder would get by directly coding the symbols in the file with non-integer length Huffman codes. I imagine there is a simple way to say this and I just don't know how.
    I suggest reading some Wikipedia articles on information theory to brush up on terminology.

    I have found through experimentation that if a file has X symbols, a Shannon entropy of Y bytes per symbol (Shannon file size of X*Y) and an uncompressed prefix tree with Z internal nodes + leaves, then the value X*Y*X/Z provides a reasonable estimate of the order 0 entropy - 16.869 MB for enwik8 for example. The goal for the token prioritization is to maximize the reduction in X*Y while minimizing the increase in X*Y*X/Z. Through experimentation, I have found this is best accomplished by matching long strings at good boundaries, with the meaning of long changing (reducing) as the Shannon entropy increases. The full numerical analysis would be worthwhile, I just haven't had the time. Basically I think it reduces to making the change in X divided by the change in Z as small as possible, which is done by finding portions of the suffix tree that represent relatively long strings with relatively few siblings. The two values almost converge as X/Z goes from almost 4 to about 1.01. The smaller the value they converge at, the better (duh!).

    The algorithm does not have optimal prioritization, does not create optimal boundaries, etc., etc., etc. I agree with everything you have said about the compression speed. The goal for the initial release was to create a file that was a small as possible with a decompression speed on the Pareto Frontier on a relatively slow computer. That is done and I have a baseline to test speed enhanced versions against. When I started, the algorithm would have taken about 1000 years to run. While the time is still laughable in many ways, it is at least on the boundary of credibility.

    I think I can improve the speed a lot without hurting compression too much and others are welcome to also try. I don't have time to try all of the worthy ideas, at least any time in the near future. I think the biggest problem speed-wise right now is that the algorithm doesn't use enough of the information it gathers - it needs a smarter brain. I suspect a couple of weeks of effort in this area could improve the compression time to about 1/3 of what it is now. Alternatively, based on some thoughts Dr. Mahoney provided, I am thinking about creating a word dictionary first that could either be a) further tokenized or b) context matched. It seems like it might yield a massive compression speed improvement for the tokenization method even with the current "brain".
    The important question is whether its just a code optimization problem or if it's an inherently slow algorithm. If it's inherently slow, all isn't necessarily lost, but it ought to be doing something important that can't be done any other way.

  18. #15
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    enwik9

    compress:
    treecapencode 108.796 sec.
    treecompress 174,589.974 sec.
    treebitencode 14.089 sec.
    ------------------------------
    total 174,712.859 sec.

    187,933,388 bytes

    decompress:
    treedecode 11.70 sec.

    compare ok.

    Compare:

    187,933,388 bytes, 174,712.859 sec. - 11.700 sec., treecapencode treecompress treebitencode, treedecode
    194,681,713 bytes, 1,488.774 sec. - 9.537 sec., lzturbo -49 -p1 -b1044, lzturbo -d -p1
    200,607,756 bytes, 3,470.196 sec. - 4,861 sec., lzhamtest_x64 -m4 -d29 -c -u -t0 -p -x -e c, lzhamtest_x64 -c -u -t0 d
    207,164,371 bytes, 849.854 sec. - 15.508 sec., lzip -9 -s128MiB, lzip -d
    211,776,220 bytes, 709.306 sec. - 11.217 sec., xz -9 -e, xz -d
    213,154,428 bytes, 616.610 sec. - 50.278 sec., plzma64 c2, plzma64 d
    215,584,693 bytes, 155.573 sec. - 26.168 sec., yzx a -m9 -b7 -h5, yzx x
    217,749,028 bytes, 652.461 sec. - 4.880 sec., tor64 -16, tor64 -d
    218,460,141 bytes, 377.879 sec. - 16.418 sec., packet a -h1044 -b1044 -mx, packet x

    Compare ok.
    Last edited by Sportman; 5th April 2014 at 23:37.

  19. The Following 2 Users Say Thank You to Sportman For This Useful Post:

    dnd (6th April 2014),Kennon Conrad (5th April 2014)

  20. #16
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by nburns View Post
    I suggest reading some Wikipedia articles on information theory to brush up on terminology.



    The important question is whether its just a code optimization problem or if it's an inherently slow algorithm. If it's inherently slow, all isn't necessarily lost, but it ought to be doing something important that can't be done any other way.
    I am working on it, but there is a lot for an old dog to learn. I think what I have been calling the "Shannon file size" is just the order 0 entropy.

    There are many ways the speed of the current algorithm can be significantly improved, but I don't expect single block prefix tree tokenizers will ever be fast. TreeCompress would have taken 1000 years to run when I started, took 2 weeks on the previous (unreleased) version and 4 days now, so while I am disappointed by the current compression time, I am pleased with the progress. If I bought a faster PC (like Sportman's) and multi-threaded the main loop, I am almost certain compression time could be reduced to about 20% of the published time. If I had enough processors to run 100 threads, I might be able to get that down to just a few percent of the current time, all without hurting the compressed file size.

    While this is nice, it's not good enough. The algorithm should be improved. I view Tree alpha v0.1 as being similar to a newborn child. There is just enough there for it to function, but it is about as dumb as possible. It is a brute-force approach that establishes a baseline for development and comparison. But first, I need to step back, take a breath, and think big picture.

    Dr. Mahoney has been consistently guiding me toward (at least) using a separate model for spaces. It took me a while to understand the concept. At first I thought he was telling me this mainly as a technique for improving size. But when my time was published, he was sure to remind me again, which finally made me realize there could be huge speed benefits also. He has also been advocating the addition of a context model, which makes perfect sense and is something I have been thinking about from the start, but am not ready to implement. Right now, I am thinking the best next step is to add a prefix tree word tokenizer in front of TreeCompress. I think this will help a lot on the speed and not hurt much on the size. When that is done, I could tweak TreeCompress to not tokenize nearly as thoroughly (essentially make it a long match finder) and then put the output into a low order context model with implied spaces. I have no experience with context modeling, so if there is something out there that is fast and someone could point me in the right direction to get started, that would be helpful.
    Last edited by Kennon Conrad; 5th April 2014 at 05:49.

  21. #17
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    enwik9

    compress:
    treecapencode 108.796 sec.
    treecompress 174,589.974 sec.
    treebitencode 14.089 sec.
    ------------------------------
    total 174,712.859 sec.

    187,933,388 bytes

    decompress:
    treedecode 11.70 sec.

    compare ok.
    If you ever come to Seattle, the beer is on me

    Could you tell me how much RAM usage TreeCompress prints at the start with either enwik8 or enwik9? This inconsistent compressed file thing is not good, so I'd like to tune back the maximum memory usage on the next version so this is less likely. On the bright side, you did better than me not just on speed, but also on size.

    It looks like I really need to clean up TreeCapEncode. It shouldn't take almost 10x longer than TreeDecode.
    Last edited by Kennon Conrad; 5th April 2014 at 06:04.

  22. #18
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    If you ever come to Seattle, all the beer you want is on me

    Could you tell me how much RAM usage TreeCompress prints at the start with either enwik8 or enwik9?
    Thanks, but I don't drink alcohol

    Sadly that values are scrolled out of my command line buffer, I only have this files left and some data:

    1,036,817,655 bytes, size after treecapencode
    367,356,817 bytes, size after treecompress

    Last Iteration:
    Iteration 842: 17.7148 bits per token, Shannon file size 208614318 bytes
    Run time 174859 seconds. 6d9845.6d9845

    treebitencode:
    Read 94210190 tokens including 7014470 token definition symbols
    Tokenize done
    Embed tokens done
    Compressed file of size 187933388 bytes uses a 7020759 'letter' alphabet

    Does that mean average 26.768 bits (3.346 bytes) each 'letter' is used?

    After restart it show:

    treecompress enwik8:
    Allocated 1946157056 bytes for data processing
    Read 103340635 byte input file
    Found 102962467 tokens, 0 defines, maximum unicode value 0x28b46

    treecompress enwik9
    Allocated 1946157056 bytes for data processing
    Read 1036817655 byte input file
    Found 1034338546 tokens, 0 defines, maximum unicode value 0x28b4e
    Last edited by Sportman; 5th April 2014 at 06:10.

  23. #19
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Sportman View Post
    Thanks, but I don't drink alcohol

    Sadly that values are scrolled out of my command line buffer, I only have this files left and some data:

    1,036,817,655 bytes, size after treecapencode
    367,356,817 bytes, size after treecompress

    Last Iteration:
    Iteration 842: 17.7148 bits per token, Shannon file size 208614318 bytes
    Run time 174859 seconds. 6d9845.6d9845

    treebitencode:
    Read 94210190 tokens including 7014470 token definition symbols
    Tokenize done
    Embed tokens done
    Compressed file of size 187933388 bytes uses a 7020759 'letter' alphabet

    Does that mean average 26.768 bits (3.346 bytes) each 'letter' is used?

    After restart it show:

    treecompress enwik8:
    Allocated 1946157056 bytes for data processing
    Read 103340635 byte input file
    Found 102962467 tokens, 0 defines, maximum unicode value 0x28b46

    treecompress enwik9
    Allocated 1946157056 bytes for data processing
    Read 1036817655 byte input file
    Found 1034338546 tokens, 0 defines, maximum unicode value 0x28b4e
    Actually not much of a drinker myself either Thanks for the info on the Allocation sizes. The restarts would have run with the full amount requested, but the results could have been different on your earlier runs, especially if you had other applications open. No biggie, I will just make a modest reduction.

    The "letters" is the size of the alphabet, slightly over 7 million symbols. 26.768 bytes is the average amount of space for writing all instances of each alphabet symbol including the definition of the symbol. TreeEncode removes a few alphabet symbols that only occur one time (an artifact of the method) unless they represent members of the uncompressed (original) alphabet and also makes the alphabet dynamic so that the size of the alphabet at any point in the final compressed file is much smaller, I think less than half of that.

    The average number of bytes per symbol is 187933388/94210190 = 1.995 bytes, including the overhead of defining the dictionary. The "token definitions" is the number of "letters" added to the original alphabet, which includes all UTF-8 symbols.


    It took less than an hour to create TreeDictionary from TreeCompress and add the character blocks need to create the dictionary first. The first iteration went from about 500 cycles to 4 cycles, about 55 minutes to 4 minutes, and compresses about 3/4's as much. It will be interesting to see how much faster TreeCompress runs on the output when it is ready.
    Last edited by Kennon Conrad; 5th April 2014 at 07:32.

  24. #20
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    I tested enwik8 again, this time without memory reduction, speed was faster and size 4 bytes smaller. I updated stats above.

    103,340,635 bytes, size after treecapencode
    50,669,667 bytes, size after treecompress

    Last Iteration:
    Iteration 211: 15.5584 bits per token, Shannon file size 25755381 bytes
    Run time 5219 seconds.2e95.122e95

    treebitencode:
    Read 13243211 tokens including 1023638 token definition symbols
    Tokenize done
    Embed tokens done
    Compressed file of size 23668360 bytes uses a 1028860 'letter' alphabet

  25. #21
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I spent part of the weekend investigating the potential impacts of adding a second layer to the compression. I found that doing the compression in two stages caused an increase in the compressed file size by about 5%. The rate of new symbol creation was much better for the first stage, but much worse for the second stage. The program struggles with high frequency characters and this method makes spaces have a higher frequency in the second stage, so compression slows down. However, I think I learned enough to have some thoughts about the direction of Tree alpha v0.2. Here's what I am thinking.

    v0.2 will (tentatively) include two additional programs. The first will be "TreeDictionary", a prefix based string builder similar to TreeCompress, except it will have a block for the space character (0x20) to prevent the creation of any tokens that represent strings that include spaces. The second will be "TreeTransformSpaces". This program will parse the data and remove spaces that follow any symbol that is not a space and add a unique "back space" symbol between all pairs of consecutive symbols that are not spaces and not in the dictionary.

    The process to compress for v0.2 would include all of the current processes with the addition of TreeDictionary and TreeTransformSpaces just prior to TreeCompress.

    Modifications to TreeBitEncode and TreeDecode would be needed to support the identification of which dictionary each symbol comes from so that a reverse transformation of the spaces can be run.

    Preliminary numbers for the space transformation of the output of TreeDictionary for enwik9, which has a dictionary of about 3 million strings, are as follows:

    input:
    303,330,347 symbols
    293,622,767 symbols excluding the dictionary (spaces only occur here)
    9,707,580 symbols in the dictionary (including 2,983,114 definition symbols)
    139,132,610 space symbols
    order 0 entropy of 302,403,788 bytes

    output:
    212,732,999 symbols
    203,025,419 symbols excluding the dictionary (spaces and backspaces only occur here)
    16,588,858 space symbols
    31,946,404 back space symbols
    order 0 entropy 290,898,090 bytes

    (Note: the order 0 entropy calculation treats all definition symbols as the same symbol because they are in numerical order.)

    It seems like this should get the compressed file size back around where it is for v0.1 and help compression speed because it creates a more diverse set of symbols (not dominated by the space character)

    Another possible change to improve compression speed is to replace the sibling pointer in the prefix tree data structure with even and odd sibling pointers, like the suffix tree structure has.

    When all else is done, I think modification of TreeBitEncode and TreeDecode to use different prefix code sets for the each of the two dictionaries would improve the compressed file size with minimal impact on decompression time.

    I think in the long run, this whole direction would be better for context matching the output of TreeCompress, but this is just a gut feeling at this point.

    This is all new to me. Are there obvious things I am missing? Does this sound like a reasonable direction to go? Any other ideas?

    Should I move this conversation to the main forum?
    Last edited by Kennon Conrad; 7th April 2014 at 19:58.

  26. #22
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    I think it should go in the main forum.

  27. #23
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Will do, after I flounder around a little more on my own with this space modeling thing.

    In the meantime, it dawned on me that the encoder and decoder can easily be modified to improve compression of enwik9 by 2.816 MB. Since this will move Tree from #50 to #45 on the LTCB site and advance the Pareto Frontier for decompression speed vs. compressed file size, I am making the changes and will release them as v0.2 later this month.

    I am thinking about trading off a little bit of compressed file size to improve decompression speed and RAM usage. For a compressed file of enwik9 that is 186 MB, decompresses in 12 seconds and uses 400 MB of RAM while decompressing, does anyone have an opinion they would like to share on whether it would be better to have a compressed file that is 0.5 MB smaller or a compressed file that can be decompressed in 10% less time with 20% less RAM usage?

    Also, should the new version go in a new thread or be added to this one?
    Last edited by Kennon Conrad; 14th April 2014 at 19:27.

  28. #24
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    This should probably be in the main data compression thread.

    There are a few LZ77 compressors on the Pareto frontier for decompression speed that take a ridiculously long time to compress. Some of them got there by running on very fast machines. This might be useful for files compressed once and downloaded a million times. But such compressors are unlikely to become popular. Often we have the opposite scenario where we want to create backups: compress often, decompress rarely.

    On enwik9, the easiest way to improve the compression ratio is to add memory. Of course I encourage this because the goal of LTCB is AI research. I didn't know when I started the benchmark that memory size would play such an important role.

    If your goal is practical compression, I suggest the 10GB benchmark. Here the big winners are deduplication, multi-threading, and detection of already compressed data.

  29. #25
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Thanks. My goal was fluid, changing from brushing up on my C, to dreaming of winning the Hutter Prize, and then settling on creating a compressed enwik9 file and decompressor that was on the Pareto Frontier. Additional goals were having fun and exercising my brain. Practicality and popularity were never a concern.

    If I wanted a practical suffix tree compressor, I think I would go with as few as two passes over the data with the first pass building and substituting dictionary symbols with a look-ahead suffix tree and the second pass finding and tokenizing distant matches. Compression would not be as good, but I see no reason something like this couldn't be at least 100x faster than what has been released. It's just a bigger effort than I can take on right now. Maybe someday, after retirement. I have the (perhaps ignorant) impression that most compressors make their decisions based on prior data rather than upcoming data, and that seems backwards to me.

  30. #26
    Member
    Join Date
    Feb 2013
    Location
    San Diego
    Posts
    1,057
    Thanks
    54
    Thanked 71 Times in 55 Posts
    Quote Originally Posted by Kennon Conrad View Post
    I have the (perhaps ignorant) impression that most compressors make their decisions based on prior data rather than upcoming data, and that seems backwards to me.
    The issue is online versus offline. Online algorithms can't see into the future; offline algorithms can. Since you seem to have written an offline dictionary compressor, I recommend studying the paper I linked to. You don't have to solve the whole problem yourself. Research is about moving forward in increments.

  31. #27
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Compression is all about prediction (assigning a probability distribution to the next symbol) and coding (in log 1/p bits). The decompresser can't peek at the next symbol, so it has to make an identical prediction to decode.

    Of course you can transmit hints to the decompresser separately (if you don't like the backward approach) but the hints eat up any coding space you would have saved. You can't get around Kolmogorov complexity.

  32. #28
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    It's amazing how over time things seem to go from impossible to impractical to mainstream. Most of what is being researched and used today was impossible when I went to school.

    I read the paper at the link earlier and rechecked it now. While the scheme they implemented is better than what I tried, matching pairs just wasn't cutting it for me. BPE, Repair and Tree all essentially compress the suffix tree of the input string. The difference is in the order of how the tree is compressed. I am convinced that a static dictionary scheme cannot reduce the order 0 entropy by more than the average depth of the suffix tree. Assuming this is true, I figured it was better to start with strings that are longer than average and break them down into substrings rather than relying on multiple matches of pairs to get the order 0 entropy of the result as close as possible to the initial order 0 entropy divided by the initial average depth of the suffix tree. Tree ends up creating mostly pairs, but also ends up with triplets, quads, etc. For enwik9, my reported run yielded 5.46 million pairs, 1.09 million triplets, 247K quads, 93K "fives", 42K "sixes", etc. and the longest symbol represents 701 symbols. It's easy to miss the good matches that are longer than pairs with a pairing scheme.

    Has it been proven that biasing compressors to more easily predict the next symbol eats up at least as much code space as it saves? It seems to me (again, possibly ignorantly) that this could save space if done efficiently. If anything, I thought Tree indicated the opposite was true. Either way, I like the decompression speed advantages. My instincts tell me that the optimum algorithm would use both past and future data. I could be wrong, it wouldn't be the first time....
    Last edited by Kennon Conrad; 17th April 2014 at 10:01.

  33. #29
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.2

    The encoder and decoder now use codes to indicate the number of symbols in each symbol definition instead of using a symbol to mark the end of each symbol definition. Only TreeBitEncode and TreeBitDecode are different than the versions in Tree v0.1.

    Tree v0.1 on A8-5500:
    enwik9: 187,978,600 bytes
    enwik8: 23,668,360 bytes
    TreeDecode.zip: 6,656 bytes

    Tree v0.2 on A8-5500:
    enwik9: 185,311,980 bytes
    enwik8: 23,250,856 bytes
    TreeDecode.zip: 6,563 bytes

    Compression time, decompression time and RAM usage are not significantly changed.
    Attached Files Attached Files

  34. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Jyrki Alakuijala (23rd August 2017),Nania Francesco (13th August 2016)

  35. #30
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    776
    Thanks
    63
    Thanked 270 Times in 190 Posts
    Quote Originally Posted by Kennon Conrad View Post
    Tree v0.2
    file v0.1 - v0.2, treebitencode - treedecode:

    enwik5 43,236 - 41,952, 0.012 sec. - 0.065 sec.
    enwik6 347,836 - 337,484, 0.028 sec. - 0.078 sec.
    enwik7 2,780,800 - 2,720,380, 0.142 sec. - 0.203 sec.
    enwik8 23,668,360 - 23,250,856, 1.431 sec. - 1.444 sec.
    enwik9 187,933,388 - 185,267,772, 14.526 sec. - 12.470 sec.

    Nice improvement!
    Last edited by Sportman; 17th April 2014 at 11:05.

  36. The Following 2 Users Say Thank You to Sportman For This Useful Post:

    Jyrki Alakuijala (23rd August 2017),Kennon Conrad (18th April 2014)

Page 1 of 29 12311 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •