Page 17 of 29 FirstFirst ... 7151617181927 ... LastLast
Results 481 to 510 of 849

Thread: Tree alpha v0.1 download

  1. #481
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Quote Originally Posted by Kennon Conrad View Post

    Some statistics from Tree v0.15's (TreeCompress) compressed enwik9 file:
    mtf (2 - 20 instances): 4,197,656 hits, 73,112,243 bits saved
    mtf1 (code lengths 11 - 22): 3,265,078 hits, 26,784,319 bits saved

    86,387,490 symbols coded from code space
    78,924,756 symbols coded from code space except mtf/mtf1
    mtf consumes 216/4096 of code space
    mtf1 consumes 96/4096 of code space
    approximate penalty for non mtf/mtf1 symbols: 9,021,386 bits

    So altogether, it appears that mtf is saving 90,875,176 bits or 11,359,397 bytes.
    Those are some interesting numbers, if I'm understanding them correctly.

    I take it that they mean that:

    1) It's making your compressed enwik9 more than 11 megabytes smaller than it would be without MTF coding, and if you're getting about 177 megabytes compressed size, you'd be getting 188 without, so it's a difference of more than 1 percent compression (relative to the full size of a gigabyte) and 6 percent or so smaller compressed size. That's cool.

    2) The use of part of the code space for MTF codes is only costing you a tenth of what it's benefitting you, by making other codes longer. That's very cool, because it suggests you're not near the breakeven point, and using considerably more of your code space would likely get you comparable further gains.

    If you do that, and it works well, that's an important, basic result right there. I've thought for a while that Bentley et al.'s MTF coding compression would work better if you split things into high-dispersion and low-dispersion categories, and only used MTF coding for low-dispersion stuff. You've gone and done it, and it works fine.

    [ EDIT: more precisely, you're exploiting the fact that higher-frequency strings tend to also be higher-dispersion. If you actually used a measure of dispersion distinct from frequency, it'd likely work better. ]

    I don't know if you're into publishing papers, but that's something that ought to be in the literature. (I went looking and couldn't find it.) [EDIT: As should the overall Tree thing.]
    Last edited by Paul W.; 28th November 2014 at 06:57. Reason: clarifications

  2. #482
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Those are some interesting numbers, if I'm understanding them correctly.

    I take it that they mean that:

    1) It's making your compressed enwik9 more than 11 megabytes smaller than it would be without MTF coding, and if you're getting about 177 megabytes compressed size, you'd be getting 188 without, so it's a difference of more than 1 percent compression (relative to the full size of a gigabyte) and 6 percent or so smaller compressed size. That's cool.
    That's correct.

    Quote Originally Posted by Paul W. View Post
    2) The use of part of the code space for MTF codes is only costing you a tenth of what it's benefitting you, by making other codes longer. That's very cool, because it suggests you're not near the breakeven point, and using considerably more of your code space would likely get you comparable further gains.
    For the 2-20 instance MTF queues, I don't think making them longer would help compression much. For 2 instance symbols, the first queue position is hit 357,344 times. The 64th position (last) is hit 726 times. For 20 instance symbols, the first queue position is hit 48,817 times and the 64th position is hit 78 times. MTF codes get longer as the position increases, so there's not much benefit left after the 64th position when the symbols are categorized by the number of instances. More bits could be saved by combining the queues into a single queue, especially for very recent symbols or by adjusting the probabilities of the queues.

    The >20 instance 1 deep mtf "queues" based on symbol length are signficantly underperforming what is possible. First, the code space does not match the hit probability as well as it could. Second, bits can be saved by using a single 12 deep queue instead of 12 1-deep queues. Third, bits can be saved by using a longer queue. Fourth, the dictionary codes for the symbols in these queues are not removed from the dictionary code space and consuming code space. I have not found a way to quickly manage the dictionary code space appropriately to fix #4. A single 188 deep queue for >20 instance symbols receives almost 3x the number of hits and saves about 15.5 million bits compared to Tree v0.15 without adjusting the mtf1 code space or recalculating the penalty. Final savings should be well over 20 million bits. More bits could be saved with a longer queue, but I am afraid of making it much longer because of my expectations for impact on decoding speed.

    Quote Originally Posted by Paul W. View Post
    If you do that, and it works well, that's an important, basic result right there. I've thought for a while that Bentley et al.'s MTF coding compression would work better if you split things into high-dispersion and low-dispersion categories, and only used MTF coding for low-dispersion stuff. You've gone and done it, and it works fine.

    [ EDIT: more precisely, you're exploiting the fact that higher-frequency strings tend to also be higher-dispersion. If you actually used a measure of dispersion distinct from frequency, it'd likely work better. ]

    I don't know if you're into publishing papers, but that's something that ought to be in the literature. (I went looking and couldn't find it.) [EDIT: As should the overall Tree thing.]
    I write a lot of documents, but they are only published internally by companies I work for. I do think Tree needs a decent set of documents before I say the "project" is done. Without documentation, it is easy to lose ideas and results. Tree seems to be more unique than I initially realized, so I think documentation is really important. The only reason I haven't updated my initial document is because I don't like documenting moving targets.

    I have thought about formally submitting a paper, but am not sure what to do because it would be hard for me to produce a good quality publication on my own and not sure where to go for help. I have considered stopping by our local University and discussing Tree and publications with a CS professor. But first, I would like to continue investigating some ideas and making improvements where possible. That way, if and when I go, I will be able to put Tree in the best light that is reasonably possible. At a minimum, once the encoding format stabilizes, I will write a format specification and post it.

    I have thought about dispersion and have wondered if there would be benefits to adding some type of dispersion indicator. It could be as simple as one extra bit per definition, but I haven't tried anything.

    After I have Tree's new mtf scheme working, I do think one simple way to slightly improve compression would be to have the encoder send code lengths for new symbols based on how often the symbol they come from the dictionary rather than how often they are used (ie. exclude mtf usage).

    One last point for clarification. The approximately 86.4 million symbols includes about 7.82 million new symbol escape symbols since these are coded with some of the 4096 code bins. These are coded with whatever number of bins are leftover after the fixed size mtf code space and variable size dictionary code space have been assigned.

  3. #483
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    [duplicate post deleted]
    Last edited by Paul W.; 28th November 2014 at 17:51. Reason: duplication

  4. #484
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Well, I've been a tenured CS professor and have published a number of papers, including a couple that got Best Paper awards, and I'd be happy to comment on drafts and suggest changes.

    I've never published in the compression conferences/journals, though. (Our WK stuff was published in the OS field, in the context of compressed caching for virtual memory systems.) It'd be good to have somebody on board who's been there and done that, and knows what stuff you're expected to cite, how they like the papers written, etc.

    (It would be better to find specialists in the right field than just some professor at the local U---though maybe there's an appropriate compression person wherever it is you are...[google maps "Bothell WA"] and it looks like one of your "local" universities is the University of Washington, which is good. There are a lot of good "systems" people there.)

    At the rate you're going, it seems to me you should be able to find such people. This is good stuff, and interesting, and I think there's a good chance that you could get the right people to take an interest and help you a bit.

  5. #485
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    Kennon:

    "I have thought about dispersion and have wondered if there would be benefits to adding some type of dispersion indicator. It could be as simple as one extra bit per definition, but I haven't tried anything."

    If you have a reasonably efficient MTF implementation, with priority queues implemented as heaps embedded in arrays in the usual way, or with splay trees or something, I'd think you could do roughly this:

    1) determine the global frequencies of your strings in the usual way, giving you the plain entropy of your strings
    2) make another pass over the input where you measure the MTF entropy of the same strings---i.e., how many bits you'd need to encode them by a global MTF list.

    Then you'd set a bit for each string saying whether its plain entropy was lower (not much locality) or its MTF entropy was lower (should be MTF-encoded).

    I suspect that you can't really measure these things independently, or independently of how you choose strings in the first place, but that it would be a fine rough cut that would work reasonably well.



  6. #486
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Kennon:

    "I have thought about dispersion and have wondered if there would be benefits to adding some type of dispersion indicator. It could be as simple as one extra bit per definition, but I haven't tried anything."

    If you have a reasonably efficient MTF implementation, with priority queues implemented as heaps embedded in arrays in the usual way, or with splay trees or something, I'd think you could do roughly this:

    1) determine the global frequencies of your strings in the usual way, giving you the plain entropy of your strings
    2) make another pass over the input where you measure the MTF entropy of the same strings---i.e., how many bits you'd need to encode them by a global MTF list.

    Then you'd set a bit for each string saying whether its plain entropy was lower (not much locality) or its MTF entropy was lower (should be MTF-encoded).

    I suspect that you can't really measure these things independently, or independently of how you choose strings in the first place, but that it would be a fine rough cut that would work reasonably well.
    That's an interesting idea. I have been trying to avoid a fully global MTF list because long MTF lists can take a lot of time to maintain. But if the percentage of symbols that should be MTF encoded is relatively small, then a global MTF queue would be a lot smaller and number of MTF queue updates could be a lot less than for a fully global MTF queue.

    The second pass doesn't bother me. It's on the encoder side, where speed is essentially irrelevant because it is insignificant compared to compression time. It is similar to what is done currently to determine mtf queue code space sizes. This would solve my problem #4 above since codes for symbols that are in the mtf queue would be not need to be dynamically taken out of the dictionary while in the mtf queue to avoid wasting dictionary code space. The biggest advantage may be in queue hits tending to happen closer to the top of the queue, which should provide shorter mtf codes on average. I suspect that a scheme like this would allow for more effective dynamic adjustable of the global mtf probability. If the recent history has been non-mtf symbols, then mtf should be de-emphasized and if the recent history has a lot of mtf symbols, then it would probably be a good idea to promote mtf probability.

    Once I get my first cut at better MTF coding done, I can get a better idea of how much impact the new queue has on decoding speed and try to determine the impact of a much bigger queue. This is intriguing, but I am concerned about my ability to code a fast global MTF scheme with a queue size that might still be a million symbols. If that's just too much, I could possibly do this on only >20 instance symbols, of which there are only about 430K compared to about 4M for the peak dictionary size or about 7.8M for the full dictionary size. These tend to be more dispersed so the percentage of symbols that go in the mtf queue should be lower. I seems possible that much of the mtf benefit would come from a relatively small number of those symbols, perhaps only 10K of them have really low dispersion. I think I could handle a 10K queue that is only affected by 5% of the transmitted symbols.

    I do not know what heaps and splay trees are. Tree's mtf code may not be optimal because of this. Currently, Tree uses a circular buffer for each 64 element queue. My prototype code uses a series of five 2^Nx sized circular buffers for it's "global" array (8,16,32,64,64), except a few queue locations at the top that are handled individually. So generally, the code has to move symbols down within one buffer and update the circular buffer pointer and one symbol per circular buffer for the circular buffers that contain symbols that were more recent than the current symbol. Symbols of incrementally lower probability are removed from the global queue as they transition from one circular buffer to the next to ensure mtf codes are shorter than the dictionary code and allow lower probability symbols to stay in the queue longer. So sometimes all the queue entries beyond where the current symbol was located are moved up. Maybe not the best, but that's what I have come up with.

  7. #487
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    A heap is a partially-ordered tree. (Usually a binary tree, but it doesn't have to be.) Unlike a normal balanced tree, it doesn't maintain a total (left-to-right) ordering of the items. It only guarantees that any given node has a higher key than any of its children.

    That weak ordering constraint is enough to let you do certain operations in logarithmic time, like with a normally balanced tree (red-black or AVL tree or whatever), but with lower constant factors.

    Heaps are commonly associated with the "priority queue" abstract data type, and sometimes confused with them, but they're different things. "Priority queue" is an abstraction that can be implemented lots of ways, but the most common is with a heap, because it supports the necessary operations very efficiently. (You can use a priority queue as an MTF list by timestamping items, and using the timestamp as a key. To maintain an LRU eviction queue, for example, you'd update the timestamp on the current item to the current time, add it to the front of the priority queue, and evict the item with the lowest timestamp using delete-min.)

    Heaps are usually implemented in terms of a lower-level data structure: an array. Given the simple ordering constraint, you can embed a heap in an array in a simple deterministic way, and don't need space for pointers to child nodes. If all you need to implement is the operations of a priority queue, not all of the operations normal balanced binary trees can support, that's usually the way to go; it's compact and fast.

    There are binary heap implementations suitable for implementing efficient MTF lists in pretty much any language you can name. http://en.wikipedia.org/wiki/Heap_(data_structure)

    A splay tree is an oddly balanced-or-unbalanced binary tree. It maintains the normal left-to-right order, but doesn't maintain any obvious local balance property. It does turn out to have very interesting balance properties if you're interested in locality and compression. http://en.wikipedia.org/wiki/Splay_tree

    A splay tree restructures itself every time you access it, sort of like a move-to-front linear list, generalized to ordered trees.

    Whenever you touch an item, you move it to the root of the tree, and you "rotate" all the links that you followed to get to it, so that its parent becomes its child, and in general its nearest neighbors move half-way up to the root of the tree, its next-nearest neighbors move a quarter of the way up, and so on.

    That makes splay trees radically adaptive:

    1) If you have random accesses, the splay tree will be bushy. Random things will get moved to the root, nearby things will get moved toward the root, and on average it'll be a fairly balanced tree.

    2) If you have hot items that are frequently accessed (like dictionary entries for "and a" and "of the") they will tend to be very near the root---they'll go up to the root every time you touch them, and they won't get pushed down far before they get touched again. Accesses to frequently-touched items are fast.

    3) If you have "hot" regions in the key ordering, i.e., ranges of keys you use more often than others, the corresponding subtrees will tend to be nearer the root (higher in the tree) than the subtrees corresponding to "cold" regions in the key ordering. Subtrees that haven't been touched for a while drift down as more-recently-touched stuff (and that stuff's near neighbors) gets rotated up toward the root.

  8. The Following User Says Thank You to Paul W. For This Useful Post:

    Kennon Conrad (29th November 2014)

  9. #488
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    I think I get it. Heap sort may be the way to go for really big mtf queues. I think Piotr mentioned them earlier. If I understand correctly, splay trees are mostly for fast searching, which isn't really an issue for decoding since the code indicates the queue position.

  10. #489
    Member
    Join Date
    Oct 2013
    Location
    Filling a much-needed gap in the literature
    Posts
    350
    Thanks
    177
    Thanked 49 Times in 35 Posts
    I was suggesting heaps, but not heap sort. (Heaps are used in heap sort, because heap sort doesn't require the total left-to-right ordering that normally ordered trees require.)

    Could you tell me more concretely how you're implementing your MTF lists now? The advantage of using a heap to implement MTF lists is that you can grab something from the middle of a huge list and move it to the front in fast logarithmic time, so it works well for much longer lists than using linear search in a linear linked list or by shoving things around in linear arrays.

  11. #490
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    I was suggesting heaps, but not heap sort. (Heaps are used in heap sort, because heap sort doesn't require the total left-to-right ordering that normally ordered trees require.)

    Could you tell me more concretely how you're implementing your MTF lists now? The advantage of using a heap to implement MTF lists is that you can grab something from the middle of a huge list and move it to the front in fast logarithmic time, so it works well for much longer lists than using linear search in a linear linked list or by shoving things around in linear arrays.
    Sorry, I meant heaps. I tend to not always get my wording right when a subject is new to me.

    The released code uses an array to implement nineteen 64 element mtf queues. It is not elegant, but is simple. Adding a new symbol to a full queue, which is by far the most common operation on the queue, essentially consists of the following:

    Code:
    mtf_queue_ptr = &mtf_queue[64 * queue_number + mtf_queue_offset[queue_number]];
    *mtf_queue_ptr = new_symbol;
    mtf_queue_offset[queue_number] = (mtf_queue_offset[queue_number] + 1) & 0x3F;
    The biggest problem is removing symbols from the interior of the list because up to 63 elements need to be moved.

    Code:
    mtf_queue_last_ptr = &mtf_queue[MTF_QUEUE_SIZE * symbol_count
        + ((mtf_queue_offset[symbol_count] + mtf_queue_size[symbol_count] - 1) & 0x3F)];
    mtf_queue_ptr = &mtf_queue[MTF_QUEUE_SIZE * symbol_count
        + ((mtf_queue_offset[symbol_count] + mtf_queue_size[symbol_count]
          - mtf_position_table[temp_input_word] - 1) & 0x3F)];
    do
    {
      if (mtf_queue_ptr < &mtf_queue[64 * (queue_number + 1) - 1])
      {
        *mtf_queue_ptr = *(mtf_queue_ptr+1);
        mtf_queue_ptr++;
      }
      else
      {
        *mtf_queue_ptr = *(mtf_queue_ptr - 63);
        mtf_queue_ptr -= 63;
      }
    } while (mtf_queue_ptr != mtf_queue_last_ptr);
    The code is a bit convoluted. The most recent symbol is on the bottom of the queue and mtf_queue_ptr initally points one position past the position of the most recent symbol. This code was changed for v0.15. I got it working but didn't clean it up as much as I could have, so I will probably go back at some point and change at least a few things. v0.14 moved all of the elements in the queue when inserting a symbol that was not in the queue, so at least the code doesn't do that anymore.

    I realize the code is not optimal, and that the algorithm is not optimal. Most of the time its not too costly because the position hits are strongly biased toward mtf position 0 - roughly 50% of queue hits for enwik9 are for the most recently added symbol. I wish I could write better code on my first try, but still have a lot to learn about efficient algorithms for compression and decompression. It wouldn't surprise me if someone like Bulat or Piotr could take Tree's decoder and cut decompression time in half.

  12. #491
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Paul W. View Post
    Well, I've been a tenured CS professor and have published a number of papers, including a couple that got Best Paper awards, and I'd be happy to comment on drafts and suggest changes.

    I've never published in the compression conferences/journals, though. (Our WK stuff was published in the OS field, in the context of compressed caching for virtual memory systems.) It'd be good to have somebody on board who's been there and done that, and knows what stuff you're expected to cite, how they like the papers written, etc.

    (It would be better to find specialists in the right field than just some professor at the local U---though maybe there's an appropriate compression person wherever it is you are...[google maps "Bothell WA"] and it looks like one of your "local" universities is the University of Washington, which is good. There are a lot of good "systems" people there.)

    At the rate you're going, it seems to me you should be able to find such people. This is good stuff, and interesting, and I think there's a good chance that you could get the right people to take an interest and help you a bit.
    That's a very generous offer. I may well take you up on it when I am ready.

    Yes, I was referring to the University of Washington. My degrees are from there (main campus), but most of the professors I knew have retired long ago.

    I got Tree's new mtf scheme working last night. Using the compressed (not encoded) enwik9 file produced by Tree v0.15a gives 174,067,748 bytes instead of 176,896,672 bytes. MTF savings is now over 14MB and it could still be better.

  13. #492
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    124
    Thanks
    40
    Thanked 35 Times in 25 Posts
    Kennon,

    I ran tree against enwik8:

    TreeComp.bat r:\enwik8
    [...]
    205: 13267479 syms, dict. size 1180615, 15.7846 bits/sym, o0e 26177703 bytes
    Common prefix scan 1/1 0.0 - 1503c6.1503c6, score[3] = 4.51
    206: 13267465 syms, dict. size 1180617, 15.7846 bits/sym, o0e 26177708 bytes
    Common prefix scan 1/1 0.0 - 1503c8.1503c8
    Run time 5242 seconds.


    R:\Tree>TreeBitEncode r:\enwik8.tcom r:\enwik8.tree
    cap encoded: 1, UTF8_compliant 1
    Read 13267465 symbols including 1180617 definition symbols
    Encoded 9385250 level 1 symbols
    Compressed file size: 21922356 bytes, dictionary size: 1185671 symbols
    elapsed time = 2.634000 seconds.


    R:\Tree>TreeDecode r:\enwik8.tree r:\enwik8.rest
    Decompressed 100000000 bytes in 962 msec

    Then I tried something a little "harder" ... tree against the BWTS of enwik8:
    [...]
    396: 22489289 syms, dict. size 732952, 10.8896 bits/sym, o0e 30612423 bytes
    Common prefix scan 1/1 0.0 - b3017.b3017
    Run time 14052 seconds.


    R:\Tree>TreeBitEncode r:\enwik8.bwt.tcom r:\enwik8.bwt.tree
    cap encoded: 0, UTF8_compliant 0
    Read 22489289 symbols including 732952 definition symbols
    Encoded 19644405 level 1 symbols
    Compressed file size: 29015552 bytes, dictionary size: 733138 symbols
    elapsed time = 1.796000 seconds.


    R:\Tree>TreeDecode r:\enwik8.bwt.tree r:\enwik8.bwt.rest
    2000197

    I killed the decoding process after several hours. The CPU was still being used by the process so I suspect it was stuck in a busy loop (or maybe very very slow ...).

  14. The Following User Says Thank You to hexagone For This Useful Post:

    Kennon Conrad (2nd December 2014)

  15. #493
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by hexagone View Post
    R:\Tree>TreeDecode r:\enwik8.bwt.tree r:\enwik8.bwt.rest
    2000197

    I killed the decoding process after several hours. The CPU was still being used by the process so I suspect it was stuck in a busy loop (or maybe very very slow ...).
    Hexagone,

    I appreciate the problem report. It helps make Tree more robust.

    Tree v0.15b is attached. TreeDecode's array for new symbols has been increased in size from 100K to 10M. This is still vulnerable to overflow if a symbol representing a string of more than 10MB is created. More robust code is possible, but this is something I chose to defer during development and will continue to defer other than the increase in buffer size. The compressed enwik8 bwts file had a symbol for a match of 126,014 characters.

    I also fixed a bug that affected encoding of the bwts of enwik8 without running TreeCompress. A batch file for TreeCompress64 has been added but should only be used if your computer has free RAM that is at least 5x the input file size.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 3rd December 2014 at 09:56.

  16. #494
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    124
    Thanks
    40
    Thanked 35 Times in 25 Posts
    Kennon,

    Your fix addressed the issue.
    [...]


    R:\Tree>TreeDecode r:\enwik8.bwt.tree r:\enwik8.bwt.rest
    Decompressed 100000016 bytes in 566 msec



    Just to be clear, the input file is the BWTS of enwik8 + a header (hence the 16 extra bytes in the file size). The decompressed file is correct.
    Last edited by hexagone; 4th December 2014 at 01:02. Reason: Clarification

  17. #495
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    I suppose there is an error in TreeComp64.bat
    Correct me if I wrong but shouldn't the second line look like
    TreeCompress64 %1.tpre %1.tcom
    instead of
    TreeCompress64 %1.twor %1.tcom

    Also. I have tried to create a speed-optimized compile of TreeCompress64.exe and generally speaking I'm failed.
    Seems that C based sources are badly optimizable at compiler level, or maybe the source itself leaves no room for optimization, or its just me and my lack of knowledge.
    2.17% of speedup is the maximum I was able to squeeze out. Anyway if somebody want to try it then you're welcome.
    Attached Files Attached Files

  18. The Following User Says Thank You to Skymmer For This Useful Post:

    Kennon Conrad (5th December 2014)

  19. #496
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Skymmer View Post
    I suppose there is an error in TreeComp64.bat
    Correct me if I wrong but shouldn't the second line look like
    TreeCompress64 %1.tpre %1.tcom
    instead of
    TreeCompress64 %1.twor %1.tcom

    Also. I have tried to create a speed-optimized compile of TreeCompress64.exe and generally speaking I'm failed.
    Seems that C based sources are badly optimizable at compiler level, or maybe the source itself leaves no room for optimization, or its just me and my lack of knowledge.
    2.17% of speedup is the maximum I was able to squeeze out. Anyway if somebody want to try it then you're welcome.
    You are correct. I will fix the batch file on the next release.

    In my somewhat limited experience, execution speed is more a function of coding choices than programming language, but I haven't written code in any language that has been invented in the last 20 years.

    The C code has functions that are generally written to optimize speed rather than code size. I have spent a fair amount of time looking at the assembly code and rewriting code to give faster assembly code. Function calls are only used for recursive functions. I used to think I could make the code faster by writing it in assembly, but I don't think the gains would be that great. At this point, I think most of the possible speed optimizations lies in algorithm improvements rather than code optimization. So, in other words, I think you did pretty well to get 2.17%. If I knew what you did, I would definitely incorporate it. BTW, I am also seeing a little more than 2% speed improvement.

  20. #497
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    733
    Thanked 660 Times in 354 Posts
    look at microarchitecture manual by agner fog, usually assembler-level optimizations are wrong since they doesn't take into account microarchitecture

  21. #498
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Kennon:
    It could be the case that Tree is bound by random memory accesses. If that's true then local optimizations won't bring any serious improvements. You should then work on prefetching data from RAM, decreasing data structures size to keep more data in caches, reorganizing data if cache lines are underutilized (ie you are reading small chunks, eg 5 bytes of memory at random memory locations instead of chunks of eg 30 bytes which is much more comparable to cache line size) or reorganizing data if the layout causes cache lines sets underutilization (CPU caches are usually partitioned to cache lines sets, where cache lines in each set share some bits of an address - because of that, if lots of data locations share some bits then probably they won't be spreaded all over the available CPU cache). Look here: http://en.wikipedia.org/wiki/Locality_of_reference

    I expect compilers to have only very basic memory reorganization and prefetching abilities.

    Also splitting computations to many passes can (counterintuitively) increase speed as doing many things at one means that those many interleaved computations have to compete for CPU cache. Eg if you're decoding symbols using FSE simultaneously with copying matches then frequent matches compete for data cache with FSE lookup table. I think that if the intermediate data stream is not big then it's better to split processing to small blocks, so that the intermediate data still can fit in caches but maybe not the fastest ones. But even the slowest caches have so much bandwidth that they shouldn't be a bottleneck, provided you don't cause too much walks into main memory.

    When dividing data to small blocks (you don't have to store block boundaries in compressed stream, so it won't change any storage format) and splitting computation to passes you can multithread it on that level and then the effect would be that different cores use their fast non-shared caches for different tasks. IIRC you're doing something like that as Tree decoding can use two threads, but I dont know the intermediate buffers sizes. Maybe they're even bigger than cache could hold at once, but even then it wouldn't be suprising if such sizes provide biggest speed benefit. Main memory transfer is still fast and thread/ process synchronization is not free, so the balance won't depend on one factor.
    Last edited by Piotr Tarsa; 6th December 2014 at 18:00.

  22. #499
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Yes, I think RAM access time is a significant factor, especially for TreeCompress(64) and TreeDecode when working with large files. To some degree it is inherent when making many passes over large data sets (TreeCompress) or handling large dictionaries (TreeDecode). When compressing enwik9, TreeCompress64 initially uses more than 4 GB just to hold the input (a word per character plus extra capital encode words), so the data stream is quite large. When decoding , only one pass is required, but the dictionary size is about 200 MB for enwik9. When I tested compression with -m values that limit the dictionary to a much smaller size, compession speed was up to more than 3x faster. I think much of that is from decreased RAM accesses.

    I have variable association on my list of things to try, but my list consistently gets longer rather than shorter. Some of the key variables are still changing as the algorithm is developed. It seems like something that is best done toward the end of development.

    The circular buffer to pass data between TreeDecode's two threads is 1 MB. It is a circular list of 32 bit symbol numbers. I tried various sizes and that size seems to work as well as any. The decoding thread accesses the dictionary to create the new dictionary symbol. The output thread accesses the dictionary to write the output symbols. I have considered modifying the dictionary so that some or all symbols are entered in their compressed format, but haven't tried any code changes yet. This would reduce the amount of RAM used by the dictionary a lot, but add extra dictionary lookups. It could well be helpful, at least to some degree, and is also on my list of things to investigate.

    I haven't really studied it yet, but imagine the characteristics of what is accessed in the dictionary is somewhat different for each thread. I think accesses for writing new dictionary symbols is more heavily weighed toward more frequent symbols than dictionary symbols accessed to write the output. One of these days I will investigate whether using a separate model is worthwhile.

    For TreeCompress, probably the best way to improve RAM access is to reduce the number of them. If I was starting over today and going for the fastest implementation of the same algorithm, I would build a suffix array of the input and maintain it throughout the compression process rather than rebuilding it after each scoring cycle. With my old computer, it wasn't viable to keep the whole structure in RAM whether it was an array or a tree. With my new computer, I think I could do it with a suffix array. The key would be finding a fast way to maintain the array, and I'm not sure how hard that is.

    I do realize local optimizations don't have much impact when they aren't the key time consumers. But I do sometimes find them helpful for making other areas that need optimization more obvious.

  23. #500
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Quote Originally Posted by Kennon Conrad View Post
    So, in other words, I think you did pretty well to get 2.17%. If I knew what you did, I would definitely incorporate it.
    Thanks for good words. Actually I didn't done anything special and I'm afraid there will be not so much help from me here, but anyway - those 2% is just a compiler option.

    -m64 -march=x86-64 -mtune=core2 -static -O3 -s -funroll-all-loops -fomit-frame-pointer -mprfchw -fprefetch-loop-arrays -fno-exceptions -mconsole --param inline-unit-growth=99999

  24. #501
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Skymmer View Post
    Thanks for good words. Actually I didn't done anything special and I'm afraid there will be not so much help from me here, but anyway - those 2% is just a compiler option.

    -m64 -march=x86-64 -mtune=core2 -static -O3 -s -funroll-all-loops -fomit-frame-pointer -mprfchw -fprefetch-loop-arrays -fno-exceptions -mconsole --param inline-unit-growth=99999
    Thanks. I will try some compile options. I use -O2 because it has always been as fast and usually faster than -O3. For TreeBitEncode, -O2 is about 20% faster than -O3.

  25. #502
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Tree v0.16 uses a 258 symbol mtf queue for symbols that occur more than 20 times instead of a single mtf symbol per symbol length. This is not ideal and there are new opportunities for improvement, but it will move Tree from #36 on LTCB to #28.

    The main change is in TreeBitEncode and TreeDecode. TreeCompress64 has an adjusted scoring formula that results in slightly faster compression and slightly better compression of enwik9.

    The results:
    enwik9:
    TreeCompress: 174,086,232 bytes, 114,739 sec./8.1 sec. (was 176,896,672 bytes, 114,733 sec./6.9 sec.)
    TreeCompress64: 175,032,196 bytes, 9362 sec./8.2 sec. (was 177,974,208 bytes, 9542 sec./7.0 sec.)

    enwik8:
    TreeCompress: 21,625,828 bytes (was 21,922,356 bytes)

    A .zip file of TreeDecode.c is 13,395 bytes.

    The dispersion statistics, as measured by the number of mtf hits for each string are quite diverse. Here are a few samples from the file produced by TreeCompress. The first number is the symbol rank. The second number is the code length. The next two numbers are the number of mtf hits and number of occurances of each symbol. The last part is the string the symbol represents in double quotes.

    64 (11): 20386/36635 in q: "
    *[[C"
    77 (11): 21919/31602 in q: "]] - [[C"
    86 (11): 15852/29456 in q: "
    * [[C"
    282 (13): 10488/12786 in q: "]]
    **[[C"
    322 (13): 9/11149 in q: "]]
    [[fi:C"
    444 (13): 7221/7875 in q: "]]
    #[[C"
    497 (14): 24/7204 in q: "]

    [[Ccategory:C"
    624 (14): 5505/6089 in q: "&lt;br&gt;
    C"
    659 (14): 8/5807 in q: "]]
    [[da:C"
    819 (14): 4674/4838 in q: "]] -- [[C"
    1121 (14): 6/3729 in q: "]]
    [[sk:C"
    1125 (14): 7/3720 in q: ".

    ==Csee also==
    *[[C"
    1431 (15): 3070/3080 in q: "'']]
    *[[ChCmCs C"

    Some strings have very low dispersion such as the last one that looks like it is probably used repeatedly in one or more lists of ships. The language ones have unusually high dispersion, with far fewer mtf hits than if they had been randomly spaced. These tend to occur only once per page and often are soon followed by other high dispersion symbols that represent other languages. It looks like there are lots of bits that can be saved by adjusting the code lengths for these symbols so they are based on the number of non-mtf instances instead of the total number of instances. For instance, the last one should have a code length of about 23 bits instead of 14 bits.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 9th December 2014 at 10:20.

  26. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Paul W. (9th December 2014),surfersat (9th December 2014)

  27. #503
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.16a

    Since Matt has not updated LTCB yet, I wanted to put out a version of TreeBitEncode that takes advantage of the symbol dispersion characteristics of individual symbols by sending symbol code lengths determined by the number of dictionary hits instead of the number of non-initial occurances. The initial code is limited to sending codes corresponding to 20 or more dictionary hits.

    New results:
    enwik9:
    TreeCompress: 173,846,372 bytes, 114,748 sec./8.1 sec. (was 174,086,232 bytes, 114,739 sec./8.1 sec.)
    TreeCompress64: 174,796,156 bytes, 9371 sec./8.2 sec. (was 175,032,196 bytes, 9362 sec./8.2 sec.)

    enwik8:
    TreeCompress: 21,603,356 bytes (was 21,625,828 bytes)

    Tree now produces a compressed enwik9 file that is over 10% smaller than the leading LZ compressor.

    One interesting note, which caught me by surprise (but probably shouldn't have): After adjusting the symbol code lengths, but before modifying the total symbol code space targeted by the code length assignment algorithm, the reduction in symbol code space more than made up for the increase in mtf queue size. The high frequency and the low dispersion symbols in the dictionary actually get slightly shorter code lengths than they did before the change. That is also in spite of the fact that symbols in the >20 instance mtf queue are not removed from the dictionary while in the mtf queue because of decoding speed concerns.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 9th December 2014 at 10:35.

  28. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Paul W. (9th December 2014),surfersat (9th December 2014)

  29. #504
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts

    Tree v0.16b

    I didn't check Tree v0.16b on enough files. It crashes on many files from the Silesia Corpus. There was a bug in the mtf hit counting code. Results with fixed code:

    New results:
    enwik9:
    TreeCompress: 173,848,464 bytes, 114,748 sec./8.1 sec. (was 174,086,232 bytes, 114,739 sec./8.1 sec.)
    TreeCompress64: 174,825,152 bytes, 9371 sec./8.2 sec. (was 175,032,196 bytes, 9362 sec./8.2 sec.)

    enwik8:
    TreeCompress: 21,602,648 bytes (was 21,625,828 bytes)

    I suspect the enwik9 results are slightly worse because symbol code lengths are calculated using an approximation to Huffman coding. It's probably time to implement actual Huffman codes.
    Attached Files Attached Files
    Last edited by Kennon Conrad; 9th December 2014 at 19:39.

  30. #505
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    Do you know what the memory usage is?

  31. #506
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Matt Mahoney View Post
    Do you know what the memory usage is?
    I haven't changed compression memory usage, which is what you list. But I can rerun compression later with timer32 to get better data.

  32. #507
    Member
    Join Date
    Jun 2009
    Location
    Kraków, Poland
    Posts
    1,474
    Thanks
    26
    Thanked 121 Times in 95 Posts
    Do you use plain and simple MTF algorithm or something fancy? Speed hit from using MTF tricks in Tree is noticeable.

    MTF could be speed up by using SIMD and/or trees. Eg with two level tree (ie root, internal nodes, leaves, so actually three levels) each internal node and root would have O(sqrt(total)) children, so it should be eligible for speedup using SSE2 or newer. Two children in binary tree is too few to allow for efficient vectorization.
    Last edited by Piotr Tarsa; 10th December 2014 at 01:44.

  33. #508
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 778 Times in 485 Posts
    No problem. Looks like you moved up several spots http://mattmahoney.net/dc/text.html

  34. The Following User Says Thank You to Matt Mahoney For This Useful Post:

    Kennon Conrad (10th December 2014)

  35. #509
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    Quote Originally Posted by Piotr Tarsa View Post
    Do you use plain and simple MTF algorithm or something fancy? Speed hit from using MTF tricks in Tree is noticeable.

    MTF could be speed up by using SIMD and/or trees. Eg with two level tree (ie root, internal nodes, leaves, so actually three levels) each internal node and root would have O(sqrt(total)) children, so it should be eligible for speedup using SSE2 or newer. Two children in binary tree is too few to allow for efficient vectorization.
    The speed impact on enwik9 from adding more extensive MTF handling was about 6 seconds for encoding and 1.2 seconds for decoding. I am not too concerned about the encoding time since it is a small fraction of compression time. However, the time impact on decoding was more than I would like and I am open to suggestions. I'm not sure I understand how a tree helps, but will try to figure it out.

    The initial implementation is not plain and simple MTF. Symbols that occur 20 or fewer times are still handled with nineteen 64 element queues specific to the number of times the symbol occurs. Of the remaining symbols, those that have code lengths of at least 11 bits are put in the new 258 element queue. When decoding enwik9, about 52.2 million symbols are put into this queue. The queue is fully updated for every symbol entered (or moved). So simple MTF would require approximately 16 trillion mtf moves and would take a lot more than 1.2 seconds (at least for code I know how to write today). What is implemented is as follows:

    queue #1: 10 element queue (code lengths 9, 8, 8, 8, 9, 9, 9, 9, 9, 9)
    queue #2: 8 element circular queue (code length 10)
    queue #3: 16 element circular queue (code length 11)
    queue #4: 32 element circular queue (code length 12)
    queue #5: 64 element circular queue (code length 13)
    queue #6: 64 element circular queue (code length 14)
    queue #7: 64 element circular queue (code length 15)

    Symbols with code lengths of 11 are removed from the queue when they exit queue #2. Likewise for code length/queue pairs as follows: 12/#3, 13/#4, 14/#5, 15/#6. When a new symbol is added to the queue, symbols are moved down in the circular queues by decrementing the circular queue start pointer, moving the new top symbol into the next queue (unless the code length limit applies, in which case this is skipped) and putting the symbol that fell out of the previous queue on top of the queue being updated. Handling queue hits is similar, except the queues after the hit location don't need to be updated but the queue that had the hit gets the individual symbols above the hit location moved down.

    I consider this a first cut and not that great. First, the queue code lengths should be derived from the file and either adjusted dynamically using the same method for encoding and decoding or described in the header. Second, there are probably faster implementations. Third, the number of symbols that go through the queue can probably be reduced dramatically by adding a dispersion indicator bit with each symbol definition. I think allowing symbols with dispersion at or above what one would expect with random symbol locations into the queue wastes time and does not help the compressed file size.

    The main purpose was to get something reasonable running so I could better understand where the benefits of mtf are for Tree. From what I have seen, it appears that the distribution of mtf queue hit locations depends on the global symbol probability. If so, it makes sense to handle symbols with different global probabilities differently in mtf (as Tree does now). If not, Tree should be using a single queue. Either way, I am pretty sure some additional optimizations can be made. I have a feeling that in the long run I will want some dispersion indicators, a combined queue at the top, split queues beyond a certain depth, and variable queue probabilities derived from the recent history of mtf hits. But ultimately I will analyze the data and try to let the data tell me what scheme would be best for Tree.

    One side note of interest is that because the new mtf queue reduces the average symbol length, the penalty for using integer code length seems to have increased because the impact on other symbols can be more dramatic when more codes should be 7.5 bits long rather than 18.5 bits (for example).
    Last edited by Kennon Conrad; 10th December 2014 at 08:11.

  36. The Following User Says Thank You to Kennon Conrad For This Useful Post:

    Paul W. (10th December 2014)

  37. #510
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    685
    Thanks
    153
    Thanked 177 Times in 105 Posts
    It's way past my bedtime, but I added explicit dispersion indicators (about 430K bits) and ended up with a compressed enwik9 file that is 391 KB smaller and recovers over half the lost decoding speed.

  38. The Following 2 Users Say Thank You to Kennon Conrad For This Useful Post:

    Paul W. (10th December 2014),surfersat (11th December 2014)

Page 17 of 29 FirstFirst ... 7151617181927 ... LastLast

Similar Threads

  1. Replies: 4
    Last Post: 2nd December 2012, 02:55
  2. Suffix Tree's internal representation
    By Piotr Tarsa in forum Data Compression
    Replies: 4
    Last Post: 18th December 2011, 07:37
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •