Results 1 to 19 of 19

Thread: On fast discarding of uncompressible data

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts

    On fast discarding of uncompressible data

    Still working currently on my next public compressor,
    i'm trying to optimize its behavior when source stream has some uncompressible parts (garbage, already compressed, random noise, well whatever).
    This is particularly interesting when dealing with "container" type files, such as virtual hdd for example.

    The main idea is to detect these sequences in order to skip them. More easily said than done.

    Currently, my algorithm go through these sequences, updating frenetically its pointer tables with many useless "noise pointers", measuring in the end the lack of compression and therefore discarding all fruitless efforts made so far. From a compression rate perspective, this is merely correct, but from a speed perspective, this looks just a waste.

    I'm able to detect such situation at Huffman/Range encoding stage. This however does not help for pointer updates.
    I'm wondering if already well known fast compression algorithms have implemented such mechanisms.

  2. #2
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    afair, slug

  3. #3
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Mmmh, unfortunately, slug author has left the scene. And i guess no source code has been released either.
    Anyway, i'm more interested in principles than ready-to-compile source codes.
    Any idea how slug was able to achieve that fast-skipping strategy ?

    Regards
    Last edited by Cyan; 18th March 2010 at 20:51.

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,511
    Thanks
    746
    Thanked 668 Times in 361 Posts
    unfortunately, most of us don't share their sources

  5. #5
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    i used to count order0 entropy and discard chunks which nearly be 8bpc. this also can help find highly compressible chunks and determine other LZ profile.

  6. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks Fu.
    Indeed, i proceed the same when reaching entropy stage, but this stage happens only after match-finding, so a lot of time is already lost.

    What you seem to suggest is that this analysis could be performed *before* match-finder ?
    I was worried that it would be too much, and would for example forbid to find 2 identical zip-files within a stream. But maybe these are unfounded worries...

    Regards

  7. #7
    Member Fu Siyuan's Avatar
    Join Date
    Apr 2009
    Location
    Mountain View, CA, US
    Posts
    176
    Thanks
    10
    Thanked 17 Times in 2 Posts
    I was worried that it would be too much,
    What do you mean of 'too much'? It should be very fast.


    and would for example forbid to find 2 identical zip-files within a stream. .
    That's exactly what I'm worring about.

    But maybe these are unfounded worries..
    yes. Especially I doubt if it is meaningful on your very fast compressor.

    BTW, see the behaviour of -cF and -cf in NanoZip, -cf won't find duplicate
    uncompressible data but -cF can.
    Since only very long matches make sence, I'm considering this: Even if we
    found a 'bad data', the match finder should scan the data and update the
    pointer on every 2 or several bytes. When processing next chunks, check
    the first several bytes to see if it is in match finder. Then it will detect out duplicate chunks.

  8. #8
    Member
    Join Date
    Feb 2010
    Location
    Nordic
    Posts
    200
    Thanks
    41
    Thanked 36 Times in 12 Posts
    Quote Originally Posted by Cyan View Post
    and would for example forbid to find 2 identical zip-files within a stream
    Perhaps special searching for, for example, "PK" and other magic numbers might help in specific circumstances?

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    Consider this: http://www.ctxmodel.net/files/PPMd/segfile_sh2.rar
    It actually does "a little" more than detection of incompressible data,
    but can be used for that too.

    Also, it seems obvious that you'd need different processing loops
    for LZ and detection. Its possible to only check small amounts of
    data though - for example, process the data in 64k blocks and
    test the first 1k in each block, and pass whole block uncompressed
    if that first 1k is not compressible.

    But likely that doesn't make much sense as there're usually multiple
    filters applicable.

  10. #10
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks. It seems there are several detection possibilities to try in this post. I will have a look into them. Thanks

  11. #11
    Programmer osmanturan's Avatar
    Join Date
    May 2008
    Location
    Mersin, Turkiye
    Posts
    651
    Thanks
    0
    Thanked 0 Times in 0 Posts
    For LZ, you should try longer context. I think, order-1 or order-2 should be better than plain order-0 (more longer context can slow down your detection, you could also try hashed context for emulating more longer context). Just count frequencies and calculate coding cost. If it seems a uniform data, then just skip it. As a detail, if you use non-sliding window, it's easy:

    1-Read N bytes data into window (N=window size)
    2-Perform frequency counting under order-1 or order-2 and calculate code length.
    3-If it's uniform just copy this window and mark it as uncompressed. Else go on next step.
    4-Compress the window with your favourite LZ method

    I think, Slug does exactly same thing. Note that, Slug has a non-sliding 8 MiB window and it supports above process. Also, if your window smaller than L2 cache size, theoritically detection will be a "light-weight" process. Because, after detection, all of your data should be in L2 cache and thus makes your compression fast enough.
    BIT Archiver homepage: www.osmanturan.com

  12. #12
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    In this thread Christian gives a few details about slug uncompressible data handling.

  13. #13
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Thanks for feedback.

    Indeed, Osman, checking order-1 frequency is a good idea, and quite probably enough to ensure data compressibility.
    A remaining downside is that 2 identical uncompressible parts into a container file (tar/iso/vmdk, etc.) could be missed. I'm also slightly worried of detection time, as we are talking here of very fast compressor (in the range of 80-90 MB/s) so any detection must be either very fast or passive to avoid becoming noticeable. Especially it if is meant to increase speed ...

    m^2 : thanks for this link, it is indeed very informative on Christian's ideas.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > Indeed, Osman, checking order-1 frequency is a good idea,

    Not frequency, but entropy (aka code length, as osman said).
    Its what seg_file does too, with some kind of hashed order-2.

    > A remaining downside is that 2 identical uncompressible
    > parts into a container file (tar/iso/vmdk, etc.) could be missed.

    Even a fast LZ imho is still different from match filter (like rep/srep)
    So it seems reasonable to run a rep pass first, then detection/filters,
    then actual compression (eg. LZ with limited match length).

    > I'm also slightly worried of detection time, as we are
    > talking here of very fast compressor (in the range of
    > 80-90 MB/s) so any detection must be either very fast or
    > passive to avoid becoming noticeable.

    It can be implemented with some kind of state machine,
    like
    Code:
    c = getc();
    entropy += codelength[s][c];
    s = next_state[s][c];
    Although, an interesting alternative might be to do a direct
    measurement... like keep a log with all memory modifications
    {address;value}, and restore the model after processing
    a block if it was expanded.
    With CM there's kinda too much stuff to log, but maybe it
    can work out for LZ without significant speed impact.
    Last edited by Shelwien; 23rd March 2010 at 22:18.

  15. #15
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You'll have to forgive me if this is a bad idea, as I'm pretty new to the realm of data compression (though I've been lurking and reading for quite some time now), but would a possible, and presumably very speedy, way to detect highly incompressible data be to check how much the previous few hundred, or thousand, bytes have been compressed? If the past few hundred bytes have very low compression, then you could switch to a much faster algorithm, and when that faster algorithm manages to compress a chunk of data enough, then it switches back to the slower algorithm?

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    That's kinda what Christian actually does.
    But this approach is a bit redundant (commonly there's no smooth
    transition from redundant to random data - some chunks are just
    incompressible because they're already compressed), and not exactly
    universal (one has to think which features can be disabled in the
    main algorithm to adapt it to incompressible data), and also symmetric
    (so slows down the decompression).

    Thus it seems better to actually detect incompressible blocks with
    some specialized algorithm (independent from the main codec)
    at compression stage and then just copy them on decompression.

  17. #17
    Member
    Join Date
    Mar 2010
    Location
    Canada
    Posts
    6
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    That's kinda what Christian actually does.
    But this approach is a bit redundant (commonly there's no smooth
    transition from redundant to random data - some chunks are just
    incompressible because they're already compressed)
    I'm not quite sure how the method I mentioned would be redundant, I may need to be educated :P Of course it would miss areas where it could compress data, and it would attempt to compress areas where it could not compress data, but the goal is to minimize both of these with the very minimum of speed costs, is it not?

    Quote Originally Posted by Shelwien View Post
    and not exactly
    universal (one has to think which features can be disabled in the
    main algorithm to adapt it to incompressible data), and also symmetric
    (so slows down the decompression).

    Thus it seems better to actually detect incompressible blocks with
    some specialized algorithm (independent from the main codec)
    at compression stage and then just copy them on decompression.
    Well if I read what Christian did correctly, he is switching between algorithms; is this not possible in Cyan's case? Or is the goal to create a method better than the one Christian is using, rather than optimize his?

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,424
    Thanks
    223
    Thanked 1,053 Times in 565 Posts
    > I'm not quite sure how the method I mentioned would be
    > redundant,

    Well, not really the method, but its output :)

    > Of course it would miss areas where it could compress
    > data, and it would attempt to compress areas where it
    > could not compress data, but the goal is to minimize both
    > of these with the very minimum of speed costs, is it not?

    1. Having different code branches in the same loop is potentially
    slow, keeping different algorithms separate is usually better for
    performance.
    2. Decoding can be noticeably faster if it can just read an
    uncompressed block flag and copy the block, instead of adapting
    in sync with encoder.
    3. A static detection method can also minimize redundancy.

    > Well if I read what Christian did correctly, he is
    > switching between algorithms;

    He does, and certainly its usually easier to implement that way -
    symmetry means that you don't have to write two handlers
    (for encoding and decoding) and using side effects of the main
    method means no duplicate code.

    > is this not possible in Cyan's case?

    Its more like I can't really give advices about that,
    because I don't know the specifics of Cyan's algorithm
    (and maybe he doesn't know either :).
    And either way, if there's a superior method, even requiring
    a little more programming, why not use it :)

    > Or is the goal to create a method better than the one
    > Christian is using, rather than optimize his?

    Its not a matter of "optimizing his", it has to be solved
    from scratch for each specific codec, though Christian likely
    enjoys doing exactly that (i mean modifying the given codec
    to efficiently process random data).
    So even though there're examples of such implementations
    (in fact, I actually prefer such adaptive methods myself eg. for
    bitrate control), I'd not recommend it for fast LZ codec.

    Btw, here I found an old example (order2 bitwise CM) with
    incompressible block skipping:
    http://shelwien.googlepages.com/order_test4.rar
    Code:
    ver      c.size   c.time d.time
    o2_io9co 20281316 61.751 63.718 io9c + /Qprof_use 
    o2_ioA   20281316 62.938 66.813 io9c + probability caching on encoding (by 1M=128k bytes) 
    o2_ioB   20166375 64.639 67.250 ioA + copy on redundant blocks, PSize=1320, optimized for wcc386 
    o2_ioBa  20160989 64.156 67.204 ioA + copy on redundant blocks, PSize=256
    Its different from what I said though, because decoder simulates (de)compression
    of stored blocks to stay in sync with encoder statistics.
    But again, for fast LZ its less troublesome.
    And maybe these stats (compression gain and speed loss) would be still useful.
    Last edited by Shelwien; 24th March 2010 at 18:44.

  19. #19
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Indeed Shelwien
    I'm considering releasing a version sooner with less optimizations than wanted, in order to have "something" out to test and on which to improve upon. This seems a more re-assuring method avoiding a too long tunnel before release.

    Regarding "Skipping uncompressible parts", i'm likely to start with a very simple method providing clear (measurable) speed benefits, and start improving from there. Talking about "best method" or even "better than XXX" seems a too ambitious objective at this stage.

Similar Threads

  1. The uncompressible file
    By in forum Forum Archive
    Replies: 1
    Last Post: 22nd May 2006, 13:20

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •