Page 5 of 5 FirstFirst ... 345
Results 121 to 140 of 140

Thread: another (too) fast compressor

  1. #121
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    > My advice is remove any features you don't absolutely need.

    That's a sound advice.
    Thanks for suggestions Matt. I'll ponder them.

  2. #122
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts

    LZ4Mod

    Despite of its popularity, I have checked the LZ4 algorithm just recently.
    A few notes:
    • Why we waste the "0" offset utilizing just 1..65535 offsets instead the full range of 1..65536?
    • I see no reason to limit the decoder in any ways. i.e. the last chars of a block may not be literals.
    • Why limit the max block size to 4 MB? 8/16/32 MB are quite adequate numbers these days.
    • In some cases there is no reason to emit the EOF mark (block size of zero).

    Taking into account all that stuff, I have wrote my own LZ4 implementation called LZ4Mod. Slightly modified LZ4 - same 64 KB LZ window, 16 MB block and no format restrictions.

    Some testing results with different parsing methods are here:
    No code has to be inserted here.


  3. #123
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Hi Ilya


    Unexpected and welcomed investigation


    > Why we waste the "0" offset utilizing just 1..65535 offsets instead the full range of 1..65536?

    Keep in mind LZ4 was created a long time ago, in a far far away gal...
    Wait, not that far, but at least at a time I was still learning compression and programming in general.
    The format was created to learn data compression, and features first and foremost simplicity.
    Speed and efficiency were an accidental side effect.

    Therefore, format has overlooks I probably wouldn't accept today.
    Not sure if it proved bad in the end, since LZ4 most important asset remains its simplicity, from an interface and specification standpoint.

    For the offset, indeed, we currently waste one value (0).
    Hopefully, the impact on compression is supposed to be extremely minor (especially at fast mode, which remains its primary use case).

    And in the end, it could be a blessing in disguise.
    I believe I've got a usage for it. But it will have to wait for LZ4v2...


    > I see no reason to limit the decoder in any ways. i.e. the last chars of a block may not be literals.

    Historical limitations.
    Initially, lz4 was just experimental and all speed. It did not care about out-of-bound checkings for example. This requirement came right after open-sourcing the algorithm.

    The interfaces then were still "unsafe". It started by knowing the size to regenerate but not the size of compressed data.
    How to guarantee zero out-of-bound read/write without knowing the size of buffer ?

    The first trick was to impose end-of-blocks conditions.
    Last 5 bytes are necessarily literals because 5(literal) + 1(token) +2 (offset) == 8, which is the size of unconditional literal copy.
    Last match is necessarily at a minimum distance of 12 because unconditional match copy was a minimum 12 bytes.

    Of course, it requires the input to be well formed.
    It started by sounding reasonable, but eventually, lz4 had to adapt by introducing safe interfaces.
    Today bound checkings are ensured no matter what, because there is no way to guarantee that input is well formed.
    Protecting against attack scenario is, obviously, a necessity, but only because of success. You could guess it did not yet matter when the algorithm was just a learning toy...

    Bottom line :
    it would be possible today to remove these limitations from decoder, by ditching the last unsafe interfaces.
    But that, also, will have to wait for LZ4v2.


    > Why limit the max block size to 4 MB? 8/16/32 MB are quite adequate numbers these days.

    The Frame format limits block sizes to 4 MB.
    But LZ4, as a Memory Block compressor, doesn't impose such limit, it can compress as much as 1.99 GB as a single block.
    I wouldn't advise to go to such limits due to memory requirements, but you could indeed compress enwik9 as a single memory block.


    > In some cases there is no reason to emit the EOF mark (block size of zero)

    Again, this is only a feature of the LZ4 Frame format, which is not created for ultimate compression ratio.
    A lot of applications don't want to pay for the cost of Frame metadata, including the EOF, and as a consequence use the Block interface directly, which let the user program in charge of headers.
    The price to pay is being not compatible with other lz4 implementations, which in most circumstances is not a problem, since compressed data remain application-internal.

    Btw, LZ4 was initially created just a block interface; as a consequence, multiple adaptations have been created, using their own, incompatible, header convention.
    The LZ4 Frame format was created afterwards, as an answer to provide a common ground for applications willing to share compatible .lz4 files.


    > Some testing results with different parsing methods are here:

    These are very interesting results.
    Would you mind explaining the differences which makes DynProg a better performer than StorerSzymanski ?


    Regards

  4. #124
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Quote Originally Posted by Cyan View Post
    Would you mind explaining the differences which makes DynProg a better performer than StorerSzymanski ?
    Long story short, StorerSzymanski builds optimal path considering a best match at each position, by contrast, my DynProg builds optimal path by estimating ALL matches at each position. And actually it is a hell for LZ with unlimited match lengths. Anyway, it strikes!

    BTW, should I wait for LZ4v2?
    I'd like to release an LZ4 compatible compressor with DynProg (Extreme/Maximum) mode.


  5. #125
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    > BTW, should I wait for LZ4v2?

    I don't think so. There is no date set for LZ4v2, and my availability is currently tied to zstd.
    Best bet in the end of the year or later.


    > I'd like to release an LZ4 compatible compressor with DynProg (Extreme/Maximum) mode.

    Sound like a great plan !
    Hopefully, you can nonetheless release it for current LZ4 format.
    Even after LZ4v2 get introduced, it will likely remain the most important supported variant for quite some time, due to ecosystem latency.

  6. #126
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    I think LZ4Demo (aka Legacy frame) looks like a good candidate then. How can I find an original LZ4Demo.exe to confirm the compatibility with my implementation?
    EDIT:
    Uh, looks like the Legacy frame keeps a compressed size, not an original size as with the latest LZ4 or my implementation. Is there any way to decode raw data stream (sequence of blocks with no header) using lz4.exe?

  7. #127
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    > Is there any way to decode raw data stream (sequence of blocks with no header) using lz4.exe?

    Not with lz4.exe, which only accepts validated frames.
    The legacy format is simpler, but much more rigid.
    So it's not recommended.
    Last edited by Cyan; 2nd February 2016 at 03:40.

  8. #128
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    My proposal for the simplified LZ4 file format:
    Code:
    int magic; // =="LZ4m" or "LZ4%" or suggest yours
    
    int size; // Uncompressed block size
    // Compressed data
    
    int size; // 0 means EOF
    // if size>0, Compressed data
    // etc.
    Max block size by default is 8 MB. Actually no need to specify the max block size in a file header explicitly. In worst-case scenario and for a given decoder, size>MAX_BLOCK_SIZE means the decoder can't handle it. Thus a decoder with a large enough max block size can handle any smaller blocks. Or, allocate memory according to a first block.

    Blocks are always compressed (independently), as with legacy LZ4.

    No limitations on LZ stream - literals and matches can appear in any order.

    Match distance of "0" is unused (can be used for error detection or as a special code later).

    All in all, it's an LZ4 compatible sequence of compressed blocks with a magic number.

    Also, you may add additional block sizes:
    0 - 8 MB (Default)
    1 - Reserved or 16 MB
    2 - Reserved or 32 MB
    3 - Reserved
    4 - 64 KB
    5 - 256 KB
    6 - 1 MB
    7 - 4 MB

    LZ4.exe is a testbed for your brilliant library. I see no reason to limit a modern file compressor in any ways. As well as I see no reason to store a checksum for a frame header in such a simple format. Use CRC32 (slicing-by-8 implementation) right after EOF mark. No extra flags required, if a decoder cannot read after EOF - there is no checksum stored. As simple as that.

    BTW, my proposal for LZ4v2 is to use a larger 128 KB LZ window. This requires "token" byte modification only:
    Keep 3-bits for literal run (instead of 4-bits)
    Add 1 extra bit for LZ distance (as example it could be the lowest bit of a distance)
    Keep 4-bits for a match length

    Or keep 4-bits for literal run and just 3-bits for a match. Or keep 3-bits for literals and matches and add 2 extra bits for LZ distance, extending LZ window to 256 KB!

    Don't take my observations about the LZ4 too seriously, it's just my thoughts to consider or not!
    Hope it helps! Cheers!

  9. #129
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    These are pretty good suggestions Ilya. Will keep that link in mind for future format evolution.


    Note that if you release a compressor using its own frame format, it will likely limit its audience, because of interoperability issues

    Btw, your proposed format is pretty close to the Legacy Frame format (see http://www.lz4.org/lz4_Frame_format.html).
    I suspect it's very close to being compatible.


    Use CRC32 (slicing-by-8 implementation) right after EOF mark. No extra flags required, if a decoder cannot read after EOF - there is no checksum stored. As simple as that.
    Some usages require frame concatenation capability.


    Regards
    Last edited by Cyan; 2nd February 2016 at 17:09.

  10. #130
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Anyway, can't find a link to older versions of LZ4. Especially I need the latest LZ4Demo.exe (legacy frame)

  11. #131
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    To rebuild the older LZ4Demo.exe,
    one would have to download the lz4 source code history from SVN on Google Code, or git on Github,
    then go back to an earlier commit, restore the code, and then compile.


    That being said, if the point is to have a code or a program that can read/write Legacy frames, you don't need to do that.

    The current version of lz4 is able to generate and read legacy frames.
    For read, you don't have to do anything special : it's automatically detected and handled. Usage of `-v` will also trigger a warning message.

    For generation, it's necessary to request the legacy format, using `-l` command (see info with `./lz4 -h` or `man lz4` on linux systems).

    Btw, the code to handle Legacy frames is into lz4io.c

  12. #132
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    -l is a great switch that should be documented!
    Confirmed, my implementation is compatible with your LZ4. And my result for ENWIK9 in legacy format is 372,083,730 bytes (decompression verified)

  13. Thanks:

    Cyan (3rd February 2016)

  14. #133
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    That's quite a few megabytes better than lz4 highest compression mode.
    Great work Ilya !

  15. #134
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Can't wait for it

  16. #135
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    At this moment, LZ4Mod can unpack legacy .lz4 files produced by LZ4.
    However, LZ4 cannot unpack files produced by LZ4Mod - Error 52 - Corrupted input.
    Like I said, with LZ4Mod last chars of a block may not be literals, just as with regular LZ77.
    Unused match distance or small blocks are somewhat okay (distance 0 in LZO means EOF, as example).
    But I disagree with that last chars of a block must be literals! At least the decoder must handle it! I wrote my decoder in 5 minutes and see no reason in such limitation!

  17. #136
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    Result for enwik9 if last 5 bytes of a block are always literals -> 372,084,180 bytes
    And LZ4 CAN decode it!

  18. #137
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    865
    Thanks
    463
    Thanked 260 Times in 107 Posts
    Still pretty good

    Well, with block sizes of 8 MB, these "end of blocks" inefficiencies have very little impact.

    > I disagree with that last chars of a block must be literals! At least the decoder must handle it!

    This is now in the todo list, and will be part of LZ4 v2.

  19. #138
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    The work in progress...

    Anyway, here are some optimized LZ4 files:
    Attached Files Attached Files

  20. #139
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    3,984
    Thanks
    377
    Thanked 352 Times in 140 Posts
    I might call my program LZ4Opt

    The program will have 3 or 4 compression modes:
    cf - Fast (Not the Fastest)
    c - Normal
    (cx - Max)
    cb - Brute

    I have completely rewritten my LZ77 framework - this program is my best LZ77-based compressor so far (in terms of efficiency and decompression speed).



    EDIT:
    Code:
    Optimized LZ4 compressor by Encode, v1.00
    
    Usage: lz4opt command infile outfile
    
    Commands:
        c Compress (cf = fast, cb = brute)
        d Decompress

  21. Thanks (3):

    comp1 (6th February 2016),Cyan (6th February 2016),Mike (6th February 2016)

  22. #140
    Member
    Join Date
    May 2012
    Location
    United States
    Posts
    324
    Thanks
    182
    Thanked 53 Times in 38 Posts
    Quote Originally Posted by encode View Post
    I might call my program LZ4Opt

    The program will have 3 or 4 compression modes:
    cf - Fast (Not the Fastest)
    c - Normal
    (cx - Max)
    cb - Brute

    I have completely rewritten my LZ77 framework - this program is my best LZ77-based compressor so far (in terms of efficiency and decompression speed).



    EDIT:
    Code:
    Optimized LZ4 compressor by Encode, v1.00
    
    Usage: lz4opt command infile outfile
    
    Commands:
        c Compress (cf = fast, cb = brute)
        d Decompress
    Can't wait for the release!

Page 5 of 5 FirstFirst ... 345

Similar Threads

  1. Blizzard - Fast BWT file compressor!!!
    By LovePimple in forum Data Compression
    Replies: 40
    Last Post: 6th July 2008, 15:48
  2. PACKET v.0.01 new fast compressor !
    By Nania Francesco in forum Data Compression
    Replies: 45
    Last Post: 19th June 2008, 02:44
  3. RINGS Fast Bit Compressor.
    By Nania Francesco in forum Forum Archive
    Replies: 115
    Last Post: 26th April 2008, 22:58
  4. Tornado - fast lzari compressor
    By Bulat Ziganshin in forum Forum Archive
    Replies: 23
    Last Post: 27th July 2007, 14:26
  5. Fast PPMII+VC Compressor
    By in forum Forum Archive
    Replies: 4
    Last Post: 2nd August 2006, 20:17

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •