Results 1 to 16 of 16

Thread: Wondering whether to release my deflate recompressor

  1. #1
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts

    Wondering whether to release my deflate recompressor

    The thing goes like this:
    I wrote an experimental deflate recompressor that, much like precomp, does lossless recompression of zlib streams. It's much faster than precomp and can be a tiny bit stronger (because it supports multiple zlib versions and, I think, more zlib parameters).
    It's experimental and done in a big rush, it's input is just a subset of gzip standard, there is no user interaction, barely any error handling, little documentation, likely bugs. I intended to bring it to usable state, but I can't get myself to start the work on it for a month already and I'm afraid I never will.
    Nevertheless, it works, or at least mostly works and might serve as a reference or maybe even a base of something.
    So on one hand I'd rather not release it because I'm quite ashamed of the sorry state of the code and on the other hand, I don't like it going waste.

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    The advantage of releasing it is others can test it for you, and eventually you'll end up with a solid program. Publishing free software is also good for your reputation, which helps with getting a job doing what you like to do.

  3. #3
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    89
    Thanks
    8
    Thanked 4 Times in 4 Posts
    A new toy for free? For me it is a rhetorical question

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,501
    Thanks
    741
    Thanked 664 Times in 358 Posts
    i badly want to add such thing to freearc

  5. #5
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Matt Mahoney View Post
    The advantage of releasing it is others can test it for you, and eventually you'll end up with a solid program. Publishing free software is also good for your reputation, which helps with getting a job doing what you like to do.
    Actually I'm afraid of adverse effect this might have on my career ^^

    But here it is.
    Most significant limitation:
    Supports only gzips with a single deflate block. Many bigger ones don't work.
    Usage:
    Create a file named cabal.gz in the working directory and run rzlib.exe.
    Attached Files Attached Files

  6. #6
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    i badly want to add such thing to freearc
    You'll have to invest in it quite a bit, sorry. Actually I wrote it in C++, but my first thought about technology was Python + C++, but decided against it because I wanted to put it in FreeArc. Then I started with Haskell + C, but dropped Hakell because the cost of interfacing it with C++ was too high, especially with my low experience with it and extremely limited time. And turned to C++ because I feel better with it.

    I can offer help with understanding if the code is unclear at some point. Maybe some minor coding too, but no promise - sympathetic responses made me more optimistic about the program.

  7. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,501
    Thanks
    741
    Thanked 664 Times in 358 Posts
    it's hard to understand but anyway - it's the first open-sourced defate recompressor

    and compressors aren't the sort of things where haskell need. i've seen pure haskell deflate implementation that was thousands times slower than original

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    How about explaining what it's supposed to do?..
    Afaik it bruteforces the zlib version/level/winsize/memlevel/strategy,
    tries to compress the block using each parameter combination,
    then computes the diff of original and restored block (commented out),
    then compresses the block data with LZ4.
    Recompressed data and parameters are stored to different files.

    Frankly, I don't think that this approach has any potential (using 3rd-party format/codec libs for recompression) -
    there're too many modified zlib versions around. And forcing these libs to do what you want
    easily can take more work than writing your own format parsers and such.
    Also converting modules based on random access to files to streamable ones usually means a complete rewrite.
    Still, it can be helpful as a zlib.dll usage example, I guess :)

  9. #9
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    it's hard to understand but anyway - it's the first open-sourced defate recompressor

    and compressors aren't the sort of things where haskell need. i've seen pure haskell deflate implementation that was thousands times slower than original
    Practically all runtime is spent in zlib, so it doesn't matter if my part is slow, so it's a great case to use a productive language.

    Quote Originally Posted by Shelwien View Post
    How about explaining what it's supposed to do?..
    Afaik it bruteforces the zlib version/level/winsize/memlevel/strategy,
    tries to compress the block using each parameter combination,
    then computes the diff of original and restored block (commented out),
    then compresses the block data with LZ4.
    Recompressed data and parameters are stored to different files.
    About correct, but I'll say it with other words, I think it will be clearer.
    Like you said, the program bruteforces zlib version/level/winsize/memlevel/strategy - all zlib parameters except a few settable by deflateTune() function that hopefully isn't used too often. It takes a 64 KB chunk of the input file, decompresses, tries to recompress with different parameters and computes how similar is the recompressed output to the original (LZ4 is used as the distance function, I compress the original and recompressed files together).
    After trying all combinations it takes the minimum and tries to recompress the full stream. The output is stored in 2 files (one stores decompressed file and the other data that is needed for recompression), though they could be combined with no significant work by placing metadata in front of data; The program can find the end of metadata by itself.

    Quote Originally Posted by Shelwien View Post
    Frankly, I don't think that this approach has any potential (using 3rd-party format/codec libs for recompression) -
    there're too many modified zlib versions around. And forcing these libs to do what you want
    easily can take more work than writing your own format parsers and such.
    Still, it can be helpful as a zlib.dll usage example, I guess
    The method is clearly less robust than yours, but I expected it to be significantly stronger too. Actually only now I looked that the hif size of reflate and I see that I was wrong. Now I too think that this might be a dead end. I have mixed feelings about it, on one hand it's sad to write a tool that would be obsolete less then a month later, but OTOH writing generic deflate models is a much more elegant solution and it's cool that it works. I guess it's quite complex though...

    Quote Originally Posted by Shelwien View Post
    Also converting modules based on random access to files to streamable ones usually means a complete rewrite.
    You mean my tool? No, it should be very simple actually. IIRC the only seek is 64 KB back, so it can be easily removed.

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    > Practically all runtime is spent in zlib, so it doesn't matter if my
    > part is slow, so it's a great case to use a productive language.

    My opinion is that all high-level languages (ones with built-in strings etc)
    are very bad for parsing binary formats.
    For example, look at this:
    http://pastebin.com/PP9eMk3c (C++)
    http://pastebin.com/mkwXrDKu (python)
    and tell me which is a better description of the format.

    Yes, they can do it and I'd not be surprised by a smaller source even,
    but afaik it would be write-only - things that you have to decrypt
    with a language manual are clearly not descriptive.

    In zlib case that could mean that you would do detection by appending
    single symbols to a string ( s=substr(s,1)+newchar -> 2-3 instances of
    64k string with dynamic allocation ) and running ungzip and gzip on it
    (called via weird thunks with likely more string instances allocated,
    while being already bloated from the start), while normally you would
    know which part of a stream is a valid deflate code just by decoding.

    Sometimes complex tools add more problems than they solve, and
    decoding/encoding of binary formats is certainly one of these cases, imho.

    > they could be combined with no significant work by placing metadata in front of data;
    > The program can find the end of metadata by itself.

    If you mean attaching it after actually recompression, then its not streamable.

    Actually the problem of interleaving unp/hif into a single stream is the most
    troublesome for me now.

    > Actually only now I looked that the hif size of reflate and I see that I was wrong

    Yes, for pure zlib streams the overhead is about 15 bytes per MB of deflate data,
    and it could be 1 bit per file if I'd forget about streaming.

    However its not all perfect yet, as I only tested gzip levels as parameters
    (and I'm still working on a detector).
    But I think that parameter support has to be added only after some real tests
    which would demonstrate where each parameter is necessary.
    For example, I already know that I don't need any special support for
    "huffman only" strategy (match model can handle such blocks as stored blocks).

    Also I know that its not necessary to consider different zlib versions.
    Quoting my comment from SO:
    I've tested zlib 114 and 121-125 and generated code seems to be
    identical with all options (there're little differences in zip
    archive headers though). But there're other libraries (for example,
    zlib version in Intel IPP) which can be used instead of zlib, and do
    generate different code even now. And there's no guarantee for
    future zlib versions either, as they might include parallel
    processing, or some other features, which would affect compression.
    Then, reducing the window size doesn't normally make sense, as well
    as changing that memlevel parameter.
    So I'd expect the practical set of parameter combinations for bruteforce
    to be much smaller than it seems.
    (Also in my case a parameter mismatch would only affect the diff size,
    and parameters like memlevel won't affect it too much.)

    > it's sad to write a tool that would be obsolete less then a month later

    Actually I don't think that you've written anything like that there.
    I mean, interfaces with zlib/LZ4/bsdiff certainly might get useful someday
    and I don't see much recompression-specific code there.

    > I guess it's quite complex though...

    Yes, in fact there're 3 different recompressors:
    1. LZ token recompressor based on zlib match models (fast/slow models are different) with unpacked data as context
    2. uncompressed huffman length table recompressor with LZ tokens as context
    3. RLE-coded length table recompressor (from uncompressed table)
    4. uncompressed huffman length table again (the lencode table), but its the same as #2.
    And also CM models for various structures.

    > IIRC the only seek is 64 KB back, so it can be easily removed.

    Good if so, but to me it looks like all your read*/write* functions need rewriting.
    One particular nuance of stream processing is that you never know where and how
    you receive/send the data, so the best way is to implement it for a buffer in memory,
    with appropriate bound checks and resuming.
    Although the more common approach is to request i/o handlers from the app,
    but its annoying as hell - my lzma decoder based on lzma SDK (not counting the SDK)
    is about the same in size as my standalone lzma decoder.

  11. #11
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > Practically all runtime is spent in zlib, so it doesn't matter if my
    > part is slow, so it's a great case to use a productive language.

    My opinion is that all high-level languages (ones with built-in strings etc)
    are very bad for parsing binary formats.
    For example, look at this:
    http://pastebin.com/PP9eMk3c (C++)
    http://pastebin.com/mkwXrDKu (python)
    and tell me which is a better description of the format.

    Yes, they can do it and I'd not be surprised by a smaller source even,
    but afaik it would be write-only - things that you have to decrypt
    with a language manual are clearly not descriptive.

    In zlib case that could mean that you would do detection by appending
    single symbols to a string ( s=substr(s,1)+newchar -> 2-3 instances of
    64k string with dynamic allocation ) and running ungzip and gzip on it
    (called via weird thunks with likely more string instances allocated,
    while being already bloated from the start), while normally you would
    know which part of a stream is a valid deflate code just by decoding.

    Sometimes complex tools add more problems than they solve, and
    decoding/encoding of binary formats is certainly one of these cases, imho.
    I wrote a Prolog parser in Haskell and feel that Haskell does it much better than C. It's not exactly the same as an archiver format, so here the benefits are lower, but I still like it a lot for such uses.

    Quote Originally Posted by Shelwien View Post
    > they could be combined with no significant work by placing metadata in front of data;
    > The program can find the end of metadata by itself.

    If you mean attaching it after actually recompression, then its not streamable.
    Mhm...correct. So it's not so simple, but not bad either. The thing that isn't known early on is gzip footer and it has a fixed size, so it could be placed at the end.
    There's a bigger problem with streaming though - zlib mode prediction might be incorrect and you may learn quite late about it. I see no fix for it.

    Also I know that its not necessary to consider different zlib versions.
    Quoting my comment from SO:
    No. I thought so, but found data that shows otherwise.
    Pic, compressed with gzip -6. I calculated (bs-)patches from it to the tested zlib modes and got the following table:
    No code has to be inserted here.I tested zlib 1.2.1-1.2.5, so if version didn't matter, results would be divisible by 5. I did not investigate further and I didn't see any other case where there would be difference (though I didn't try to find it either). I certainly think that for regular uses, trying only the latest version is sufficient.

    Then, reducing the window size doesn't normally make sense,
    Agree.
    as well as changing that memlevel parameter.
    IIRC on some machines the default is 8 and on some 9, reducing it further doesn't make sense.

    So I'd expect the practical set of parameter combinations for bruteforce
    to be much smaller than it seems.
    Definitely. I didn't do any work on reducing parameter space, but there's no doubt a lot to be gained. I didn't analyse it well, but it seems that Precomp checks only the basic 9 modes and works quite OK.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    > I wrote a Prolog parser in Haskell and feel that Haskell does it
    > much better than C.

    Parser or a real intepreter? Latter would be cool.
    But for plain parsing the dedicated parser generators (eg. ANTLR) are probably better.

    > It's not exactly the same as an archiver format,
    > so here the benefits are lower, but I still like it a lot for such uses.

    I think there'd be no benefits for any practical tasks.
    For example, manipulating a list of 1000 files certainly would be easier
    in languages with strings and lists. But already when it comes to millions
    of files, the RTL overhead would be probably deadly.

    Same thing applies to working for long periods of time with high load.
    Things like garbage collector freezing the app for a few minutes
    after working for a week are very hard to debug and fix.

    > I thought so, but found data that shows otherwise.

    Well, I only tested using minizip utility from zlib package,
    but it certainly worked the same for all tested versions.
    Still, thanks for info, I'd take it into account.

  13. #13
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > I wrote a Prolog parser in Haskell and feel that Haskell does it
    > much better than C.

    Parser or a real intepreter? Latter would be cool.
    Interpreter. I will long remember SLD-resolution trees...they were hard.

    Quote Originally Posted by Shelwien View Post
    But for plain parsing the dedicated parser generators (eg. ANTLR) are probably better.
    Very possible, though I don't have any experience with it.

    Quote Originally Posted by Shelwien View Post
    > It's not exactly the same as an archiver format,
    > so here the benefits are lower, but I still like it a lot for such uses.

    I think there'd be no benefits for any practical tasks.
    For example, manipulating a list of 1000 files certainly would be easier
    in languages with strings and lists. But already when it comes to millions
    of files, the RTL overhead would be probably deadly.
    Depends on what you do with those files. If you want to compress them, then even whooping RTL overhead of 1000 cycles/file won't matter.

    Quote Originally Posted by Shelwien View Post
    Same thing applies to working for long periods of time with high load.
    Things like garbage collector freezing the app for a few minutes
    after working for a week are very hard to debug and fix.
    That's another matter. Indeed a problem, so for now one shouldn't write code that manages large number of frequently disposed objects in high level languages. With the only exception that I'm aware of being Azul JVM that's practically pauseless up to terabytes of RAM.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,334
    Thanks
    209
    Thanked 1,007 Times in 532 Posts
    > Interpreter. I will long remember SLD-resolution trees...they were hard.

    Nice!
    I actually have an old constraint programming project, and now it seems
    like I'd have an application for it in a foreseeable future, how about
    giving me some advices?

    > Depends on what you do with those files.

    Sorting millions of files.
    With a non-standard comparison function like "by extension".

  15. #15
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    > Interpreter. I will long remember SLD-resolution trees...they were hard.

    Nice!
    I actually have an old constraint programming project, and now it seems
    like I'd have an application for it in a foreseeable future, how about
    giving me some advices?
    Well, even though I wrote the thing, I've been led step by step and I don't feel strong giving advices. Anyway I suggest that you do high level study of theory behind such interpreters and make sure you understand it well. SLD-resolution being the most important and implementation-wise the only hard thing to do. And actually Haskell makes it easier than C, because its laziness takes the complexity of managing infinite trees away from an already very complex algorithm.
    ADDED: In fact, I think it would be worthwhile to write it in Haskell first and then port to C(++) if needed. No doubt it's easier this way and because of this it may actually be faster.


    Quote Originally Posted by Shelwien View Post
    > Depends on what you do with those files.
    Sorting millions of files.
    With a non-standard comparison function like "by extension".
    If you determine that it's indeed slow, you can rewrite the part in C, which is at most overhead of converting data back and forth. Not a problem with a couple of millions.
    With Python actually there's Cython, which makes this kind of optimization easier.
    Last edited by m^2; 4th November 2011 at 18:07.

  16. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,501
    Thanks
    741
    Thanked 664 Times in 358 Posts
    Sorting millions of files.
    With a non-standard comparison function like "by extension".
    1. use Schwartzian transform
    2. it will be slower than in C but much faster than scanning directories

    Haskell is great for utilities. it doesn't make sense when you are going to data itself - even ability to process millions of chars in a second isn't enough here. for example, i'm using plain haskell to collect http://freearc.org/Statistics.aspx page - it processes ~100mb of data in 30 seconds

Similar Threads

  1. reflate - a new universal deflate recompressor
    By Shelwien in forum Data Compression
    Replies: 121
    Last Post: 7th August 2019, 22:44
  2. lzma recompressor
    By Shelwien in forum Data Compression
    Replies: 33
    Last Post: 25th February 2016, 23:40
  3. ZPAQ pre-release
    By Matt Mahoney in forum Data Compression
    Replies: 54
    Last Post: 23rd March 2009, 03:17
  4. FreeArc 0.40 pre-release version
    By Bulat Ziganshin in forum Forum Archive
    Replies: 110
    Last Post: 3rd January 2008, 00:29
  5. quad 1.07BETA2 - pre-release of 1.08 available
    By encode in forum Forum Archive
    Replies: 4
    Last Post: 11th March 2007, 23:59

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •