Page 1 of 5 123 ... LastLast
Results 1 to 30 of 122

Thread: reflate - a new universal deflate recompressor

  1. #1
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts

    reflate - a new universal deflate recompressor

    Finally the diff and CM parts are more or less finished,
    so here's a working recompressor of any raw deflate streams.

    Specific utils are called raw2hif and hif2raw atm, for
    historical reasons (hif = header diff).

    http://nishi.dreamhosters.com/u/defslow_v6.rar

    .unp is unpacked file extracted from deflate
    .raw is the actual deflate stream = zip archive without zip headers
    .hif is the recompressor's diff - information required for lossless
    restoration of original deflate stream
    .7z is .unp+.hif archived with 7z -mx=9

    Test files are created with various archivers
    (7z,kzip,infozip/gzip,jar,winzip,pkzip/DOS,securezip)
    and are all recompressed losslessly now

    Code:
    -----------------------------------------------------
    |file       |    .unp|   .raw|  .hif|    .7z|   gain|
    -----------------------------------------------------
    |book1__7z  |  768771| 299731|119814| 382697|-27.68%|
    |book1__izip|  768771| 312257|    14| 261221| 16.34%|
    |book1__jar |  768771| 313576| 10264| 271641| 13.37%|
    |book1__kzip|  768771| 299437|110049| 372822|-24.51%|
    |book1__pacl|  768771| 312802|   103| 261285| 16.47%|
    |book1__pk  |  768771| 312490| 10561| 271891| 12.99%|
    |book1__pks |  768771| 311211| 25992| 287558|  7.60%|
    |book1__wz  |  768771| 312047|  1868| 263086| 15.69%|
    |wcc386_7z  |  536624| 303040| 57775| 333413|-10.02%|
    |wcc386_izip|  536624| 314952|    14| 274644| 12.80%|
    |wcc386_jar |  536624| 314180|  6405| 281305| 10.46%|
    |wcc386_kzip|  536624| 302580| 54157| 329711| -8.97%|
    |wcc386_pacl|  536624| 314034|    71| 274763| 12.51%|
    |wcc386_pk  |  536624| 313295| 18708| 293663|  6.27%|
    |wcc386_pks |  536624| 312781| 27171| 302223|  3.38%|
    |wcc386_wz  |  536624| 314914|   467| 275194| 12.61%|
    -----------------------------------------------------
    |           |10443160|4963327|443433|4737117|  4.56%|
    -----------------------------------------------------
    Recompression of SFC files compressed with gzip -9
    http://nishi.dreamhosters.com/u/log5_99_tbl.txt

    Intentional mode mismatch with recompression model -
    files are compressed with gzip -1/4/6, while model expects -9
    http://nishi.dreamhosters.com/u/log5_146_tbl.txt

    Comparison of v6 with previous version.
    v6 has a token dif model, while v5 just stored the difs uncompressed.
    http://nishi.dreamhosters.com/u/zipfiles_v6.txt

  2. The Following User Says Thank You to Shelwien For This Useful Post:

    elit (7th August 2019)

  3. #2
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    If I get it right .hif is small when the original deflate streams are pretty similar to those produced by the zlib?
    7-zip and Kzip have their own deflate algorithms thus .hif is way bigger.

  4. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > If I get it right .hif is small when the original deflate streams are pretty similar to those produced by the zlib?

    Yes, though in theory it depends on precision of parsing model approximation.
    In other words, for winzip I know specifically how its different from zlib, so I can make a perfect model.
    And for pkzip/7z/kzip I know that overhead can be reduced using a matchfinder without zlib quirks
    (distance not limited to 0x7EF7 etc).

    > 7-zip and Kzip have their own deflate algorithms thus .hif is way bigger.

    Yes, in a way its a measure of parser complexity.

  5. #4
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    89
    Thanks
    8
    Thanked 4 Times in 4 Posts
    I've noticed that precomp and this tool are focused on lossless zip file restoration. Since ZIP files are ZIP files are ZIP files, it doesn't matter which decoder implementation is used for decompression. Which means that there is no need to store extra diff information (reflate) or waste cpu time trying to find exact zip compressor and options (precomp) - as zip decompressor has to be bit exact, no? Are there bad/limited deflate decompressors?
    Last edited by jethro; 3rd November 2011 at 18:58.

  6. #5
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    The decompression is not a problem, it always gives the same file, but restoring the compressed file bit exact is. For example, look at this simplified example:

    Code:
    ABCABCABC
    This can be compressed, f.e. you can write it as [3xABC] or [2xABC,ABC] or [ABC,2xABC] (and more). All of these decompress to the same original data, but without addtional information its not possible to tell which of the compressed possibilities was used.

    So for a given ZIP, you can tell exactly how the decompressed data looks, but for given decompressed data, there are various possible ZIP files depending on the exact zLib implementation.
    http://schnaader.info
    Damn kids. They're all alike.

  7. #6
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I guess that jertho understands the noninjectivity of inflate, but asks why do we need recompression to be lossless, right?

    The answer is that it's safer and more useful.
    Safer, because encoder may not have full view of some data and doesn't know whether there are some references to it, i.e. checksum files. Lossy recompression can break such data. Also, I've heard about a real world case where recompressing some jars with a stronger encoder broke a program that used them (OpenOffice?). I didn't check what was up, just repeat what I heard.
    And more useful, because it enables more tricks like recursion, i.e. png in a zip.

  8. #7
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    89
    Thanks
    8
    Thanked 4 Times in 4 Posts
    Quote Originally Posted by m^2 View Post
    I guess that jertho understands the noninjectivity of inflate, but asks why do we need recompression to be lossless, right?

    The answer is that it's safer and more useful.
    Safer, because encoder may not have full view of some data and doesn't know whether there are some references to it, i.e. checksum files. Lossy recompression can break such data. Also, I've heard about a real world case where recompressing some jars with a stronger encoder broke a program that used them (OpenOffice?). I didn't check what was up, just repeat what I heard.
    And more useful, because it enables more tricks like recursion, i.e. png in a zip.
    Right, why lossless aka why bother? Broken or limited deflate decoders should be rare, however checksum verification is an issue, I agree. However, it should matter only in installers and such, not in standalone zip files.

    The decompression is not a problem, it always gives the same file, but restoring the compressed file bit exact is. For example, look at this simplified example:

    Code:
    ABCABCABC
    This can be compressed, f.e. you can write it as [3xABC] or [2xABC,ABC] or [ABC,2xABC] (and more). All of these decompress to the same original data, but without addtional information its not possible to tell which of the compressed possibilities was used.

    So for a given ZIP, you can tell exactly how the decompressed data looks, but for given decompressed data, there are various possible ZIP files depending on the exact zLib implementation.
    I perfectly understand that, Schnaader. I understand that is the way precomp works - it finds exact zip method implementation and its parameters so as to make identical zip file on restoration.
    My point is this: you have a normal single zip file called "File.zip". You can extract its contents with any zip implementation (e.g. winrar, 7z, pkzip) because they produce bit exact output (or else are broken). Next proceed to pack it with,say, LZMA. On restoring we decompress LZMA and pack to zip with, say, 7z deflate algo. It is still a normal zip file which can be decompressed with any zip implementation.
    Precomp, by not looking for matching compression method, can be speeded up probably like 10x. In simple cases (e.g a bunch of ZIP/PNG/GIF files) I would use such super fast precomp mode.
    Code:
    Super Fast Precomp mode:
    Look for headers, decompress, check size after decompression. Maybe some threshold of 5%, else keep original bits
    Also intense mode finds many possible zip headers yet it seldom finds "right" compressor (something like 20/1000 in my last test). This mode could improve compression hugely. Since such mode would be very quick one could verify right away if everything works after restoration.
    In conclusion, we risk slightly incompatibly for much, much faster operation and in conjunction with intense mode much better compression. In case of reflate, ditching losslessness has obvious size gains.

    What do you think about implementing such mode Schnaader?
    Last edited by jethro; 3rd November 2011 at 21:08.

  9. #8
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    I had several requests for such a "lossy" recompression mode, but I'm not sure exactly how to realize it. I agree that it would be helpful for some known filetypes like ZIP and PNG where losslessness isn't needed. It definitely won't get a part of Precomp as it's very "unsafe" as m^2 said and people must be aware to use it entirely different, e.g. you just can't use this mode for a big ISO file as it would destroy its structure. So releasing it as a seperate program would be the way to go, but I just haven't found the time to do so. As the compressed streams would often have a different size, file headers need to be handled/updated and it'd need work to more safely identify/handle different file types.

    Note that intense mode in Precomp only looks for two header bytes instead of ZIP or other headers - and even those 2 bytes can vary, so there's much misdetection involved, there will be "streams" detected (and sometimes recompressed, if they are longer than 32 bytes) even in pure text files. This works for Precomp as it's lossless, but will cause problems in a lossless implementation destroying content of the original file.

    So this project would be quite different from Precomp, I'm not sure if I'll ever find time to start it, but if I do, I'll make it open source right from the start for sure so that other people can play with it and improve it.
    http://schnaader.info
    Damn kids. They're all alike.

  10. #9
    Member
    Join Date
    May 2007
    Location
    Poland
    Posts
    89
    Thanks
    8
    Thanked 4 Times in 4 Posts
    Sorry for hijacking the thread Shelwien, hope you don't mind.

    I had several requests for such a "lossy" recompression mode, but I'm not sure exactly how to realize it. I agree that it would be helpful for some known filetypes like ZIP and PNG where losslessness isn't needed. It definitely won't get a part of Precomp as it's very "unsafe" as m^2 said and people must be aware to use it entirely different, e.g. you just can't use this mode for a big ISO file as it would destroy its structure. So releasing it as a seperate program would be the way to go, but I just haven't found the time to do so. As the compressed streams would often have a different size, file headers need to be handled/updated and it'd need work to more safely identify/handle different file types.
    Myself i would prefer in to be a part of precomp, though with BIG FAT warning that this is not lossless. Having too many tools around is an inconvenience.

    Note that intense mode in Precomp only looks for two header bytes instead of ZIP or other headers - and even those 2 bytes can vary, so there's much misdetection involved, there will be "streams" detected (and sometimes recompressed, if they are longer than 32 bytes) even in pure text files. This works for Precomp as it's lossless, but will cause problems in a lossless implementation destroying content of the original file.
    This sounds problematic, yes. A kludgy way to solve it is with the threshold as described in previous post. For intense mode then it may need to be bumped to 15% or any other empirically-found safe value, with sufficiently long streams to rule out random luck (i.e. streams longer than default 32 bytes). Maybe this could work perfectly, I would like to test that.

  11. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > I've noticed that precomp and this tool are focused on lossless zip
    > file restoration.

    Yes, because its the only type of recompression which can be
    applied _automatically_ , while with lossy transformation somebody
    has to be responsible for the losses.

    > Since ZIP files are ZIP files are ZIP files,

    They are not. Different archivers create different zip files
    (there're a few optional records and such).
    For example, some games use zips with zeroed crcs as resource
    containers.
    So lossless recompression of zip files is actually a different
    task, not directly related to deflate recompression.

    > it doesn't matter which decoder implementation is used for
    > decompression.

    Kinda, but lossless recompression also has a bonus - it also
    works with partial decoding. In other words, there can be
    a few extended records which decoder doesn't support, but
    if we can recompress everything else and there's still a
    compression gain, then it still makes sense.

    > Which means that there is no need to store extra diff
    > information (reflate) or waste cpu time trying to find exact zip
    > compressor and options (precomp) - as zip decompressor has to be bit
    > exact, no? Are there bad/limited deflate decompressors?

    That extra information (reflate and precomp are the same in that sense)
    is mostly necessary for lossless restoration of original deflate stream.
    Yes, its not necessary if you're ok with lossy restoration.
    But even plain .zip or .gz would be invalid if you'd replace a deflate
    stream in there with a stream of different size -
    ie lossy recompression is only possible at the level of containers,
    while lossless is universal.

    Anyway, lossy recompression is just a different (and usually much simpler) task.
    And in fact you don't need to write anything for it - just find
    a decoder and encoder (and maybe a compressor for intermediate format).

    There's actually also a different kind of lossy recompression which is
    usually called optimization - stuff like that: http://www.neuxpower.com/support/technology/
    So there're basically 3 related but different tasks:

    1. lossless data recompression (just somehow further compress the already compressed data)
    - encoder option bruteforce + diff (precomp)
    - entropy coding replacement (lzmarec,most jpeg/mp3 recompressors)
    - encoder model + entropy coding replacement (reflate)
    2. lossy data recompression (decode, (de)compress with another codec, encode back)
    3. format optimization (pngout,mp3repacker etc)

  12. #11
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Shelwein, do you have tools to extract raw deflate streams from files?
    The tool is barely testable as it is.

  13. #12
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    Quote Originally Posted by m^2 View Post
    Shelwein, do you have tools to extract raw deflate streams from files?
    The tool is barely testable as it is.
    Verbose mode in Precomp can be abused for very rough offset detection (depending on the various headers, actual raw stream positions are some bytes behind) and also gives stream length (although I think they don't matter for reflate as stream end is clear) when recompression was successful. Also, the temporary files should be raw streams, but it's difficult/unsafe to "catch" them because they have a short lifespan.

    Adding a debug switch to Precomp to keep the decompressable raw streams in a seperate directory would be another possible solution.

    Of course, writing a seperate tool would be good, too, but note that raw deflate streams aren't easily detectable/extractable from unknown filetypes.
    Last edited by schnaader; 6th November 2011 at 01:29.
    http://schnaader.info
    Damn kids. They're all alike.

  14. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    > tools to extract raw deflate streams from files?

    Sorry, not yet, but I'm finally working on it - the actual recompression was more important.
    Atm I can only suggest to extract them manually - it starts immediately after file name in .zip and .gz
    (asciiz filename in .gz) and after IDAT and two more bytes in .png.

    Btw, there're also defslow v0-v5 demos, which contain various other utils.
    raw2unp or raw2dec from there would be likely more useful for .raw testing than raw2hif.

    Note also that there's no parameter detection either, it just uses a -9 parser for all inputs atm.

    However, these things (stream and level detection) are relatively simple, so I'd likely
    post an updated demo in a few days.

  15. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    http://nishi.dreamhosters.com/u/reflate_v0a.rar
    v0 is available for a month already, but there was a bug and it wasn't exactly lossless.
    Please tell me if you'd find a file which is not restored correctly.
    Btw, rawdet/rawrest utils can be useful as they are, because rawdet extracts
    detected streams to separate files and they can be decoded, modified etc.

    Code:
                   original lv=6.ref lv=4.ref    .refx   .ref0    .ref1
    2009_Gen~.pdf   2551611  1871024  1892553  1838314  2011876  1844149
    advcomp1.pdf     108823    76077    76779    75547    79578    75842
    gs.pdf          1459619  1302619  1304207  1301067  1319064  1302637
    PartyLite.pdf  51341506 45335887 44478337 38768173 46280052 38884167
    SonyAR11-E.pdf  8186364  5412167  5397943  5253807  5597390  5265656
    
    .ref is the recompression archive created with reflate+shar+plzma
    .refx is an archive of reflate's .unp files (like .ref, but without .hifs)
    .ref0 is plain plzma
    .ref1 is precomp042+plzma
    
    "lv=N" means that gzip level N is used as a diff base by raw2hif
    
    .refx is the estimation of potentially possible reflate result
    with deflate parameter detection.
    Code:
    Run/see test1.bat for an example of script usage
    
    test1.bat: create .ref and .ref1 archives from files in /pdf subdirectory
    test2.bat: create .ref2 archives from files in /pdf (using precomp)
    
    c.bat filename.ext
    
      compresses filename.ext to filename.ref and filename.ref0,
      (.ref0 is a version compressed with plain plzma, no recompression.)
    
    d.bat filename.ref filename.unp
    
      restores the file from reflate archive filename.ref to filename.unp
    
    Notes:
    
    1. Compression level is not detected yet, instead its configured
    in "set level=n" line of c.bat/d.bat scripts.
    Note that the optimal level can be significantly different for different
    file formats and/or encoders. 
    For example, for .docx level=1 seemed to be the best somehow.

  16. #15
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    Tested 4 files (2,4 GHz dual core, 4 GB RAM).

    Code:
    file                size        ref (level 6)                 comp/decomp time     precomp -cn -intense -t-j
    
    advcomp1.pdf        108,823     126,612 -> 76,077             1.8 s / 1.2 s        125,656
    FlashMX.pdf         4,526,946   27,509,105 -> 3,106,490       1 min 48 s / 24 s    26,953,437
    setup_paserver.zip  15,235,901  37,706,637 -> 11,751,787      1 min 33 s / 39 s    17,189,024
    corellinux.iso      473,825,280 1,017,498,081 -> 429,283,695  1 h 55 min / 49 min  505,421,167
    Note that the Precomp values are including recursion and additional streams like bZip2, so they are only rough guides - and I intentionally picked the ZIP+ISO to show files where Precomp only recompresses partial streams.

    Both FlashMX.pdf and corellinux.iso aren't restored correctly. For FlashMX.pdf, it seems to be a minor bug, there's one removed byte at position 0x29242 and the remaining content seems to be identical (though shifted one byte). For corellinux.iso, there are 5 streams and especially the biggest one takes much time:

    Code:
          00000004.hif   91.069.876 bytes
          00000004.raw  447.381.529 bytes
          00000004.unp  894.828.544 bytes
    It seems that this one is somehow (partially) lost after restoration, file size is only 26,460,730 bytes, there's one byte mismatch at position 0x154E341 (0x46 instead of 0x00) and everything else is identical, though the file is much too small. 473,825,280 - 447,381,529 = 26,443,751, so it seems restoration of the stream was started, but something went wrong after ~17 KB.

    Another interesting detail on corellinux.iso: RAR of the original file is only 344 MB and although Precomp can only recompress a part of it successfully, Precomp+SREP+7-Zip gives 318 MB, so the reflate result (429 MB) doesn't really match the successful decompression of almost 1 GB data with relatively small overhead of 90 MB. Perhaps the compression needs some tuning for big files, long matches and such (EDIT: No it doesn't, see below). I have skipped ref0 to save time, will try this on corellinux.iso and edit (EDIT: Just fine, 325,613,370 bytes).

    EDIT: Seems the big stream of 447 MB is a misdetection - there is no big deflate stream inside the ISO, but several hundred GZip streams - it seems detecting the stream end somehow fails - this also explains the bad compression ratio as the 1 GB decompressed data also contains artificially "decompressed" data outside of the deflate stream.

    Precomp -c- -d0 -t-b output for corellinux.iso:

    Code:
    New size: 492927220 instead of 473825280
    
    Recompressed streams: 1508/1883
    ZIP streams: 83/86
    GZip streams: 1396/1768
    GIF streams: 29/29
    
    You can speed up Precomp for THIS FILE with these parameters:
    -zl66,68,86,95,96,97,98,99
    Ah, the rawdet output for corellinux.iso also shows some strange error, the last stream has no end (but beg=0193C23A=26,460,730, which was the size of the incorrectly restored file):

    Code:
    beg=001F469E last=0 type=2 size=4096 unplen=27590
    end=002EDFD2 bufbeg=00000003 bufend=00000000
    beg=002EE00A last=0 type=2 size=4096 unplen=144788
    end=003348A3 bufbeg=00000000 bufend=00000000
    beg=0154C560 last=1 type=1 size=7299 unplen=28307
    end=0154E342 bufbeg=00000007 bufend=00000000
    beg=015FF016 last=1 type=2 size=1792 unplen=13139
    end=015FFA3F bufbeg=00000002 bufend=00000000
    beg=0193C23A last=0 type=2 size=65536 unplen=65536
    Last edited by schnaader; 8th December 2011 at 19:05.
    http://schnaader.info
    Damn kids. They're all alike.

  17. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Uh, thanks, at least I confirmed that FlashMX.pdf bug, but I can't find any working mirrors for corellinux...

  18. #17
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    The ISO is from an old version of corellinux and was sent to me by e-mail because Precomp had problems with it. It seems the link from the e-mail (~2 weeks ago) still works: http://ompldr.org/vYmI1eA

    Full name is corellinux-oc_1.2.iso, md5sum is 0f3a266d124ac82c0af840ca34bf1a98
    Last edited by schnaader; 8th December 2011 at 18:55.
    http://schnaader.info
    Damn kids. They're all alike.

  19. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Cool, downloading...

  20. #19
    Member caveman's Avatar
    Join Date
    Jul 2009
    Location
    Strasbourg, France
    Posts
    190
    Thanks
    8
    Thanked 62 Times in 33 Posts
    Quote Originally Posted by Shelwien View Post
    > tools to extract raw deflate streams from files?

    Sorry, not yet, but I'm finally working on it - the actual recompression was more important.
    Atm I can only suggest to extract them manually - it starts immediately after file name in .zip and .gz
    (asciiz filename in .gz) and after IDAT and two more bytes in .png.
    A single deflate stream can be split between several IDAT chunks in PNG files.
    I have a tool that can dump the deflate stream of .gz and .png files in human readable form (it's a pngdb update and is now complete including dynamic huffman tables dump). If needed I could tweak it to output the raw stream into a file.

  21. #20
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by caveman View Post
    A single deflate stream can be split between several IDAT chunks in PNG files.
    I have a tool that can dump the deflate stream of .gz and .png files in human readable form (it's a pngdb update and is now complete including dynamic huffman tables dump). If needed I could tweak it to output the raw stream into a file.
    Would be nice...

  22. #21
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    Quote Originally Posted by caveman View Post
    A single deflate stream can be split between several IDAT chunks in PNG files.
    See RFC 2083, section "4.1.3. IDAT Image data".

    There can be multiple IDAT chunks; if so, they must appear
    consecutively with no other intervening chunks. The compressed
    datastream is then the concatenation of the contents of all the
    IDAT chunks. The encoder can divide the compressed datastream
    into IDAT chunks however it wishes. (Multiple IDAT chunks are
    allowed so that encoders can work in a fixed amount of memory;
    typically the chunk size will correspond to the encoder's
    buffer size.)
    The layout in such cases is [4 byte length, "IDAT", data, 4 byte CRC] repeated, so 12 bytes will be between the deflate stream parts. Precomp handles this by writing to a temporary file and removing the intervening bytes, the statistics will show "PNG streams (multi):".

    I appended a test file, it seems that reflate has troubles with it, too - which isn't surprising as in most cases it won't be able to detect/distinguish the deflate stream and the "injected" bytes without any further information about the PNG structure.

    Code:
    723.542 pika_pika.png
    641.169 pika_pika.ref
    723.556 pika_pika.png_
    Files differ starting at 0x12AFE.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	pika_pika.png 
Views:	350 
Size:	706.6 KB 
ID:	1758  
    Last edited by schnaader; 8th December 2011 at 21:40.
    http://schnaader.info
    Damn kids. They're all alike.

  23. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    This one is more interesting

  24. #23
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    555
    Thanks
    208
    Thanked 193 Times in 90 Posts
    Analyzed the big ISO stream further. My self-made zLib decompression routines return with an error instantly (End of code unassigned). Looking at the huffman tree, this is true, there's no huffman code for symbol 256 assigned!

    Precomp 0.4.2 (Windows version) in brute mode tries to decompress the complete stream, too, but as end of file is reached before the stream is finished, it is not seen as a valid stream. Precomp 0.4.2 (Linux version) doesn't decompress the complete stream because zLib returns Z_DATA_ERROR with the message "invalid code -- missing end-of-block". The difference here is that the Windows version uses ZLIB1.DLL (version 1.2.3) and the Linux version compiles zLib from source (version 1.2.5). And in fact, looking at the zLib changelog ( http://zlib.net/ChangeLog.txt ):

    Changes in 1.2.3.4 (21 Dec 2009)
    [...]
    - Catch missing-end-of-block-code error in all inflates and in puff
    Assures that random input to inflate eventually results in an error
    Code from inflate.c:

    Code:
    int ZEXPORT inflate(strm, flush)
       [...]
            switch (state->mode) {
                [...]
            case CODELENS:
                [...]
                /* check for end-of-block code (better have one) */
                if (state->lens[256] == 0) {
                    strm->msg = (char *)"invalid code -- missing end-of-block";
                    state->mode = BAD;
                    break;
                }
    Last edited by schnaader; 9th December 2011 at 17:53.
    http://schnaader.info
    Damn kids. They're all alike.

  25. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Yes, I noticed that too... and then invented 3-4 more constraints to check.
    Anyway, I'd post v0b after fixing that interleaving bug which causes other problems you've found.

  26. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Yesterday I rewritten the main loop once again and kinda made it work in the end.
    But I also found a troublesome issue.
    Not sure whether it was a reason for any bugs before... likely not.
    But there's an interleaving issue in both old and new version.
    You see, the dec2dif's output is delayed comparing to block headers.
    ie I get header + tokens by decoding a deflate block,
    then encode the header and put the tokens into dec2dif.
    But dec2dif can't immediately produce all the diffs for the current block,
    it needs some lookahead to work.
    So I ended up building some header queue there.
    So I put the header into the queue and process the tokens.
    and then check whether there's enough difs to cover the first block in queue.
    If there're, then I finally encode that block and corresponding difs,
    and remove them from queues.
    etc.
    Anyway, normally its okay, and the queues are not so long,
    especially in the new version.
    But there's an exploit.
    Its quite possible that any number of empty or very short blocks
    would appear in the stream.
    (pacl zip is a good example, though it "only" can produce up to 6 or so
    short blocks in sequence)
    Also 3-4 size=0 blocks in line are not a wonder for pdf streams and such
    (at the end though, where it doesn't matter).
    Well, anyway, the format allows that, but my queues are limited -
    currently the size of header queue is 8 headers,
    so there's a need to do something if the queue is full,
    and what I did was flushing the diff engine,
    telling it that the stream ends after the current block,
    and to dump all the info.
    But in fact the decoder doesn't know about the flush,
    and appears that it can unsync because of that
    (diffs are applied correctly, but the base matchfinder is out of sync).
    But forcing the decoder to reproduce the flush is also hard,
    because encoder input is LZ tokens,
    and decoder input is uncompressed data extracted from these tokens,
    so without encoding the uncompessed block length its basically impossible to
    reproduce the flush in decoder.
    Anyway, its complicated.
    I kinda have one not very efficient (nor simple) workaround for that -
    just dumping the dec2dif queue "manually" without actually flushing it,
    then continue as usual, but skip the difs corresponding to already flushed part.
    But if it won't work either for some reason...
    I'd only be able to increase the queue size and put an error message there.
    Which would mean possible broken files, and is not any good at all
    And its normally very hard to test too...

    And then there's another layer of interleaving that has to be added %)

  27. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    http://nishi.dreamhosters.com/u/reflate_v0b.rar

    Code:
                   original lv=6.ref lv=4.ref    .refx   .ref0    .ref1
    2009_Gen~.pdf   2551611  1871039  1892305  1838314  2011876  1844149
    advcomp1.pdf     108823    76033    76771    75547    79578    75842
    gs.pdf          1459619  1302432  1304242  1301067  1319064  1302637
    partylite.pdf  51341506 45338511 44484689 38768173 46280052 39433825
    sonyar11-e.pdf  8186364  5413421  5398625  5253807  5597390  5265656
    pika_pika.png    723542   640719   741070   557270   717569   609961
    Code:
    17-12-2011 01:19 v0b
     + BUG: (rawdet) blocks w/o EOF code were allowed
     + BUG: (raw2hif) match model desync due to input padding after EOF
     + BUG: (raw2hif) match model desync due to matchfinder flushes
     + BUG: (raw2hif) blk/dif/hdr sequence desync
     + BUG: (raw2hif) thread_out didn't expect multiple blocks in queue at EOF
     + BUG: (raw2hif) broken input handling (for now error treated as EOF)
     + same rawdec library in rawdet/raw2hif
     + rawdet uses getcputc.inc to improve the speed of putc writes
     + improved compression of "standard" streams
    
    Known bugs:
     - (raw2hif) corellinux-oc_1.2.iso/000000CA.raw: last byte's padding mismatch
     - (raw2hif) corellinux-oc_1.2.iso/000002FC.raw: extra garbage added on decoding
     - (rawdet) detecting streams like 000000CA.raw and 000002FC.raw as deflate
     - missing byte in restored FlashMX.pdf
    
    To do:
     - integrate the forgotten compression optimization (match indexing)
     - add commandline options to control matchfinder's winsize/memsize
     - matchfinder mode detection

  28. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Btw, I'd like an advice about "streams" like:
    http://nishi.dreamhosters.com/u/000000CA.raw
    (from that linux iso)

    Its obviously not deflate, but its parsed correctly as a type-1 block.
    I'd keep it like it is for now, as its a good test case for raw2hif,
    but normally rawdet should be able to skip it,
    and I'm not sure how it would be able to do that.

  29. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    http://nishi.dreamhosters.com/u/reflate_v0b1.rar
    Code:
    23-12-2011 05:20 v0b1
     + BUG: (thread_pipe21) EOF on one of inputs doesn't mean there's nothing to do
       (missing byte in FlashMX.pdf)
     + BUG: (raw2hif) last byte's padding mismatch
       (mismatch at corellinux.iso/000000CA.raw)
     + BUG: (raw2hif) handling of EOF within an incomplete block
       (extra garbage added on decoding in corellinux.iso/000002FC.raw)
     + BUG: (rawdet) handling of incomplete streams
    
    
    Known bugs:
     - (rawdet) detecting streams like 000000CA.raw and 000002FC.raw as deflate
    
    Planned:
     - (v0c) redundancy estimation of first block data (too compressible = fail)
     - (v0c) integrate the forgotten compression optimization (match indexing)
     - (v1) add commandline options to control matchfinder's winsize/memsize
     - (v1) matchfinder mode detection
     - (v1a) rawdet/raw2hif/rawrest integration
     - (v1a) input->.out+.hif,.str+.unp interleaving (2 out streams for recursion support)
     - (v1a) support for >4G input/streams
     - (v1b) integrated processing of nested streams
     - (v1b) full output interleaving (simple input->output)
     - (v1b) integrated plzma compression
     - explicit zip recompression
     - explicit png recompression
    With this, all the previous problematic files (flashmx.pdf, pika_pika.png, corellinux-oc_1.2.iso)
    should be now restored correctly. Please find me some new ones, while I'm working on that
    compression improvement.

  30. #29
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    http://nishi.dreamhosters.com/u/reflate_v0c.rar

    Code:
    31-12-2011 01:15 v0c
     + entropy prefilter (data bpc check before decoding attempts)
       (10x faster processing of redundant data; also exclusion of "fake" deflate blocks)
     + integration of the forgotten compression optimization (match indexing)
       (2x better compression of large diffs)
     + merged the deflate decoder library branches in rawdet and raw2hif
    Code:
                   original  v0b .ref  v0c .ref   .ref0    .ref1
    2009_Gen~.pdf    2551611  1871039  1853328  2011876  1844149
    advcomp1.pdf      108823    76033    75838    79578    75842
    gs.pdf           1459619  1302432  1301889  1319064  1302637
    PartyLite.pdf   51341506 45338511 41273674 46280052 39433825
    SonyAR11-E.pdf   8186364  5413421  5320432  5597390  5265656
    pika_pika.png     723542   640719   626492   717569   609961
    FlashMX.pdf      4526946           2831690 
    
    .ref is the recompression archive created with reflate+shar+plzma
    .ref0 is plain plzma
    .ref1 is precomp042+plzma
    Code:
    .raw is the raw deflate stream, w/o zlib header or anything
     (like its stored in zip archives)
    .hif is the reflate's compressed metainformation required to losslessly
     restore .unp back to original .raw
    .7z are .unp+.hif pairs extracted from .raw and compressed again with 7-zip
    
                                  /=== reflate_v0b ===\  /=== reflate_v0c ===\
    ---------------------------------------------------------------------------
    |file       |    .unp|   .raw|  .hif|    .7z|   gain|  .hif|    .7z|  gain|
    ---------------------------------------------------------------------------
    |book1__7z  |  768771| 299731|119814| 382697|-27.68%| 55802| 317792|-6.03%|
    |book1__izip|  768771| 312257|    14| 261221| 16.34%|    17| 261261|16.33%|
    |book1__jar |  768771| 313576| 10264| 271641| 13.37%|  5684| 266953|14.87%|
    |book1__kzip|  768771| 299437|110049| 372822|-24.51%| 53912| 315982|-5.53%|
    |book1__pacl|  768771| 312802|   103| 261285| 16.47%|   144| 261386|16.44%|
    |book1__pk  |  768771| 312490| 10561| 271891| 12.99%|  8485| 269825|13.65%|
    |book1__pks |  768771| 311211| 25992| 287558|  7.60%| 14582| 275953|11.33%|
    |book1__wz  |  768771| 312047|  1868| 263086| 15.69%|  1191| 262379|15.92%|
    |wcc386_7z  |  536624| 303040| 57775| 333413|-10.02%| 30598| 305936|-0.96%|
    |wcc386_izip|  536624| 314952|    14| 274644| 12.80%|    19| 274747|12.77%|
    |wcc386_jar |  536624| 314180|  6405| 281305| 10.46%|  3620| 278457|11.37%|
    |wcc386_kzip|  536624| 302580| 54157| 329711| -8.97%| 26628| 301764| 0.27%|
    |wcc386_pacl|  536624| 314034|    71| 274763| 12.51%|   124| 274788|12.50%|
    |wcc386_pk  |  536624| 313295| 18708| 293663|  6.27%|  8500| 283264| 9.59%|
    |wcc386_pks |  536624| 312781| 27171| 302223|  3.38%| 11689| 286612| 8.37%|
    |wcc386_wz  |  536624| 314914|   467| 275194| 12.61%|   295| 274953|12.69%|
    ---------------------------------------------------------------------------
    |           |10443160|4963327|443433|4737117|  4.56%|221290|4512052| 9.09%|
    ---------------------------------------------------------------------------

  31. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,267
    Thanks
    200
    Thanked 985 Times in 511 Posts
    Ah, also this:
    Code:
    corellinux-oc_1.2.iso  - 473,825,280
    corellinux-oc_1.2.ref0 - 325,613,370 // plain plzma
    corellinux-oc_1.2.ref  - 247,208,316 // v0b2, 24.079% gain
    corellinux-oc_1.2.ref  - 247,139,907 // v0c, 24.100% gain
    
    corellinux-oc_1.2.ref1 - 325,518,135 // precomp042.exe -c- -intense -d0 -t+Z -mjpeg-
    corellinux-oc_1.2.ref1 - 322,272,596 // precomp042.exe -c- -intense -d0 -t+zgn (also -t-pfjsmb)
    corellinux-oc_1.2.ref1 - 320,445,904 // precomp042.exe -c- -intense -d0 -t-j (29 gifs, 1 bzip)
    corellinux-oc_1.2.ref1 - 412,435,140 // precomp042.exe -c- -intense -d1 -t-j (1073 gifs, 1 bzip)
    schnaader, is there a way to configure precomp to detect all raw deflate streams, but skip
    the formats unsupported by reflate?
    Also, am I doing something wrong with that .iso? Why the precomp result is so much worse?

Page 1 of 5 123 ... LastLast

Similar Threads

  1. lzma recompressor
    By Shelwien in forum Data Compression
    Replies: 33
    Last Post: 25th February 2016, 22:40
  2. DEFLATE/zlib implementations
    By GerryB in forum Data Compression
    Replies: 10
    Last Post: 7th May 2009, 17:03
  3. deflate model for paq8?
    By kaitz in forum Data Compression
    Replies: 2
    Last Post: 6th February 2009, 20:48
  4. Universal Archive Format
    By Bulat Ziganshin in forum Data Compression
    Replies: 1
    Last Post: 9th July 2008, 00:54
  5. Interesting Deflate source
    By encode in forum Forum Archive
    Replies: 10
    Last Post: 21st April 2008, 15:30

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •