Page 1 of 3 123 LastLast
Results 1 to 30 of 88

Thread: precomp - further compress already compressed files

  1. #1
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    616
    Thanks
    264
    Thanked 242 Times in 121 Posts

    precomp - further compress already compressed files

    Note: To follow the way that was recommended in some other post, I decided to also do only one post about Precomp instead of one per version, so both development news and future releases will go here.

    Here's a first development version of Precomp 0.4.8 that integrates brunsli JPG compression. There might be some minor changes in the final version, but the main work is done. By default, brunsli is enabled and brotli compression of metadata is disabled - to enable the subsequent compressors (like lzma2) to compress the metadata. This gives much faster JPG compression and decompression with almost the same ratio as packJPG.

    There are new command line switches to control the behaviour (found in the -longhelp). "brunsli[+-]" (default is on) enables/disables brunsli, "brotli[+-]" (default is off) enables/disables brotli and "packjpg[+-]" (default is on) can be used to disable the packJPG fallback.

    Known limitations

    • brunsli uses a pessimistic memory estimate, so it's quite memory hungry, that's why Stephan Busch's test images from https://github.com/google/brunsli/issues/16 will still fallback to packJPG. Tried to raise the kBrunsliMaxNumBlocks variable to test it - for the nasa jpg, precomp tried to allocate 10 GB RAM (have 4 GB here) and crashed. However, it's not that bad for everyday JPGs below 100 MB, e.g. the 60 MB test image below uses 800 MB memory using packJPG and 1300 MB using brunsli.
    • no multithreading support (yet). There are some first experiments in brunsli to use "groups" compression that enables multithreading, but I couldn't get it compiled. But I'll give it another shot and also try to enable multithreading in reconstruction (-r) for JPG and MP3.
    • no automatic recursion. Metadata compression with brotli is deactivated by default, so thumbnails in the metadata could be compressed with recursion. But doing this automatically would introduce some performance penalty and results aren't always better (thumbnails are usually very small). So at the moment, you'll have to call precomp twice for thumnail compression.


    The attached version is a Windows 64-bit MSVC compile, if you want to compile your own version, you can use the "brunsli_integration" branch on GitHub. Please test it on your files and report any bugs.

    Loch Lubnaig test image from Wikimedia Commons
    Code:
      Original: 62.069.950 Bytes
        Precomp 0.4.7 -cn:             48.626.609, 2 min 14 s, -r: 2 min 18 s
        Precomp 0.4.7:                 48.629.092, 2 min 55 s, -r: 2 min 18 s
        Precomp 0.4.8dev -cn:          49.822.256,       25 s, -r:       22 s 
        Precomp 0.4.8dev -cn -brotli+: 49.787.761,       26 s, -r:       22 s
        Precomp 0.4.8dev:              49.790.048, 1 min  9 s, -r:       23 s
    Attached Files Attached Files
    Last edited by schnaader; 8th November 2019 at 13:38. Reason: added brotli result
    http://schnaader.info
    Damn kids. They're all alike.

  2. Thanks (8):

    comp1 (9th November 2019),Gonzalo (10th November 2019),hmdou (23rd March 2020),Mike (8th November 2019),Shelwien (8th November 2019),Simorq (28th December 2019),Stephan Busch (9th November 2019),vladv (1st March 2020)

  3. #2
    Member CompressMaster's Avatar
    Join Date
    Jun 2018
    Location
    Lovinobana, Slovakia
    Posts
    199
    Thanks
    58
    Thanked 15 Times in 15 Posts
    Quote Originally Posted by schnaader View Post
    Note: To follow the way that was recommended in some other post, I decided to also do only one post about Precomp instead of one per version, so both development news and future releases will go here.
    @schaader,
    are you mean this?
    Quote Originally Posted by CompressMaster View Post
    Simply speaking, it´s all about NNCP unless you develop new method/algorithm.

    Well, in order to make the forum more clean, you should continue as it is. We don´t (at least I) want to make separate threads for every new version of softwares (even when these would contains significant alterations/improvements) that uses the same algorithm - as an example, your CMV. You posted every new version in corresponding posts. That´s great and much more readable... Gotty does that for PAQ. For example, schnaader creates separate threads for every new precomp version. Better would be to have only one and he should periodically update his first post with links to every new version i.e.

    1st post
    Here´s my new tool , PRECOMP etc
    Features: 1.full ect
    Versions:
    PRECOMP 1.0 Build 20070214 - https://encode.su/threads/111222/PRECOMP/post#5
    PRECOMP 2.0 Build 20081115 - https://encode.su/threads/111222/PRECOMP/post#29

    2nd post
    Precomp version1
    description+features
    download in attachment

    another post
    Precomp version2
    description+features
    download in attachment

    See the advantage? It´s much more clear and readable - users could have problems finding particular version, but if first post would contains links to all versions in corresponding posts, it will be very useful. That´s my point of view...
    Good decision, schnaader! Small note - it´s kinda too late. For example when I will have developed BestComp (I don´t know exactly when, because I need to heavily optimize it´s code - base kinda works), all my versions goes in BestComp thread as a Mauro Vezzosi´s CMV. But it´s better to have only one (like will be this) instead of multiple of course. Good work, though!

  4. #3
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 93 Times in 73 Posts
    Good work @schnaader! Do you need us to run some test in particular? I've been doing the basics, i.e. compiling, roundtrip, edge cases, etc. and so far all smooth.

    I can confirm that this version compile on both GCC 9.2 and clang 9.0 on Manjaro Deepin x86-64. It seems like clang is catching up with the optimizations bc there is no noticeable difference in speed, although it is faster and cleaner to compile (warnings aren't so messy).

    Personal opinion, I'm happy with the default settings, i.e. brunsli then packjpg, no brotli, no thumbnails recursion.

    ------------------------------------

    Another thing entirely: Do you have any plans to include bitmap and exe pre-processing? It seems that 7z's delta can be very helpful but AFAIK it needs to be manually tuned. I believe you already included exe pre-processing in precomp but only as a part of lzma compression. In my scripts, I frequently use good old "arc -m0=rep+dispack+delta" to pre-process executables and it hasn't failed in almost a decade. It makes for an important difference in ratio, not only on Windows exes, but on executable code in general. It would be a very, very good thing to add to precomp if you will

    Thanks in advance for your reply!

  5. #4
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    616
    Thanks
    264
    Thanked 242 Times in 121 Posts
    Quote Originally Posted by Gonzalo View Post
    Good work @schnaader! Do you need us to run some test in particular? I've been doing the basics, i.e. compiling, roundtrip, edge cases, etc. and so far all smooth.
    I can confirm that this version compile on both GCC 9.2 and clang 9.0 on Manjaro Deepin x86-64. It seems like clang is catching up with the optimizations bc there is no noticeable difference in speed, although it is faster and cleaner to compile (warnings aren't so messy).
    Compiling was a good test, thanks for that! I'm not looking for anything specific, it's more of a "more eyes find more bugs" case, throwing the test version at different people and their systems with different settings.

    Quote Originally Posted by Gonzalo View Post
    Another thing entirely: Do you have any plans to include bitmap and exe pre-processing? It seems that 7z's delta can be very helpful but AFAIK it needs to be manually tuned. I believe you already included exe pre-processing in precomp but only as a part of lzma compression. In my scripts, I frequently use good old "arc -m0=rep+dispack+delta" to pre-process executables and it hasn't failed in almost a decade. It makes for an important difference in ratio, not only on Windows exes, but on executable code in general. It would be a very, very good thing to add to precomp if you will
    Bitmap preprocessing will most likely be done by solving the "Using FLIF for image compression" issue and adding as much image formats as possible (like BMP, PBM/PGM/P.., TGA, PCX, TIFF, ...) to feed FLIF. Also, I recently did some experiments to autodetect images in data without any header (would help for game data like Unity resources), but this is more of an experimental thing that wouldn't be used as a default setting.

    For exe preprocessing, there are two possible ways: 1) Detecting exe streams and processing them in a seperate lzma stream, 2) Detecting and preprocessing exe streams "manually", by using some library like dispack. The first one would be easier to implement, but would still rely on lzma and is limited to lzma's filters. So it seems in the long run, the second solution is inevitable indeed, but will need some time to develop.
    http://schnaader.info
    Damn kids. They're all alike.

  6. #5
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    783
    Thanked 687 Times in 372 Posts
    lzma by itself doesn't include exe filters, they are in separate bcj and bcj2 algos

    also, it's not a good idea to make a lot of small independently compressed (lzma) streams, although front-end deduplication may significantly reduce losses

  7. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    For exes I can suggest this: http://nishi.dreamhosters.com/u/x64flt3_v0.rar http://freearc.dreamhosters.com/delta151.zip http://freearc.dreamhosters.com/mm11.zip (all with source)
    dispack is actually not that good - its better than dumb E8, but can only parse 32-bit code and what it does with parsed code isn't much smarter than E8.
    So if you want something more advanced than x64flt3, consider courgette (it has 32/64 ELF/COFF x86/arm support) or paq8px exe parser.

    For bmps I'd suggest http://nishi.dreamhosters.com/u/pik_20190529.7z
    FLIF is imho too inconvenient for precomp use (slow encoding etc).

    If you want a dedup filter, I can post rep1 from .pa, its a configurable CDC dedup preprocessor.
    Same with cdm, I suppose.

  8. #7
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 93 Times in 73 Posts
    @Shelwien: What is 'rep1'? Does it have anything to do with Bulat's rep? Is it an improvement over it, or another thing entirely?

    About exe preprocessors, look at this numbers:

    Click image for larger version. 

Name:	image.png 
Views:	113 
Size:	33.2 KB 
ID:	7052

    Maybe I'm doing something wrong, but after applying Bulat's 'delta' and flzma2, dispack has an advantage over x64flt3 on both PE and ELF formats. Both rclone and precomp are for linux x86_64.
    Delta has a positive effect on both filters.
    ​This is some crude test I did today, but I remember doing some more thorough comparisons a while back, and having the same results.
    Last edited by Gonzalo; 11th November 2019 at 16:11. Reason: Update table

  9. #8
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    616
    Thanks
    264
    Thanked 242 Times in 121 Posts
    Quote Originally Posted by Shelwien View Post
    For bmps I'd suggest http://nishi.dreamhosters.com/u/pik_20190529.7z
    FLIF is imho too inconvenient for precomp use (slow encoding etc).
    Code:
    > pik_c --lossless suns.png suns.pik
    
    CPU does not support all enabled targets => exiting.
    Guess my CPU is too old? Intel Core i5 520M here, instruction set support is MMX, SSE, SSE2..4.2, EM64T, VT-x, AES - so AVX/AVX2 is missing.
    I'd be happy to include pik to my tests, though. At the moment, the plan would be to include both FLIF and webp as there are images for both where the other performs better. More extreme settings would even include comparing results with/without PNG filtering and with pure lzma. Somehow, image compression is still kind of randomish.
    From your post, it sounds like pik might be a good default setting, offering fast and decent compression offer many types of image.
    http://schnaader.info
    Damn kids. They're all alike.

  10. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > What is 'rep1'? Does it have anything to do with Bulat's rep?
    > Is it an improvement over it, or another thing entirely?

    Its my dedup preprocessor, same function but different implementation.
    Srep is better, but rep1 is much easier for integration since its stream-based,
    doesn't need tempfiles etc.

    > About exe preprocessors, look at this numbers:

    You're partially right, but I actually meant original dispack from fgiesen, aka disfilter.
    Just that,
    1) Freearc having some codec doesn't mean that schnaader can use it.
    For example, nosso installer has the best disasm preprocessor, but so what?
    2) Testing a few files with freearc doesn't really prove anything,
    in particular with external exe preprocessing there's a frequent problem
    that archiver would still apply its own exe preprocessing (eg. nz,rz do that).

    So here I ported dispack from freearc: http://nishi.dreamhosters.com/u/dispack_arc_v0.rar
    And yes, it is frequently better than x64flt3.
    But sometimes it is not:
    Code:
     1,007,616 oodle282x64_dll
     1,000,369 oodle282x64_dll.dia
       336,415 oodle282x64_dll.b2l // BCJ2 + delta + lzma
       332,920 oodle282x64_dll.xfl // x64flt3 + delta + lzma
       334,971 oodle282x64_dll_dia.lzma // dispack + delta + lzma
    
    34,145,968 powerarc_exe
    33,603,128 powerarc_exe.dia
     5,255,372 powerarc_exe.b2l
     5,105,065 powerarc_exe.xfl
     5,531,003 powerarc_exe_dia.lzma
    http://nishi.dreamhosters.com/u/exetest_20191112.7z

  11. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    @schnaader:
    I recompiled pik*.exe with SIMD_ENABLE=0 and reuploaded the archive: http://nishi.dreamhosters.com/u/pik_20190529.7z

  12. Thanks:

    schnaader (12th November 2019)

  13. #11
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    616
    Thanks
    264
    Thanked 242 Times in 121 Posts
    Thanks, works now and looks like a good candidate between FLIF and webp for both ratio and decompression speed:

    Code:
    3.299.928 suns.pcf_cn, decomp 1.0 s
    1,805,356 suns.png
    1,620,308 suns.pcf, decomp 1.3 s
    1,258,722 suns.webp, decomp 0.1 s
    1,200,934 suns.pik, decomp 0.7 s
    1,112,302 suns.flif (-e), decomp 1.8 s
    1,115,026 suns.flif (-e -N), decomp 1.8 s
    1,096,019 suns.flif (-e -Y), decomp 1.8 s
    1,092,796 suns.flif (-e -Y -R4), decomp 1.8 s
    By the way, the FLIF results show what I like about this format: It has high ratios for photographic images and some switches to tune it (-N, -Y, -R) that give reliable improvements without trying too much. Decompression speed will stay mostly the same and though it's the slowest candidate, it's still asymmetric (encoding takes longer).

    Also note that PNG reconstruction will be bound by Precomp -cn time which is 1.0 s, so decompression doesn't necessarily have to be as fast as webp.
    http://schnaader.info
    Damn kids. They're all alike.

  14. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    Yes, but flif encoding is pretty slow, and when its faster (eg. -E1) the compression gets worse than pik.
    Also in pik you'd basically need a couple of files (lossless8.cc, lossless_entropy.cc and headers),
    while flif is more complicated.

    Btw, here's an idea: make a mode with external files in addition to .pcf (basically -cn). Write bmps to %08X.bmp, jpegs to %08X.jpg etc.
    Now there're archive formats with recompression (7z with Aniskin's plugins, .pa, freearc), why not let them do this?

  15. Thanks:

    Mike (12th November 2019)

  16. #13
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 93 Times in 73 Posts
    Quote Originally Posted by Shelwien View Post
    > What is 'rep1'? Does it have anything to do with Bulat's rep?
    > Is it an improvement over it, or another thing entirely?

    Its my dedup preprocessor, same function but different implementation.
    Srep is better, but rep1 is much easier for integration since its stream-based,
    doesn't need tempfiles etc.
    Great! I would love to give it a try!

    OTOH, it looks a lot like Bulat's rep

    Bear in mind that rep and Srep are two very different programs; both serve the same purpose and work roughly alike, but Srep is way more complex and is designed to deduplicate across very large distances, whereas rep is simpler, faster, and is better suited for chunk sizes < 1gb. Srep binary is about 5x the size of rep's if memory serves. Rep is included in freearc and fazip at source level; Srep isn't.

    Quote Originally Posted by Shelwien View Post
    1) Freearc having some codec doesn't mean that schnaader can use it.
    Oh. Why is that? License, implementation, something else?

    Quote Originally Posted by Shelwien View Post
    2) Testing a few files with freearc doesn't really prove anything,
    in particular with external exe preprocessing there's a frequent problem
    that archiver would still apply its own exe preprocessing (eg. nz,rz do that).
    That's true. In my case, I always try to avoid such inexactitudes. That's why I use 7z, bc I can control exactly what algorithms it's applying.

    Quote Originally Posted by Shelwien View Post
    So here I ported dispack from freearc: http://nishi.dreamhosters.com/u/dispack_arc_v0.rar
    And yes, it is frequently better than x64flt3.
    But sometimes it is not:
    Code:
     1,007,616 oodle282x64_dll
     1,000,369 oodle282x64_dll.dia
       336,415 oodle282x64_dll.b2l // BCJ2 + delta + lzma
       332,920 oodle282x64_dll.xfl // x64flt3 + delta + lzma
       334,971 oodle282x64_dll_dia.lzma // dispack + delta + lzma
    
    34,145,968 powerarc_exe
    33,603,128 powerarc_exe.dia
     5,255,372 powerarc_exe.b2l
     5,105,065 powerarc_exe.xfl
     5,531,003 powerarc_exe_dia.lzma
    http://nishi.dreamhosters.com/u/exetest_20191112.7z
    Cool! Thanks for going through the trouble.
    You do know that fazip.exe can apply dispack from/to file and stdin-stdout, right? That's Bulat's version, mind you. It's as simple as doing

    Code:
    fazip dispack in out
    //or
    whatever_archiver | fazip dispack - out
    //my personal favorite:
    fazip rep:##m+dispack+delta in out
    It can go through all kinds of unknown data w/o introducing noise.

  17. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > Great! I would love to give it a try!

    There're some standalone builds in /u
    (eg. http://nishi.dreamhosters.com/u/fma-rep-1900m3.exe c/d input output)

    Or you can use it with 7zdll, as -m0=rep1:fb1000:c256M:mem2000M or something.
    (Here "fb" is average fragment length, "c" is hashtable size, "mem" is window size;
    encoder only allocates hashtable, decoder also allocates window;
    mem=2000m is basically max, because its old and 32-bit).

    > OTOH, it looks a lot like Bulat's rep

    Same function, but different algorithms.
    Mine is based on CDC aka "anchor hashing" while Bulat's is more like usual LZ.
    Like this: http://nishi.dreamhosters.com/u/fma-diff_v0.rar

    > 1) Freearc having some codec doesn't mean that schnaader can use it.
    > Oh. Why is that? License, implementation, something else?

    1) Bulat apparently added some detection code, there're some patches,
    plus it uses some freearc headers - we have to ask if we can use it for precomp.
    2) Do you know where to find the most recent freearc source?
    3) Part of freearc source is in haskell
    4) Sometimes even plain C/C++ is not easy to use. For example, try making a standalone BCJ2 from 7zip source.
    5) Portability (I know that freearc had a linux build, but is it full-featured?)

    > You do know that fazip.exe can apply dispack from/to file and stdin-stdout, right?
    > That's Bulat's version, mind you. It's as simple as doing
    > fazip dispack in out

    Here I wanted something that we can reproduce, not just "best exe preprocessor",
    because then we'd need to bring in nosso, durilca, winrk/mcomp, rz and who knows what else.

  18. Thanks:

    Gonzalo (13th November 2019)

  19. #15
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 93 Times in 73 Posts
    OK. I give up. Your quoting system is way more efficient. Ugly, but efficient

    > There're some standalone builds in /u
    > (eg. http://nishi.dreamhosters.com/u/fma-rep-1900m3.exe c/d input output)

    > Or you can use it with 7zdll, as -m0=rep1:fb1000:c256M:mem2000M or something.
    > (Here "fb" is average fragment length, "c" is hashtable size, "mem" is window size;
    > encoder only allocates hashtable, decoder also allocates window;
    > mem=2000m is basically max, because its old and 32-bit).

    Thanks! Will take a peek


    > Same function, but different algorithms.
    > Mine is based on CDC aka "anchor hashing" while Bulat's is more like usual LZ.
    > Like this: http://nishi.dreamhosters.com/u/fma-diff_v0.rar

    That's what I wanted to know, very useful info.


    > 1) Bulat apparently added some detection code, there're some patches,
    > plus it uses some freearc headers - we have to ask if we can use it for precomp.
    > 2) Do you know where to find the most recent freearc source?
    > 3) Part of freearc source is in haskell
    > 4) Sometimes even plain C/C++ is not easy to use. For example, try making a standalone BCJ2 from 7zip source.
    > 5) Portability (I know that freearc had a linux build, but is it full-featured?)

    OK, I can only answer meaningfully to some of these issues.
    1) Definitely. Albeit I believe Christian likes to write his own parsers for precomp, and call the libraries as needed.
    2) https://web.archive.org/web/20161128...oad-Alpha.aspx
    3) Not the codecs. All C & CPP. Could be wrong tho... @Bulat??
    4) Never tried, but I know it's difficult to say the least
    5) Nope. I mean yes, but only up to version 0.666 - There were some heavy changes after that version. Anyway, I'm not talking about porting the whole little monster, only some libraries.

    I tried to hire a guy last year to resurrect FreeArc and make some improvements over it. Nobody even knows what Haskell is. Those who kinda know, are complete illiterates regarding data compression. I spent hours trying to explain some guy the purpose of precomp. I'm not really sure he understood.
    And the old sources won't compile under the current GHC. Not on Windows, nor on linux. But I do believe the core algorithms are pure gold, and the idea itself of switching between codecs to provide optimal results.


    > Here I wanted something that we can reproduce, not just "best exe preprocessor",
    > because then we'd need to bring in nosso, durilca, winrk/mcomp, rz and who knows what else.

    I'm not sure I follow... What's not reproducible about using the binary I used, instead of another one? I post it here just in case somebody doesn't have it:
    Attached Files Attached Files

  20. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    >> Mine is based on CDC aka "anchor hashing" while Bulat's is more like usual LZ.

    > That's what I wanted to know, very useful info.

    CDC matchfinding is in theory faster (single hashtable lookup/insert per fragment,
    while usual LZ basically needs a lookup per byte), though of course a lot depends
    on actual implementation.

    > 2) https://web.archive.org/web/20161128...oad-Alpha.aspx

    Did you try downloading sources there?
    There's one from different date: https://web.archive.org/web/20150319...ources.tar.bz2
    but its not the most recent.

    > 3) Not the codecs. All C & CPP. Could be wrong tho... @Bulat??

    Well, I already made a standalone version, so we know that its all C++.
    Though I still had to add some main() and callbacks.
    Point was, we don't know whether we can add some feature to precomp
    just because an "open-source" program has it.

    > I tried to hire a guy last year to resurrect FreeArc and make some improvements over it.
    > Nobody even knows what Haskell is.

    Actually I don't understand the repacker's obsession with freearc.
    Freearc is basically used in repacks only as wrapper for external codecs -
    nobody cares about specific .arc format, tornado, GUI, or whatever other
    features it has.

    So imho it has to be replaced with something simpler, but more flexible.
    Afaik .arc format doesn't have support for stream interleaving
    (and thus multi-output codecs), so it should be better to just make a new tool.

    For example, 7-zip can easily replace freearc basically as is
    (since I already have some codecs as external stdio exes in 7zdll),
    to make it perfect we just need CLS integration and updates for some Inno plugins.

    However imho it should be better to compile a list of all the features
    you may need for repacks (eg. a virtual FS could be useful to process chunks
    of archive as "files" with an external codec), design a convenient syntax
    for repack scripts, then implement it from scratch, since no existing
    archive format would have all the necessary features.
    (For example, rar has file dedup, codec switching, rarvm, recovery,
    but its compression ratio is not very good and there's no support
    for external codecs).

    For now, I have this: http://nishi.dreamhosters.com/7zcmd.html

    > I'm not sure I follow... What's not reproducible about using the binary I used,
    > instead of another one? I post it here just in case somebody doesn't have it:

    We want to add a function of freearc/fazip to precomp (ie reproduce it).
    But fazip.exe being able to do something doesn't mean that its easy to
    do the same in other program.

  21. #17
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 93 Times in 73 Posts
    Yes, of course. I have a copy of FA sources on my HD. It's the very latest. I'll leave it here just in case.

    > Actually I don't understand the repacker's obsession with freearc.

    Well, IDK about that, because I'm not a 'repacker'. I don't have a single game on my computer and I don't play since I was a teen. I'm 29 now. I mostly 'pack' documents, photos, and test files. I also don't share them on any forum (except the occasional test result here), but I do upload them as personal backups.

    I just care about efficiency, and IMHO, FA format is a technically superior solution, for a plethora of reasons. I also don't use FA as a wrapper for anything bc I can do that better from scripts if I wanted to. I do use it occasionally (the console version) as a preprocessor, mainly bc of the 'rep' filter. It's proven time and again to enhance the overall efficiency of lzma-like compressors.



    > So imho it has to be replaced with something simpler, but more flexible.

    I completely agree. Fairytale was supposed to be that. When I set out to improve FA, I did it because it seemed the best way to have something useful working in a reasonable time. Boy was I wrong.


    >
    However imho it should be better to compile a list of all the features
    you may need for repacks

    Again, I don't do 'repacks' but I do agree that most of these features would be useful in an everyday archiver. Fairytale again



    > But fazip.exe being able to do something doesn't mean that its easy to
    do the same in other program.

    Yup. Now I understand what you meant.
    Attached Files Attached Files

  22. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    783
    Thanked 687 Times in 372 Posts
    Actually I don't understand the repacker's obsession with freearc.
    for hardcore ones, 1) rep and 2) CLS. Despite 7z has its own external codec API, CLS was so much simpler to support that most 3rd-party codecs was implemented as CLS dlls.

    for the rest of us, better GUI, fashion and availability of CLS codecs




    1) Bulat apparently added some detection code, there're some patches,
    plus it uses some freearc headers - we have to ask if we can use it for precomp.
    2) Do you know where to find the most recent freearc source?
    3) Part of freearc source is in haskell
    4) Sometimes even plain C/C++ is not easy to use. For example, try making a standalone BCJ2 from 7zip source.
    5) Portability (I know that freearc had a linux build, but is it full-featured?)
    fazip.exe is compiled from C++ sources (no haskell). All my codecs are pure C++ and provide uniform C++ API.

    1) I don't remember about detection code, but otherwise my codec just combines all 28 (?) dispack streams together - it's simple technical work and I can contribute it to precomp if author is interested

    It will be more interesting to improve this simplified scheme to provide better compression, f.e. rip some features from BCJ2 and xflt3

  23. Thanks:

    Shelwien (13th November 2019)

  24. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > Well, IDK about that, because I'm not a 'repacker'.
    > I don't have a single game on my computer and I don't play since I was a teen.

    Its just that these days game repacking is the most practical user-side application
    for advanced compression methods.
    Games are large and usually already compressed/encrypted, so all kinds of creative
    workarounds are required to compress them.
    In addition, game developers also keep experimenting with compression
    (eg. http://www.radgametools.com/oodle.htm, lzham, quark) so there's still progress
    despite overall loss of interest.

    > I'm 29 now. I mostly 'pack' documents, photos, and test files.

    All the popular formats of documents and images are already compressed,
    so, besides global dedup, actual compression algorithms became not really relevant.
    In fact, a winzip archive of jpegs would be smaller than nz/rz archive,
    because winzip does include a jpeg recompressor while nz/rz don't, even if their
    universal compression is more advanced.

    I suppose in this case freearc with srep and precomp4.7+ would be still relevant,
    but I'd be more interested in your evaluation of my 7zdll and Aniskin's plugins (they can be combined).

    > I just care about efficiency, and IMHO, FA format is a technically superior solution, for a plethora of reasons.

    Comparing to what? vs zip,rar - yes, vs 7z - not really.
    But even .7z lacks too much. There're many format features not directly related to compression,
    which still require support in format (proper volumes, recovery, file dedup).
    On other hand, there're all kinds of new changes required by new recompression algorithms.
    Aside from overall idea of having a single compressed archive index at the end
    of archive file (rather than individual file headers like in zip/rar), I don't
    see what we could take from .7z or .arc for development of state-of-art archive format.

    > I also don't use FA as a wrapper for anything bc I can do that better from scripts if I wanted to.

    If you do use it with srep and precomp - that's a wrapper.
    If you use only integrated codecs - freearc compression is probably neither best or fastest.

    > It's proven time and again to enhance the overall efficiency of lzma-like compressors.

    Well, there's a new generation of codecs now - oodle, zstd, brotli, rz.
    Its not really a fault of previous codecs like lzma, but cpu architecture changed,
    so only new codecs made after the change are efficient now - their compression
    is similar to lzma, but decoding is 10x faster.

    >> So imho it has to be replaced with something simpler, but more flexible.
    > I completely agree. Fairytale was supposed to be that.

    Well, they moved the discussion off the forum, so I didn't really follow it.
    But anyway I don't see there any solutions (or even discussions) of issues
    that I have with existing formats (eg. 7z), so I don't see the point.

    My own choice for now is to collect/write relevant new codecs and add them to .7z.
    Then, once I have my own framework that can replace 7-zip's
    (MT task manager, stream interleaver, codec API), I can start migrating to a new format.

    >> However imho it should be better to compile a list of all the features you may need for repacks
    > Again, I don't do 'repacks' but I do agree that most of these features would be useful
    > in an everyday archiver. Fairytale again

    1) Even if you don't call it that, once you have to manually design a compression
    script for some data set - that's basically a repack.

    2) Taking inspiration from game repacks is a good idea, because its an active scene
    with lots of experience, tools and even whole programming languages (http://aluigi.altervista.org/quickbms.htm)
    developed specifically to compress large sets of data.

    For example, one major difference from well-known archivers is the heavy use of diffs.
    Its used in scenarios like this:
    - extract data chunks from game archives (eg. with a quickbms script)
    - dedup chunks
    - compress chunks with something fitting (not always there's a 100% match with original compressed data)
    - diff original game archive and new recompressed chunks archive
    How would you fit this into a known archive format? Well, you won't.

  25. #20
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 93 Times in 73 Posts
    OK, sorry Christian for going off-topic

    > All the popular formats of documents and images are already compressed,
    so, besides global dedup, actual compression algorithms became not really relevant.


    They are if precomp is involved. PDF are really compressible and susceptible to all kinds of preprocessing and codec tuning. PNGs too and so on.




    > but I'd be more interested in your evaluation of my 7zdll and Aniskin's plugins


    You mean this?
    Quote Originally Posted by Gonzalo View Post
    You might have just created the best practical archiver since FreeArc with this few tweaks...
    Aniskin's plugins are great, but he himself has said that 7z internal structure makes the symbiosis too messy. For example, the latest Mfilter, which I think is great, has to write a temp file for every single thing it wants to process. Way subpar IMHO. It's not anybody's fault, it's just 7z is not the best format, to say the least.


    I also tried your 7zdll but I didn't put too much attention into it bc it looked more like an experiment in the moment and I was also afraid of its legal status. Isn't it part of Power Archiver? Do you work for them? Is it your company? Bottom line, can we use the program w/o infringing copyrights?


    > Comparing to what? vs zip,rar - yes, vs 7z - not really.
    > I don't see what we could take from .7z or .arc for development of state-of-art archive format.


    Maybe now, after years of FA being dead and lots of people trying to address 7z's shortcomings. I'll give you just a few examples of things that didn't exist at the moment on a user friendly archiver:


    * File sorting and grouping
    * Content aware compression
    * Different kinds of deduplication
    * Dedicated lossless compressor for audio
    All of it automatic. It was just the perfect solution, something both faster and stronger than the competitors.


    And this wasn't even the greatest thing about it. It was the mindset... zip, 7z and company are too simplistic. Take a file, pass it through THE codec. Write to the disk. Up until today, official 7z doesn't capitalize on the fact that it's got a lot of codecs included. It only lets you chose one and that's it. Just one per run. I mean, really? That's soo not efficient.




    > If you do use it with srep and precomp - that's a wrapper.


    No I don't. I lost interest on Srep long time ago bc I don't really need it (I don't have 10 GB archives) and rep is more efficient on smaller inputs.
    Regarding precomp, I always use the latest commit via shell scripts, mostly to run it on parallel




    > If you use only integrated codecs - freearc compression is probably neither best or fastest.


    Agree! Let me show you what I use it for and how:


    Code:
    //pseudo code
    
    $ some_script_applying_precomp_multithreaded 
    
    $ wine arc a -m0=rep:500m+probably_some_other_filter
    
    $ wine 7-Zip-Zstandard_7z.exe a OUT -mfb=273 -myx=9 -m0=flzma2 -mf=on -mqs=on -md=128m -mx9 -ms=on IN.arc

    That's the most efficient configuration for me. Strong enough, fast enough, pareto frontier (backed by experiments).


    ​> so only new codecs made after the change are efficient now - their compression is similar to lzma, but decoding is 10x faster.

    That's a good thing, but cspeed remains slow for lzma-like ratios. That's where things like rep are truly helpful, to speed up compression.


    >
    My own choice for now is to collect/write relevant new codecs and add them to .7z.
    > Then, once I have my own framework that can replace 7-zip's (MT task manager, stream interleaver, codec API), I can start migrating to a new format.

    You might want to look at Márcio's draft for the Fairytale format. I'm no expert, I just remember a lot of knowledgeable people saying it was the closest thing to an ideal format.


    About the 'repacker' thing: Fair enough, I guess I do things like that. I just don't like the label.

    That last scheme with the diffs seems a lot like a poor man's preflate for other formats. I did something like that for lzx streams a while ago just to see if it yielded any gains. It did, a lot...

  26. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > OK, sorry Christian for going off-topic

    I can always move it to a new thread if it continues.
    For now its not even really off-topic.

    >> so, besides global dedup, actual compression algorithms became not really relevant.

    > They are if precomp is involved. PDF are really compressible and susceptible to all kinds
    > of preprocessing and codec tuning. PNGs too and so on.

    Yes, but imho precomp is more of a proof-of-concept rather than a tool for practical use.
    For repeated automatic use you need to always verify correct decompression
    and provide a fallback compression script without it if something happens (like crash).

    Also, all the recompression libraries used by precomp are 3rd-party unmaintained code,
    so we might have a problem if somebody makes a remote execution exploit for packmp3,
    or when x86 becomes obsolete.

    As to PNG/PDF... precomp is certainly the best tool atm,
    but there's a lot of potential for improvement, like that FLIF idea.

    > Aniskin's plugins are great, but he himself has said that 7z internal structure makes the symbiosis too messy.

    Uh, no, "7z internal structure" is unrelated here.
    Its not like any archiver can provide a codec API with random seek.

    > For example, the latest Mfilter, which I think is great, has to write a temp file
    > for every single thing it wants to process. Way subpar IMHO.
    > It's not anybody's fault, it's just 7z is not the best format, to say the least.

    Nope. In this case its the fault of jpeg format designers.
    Jpeg is a very ugly format - to start, there's no certain way to even detect it,
    except "start decoding from here and see if it eventually works out".
    There's no specific signature, ~5 different ways to embed other jpegs,
    and "progressive mode" which is incompatible with sequential recompression.

    Because of that, brunsli/lepton/packjpg prefer to load/parse the whole file first,
    then recompress it in a second pass.
    Still, there're pjpg and jojpeg that can recompress non-progressive jpegs
    sequentially.

    But tempfiles in mfilter and precomp are just a lazy workaround.
    Jpeg files are not large enough that they won't fit in RAM.

    > I also tried your 7zdll but I didn't put too much attention into it bc it
    > looked more like an experiment in the moment and I was also afraid of its legal status.

    Well, that's unfortunate.

    > Isn't it part of Power Archiver? Do you work for them? Is it your company?

    Yes, yes, no.

    > Bottom line, can we use the program w/o infringing copyrights?

    You can even use a trial of full GUI PA, I don't see any reasons for caution.
    Can you name a program which can't be legally used?
    The only real case is when you sign NDA and then leak the software.

    Though sure, its an experimental tool for testing, rather than a supported product.
    But there's no difference with precomp in that sense.

    I don't advertise it mainly because there's no automated codec switching -
    we actually have some, but its on GUI side.
    But 7zdll is atm the best solution for mp3 recompression, may be pretty good for pdfs with reflate/jojpeg,
    there's cdm, x64flt3 is better than BCJ2, etc.

    > Maybe now, after years of FA being dead and lots of people trying to address 7z's shortcomings.

    "7z's shortcomings"?
    I can only name recovery records, too simple volume format and lack of repair function.
    But I never heard of anybody trying to fix that.

    And who's "lots of people"? There's flzma author, Stephan's EcoZip, Aniskin and 7zdll.
    Am I missing something?

    Otherwise .7z was and is the most efficient and flexible archive format.
    Sure, .arc is close and may be better at some points, but lack of stream interleaving
    and multi-output API is a deal-breaker.
    While its quite possible to add all of freearc's codecs to .7z.

    > * File sorting and grouping

    Was introduced by rar, 7z kinda has it too.
    As for grouping, it was always possible to write a script
    to add different filetypes with different codecs by multiple 7z calls,
    and if you want a text config, there's now smart7z or something.
    But 7z always had file analysis stage, it just wasn't used for anything
    except for BCJ2 and maybe wavs.

    > * Content aware compression

    See above, even baseline 7z actually can detect exes by signatures,
    same always could be done for other filetypes.

    Rar is better in that sense because it has an option to switch codecs
    within a file, its a real problem.

    But afaik its a problem both for .7z and .arc too.

    > * Different kinds of deduplication

    Yes, but it was always possible to attach srep or something to 7z too, just nobody bothered.
    And now (since 2016 or so) there's also 7zdll with rep1.

    > * Dedicated lossless compressor for audio

    What's there, tta? Afaik its only applicable to whole files and not like Bulat made it.
    Then, zip has wavpack, 7z has delta, rar has its own detection and audio compression.
    Also nz and rz are currently much better for that anyway.

    > All of it automatic. It was just the perfect solution,
    > something both faster and stronger than the competitors.

    I've only seen it used as a batch wrapper for external codecs.
    Dedup introduction was certainly a breakthrough, but afaik there's no special
    support for it in .arc format, so it can be used just as well anywhere.

    > And this wasn't even the greatest thing about it. It was the mindset...
    > zip, 7z and company are too simplistic.

    winzip actually was the first to introduce jpeg recompression.
    Currently it has wzjpeg,wavpack,lzma,xz codecs,
    so it might beat freearc on some data sets.

    > It only lets you chose one and that's it. Just one per run.
    > I mean, really? That's soo not efficient.

    Unfortunately its also true for freearc (or at least as true as for 7z).
    Did you ever see auto-assigned 7z profile for exes?
    its actually something like this: "m0=BCJ2 -m1=LZMA:d25 -m2=LZMA:d19 -m3=LZMA:d19 -mb0:1 -mb0s1:2 -mb0s2:3".

    Of course, there's potential for improvement.
    For example, there's currently no option to run global dedup,
    then compress some files with lzma and others with ppmd.
    But its the same for 7z and freearc.

    > If you do use it with srep and precomp - that's a wrapper.

    > and rep is more efficient on smaller inputs.

    Uh, I think you should test it.
    Bulat added some new modes there, like "future-LZ", which actually improve compression.

    > $ wine arc a -m0=rep:500m+probably_some_other_filter
    > $ wine 7-Zip-Zstandard_7z.exe a OUT -mfb=273 -myx=9 -m0=flzma2 -mf=on -mqs=on -md=128m -mx9 -ms=on IN.arc
    > That's the most efficient configuration for me. Strong enough, fast enough, pareto frontier (backed by experiments).

    Yeah, it really doesn't seem like you need freearc here.

    > That's a good thing, but cspeed remains slow for lzma-like ratios.

    Not really, you can try faster modes in oodle/brotli/zstd.
    And its impossible to get fast encoding speed with larger-than-cpu-cache window.

    > You might want to look at Márcio's draft for the Fairytale format. I'm no expert,
    > I just remember a lot of knowledgeable people saying it was the closest thing to an ideal format.

    I actually did look yesterday. Its full of unimportant details like specific data structures,
    but doesn't provide a solution for new features like multi-layer dedup.
    (Currently we can either dedup data, then preprocess, or preprocess then dedup;
    there's no option to fall back to original data when archiver detects that
    preprocessing hurts compression on specific chunks of data.)

    > That last scheme with the diffs seems a lot like a poor man's preflate for other formats.

    Yes, but surprisingly its also frequently used for speed.
    Like, we extract a million of <=64kb chunks individually compressed with LZ4.
    In such a case it might be faster to run solid compression of chunk data as a single file
    and fix encoder output differences with a patch.

  27. #22
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    783
    Thanked 687 Times in 372 Posts
    Eugene, I think you miss the point - 7z had some great features, rar had some great features, and even winzip had a few. FA just combined them - not all, but it had more features simultaneously than any competitor.

  28. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    My points are these:
    1. People keep talking about freearc greatness, but actually use it as wrapper for external codecs, for example srep, xtool and lolz.
    Gonzalo's case seems to be different, but even weirder, since he runs precomp outside and just uses freearc for rep/delta/mm.
    2. It'd be easier for me if people learned to use 7-zip properly, because I can add stuff to 7z, but not to freearc.

  29. #24
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 93 Times in 73 Posts
    We should have a "Holly Wars" thread for this sorts of things


    On a more serious note:

    @Shelwien: Maybe you're right. Maybe 7z is objectively better. With this things, there isn't always an empirical way of determining if a solution is the best, bc there are so many vectors. It's not just which file size is smaller anymore, we're talking about whether a particular program or chain of programs is best suitable for a particular case, and our cases are different enough for it to matter. I don't feel like 7z provides me with everything I need right now (nor FA for that matter), so I use other tools too.

    -----------------
    Regarding codecs, instead of full-blown archivers, I really like Bulat's suggestion:

    Quote Originally Posted by Bulat Ziganshin View Post
    It will be more interesting to improve this simplified scheme to provide better compression, f.e. rip some features from BCJ2 and xflt3
    I know we all care about precomp bc here we are, so why not make the best of it? I can definitely help testing.

    BTW, what is cdm?? I don't think I ever heard of it until yesterday


    -----------------
    > doesn't provide a solution for new features like multi-layer dedup. (Currently we can either dedup data, then preprocess, or preprocess then dedup;

    Actually, it does. In Fairytale, preprocessing IS deduplication. Everything is fingerprinted so if a jpg is in the root folder, also inside a zip, also inside a pdf inside a tgz, it only gets processed once. After that, other kinds of deduplication can be applied to blocks, chunks or the whole thing.


    > there's no option to fall back to original data when archiver detects that preprocessing hurts compression on specific chunks of data.)

    I believe this depends on the implementation. The format itself doesn't care how the data got in the archive, only that it makes sense so it can be decompressed. Anyway, it's a draft, so improvements can and certainly should be made. Márcio himself said that multiple times.


    -----------------
    PS: I like where this is going...

  30. #25
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > I don't feel like 7z provides me with everything I need right now (nor FA for that matter),
    > so I use other tools too.

    Maybe you need to learn more about the tools you have.
    Like, your example script in previous post could be implemented with freearc alone.
    For MT precomp you can use this:
    http://nishi.dreamhosters.com/rzwrap_precomp_v0a.rar
    http://nishi.dreamhosters.com/u/precomp.cpp.diff (doesn't work with unmodified precomp!)

    >> f.e. rip some features from BCJ2 and xflt3
    > I know we all care about precomp bc here we are, so why not make the best of it?

    It doesn't really work like that in this case.
    A good disasm filter would be always better than BCJ2 or x64flt3.
    The problem with dispack is that its outdated... there's no x64 parser
    (it still works because CALL/JMP code is the same) or new vector instructions.

    > BTW, what is cdm?? I don't think I ever heard of it until yesterday

    Its this: https://encode.su/threads/2742-Compr...ll=1#post52493
    Kinda an universal recompressor for huffman-based formats.

    >> doesn't provide a solution for new features like multi-layer dedup.
    > Actually, it does. In Fairytale, preprocessing IS deduplication.
    > Everything is fingerprinted so if a jpg is in the root folder,
    > also inside a zip, also inside a pdf inside a tgz, it only gets processed once.

    Not what I meant.
    Here's a more specific example:
    - We have a set of exe files
    - Exe preprocessor turns relative addresses in exe files to absolute
    (it usually improves compression by creating more matches within the exe)
    - There're chunks of code which are the same in original files,
    but become different after preprocessing.

    Normally there're only two options - either we preprocess a file, or we don't.
    But here the optimal solution would be to preprocess most of the data,
    except for chunks which are duplicated in original files.

    In this case, maybe a specific solution can be created,
    like we can integrate exe filter into dedup, and only
    run it on unmatched chunks.

    But ideally it has to be the job of archiver and apply
    not only to cases with specifically written manual workarounds,
    but to anything.

    >> there's no option to fall back to original data when archiver detects
    >> that preprocessing hurts compression on specific chunks of data.)

    > I believe this depends on the implementation.

    Yes, currently it is handled by preprocessors.
    For example, Bulat's delta filter does some rough entropy estimation
    to see if delta-coding of a table improves compression.

    But this same problem applies basically to any codec and preprocessor,
    so I think that adding an approximate detector (and other things like stream interleaving)
    to every codec is not a good solution.

    > Anyway, it's a draft, so improvements can and certainly should be made.
    > Márcio himself said that multiple times.

    I currently think that its too early to design an archive format.
    We have to start with codec API, MT task manager and stream interleaving.
    There're obscure but tricky parts like progress estimation
    (dedup filter quickly caches all input data, then archiver gets stuck at 99%
    for a long time, because it has to actually compress the data;
    but during encoding we also can't calculate progress from compressed size either,
    because we don't know the total compressed size),
    or memory allocation
    (its a bad idea to let each codec allocate whatever it wants;
    ideally task manager has to collect alloc request from all nodes in filter tree,
    adjust parameters where necessary, when do a batch allocation).

    Based on that we can build a few standalone composite coders
    (like rep1/mp3det/packmp3c/jojpeg/lzma in 7zdll),
    then a few based on new methods (like stegdict+xwrt+ppmd or minhash+bsdiff+lzma),
    and only then we'd know actual requirements for the archive format to support that.

  31. #26
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    616
    Thanks
    264
    Thanked 242 Times in 121 Posts
    Got another development version here, this time I multithreaded reconstruction (-r) of JPG. This involved all this nasty multi-threading stuff (threads, mutexes, std::future), so there might be some bugs, especially when combined with other streams and recursion, so the preferred test case would be to throw all kind of stuff together and try this version on it.

    Unfortunately, both packJPG and packMP3 are using all kind of global variables and so are not thread safe, so this only speeds up brunsli encoded jpegs, but not packJPG encoded ones or MP3s.

    For JPG heavy files like this MJPEG video, JPG recompression scales with the number of cores:

    Code:
    28.261.596 WhatBox_720x480_411_q25.avi
    20.676.282 WhatBox_720x480_411_q25.avi.pcf, 13 s, 1174 JPG streams (precomp -cn)
    28.261.596 WhatBox_720x480_411_q25.avi_   , 10 s (precomp048dev -r, old version)
    28.261.596 WhatBox_720x480_411_q25.avi_   ,  3 s (precomp048dev -r, new version)
    Attached Files Attached Files
    http://schnaader.info
    Damn kids. They're all alike.

  32. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    Can you also do something with this https://github.com/schnaader/precomp...ecomp.cpp#L129 ?
    Like in the patch I made here:
    http://nishi.dreamhosters.com/rzwrap_precomp_v0a.rar (its a blockwise MT wrapper for precomp)
    http://nishi.dreamhosters.com/u/precomp.cpp.diff

    Aside from overlapping names of tempfiles there was another problem - how to stop precomp from asking questions.

  33. #28
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    783
    Thanked 687 Times in 372 Posts
    on the fairytale format - I studied v 1.2 and then recalled that I already done it a year ago

    1. overall format is very weak and doesn't include many features from 7z/arc. I recalled that I already wrote about it and my comments were rejected. Since authors have no experience of prior archive development, we can't really expect anything else

    2. the only FTL thing lacking in 7z/arc is FTL block structure. The format they developed serves very specific need - archive-wide catalogue of chunks at any compression stage. It's a sort of feature that may sometimes improve compression ratios. It may be supported in 7z/arc by adding another type of meta-info blocks, so no need to introduce the new archive type

  34. #29
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    547
    Thanks
    203
    Thanked 798 Times in 323 Posts
    I don't know why I even bother, but here goes..

    @Shelwien:

    >I actually did look yesterday.

    You probably didn't look very hard.

    >Its full of unimportant details like specific data structures,

    It's.. a specification for a format, i.e., it should describe in detail how to read an archive and interpret its data structures..

    >but doesn't provide a solution for new features like multi-layer dedup.
    >(Currently we can either dedup data, then preprocess, or preprocess then dedup;
    >there's no option to fall back to original data when archiver detects that
    >preprocessing hurts compression on specific chunks of data.)

    >Here's a more specific example:
    >- We have a set of exe files
    >- Exe preprocessor turns relative addresses in exe files to absolute
    >(it usually improves compression by creating more matches within the exe)
    >- There're chunks of code which are the same in original files,
    >but become different after preprocessing.

    >Normally there're only two options - either we preprocess a file, or we don't.
    >But here the optimal solution would be to preprocess most of the data,
    >except for chunks which are duplicated in original files.

    It doesn't even mention preprocessors, since those would just be codecs, usually the first ones in a codec sequence for a block type.
    And it was specifically designed to do just what you mentioned. First, deduplication occurs at the file level, then later on it's done at the content level (if a parser splits a block, we perform deduplication on those new sub-blocks). After all parsing is done, the plan was to run a rep-like (or such) deduplication stage on any non-specific blocks, so they could still be further divided.

    So in your example, if the common code chunks were large enough to merit deduplication, they'd be deduped before any exe-preprocessor would even be called.

    > there's no option to fall back to original data when archiver detects
    > that preprocessing hurts compression on specific chunks of data.)

    Nothing in the format prevents that. Each block uses a codec-sequence. The user could specify that, for instance, on 24bpp images, we should try 2 sequences for every block: one that includes a color-space transform before the actual compression, and one that doesn't; and we just keep the best. Only compression would be slower, decompression would be unaffected.

    @Bulat:

    >1. overall format is very weak and doesn't include many features from 7z/arc.

    It was never meant to include everything and the kitchen sink, it was supposed to be efficient at allowing pratical (re)compression of many useful formats and allow the operations that 99% of users may want to do. The idea was that it would provide a skeleton that users here could use to test out their own codecs without having to worry about writing parsers, dedupers, archiving routines, etc.

    >I recalled that I already wrote about it and my comments were rejected.

    I'm gonna have to call you out on this one, sorry.

    In the Community Archiver thread, we discussed how you'd prefer a DFS versus my BFS parsing strategy, and how the archive itself doesn't care and both can be available to the user.
    It was also important to discuss how to handle the potential for exponential storage needs when parsing, and I proposed a fixed intermediary storage-pool as that seems like the only realistic solution since we can't expect the end-user to simply have infinite storage. The prototype code was a quick-hack to see if it could be done, and it could. I have a (non-public) completely rewritten version that solves the problem of having lots of memory allocations and temporary files, by using a pre-allocated hybrid pool of a single memory block and a single temporary file, so that much is done.

    In the Fairytale thread itself, you said:

    >basic stuff:
    >- everything including info you saved in the archive header should be protected by checksums

    As proposed.

    >- it will be great to have archive format that can be written strictly sequentially. this means that index to metadata block(s) should be placed at the archive end

    As proposed.

    >- in 99.9% cases, it's enough to put all metadata into single block, checksummed, compressed and encrypted as the single entity

    Every global data structure is a checksummed single block. They would be stored uncompressed to make for easier reading by external tools, which seems like good-practice if the format was to get any traction outside this forum.

    >- all metainfo should be encoded as SoA rather than AoS

    Since we'd be using variable-length integers, most of the benefits of SoA over AoS aren't really available. If you know you have 10 structures to read, and the first array contains 10x the size field, you can't just read 10*sizeof(..) bytes.

    >- allow to disable ANY metainfo field (crc, filesize...) - this requires that any field should be preceded by tag. but in order to conserve space, you can place flags for standard set of fields
    >into the block start, so you can still disable any, but at very small cost

    As proposed.

    >- similarly, if you have default compression/deduplication/checksumming/... algorithms, allow to encode them with just a few bits, but provide the option to replace them with arbitrary
    >custom algos

    As proposed.

    So how was your input ignored?

    >Since authors have no experience of prior archive development, we can't really expect anything else

    Nice ad-hominem, I get it, I'm not part of the "old-guard" of this community, so therefore, what could I possibly know?

    Also on that Fairytale post:
    >I have started to develop nextgen archive format based on all those ideas, but aborted it pretty early. Nevertheless, I recommend to use it as the basis and just fill in unfinished parts of my spec.

    You do realize that could just as well be a quote from me? At least I tried, and didn't make myself out to be some uber-expert that had all the answers. I said I couldn't do it alone, which was as true then as it is now. Trying to write multi-plataform/processor archive handling routines in C++ is completely daunting and I have no motivation to spend months (if I'm being optimistic) trying to learn enough to get it right.

    @all:
    This is exactly what made me decide to quit posting any stuff here. All I see here is talk. Lots of talk, very little if any action. It's so, so easy to just sit on the sidelines and just criticize others.

  35. Thanks (2):

    Gonzalo (16th November 2019),schnaader (16th November 2019)

  36. #30
    Member
    Join Date
    Jun 2018
    Location
    Yugoslavia
    Posts
    65
    Thanks
    8
    Thanked 5 Times in 5 Posts
    "Normally there're only two options - either we preprocess a file, or we don't.
    But here the optimal solution would be to preprocess most of the data,
    except for chunks which are duplicated in original files."

    I think you should preprocess those chunks too.

Page 1 of 3 123 LastLast

Similar Threads

  1. Test files to compress
    By KingAmada in forum Random Compression
    Replies: 2
    Last Post: 4th November 2019, 18:31
  2. How to expand .ff compressed files using Precomp & Fsum???
    By Manjunath in forum The Off-Topic Lounge
    Replies: 21
    Last Post: 7th September 2014, 13:47
  3. pim 2.9 compress mysql 5.1.32 x64 files
    By l1t in forum Data Compression
    Replies: 0
    Last Post: 23rd March 2009, 15:06
  4. Replies: 3
    Last Post: 10th November 2007, 21:32
  5. Replies: 12
    Last Post: 30th June 2007, 16:49

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •