Page 1 of 3 123 LastLast
Results 1 to 30 of 61

Thread: Precomp 0.4.6

  1. #1
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    261
    Thanked 242 Times in 121 Posts

    Precomp 0.4.6

    Precomp 0.4.6 is out.

    List of changes (also see closed issue list at GitHub)

    • Using liblzma for on-the-fly compression (-cl) - thanks to sftt
    • Reduced temporary files usage
    • Much faster intense and brute mode
    • Intense and brute mode can be combined now for best results
    • Smoother progress indicator, second progress indicator in lzma mode
    • Flag -e to preserve file extension (file.ext => file.ext.pcf instead of file.pcf) - thanks to guptaprince
    • JPG detection faster and more reliable - thanks to Márcio Pais
    • Updated zlib to 1.2.11
    • Show Precomp version together with OS type (Linux/Windows) and 32/64 bit
    • Fixed crashes on certain files (Issue #52, Issue #59)
    • Fixed incorrect restoration of a PNG multi file (Issue #50)


    Have a look at https://github.com/schnaader/precomp-cpp
    http://schnaader.info
    Damn kids. They're all alike.

  2. Thanks (16):

    78372 (8th September 2017),Bulat Ziganshin (11th September 2017),Dimitri (11th September 2017),Gonzalo (12th September 2017),hunman (8th September 2017),Intrinsic (15th October 2017),kassane (3rd October 2017),Mike (8th September 2017),RamiroCruzo (8th September 2017),Razor12911 (9th September 2017),redrabbit (11th September 2017),Samantha (11th September 2017),Simorq (8th September 2017),Skymmer (8th September 2017),Stephan Busch (8th September 2017),unarc 125 (16th September 2017)

  3. #2
    Member
    Join Date
    Feb 2017
    Location
    none
    Posts
    25
    Thanks
    6
    Thanked 13 Times in 6 Posts
    Thanks for the update, now works more fast, missing a 32bits for linux, i attach a compiled version (GIT)
    Attached Files Attached Files

  4. Thanks:

    schnaader (11th September 2017)

  5. #3
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    Just added precomp to 7zdll, for testing.
    Had to do this:
    char* get_temp_file() {
    unsigned pid = GetCurrentProcessId();
    char* s = new char[256];
    static unsigned idx = 0;
    sprintf( s, "precomp_%08X_%i", pid, idx++ );
    //printf( "!%s!\n", s );
    return s;
    }

    // name of temporary files
    char* metatempfile = get_temp_file(); //[18] = "~temp00000000.dat";
    char* tempfile0 = get_temp_file(); //[19] = "~temp000000000.dat";
    char* tempfile1 = get_temp_file(); //[19] = "~temp000000001.dat";
    char* tempfile2 = get_temp_file(); //[19] = "~temp000000002.dat";
    char* tempfile2a = get_temp_file(); //[20] = "~temp000000002_.dat";
    char* tempfile3 = get_temp_file(); //[19] = "~temp000000003.dat";
    char* tempfile4 = get_temp_file(); //[19] = "~temp000000004.dat";


    Also had to disable all the console input required to overwrite files (there's no option).

    But now there's a program with MT precomp and reflate integrated :)

  6. Thanks (8):

    78372 (12th September 2017),Gonzalo (12th September 2017),kassane (3rd October 2017),Mike (12th September 2017),PrinceGupta (12th September 2017),RamiroCruzo (12th September 2017),Razor12911 (18th September 2017),Simorq (12th September 2017)

  7. #4
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 92 Times in 72 Posts
    I'm sorry if this is a dumb question but, is there a binary we can try?

  8. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    Well, here's modified precomp compiled with IntelC 17.
    As to 7zdll, its in http://nishi.dreamhosters.com/u/
    Attached Files Attached Files

  9. Thanks (3):

    78372 (27th September 2017),RamiroCruzo (13th September 2017),Simorq (14th September 2017)

  10. #6
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 92 Times in 72 Posts
    Thank you Shelwien.
    I'm sad to report that precomp64 is not multithreading, at least on my machine. I have an Intel Atom of 2 cores with hyperthreading, which makes 4 threads in total. But the process manager reports an usage of 25% tops. In fact, precomp rarely reaches that ceiling. I'm using it on intense mode, and from a ram disk to reduce throttling.

  11. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    Of course precomp itself doesn't. As I said, its integrated in 7z.
    Get 7zdll and try something like this:
    Code:
    7z a -mx=9 -myx=9 -mf=off -bb3 -m0=reflate:x6 -m1=precomp:mt4:c64M -m2=plzma archive.pa *.pdf
    Note mt4. And c64M is chunk size for processing, it has to process chunks independently.
    Also, I'm running precomp like this: "precomp64.exe -cn -d1 -t+PZGNFJSMB -zl -o%s. %s"
    So deflate in it is basically disabled, use reflate for that.

  12. Thanks:

    Simorq (14th September 2017)

  13. #8
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    261
    Thanked 242 Times in 121 Posts
    Quote Originally Posted by Shelwien View Post
    Also, I'm running precomp like this: "precomp64.exe -cn -d1 -t+PZGNFJSMB -zl -o%s. %s"
    So deflate in it is basically disabled, use reflate for that.
    That's very redundant. "-zl" should be enough to disable deflate streams. But yeah, it disables parsing this way, so it might be a bit faster. Though with "-t+FJMB" parameters, you also disable GIF, JPG, Base64 and bZip2, which aren't handled by reflate/7z.

    You could also modify the PNG parser to only do the "stitching" of PNG files with multiple chunks ("png_multi") so reflate can process them - or is Precomp called after reflate?
    http://schnaader.info
    Damn kids. They're all alike.

  14. Thanks:

    Simorq (14th September 2017)

  15. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > That's very redundant. "-zl" should be enough to disable deflate streams.

    I also need to disable mp3. Also its "t+", so its supposed to keep processing of gif etc?

    > You could also modify the PNG parser to only do the "stitching" of PNG files with multiple chunks

    reflate still doesn't have level/winsize autodetect, so pngs are not a good idea anyway.
    Ideally I'd prefer for precomp to process streams in known formats, while not scanning for zlib etc elsewhere.

    I'm also planning to write a separate reflate-based recompressor for png, to completely convert them to bmps.

    > is Precomp called after reflate?

    That depends on 7z commandline actually.

    I also can add some options to 7z handler which it would pass to precomp, but have no idea what it can be, atm.

  16. #10
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    261
    Thanked 242 Times in 121 Posts
    Quote Originally Posted by Shelwien View Post
    > That's very redundant. "-zl" should be enough to disable deflate streams.

    I also need to disable mp3. Also its "t+", so its supposed to keep processing of gif etc?
    Ah, sorry, didn't catch that. So basically, it's the same as "-zl -t-3".
    http://schnaader.info
    Damn kids. They're all alike.

  17. #11
    Member
    Join Date
    Oct 2014
    Location
    South Africa
    Posts
    38
    Thanks
    23
    Thanked 7 Times in 5 Posts
    Quote Originally Posted by schnaader View Post
    That's very redundant. "-zl" should be enough to disable deflate streams. But yeah, it disables parsing this way, so it might be a bit faster. Though with "-t+FJMB" parameters, you also disable GIF, JPG, Base64 and bZip2, which aren't handled by reflate/7z.

    You could also modify the PNG parser to only do the "stitching" of PNG files with multiple chunks ("png_multi") so reflate can process them - or is Precomp called after reflate?
    How do you guys use something like "-t+FJMB" in FreeArc?

    I think another alternative such as -ti (include) for -t+ and -te(exclude) instead of -t- should be added to command line parser for easier usage in FreeArc.

    I have patched the binary file (+ changed to i, - changed to e) and it worked in FreeArc.


    ------------------------
    EDITED:

    I have compiled both x86 and x64 files by VS2015. See attached including precomp.cpp.

    Now, both -t+ and -ti are working as well as -t- and -te. So you can easily use -tiPJ3 or -teP3 in FreeArc to save your time.

    Using -tePNFJSM3 saved 7 seconds while compressing 13 XLSB files with a total 141MB in size.
    Attached Files Attached Files
    Last edited by msat59; 16th September 2017 at 23:35.

  18. Thanks (2):

    Bulat Ziganshin (16th September 2017),Simorq (17th September 2017)

  19. #12
    Member
    Join Date
    May 2008
    Location
    France
    Posts
    83
    Thanks
    555
    Thanked 27 Times in 19 Posts

  20. #13
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    261
    Thanked 242 Times in 121 Posts
    Quote Originally Posted by Mike View Post
    Thanks for reporting, fixed.
    http://schnaader.info
    Damn kids. They're all alike.

  21. #14
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 92 Times in 72 Posts
    I would like to report a non-critical bug.
    It happens on linux, when a folder is passed instead of a file. Precomp enters into an infinite loop at 0.00% completion, using no CPU resources at all. It should exit with code 1 in my opinion and display a message explaining the error.
    Thanks in advance. Keep up with the good work!

  22. Thanks:

    schnaader (11th October 2017)

  23. #15
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 92 Times in 72 Posts

    Exclamation

    There is an issue with a particular file that might reveal a bug regarding brute mode. Or at least can be helpful to speed up the processing of any data in general. See this:

    Code:
    New size: 135085 instead of 135062     
    
    Done.
    Time: 1 hour(s), 38 minute(s), 57 second(s)
    
    
    Recompressed streams: 0/2221
    Brute mode streams: 0/2221
    The file is barely some 132 kb but it is processed in more than an hour and half. In the attachment is the full verbose log along with the file itself.
    I know precomp isn't suppose to compress wav files but in a bundle like a tar file you can encounter anything.

    Notice: I'm using a binary compiled by me from the last commit at github a week ago.
    Attached Files Attached Files

  24. #16
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    261
    Thanked 242 Times in 121 Posts
    Quote Originally Posted by Gonzalo View Post
    The file is barely some 132 kb but it is processed in more than an hour and half. In the attachment is the full verbose log along with the file itself.
    I know precomp isn't suppose to compress wav files but in a bundle like a tar file you can encounter anything.
    Thanks for the file, I saw some files that run slow in the "new" 0.4.6 brute mode, but this is the most extreme so far.

    There are two problems here:
    • Precomp detects non-deflate data as deflate data. There is some heuristic that tries to improve this, but as you can see, it's not perfect. Even worse, zlib happily decompresses the data (although only small parts at the beginning).
    • In brute mode streams (as well as in gZip and ZIP streams) there are more combinations to brute force because window size is unknown - up to 8*9*9 = 648 instead of 9*9 = 81.


    The second is only a multiplicator, it makes things slower, but the impact of the first one is much worse.

    I'll have a look into improving the heuristics and analyze the specific streams, zlib might just ignore some errors - or a preprocessing stage could detect this as either non-deflate or too short.

    Good news is: There definitely is a solution as the reflate-modified 7-Zip shows, it processes the file in under a second.
    http://schnaader.info
    Damn kids. They're all alike.

  25. #17
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 92 Times in 72 Posts
    Yes, I've been comparing the two of them for the past few weeks and I have to say, precomp fails to recompress a lot of files that are clearly packed with deflate, but maybe not using zlib. This is with the brute and intense modes activated.
    Maybe is time to rethink the whole engine?

    Another thing that keeps me thinking is the eminent parallel nature of what precomp does, yet it still waits for one thing to finish to start the other. You know that maybe the only thing keeping precomp from being widely used is its speed, right? I'm not a professional programmer but it seems parallelisation is pretty straightforward nowadays. In this particular case, there is no shared memory problem, for example, because in one thread you can be processing a gif stream and in another one a jpg stream and so on... The only 'complication' would be to keep a buffer for the final data to be written in case a first thread delays more than the next one, which will happen for sure.

    So please consider making multi-threading a priority, before moving forward and adding things that might complicate the file format. We will appreciate it so much. Thanks in advance Christian!

  26. #18
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    261
    Thanked 242 Times in 121 Posts
    These two things are my top priority tasks for 0.4.7 indeed. I'm currently working on the new engine, called difflate, and moved it to a separate private GitHub repository that I will change to a public repository when I'm done with the last few modifications to make it usable/testable for others.

    Integrating difflate into Precomp might push it on par with reflate (or at least quite close), so this alone would be a reason to release a new version. But since multi-threading is a "low hanging fruit", it will be part of 0.4.7, too.
    http://schnaader.info
    Damn kids. They're all alike.

  27. Thanks (3):

    Gonzalo (26th October 2017),Simorq (27th October 2017),Stephan Busch (26th October 2017)

  28. #19
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 92 Times in 72 Posts
    If you can use some monkey testing and/or fresh ideas, don't hesitate to send me a PM even before the first alpha release. I will be happy to help.

  29. #20
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    560
    Thanks
    357
    Thanked 365 Times in 198 Posts
    This precomp related issue first discussed in the paq8px thread is continued here.
    See the paq8px thread (starting here) for the full quotes.

    Quote Originally Posted by Gonzalo View Post
    [...]removed not related text[...]
    Code:
    precomp -cn -intense -brute 
    ...
    100.00% - New size: 57453657 instead of 22789601     
    
    Done.
    Time: 2 minute(s), 28 second(s)
    [...]removed not related text[...]
    TIA - Gonzalo M.
    Quote Originally Posted by mpais View Post
    [...]removed not related text[...]

    Code:
    precomp -cn -intense -brute zpaqgui-win32.win32.x86.zip
    ...
    100.00% - New size: 57451397 instead of 22788604
    
    Done.
    Time: 2 minute(s), 58 second(s)
    Quote Originally Posted by Gotty View Post
    [...]removed not related text[...]

    @Márcio, did you run your benchmark with precomp 0.4.6? Your 178-second runtime is waaay too slow. It should be around 36 seconds. Your CPU thread speed is ~7x faster than Gonzalo's CPU thread speed, and executing paq took ~1/10th of the time. It is very unlikely that executing precomp on this CPU takes approximately the same amount of time as on an Atom.
    Quote Originally Posted by schnaader View Post
    I used the zpaqgui file from the GitHub releases, too, so it has to be something else. But we're very likely comparing apples to oranges here when CPU and HD/SSD performance differs that much. Also, I made the tests under Windows, I guess you two used Linux?
    Quote Originally Posted by mpais View Post
    @schnaader
    I'm using a 480GB Sata-3 SSD, so I don't think that would be the problem, I don't have any other slowdowns.
    Quote Originally Posted by mpais View Post
    @Gonzalo, Gotty, schnaader
    I'm using precomp.exe from the latest 0.4.6 release from GitHub. I re-run the test and actually got a worse result (3m 06s).

    Also, note that I'm not running it on the i5 2400, but on my i7 5820k at 4.4Ghz, which better accounts for the speedup from paq8px.

    So it seems my file being slightly different from yours is probably the reason for it, since I didn't know you were running on an Atom processor.
    Have you tried using the file I used?
    As strange (even weird) as it seems... runnung precomp on my SSD is significantly slower than running it on my HDD:

    On the same hardware (except for the disk of course):
    On Windows running on my (old 160 GB) HDD: 60-61 sec; running on my SSD: 75-76 sec
    On Linux VM running on my (old 160 GB) HDD: 91-93 sec (2 threads); SSD: 91-93 sec (same)

    Double(-triple) checked each result above. No mistake.

    But still... your (=Márcio) 178-186 sec runtime is weirder than the above - you have a much faster CPU (Intel Core i7-5820K @ 4.40GHz) and a (probably) faster SSD.
    What sorcery is that?

    Note: the Linux build prints "Compressing with LZMA, 2 threads" while the windows build is clearly running on a single thread (verified using the respective OS tools).
    ---
    Version: precomp 0.4.6, the windows 64-bit binary was downloaded from the precomp GitHub download site; Linux binary was built from source.
    Command line: precomp -cn -intense -brute zpaqgui-win32.win32.x86.zip
    The processed file is zpaqgui-win32.win32.x86.zip, size: 22788604 bytes, downloaded from https://github.com/thometal/zpaqgui/releases (initial release)

  30. Thanks:

    schnaader (29th October 2017)

  31. #21
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    560
    Thanks
    357
    Thanked 365 Times in 198 Posts
    I think I have found the reason. The creation of temporary files makes the difference in the execution time.

    I noticed that quite many temporary files are created on the drive where the precomp executable is located. And a lot of them are created, written to and deleted during the execution.
    I noticed also that reading the input from or writing the output to an SDD or a HDD or a RAMDRIVE makes no difference in the execution time at all.
    But if the precomp.exe file is located on a HDD, SSD or RAMDRIVE, the execution time differs significantly: 60 sec, 76 sec and 51 sec respectively. And this is the drive where the temporary files are created.

    @Márcio, could you please turn off your antivirus software while running a test again? I have a strong suspicion, that your antivirus software is going crazy about those temporary files.

  32. Thanks (2):

    Gonzalo (30th October 2017),schnaader (30th October 2017)

  33. #22
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    Correct, disabling Windows Defender on Windows 10 brings it down to 35s for the x64 version.

    Best regards

  34. Thanks:

    Gotty (30th October 2017)

  35. #23
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    261
    Thanked 242 Times in 121 Posts
    Quote Originally Posted by Gotty View Post
    temporary files are created on the drive where the precomp executable is located
    Where they are created depends on how you run Precomp, at least on Windows. If you use drag-and-drop of an input file on a precomp.exe, temporary files are created where the input file is. In the command-line version, however, the current directory will be used. So for example, if your precomp.exe is located in C:\precomp\precomp.exe, your file is located in C:\test\testfile.dat and you run Precomp from D:\tempfileshere\, temporary files will be created in D:\tempfileshere\.

    Are you sure temporary files are created on the SSD? The 76 sec result still is quite puzzling. At work, we had problems with the slow performance of SSD drives similar to the case reported here (related to the caching of the drives), but in most other cases, SSD should just be much faster.

    Quote Originally Posted by mpais View Post
    Correct, disabling Windows Defender on Windows 10 brings it down to 35s for the x64 version.
    It's good to hear that we found the reason, thanks for the research.

    Some questions remain: What does paq8px do better? I don't really like the usage of tmpfile() there because with a non-admin cmd, you'll get "Permission denied" on Windows. Anyway, the temporary files created using this method seem to be handled better by Windows Defender and they're deleted automatically.

    I'm not sure what to do for Precomp now. Recursion is a special case where temporary files are still used, but changing this is a major change that will need much work. I could rewrite temporary file routines so that only some temporary files are used without deleting them in the meantime, but I'm not sure if this would help with the issues. Might as well give tmpfile() a try and have a look if the error for the non-admin case can be handled.
    http://schnaader.info
    Damn kids. They're all alike.

  36. #24
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 92 Times in 72 Posts
    From paq8px thread, regarding deflate parameters guessing:


    Quote Originally Posted by gonzalo View Post
    Quote Originally Posted by mpais View Post
    The technique you describe from precomp is interesting, but why keep the combinations sorted by occurrence count?
    As you said, if the file is a container for many files, you might have slowdowns if the most used combinations change for
    each embedded stream. You could achieve the same speed-up without that drawback by using a MTF variant to process
    the list of combinations. It wouldn't necessarily have to be vanilla MTF, maybe a delayed-promotion variant, or a weighted
    variant with dynamic decay.
    I don't know for sure but I think this is very similar to a scheme I imagined a few years ago.
    It involves two steps:
    First, a table must be made sorting every possible parameter by its reported compressed size. The idea is to compress a chunk of data using all possible combinations of parameters and then sort the results. That is, offline, before any real run of the recompressor. Then again and again until so much data is accounted that the relative position of each parameter can be used as a reliable source to predict the origin of any given compressed chunk.
    For example, if we are recompressing a deflate stream of 32 bytes that expands to 64, we have to pick the parameter closest to 50% ratio. Let's say, 65-15w. If the value found there isn't the right one, we use a rule of three to infer the most likely value now.
    That would be the second step. Which can be used also to update the values of the table ('train it').
    I think that in due time and with enough data, this can be easily the fastest way to figure exactly how to recompress any given stream. It shouldn't take more than a few looks at the table to find it.
    What do you think, people? Can work?

  37. #25
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    @schnaader

    I'm not as experienced in pre-processing as you, but from a compression perspective, neither paq8 nor precomp offer what I'd like.
    I've been mulling over the idea of writing my own pre-processing plus compression engine, which I debated in the paq8px thread with
    Shelwien, but as you of all people know, doing it alone in what little free time I have is a daunting task.

    I had planned on using a 2 stage recursive parsing and transformation pass. Starting from a default block that encompasses the whole file, that
    block is then processed with deterministic parsers (and transforms, if needed). These are parsers that rely on reading file headers, so they
    give us a reasonable certainty that the data detected is what it seems. Should that data need a transform, it is applied and tested, if needed.
    Examples of these parsers include JPEG/GIF/PNG/TIFF/BMP/TGA/PNM/PSD/DICOM images, raw photo formats, WAV/AIFF/MOD/S3M/XM audio, etc.
    When a parser detects something, it subdivides the block that is being processed in as many sub-blocks as needed. This allows, for example,
    a TIFF parser to signal 4 or 5 different images at once. So a single parent block becomes n image blocks and as many default blocks as those
    left between the images.
    These blocks are then marked as "final" if their specific data types don't allow for any more embedded data types.
    We then repeat the parsing with these deterministic parsers on all non-final blocks, until no more detections are signaled (or a recursion limit).
    Then we proceed to parse the non-final blocks with non-deterministic parsers, like what precomp does when searching for deflate streams.
    If a block is processed and nothing is found, it's marked as final. If something was detected, and we have non-final sub-blocks, we repeat
    processing starting from stage 1 on those.
    Each block would be given a hash value, and after pre-processing, those hashes would be used for a quick dedupe stage, where blocks with
    equal hashes would be compared, and if found to be equal, would be deduped.
    After doing this for all files to compress, we'd be left with a structure describing the blocks from all files, their data type (default, deduped,
    audio, x86/x64 code, etc) and the respective stream that represents them. We could then choose what compression engines to use, like
    you do in precomp. Should the user want speed over ratio, we could use fast compressors, like lepton, packMP3, PackRAW, etc. If not,
    we'd use really strong compressors.

    @Gonzalo

    Sorting the parameters appears to already give precomp a nice speedup, and compression ratio can vary significantly between parameters,
    depending on the type of data it represents. The problem with sorting is that doing it on absolute frequencies can be bad for very heterogeneous
    streams, and to a lesser degree, it's a O(n2) operation (no need for fancy O(n.log n) sorting on such small arrays, but still, it's not free).

    With a Move-To-Front list, the most used combinations are also tried first, and updating the list is an O(1) operation.
    On the zpaqgui file, only 7 combinations occur, and of those, 2 only occur once. So most of the time, the most used
    combination would be the first in the list anyway, as would the 2nd most used be the second in the list.

  38. Thanks (2):

    Gonzalo (31st October 2017),Stephan Busch (31st October 2017)

  39. #26
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    615
    Thanks
    261
    Thanked 242 Times in 121 Posts
    I really like this concept and the idea of an “encode.su community archiver". We have many talented people here, so why not join forces? The reason I didn't participate in the previous discussion was that I learned a lesson when working on Precomp: "There's not enough time for all the good ideas". But another lesson learned there is: "Just do it anyway, step by step" - so I'd like to just start this project with you, see where it leads and who joins the team.

    What I like about your idea is the modularity, flexibility and extensibility of it. I think it would be important to make contributing as easy as possible, e.g. some very basic scripting language/framework could help collecting basic parsing for all kind of formats. Another thing I'd consider important is a early prototype with basic archiving functionality (libarchive?) and both CLI and GUI, so even if it only has mediocre compression and usability at the beginning, people can start using it anyway and give feedback.

    Collecting and implementing formats in Precomp was much work and it still feels very incomplete, especially when thinking about container formats (ISO/tar/video containers) that don't give better compression directly, but support the fast deduplication like you described it.

    From the community driven aspect, I'd choose GitHub, Gitter, a wiki and a permissive license like Apache, BSD or even some DoWTFyouWant. The only problem I came across when developing Precomp was GPL (especially that using GPL libraries is not possible) because of its "embrace me or die" nature, but there might be ways to deal with this. C++ and cross-platform would be nice, but I'm also open for ideas on all that.

    To make the project useful even for lurkers and testers (building something that makes a change -> reach as many people as possible) and to enable discussions, a good documentation (either source or wiki) would be important, not only on how it works but also about design choices and alternatives (we used this hash because ..., a good alternative to try would be...)
    Last edited by schnaader; 31st October 2017 at 19:21.
    http://schnaader.info
    Damn kids. They're all alike.

  40. Thanks (3):

    Gonzalo (31st October 2017),PrinceGupta (31st October 2017),Stephan Busch (31st October 2017)

  41. #27
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 92 Times in 72 Posts
    Posted last year:

    @powzix @Nania Francesco @anyone else interested
    What about a collaboration about a new open source fast lz huff compression algorithm better than Kraken(and zstd)?I would like to contribute.
    Finally listening to the interesting words. For several years I try to convince a group of programmers to make our program you will define the shape (open or closed source)
    We can achieve results exceeding expectations
    Quote Originally Posted by Gonzalo View Post
    I couldn't be more agree. In fact, the thought has crossed my mind since long ago.

    (...)

    Lying around this forum we have O.S. code covering the whole range from the heaviest to the fastest algorithms, some of them targeting specific data types, like GLZA, filters to improve both speed and ratio, even re-compressers for many very common data types. Some of them have broken records on ratio or speed according to the statistics. All we have to do is merge them together.

    I know, it's easier said than done, but the rest of us will be testing and supporting in many ways too, of course. And the final product would have no match even from a big company product. What do you think?
    Right now I'm trying to build up some programming skills in order to produce a Multimedia Archiver of my own. But this is working out painfully slow, mainly because of the lack of time. Anyway, better late than never I guess.
    Quote Originally Posted by Gonzalo View Post
    I guess the key here would be to turn every algorithm in an independent library so the final program keeps modular and a change in one of the routines doesn't affect the whole thing. Low level stuff is almost always done in C++ but the archiver itself and/or GUI can be written in a high level language like Python-Qt, allowing quick development, helping portability and making easier for us beginners to collaborate.

  42. #28
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 92 Times in 72 Posts
    Quote Originally Posted by mpais View Post
    @Gonzalo

    Sorting the parameters appears to already give precomp a nice speedup, and compression ratio can vary significantly between parameters,
    depending on the type of data it represents. The problem with sorting is that doing it on absolute frequencies can be bad for very heterogeneous
    streams, and to a lesser degree, it's a O(n2) operation (no need for fancy O(n.log n) sorting on such small arrays, but still, it's not free).

    With a Move-To-Front list, the most used combinations are also tried first, and updating the list is an O(1) operation.
    On the zpaqgui file, only 7 combinations occur, and of those, 2 only occur once. So most of the time, the most used
    combination would be the first in the list anyway, as would the 2nd most used be the second in the list.
    Uh... Yes and no. I understand what you tell me but I'm not talking about in-operation sorting of the previous parameters, but about a theoretical built-in data base that knows which ratio is a given parameter likely to have. This way, just looking at the decompressed length, the program can try first the most probable parameter before doing anything, even if it is the first stream it processes. So the problem with heterogeneous files disappears because we don't care which values we used just a second ago.

    The only thing is that it will require a training stage before release. A separate program must be built (a very simple one, though) that compresses a relatively large amount of different streams using absolutely all parameters on each one and writes down every compressed ratio. Those are the numbers that will be sorted in a table, and that table is what will tell precomp or whatever program uses it how to treat the data.

    For example, after processing a million little chunks of data, the training program knows that level 85 windowbit 15 tends to deliver a ratio of, lets say, 35.436% (+-5%). Those two types of value will be put on the data base, parameter and most probable result.
    When precomp decompresses some stream and sees that the ratio is of 35.437%, it will try level 85 windowbit 15 right away instead of start from 11 all the way up. If it is wrong, it can use the actual ratio to triangulate and find the next likely parameter. The idea is that the more data we use to build the table, the better it will reflect and predict any future operation.
    The idea can be brought a little further, building a table for different chunk sizes or different data types.

    I hope this is a little clearer.

    _____________________________

    Edit: Márcio told me something that made me realize that this isn't gonna work. Instead, I think the best method is another:

    Quote Originally Posted by Gonzalo View Post
    >First, compress the stream with the weakest method. That ratio would be my 100
    >Then, compress the stream with the strongest method. That ratio would be my 0
    >After that, assign a value between those two numbers to the original chunk ratio and go for that position on the table.

    For example:
    Original chunk saved 3%
    Weakest method saved 2%
    Strongest method saved 4%

    Which parameter is the most probable? Not the one that usually saves 3% but the one that is right in the middle of the table.
    And since I probably tried two different window sizes in the first two steps, that wouldn't be a problem either.

    Of course that the first attempt isn't very likely to give me the combination but it will point me in the right direction because if the compressed size is larger, now it is my 100, and if it is smaller, now is my 0 and I repeat step 3.
    Last edited by Gonzalo; 31st October 2017 at 22:21.

  43. #29
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    For me the biggest hurdle is the basic archiving functionality. I've usually only cared about the algorithms and techniques, I never payed much attention to actually designing and implementing an actual archiver. I simply don't have the experience required, and would probably lose a lot of time just researching it before I could even get off the ground.

    We already have parsers for all the formats I listed and more, so it's quite a good start. The transforms are stable too, so we'd probably just have to improve some heuristics for the non-deterministic parsers (Deflate, Base64, x86/x64 instructions, ARMv7/8 instructions, Unicode text, etc).

    In the paq8px thread, I proposed an optional stage, post pre-processing, to use when creating solid archives. It would attempt to order the blocks in terms of similarity, be it visual if dealing with images, textual (ordering by language), etc. This could be left as an enhancement after we got everything working as planned.

    As for the compressors themselves, they'd just receive a block of size n and return a compressed stream. In paq, one of the biggest limitations is that compression must be done sequentially.
    For example, when dealing with images, we could want to process each color plane separately, or use an image-specific codec that relies on a wavelet decomposition/block transform.

    I'm not well versed in the different OS licenses, so I'd leave that for those who are. C++ isn't really my strong point, but I'd try to manage. I just can't promise that I'll have much time to work on this.

    @Gonzalo

    I understood what you meant, but a compression ratio of n:1 won't necessarilly tell you much about the parameters used, since that is quite dependent on the data itself.
    On files that are very compressible, you may find that almost any combination of parameters get about the same very high ratio. If all possible combinations give ratios from
    80:1 to 85:1, you won't learn much just from the ratio. And then there's the opposite, a file with long repetitions just outside the window size, where the same parameters give
    mediocre compression with this window size, but very high ratios with the next window size.
    On files that don't have many repetitions you might find that the ratio between the possible combinations varies between 2:1 and 2.2:1, and again you don't gain many insights from it.

  44. Thanks:

    Stephan Busch (31st October 2017)

  45. #30
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    542
    Thanks
    239
    Thanked 92 Times in 72 Posts
    Quote Originally Posted by mpais View Post
    @Gonzalo

    I understood what you meant, but a compression ratio of n:1 won't necessarilly tell you much about the parameters used, since that is quite dependent on the data itself.
    On files that are very compressible, you may find that almost any combination of parameters get about the same very high ratio. If all possible combinations give ratios from
    80:1 to 85:1, you won't learn much just from the ratio. And then there's the opposite, a file with long repetitions just outside the window size, where the same parameters give
    mediocre compression with this window size, but very high ratios with the next window size.
    On files that don't have many repetitions you might find that the ratio between the possible combinations varies between 2:1 and 2.2:1, and again you don't gain many insights from it.
    I would deal with it in the following way:

    >First, compress the stream with the weakest method. That ratio would be my 100
    >Then, compress the stream with the strongest method. That ratio would be my 0
    >After that, assign a value between those two numbers to the original chunk ratio and go for that position on the table.

    For example:
    Original chunk saved 3%
    Weakest method saved 2%
    Strongest method saved 4%

    Which parameter is the most probable? Not the one that usually saves 3% but the one that is right in the middle of the table.
    And since I probably tried two different window sizes in the first two steps, that wouldn't be a problem either.

    Of course that the first attempt isn't very likely to give me the combination but it will point me in the right direction because if the compressed size is larger, now it is my 100, and if it is smaller, now is my 0 and I repeat step 3.

Page 1 of 3 123 LastLast

Similar Threads

  1. Precomp 0.4.4
    By schnaader in forum Data Compression
    Replies: 16
    Last Post: 29th January 2016, 13:14
  2. Precomp 0.4.2
    By schnaader in forum Data Compression
    Replies: 31
    Last Post: 13th August 2012, 12:01
  3. Precomp (and Precomp Comfort) in 315 kb
    By Yuri Grille. in forum Data Compression
    Replies: 2
    Last Post: 1st April 2009, 19:40
  4. Precomp 0.3.5 is out!
    By squxe in forum Forum Archive
    Replies: 1
    Last Post: 20th August 2007, 14:55
  5. Precomp 0.3.3 is out!
    By squxe in forum Forum Archive
    Replies: 1
    Last Post: 20th July 2007, 17:27

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •