Page 3 of 3 FirstFirst 123
Results 61 to 72 of 72

Thread: Precomp source code on GitHub

  1. #61
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts

    Lightbulb Feature request

    I was thinking about the way precomp tries to speed-up next run over the same file by suggesting -zl and -d parameters. That´s very useful. You even added an issue about improving it. Well, what do you think about this other way?

    When precomp pre-compresses a file, it store in the .pcf output the actual data plus all the info needed to restore it (stream type, compression level used if zLib, etc). So basically you have a "seed" to restore the original file without any change. Now, what if we store all the info needed to pre-compress a file in a similar manner? Namely, offset where each stream is found, what type it is, the length, zLib level used to pack it, whether is possible to go on a lower recursion depth, and so on. Just like the .pcf but without the data itself...

    All the info needed and even more is already generated when a file is processed by precomp with switch -v on. So, why discard it? I think is better to save it as a "map" or "seed" to make possible something like this:

    Code:
    precomp -intense -pdfbmp+ --simulate=MY_FILE.map MY_FILE.zip
    
    precomp -cn -mMY_FILE.map MY_FILE.zip

    Using this approach, every run after the first for each file would be as fast as just decompress the zLib stream and copy a flag with the level used to restore it (which is saved on the seed). All the search for streams is redundant now, because precomp knows for sure exactly where all they are located, and how to treat them. So goodbye false positives, goodbye brute-force guessing on intense mode (which takes most the time), goodbye unnecessary recursion attempts. Just decompress, copy and that's it. This way, kubuntu.iso, for example, which took me 1 day to precompress the first time, would take maybe half an hour the next time (mainly because of JPG slow compressing). The boost would in the order of magnitude...

    Would you please consider adding this feature to Precomp? Thanks in advance!
    Last edited by Gonzalo; 25th April 2016 at 07:35. Reason: Misspelling correction

  2. #62
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    612
    Thanks
    250
    Thanked 240 Times in 119 Posts
    The latest commit does most of the packMP3 frame validation (mainly checksum + region size checks) on the Precomp side to speed things up and get rid of the false positives. Works very well on both the TeamViewer file (no more MP3 false positives) and the test file from Gonzalo (89/89 MP3 streams instead of 89/27161). There also is a slight speedup because less data is transferred to packMP3.

    I'm not sure if issue #34 is fixed, too - Dimitri, could you test the attached version on your test files and perhaps post a log from a "precomp test_file -v > log.txt" run? Note that the checks still take time, so it might still be a little slow, but most false positives for MP3 should be gone.

    In the next days, I'll either mark the remaining issue as fixed or move it to the next milestone. The current version will be released as 0.4.5 and I'll begin working on 0.4.6, which will address memory management and avoid temporary files for deflate streams.
    Attached Files Attached Files
    http://schnaader.info
    Damn kids. They're all alike.

  3. #63
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Great improvement on problematic files and nice improvement in overall speed for last commit

    Now: Microsoft Visual Studio 2015 compile:

    GCC 3842 sec
    MSVS15 3182 sec

    That's about 17% faster. The test file is a combination of PNGs, SWFs, valid MP3 files, problematic and non-supported MP3, GIFs, PDFs including also JPGs and pdf bitmaps. There are also non-precompressible files and some zLib chunks found on intense mode. For the rest of types I ran comparisons too and the difference is always on the MSVS side, between 3% and 30%.
    See attached.

    ---------------------------------

    Here is a crash for you to study. I've been following this since MP3 repacking came out, and there's no compilation tweaks, no commit or whatsoever that had helped, on both gcc and V.S. compiles. Linux version behaves in the same way, so I uploaded the files.

    ---------------------------------

    This one is I think the last major slowdown in my files.

    Code:
    (0.00%) Possible JPG found at position 215, length 147770
    Best match: 147770 bytes, recompressed to 121670 bytes
    (3.20%) Possible MP3 found at position 149367, length 4518614
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.20%) Possible MP3 found at position 149575, length 4518406
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.21%) Possible MP3 found at position 149679, length 4518302
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.21%) Possible MP3 found at position 149783, length 4518198
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.21%) Possible MP3 found at position 149887, length 4518094
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.21%) Possible MP3 found at position 149991, length 4517990
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.22%) Possible MP3 found at position 150095, length 4517886
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.22%) Possible MP3 found at position 150199, length 4517782
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.22%) Possible MP3 found at position 150303, length 4517678
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.22%) Possible MP3 found at position 150407, length 4517574
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.22%) Possible MP3 found at position 150511, length 4517470
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    (3.23%) Possible MP3 found at position 150615, length 4517366
    packMP3 error: big value pairs out of bounds (€h> h)
    No matches
    And so on until the end. 4.5mb in more than 1.5 hours.

    New size: 4641654 instead of 4667697

    Done.
    Time: 1 hour(s), 35 minute(s), 54 second(s)

    Recompressed streams: 1/11099
    JPG streams: 1/1
    MP3 streams: 0/11098
    See included file.
    Attached Files Attached Files

  4. #64
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    612
    Thanks
    250
    Thanked 240 Times in 119 Posts
    OK, first of all, thanks for all the input, here is a response to the "map file" feature request:

    There's the usual answer: I will do it, but not now Of course, the downside is that it just works for a specific file (I would check this by adding a checksum into the file, so you can't combine a map file with any other file than the original file) and this specific Precomp version (at least it's very likely you get more/other streams with a newer version). But it also speeds up things extremely, that's right. The size of a map file could also be quite small, so sharing them would be easy and useful.

    I'll post more later, f.e. about the slowdown (opened issue #37 for it).
    http://schnaader.info
    Damn kids. They're all alike.

  5. #65
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Quote Originally Posted by schnaader View Post
    The size of a map file could also be quite small, so sharing them would be easy and useful.
    Yes, I was even thought about an online database checked and filled by precomp automatically with every run around the world. Anyway, far in the future if possible at all.

  6. #66
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    612
    Thanks
    250
    Thanked 240 Times in 119 Posts
    So, more about the remaining topics:

    Slowdown "big value pairs out of bounds" - should be fixed soon, as described in the issue, I'll try to skip MP3 streams where position and length sum up to the same value as the first one where this error occured. This should give a good balance between avoiding slowdowns and not skipping any valid streams. The difference to the other slowdown issues is that it's very deep inside the packMP3 code (actual encoding) instead of parsing code, so I can't use the old strategy of moving it to Precomp here and even if I could, it wouldn't speed things up.

    The MP3 crash - that's a very interesting file. A little summary of what happens here: This is a valid MP3 stream that's compressed in a ZIP file. The ZIP streams aren't compressed using zLib's default deflate strategy, so Precomp can't find any matches and doesn't decompress them (note that if it would, there would be no crash at all). But this way, the actual compressed data contains many pseudo MP3-fragments encoded as literals, interwoven with compressed parts, so it basically throws thousand of corrupt/invalid/strange ~10 KB MP3 fragments at Precomp/packMP3. This is similar to fuzz testing, and it's very "successful" in revealing a crash. I'll open an issue about it later and analyze it, perhaps fixing it or forwarding it to Matthias Stirner.
    http://schnaader.info
    Damn kids. They're all alike.

  7. Thanks:

    Gonzalo (3rd May 2016)

  8. #67
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Very useful. Turns out I've been doing some kind of 'fuzz testing' by hand since very long without knowing. For example, corrupting mp3 files deliberately.
    This file I uploaded is from the net, but I successfully recreated the crash some time ago with other file created to that purpose. Will compare both to see whether is the same type of flaw.
    Last edited by Gonzalo; 4th May 2016 at 00:24.

  9. #68
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    JPG related:

    As with MP3, probably JPG parsing should be done on precomp side to avoid unwanted behaviour. When a broken image is treated by precomp, it can be read as it were much larger than it is. Consequently, precomp writes it down to the disk, starts thrashing and ends nearly never. See this, for example:

    Code:
    (10.49%) Possible JPG found at position 142116889, length 288354575
    packJPG error: size mismatch in marker segment FF E1
    No matches
    (10.54%) Possible JPG found at position 142837725, length 287437967
    packJPG error: size mismatch in marker segment FF  0
    No matches
    (10.64%) Possible JPG found at position 144166125, length 285883356
    packJPG error: size mismatch in marker segment FF E1
    No matches
    (10.72%) Possible JPG found at position 145306048, length 284613797
    packJPG error: size mismatch in marker segment FF E1
    No matches
    (10.78%) Possible JPG found at position 146039494, length 282523418
    packJPG error: size mismatch in marker segment FF E1
    No matches
    (10.81%) Possible JPG found at position 146429406, length 282026204
    packJPG error: size mismatch in marker segment FF E1
    No matches
    (10.89%) Possible JPG found at position 147578348, length 280451946
    packJPG error: size mismatch in marker segment FF E1
    No matches
    (10.97%) Possible JPG found at position 148635021, length 279287875
    packJPG error: size mismatch in marker segment FF E1
    No matches
    (11.03%) Possible JPG found at position 149414647, length 277446395
    packJPG error: size mismatch in marker segment FF E1
    No matches
    (11.10%) Possible JPG found at position 150449385, length 276365269
    packJPG error: size mismatch in marker segment FF E1
    No matches
    (11.17%) Possible JPG found at position 151293487, length 274338293
    packJPG error: size mismatch in marker segment FF E1
    No matches
    (11.22%) Possible JPG found at position 152016457, length 273608201
    packJPG error: size mismatch in marker segment FF E1
    No matches
    Imagine how much time would take to complete this until the end. Surely you noticed the length is on regressive count, like on those eternal MP3s... I know maybe this is not the right time. I just found the issue so I let you know before I forget. Happy coding!

  10. #69
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    612
    Thanks
    250
    Thanked 240 Times in 119 Posts
    Quote Originally Posted by Gonzalo View Post
    As with MP3, probably JPG parsing should be done on precomp side to avoid unwanted behaviour. When a broken image is treated by precomp, it can be read as it were much larger than it is. Consequently, precomp writes it down to the disk, starts thrashing and ends nearly never. See this, for example: [...]
    Very interesting. Note that in this case, the sum of length and position is not the same as in the MP3 slowdown case, so we can't generalize the fix from there JPG parsing is one of the things that has to be changed sometime because I didn't use the packJPG code, but wrote my own parser roughly following the specification and some trial and error. Perhaps this will get better when rewriting JPG parsing.

    Now on to the last remaining topic: MSVC compiles (and different compiles in general).

    As you posted, I was in the middle of doing benchmarks of different compiles of Precomp. The reason was that I plan to release both a 32-bit and a 64-bit version this time because Skymmer's compile is quite easy to reproduce and gives a solid 10-20% speedup on 64-bit systems.

    But when benchmarking, things got interesting... I compiled a 64-bit version and reproduced the speedup. Then I compiled using Visual Studio 2012 (resulting in a 32-bit executable) and most of the time it gave a little speedup, although not getting close to the 64-bit speedup. While benchmarking, I experienced some variations and analyzed them:

    Code:
    Scenario                                       Time for 32-bit/64-bit/VS2012 32-bit (in seconds, each time is an average of 3 measurements)
    
    Worst case: Windows folder view in background
      where the temporary files flicker            144/125/135
    Minimized Windows folder view		       138/123/133
    No Windows folder view                         137/120/131
    Disabled AV software                           133/117/128
    Moved files to a non-desktop folder            130/114/128
    Disabled (!) Windows Defender realtime scan    144/129/139
    Enabled Windows Defender realtime scan again   144/129/137
    The first 5 values look straightforward - yeah, we're getting faster and faster and reveal all the culprits. But the last two lines are just some "WTF is happening here?". To make matters worse, some other random slowdowns kicked in now and then, like the nice "CompatTelRunner" tool that seems to check for the compatibility of my Windows 7 programs for the recommended Windows 10 update. It scans the disk, distorts measurements this way and seems to be rerun at least every time you reboot your PC So while doing this benchmarks, I monitored disk usage between measurements and had to throw away some of the results.

    But the real surprise came when going a step further and comparing times for Windows (Win7, 64-bit) and Linux (Ubuntu 15.10, 32-Bit), note that they run on the same laptop, so the hardware is identical:

    Code:
    Tested file and parameters		Time for 32-bit/64-bit/VS2012 32-bit/Ubuntu 32-bit (in seconds, each time is an average of 3 measurements)
    
    FlashMX.pdf, -cn                        2.091/1.965/1.975/1.329
    FlashMX.pcf, -r                         0.889/0.827/1.014/0.762
    teamviewer_11.0.53191_i386.deb,
      -cn -intense1                         450/379/421/139
    For the last tested file (and as it seems, for intense mode in general), the Ubuntu 32-bit compile is almost 3 times faster than the Windows 64-bit version! I didn't analyze this further yet, but suspect the filesystem (ext2) to handle the usecase better (many short-lived, small temporary files) than NTFS. But there has to be something more about it, as it doesn't explain the difference for "FlashMX.pcf, -r" where no temporary files are used.

    As a sidenote, this shows another interesting thing on the Visual Studio compiles: -r is slower.

    All this shows (again) how important it is to switch to a better memory management (= Precomp 0.4.6) before trying any compiler-related tricks. I'm quite sure I'll release both a 32-bit compile and a 64-bit compile for both Windows and Linux, but they both will be compiled using GCC (5.3.0 for Windows, 5.2.1 for Linux).
    http://schnaader.info
    Damn kids. They're all alike.

  11. #70
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Yes! I noticed exactly the same on my PC. Linux 64 bits version is up to 7x faster than on Windows, but usually 1.5-3x faster. I couldn't find a reasonable explanation to this, so I decided not to post yet. But I can confirm disk thrashing is not the main factor in speed change. Not also the operating system itself (read kernel) because "wine precomp.exe" has a slightly worse timing than pure Windows version on Windows environment. I guess the cause of such a difference is really into the code structure and the way it is compiled for each platform.
    I never use AV software but I suspect it should affect badly any R/W operation on disk. Anyway, one of the reasons I stopped using antivirus is that the so called "real time protection" is not always that "real time". So I converge on this: once the temp files are reduced to the minimum, that is the time to find out the perfect compiler. In the meantime, on windows, VS are always faster (at least for me, on my PC) and Linux x64 is the fastest (note two things: I always compile with -O3 on Linux and my Visual Studio version is 2015 Community).
    My set-up is inverse than yours: I have Win7 32 bits and Ubuntu 16.4 64 bits, so I'm not able to test Win 64 version. I'd love to
    Another thing: I noticed an additional speed-up on Linux after setting -march=atom (my processor) Maybe you should change makefiles to say -march=native instead of -march=pentiumpro

    About JPG, EMMA's parser seems to be the best among all I have looked into so far. It can find and compress images that even extrjpg from Matthias' tools or PAQ8px can't. Maybe you can get some help from its author.

    And related to MP3 crashes: I was able to build another "crasher file". This one is very small (1 mb). I can upload it if needed. I´m trying to crash packJPG part of precomp but I couldn't yet. Seems pretty stable so far.

  12. Thanks:

    schnaader (4th May 2016)

  13. #71
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    612
    Thanks
    250
    Thanked 240 Times in 119 Posts
    Quote Originally Posted by Gonzalo View Post
    Maybe you should change makefiles to say -march=native instead of -march=pentiumpro
    It could be done in the makefiles perhaps because people will most probably use them only on their own hardware and recompile on others. But note that executables compiled with "-march=native" often won't run on other machines with older hardware. Because of my increasing confusion with compiler optimization and the GCC switches, I stumbled upon a good resource: GCC optimization in the Gentoo Wiki. It is very clear about avoiding -O3, too.

    Quote Originally Posted by Gonzalo View Post
    And so on until the end. 4.5mb in more than 1.5 hours.

    New size: 4641654 instead of 4667697

    Done.
    Time: 1 hour(s), 35 minute(s), 54 second(s)

    Recompressed streams: 1/11099
    JPG streams: 1/1
    MP3 streams: 0/11098
    Code:
    New size: 4641654 instead of 4667697
    
    Done.
    Time: 811 millisecond(s)
    
    Recompressed streams: 1/2
    JPG streams: 1/1
    MP3 streams: 0/1
    The last 2 commits fix issue #37. Attached a compiled version (note that it's a gcc/g++ 5.3.0 compile this time, I finally updated my MinGW - thanks to load).

    The first commit brought the time down to 90 seconds, which is still slow. MP3 parsing was done again and again over the same stream (but starting one frame later each time). So I introduced a MP3 parsing cache. As it's not related to that specific issue, it will result in a speed up for other problematic files, too.
    Attached Files Attached Files
    http://schnaader.info
    Damn kids. They're all alike.

  14. #72
    Programmer schnaader's Avatar
    Join Date
    May 2008
    Location
    Hessen, Germany
    Posts
    612
    Thanks
    250
    Thanked 240 Times in 119 Posts
    Gitter chat for Precomp

    I saw that FLIF and other projects have a nice chat room at Gitter which works together with GitHub, so let's see if this gets useful for Precomp, too. Feel free to join, I'll try to drop in regularly and answer any questions.
    http://schnaader.info
    Damn kids. They're all alike.

  15. Thanks:

    Bulat Ziganshin (9th May 2016)

Page 3 of 3 FirstFirst 123

Similar Threads

  1. AntiZ-an open source alternative to precomp
    By Diazonium in forum Data Compression
    Replies: 70
    Last Post: 17th June 2017, 14:09
  2. Durilca Source Code
    By juanandreslaura in forum Data Compression
    Replies: 2
    Last Post: 14th September 2015, 18:30
  3. where to get source code for a jpeg implementation?
    By pk-compression in forum Data Compression
    Replies: 2
    Last Post: 11th August 2015, 23:35
  4. Compiling Source Code
    By comp1 in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 10th June 2015, 22:32
  5. Need help to migrate libbsc.com from Office Live to github
    By Gribok in forum The Off-Topic Lounge
    Replies: 3
    Last Post: 23rd April 2012, 01:29

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •