Page 69 of 71 FirstFirst ... 19596768697071 LastLast
Results 2,041 to 2,070 of 2112

Thread: paq8px

  1. #2041
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,217
    Thanks
    743
    Thanked 495 Times in 383 Posts
    Theroretically Ryzen 9 3950x shuold be about 30-33% faster than i7-4770HQ... Looks like there are 2 times slowdown from hidden reason. Of course different builds runs different on different architectures but it's eally strange.

    I've checked other benchmarks for these CPUs (CPU-Z, Cinebench r11.5, Cinebench r15, Cinebench r20, Geekbench 4.0, Geekbench 5.0, Passmark, SisoftSandra Arithmetic, Userbench) and all these tests shows in average about 33% of single thread dominance of 3950x vs. 4770HQ... These must be:

    a) worst (really worst) case scenario for Ryzen or
    b) some compile/build implications

  2. #2042
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    I would look at the memory specs. Typically memory is the bottleneck with this kind of software.

  3. #2043
    Member
    Join Date
    Jun 2009
    Location
    Puerto Rico
    Posts
    251
    Thanks
    138
    Thanked 52 Times in 39 Posts
    Quote Originally Posted by hexagone View Post
    I would look at the memory specs. Typically memory is the bottleneck with this kind of software.
    The laptop with the Intel i7-4700MQ runs 16GB DDR3 at 1666Mhz.
    The other laptop with the Intel i7-7700HQ runs 64GB DDR4 at 2400Mhz.
    The AMD machine is running the 128GB RAM at 2133Mhz.

    I personally don't think RAM is the issue, as the numbers on both Intel machines are very similar given their architecture and speed differences.

  4. #2044
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,217
    Thanks
    743
    Thanked 495 Times in 383 Posts
    I think the same - memory isn't an issue. It's something in not AMD prper using AVX2 or other Intel favourise instructions.
    Of course if Ryzen got 3200MHz or 3600MHz memory it wil help but not a lot - maybe 5-10%. The major issue is in other place.

  5. #2045
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    158
    Thanks
    51
    Thanked 44 Times in 33 Posts
    Looking at memory frequency and amount is not sufficient to draw a conclusion. The sizes of the intermediate L1, L2 & L3 caches can make a big difference. Has anyone ever measured what amount of time is spent waiting for memory during compression ?

  6. #2046
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    553
    Thanks
    356
    Thanked 355 Times in 192 Posts
    @moisesmcardona
    Could you please try to run a short test with a small file (like bib or book1) on your systems with memory setting -1 and simd setting (-simd sse2) where the systems are idle? Could you turn off any antivirus software in the meantime?

  7. #2047
    Member
    Join Date
    Jun 2009
    Location
    Puerto Rico
    Posts
    251
    Thanks
    138
    Thanked 52 Times in 39 Posts
    Quote Originally Posted by Gotty View Post
    @moisesmcardona
    Could you please try to run a short test with a small file (like bib or book1) on your systems with memory setting -1 and simd setting (-simd sse2) where the systems are idle? Could you turn off any antivirus software in the meantime?
    Hmm, so now the AMD machine was faster! Of course, I tested now v190 while the previous results were from v189. Maybe the slowdown had to do with the amount of simultaneous tasks being run? Keep in mind I only had 50% of processes running (16 threads out of 32).

    However, now I noticed that the LSTM is producing different results on AMD and Intel (for the AVX2 SIMD). I'm not sure if this is intended or if it may be a bug.

    ------------------------------

    AMD machine: SSE2 no LSTM:

    Code:
    paq8px_v190.exe -1 -simd sse2 "H:\test rav1e 0.1.0 993950d q175-s0.log"
    paq8px archiver v190 (c) 2020, Matt Mahoney et al.
    
    Highest SIMD vectorization support on this system: AVX2.
    Using SSE2 neural network and hashtable functions.
    
    Creating archive test rav1e 0.1.0 993950d q175-s0.log.paq8px190 in single file mode...
    
    Filename: H:\test rav1e 0.1.0 993950d q175-s0.log (320157 bytes)
    Block segmentation:
     0           | text             |    320157 bytes [0 - 320156]
    -----------------------
    Total input size     : 320157
    Total archive size   : 18415
    
    Time 45.42 sec, used 552 MB (579223768 bytes) of memory
    AMD SSE2 LSTM:

    Code:
    paq8px_v190.exe -1l -simd sse2 "H:\test rav1e 0.1.0 993950d q175-s0.log" "H:\test rav1e 0.1.0 993950d q175-s0 lstm sse2.log"
    paq8px archiver v190 (c) 2020, Matt Mahoney et al.
    
    Highest SIMD vectorization support on this system: AVX2.
    Using SSE2 neural network and hashtable functions.
    
    Creating archive H:\test rav1e 0.1.0 993950d q175-s0 lstm sse2.log in single file mode...
    
    Filename: H:\test rav1e 0.1.0 993950d q175-s0.log (320157 bytes)
    Block segmentation:
     0           | text             |    320157 bytes [0 - 320156]
    -----------------------
    Total input size     : 320157
    Total archive size   : 18504
    
    Time 302.81 sec, used 570 MB (598303097 bytes) of memory
    AMD AVX2 LSTM (non-native build):

    Code:
    paq8px_v190.exe -1l -simd avx2 "H:\test rav1e 0.1.0 993950d q175-s0.log" "H:\test rav1e 0.1.0 993950d q175-s0 lstm avx2.log"
    paq8px archiver v190 (c) 2020, Matt Mahoney et al.
    
    Creating archive H:\test rav1e 0.1.0 993950d q175-s0 lstm avx2.log in single file mode...
    
    Filename: H:\test rav1e 0.1.0 993950d q175-s0.log (320157 bytes)
    Block segmentation:
     0           | text             |    320157 bytes [0 - 320156]
    -----------------------
    Total input size     : 320157
    Total archive size   : 18378
    
    Time 151.14 sec, used 572 MB (600230729 bytes) of memory
    AMD AVX2 LSTM (native build)

    Code:
    paq8px_v190_nativecpu.exe -1l -simd avx2 "H:\test rav1e 0.1.0 993950d q175-s0.log" "H:\test rav1e 0.1.0 993950d q175-s0 lstm avx2 native.log"
    paq8px archiver v190 (c) 2020, Matt Mahoney et al.
    
    Creating archive H:\test rav1e 0.1.0 993950d q175-s0 lstm avx2 native.log in single file mode...
    
    Filename: H:\test rav1e 0.1.0 993950d q175-s0.log (320157 bytes)
    Block segmentation:
     0           | text             |    320157 bytes [0 - 320156]
    -----------------------
    Total input size     : 320157
    Total archive size   : 18389
    
    Time 128.81 sec, used 572 MB (600230750 bytes) of memory
    Intel SSE2 no LSTM:

    Code:
    paq8px_v190.exe -1 -simd sse2 "C:\temp\test rav1e 0.1.0 993950d q175-s0.log"
    paq8px archiver v190 (c) 2020, Matt Mahoney et al.
    
    Highest SIMD vectorization support on this system: AVX2.
    Using SSE2 neural network and hashtable functions.
    
    Creating archive test rav1e 0.1.0 993950d q175-s0.log.paq8px190 in single file mode...
    
    Filename: C:\temp\test rav1e 0.1.0 993950d q175-s0.log (320157 bytes)
    Block segmentation:
     0           | text             |    320157 bytes [0 - 320156]
    -----------------------
    Total input size     : 320157
    Total archive size   : 18415
    
    Time 72.22 sec, used 552 MB (579223788 bytes) of memory
    Intel SSE2 LSTM:
    Code:
    paq8px_v190.exe -1l -simd sse2 "C:\temp\test rav1e 0.1.0 993950d q175-s0.log" "C:\temp\test rav1e 0.1.0 993950d q175-s0 lstm sse2 intel.log"
    paq8px archiver v190 (c) 2020, Matt Mahoney et al.
    
    Highest SIMD vectorization support on this system: AVX2.
    Using SSE2 neural network and hashtable functions.
    
    Creating archive C:\temp\test rav1e 0.1.0 993950d q175-s0 lstm sse2 intel.log in single file mode...
    
    Filename: C:\temp\test rav1e 0.1.0 993950d q175-s0.log (320157 bytes)
    Block segmentation:
     0           | text             |    320157 bytes [0 - 320156]
    -----------------------
    Total input size     : 320157
    Total archive size   : 18504
    
    Time 556.58 sec, used 570 MB (598303150 bytes) of memory
    Intel AVX2 LSTM (non-native build):
    Code:
    paq8px_v190.exe -1l -simd avx2 "C:\temp\test rav1e 0.1.0 993950d q175-s0.log" "C:\temp\test rav1e 0.1.0 993950d q175-s0 lstm avx2 intel.log"
    paq8px archiver v190 (c) 2020, Matt Mahoney et al.
    
    Creating archive C:\temp\test rav1e 0.1.0 993950d q175-s0 lstm avx2 intel.log in single file mode...
    
    Filename: C:\temp\test rav1e 0.1.0 993950d q175-s0.log (320157 bytes)
    Block segmentation:
     0           | text             |    320157 bytes [0 - 320156]
    -----------------------
    Total input size     : 320157
    Total archive size   : 18391
    
    Time 293.94 sec, used 572 MB (600230782 bytes) of memory
    Intel AVX2 LSTM (native build):
    Code:
    paq8px_v190_nativecpu.exe -1l -simd avx2 "C:\temp\test rav1e 0.1.0 993950d q175-s0.log" "C:\temp\test rav1e 0.1.0 993950d q175-s0 lstm avx2 intel native.log"
    paq8px archiver v190 (c) 2020, Matt Mahoney et al.
    
    Creating archive C:\temp\test rav1e 0.1.0 993950d q175-s0 lstm avx2 intel native.log in single file mode...
    
    Filename: C:\temp\test rav1e 0.1.0 993950d q175-s0.log (320157 bytes)
    Block segmentation:
     0           | text             |    320157 bytes [0 - 320156]
    -----------------------
    Total input size     : 320157
    Total archive size   : 18385
    
    Time 253.36 sec, used 572 MB (600230803 bytes) of memory
    Note that the SSE2 LSTM result matches for both CPUs but not the AVX2 version, both for Native and Non-native architecture build.

  8. Thanks (2):

    Darek (13th August 2020),Gotty (13th August 2020)

  9. #2048
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts

    paq8px_v191

    Code:
    Changes:
    - Pre-training now available for the LSTM, with a heavily-quantized model trained on english texts
    - LSTM prediction is now promoted to the second layer of the paq mixer
    - Activations functions are now available in AVX2
    Attached Files Attached Files

  10. Thanks (4):

    Darek (13th August 2020),Gotty (13th August 2020),Mike (12th August 2020),moisesmcardona (12th August 2020)

  11. #2049
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    Quote Originally Posted by LucaBiondi View Post
    Hi Mpais, what about a preliminare model for mp3 files?
    Do you think should be an easy task?
    Luca
    No, it would be a lot of work for not much gain, so, for me personally, it's low priority.

    Detection itself, if memory serves from when working on precomp, is finicky, since we'd need to rely on just a few bits to detect possible frames, and then need a way to validate them, which incurs decompressing them. Otherwise we'd be in same situation as now, where we don't check deflate or gif detections for proper validity, and then get transform fails. Honestly, the whole pre-processing stage in paq8 is a mess, one that I'm not looking forward to meddling in.

  12. #2050
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    Quote Originally Posted by Darek View Post
    paq8px_v190 got 2'nd place in Silesia benchmark now! It loses to cmix v18 "only" 183KB.
    As usual your options are better than mine and you actually managed to take #2 spot. Are you going to submit it?

    Also, don't forget that, if cmix gets updated with these improvements, it will again significantly increase the distance to #1 spot (well, with a huge help from precomp, now with Preflate integrated).

  13. #2051
    Member
    Join Date
    Sep 2014
    Location
    Italy
    Posts
    94
    Thanks
    102
    Thanked 39 Times in 25 Posts
    Ok thank you Mpais.
    It was just to know your opinion
    Luca

  14. #2052
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    553
    Thanks
    356
    Thanked 355 Times in 192 Posts
    @moisesmcardona
    Based on your results I have the feeling that the cause is indeed the thread count + memory. 16 threads are too much.

    Paq8px does something like that during compression:
    reading from memory... waiting for result ... performing some calculations, ... reading from memory... waiting for result ... performing some calculations, ...

    A significant amount of time is spent for just... waiting... for RAM.
    Level -1 is the fastest. Does it do less operations? No. It "only" uses less memory. So why is it faster then? When using less memory a larger percent of it can fit in the caches so there is a higher chance that the desired location is cached.
    Imagine yourself being the RAM. What you do is working hard to fetch some data and immediately the next request comes - you don't have much time to rest. You are the slowest component (from the CPU+cache+RAM trio). Everybody waits for you

    So let's do multi-threading.
    How many operations can your RAM modules do at any moment? (It depends on how many RAM modules you have and if they run in dual channel for example.) So how many threads can access your RAM simultaneously/concurrently? If you are running 16 threads but you have like 4 RAM modules in dual channel and let's suppose the rare (optimal) case when all threads want to access a very different memory location, still maximum 8 will be lucky, and 8 will wait for their chance to get data.

    Your maximum memory bandwidth with DDR4-2133 RAM is somewhere around 17 GiB/s. Latency plays mainly, but for the sake of simplicity let's ignore it. Let's suppose a paq8px thread will utilize it to 10%. That means (roughly speaking) that 10 such threads will utilize your memory fully and if you start a 11th one then these 11 threads will start to block each other. So 11 would run around the same speed as the 10 did. (Simplifying again.) All of them will wait for the same "slow" memory subsystem.

    It could be the case that your peak thread count is not 16 but 12 or 8 for paq8px. But let's not guess. Let's measure. What is the combined throughput for all the threads when we do compression simultaneously with 1..32 threads?

    A little experiment (Xeon E2286M, 4x16GB DDR4-2666, Dual channel, 19-19-19-43):

    paq8px -8 obj1
    1 thread: 3.8s
    2 threads: 4.1s x2
    3 threads: 4.3s x3
    4 threads: 4.5s x4
    5 threads: 4.9s x5
    6 threads: 5.4s x6
    7 threads: 6.0s x7
    8 threads: 6.6s x8
    9 threads: 7.1s x9
    10 threads: 8.1s x10
    ...
    16 threads: 14.3s x16
    ...
    32 threads: 26.5s x32

    As you can see the more threads are running the slower each thread will be.
    A metric for speed would be: how many obj1 files can we compress per second in each case? (i.e. divide the number of threads by the elapsed seconds).

    Since obj1 is a very small file, memory will not yet be exhausted at all when compression suddenly finishes. The best thread count is 9 in this case (with 9 threads it can crunch 9/7.1 = 1.26 obj1 files per second - that is the maximum).

    paq8px -8 obj2
    1 thread: 40.3s
    2 threads: 44.3s x2
    3 threads: 48.4s x3
    4 threads: 52.7s x4
    5 threads: 58.6s x5
    6 threads: 65.5s x6
    7 threads: 73.2s x7
    8 threads: 83.0s x8
    9 threads: 93.3s x9
    10 threads: 105.4s x10
    ...
    16 threads: 192.9s x16
    ...
    32 threads: 355.9s x32

    Obj2 is somewhat larger, compression takes more time, memory is filled up more (but still not be exhausted). The best thread count is similar: 10 (with 10 threads it can crunch 10/105.4 = 0.09487 obj2 files per second. When I use 16 threads for example its "speed" is just 16/192.9 = 0.08294 obj2 files per second).

    So on this system 9-10 is the optimal thread count for paq8px. Using 16 threads produces a 13% slower throughput than running with 10 threads.

  15. Thanks (2):

    schnaader (13th August 2020),Sportman (16th August 2020)

  16. #2053
    Member
    Join Date
    Jun 2009
    Location
    Puerto Rico
    Posts
    251
    Thanks
    138
    Thanked 52 Times in 39 Posts
    @Gotty

    Thanks for the explanation. This makes sense, given that the system sometimes gets laggy even when the CPU is not at 100%. Usually, I'd just pause the process if I'm doing something important (BOINC offers a suspend function that suspends every task and resumes them only when told).

    As far as the LSTM question, is it normal that AVX2 produces different results on different CPUs? I was planning on using that since it saved around 2-5MB, even if it took more days, but given the differences between the machine, I'm not sure if it would be a good approach, given the different results. The issue didn't happen with SSE2 (See the output size from the above post).

  17. #2054
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    342
    Thanks
    50
    Thanked 62 Times in 50 Posts
    Quote Originally Posted by moisesmcardona View Post
    @Gotty

    Thanks for the explanation. This makes sense, given that the system sometimes gets laggy even when the CPU is not at 100%. Usually, I'd just pause the process if I'm doing something important (BOINC offers a suspend function that suspends every task and resumes them only when told).

    As far as the LSTM question, is it normal that AVX2 produces different results on different CPUs? I was planning on using that since it saved around 2-5MB, even if it took more days, but given the differences between the machine, I'm not sure if it would be a good approach, given the different results. The issue didn't happen with SSE2 (See the output size from the above post).
    ‚Äčit is not normal.

  18. #2055
    Member
    Join Date
    Jun 2009
    Location
    Puerto Rico
    Posts
    251
    Thanks
    138
    Thanked 52 Times in 39 Posts
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    ‚Äčit is not normal.
    It's normal SSE2 and AVX2 produce different result. But for AVX2, it's different on different CPUs, whereas for SSE2, it's the same result regardless of the CPU.

  19. #2056
    Member Gotty's Avatar
    Join Date
    Oct 2017
    Location
    Switzerland
    Posts
    553
    Thanks
    356
    Thanked 355 Times in 192 Posts
    As far as I know AVX2 should produce the same results regardless of the CPU. As I see on your results, just a recompilation produces different results, not just targeting different CPUs. That's no good. I'll look into it.

  20. #2057
    Member
    Join Date
    Jun 2009
    Location
    Puerto Rico
    Posts
    251
    Thanks
    138
    Thanked 52 Times in 39 Posts
    Quote Originally Posted by Gotty View Post
    As far as I know AVX2 should produce the same results regardless of the CPU. As I see on your results, just a recompilation produces different results, not just targeting different CPUs. That's no good. I'll look into it.
    Yes, but also look that the AMD and Intel CPU produced different result, using the same executable.

    Intel AVX2 LSTM (non-native build):
    Code:
    paq8px_v190.exe -1l -simd avx2 "C:\temp\test rav1e 0.1.0 993950d q175-s0.log" "C:\temp\test rav1e 0.1.0 993950d q175-s0 lstm avx2 intel.log"
    paq8px archiver v190 (c) 2020, Matt Mahoney et al.
    
    Creating archive C:\temp\test rav1e 0.1.0 993950d q175-s0 lstm avx2 intel.log in single file mode...
    
    Filename: C:\temp\test rav1e 0.1.0 993950d q175-s0.log (320157 bytes)
    Block segmentation:
     0           | text             |    320157 bytes [0 - 320156]
    -----------------------
    Total input size     : 320157
    Total archive size   : 18391
    
    Time 293.94 sec, used 572 MB (600230782 bytes) of memory

    AMD AVX2 LSTM (non-native build):

    Code:
    paq8px_v190.exe -1l -simd avx2 "H:\test rav1e 0.1.0 993950d q175-s0.log" "H:\test rav1e 0.1.0 993950d q175-s0 lstm avx2.log"
    paq8px archiver v190 (c) 2020, Matt Mahoney et al.
    
    Creating archive H:\test rav1e 0.1.0 993950d q175-s0 lstm avx2.log in single file mode...
    
    Filename: H:\test rav1e 0.1.0 993950d q175-s0.log (320157 bytes)
    Block segmentation:
     0           | text             |    320157 bytes [0 - 320156]
    -----------------------
    Total input size     : 320157
    Total archive size   : 18378
    
    Time 151.14 sec, used 572 MB (600230729 bytes) of memory
    13 bytes of difference.

    Same flags, same SIMD, same executable, different CPU, different results.

  21. #2058
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    I don't have an AMD machine available to test, but I suspect it may be the AVX2 RSQRT instruction used in the Adam optimizer.

    @moisesmcardona:
    Could you try commenting out the line
    #define USE_RSQRT
    in Adam.hpp, recompiling and trying it out?

  22. Thanks:

    moisesmcardona (14th August 2020)

  23. #2059
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,217
    Thanks
    743
    Thanked 495 Times in 383 Posts
    Scores for my tetsset on paq8px v191 - there are some differrent effescts:

    a) for rextual files and 24bpp files the scores are awesome! Socres are better than cmix v18 scores! All textual files are the best number at all!

    b) for some biger files (G,EXE, H,EXE, I.EXE, K.WAD, L.PAK) scores are highly worse - compared to non LSTM scores...
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	paq8px_v191.jpg 
Views:	15 
Size:	857.7 KB 
ID:	7853  

  24. #2060
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    @Darek:

    I see, you used text pre-training even on non-textual files to squeeze a bit more compression. But now doing so means the LSTM is right from the start highly trained to predict english text, and it will take a very long time to re-adapt. You'll get much lower losses by skipping pre-training on those files. In the next version I'll separate "regular" text pre-training from LSTM model loading, so you can use them separately, for files where regular pre-training helped a bit, but LSTM model loading hurts a lot.

  25. Thanks:

    Darek (15th August 2020)

  26. #2061
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts

    paq8px_v191a

    Code:
    Changes:
    - Repurposed option switch "r" to handle loading of pre-trained LSTM models, decoupling it from other switches
    The next model is still going to take a few days to finish training, and since I was already planning on doing this, I've removed retraining (too little benefit) and repurposed the "r" swtich to handle the loading of LSTM models. For now that just means using the english model.

    @Darek
    So for this interim version, instead of "-9lta" for your text files, you'd use "-9rta". "t" will just use the old text pre-training, "r" will complement it by loading the LSTM model (and no need for "l" since "r" implies it).
    Attached Files Attached Files

  27. Thanks (3):

    Darek (15th August 2020),Mike (14th August 2020),moisesmcardona (14th August 2020)

  28. #2062
    Member
    Join Date
    Jun 2009
    Location
    Puerto Rico
    Posts
    251
    Thanks
    138
    Thanked 52 Times in 39 Posts
    Quote Originally Posted by mpais View Post
    I don't have an AMD machine available to test, but I suspect it may be the AVX2 RSQRT instruction used in the Adam optimizer.

    @moisesmcardona:
    Could you try commenting out the line
    #define USE_RSQRT
    in Adam.hpp, recompiling and trying it out?
    Hi @mpais,

    Your suggestion worked. The files now have the same size and the checksums matched between the Intel and AMD CPU.

    Same goes for the native build. The size and checksums matched between the CPUs, but the files are incompatible between the native and non-native build. The native build was about 20 seconds faster on the AMD CPU and about 42 seconds faster on the Intel CPU.

    tested with -1l -simd avx2.

  29. #2063
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,217
    Thanks
    743
    Thanked 495 Times in 383 Posts
    Ok, there are scores for pure pax8px_v191 - super scores for textual files (paq beats best cmix veriosns!!!!) but very bad scores for the biggest files -similar to non LSTM scores.
    Attached Thumbnails Attached Thumbnails Click image for larger version. 

Name:	paq8px_v191.jpg 
Views:	27 
Size:	857.7 KB 
ID:	7855  

  30. #2064
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    @moisesmcardona
    Thanks, that's good to know. If native builds are that much faster, there's still a lot to be gained. And then there's AVX512 and bfloat16..

    @Darek
    Those are the same scores you posted, are they not? Anyways, you can just skip to v191a, it's just that your old "best" settings don't apply on v191. For instance, I got 191.761 bytes with "-9l" on I.EXE with v191, I'm guessing you used text pre-training on that file before v191 since it gave better results. And as always, thank you for testing, we should petition the mods to get you a special "Official Tester" badge on your profile.

  31. Thanks (2):

    Darek (15th August 2020),moisesmcardona (15th August 2020)

  32. #2065
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,217
    Thanks
    743
    Thanked 495 Times in 383 Posts
    @mpais - yes - there are the same scores, maybe verified. Yesterday I couldn't find my previous post with scores - now I see it was already on place. "Oficial Tester" - hmmm, sounds proud . Maybe just "Tester".
    According to text option - yes, I've started to test paq8px v191a and v191 with -t option reverse.

  33. #2066
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,217
    Thanks
    743
    Thanked 495 Times in 383 Posts
    @mpais - I have a question. I've found that for version v191 (without a) for some files there were addtional small gains for textual files if I use -r (old) option. Swithing it off there were slightly worse scores. In version v191a using new -r option gives for these particular files the same scores like for version v191 w/o using -r which means slightly worse. It's possible that using old -r swith could add someting extra to LSTM pretrained option and now is off? Gains were small but it was always gain.

    examples:
    S.DOC file:
    22'137 bytes - paq8px_v191 -10lrta
    22'150 bytes - paq8px_v191 -10lta
    and
    22'150 bytes - paq8px_v191a -10lrta

    T.DOC file:
    15'268 bytes - paq8px_v191 -11lreta
    15'282 bytes - paq8px_v191 -11leta
    and
    15'283 bytes - paq8px_v191a -11lreta

    V.DOC file:
    15'031 bytes - paq8px_v191 -11lreta
    15'040 bytes - paq8px_v191 -11leta
    and
    15'040 bytes - paq8px_v191a -11lreta

  34. #2067
    Member
    Join Date
    Feb 2016
    Location
    Luxembourg
    Posts
    546
    Thanks
    203
    Thanked 796 Times in 322 Posts
    Yes, I removed the retraining code, since it slowed down compression for very little gain. It can be reinstated in the future, sure, but I don't see the point, using quantized pre-trained models allows for much, much bigger gains. Your testset doesn't benefit much because you only have a few small text files, but if you test on larger files you'll notice the improvement.

    I tested v191a on the Calgary Corpus, since it's small and didn't take long. My options are probably not ideal, but the gain is already quite significant:

    Code:
    bib           17.454 bytes        -9tra
    book1        163.421 bytes        -9tra
    book2        107.174 bytes        -9tra
    geo           42.583 bytes        -9la
    news          77.579 bytes        -9tra
    obj1           7.020 bytes        -9lte
    obj2          40.929 bytes        -9lt
    paper1        10.796 bytes        -9tra
    paper2        16.625 bytes        -9tra
    pic           24.686 bytes        -9lta
    progc          8.254 bytes        -9tr
    progl          8.963 bytes        -9tr
    progp          6.179 bytes        -9tr
    trans         10.156 bytes        -9tr
    
    Total        541.819 bytes
    v190         560.318 bytes

  35. Thanks:

    Darek (16th August 2020)

  36. #2068
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,217
    Thanks
    743
    Thanked 495 Times in 383 Posts
    Score of 541'819 is the record of Calgary Corpus at all (w/o counting tarball file compressed score by cmix v17)!

  37. #2069
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,026
    Thanks
    103
    Thanked 410 Times in 285 Posts
    Performance impact when running multiple paq8px processes simultaneously 1-10 cores:

    Input:
    enwik6 - 1,000,000 bytes

    Output:
    1 core 5.3GHz:
    195,300 bytes, 356.438 sec, paq8px.exe -9l -simd avx2

    2 cores 5.3GHz +12.93%:
    195,300 bytes, 402.544 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 400.875 sec, paq8px.exe -9l -simd avx2

    3 cores 5.0GHz +29.47%:
    195,300 bytes, 461.487 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 460.439 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 459.720 sec, paq8px.exe -9l -simd avx2

    4 cores 5.0GHz +58.73%:
    195,300 bytes, 564.430 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 565.773 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 559.525 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 563.977 sec, paq8px.exe -9l -simd avx2

    5 cores 4.9GHz +101.77%:
    195,300 bytes, 717.332 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 715.223 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 717.785 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 718.941 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 719.175 sec, paq8px.exe -9l -simd avx2

    6 cores 4.9GHz +153.30%:
    195,300 bytes, 902.869 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 898.039 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 899.054 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 897.352 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 899.023 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 895.055 sec, paq8px.exe -9l -simd avx2

    7 cores 4.9GHz +218.68%:
    195,300 bytes, 1,134.999 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,131.922 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,130.578 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,131.250 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,134.827 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,131.656 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,135.896 sec, paq8px.exe -9l -simd avx2

    8 cores 4.9GHz +286.93%:
    195,300 bytes, 1,378.973 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,379.161 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,368.960 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,372.694 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,377.677 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,371.366 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,375.568 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,374.053 sec, paq8px.exe -9l -simd avx2

    9 cores 4.9GHz +366.25%:
    195,300 bytes, 1,656.135 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,642.685 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,650.871 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,655.354 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,650.746 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,657.244 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,659.041 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,652.027 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,661.884 sec, paq8px.exe -9l -simd avx2

    10 cores 4.9GHz +464.77%:
    195,300 bytes, 2,013.061 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 1,991.236 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 2,010.607 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 2,002.437 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 2,010.794 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 2,011.669 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 2,008.295 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 2,011.982 sec, paq8px.exe -9l -simd avx2
    195,300 bytes, 2,011.920 sec, paq8px.exe -9l -simd avx2

    Test sytem:
    CPU: Intel Core i9 10900K OC at 4.9GHz (5.3GHz turbo), 10 cores (hyper threading disabled)
    CPU speed: Cinebench single-core 227 (R15.0) 538 (R20.060), Geekbench single-core 6,705 (4.4.3) 1,473 (5.2.3)
    Memory: 2 x 32GB = 64GB DDR4 at 4000MHz, timings 18-22-22-42
    Storage: Samsung M.2 NMVe 970 Evo Plus 2TB
    Storage speed: 3,500MB/s read 3,300MB/s write (Samsung) 2,551MB/s read, 2,114MB/s write (AS SSD Benchmark)
    OS: Windows 10 Pro v2004 (not used services disabled)
    Archiver: paq8px v191

    Test batch file + enwik6:
    Attached Files Attached Files

  38. Thanks (2):

    Darek (16th August 2020),Mike (16th August 2020)

  39. #2070
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,217
    Thanks
    743
    Thanked 495 Times in 383 Posts
    @sportman => interesting, looks like for these setup the most effective way is to use 4 cores:

    cores time speedup savings
    1 core 100,00% 0% 0,00%
    2 cores 112,93% 77% -43,54%
    3 cores 129,47% 132% -56,84%
    4 cores 158,73% 152% -60,32%
    5 cores 201,77% 148% -59,65%
    6 cores 253,30% 137% -57,78%
    7 cores 318,68% 120% -54,47%
    8 cores 386,93% 107% -51,63%
    9 cores 466,25% 93% -48,19%
    10cores 564,77% 77% -43,52%

  40. Thanks:

    Sportman (16th August 2020)

Similar Threads

  1. FrontPAQ - GUI frontend for PAQ8PF and PAQ8PX
    By LovePimple in forum Download Area
    Replies: 26
    Last Post: 17th January 2019, 13:36
  2. Alternative paq8px builds
    By M4ST3R in forum Download Area
    Replies: 20
    Last Post: 25th June 2010, 16:19
  3. Optimized paq7asm.asm code not compatible with paq8px?
    By M4ST3R in forum Data Compression
    Replies: 7
    Last Post: 3rd June 2009, 15:34

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •