Page 2 of 4 FirstFirst 1234 LastLast
Results 31 to 60 of 109

Thread: enwik10 benchmark results

  1. #31
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    336
    Thanks
    50
    Thanked 61 Times in 49 Posts
    Quote Originally Posted by Sportman View Post
    Yes no error, length ok, but decompressed output content not ok.

    For example original:
    |last=Stubblefield|first=Phillip G. |

    Crush 1.9 output:
    |last=Stubblt;ref |first=Phillip G. |
    That strange because when I decompress the output there is no error happen like file corrupted or something like that. Have you tested v1.3 please ?

  2. #32
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,025
    Thanks
    102
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by michael maniscalco View Post
    In the case of M99 you get basically the same compression ratio and same memory requirements no matter how many cores you use. But the throughput increases dramatically.
    1,722,407,658 bytes, 778.796 sec. - 401.317 sec., m99 -t1 -b1000000000 (beta)
    1,722,420,386 bytes, 447.087 sec. - 361.869 sec., m99 -t2 -b1000000000 (beta)
    1,720,127,471 bytes, 337.509 sec. - 270.985 sec., m99 -t3 -b1000000000 (beta)
    1,722,495,767 bytes, 283.544 sec. - 229.541 sec., m99 -t4 -b1000000000 (beta)
    1,720,169,087 bytes, 254.631 sec. - 200.170 sec., m99 -t5 -b1000000000 (beta)
    1,720,200,316 bytes, 236.581 sec. - 179.022 sec., m99 -t6 -b1000000000 (beta)
    1,719,984,055 bytes, 226.561 sec. - 167.299 sec., m99 -t7 -b1000000000 (beta)
    1,722,762,887 bytes, 218.418 sec. - 158.455 sec., m99 -t8 -b1000000000 (beta)

  3. Thanks:

    michael maniscalco (24th January 2020)

  4. #33
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by Sportman View Post
    1,722,407,658 bytes, 778.796 sec. - 401.317 sec., m99 -t1 -b1000000000 (beta)
    1,722,420,386 bytes, 447.087 sec. - 361.869 sec., m99 -t2 -b1000000000 (beta)
    1,720,127,471 bytes, 337.509 sec. - 270.985 sec., m99 -t3 -b1000000000 (beta)
    1,722,495,767 bytes, 283.544 sec. - 229.541 sec., m99 -t4 -b1000000000 (beta)
    1,720,169,087 bytes, 254.631 sec. - 200.170 sec., m99 -t5 -b1000000000 (beta)
    1,720,200,316 bytes, 236.581 sec. - 179.022 sec., m99 -t6 -b1000000000 (beta)
    1,719,984,055 bytes, 226.561 sec. - 167.299 sec., m99 -t7 -b1000000000 (beta)
    1,722,762,887 bytes, 218.418 sec. - 158.455 sec., m99 -t8 -b1000000000 (beta)
    I'll bet that the non linear scaling of performance is likely due to some threads winding up on virtual cores and thus competing for CPU with another thread. But thanks for running this test. It does demonstrate my point that in some cases the implementation of the core underlying algorithms does make a difference in the real world results.

    - Michael

  5. #34
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    294
    Thanked 1,293 Times in 734 Posts
    Its not just hyperthreading. L3 (and sometimes L2) are shared between cores, so MT scaling is only linear for pure number-crunching.

  6. #35
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by Shelwien View Post
    Its not just hyperthreading. L3 (and sometimes L2) are shared between cores, so MT scaling is only linear for pure number-crunching.
    That's true, however, my own tests on the suffix array engine (MS4) show that it generally does scale well even when there is a fair amount of cache eviction going on. And the encoder is extremely cache friendly so I doubt there's any cache contention for the MT encoding/decoding parts.

  7. #36
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Can be due to cpu self underclocking because of heat or extensive using of avx2 instructions. According to reviews on the net, i9900K is not scaling as good as expected.
    A test with bsc multithreading will probably show, if it's a hardware issue or not.


    @Sportman
    is it possible to include Turbo-Range-Coder bwt ?
    You can download the windows exe here

  8. #37
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    294
    Thanked 1,293 Times in 734 Posts
    My m99 exes were all built for SSE2 target.

  9. Thanks:

    dnd (24th January 2020)

  10. #38
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by dnd View Post
    Can be due to cpu self underclocking because of heat or extensive using of avx2 instructions. According to reviews on the net, i9900K is not scaling as good as expected.
    A test with bsc multithreading will probably show, if it's a hardware issue or not.
    Well, I'm not too concerned about it. I could easily set up the code to detect physical core ids and set cpu affinity if I were that concerned about it but I'm not so concerned. Besides, I wouldn't want to trouble Sportman with such things.

  11. #39
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Well, this is not only about M99. If it's an underclocking issue, then there is no sense to make a separate (allcores) mulithreading benchmark because the results will be not stable enough.
    Last edited by dnd; 24th January 2020 at 21:30.

  12. #40
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by dnd View Post
    Well, this is not only about M99. If it's an underclocking issue, then there is no sense to make a separate (allcores) mulithreading benchmark because the results will be not stable enough.
    True. However, I was only referring to the possibility of it simply being due to some threads working on virtual cores. Disabling hyper threading would also be a possibility in this particular case as it would eliminate such things entirely.

  13. #41
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,962
    Thanks
    294
    Thanked 1,293 Times in 734 Posts
    Code:
    i7-7820X @ 4.46Ghz, HT disabled
    
    195828888 105.813s 39.140s // m99 e enwik9 1 -t1 -b100000000
    195837092  45.485s 31.469s // m99 e enwik9 1 -t2 -b100000000
    195840689  33.219s 22.844s // m99 e enwik9 1 -t3 -b100000000
    195843233  27.094s 18.891s // m99 e enwik9 1 -t4 -b100000000
    195951462  23.375s 16.547s // m99 e enwik9 1 -t5 -b100000000
    195852676  20.578s 14.547s // m99 e enwik9 1 -t6 -b100000000
    196240337  19.156s 13.437s // m99 e enwik9 1 -t7 -b100000000
    195864494  18.047s 12.203s // m99 e enwik9 1 -t8 -b100000000

  14. Thanks (2):

    michael maniscalco (24th January 2020),Sportman (24th January 2020)

  15. #42
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,025
    Thanks
    102
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by michael maniscalco View Post
    some threads winding up on virtual cores
    Hyper threading is disabled, only 8 real cores.

  16. #43
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,025
    Thanks
    102
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by dnd View Post
    Can be due to cpu self underclocking because of heat or extensive using of avx2 instructions.
    I checked -t8 again and live monitored all cores separate (CPU usage, effective clock, temperature, thermal throttling) as soon usage go to 100% effective clock for each core is 5200MHz and stay there till usage fall. Temperature is almost always between 60-75 degrees, sometimes some cores reach max 84 degrees for a very short time when liquid cooling kicks in late after a usage change, no thermal throttling for any core.

  17. Thanks:

    michael maniscalco (24th January 2020)

  18. #44
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by Shelwien View Post
    Code:
    i7-7820X @ 4.46Ghz, HT disabled
    
    195828888 105.813s 39.140s // m99 e enwik9 1 -t1 -b100000000
    195837092  45.485s 31.469s // m99 e enwik9 1 -t2 -b100000000
    195840689  33.219s 22.844s // m99 e enwik9 1 -t3 -b100000000
    195843233  27.094s 18.891s // m99 e enwik9 1 -t4 -b100000000
    195951462  23.375s 16.547s // m99 e enwik9 1 -t5 -b100000000
    195852676  20.578s 14.547s // m99 e enwik9 1 -t6 -b100000000
    196240337  19.156s 13.437s // m99 e enwik9 1 -t7 -b100000000
    195864494  18.047s 12.203s // m99 e enwik9 1 -t8 -b100000000

    Interesting. So it looks like for T1- 4 the encoder is pretty much scaling linearly but then starts to get diminishing returns. Cache contention, I'm guessing. No big surprise on the decode side where cache performance is expected to be bad anyhow.

  19. #45
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,025
    Thanks
    102
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by dnd View Post
    A test with bsc multithreading will probably show, if it's a hardware issue or not.
    1,638,441,156 bytes, 1,030.489 sec. - 640.502 sec., bsc -b1024 -m0 -e2 -T (v3.1.0)
    1,638,441,156 bytes, 358.558 sec. - 223.752 sec., bsc -b1024 -m0 -e2 (v3.1.0)

  20. #46
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,025
    Thanks
    102
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by dnd View Post
    is it possible to include Turbo-Range-Coder bwt ?
    Added, without parameters it crash.

  21. #47
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Thanks Sportman for this marathon benchmark on the latest intel cpu.

    I think, the poor scalability in the bsc case can be explained by:

    - Non-local memory access pattern in bwt compression + decompression.
    Even by using cache aware algorithms (because of large blocks).
    - A maximum of 8 blocks can be processed in parallel with a 8 cores cpu.
    The bwt in bsc needs 5-6GB minimum for compression and decompression.
    With 32 GB at most 5-6 blocks can be processed in parallel without using the swap memory.
    Depending on the block processing, bsc is probably using the swap device (pagefile) for the 10 blocks in enwik10.

  22. #48
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by dnd View Post
    Thanks Sportman for this marathon benchmark on the latest intel cpu.

    I think, the poor scalability in the bsc case can be explained by:

    - Non-local memory access pattern in bwt compression + decompression.
    Even by using cache aware algorithms (because of large blocks).
    - A maximum of 8 blocks can be processed in parallel with a 8 cores cpu.
    The bwt in bsc needs 5-6GB minimum for compression and decompression.
    With 32 GB at most 5-6 blocks can be processed in parallel without using the swap memory.
    Depending on the block processing, bsc is probably using the swap device (pagefile) for the 10 blocks in enwik10.
    BSC uses an lzp preprocessor which, depending on the minimum match theshold, might be reducing the blocks sufficiently enough to fit into RAM. I'm not sure about that but those timings don't look bad enough to have gone to swap to me.

    Testing scalability of the BWT on massive block like these with regards to cache aware algorithms should be fairly easy to measure out. I'll run tests using MSufsort when I get a chance and report back with the results. In the BSC case the underlying algorithm is DivSufsort which doesn't facilitate full MT though so measuring its scalability isn't really measuring the same thing.

  23. #49
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,025
    Thanks
    102
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    Have you tested v1.3 please ?
    Yes same problem:

    2,620,142,147 bytes, 4,216.061 sec. - 93.476 sec., crush -9 (v1.3) (compare fail)

    Try this command line to compare two files and display differences:
    fc filename1 filename2

  24. #50
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    ​Yes, you're right bsc is reducing the memory requirement by using lzp preprocessing.
    BSC is using ~3,7GB per 1024MB block (determined w/ turbobench and enwik9).
    Assuming, all the block will be lzp reduced like enwik9, we have 29,6GB (3,7GB x 8 cores)
    as peak memory. Considering the memory for the OS, I think we are here in the limit of swapping
    for a 32GB System.

    In the BSC case the underlying algorithm is DivSufsort which doesn't facilitate full MT though so measuring its scalability isn't really measuring the same thing.
    Yes, there is no much MT benefit for the (in block) bwt at compression, but here the 9,3 blocks a 1024MB are processed in parallel saturating the
    8 cores at ~100%.

  25. #51
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    336
    Thanks
    50
    Thanked 61 Times in 49 Posts
    Please test BBB using m1000 option. I guess The result is better than bsc

  26. #52
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    Wondering, if the large/huge pages option "-P" can accelerate bsc in multithreading mode.
    For enwik9, I've not found any significant benefits with linux on my hardware.
    BSC has no logic to limit the number of blocks read and processed in parallel.
    It was not possible to test enwik10 with 16GB in multithreaded mode.

  27. #53
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,025
    Thanks
    102
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    Please test BBB using m1000 option. I guess The result is better than bsc
    Added m329, higher values with cf give out of memory.

  28. #54
    Programmer michael maniscalco's Avatar
    Join Date
    Apr 2007
    Location
    Boston, Massachusetts, USA
    Posts
    141
    Thanks
    26
    Thanked 94 Times in 31 Posts
    Quote Originally Posted by dnd View Post
    Wondering, if the large/huge pages option "-P" can accelerate bsc in multithreading mode.
    For enwik9, I've not found any significant benefits with linux on my hardware.
    BSC has no logic to limit the number of blocks read and processed in parallel.
    It was not possible to test enwik10 with 16GB in multithreaded mode.
    It shouldn't be too hard to modify the code. But you are correct in that BSC, in multithreaded mode, simply processes one block per core regardless of how many cores there are.
    My machine is 6 cores (12 with hyperthreading) and goes quickly to swap even with 32 GB of RAM. I hacked the code up to max out at 8 threads and that was on the cusp of swapping.

    To make BSC more scalable requires that either DivSufSort is replaced entirely or, otherwise, use the OpenMP feature with DivSufSort to get pseduo multithreaded BWT one block at a time. Then process 1/Nth of the block with each of the N cores. If the blocks are sufficiently large it really shouldn't have much of an affect on overall compression results.

    But as is, BSC is somewhat crippled by the way it approaches multithreading. At least with large blocks.

    [idea 1]
    add a new flag for the maximum amount of overall memory to use. Assume that each block will be 5N - even if it isn't due to lzp preprocessing. Then, if the number of cores * 5N exceeds the max overall memory reduce the number of cores as needed.

    So a single optional max memory setting would solve the issue entirely by limiting the encode/decode speed (by reducing cores) while trying to satisfy the requested block size throughout.

    [idea 2]
    switch to MSufSort

  29. #55
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    336
    Thanks
    50
    Thanked 61 Times in 49 Posts
    Quote Originally Posted by Sportman View Post
    Added m329, higher values with cf give out of memory.
    I use:
    BBB cm1000 enwik10 enw10.bbb

  30. #56
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 199 Times in 147 Posts
    ​BSC splits enwik10 in 10 blocks and preprocess each block with lzp:

    size lzp
    1073741824->1060156758
    1073741824->1057244115
    1073741824->1041287599
    1073741824->1044416511
    1073741824->1045617768
    1073741824->1040648094
    1073741824->989922738
    1073741824->970998143
    1073741824->1007507166
    336323584->298179402


    The saving in this case is only few percents.
    In theory, bsc with 5N+ (bwt+EC) will begin to swap by processing the 6th block in a system with 32GB memory.
    I think, it's safe to reduce the number of openmp parallel threads by setting the environment variable OMP_NUM_THREADS to 5.

  31. #57
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,025
    Thanks
    102
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by suryakandau@yahoo.co.id View Post
    BBB cm1000
    Added.

  32. #58
    Member
    Join Date
    Feb 2020
    Location
    Earth
    Posts
    3
    Thanks
    3
    Thanked 0 Times in 0 Posts
    Sportman, could you please test
    PPMd and PPMonstr?
    glza?
    bsc with -e1 ? is it faster than m99 or not?

    Looking at results of m99 and bsc I think it would be better
    to test all BWT compressors with either 1 GB or 1 GiB block size,
    or, when a precise value is impossible, pick the nearest value allowed by compressor's interface.

  33. #59
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    1,025
    Thanks
    102
    Thanked 410 Times in 285 Posts
    Quote Originally Posted by Boris K View Post
    could you please test PPMd and PPMonstr?
    Both failed, basic all not listed failed or no Windows binary or >48 hours to run (comp/decomp).
    Quote Originally Posted by Boris K View Post
    glza?
    Max. 4GB input.
    Quote Originally Posted by Boris K View Post
    bsc with -e1?
    Added.
    Quote Originally Posted by Boris K View Post
    is it faster than m99 or not?
    No, but smaller output.

  34. #60
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    801
    Thanks
    244
    Thanked 255 Times in 159 Posts
    Wow, zhuff 0.99 (predecessor of zstd) has even better decoding speed, comparable with pure LZ:
    3,689,371,924 bytes, 214.152 sec. - 7.134 sec., lz4 -9 (v1.9.2)
    3,325,794,738 bytes, 33.481 sec. - 12.952 sec., zstd -2 (v1.4.4)
    3,292,420,658 bytes, 37.645 sec. - 9.609 sec., zhuff -c0 -t1 (v0.99beta)
    3,158,641,894 bytes, 150.717 sec. - 9.848 sec., zhuff -c1 -t1 (v0.99beta)
    3,137,189,321 bytes, 43.010 sec. - 13.343 sec., zstd -3 (v1.4.4)
    3,078,914,611 bytes, 240.124 sec. - 9.381 sec., zhuff -c2 -t1 (v0.99beta)
    3,072,049,859 bytes, 46.019 sec. - 14.061 sec., zstd -4 (v1.4.4)
    It used only (ancient version of) FSE/tANS - could anybody elaborate on this difference?

    Ok, while zstd separately encodes 4 streams (offset, match_length, literals, literal_length), I think zhuff was just LZ4 + FSE (?)
    If so, maybe it is worth to also consider more recent version of LZ4 plus some faster EC like https://github.com/jkbonfield/rans_static also order1 ...
    Last edited by Jarek; 2nd February 2020 at 14:40.

Page 2 of 4 FirstFirst 1234 LastLast

Similar Threads

  1. Benchmark results on GitHub
    By Piotr Tarsa in forum Data Compression
    Replies: 40
    Last Post: 30th May 2018, 05:47
  2. large window brotli results available for a several corpora
    By Jyrki Alakuijala in forum Data Compression
    Replies: 9
    Last Post: 3rd May 2018, 08:48
  3. Replies: 3
    Last Post: 30th July 2011, 14:48
  4. Strange gcc4.3 results with paq8o8
    By Hahobas in forum Forum Archive
    Replies: 8
    Last Post: 22nd March 2008, 19:44
  5. TC 5.1dev7x test results
    By LovePimple in forum Forum Archive
    Replies: 8
    Last Post: 23rd January 2007, 22:00

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •