Page 13 of 14 FirstFirst ... 311121314 LastLast
Results 361 to 390 of 395

Thread: Zstandard

  1. #361
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    I think, that zstd is I/O bound not only by hardware but also by OS
    One should also consider binaries and other type of data,
    as enwik8 contains a lot of short matches at large distances.
    This can reduce significantly the speed in decompression.

    Ubuntu 19.10 - gcc 9.2 - skylake 3.4GHz

    Code:
    ./turbobench -ebrotli,8w23,8/zstd,13 ~/c/c/enwik8 -I15 -J15
    ​
         C Size  ratio%     C MB/s     D MB/s   Name          Dic/Window        (bold = pareto) MB=1.000.0000
        27702787    27.7      2.88     833.37   zstd 17            23
        28781430    28.8      2.74     301.10   brotli 9           
        29092151    29.1      3.27     430.77   brotli 9w23        23
        29291954    29.3     30.97     779.93   lzturbo 32     
        29464303    29.5      5.65     320.02   brotli 8           
        29468806    29.5      4.93     433.54   brotli 9w22        22 
        29582960    29.6      6.50     434.81   brotli 8w23        23
        30153208    30.2     10.16     327.31   brotli 7           
        30205556    30.2     10.40     416.06   brotli 7w23        23
        30306199    30.3     11.74     424.61   brotli 7w22        22
        30331609    30.3      7.62     894.94   zstd 13            22
        33576959    33.6     84.39    1162.03   lzturbo 31  
       100000000   100.0  14317.96   14280.02   memcpy
    Added LzTurbo 31 +32

    Added zstd to Static/Dynamic web content compression benchmark
    Last edited by dnd; 7th January 2020 at 13:08.

  2. Thanks:

    Jarek (11th January 2020)

  3. #362
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    749
    Thanks
    215
    Thanked 282 Times in 164 Posts
    Quote Originally Posted by dnd View Post
    I think, that zstd is I/O bound not only by hardware but also by OS
    One should also consider binaries and other type of data,
    as enwik8 contains a lot of short matches at large distances.
    This can reduce significantly the speed in decompression.

    Ubuntu 19.10 - gcc 9.2 - skylake 3.4GHz

    Code:
    ./turbobench -ebrotli,8w23,8/zstd,13 ~/c/c/enwik8 -I15 -J15
    ​
         C Size  ratio%     C MB/s     D MB/s   Name          Dic/Window        (bold = pareto) MB=1.000.0000
        27702787    27.7      2.88     833.37   zstd 17            23
        28781430    28.8      2.74     301.10   brotli 9           
        29092151    29.1      3.27     430.77   brotli 9w23        23
        29291954    29.3     30.97     779.93   lzturbo 32     
        29464303    29.5      5.65     320.02   brotli 8           
        29468806    29.5      4.93     433.54   brotli 9w22        22 
        29582960    29.6      6.50     434.81   brotli 8w23        23
        30153208    30.2     10.16     327.31   brotli 7           
        30205556    30.2     10.40     416.06   brotli 7w23        23
        30306199    30.3     11.74     424.61   brotli 7w22        22
        30331609    30.3      7.62     894.94   zstd 13            22
        33576959    33.6     84.39    1162.03   lzturbo 31  
       100000000   100.0  14317.96   14280.02   memcpy
    Added LzTurbo 31 +32

    Added zstd to Static/Dynamic web content compression benchmark
    And what if you add a very simple checksum calculation for the decompressed data into the benchmark after (or during) the decompression is done? -- Just to make it less likely some of the memcpys to be optimized away by the compiler as unnecessary no-ops.

  4. #363
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,712
    Thanks
    270
    Thanked 1,184 Times in 655 Posts
    Btw, they keep optimizing zstd. Based on my own tests, zstd 1.44 vs 1.38 has 10% faster decoding, and current "1.45" adds 5% or so more.
    My benchmark (ppmd_bench in turbobench thread) _does_ have a crc after decoding (not included in timing),
    and zstd there seems a bit slower than in turbobench (782MB/s instead of 811MB/s), but zstd decoding speed is really like that.

  5. #364
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    486
    Thanks
    168
    Thanked 166 Times in 114 Posts
    It's also worth noting that memory layout for such fast decoding checks can make a difference so different test harnesses may lead to different speeds.

    At one point I ended up using a non-power of two for my blocking size (nearly 1Mb, but not quite) as it gave a substantial speed benefit on some systems. (Sadly I cannot recall which to demonstrate the effect again.)

  6. #365
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    And what if you add a very simple checksum calculation for the decompressed data into the benchmark after (or during) the decompression is done? -- Just to make it less likely some of the memcpys to be optimized away by the compiler as unnecessary no-ops.
    The compiler that can optimize this memcpy is not yet invented, but at the speed AI is going, nothing is impossible.

  7. #366
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    749
    Thanks
    215
    Thanked 282 Times in 164 Posts
    Quote Originally Posted by dnd View Post
    The compiler that can optimize this memcpy is not yet invented, but at the speed AI is going, nothing is impossible.
    Looks promising. Still it would be interesting to understand why the difference appears only when the decoded result is not used in a significant way.

    For practical use like Linux package management, I'd anticipate that the benchmark run through normal cli tools is closer to.actual results.

  8. #367
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,712
    Thanks
    270
    Thanked 1,184 Times in 655 Posts
    You can look at this post for Sportman's benchmark: https://encode.su/threads/2119-Zstandard?p=62944&pp=1
    He uses codec executables with actual file i/o.

  9. #368
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Looks promising. Still it would be interesting to understand why the difference appears only when the decoded result is not used in a significant way.
    Decompression is often I/O bound, Sportman must also try a faster compressor like lz4 to see the effect.

    For practical use like Linux package management, I'd anticipate that the benchmark run through normal cli tools is closer to.actual results.
    For this use case, more compression is in general better than decompression speed, but brotli is extremly slow on arm
    see: CI Benchmarks
    Im my experiments with turbobench, zstd is bad at binary data compared to lzma or brotli.
    Developers like sales managers have a tendency to always present something new without extensive testing in real scenario benchmarks.

  10. #369
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    878
    Thanks
    80
    Thanked 315 Times in 219 Posts
    Input:
    1,000,000,000 bytes - enwik9

    Output:
    509,454,838 bytes - lz4 -1

    1 core:
    1.863 0.705 RAMDISK
    1.895 0.731 NVMe
    2.135 0.831 SSD
    6.470 0.728 HD

    2+ cores:
    1.848 0.692 RAMDISK
    1.849 0.722 NVMe
    2.015 0.720 SSD
    6.910 0.965 HD

    ---------------------------------

    Output:
    374,839,215 bytes - lz4 -9

    1 core:
    20.862 0.722 RAMDISK
    20.974 0.742 NVMe
    20.910 1.023 SSD
    21.161 0.735 HD

    2+ cores:
    20.529 0.682 RAMDISK
    20.474 0.706 NVMe
    20.486 0.712 SSD
    20.561 0.704 HD

    ---------------------------------

    Output:
    357,515,962 bytes - zstd -1

    1 core:
    4.493 1.154 RAMDISK
    4.699 1.196 NVMe
    4.363 1.284 SSD
    6.130 1.203 HD

    2+ cores:
    2.332 1.142 RAMDISK
    2.335 1.163 NVMe
    2.335 1.162 SSD
    2.631 1.170 HD

    ---------------------------------

    Output:
    373,837,267 bytes - brotli -0

    1 core:
    3.565 2.804 RAMDISK
    3.601 2.798 NVMe
    3.672 2.806 SSD
    4.214 2.798 HD

    2+ cores:
    3.519 2.066 RAMDISK
    3.523 2.685 NVMe
    3.521 2.687 SSD
    3.675 2.699 HD

  11. #370
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Thanks Sportman.
    Well, here we can also see that zstd decompression is (2.804/1.154 RAMDISK) 2,5 times faster than brotli.
    The in-memory turbobench results are also confirmed even when I/O is involved.
    The lz4 timings show that zstd is not I/O bound on this hardware unlike the google hardware used.

    Apparently zstd,1 is using multithreading at compression, whereas brotli not.
    Last edited by dnd; 11th January 2020 at 16:45.

  12. #371
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    878
    Thanks
    80
    Thanked 315 Times in 219 Posts
    Input:
    10,000,000,000 bytes - enwik10 (1GB EN, 1GB DE, 1GB FR, 1GB RU, 1GB ES, 1GB JA, 1GB IT, 1GB PL, 1GB ZH, 1GB PT)

    Output:
    3,638,532,812 bytes - zstd -1

    23.296 11.871 NVMe
    37.934 71.008 SSD
    83.902 32.562 HD

    Output:
    3,325,794,738 bytes - zstd -2

    33.776 13.143 NVMe
    45.577 70.115 SSD
    81.706 34.347 HD

    -----------------------------------

    Output:
    3,770,151,519 bytes - brotli -0

    34.245 26.295 NVMe
    46.397 80.399 SSD
    88.315 35.189 HD

    Output:
    3,545,817,857 bytes - brotli -1

    42.110 24.869 NVMe
    49.335 75.098 SSD
    85.677 34.612 HD

  13. #372
    Member
    Join Date
    Nov 2014
    Location
    California
    Posts
    142
    Thanks
    46
    Thanked 40 Times in 29 Posts
    Next test ... cmix on enwik10

  14. #373
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    155
    Thanks
    15
    Thanked 17 Times in 15 Posts
    Quote Originally Posted by hexagone View Post
    Next test ... cmix on enwik10
    Try out bigm with max setting or crush v1.2

  15. #374
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    749
    Thanks
    215
    Thanked 282 Times in 164 Posts
    Quote Originally Posted by Sportman View Post
    zstd -1, zstd -2, brotli -0, brotli -1
    Could you compare with LZ4 or snappy? Perhaps this domain is already too fast for formats with real entropy coding, and especially so for formats with context modeling (brotli)... For sure one looses everything good that brotli can offer below quality 4. I don't know about zstd -- it is typically focusing more on the lowest compression settings -- but I wouldn't be surprised if LZ4 would be competitive against both in this part of the performance space.

    Also, zstd encoder seems to have a smarter logic to deal with gigabyte range match finding: I suspect it uses longer hashes for long data (which makes perfect sense, but might be less important in real life, where compression can be applied chunked for looooong data because you don't want to lose all your data for one lost bit).

    It might be interesting to have the same lz finder to be run for both formats. With that, the only two major differences should be the existence of context modeling as well as more flexibility in the block splitting dance in brotli (which make compression density usually 5 % more, but slow down the decompression speed into the ~500 MB/s class).

  16. #375
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    878
    Thanks
    80
    Thanked 315 Times in 219 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Could you compare with LZ4 or snappy?
    Can't find Snappy Windows binary.

    Input:
    10,000,000,000 bytes - enwik10

    Output:
    5,034,758,325 bytes - lz4 -1

    18.480 7.376 NVMe
    35.476 116.978 SSD
    106.030 39.022 HD

    Output:
    3,689,371,924 bytes - lz4 -9

    214.639 7.515 NVMe
    214.791 80.685 SSD
    216.309 38.788 HD

  17. #376
    Member
    Join Date
    Aug 2015
    Location
    indonesia
    Posts
    155
    Thanks
    15
    Thanked 17 Times in 15 Posts
    Quote Originally Posted by Sportman View Post
    Can't find Snappy Windows binary.

    Input:
    10,000,000,000 bytes - enwik10

    Output:
    5,034,758,325 bytes - lz4 -1

    18.480 7.376 NVMe
    35.476 116.978 SSD
    106.030 39.022 HD

    Output:
    3,689,371,924 bytes - lz4 -9

    214.639 7.515 NVMe
    214.791 80.685 SSD
    216.309 38.788 HD
    Could you compare with crush v1.3 ?

  18. #377
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    749
    Thanks
    215
    Thanked 282 Times in 164 Posts
    Quote Originally Posted by Sportman View Post
    216.309 38.788 HD
    What are these numbers about? Seconds? Time from a real clock or cpu used?

  19. #378
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    878
    Thanks
    80
    Thanked 315 Times in 219 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    What are these numbers about? Seconds? Time from a real clock or cpu used?
    Encoding and Decoding Seconds Global Time by Timer 9.01 (Igor Pavlov, Public Domain, 2009-05-31)

    NVMe:
    2 x M.2 NVMe in RAID 0
    3272MB/s read, 2704MB/s write

    SSD:
    1 x SATA SSD
    512MB/s read, 306MB/s write

    HD:
    1 x SATA harddisk
    159MB/s read, 188MB/s write

  20. #379
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    878
    Thanks
    80
    Thanked 315 Times in 219 Posts
    Input:
    10,000,000,000 bytes - enwik10

    Code:
    Output:
    3,638,532,812 bytes,    23.309 sec. - 11.745 sec., zstd -1
    3,325,794,738 bytes,    33.481 sec. - 12.952 sec., zstd -2
    3,137,189,321 bytes,    43.010 sec. - 13.343 sec., zstd -3
    3,072,049,859 bytes,    46.019 sec. - 14.061 sec., zstd -4
    2,993,531,459 bytes,    72.476 sec. - 14.123 sec., zstd -5
    2,921,997,106 bytes,    96.765 sec. - 14.029 sec., zstd -6
    2,819,369,488 bytes,   131.372 sec. - 13.304 sec., zstd -7
    2,780,718,316 bytes,   168.903 sec. - 12.921 sec., zstd -8
    2,750,214,835 bytes,   238.516 sec. - 12.896 sec., zstd -9
    2,694,582,971 bytes,   288.670 sec. - 12.842 sec., zstd -10
    2,669,751,039 bytes,   374.093 sec. - 13.039 sec., zstd -11
    2,645,099,063 bytes,   560.829 sec. - 13.136 sec., zstd -12
    2,614,435,940 bytes,   745.449 sec. - 12.699 sec., zstd -13
    2,569,453,043 bytes,   967.368 sec. - 12.997 sec., zstd -14
    2,539,608,782 bytes, 1,356.777 sec. - 13.194 sec., zstd -15
    2,450,374,607 bytes, 1,495.895 sec. - 12.653 sec., zstd -16
    2,380,073,078 bytes, 2,210.100 sec. - 13.032 sec., zstd -17
    2,347,338,736 bytes, 2,601.788 sec. - 13.256 sec., zstd -18
    2,303,077,470 bytes, 3,402.543 sec. - 13.394 sec., zstd -19
    2,269,340,648 bytes, 4,296.761 sec. - 14.471 sec., zstd -20
    2,229,014,084 bytes, 5,052.537 sec. - 15.063 sec., zstd -21
    2,364,340,233 bytes, 5,658.734 sec. - 15.333 sec., zstd -22
    Last edited by Sportman; 15th January 2020 at 15:50.

  21. Thanks:

    Cyan (15th January 2020)

  22. #380
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    Do I read properly that zstd -22 produces actually worse compression than zstd -21 on this sample ?

    We don't see that on enwik9 or silesia, but nonetheless, if that's the case,
    it would be a good reason to launch an investigation on this source file (and any other file featuring the same issue).
    Last edited by Cyan; 15th January 2020 at 22:37.

  23. #381
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    878
    Thanks
    80
    Thanked 315 Times in 219 Posts
    Quote Originally Posted by Cyan View Post
    Do I read properly that zstd -22 produces actually worse compression than zstd -21 on this sample ?
    Yes, it's in the log file, I go to rerun -22 to double check.

  24. Thanks:

    Cyan (15th January 2020)

  25. #382
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    878
    Thanks
    80
    Thanked 315 Times in 219 Posts
    Quote Originally Posted by Sportman View Post
    I go to rerun -22 to double check.
    zstd -22 --ultra enwik10 -o enwik10.zst
    enwik10 : 23.64% (10000000000 => 2364340233 bytes, enwik10.zst)

    Quote Originally Posted by Cyan View Post
    worse compression than zstd -21
    Also worse than zstd -18, -19, -20.

  26. Thanks:

    Cyan (16th January 2020)

  27. #383
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    Thanks Sportman !
    we'll have to look into that.

    Since it doesn't show up on enwik9, I would expect some components of enwik10 to show this impact more than others.
    Let's zoom on a practical case to analyze.
    Last edited by Cyan; 16th January 2020 at 03:23.

  28. #384
    Member
    Join Date
    Apr 2017
    Location
    United Kingdom
    Posts
    65
    Thanks
    52
    Thanked 27 Times in 17 Posts
    Quote Originally Posted by Cyan View Post
    Do I read properly that zstd -22 produces actually worse compression than zstd -21 on this sample ?

    We don't see that on enwik9 or silesia, but nonetheless, if that's the case,
    it would be a good reason to launch an investigation on this source file (and any other file featuring the same issue).
    Hi Cyan, I am actually seeing this issue on my small files corpus too. Attached to this post is an example of file for which:
    Code:
    prof4d-summertime_2_(2012).mg2 (uncompressed)                                  18,688
    prof4d-summertime_2_(2012).mg2.zstd19 (compressed using zstd -19)               2,744
    prof4d-summertime_2_(2012).mg2.zstd22 (compressed using zstd --ultra -22)       2,770
    (This file contains pixelart for ZX Spectrum.) I hope it is small enough for you to be able to analyse what the issue is.
    Attached Files Attached Files

  29. Thanks:

    Cyan (16th January 2020)

  30. #385
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    878
    Thanks
    80
    Thanked 315 Times in 219 Posts
    Quote Originally Posted by Cyan View Post
    I would expect some components of enwik10 to show this impact more than others.
    Let's zoom on a practical case to analyze.
    I tried every 1GB enwik10 language separate but all as expected:

    zstd -21, zstd -22, enwik10 language
    239,381,519 bytes, 233,554,032 bytes, en
    236,339,891 bytes, 230,156,981 bytes, de
    218,140,700 bytes, 212,437,303 bytes, fr
    166,287,286 bytes, 161,822,491 bytes, ru
    221,889,335 bytes, 216,165,123 bytes, es
    212,661,141 bytes, 207,155,966 bytes, ja
    202,773,602 bytes, 198,084,760 bytes, it
    186,791,399 bytes, 180,868,511 bytes, pl
    253,935,594 bytes, 247,350,494 bytes, zh
    198,331,259 bytes, 193,563,157 bytes, pt
    ---------------------, ---------------------
    2,136,531,726 bytes, 2,081,158,818 bytes

  31. Thanks:

    Cyan (16th January 2020)

  32. #386
    Member
    Join Date
    May 2017
    Location
    United States
    Posts
    9
    Thanks
    3
    Thanked 4 Times in 3 Posts
    > Hi Cyan, I am actually seeing this issue on my small files corpus too.

    I've analyzed your file. Sadly the issue isn't the same as the enwik10 regression.

    For your file size level 19 has target_length=256 and 22 has target_length=999. When zstd finds a match of length > target_length it stops its analysis and just takes the match. In your file, near the beginning (position 511) there is a match of length 261. This causes zstd to emit its sequences and update the stats for level 19. But level 21 keeps going for 4KB before it updates its stats. That means that level 19 gets more accurate stats sooner, so it is able to make better decisions in its parsing.

    This is a weakness in the zstd parser which could be improved. To help combat this we make 2 passes on the first block. Without doing 2 passes the difference between the 2 levels is 59 bytes instead of 26 bytes.

    This weakness should only cause small swings in the compressed size of the file, not the 300MB difference we're seeing on enwik10.


  33. Thanks:

    introspec (17th January 2020)

  34. #387
    Member
    Join Date
    May 2017
    Location
    United States
    Posts
    9
    Thanks
    3
    Thanked 4 Times in 3 Posts
    Running zstd in single threaded mode gets the expected compression ratio. Time to figure out what’s going wrong!

    zstd --single-thread --ultra -22 enwik10 -o /dev/null
    enwik10 : 20.80% (10000000000 => 2080479075 bytes, /dev/null)

  35. Thanks:

    Sportman (18th January 2020)

  36. #388
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,532
    Thanks
    755
    Thanked 674 Times in 365 Posts
    Note that m/t also affects level 21

  37. Thanks:

    Sportman (18th January 2020)

  38. #389
    Member
    Join Date
    Jan 2015
    Location
    Hungary
    Posts
    10
    Thanks
    11
    Thanked 4 Times in 2 Posts
    Quote Originally Posted by terrelln View Post
    > Hi Cyan, I am actually seeing this issue on my small files corpus too.

    I've analyzed your file. Sadly the issue isn't the same as the enwik10 regression.

    For your file size level 19 has target_length=256 and 22 has target_length=999. When zstd finds a match of length > target_length it stops its analysis and just takes the match. In your file, near the beginning (position 511) there is a match of length 261. This causes zstd to emit its sequences and update the stats for level 19. But level 21 keeps going for 4KB before it updates its stats. That means that level 19 gets more accurate stats sooner, so it is able to make better decisions in its parsing.

    This is a weakness in the zstd parser which could be improved. To help combat this we make 2 passes on the first block. Without doing 2 passes the difference between the 2 levels is 59 bytes instead of 26 bytes.

    This weakness should only cause small swings in the compressed size of the file, not the 300MB difference we're seeing on enwik10.

    It would be nice to have a switch/settings that runs without compromise. Of course the runtime would be long, but it would provide the smallest possible size. It would be interesting to know where zstd ends.

  39. #390
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    878
    Thanks
    80
    Thanked 315 Times in 219 Posts
    Quote Originally Posted by terrelln View Post
    Running zstd in single threaded mode gets the expected compression ratio.
    Confirmed:

    2,364,340,233 bytes, 5,667.053 sec. - 15.339 sec., zstd -22 --ultra -T1 (v1.4.4)
    2,080,479,075 bytes, 5,257.974 sec. - 15.484 sec., zstd -22 --ultra --single-thread (v1.4.4)

    Zstd v1.4.4 windows binary help (-h) do not mention argument --single-tread only argument T:
    -T# : spawns # compression threads (default: 1, 0==# cores)

Page 13 of 14 FirstFirst ... 311121314 LastLast

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •