Zstandard just received an important update (v0.2)
mainly focused on faster decompression better.
https://github.com/Cyan4973/zstd/releases
Zstandard just received an important update (v0.2)
mainly focused on faster decompression better.
https://github.com/Cyan4973/zstd/releases
Last edited by Cyan; 23rd October 2015 at 18:32.
Bulat Ziganshin (28th October 2015),inikep (22nd October 2015),Jarek (22nd October 2015),jethro (24th October 2015),Jiwei Ke (22nd October 2015),Jyrki Alakuijala (23rd October 2015),Stephan Busch (22nd October 2015),tobijdc (29th October 2015)
Zstandard just reached v0.4.
The major focus of this release is to provide High Compression modes from command line (they were previously only accessible through API).
All compression levels will notice significant ratio improvements on larger files, with the strongest ones benefiting most.
Source code is available at :
https://github.com/Cyan4973/zstd/releases
There is also a pre-compiled Windows 64 binary available.
Bulat Ziganshin (29th November 2015),comp1 (29th November 2015),inikep (29th November 2015),jethro (29th November 2015),Stephan Busch (29th November 2015),tobijdc (29th November 2015),Turtle (29th November 2015)
It seems that you have lost 20% of decompression speed in v0.4:
Code:| zstd_HC v0.3.6 level 1 | 250 MB/s | 529 MB/s | 51230550 | 48.86 | | zstd_HC v0.3.6 level 2 | 186 MB/s | 498 MB/s | 49678572 | 47.38 | | zstd_HC v0.3.6 level 3 | 90 MB/s | 484 MB/s | 48838293 | 46.58 | | zstd_HC v0.3.6 level 4 | 75 MB/s | 474 MB/s | 48423913 | 46.18 | | zstd_HC v0.3.6 level 5 | 61 MB/s | 467 MB/s | 46480999 | 44.33 | | zstd_HC v0.3.6 level 6 | 40 MB/s | 477 MB/s | 45723093 | 43.60 | | zstd_HC v0.3.6 level 7 | 28 MB/s | 480 MB/s | 44803941 | 42.73 | | zstd_HC v0.3.6 level 8 | 21 MB/s | 475 MB/s | 44511976 | 42.45 | | zstd_HC v0.3.6 level 9 | 15 MB/s | 497 MB/s | 43899996 | 41.87 | | zstd_HC v0.3.6 level 10 | 16 MB/s | 493 MB/s | 43845344 | 41.81 | | zstd_HC v0.3.6 level 11 | 15 MB/s | 491 MB/s | 42506862 | 40.54 | | zstd_HC v0.3.6 level 12 | 11 MB/s | 493 MB/s | 42402232 | 40.44 | | zstd v0.4 level 1 | 244 MB/s | 492 MB/s | 51160301 | 48.79 | | zstd v0.4 level 2 | 176 MB/s | 443 MB/s | 49719335 | 47.42 | | zstd v0.4 level 3 | 88 MB/s | 422 MB/s | 48749022 | 46.49 | | zstd v0.4 level 4 | 74 MB/s | 402 MB/s | 48352259 | 46.11 | | zstd v0.4 level 5 | 69 MB/s | 387 MB/s | 46389082 | 44.24 | | zstd v0.4 level 6 | 36 MB/s | 387 MB/s | 45525313 | 43.42 | | zstd v0.4 level 7 | 29 MB/s | 390 MB/s | 44805120 | 42.73 | | zstd v0.4 level 8 | 23 MB/s | 389 MB/s | 44509894 | 42.45 | | zstd v0.4 level 9 | 16 MB/s | 402 MB/s | 43892280 | 41.86 | | zstd v0.4 level 10 | 18 MB/s | 407 MB/s | 43807530 | 41.78 | | zstd v0.4 level 11 | 15 MB/s | 417 MB/s | 42498160 | 40.53 | | zstd v0.4 level 12 | 11 MB/s | 406 MB/s | 42394424 | 40.43 |
The differences in decompression speed are related to differences in instruction alignment decision made by the compiler.
The issue was mostly observed with gcc on x64 target, although some other compilers could be affected. Intel recent line x64 cpu is the most affected (sandy bridge and beyond) since it's tied to its instruction fetch hardware implementation.
The main problem is, from one compiler to another, using different version, parameters and system library, instruction alignment will be different with same source code.
And when doing a modification in an unrelated place of the code, it will change alignment decisions later on, in the decompression routines, with positive or negative consequences. A real nightmare when trying to optimize and measure.
I haven't found a way to reliably force the compiler to make correct alignment decisions.
`-falign-loops=32` only works on small dense libraries (such as huff0), not for a full program such as `zstd`.
So my second best guess is to generate pgo-assisted builds.
In which case, it correctly detects which loops are important, and must be correctly aligned for speed.
Results tend to be better, but they can unfortunately change a bit each time, because the pgo runtime measures can be a little different.
Moreover, while pgo-assisted builds can be automated within Makefile, it's a Makefile only solution, meaning that programs integrating directly source file (like lzbench) will not benefit from it. So it's only a partial solution.
Anyway, zstd 0.4.1 is out, and tries to help this issue, both by modifying the decompression routine in a way which seems more positive than negative on a bunch of tested platforms, and by proposing pgo-assisted zstd build.
Regards
inikep (1st December 2015)
I'm not completely sure what you're referring to, but if you're talking about data you declare IIRC you can use __attribute__(__aligned__(32)) to make sure the data is 32-byte aligned (obviously other values will work too). There might be something in OpenMP if you want something more portable, but I don't know it off the top of my head.
If your concern is really loops and you have aligned data but the compiler doesn't know it, you could use the SIMD support in OpenMP 4.0+ (it doesn't even require linking to openmp; gcc has -fopenmp-simd, I think clang does too). IIRC `#pragma omp simd aligned(variable_name:32)` should do the trick.
If you don't know whether the input data is aligned or not, I think the best you're going to be able to do is have both versions and a runtime check + branch.
FWIW OpenMP 4's SIMD stuff is really quite cool. I learned it a few years ago, but never really had a use for it and ended up forgetting most of it![]()
Unfortunately, this is not the issue.if you're talking about data you declare IIRC you can use __attribute__(__aligned__(32))
The problem is instruction alignment property, a much rarer category. It only happens when the algorithm is so densely optimized that the hardware instruction prefetcher becomes the bottleneck.
There is a fairly good description of the issue here :
http://pzemtsov.github.io/2014/05/12...rformance.html
Bulat Ziganshin (2nd December 2015),nemequ (2nd December 2015)
Ah, interesting problem to have. Thanks for the link.
Have you tried putting __attribute__((__aligned__(32))), and maybe __attribute__((__hot__)), on individual functions? AFAIK there is no way to specify attributes on loops, but it seems like putting them on functions would be better than nothing. You could also try puting the loops in their own (static, maybe static inline) functions and add the attributes there⦠If you do that there is also the pure, and maybe const, attribute.
Last edited by nemequ; 2nd December 2015 at 05:26.
Bulat Ziganshin (2nd December 2015),Cyan (2nd December 2015)
Good point !
Thanks nemequ, I didn't know the __hot__ attribute.
Interesting to test....
v0.4.1 works great for me:
Code:zstd_HC v0.3.6 level 1 263 MB/s 544 MB/s 51230550 48.86 zstd_HC v0.3.6 level 2 195 MB/s 525 MB/s 49678572 47.38 zstd_HC v0.3.6 level 3 97 MB/s 513 MB/s 48838293 46.58 zstd_HC v0.3.6 level 4 81 MB/s 508 MB/s 48423913 46.18 zstd_HC v0.3.6 level 5 72 MB/s 489 MB/s 46480999 44.33 zstd v0.4.1 level 1 256 MB/s 568 MB/s 51160301 48.79 zstd v0.4.1 level 2 186 MB/s 531 MB/s 49719335 47.42 zstd v0.4.1 level 3 95 MB/s 521 MB/s 48749022 46.49 zstd v0.4.1 level 4 78 MB/s 514 MB/s 48352259 46.11 zstd v0.4.1 level 5 77 MB/s 484 MB/s 46389082 44.24 zstd v0.4.1 level 6 39 MB/s 493 MB/s 45525313 43.42 zstd v0.4.1 level 7 34 MB/s 501 MB/s 44805120 42.73 zstd v0.4.1 level 8 25 MB/s 509 MB/s 44509894 42.45 zstd v0.4.1 level 9 17 MB/s 525 MB/s 43892280 41.86 zstd v0.4.1 level 10 19 MB/s 524 MB/s 43807530 41.78 zstd v0.4.1 level 11 16 MB/s 521 MB/s 42498160 40.53 zstd v0.4.1 level 12 13 MB/s 525 MB/s 42394424 40.43 zstd v0.4.1 level 13 10 MB/s 527 MB/s 42321163 40.36 zstd v0.4.1 level 14 10 MB/s 529 MB/s 42286879 40.33 zstd v0.4.1 level 15 8.79 MB/s 514 MB/s 42258368 40.30
Cyan (2nd December 2015)
Why do you have to apply it to the full program? Have you tried compiling just the file that needs it with that option?
Recent versions of gcc also have the attribute "optimize", which lets you set optimization options for single functions:
optimize
The optimize attribute is used to specify that a function is to be compiled with different optimization options than specified on the command line. Arguments can either be numbers or strings. Numbers are assumed to be an optimization level. Strings that begin with O are assumed to be an optimization option, while other options are assumed to be used with a -f prefix. You can also use the ā#pragma GCC optimizeā pragma to set the optimization options that affect more than one function. See Function Specific Option Pragmas, for details about the ā#pragma GCC optimizeā pragma.
This can be used for instance to have frequently-executed functions compiled with more aggressive optimization options that produce faster and larger code, while other functions can be compiled with less aggressive options.
Bulat Ziganshin (3rd December 2015),nemequ (3rd December 2015)
Why do you have to apply it to the full file? Why not just the function?
This has the major advantage that people embedding zstd in their projects benefit from it for free. It also keeps the build system simpler which, again, benefits people embedding zstd. IMHO it's usually best to try to keep the build system out of it if you can⦠pre-defined macros to detect the compiler, OS, or libc, coupled with a bit of compiler-specific magic (pragmas, attributes, etc.) usually let you keep things portable without embedding too much knowledge in your build system.
Last time I checked, the optimization options you could provide using the optimize attribute were pretty limited. More than once I've had to resort to creating separate files to be able to choose the right optimized version of a function at runtime. OTOH, when it works it is quite nice.
The only possible ways I see to apply -falign-loops=32 at lower than file level are the optimize function attribute and #pragma GCC optimize. They don't seem to have existed prior to gcc 4.4.7. If they work, and you don't mind adding the compiler dependency, then go for it.
OTOH, the command-line option seems more fool-proof.
I haven't tried it, so I can't promise it will work.Last time I checked, the optimization options you could provide using the optimize attribute were pretty limited. More than once I've had to resort to creating separate files to be able to choose the right optimized version of a function at runtime. OTOH, when it works it is quite nice.
I've tried to use align-loops at pragma optimize level.
But it doesn't work (no effect).
Indeed, function level optimization control would be ideal.
Have two small Windows .bat scripts for batch compression\decompression files to\from zst format.
Its need to post here with details?
Last edited by Vanfear; 3rd December 2015 at 14:56.
Sure.
Do you mean you are looking for a batch compression/decompression feature from zstd ?
Last edited by Cyan; 3rd December 2015 at 17:08.
Yep.
zstdPack.bat
zstdUnpack.batCode:@for /f "tokens=1" %%k in ('dir "R:\MyFiles\" /a-d-h-s /b') do (zstd.exe -20 "R:\MyFiles\%%k" "R:\Compressed\%%k.zst") @pause > nul
Notes:Code:@for /f "tokens=1" %%d in ('dir "R:\Compressed\*.zst" /a-d /b') do (zstd.exe -d "R:\Compressed\%%d" "R:\Unpacked\%%~nd") @pause > nul
1. zstd.exe not processing folders(?), so folders processing excluded from batch scripts.
2. If destination folder not exist, zstd.exe will give error (code 13). Create it manualy or via batch script.
3. Keys -d-h-s says that directories, hidden and system files will be excluded from processing.
Files in subfolders excluded from processing by deafult. They can be included by use small packing code correction:
zstdPackInclSubfiles.bat
But Note: in this case possibly conflicts of filenames in destination folder, between files from root folder and files from subfolders.Code:@for /f "tokens=1" %%k in ('dir "R:\MyFiles\" /a-d-h-s /b /s') do (zstd.exe -20 "%%k" "R:\Compressed\%%~nxk.zst") @pause > nul
Last edited by Vanfear; 4th December 2015 at 17:48.
> zstd.exe not processing folders(?), so folders processing excluded from batch scripts.
Correct.
What about transforming `directory` into `directory/*` (with potentially recursion) ?
Your script looks pretty good to me
What I could do to help, from zstd command line utility, is to allow the processing of multiple filenames.
This should speed up processing when there is a large amount of small files to compress/decompress,
because it would avoid creating / freeing a process for each file.
That being said, directory would still remain out of scope, so your scripts would be necessary to handle this scenario.
Found solution with mirroring source subfolders to destination, zstd.exe work looks as archiver with basic functions.
Also found mistake in all scripts. Files with spaces in names processing incorrect by script, as result have filenames conflicts when processing by zstd.exe
Now correcting code, and will update this post after some time today.
Update:
Colors in scripts: source, destination, commentary.
zstdPack4.bat
xcopy "R:\MyFiles" "R:\Compressed\MyFiles\" /t /e /h *
for /f "delims=" %%k in ('dir "R:\MyFiles\" /a-d-s /b /s') do (zstd.exe -20 "%%k" "R:\Compressed%%~pnxk.zst") **
* - Creating structure of source folder and subfolders in destination folder. Hidden folders included, key /h. Destination folder must have same name as source folder.
** - All files from source folder and subfolders will be send to zstd.exe for processing. Key -d says that folders will not be send to zstd.exe for processing,
key -s - system files excluded from processing (hidden h - included, not need to specify), /s - files from subfolders included.
Code %%~pnxk.zst says that to destination path will be added files and subfiles from source folder with saving their pathes:
(R:\Compressed%%~pnxk.zst = R:\Compressed\MyFiles\And possibly subfolders\name of source file(s)*.zst).
zstdUnpack4.batCode:@xcopy "R:\MyFiles" "R:\Compressed\MyFiles\" /t /e /h @for /f "delims=" %%k in ('dir "R:\MyFiles\" /a-d-s /b /s') do (zstd.exe -20 "%%k" "R:\Compressed%%~pnxk.zst") @pause > nul
xcopy "R:\Compressed\MyFiles" "R:\Unpacked\Compressed\MyFiles\" /t /e /h
for /f "delims=" %%d in ('dir "R:\Compressed\MyFiles\*.zst" /a-d-s /b /s') do (zstd.exe -d "%%d" "R:\Unpacked%%~pnd") *
* - Can't exclude subfolders from destination path, so destination folder will be have specified path + part of path of source folder.
It's not problem and not making problems with processing, but not very good from principled position.
Notes:Code:@xcopy "R:\Compressed\MyFiles" "R:\Unpacked\Compressed\MyFiles\" /t /e /h @for /f "delims=" %%d in ('dir "R:\Compressed\MyFiles\*.zst" /a-d-s /b /s') do (zstd.exe -d "%%d" "R:\Unpacked%%~pnd") @pause > nul
v2
*Now scripts processing folders, subfolders and subfiles.
*Destination folders will be created automaticaly by scripts.
*Fixed problem with processing files with spaces in name. ("tokens=1" replaced to "delims=")
v3
*Corrected source path mask example in unpacking script (string 1) for prevent scoping nearby folders from source path when unpacking script starts.
v4
*Corrected source path mask example in unpacking script (string 2) for prevent processing files which can be placed in same directory as source folder.
If someone wants to use variables for paths, create them yourself.
Example with variables of paths:
set sour1=R:\MyFiles
set sour2=R:\MyFiles\
set dest1=R:\Compressed\MyFiles\
set dest2=R:\Compressed
xcopy "%sour1%" "%dest1%" /t /e /h
for /f "delims=" %%k in ('dir "%sour2%" /a-d-s /b /s') do (zstd.exe -20 "%%k" "%dest2%%%~pnxk.zst")
pause
Updated version of scripts in bundle with zstd.exe
Last edited by Vanfear; 8th December 2015 at 16:58.
Cyan (5th December 2015)
Actually there is one easier way to do this :P
simply download lz4 installer for windows from Cyan blog and isntall it rename zstd ~~> lz4 and use it to compress
Replace the renamed zstd to program folder.
Hope it helps
Enjoy
Cyan (12th December 2015)
zstd -20 decompression fails for silesia/mozilla and silesia/webster with Error 36 : Decoding error : ZSTD_error_corruption_detected
-1 and -9 are OK.
I built from 0.4.2 source for 32 bit Windows Vista, gcc 4.8.1 using "make CC=gcc"
32-bits version cannot decode files compressed with -20.
It's limited to -19.
(I probably should do something to make this setting less accessible, or at least trigger some kind of warning).
That being said, I would have expected a clearer error message than ZSTD_error_corruption_detected.
It could be that I only introduced the clearer error message within 0.4.3 ...
zstd -18 and -19 give the same error as -20. Also -16 and -17 decompressed output is not identical for nci and webster. -10, -14, -15 work correctly.
Edit: updated the Silesia benchmark.
Code:Silesia dicke mozil mr nci ooff osdb reym samba sao webst x-ray xml Compressor -options --------- ----- ----- ---- ----- ---- ----- ---- ----- ---- ----- ----- ---- ------------------- 58301860 3197 16668 3283 2155 2851 3284 1613 4286 5196 9942 5309 512 zstd 0.4.2 -15 60459254 3351 17156 3360 2360 2891 3380 1724 4503 5228 10595 5368 537 zstd 0.4.2 -9 73726198 4279 20155 3833 2875 3584 3776 2168 5569 6254 13748 6772 707 zstd 0.4.2 -1 74570081 4134 20775 3792 3444 3627 3727 2140 5723 6146 13515 6761 780 zstd (original release)
Last edited by Matt Mahoney; 8th December 2015 at 06:16.
Thanks for report Matt
Finally reproduced it.
There must be an issue specifically with 32-bits on Windows.
32-bits binary is regularly tested with continuous integration suite
(see for example 32-bits tests for v0.4.2 : https://travis-ci.org/Cyan4973/zstd/jobs/94409047)
and everything seems to work fine, but these tests are completed on Linux.
I couldn't reproduce the issue there.
Since mingw doesn't work on my current windows workstation, I use Visual instead.
"Fortunately", binary produced with Visual 32-bits does indeed produce wrong compressed files at -16 and higher.
So it's reproducible.
64-bits binary doesn't have such problem.
Let's investigate ...
for those interested in asm optimizations - http://users.atw.hu/instlatx64/HSWvsBDWvsSKL.txt lists all the commands those latency/throughput was changed in the last 3 cpu generations. it's just the info from intel optimization manual, but with filtered out all those unchanged lines. i think that skylake have the single avx-512 unit combined of pipes 0&1, it's why most avx-256 computation commands now can be issued from both these ports, but not the port 5 (unlike previous designs)
If attribute((hot)) does the trick, I don't see the reason to create anything else.
There's a request opened here :Maybe someone should file a bug report/feature request in gcc to get this working.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67435
no formal investigation started so far :
"These values are normally strait out of the Vendors manuals."
Last edited by Cyan; 12th December 2015 at 16:10.
nburns (13th December 2015)
Windows 7z alternative compression using Cyan's Zstd v4.3 and srep3.93,
before i get misunderstood all credits for this goes to Cyan and Bulat for zstd and srep.
the code and regfiles all come from lz4 installation for windows in Cyans blog
the only thing i have done is pipeline srep with zstd and made some modification to code and regfiles
why replace 7z with this ?? .. well find out![]()
Cyan (13th December 2015)