Well, zlib is supposed to be installed in each linux dev. environment.
Build with:
Code:make
To use the system zlib, install zlib-dev with:
and build TurbBench withCode:apt-get install zlib1g-dev
Code:make HAVE_ZLIB=1
Well, zlib is supposed to be installed in each linux dev. environment.
Build with:
Code:make
To use the system zlib, install zlib-dev with:
and build TurbBench withCode:apt-get install zlib1g-dev
Code:make HAVE_ZLIB=1
TurboBench binaries for windows and linux updated with the latest compressor versions.
You can find the source code on Github: TurboBench
Sportman (4th May 2016)
Tested turbobench , lzturbo and concluded that:
1)Lzturbo 1* uses lz4 format(probably different source code and possibly not binary compatible but the same basic format |4 bit lit|4 bit ml|16 bit offset|)
On my machine similar performance,only very small decompression speed improvement.
2)TurboANX on my machine slower than advertised(only about 400MB/s decompression speed, a bit slower that rans_static64c)
3)TurboHF is very interesting.It is indeed very fast,faster that huff0 from zstd
4)lzturbo 3* is indeed fast(as advertised) and uses TurboHF , not TurboANX.
details about TurboHF:
It is a 12 bit depth limited huffman coder and contrary to what i expected it does not decode multiple symbols when their code length is small, as huff0 from zstd.
As i understand it uses 4 interleaved streams to remove RAW dependencies and each stream 4x unrolled(So in total 16x unrolled) to make better use of bit IO.
Has anyone managed to compile turbobench under xcode 7.3 on a Mac lately? It looks like it once worked on a mac but right now it seems to hit a number of compile issues. I'm following the instructions in README.md. Before I go sending pull requests, I wanted to see if this was known or anyone had encountered these issues already.
I have not been able to compile turbobench on my Mac even once...
TurboBench was tested with clang, gcc on linux (ubuntu), mingw and visual studio on windows. I have made some clang 3.8 compatibility changes today.
It will be nice, if someone can test TurboBench on a mac and report the results for LZFSE,zlib,....
LZFSE is included in TurboBench but not tested.
Github TurboBench new modules and compressors update to latest versions:
- new: xpack
- Almost all current ANS implementations including FSE are in general not better than huffman coding.
The only exception is TurboANX.
See: Entropy Coder Benchmark, where TurboHF is not only faster than other ANSs, but compress also better.
There is no sense (except advertising) using "Asymmetric Numeral Systems" when the implementations are worse than "huffman coding" for most popular distributions (like found in lz77).
For some time ANS was faster than "Huffman Coding", but as demonstrated in TurboHF, huffman Coding can now be a lot faster than ANS.
Of course, you can also find some distributions where lz77+ANS compress better.
All other ANS implementations including FSE and rans_static64c are not comparable to TurboANX in term of compression ratio.
Unlike "rans_static64c" (only Intel 64 bits cpu) , TurboANX is binary compatible for 32 and 64 bits cpu's with or without SIMD.
The numbers presented in my benchmark are real numbers, not simple advertising and obtained using a desktop CPU (overclocked i7-2600k sandy bridge).
Although currently, TurboANX is slower than expected in AVX2 CPU mode, because the AVX2 implementation was done, without having the hardware for testing.
There is an option "-Vn" in TurboBench to overwrite the automatic cpu detection in TurboANX (n=7 for sandy bridge mode on newer cpu's).
dnd i don't understand you.I said what i saw.Anyway new results
95093583 44.9 464.35 1575.71 lzturbo 20 silesia.tar
68245420 32.2 172.86 640.05 zstd 3 silesia.tar
66412086 31.3 118.29 623.29 zstd 4 silesia.tar
66033349 31.2 145.72 890.30 lzturbo 31 silesia.tar
129391464 61.0 595.54 889.71 TurboHF silesia.tar
101195229 47.7 503.51 2409.05 lzturbo 10 silesia.tar
100881157 47.6 511.12 2494.07 lz4 1 silesia.tar
128782035 60.8 293.44 390.58 TurboANX(def) silesia.tar
128782035 60.8 354.82 508.11 TurboANX -V7 silesia.tar
128782035 60.8 225.02 402.38 TurboANX -V1 silesia.tar
129979386 61.3 332.84 501.59 fse silesia.tar
130198150 61.4 279.99 458.31 rans_static64c silesia.tar
So lzturbo slower than lz4(nowhere near 3650MB/s as you claim in your website and not "compress better and faster, decompress up to 2x! faster than lz4").
TurboANX max dec speed 508MB/s much slower than your results 759MB/s(if we compare it with fse(529MB/s in your website,501MB/s here) which is much more portable than your turboANX).
TurboANX has a bit higher compression ration (if we assume results are not cheated) possibly because it has smaller block size.
lzturbo 1* decompression loop clearly lz4 format from the first 6 lines
lzturbo 2* decoding loopCode:60: movzx ecx,BYTE PTR [rax] lea r12,[rax+0x3] movzx ebx,WORD PTR [rax+0x1] mov edx,ecx shr ecx,0x4 and edx,0xf mov rsi,rbx mov ebp,ecx movzx r11d,dl cmp dl,0xf je 258 85: cmp ebp,0xf je 221 vmovdq xmm0,XMMWORD PTR [r12] mov edx,ebp lea rax,[r12+rdx*1] add rdx,r8 vmovup XMMWORD PTR [r8],xmm0 a2: mov ecx,r9d bsr ecx,ebp add ecx,0x1 movsxd rcx,ecx add DWORD PTR [rcx*4+0x8375b00],0x1 mov ecx,r9d bsr ecx,r11d add ecx,0x1 movsxd rcx,ecx add DWORD PTR [rcx*4+0x837d380],0x1 mov ecx,r9d bsr ecx,ebx add ecx,0x1 movsxd rcx,ecx add DWORD PTR [rcx*4+0x8375a60],0x1 cmp ebx,0x7 jbe 1a1 mov rcx,rdx lea r8d,[r11+0x4] sub rcx,rsi add r8,rdx nop nop 100: mov rsi,QWORD add rdx,0x10 add rcx,0x10 mov QWORD PTR [rdx-0x10],rsi mov rsi,QWORD mov QWORD PTR [rdx-0x8],rsi cmp r8,rdx ja 100 11c: movzx edx,BYTE PTR [rax] lea rbx,[rax+0x3] movzx ebp,WORD PTR [rax+0x1] mov ecx,edx shr edx,0x4 and ecx,0xf mov rsi,rbp movzx r11d,cl cmp cl,0xf je 354 13f: cmp edx,0xf je 321 vmovdq xmm0,XMMWORD PTR [rbx] lea rax,[rbx+rdx*1] add rdx,r8 vmovup XMMWORD PTR [r8],xmm0 158: cmp ebp,0x7 jbe 290 lea r8d,[r11+0x4] mov rcx,rdx sub rcx,rsi add r8,rdx nop nop 180: mov rsi,QWORD add rdx,0x10 add rcx,0x10 mov QWORD PTR [rdx-0x10],rsi mov rsi,QWORD mov QWORD PTR [rdx-0x8],rsi cmp r8,rdx ja 180 jmp 60
Code:70: lea rbx,[rax+0x2] mov ecx,edx mov r11d,edx shr ecx,0x4 and r11d,0x7 and ecx,0x3 cmp r11d,0x7 je 1a0 cmp ecx,0x3 je 218 mov eax,DWORD mov DWORD PTR [rsi],eax add rsi,rcx 9e: lea rax,[rbx+rcx*1] shr edx,0x6 mov rbx,rsi and edx,0x3ff mov ecx,edx sub rbx,rcx mov rcx,rbx cmp edx,0x7 jbe 250 bf: mov rdx,QWORD mov QWORD PTR [rsi],rdx movzx edx,WORD PTR [rcx+0x8] mov WORD PTR [rsi+0x8],dx lea edx,[r8+r11*1] add rsi,rdx d4: prefet BYTE PTR [rax+0x200] db: mov edx,DWORD test dl,0x8 je 70 mov r10d,edx mov r11d,edx shr r10d,0x5 and r11d,0x7 and r10d,0x7 test dl,0x10 jne 2ab lea rcx,[rax+0x3] shr edx,0x8 movzx edx,dx cmp r11d,0x7 je 2bc 111: cmp r10d,0x7 je 411 mov rax,QWORD mov QWORD PTR [rsi],rax add rsi,r10 124: lea rax,[rcx+r10*1] mov rbx,rsi mov ecx,edx sub rbx,rcx mov rcx,rbx cmp edx,0x7 ja bf
Last edited by algorithm; 16th May 2016 at 16:36.
You may wish to test the latest rANS_static4k.c from https://github.com/jkbonfield/rans_static. (Actually there are a couple of versions in there using different methods, which will vary in performance based on number of sybols used and the size of your L1 cache.) This is 32-bit and 64-bit compatible with no SIMD other than auto-generated by the compiler.
This is derived from the 32-bit rANS_static4c.c and is binary compatible in output, but MUCH faster for order-0 and somewhat faster at order-1 decoding. I also have tweaked the 16-bit renormalisation version (rANS_static4_16), but this isn't a primary focus as we need the 23+8 bit renorm code for the CRAM file format. Both of these are substantially faster than the (unmodified) rans_static64c code. I also took the opportunity to fix the API so it's possible to supply a buffer rather than using mallocs. I've been meaning to make a PR for you, but haven't had the time.
My benchmarks are on a Haswell 3.2GHz i5 and IIRC were coming in around 600MB/s on enwik9. I've no idea what it'd be on a 4.5GHz i7, but my home system isn't far off so I ought to see if it builds under windows (I've never tried).
Using a VM of Ubuntu Wily (15.10) rans_static4_16i was getting ~790MB/s decode on enwik8 order 0 and ~450MB/s order 1. I used enwik8 simply because this is a VM with only 2Gb of memory and my test harness keeps before and after copies for validation. I ought to try it without the VM overhead on my main system. Host is a 4.0GHz i7.
TurboBench running on the same VM (cut and paste is borked, grr) shows turboANX at 412MB/s encode and 632MB/s decode. FSE is very similar. Of course the benchmarking frameworks may not be identical (one block repeatedly used or all loaded at once? warm up of memory or not? etc) and likely block size differs too. As expected though, TurboHF and FSEhuf are significantly quicker.
On my work 3.2GHz i5 using the downloaded prebuilt turbobench tool:
On a build using default gcc install (old, 4.8.4) with the newer ans_jb updates: (https://github.com/jkbonfield/TurboBench). I guess it should be a submodule, but it's not yet as it'd need function renaming etc still.Code:@ deskpro107386[tmp/TurboBench]; turbobench_linux64/turbobenchs -b256k -eECODER /tmp/enwik 9 Benchmark: 1 from 3 637756262 63.8 634.83 946.03 TurboHF enwik9 636087252 63.6 309.03 408.53 TurboANX enwik9 620516416 62.1 53.06 47.03 TurboRC enwik9 ERROR at 19136511:65, 64 469202416 46.9 48.07 37.46 TurboRC_o1 enwik9 ERROR at 6815743:6c, 2f 624329244 62.4 41.77 26.42 TurboAC_byte enwik9 ERROR at 25165823:67, 66 647597480 64.8 176.05 134.80 arith_static enwik9 ...
Code:@ deskpro107386[tmp/TurboBench]; ./turbobench -b256k -earith_static,rans_static4c,rans_static4k,rans_static4_16i,rans_static4_16io1,fse /tmp/enwik9 Benchmark: 1 from 3 647597480 64.8 178.44 119.66 arith_static enwik9 639846192 64.0 227.91 308.04 rans_static4c enwik9 639846192 64.0 240.97 567.20 rans_static4k enwik9 639792007 64.0 305.46 622.59 rans_static4_16ienwik9 542207386 54.2 128.62 265.02 rans_static4_16io1enwik9 639048689 63.9 357.83 530.42 fse enwik9
The difference in speed of arith_static there from your binary vs my binary is likely due to you using either a better compiler/version or more aggressive optimisation options, so perhaps your build would have the other figures higher too.
What this shows is that huffman encoding is indeed rocking the speed chart, but there are faster ANS implementations out there than the TurboAN* code right now for decoding, while being comparable at encoding. For what it's worth, the complexity of tANS vs rANS implies tANS really should be quicker. I suspect there are similar tweaks (manually crafting assembly to add cmovs and more careful instruction interleaving) that would give more speed improvements still to tANS (eg FSE).
I'm not sure whether -b is having an affect here either; probably being overridden. The the 4_16i O1 I was deliberately being poor on O1 compression in favour of speed, using a very low resolution counter to keep cache hits up.
Edit: Testing on other hosts show my code is rather prone to too large tables causing cache missing, while FSE and ANX are less so. Particularly odd is how ANX is poorer on my i5 than an old Xeon. Maybe the JIT is picking the wrong code?
The Opteron has 16Kb L1 and 2Mb shared L2.Code:AMD Opteron(tm) Processor 6378 @ 2.40GHz 16817380 23.0 262.36 358.64 TurboANX q8 16831548 23.0 232.68 287.33 fse q8 16855906 23.1 199.83 230.65 rans_static4k q8 Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz 16817380 23.0 301.29 531.22 TurboANX q8 16831548 23.0 236.45 347.57 fse q8 16855906 23.1 229.14 379.87 rans_static4k q8 Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz 16817380 23.0 331.94 428.45 TurboANX q8 16831548 23.0 363.91 511.13 fse q8 16855906 23.1 297.14 620.65 rans_static4k q8
The Xeon has 32K L1 and 256K L2.
The i5 also has 32K L1 and 256K L2, but the L1 is duplicated for instruction and data cache (32k each).
Last edited by JamesB; 17th May 2016 at 16:43.
inikep (20th May 2016)
This benchmark shows again that there is a big discrepancy between
the "kraken" benchmark at cbloom.com and turbobench.
All packages w/ latest version
Code:enwik7 C Size ratio% C MB/s D MB/s Name File (bold = pareto) MB=1.000.0000 2707078 27.1 0.49 371.76 brotli 11 2773044 27.7 1.60 251.95 lzham 4 2802712 28.0 2.17 621.69 zstd 22 2809777 28.1 1.71 909.22 lzturbo 39 3144589 31.4 1.14 726.55 xpack 9 3420149 34.2 1.76 1020.53 lzturbo 29 3550544 35.5 7.94 557.44 libdeflate 12 3688450 36.9 16.30 240.97 zlib 9 3908554 39.1 8.94 3124.52 lzsse8 17 4227421 42.3 2.06 3039.96 lzturbo 19 4256269 42.6 24.96 2293.85 lz4 9 10000004 100.0 12107.88 12484.27 memcpy mozilla file from silesia corpus 14225515 27.8 0.31 281.68 brotli 11 14317035 28.0 1.94 207.16 lzham 4 14894572 29.1 1.97 773.80 lzturbo 39 15198580 29.7 2.63 599.40 zstd 22 (2.24 x faster than zlib) 15846370 30.9 0.54 550.90 xpack 9 17718126 34.6 2.17 1270.07 lzturbo 29 18308584 35.7 8.24 571.42 libdeflate 12 19044400 37.2 6.68 268.06 zlib 9 22009860 43.0 2.50 3614.19 lzturbo 19 22092113 43.1 39.80 2614.49 lz4 9 22135471 43.2 1.74 3141.94 lzsse8 17 51220484 100.0 9367.36 9473.23 memcpyCode:included the test from the kraken benchmark enwik7 lzma : 3.64:1 , 1.8 enc mb/s , 79.5 dec mb/s lzham : 3.60:1 , 1.4 enc mb/s , 196.5 dec mb/s zstdmax : 3.56:1 , 2.2 enc mb/s , 394.6 dec mb/s Kraken 220 : 3.51:1 , 1.4 enc mb/s , 702.8 dec mb/s Kraken 215 : 3.49:1 , 1.5 enc mb/s , 789.7 dec mb/s zlib9 : 2.38:1 , 22.2 enc mb/s , 234.3 dec mb/s lz4hc : 2.35:1 , 27.5 enc mb/s , 2059.6 dec mb/s silesia_mozilla lzma : 3.88:1 , 2.0 enc mb/s , 63.7 dec mb/s Kraken 220 : 3.60:1 , 1.1 enc mb/s , 896.5 dec mb/s lzham : 3.56:1 , 1.5 enc mb/s , 186.4 dec mb/s Kraken 215 : 3.51:1 , 1.2 enc mb/s , 928.0 dec mb/s zstdmax : 3.24:1 , 2.8 enc mb/s , 401.0 dec mb/s (only 1.38 x faster than zlib!) zlib9 : 2.51:1 , 12.4 enc mb/s , 291.5 dec mb/s lz4hc : 2.32:1 , 36.4 enc mb/s , 2351.6 dec mb/s
BTW I did some investigation upon Yann's prodding and think I can explain this discrepancy.
See http://www.cbloom.com/rants.html for details
This is what I get with TurboBench :
TurboBench :
C Size ratio% C MB/s D MB/s Name File
10403232 42.1 3.73 559.44 zstd 22 lzt99
10469389 42.4 3.16 863.11 lzturbo 39 lzt99
13063248 52.9 5.81 270.44 zlib 9 lzt99
On the same hardware : (Core i7-3770 3.4 GHz)
zlib dll : 309.12 mb/s
zstdmax MSVC 2005: 422.20 mb/s
lz4hc MSVC 2005 : 2600.32 mb/s
miniz MSVC 2012 : 318.18 mb/s
zstd MSVC 2012 : 426.37 mb/s
lz4hc MSVC 2012 : 2680.92 mb/s
miniz gcc-4.7.2 -O2 : 289.48 mb/s
zstd gcc-4.7.2 -O2 : 368.74 mb/s
brotli9 gcc-4.7.2 -O2 : 264.30 mb/s
lz4hc gcc-4.7.2 -O2 : 1755.57 mb/s
brotli9 0.4.0 GCC 5.3.0 : 262 MB/s
zstd 0.7.1 -99 GCC 5.3.0 : 560 MB/s
zlib 1.2.8 -9 GCC 5.3.0 : 271 MB/s
lz4hc r131 -99 GCC 5.3.0 : 2801 MB/s
TurboBench results are about the same as lzbench
TurboBench seems to be built with GCC 5.2
I think there are two factors in the different (ZStd/zlib) speed ratio :
1. ZStd seems to be very sensitive to compiler. In particular, much slower in MSVC and in GCC -O2 than it is in GCC -O3. The numbers I usually report are MSVC.
2. The zlib speeds you guys are reporting is 14% slower than mine (270 MB/s vs 309 MB/s). The zlib build I use has the optional x86 assembly files, which I'm guessing you do not use in your GCC builds since they're old MASM style asm files.
If I used your zlib speeds all the ratios vs. zlib would look better.
IMO zlib is such a mess I'm switching to "miniz" for my standard reference point in the future.
Last edited by cbloom; 3rd August 2016 at 20:05.
You might also want to consider libdefalte. It skips the streaming interface of zlib, but provides better speeds and ratios (see https://quixdb.github.io/squash-benchmark/unstable/).
Hardware: ODROID C2 - ARM 64 bits - A53 ( ARMv8 ) 2Ghz CPU
OS: Ubuntu 16.04, gcc 5.3
- zlib is decoding faster than all other compressors in this ARM 64 bits benchmark.Code:C Size ratio% C MB/s D MB/s Name File (bold = pareto) MB=1.000.0000 2704303 27.0 0.09 46.42 brotli 11 enwik7 2802473 28.0 0.40 57.81 zstd 22 enwik7 3519798 35.2 7.51 69.20 zstd 5 enwik7 3597902 36.0 5.11 51.13 brotli 4 enwik7 3647390 36.5 7.57 84.20 lzfse enwik7 3688450 36.9 3.78 86.80 zlib 9 enwik7 3697385 37.0 4.82 87.22 zlib 6 enwik7 3730944 37.3 6.79 86.83 zlib 5 enwik7 10000004 100.0 1664.88 1677.31 memcpy enwik7
- zlib is decoding nearly 2 times faster than brotli.
This surprising as brotli is intended to be used in chrome on mobile phones.
dnd,
could you update James' rANS - he claims that it is now even faster than FSE:
https://github.com/jkbonfield/rans_static
Another question is how entropy coders compare on ARM ...
JamesB (30th July 2016)
I ought to tidy up my interface a bit so it can be included as a git submodule, but haven't found the time.
I'm sure FSE could be faster too (tANS really ought to have the edge over rANS IMO for pure static (de)coding). I think the correct approach is simply better interleaving, with SIMD instructions to decode starting at multiple points in the buffer. This removes the block on SIMD by having dependency between each action. Eg 8 starting points => 8 operations at a time with the previous state dependency being per set of 8 rather than on each symbol.
Now done, but I haven't tested it in the context of TurboBench. There are now really only two relevant implementations (rANS_static4x8.c and rANS_static4x16.c) referring to 8-bit and 16-bit renormalisation. 16 bit is superior, but I already have code using the 8-bit variant so I'm keeping it working for compatibility reasons. Both of these export symbols (with 4x8 and 4x16 suffixes) which can be called directly, or you can use rANS_static.c/h which has an API with an additional renorm_size parameter (8 or 16) exposed via the -8 and -16 command line parameters. (See rANS_test.c for the example code usage.)
API wise, there are also two calling methods; one where the output size is known and supplied upfront (fastest and matching things like FSE) and one where the output buffer is unknown and it mallocs its own buffer for you. Both end up using the same code with the only difference being whether or not the output is specified as NULL, hence why the output buffer is also returned. NOTE: I don't have a test harness, nor any documentation. My actual code in real usage is just a variant of this held within the CRAM codecs rather than using a library or this project. So take it all "as is"!
might be nice to include those on the webpage https://sites.google.com/site/powturbo/home/benchmark
Also might be nice to add LZ5 just for fun. I'd also be happy to offer some bounty if lzturbo were open sourced.
Cheers!
I made a PR with the changes to do this. Note the arith_static code still needs changing, but frankly the will to do this is zero right now as it has little purpose other than to demonstrate how fast pure arithmetic coding can go if you want to restrict yourself to a static table. No doubt there is still room for improvement on arith too, but pfft! Can't be arsed
Running on an Ubuntu VM on my Windows 10 desktop, on a Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz.Code:C Size ratio% C MB/s D MB/s Name File 55333425 55.3 132.01 270.91 rans_static_4x16_o1 enwik8 56916016 56.9 117.87 231.08 rans_static_4x8_o1 enwik8 63202025 63.2 379.95 549.31 fse enwik8 63292098 63.3 315.85 693.61 rans_static_4x16 enwik8 63297874 63.3 276.09 697.21 rans_static_4x8 enwik8 64138215 64.1 199.81 156.92 arith_static enwik8
Edit: I have no idea what block size this is using, or even if it is the same for all tools. No doubt it's not an appropriate block size for the O1 tests. In theory it can be changed, but I can't figure out how to make this work (-b doesn't seem to do anything).
https://github.com/powturbo/TurboBench/pull/4
Jarek (3rd August 2016)
Great, so it has now similar decoding speed as Yann's Huff0 (for large alphabet), but unfortunately much slower encoding.
Also, order 1 at least for some data is a natural step for improving compression ratio of LZ+EC (Brotli), another way is adaptation (LZNA, BitKnit) - maybe rANS is worth considering for future higher compression modes of e.g. Zstd ...
On my machine Huff0 ran much faster than the published figures on TurboBench site, but maybe it's simply data dependent. Or maybe the published version is old and it's been improved a lot since then.
With regards to Order-1, it could be applied equally well to tANS. Indeed it may be faster if tANS tables sizes are smaller and I only started with rANS simply because the existing implementations seemed easier to follow. I'm not doing any adaptive work in the order-1 code as it's pure fixed frequency tables. Hence it also requires those (potentially large) tables written to the output stream, which is why it needs large block sizes to be worth it. The static nature of my particular needs coupled with some order-1 (and moderate order-2, but that's pushing it) statistics meant that this was a good fit in speed vs size tradeoff. I'd like to see how FSE would work with order-1 static tables, but don't have the time to work on this myself right now.
Note that interleaving order-1 encoding made me think about the problem in other ways. You can't easily just unroll it 4 or 8 times due to the dependency of the {N-1}th symbol on the Nth. However starting at, say, 4 equally spaced locations in the buffer and unrolling that way lends itself well to interleaving. Even though rANS can interleave multiple models into the same rANS state, it also seems better to have multiple states to encourage easier parallelism. Of course this does have an impact on the file format so it's not a transparent parallelisation of the code. However possibly such techniques are also valuable in speeding up order-0 coding and may well lend themselves to aggressive SIMD implementations.
That said, I don't think the compilers are using SIMD much in my rANS loops. The speed gains mainly came through manual instruction reordering (despite all the guides claiming it's pointless as chips do this on the fly) and a rewrite of the renormalisation code to use cmov assembly instructions. I did look at SIMD and had some working code, but never as fast as the non-SIMD version due to the random access in the lookup tables. Perhaps with a smaller lookup table and tANS structures this would be more amenable? I'm sure there is still room for improvement here.
Indeed, while adaptiveness seems impractical for tANS (?), the advantage of rANS for static context-dependence is only lower memory need.
Byte-wise o1 tANS would require unpredictable switching between 256 x ~2-8kB tables, but it's indeed not much better for rANS (alias rANS might help).
Another problem is storing 256 probability distributions in header, what would require much longer block sizes.
Fortunately much cheaper context dependence can already give essential improvements.
From Charles "Understanding Brotli" ( http://www.cbloom.com/rants.html ), it uses o1 only for literals and offset and context is 2-6 bits:
"The context for offsets is a 2-bit context from the match length of the current match. {ML=2,3,4,5+} (models the ML-offset correlation)"
"Literals are context-coded using the previous 2 literals." - there are four modes, first two use 6 bits as context.
ps. adding filters is also essential for some data - the income of simple byte/bit shuffle (and delta filter) can be huge:
http://www.blosc.org/blog/zstd-has-j...-in-blosc.html
Github TurboBench new modules and compressors update to the latest versions:
- new: GLZA (see benchmark)
- updated: brotli
lz5
lzfse
zpaq
zstd
rans_static
TurboBench (Text+Binary) : i7-2600k w/ 4.2GHz / gcc 5.3 / Ubuntu 16.04
Only GLZA is using multithreading (8 hardware threads).
lzt24 (binary game file, size: 3,471,552 bytes)Code:C Size ratio% C MB/s D MB/s Name File (bold = pareto) MB=1.000.0000 2274926 22.7 0.22 58.80 glza enwik7 2319408 23.2 9.70 12.79 bsc 2 enwik7 2679330 26.8 1.40 100.60 lzturbo 49 enwik7 2704303 27.0 0.55 407.62 brotli 11 enwik7 2719770 27.2 2.39 89.94 lzma 9 enwik7 2802360 28.0 2.73 684.32 zstd 22 enwik7 2809777 28.1 1.77 1002.46 lzturbo 39 enwik7 3538465 35.4 0.51 304.42 zopfli enwik7 3688450 36.9 20.94 295.08 zlib 9 enwik7 10000004 100.0 6547.48 6748.12 memcpy enwik7
I'll make other benchmarks, also on ARM next timeCode:C Size ratio% C MB/s D MB/s Name File (bold = pareto) MB=1.000.0000 1260538 36.3 4.18 55.38 lzma 9 lzt24 1278947 36.8 2.34 54.72 lzturbo 49 lzt24 1323645 38.1 0.58 248.89 brotli 11 lzt24 1370022 39.5 3.40 912.39 lzturbo 39 lzt24 1498199 43.2 5.51 732.40 zstd 22 lzt24 1862178 53.6 6.71 7.35 bsc 2 lzt24 2255154 65.0 0.23 13.69 glza lzt24 2277535 65.6 0.44 307.87 zopfli lzt24 2289878 66.0 14.78 292.41 zlib 9 lzt24 3471556 100.0 6258.08 6821.12 memcpy lzt24
Jarek (15th August 2016),Kennon Conrad (15th August 2016)
Hardware: ODROID C2 - ARM 64 bits - 2Ghz CPU, OS: Ubuntu 16.04, gcc 5.3
All entropy coders with latest versions
File: enwik8 ( uniform distribution )
- TurboHF with huffman coding declasses all other "asymmetric numeral systems" in compression ratio and speed.Code:C Size ratio% C MB/s D MB/s Name (bold = pareto) MB=1.000.0000 44854452 44.9 9.74 8.97 TurboRC_o1 55333425 55.3 18.74 33.93 rans_static16o1 61245756 61.2 9.04 8.67 TurboRC 61571700 61.6 10.99 7.83 TurboAC_byte 62677385 62.7 10.20 6.63 subotin 62957995 63.0 116.58 164.92 TurboHF 63188022 63.2 31.23 18.03 FastAC 63193639 63.2 50.88 85.14 zlibh 63202025 63.2 93.46 121.08 fse 63292098 63.3 55.85 97.30 rans_static16 63420890 63.4 124.50 130.52 fsehuf 63648861 63.6 30.47 21.87 FastHF 64138215 64.1 42.63 45.75 arith_static 100000004 100.0 1680.95 1699.70 memcpy
JamesB (15th August 2016)
Good to see some ARM results. I've never tried my code on ARM as I don't have a development platform for it. How did it fair on x86_64?
For what it's worth, I wouldn't consider the rans_static* code to be particularly bullet proof. The only one used in production is various incarnations of the rans_static4x8 (rans_static4k from before), so the 16-bit renorm one may have issues due to lack of stress testing. (It's really just an academic test to see what can be achieved if we changed formats, but really we ought to use tANS instead anyway.)
James
The lack of multiplication in tANS is much more important for hardware implementations.
Combining with Huffman compatibility, it/when producers of processors will finally decide to add hardware support also for general purpose compressors, the first step will be Huffman+tANS decoder, decoding ~1 symbol/cycle, nearly equalizing the speed of e.g. LZ4 and ZSTD decoding ...