Page 2 of 6 FirstFirst 1234 ... LastLast
Results 31 to 60 of 178

Thread: TurboBench - Back to the future (incl. LzTurbo)

  1. #31
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Well, zlib is supposed to be installed in each linux dev. environment.

    Build with:
    Code:
    make

    To use the system zlib, install zlib-dev with:

    Code:
    apt-get install zlib1g-dev
    and build TurbBench with
    Code:
    make HAVE_ZLIB=1
    Last edited by dnd; 17th April 2016 at 21:05. Reason: new makefile

  2. #32
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Github TurboBench new modules and updates :
    - bcm 1.02
    - blosc
    - brotli
    - new: Libdeflate
    - new: lzoma
    - new: LZSSE
    - ZSTD 0.6.0

  3. #33
    Member
    Join Date
    Aug 2014
    Location
    Argentina
    Posts
    536
    Thanks
    236
    Thanked 90 Times in 70 Posts
    Quote Originally Posted by dnd View Post
    Well, zlib is supposed to be installed in each linux dev. environment.

    Build with:
    Code:
    make

    To use the system zlib, install zlib-dev with:

    Code:
    apt-get install zlib1g-dev
    and build TurbBench with
    Code:
    make HAVE_ZLIB=1
    Thanks. Last version compiled flawlessly with default options

  4. #34
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts

    TurboBench binaries for windows and linux updated with the latest compressor versions.
    You can find the source code on Github: TurboBench

  5. Thanks:

    Sportman (4th May 2016)

  6. #35
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    84
    Thanks
    34
    Thanked 26 Times in 17 Posts
    Tested turbobench , lzturbo and concluded that:

    1)Lzturbo 1* uses lz4 format(probably different source code and possibly not binary compatible but the same basic format |4 bit lit|4 bit ml|16 bit offset|)
    On my machine similar performance,only very small decompression speed improvement.
    2)TurboANX on my machine slower than advertised(only about 400MB/s decompression speed, a bit slower that rans_static64c)
    3)TurboHF is very interesting.It is indeed very fast,faster that huff0 from zstd
    4)lzturbo 3* is indeed fast(as advertised) and uses TurboHF , not TurboANX.

    details about TurboHF:
    It is a 12 bit depth limited huffman coder and contrary to what i expected it does not decode multiple symbols when their code length is small, as huff0 from zstd.
    As i understand it uses 4 interleaved streams to remove RAW dependencies and each stream 4x unrolled(So in total 16x unrolled) to make better use of bit IO.

  7. Thanks (2):

    encode (8th May 2016),Lucas (4th May 2016)

  8. #36
    Member
    Join Date
    May 2016
    Location
    California, US
    Posts
    1
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Has anyone managed to compile turbobench under xcode 7.3 on a Mac lately? It looks like it once worked on a mac but right now it seems to hit a number of compile issues. I'm following the instructions in README.md. Before I go sending pull requests, I wanted to see if this was known or anyone had encountered these issues already.

  9. #37
    Member
    Join Date
    Oct 2015
    Location
    Portugal
    Posts
    12
    Thanks
    1
    Thanked 1 Time in 1 Post
    I have not been able to compile turbobench on my Mac even once...

  10. #38
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Quote Originally Posted by ctur View Post
    Has anyone managed to compile turbobench under xcode 7.3 on a Mac lately? It looks like it once worked on a mac...
    TurboBench was tested with clang, gcc on linux (ubuntu), mingw and visual studio on windows. I have made some clang 3.8 compatibility changes today.
    It will be nice, if someone can test TurboBench on a mac and report the results for LZFSE,zlib,....
    LZFSE is included in TurboBench but not tested.

  11. #39
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Github TurboBench new modules and compressors update to latest versions:
    - new: xpack

    Quote Originally Posted by algorithm View Post
    TurboANX on my machine slower than advertised (only about 400MB/s decompression speed, a bit slower that rans_static64c)
    - Almost all current ANS implementations including FSE are in general not better than huffman coding.
    The only exception is TurboANX.
    See: Entropy Coder Benchmark, where TurboHF is not only faster than other ANSs, but compress also better.

    There is no sense (except advertising) using "Asymmetric Numeral Systems" when the implementations are worse than "huffman coding" for most popular distributions (like found in lz77).
    For some time ANS was faster than "Huffman Coding", but as demonstrated in TurboHF, huffman Coding can now be a lot faster than ANS.
    Of course, you can also find some distributions where lz77+ANS compress better.

    All other ANS implementations including FSE and rans_static64c are not comparable to TurboANX in term of compression ratio.
    Unlike "
    rans_static64c" (only Intel 64 bits cpu) , TurboANX is binary compatible for 32 and 64 bits cpu's with or without SIMD.

    The numbers presented in my benchmark are real numbers, not simple advertising and obtained using a desktop CPU (overclocked i7-2600k sandy bridge).
    Although currently, TurboANX is slower than expected in AVX2 CPU mode, because the AVX2 implementation was done, without having the hardware for testing.
    There is an option "-Vn" in TurboBench to overwrite the automatic cpu detection in TurboANX (n=7 for sandy bridge mode on newer cpu's).

    Last edited by dnd; 16th May 2016 at 09:26.

  12. #40
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    84
    Thanks
    34
    Thanked 26 Times in 17 Posts
    dnd i don't understand you.I said what i saw.Anyway new results

    95093583 44.9 464.35 1575.71 lzturbo 20 silesia.tar
    68245420 32.2 172.86 640.05 zstd 3 silesia.tar
    66412086 31.3 118.29 623.29 zstd 4 silesia.tar
    66033349 31.2 145.72 890.30 lzturbo 31 silesia.tar
    129391464 61.0 595.54 889.71 TurboHF silesia.tar
    101195229 47.7 503.51 2409.05 lzturbo 10 silesia.tar
    100881157 47.6 511.12 2494.07 lz4 1 silesia.tar
    128782035 60.8 293.44 390.58 TurboANX(def) silesia.tar
    128782035 60.8 354.82 508.11 TurboANX -V7 silesia.tar
    128782035 60.8 225.02 402.38 TurboANX -V1 silesia.tar
    129979386 61.3 332.84 501.59 fse silesia.tar
    130198150 61.4 279.99 458.31 rans_static64c silesia.tar

    So lzturbo slower than lz4(nowhere near 3650MB/s as you claim in your website and not "compress better and faster, decompress up to 2x! faster than lz4").
    TurboANX max dec speed 508MB/s much slower than your results 759MB/s(if we compare it with fse(529MB/s in your website,501MB/s here) which is much more portable than your turboANX).
    TurboANX has a bit higher compression ration (if we assume results are not cheated) possibly because it has smaller block size.

    lzturbo 1* decompression loop clearly lz4 format from the first 6 lines

    Code:
    60:   movzx  ecx,BYTE PTR [rax] 
         lea    r12,[rax+0x3]                                                                                                                                                                                                          
          movzx  ebx,WORD PTR [rax+0x1]
          mov    edx,ecx 
          shr    ecx,0x4
          and    edx,0xf   
          mov    rsi,rbx 
          mov    ebp,ecx 
          movzx  r11d,dl
          cmp    dl,0xf                                                                                                                                                                                                                 
          je     258                                                                                                                                                                                                                    
    85:   cmp    ebp,0xf                                                                                                                                                                                                                
          je     221                                                                                                                                                                                                                    
          vmovdq xmm0,XMMWORD PTR [r12]                                                                                                                                                                                                 
          mov    edx,ebp                                                                                                                                                                                                                
          lea    rax,[r12+rdx*1]                                                                                                                                                                                                        
          add    rdx,r8                                                                                                                                                                                                                 
          vmovup XMMWORD PTR [r8],xmm0                                                                                                                                                                                                  
    a2:   mov    ecx,r9d                                                                                                                                                                                                                
          bsr    ecx,ebp                                                                                                                                                                                                                
          add    ecx,0x1                                                                                                                                                                                                                
          movsxd rcx,ecx                                                                                                                                                                                                                
          add    DWORD PTR [rcx*4+0x8375b00],0x1                                                                                                                                                                                        
          mov    ecx,r9d                                                                                                                                                                                                                
          bsr    ecx,r11d                                                                                                                                                                                                               
          add    ecx,0x1                                                                                                                                                                                                                
          movsxd rcx,ecx                                                                                                                                                                                                                
          add    DWORD PTR [rcx*4+0x837d380],0x1                                                                                                                                                                                        
          mov    ecx,r9d                                                                                                                                                                                                                
          bsr    ecx,ebx                                                                                                                                                                                                                
          add    ecx,0x1                                                                                                                                                                                                                
          movsxd rcx,ecx                                                                                                                                                                                                                
          add    DWORD PTR [rcx*4+0x8375a60],0x1                                                                                                                                                               
          cmp    ebx,0x7                                                                                                                                                                                                      
          jbe    1a1                                                                                                                                                                                                                    
          mov    rcx,rdx                                                                                                                                                                                                                
          lea    r8d,[r11+0x4]                                                                                                                                                                                                          
          sub    rcx,rsi                                                                                                                                                                                                                
          add    r8,rdx                                                                                                                                                                                                                 
          nop                                                                                                                                                                                                                          
          nop                                                                                                                                                                                                                           
    100:  mov    rsi,QWORD                                                                                                                                                                                                             
          add    rdx,0x10                                                                                                                                                                                                               
          add    rcx,0x10                                                                                                                                                                                                               
          mov    QWORD PTR [rdx-0x10],rsi                                                                                                                                                                                               
          mov    rsi,QWORD                                                                                                                                                                                                              
          mov    QWORD PTR [rdx-0x8],rsi                                                                                                                                                                                                
          cmp    r8,rdx                                                                                                                                                                                                                 
          ja     100
    11c:   movzx  edx,BYTE PTR [rax]                                                                                                                                                                                                     
          lea    rbx,[rax+0x3]                                                                                                                                                                                                          
          movzx  ebp,WORD PTR [rax+0x1]                                                                                                                                                                                                 
          mov    ecx,edx                                                                                                                                                                                                                
          shr    edx,0x4                                                                                                                                                                                                                
          and    ecx,0xf                                                                                                                                                                                                                
          mov    rsi,rbp                                                                                                                                                                                                               
          movzx  r11d,cl                                                                                                                                                                                                                
          cmp    cl,0xf                                                                                                                                                                                                                 
          je     354                                                                                                                                                                                                                    
    13f:  cmp    edx,0xf                                                                                                                                                                                                                
          je     321                                                                                                                                                                                                                    
          vmovdq xmm0,XMMWORD PTR [rbx]                                                                                                                                                                                                 
          lea    rax,[rbx+rdx*1]                                                                                                                                                                                                        
          add    rdx,r8                                                                                                                                                                                                                 
          vmovup XMMWORD PTR [r8],xmm0                                                                                                                                                                                                  
    158:  cmp    ebp,0x7                                                                                                                                                                                                                
          jbe    290                                                                                                                                                                                                                    
          lea    r8d,[r11+0x4]                                                                                                                                                                                                          
          mov    rcx,rdx                                                                                                                                                                                                                
          sub    rcx,rsi                                                                                                                                                                                                                
          add    r8,rdx                                                                                                                                                                                                                 
          nop                                                                                                                                                                                                                           
          nop                                                                                                                                                                                                                           
    180:  mov    rsi,QWORD                                                                                                                                                                                                              
          add    rdx,0x10                                                                                                                                                                                                               
          add    rcx,0x10                                                                                                                                                                                                              
          mov    QWORD PTR [rdx-0x10],rsi                                                                                                                                                                                               
          mov    rsi,QWORD                                                                                                                                                                                                              
          mov    QWORD PTR [rdx-0x8],rsi                                                                                                                                                                                                
          cmp    r8,rdx                                                                                                                                                                                                                 
          ja     180                                                                                                                                                                                                                    
          jmp    60
    lzturbo 2* decoding loop
    Code:
    70:   lea    rbx,[rax+0x2]                                                                                                                                                                                                          
          mov    ecx,edx                                                                                                                                                                                                                
          mov    r11d,edx                                                                                                                                                                                                               
          shr    ecx,0x4                                                                                                                                                                                                                
          and    r11d,0x7                                                                                                                                                                                                              
          and    ecx,0x3                                                                                                                                                                                                                
          cmp    r11d,0x7                                                                                                                                                                                                               
          je     1a0                                                                                                                                                                                                                    
          cmp    ecx,0x3                                                                                                                                                                                                                
          je     218                                                                                                                                                                                                                    
          mov    eax,DWORD                                                                                                                                                                                                              
          mov    DWORD PTR [rsi],eax                                                                                                                                                                                                    
          add    rsi,rcx                                                                                                                                                                                                               
    9e:   lea    rax,[rbx+rcx*1]                                                                                                                                                                                                        
          shr    edx,0x6                                                                                                                                                                                                                
          mov    rbx,rsi                                                                                                                                                                                                                
          and    edx,0x3ff                                                                                                                                                                                                              
          mov    ecx,edx                                                                                                                                                                                                                
          sub    rbx,rcx                                                                                                                                                                                                                
          mov    rcx,rbx                                                                                                                                                                                                                
          cmp    edx,0x7                                                                                                                                                                                                                
          jbe    250                                                                                                                                                                                                                    
    bf:   mov    rdx,QWORD                                                                                                                                                                                                              
          mov    QWORD PTR [rsi],rdx                                                                                                                                                                                                    
          movzx  edx,WORD PTR [rcx+0x8]                                                                                                                                                                                                 
          mov    WORD PTR [rsi+0x8],dx                                                                                                                                                                                                  
          lea    edx,[r8+r11*1]                                                                                                                                                                                                         
          add    rsi,rdx                                                                                                                                                                                                                
    d4:   prefet BYTE PTR [rax+0x200]                                                                                                                                                                                                   
    db:   mov    edx,DWORD                                                                                                                                                                                                              
          test   dl,0x8                                                                                                                                                                                                                 
          je     70                                                                                                                                                                                                                     
          mov    r10d,edx                                                                                                                                                                                                               
          mov    r11d,edx                                                                                                                                                                                                               
          shr    r10d,0x5                                                                                                                                                                                                              
          and    r11d,0x7                                                                                                                                                                                                               
          and    r10d,0x7                                                                                                                                                                                                               
          test   dl,0x10                                                                                                                                                                                                               
          jne    2ab                                                                                                                                                                                                                    
          lea    rcx,[rax+0x3]                                                                                                                                                                                                          
          shr    edx,0x8                                                                                                                                                                                                                
          movzx  edx,dx                                                                                                                                                                                                                 
          cmp    r11d,0x7                                                                                                                                                                                                               
          je     2bc                                                                                                                                                                                                                    
    111:  cmp    r10d,0x7                                                                                                                                                                                                               
          je     411                                                                                                                                                                                                                    
          mov    rax,QWORD                                                                                                                                                                                                              
          mov    QWORD PTR [rsi],rax                                                                                                                                                                                                    
          add    rsi,r10                                                                                                                                                                                                                
    124:  lea    rax,[rcx+r10*1]                                                                                                                                                                                                        
          mov    rbx,rsi                                                                                                                                                                                                                
          mov    ecx,edx                                                                                                                                                                                                                
          sub    rbx,rcx                                                                                                                                                                                                                
          mov    rcx,rbx                                                                                                                                                                                                                
          cmp    edx,0x7                                                                                                                                                                                                                
          ja     bf
    Last edited by algorithm; 16th May 2016 at 15:36.

  13. #41
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    I'm not contesting your finding, but the speed is depending on the hardware (cpu, overclocking, ram speed), OS, the test files, compiler, compiler version and the compressor version and compress options used.

  14. #42
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    Quote Originally Posted by dnd View Post
    All other ANS implementations including FSE and rans_static64c are not comparable to TurboANX in term of compression ratio.
    Unlike "
    rans_static64c" (only Intel 64 bits cpu) , TurboANX is binary compatible for 32 and 64 bits cpu's with or without SIMD.

    The numbers presented in my benchmark are real numbers, not simple advertising and obtained using a desktop CPU (overclocked i7-2600k sandy bridge).
    Although currently, TurboANX is slower than expected in AVX2 CPU mode, because the AVX2 implementation was done, without having the hardware for testing.
    There is an option "-Vn" in TurboBench to overwrite the automatic cpu detection in TurboANX (n=7 for sandy bridge mode on newer cpu's).

    You may wish to test the latest rANS_static4k.c from https://github.com/jkbonfield/rans_static. (Actually there are a couple of versions in there using different methods, which will vary in performance based on number of sybols used and the size of your L1 cache.) This is 32-bit and 64-bit compatible with no SIMD other than auto-generated by the compiler.

    This is derived from the 32-bit rANS_static4c.c and is binary compatible in output, but MUCH faster for order-0 and somewhat faster at order-1 decoding. I also have tweaked the 16-bit renormalisation version (rANS_static4_16), but this isn't a primary focus as we need the 23+8 bit renorm code for the CRAM file format. Both of these are substantially faster than the (unmodified) rans_static64c code. I also took the opportunity to fix the API so it's possible to supply a buffer rather than using mallocs. I've been meaning to make a PR for you, but haven't had the time.

    My benchmarks are on a Haswell 3.2GHz i5 and IIRC were coming in around 600MB/s on enwik9. I've no idea what it'd be on a 4.5GHz i7, but my home system isn't far off so I ought to see if it builds under windows (I've never tried).

  15. #43
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    Using a VM of Ubuntu Wily (15.10) rans_static4_16i was getting ~790MB/s decode on enwik8 order 0 and ~450MB/s order 1. I used enwik8 simply because this is a VM with only 2Gb of memory and my test harness keeps before and after copies for validation. I ought to try it without the VM overhead on my main system. Host is a 4.0GHz i7.

    TurboBench running on the same VM (cut and paste is borked, grr) shows turboANX at 412MB/s encode and 632MB/s decode. FSE is very similar. Of course the benchmarking frameworks may not be identical (one block repeatedly used or all loaded at once? warm up of memory or not? etc) and likely block size differs too. As expected though, TurboHF and FSEhuf are significantly quicker.

  16. #44
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    On my work 3.2GHz i5 using the downloaded prebuilt turbobench tool:

    Code:
    @ deskpro107386[tmp/TurboBench]; turbobench_linux64/turbobenchs -b256k -eECODER /tmp/enwik
    9
    Benchmark: 1 from 3
       637756262    63.8     634.83     946.03   TurboHF         enwik9
       636087252    63.6     309.03     408.53   TurboANX        enwik9
       620516416    62.1      53.06      47.03   TurboRC         enwik9
    ERROR at 19136511:65, 64
       469202416    46.9      48.07      37.46   TurboRC_o1      enwik9
    ERROR at 6815743:6c, 2f
       624329244    62.4      41.77      26.42   TurboAC_byte    enwik9
    ERROR at 25165823:67, 66
       647597480    64.8     176.05     134.80   arith_static    enwik9
    ...
    On a build using default gcc install (old, 4.8.4) with the newer ans_jb updates: (https://github.com/jkbonfield/TurboBench). I guess it should be a submodule, but it's not yet as it'd need function renaming etc still.


    Code:
    @ deskpro107386[tmp/TurboBench]; ./turbobench -b256k -earith_static,rans_static4c,rans_static4k,rans_static4_16i,rans_static4_16io1,fse /tmp/enwik9
    Benchmark: 1 from 3
       647597480    64.8     178.44     119.66   arith_static    enwik9
       639846192    64.0     227.91     308.04   rans_static4c   enwik9
       639846192    64.0     240.97     567.20   rans_static4k   enwik9
       639792007    64.0     305.46     622.59   rans_static4_16ienwik9
       542207386    54.2     128.62     265.02   rans_static4_16io1enwik9
       639048689    63.9     357.83     530.42   fse             enwik9

    The difference in speed of arith_static there from your binary vs my binary is likely due to you using either a better compiler/version or more aggressive optimisation options, so perhaps your build would have the other figures higher too.

    What this shows is that huffman encoding is indeed rocking the speed chart, but there are faster ANS implementations out there than the TurboAN* code right now for decoding, while being comparable at encoding. For what it's worth, the complexity of tANS vs rANS implies tANS really should be quicker. I suspect there are similar tweaks (manually crafting assembly to add cmovs and more careful instruction interleaving) that would give more speed improvements still to tANS (eg FSE).

    I'm not sure whether -b is having an affect here either; probably being overridden. The the 4_16i O1 I was deliberately being poor on O1 compression in favour of speed, using a very low resolution counter to keep cache hits up.

    Edit: Testing on other hosts show my code is rather prone to too large tables causing cache missing, while FSE and ANX are less so. Particularly odd is how ANX is poorer on my i5 than an old Xeon. Maybe the JIT is picking the wrong code?

    Code:
    AMD Opteron(tm) Processor 6378 @ 2.40GHz
        16817380    23.0     262.36     358.64   TurboANX        q8
        16831548    23.0     232.68     287.33   fse             q8
        16855906    23.1     199.83     230.65   rans_static4k   q8
    
    Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
        16817380    23.0     301.29     531.22   TurboANX        q8
        16831548    23.0     236.45     347.57   fse             q8
        16855906    23.1     229.14     379.87   rans_static4k   q8
    
    Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
        16817380    23.0     331.94     428.45   TurboANX        q8
        16831548    23.0     363.91     511.13   fse             q8
        16855906    23.1     297.14     620.65   rans_static4k   q8
    The Opteron has 16Kb L1 and 2Mb shared L2.
    The Xeon has 32K L1 and 256K L2.
    The i5 also has 32K L1 and 256K L2, but the L1 is duplicated for instruction and data cache (32k each).
    Last edited by JamesB; 17th May 2016 at 15:43.

  17. Thanks:

    inikep (20th May 2016)

  18. #45
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts

    Lightbulb Turbobench + Skylake i7-6700 w/ 3.7GHz / gcc 5.3 (Text+Binary):

    This benchmark shows again that there is a big discrepancy between
    the "kraken" benchmark at cbloom.com and turbobench.

    All packages w/ latest version
    Code:
    enwik7
          C Size  ratio%     C MB/s     D MB/s   Name            File              (bold = pareto) MB=1.000.0000
         2707078    27.1       0.49     371.76   brotli 11       
         2773044    27.7       1.60     251.95   lzham 4         
         2802712    28.0       2.17     621.69   zstd 22         
         2809777    28.1       1.71     909.22   lzturbo 39      
         3144589    31.4       1.14     726.55   xpack 9         
         3420149    34.2       1.76    1020.53   lzturbo 29      
         3550544    35.5       7.94     557.44   libdeflate 12      
         3688450    36.9      16.30     240.97   zlib 9          
         3908554    39.1       8.94    3124.52   lzsse8 17       
         4227421    42.3       2.06    3039.96   lzturbo 19      
         4256269    42.6      24.96    2293.85   lz4 9           
        10000004   100.0   12107.88   12484.27   memcpy          
        
    mozilla file from silesia corpus
        14225515    27.8       0.31     281.68   brotli 11       
        14317035    28.0       1.94     207.16   lzham 4         
        14894572    29.1       1.97     773.80   lzturbo 39      
        15198580    29.7       2.63     599.40   zstd 22           (2.24 x faster than zlib)
        15846370    30.9       0.54     550.90   xpack 9         
        17718126    34.6       2.17    1270.07   lzturbo 29      
        18308584    35.7       8.24     571.42   libdeflate 12      
        19044400    37.2       6.68     268.06   zlib 9          
        22009860    43.0       2.50    3614.19   lzturbo 19      
        22092113    43.1      39.80    2614.49   lz4 9           
        22135471    43.2       1.74    3141.94   lzsse8 17       
        51220484   100.0    9367.36    9473.23   memcpy          
    Code:
    included the test from the kraken benchmark
    enwik7
    lzma        :  3.64:1 ,    1.8 enc mb/s ,   79.5 dec mb/s
    lzham       :  3.60:1 ,    1.4 enc mb/s ,  196.5 dec mb/s
    zstdmax     :  3.56:1 ,    2.2 enc mb/s ,  394.6 dec mb/s 
    Kraken 220  :  3.51:1 ,    1.4 enc mb/s ,  702.8 dec mb/s
    Kraken 215  :  3.49:1 ,    1.5 enc mb/s ,  789.7 dec mb/s
    zlib9       :  2.38:1 ,   22.2 enc mb/s ,  234.3 dec mb/s
    lz4hc       :  2.35:1 ,   27.5 enc mb/s , 2059.6 dec mb/s
    
    silesia_mozilla
    lzma        :  3.88:1 ,    2.0 enc mb/s ,   63.7 dec mb/s
    Kraken 220  :  3.60:1 ,    1.1 enc mb/s ,  896.5 dec mb/s
    lzham       :  3.56:1 ,    1.5 enc mb/s ,  186.4 dec mb/s
    Kraken 215  :  3.51:1 ,    1.2 enc mb/s ,  928.0 dec mb/s
    zstdmax     :  3.24:1 ,    2.8 enc mb/s ,  401.0 dec mb/s (only 1.38 x faster than zlib!)
    zlib9       :  2.51:1 ,   12.4 enc mb/s ,  291.5 dec mb/s
    lz4hc       :  2.32:1 ,   36.4 enc mb/s , 2351.6 dec mb/s
    Last edited by dnd; 28th May 2016 at 09:30.

  19. #46
    Member
    Join Date
    Sep 2010
    Location
    US
    Posts
    126
    Thanks
    4
    Thanked 69 Times in 29 Posts
    BTW I did some investigation upon Yann's prodding and think I can explain this discrepancy.

    See http://www.cbloom.com/rants.html for details

    This is what I get with TurboBench :


    TurboBench :
    C Size ratio% C MB/s D MB/s Name File
    10403232 42.1 3.73 559.44 zstd 22 lzt99
    10469389 42.4 3.16 863.11 lzturbo 39 lzt99
    13063248 52.9 5.81 270.44 zlib 9 lzt99

    On the same hardware : (Core i7-3770 3.4 GHz)

    zlib dll : 309.12 mb/s
    zstdmax MSVC 2005: 422.20 mb/s
    lz4hc MSVC 2005 : 2600.32 mb/s

    miniz MSVC 2012 : 318.18 mb/s
    zstd MSVC 2012 : 426.37 mb/s
    lz4hc MSVC 2012 : 2680.92 mb/s

    miniz gcc-4.7.2 -O2 : 289.48 mb/s
    zstd gcc-4.7.2 -O2 : 368.74 mb/s
    brotli9 gcc-4.7.2 -O2 : 264.30 mb/s
    lz4hc gcc-4.7.2 -O2 : 1755.57 mb/s

    brotli9 0.4.0 GCC 5.3.0 : 262 MB/s
    zstd 0.7.1 -99 GCC 5.3.0 : 560 MB/s
    zlib 1.2.8 -9 GCC 5.3.0 : 271 MB/s
    lz4hc r131 -99 GCC 5.3.0 : 2801 MB/s



    TurboBench results are about the same as lzbench
    TurboBench seems to be built with GCC 5.2

    I think there are two factors in the different (ZStd/zlib) speed ratio :

    1. ZStd seems to be very sensitive to compiler. In particular, much slower in MSVC and in GCC -O2 than it is in GCC -O3. The numbers I usually report are MSVC.

    2. The zlib speeds you guys are reporting is 14% slower than mine (270 MB/s vs 309 MB/s). The zlib build I use has the optional x86 assembly files, which I'm guessing you do not use in your GCC builds since they're old MASM style asm files.

    If I used your zlib speeds all the ratios vs. zlib would look better.

    IMO zlib is such a mess I'm switching to "miniz" for my standard reference point in the future.
    Last edited by cbloom; 3rd August 2016 at 19:05.

  20. Thanks (2):

    jibz (28th July 2016),Turtle (28th July 2016)

  21. #47
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    123
    Thanks
    105
    Thanked 71 Times in 51 Posts
    Quote Originally Posted by cbloom View Post
    IMO zlib is such a mess I'm switching to "miniz" for my standard reference point in the future.
    You might also want to consider libdefalte. It skips the streaming interface of zlib, but provides better speeds and ratios (see https://quixdb.github.io/squash-benchmark/unstable/).

  22. #48
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts

    Lightbulb TurboBench - ARM 64 bits - Cortex-A53(ARMv8) 2Ghz

    Hardware: ODROID C2 - ARM 64 bits - A53 ( ARMv8 ) 2Ghz CPU
    OS: Ubuntu 16.04, gcc 5.3

    Code:
          C Size  ratio%     C MB/s     D MB/s   Name            File              (bold = pareto) MB=1.000.0000
         2704303    27.0       0.09      46.42   brotli 11       enwik7
         2802473    28.0       0.40      57.81   zstd 22         enwik7
         3519798    35.2       7.51      69.20   zstd 5          enwik7
         3597902    36.0       5.11      51.13   brotli 4        enwik7
         3647390    36.5       7.57      84.20   lzfse           enwik7
         3688450    36.9       3.78      86.80   zlib 9          enwik7
         3697385    37.0       4.82      87.22   zlib 6          enwik7
         3730944    37.3       6.79      86.83   zlib 5          enwik7
        10000004   100.0    1664.88    1677.31   memcpy          enwik7
    - zlib is decoding faster than all other compressors in this ARM 64 bits benchmark.
    - zlib is decoding nearly 2 times faster than brotli.
    This surprising as brotli is intended to be used in chrome on mobile phones.







  23. #49
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    748
    Thanks
    234
    Thanked 238 Times in 146 Posts
    dnd,
    could you update James' rANS - he claims that it is now even faster than FSE:
    https://github.com/jkbonfield/rans_static
    Another question is how entropy coders compare on ARM ...

  24. Thanks:

    JamesB (30th July 2016)

  25. #50
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    I ought to tidy up my interface a bit so it can be included as a git submodule, but haven't found the time.

    I'm sure FSE could be faster too (tANS really ought to have the edge over rANS IMO for pure static (de)coding). I think the correct approach is simply better interleaving, with SIMD instructions to decode starting at multiple points in the buffer. This removes the block on SIMD by having dependency between each action. Eg 8 starting points => 8 operations at a time with the previous state dependency being per set of 8 rather than on each symbol.

  26. Thanks (2):

    Cyan (31st July 2016),Jarek (30th July 2016)

  27. #51
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    Quote Originally Posted by JamesB View Post
    I ought to tidy up my interface a bit so it can be included as a git submodule, but haven't found the time.

    I'm sure FSE could be faster too (tANS really ought to have the edge over rANS IMO for pure static (de)coding). I think the correct approach is simply better interleaving, with SIMD instructions to decode starting at multiple points in the buffer. This removes the block on SIMD by having dependency between each action. Eg 8 starting points => 8 operations at a time with the previous state dependency being per set of 8 rather than on each symbol.
    Now done, but I haven't tested it in the context of TurboBench. There are now really only two relevant implementations (rANS_static4x8.c and rANS_static4x16.c) referring to 8-bit and 16-bit renormalisation. 16 bit is superior, but I already have code using the 8-bit variant so I'm keeping it working for compatibility reasons. Both of these export symbols (with 4x8 and 4x16 suffixes) which can be called directly, or you can use rANS_static.c/h which has an API with an additional renorm_size parameter (8 or 16) exposed via the -8 and -16 command line parameters. (See rANS_test.c for the example code usage.)

    API wise, there are also two calling methods; one where the output size is known and supplied upfront (fastest and matching things like FSE) and one where the output buffer is unknown and it mallocs its own buffer for you. Both end up using the same code with the only difference being whether or not the output is specified as NULL, hence why the output buffer is also returned. NOTE: I don't have a test harness, nor any documentation. My actual code in real usage is just a variant of this held within the CRAM codecs rather than using a library or this project. So take it all "as is"!

  28. #52
    Member
    Join Date
    Aug 2016
    Location
    ut
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by dnd View Post
    Github TurboBench new modules and updates :
    - bcm 1.02
    - blosc
    - brotli
    - new: Libdeflate
    - new: lzoma
    - new: LZSSE
    - ZSTD 0.6.0
    Quote Originally Posted by dnd View Post
    Github TurboBench new modules and updates :
    - bcm 1.02
    - blosc
    - brotli
    - new: Libdeflate
    - new: lzoma
    - new: LZSSE
    - ZSTD 0.6.0
    might be nice to include those on the webpage https://sites.google.com/site/powturbo/home/benchmark
    Also might be nice to add LZ5 just for fun. I'd also be happy to offer some bounty if lzturbo were open sourced.

    Cheers!

  29. #53
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    Quote Originally Posted by Jarek View Post
    dnd,
    could you update James' rANS - he claims that it is now even faster than FSE:
    https://github.com/jkbonfield/rans_static
    Another question is how entropy coders compare on ARM ...
    I made a PR with the changes to do this. Note the arith_static code still needs changing, but frankly the will to do this is zero right now as it has little purpose other than to demonstrate how fast pure arithmetic coding can go if you want to restrict yourself to a static table. No doubt there is still room for improvement on arith too, but pfft! Can't be arsed
    Code:
          C Size  ratio%     C MB/s     D MB/s   Name            File
        55333425    55.3     132.01     270.91   rans_static_4x16_o1 enwik8
        56916016    56.9     117.87     231.08   rans_static_4x8_o1  enwik8
        63202025    63.2     379.95     549.31   fse                 enwik8
        63292098    63.3     315.85     693.61   rans_static_4x16    enwik8
        63297874    63.3     276.09     697.21   rans_static_4x8     enwik8
        64138215    64.1     199.81     156.92   arith_static        enwik8
    Running on an Ubuntu VM on my Windows 10 desktop, on a Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz.

    Edit: I have no idea what block size this is using, or even if it is the same for all tools. No doubt it's not an appropriate block size for the O1 tests. In theory it can be changed, but I can't figure out how to make this work (-b doesn't seem to do anything).

    https://github.com/powturbo/TurboBench/pull/4


  30. Thanks:

    Jarek (3rd August 2016)

  31. #54
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    748
    Thanks
    234
    Thanked 238 Times in 146 Posts
    Great, so it has now similar decoding speed as Yann's Huff0 (for large alphabet), but unfortunately much slower encoding.
    Also, order 1 at least for some data is a natural step for improving compression ratio of LZ+EC (Brotli), another way is adaptation (LZNA, BitKnit) - maybe rANS is worth considering for future higher compression modes of e.g. Zstd ...

  32. #55
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    On my machine Huff0 ran much faster than the published figures on TurboBench site, but maybe it's simply data dependent. Or maybe the published version is old and it's been improved a lot since then.

    With regards to Order-1, it could be applied equally well to tANS. Indeed it may be faster if tANS tables sizes are smaller and I only started with rANS simply because the existing implementations seemed easier to follow. I'm not doing any adaptive work in the order-1 code as it's pure fixed frequency tables. Hence it also requires those (potentially large) tables written to the output stream, which is why it needs large block sizes to be worth it. The static nature of my particular needs coupled with some order-1 (and moderate order-2, but that's pushing it) statistics meant that this was a good fit in speed vs size tradeoff. I'd like to see how FSE would work with order-1 static tables, but don't have the time to work on this myself right now.

    Note that interleaving order-1 encoding made me think about the problem in other ways. You can't easily just unroll it 4 or 8 times due to the dependency of the {N-1}th symbol on the Nth. However starting at, say, 4 equally spaced locations in the buffer and unrolling that way lends itself well to interleaving. Even though rANS can interleave multiple models into the same rANS state, it also seems better to have multiple states to encourage easier parallelism. Of course this does have an impact on the file format so it's not a transparent parallelisation of the code. However possibly such techniques are also valuable in speeding up order-0 coding and may well lend themselves to aggressive SIMD implementations.

    That said, I don't think the compilers are using SIMD much in my rANS loops. The speed gains mainly came through manual instruction reordering (despite all the guides claiming it's pointless as chips do this on the fly) and a rewrite of the renormalisation code to use cmov assembly instructions. I did look at SIMD and had some working code, but never as fast as the non-SIMD version due to the random access in the lookup tables. Perhaps with a smaller lookup table and tANS structures this would be more amenable? I'm sure there is still room for improvement here.

  33. #56
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    748
    Thanks
    234
    Thanked 238 Times in 146 Posts
    Indeed, while adaptiveness seems impractical for tANS (?), the advantage of rANS for static context-dependence is only lower memory need.
    Byte-wise o1 tANS would require unpredictable switching between 256 x ~2-8kB tables, but it's indeed not much better for rANS (alias rANS might help).
    Another problem is storing 256 probability distributions in header, what would require much longer block sizes.

    Fortunately much cheaper context dependence can already give essential improvements.
    From Charles "Understanding Brotli" ( http://www.cbloom.com/rants.html ), it uses o1 only for literals and offset and context is 2-6 bits:
    "The context for offsets is a 2-bit context from the match length of the current match. {ML=2,3,4,5+} (models the ML-offset correlation)"
    "Literals are context-coded using the previous 2 literals." - there are four modes, first two use 6 bits as context.

    ps. adding filters is also essential for some data - the income of simple byte/bit shuffle (and delta filter) can be huge:
    http://www.blosc.org/blog/zstd-has-j...-in-blosc.html

  34. #57
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts
    Github TurboBench new modules and compressors update to the latest versions:
    - new: GLZA (see benchmark)
    - updated:
    brotli
    lz5
    lzfse
    zpaq
    zstd
    rans_static

    TurboBench (Text+Binary) : i7-2600k w/ 4.2GHz / gcc 5.3 / Ubuntu 16.04
    Only GLZA is using multithreading (8 hardware threads).

    Code:
          C Size  ratio%     C MB/s     D MB/s   Name            File              (bold = pareto) MB=1.000.0000
         2274926    22.7       0.22      58.80   glza            enwik7
         2319408    23.2       9.70      12.79   bsc 2           enwik7
         2679330    26.8       1.40     100.60   lzturbo 49      enwik7
         2704303    27.0       0.55     407.62   brotli 11       enwik7
         2719770    27.2       2.39      89.94   lzma 9          enwik7
         2802360    28.0       2.73     684.32   zstd 22         enwik7
         2809777    28.1       1.77    1002.46   lzturbo 39      enwik7
         3538465    35.4       0.51     304.42   zopfli          enwik7
         3688450    36.9      20.94     295.08   zlib 9          enwik7
        10000004   100.0    6547.48    6748.12   memcpy          enwik7
    lzt24 (binary game file, size: 3,471,552 bytes)
    Code:
          C Size  ratio%     C MB/s     D MB/s   Name            File              (bold = pareto) MB=1.000.0000
         1260538    36.3       4.18      55.38   lzma 9          lzt24
         1278947    36.8       2.34      54.72   lzturbo 49      lzt24
         1323645    38.1       0.58     248.89   brotli 11       lzt24
         1370022    39.5       3.40     912.39   lzturbo 39      lzt24
         1498199    43.2       5.51     732.40   zstd 22         lzt24
         1862178    53.6       6.71       7.35   bsc 2           lzt24
         2255154    65.0       0.23      13.69   glza            lzt24
         2277535    65.6       0.44     307.87   zopfli          lzt24
         2289878    66.0      14.78     292.41   zlib 9          lzt24
         3471556   100.0    6258.08    6821.12   memcpy          lzt24
    I'll make other benchmarks, also on ARM next time

  35. Thanks (2):

    Jarek (15th August 2016),Kennon Conrad (15th August 2016)

  36. #58
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    565
    Thanks
    67
    Thanked 198 Times in 147 Posts

    TurboBench - Entropy Coder - ARM 64 bits - Cortex-A53(ARMv8) 2Ghz

    Hardware: ODROID C2 - ARM 64 bits - 2Ghz CPU, OS: Ubuntu 16.04, gcc 5.3

    All entropy coders with latest versions

    File: enwik8 ( uniform distribution )
    Code:
          C Size  ratio%     C MB/s     D MB/s   Name                          (bold = pareto) MB=1.000.0000
        44854452    44.9       9.74       8.97   TurboRC_o1      
        55333425    55.3      18.74      33.93   rans_static16o1 
        61245756    61.2       9.04       8.67   TurboRC         
        61571700    61.6      10.99       7.83   TurboAC_byte    
        62677385    62.7      10.20       6.63   subotin         
        62957995    63.0     116.58     164.92   TurboHF         
        63188022    63.2      31.23      18.03   FastAC          
        63193639    63.2      50.88      85.14   zlibh           
        63202025    63.2      93.46     121.08   fse             
        63292098    63.3      55.85      97.30   rans_static16   
        63420890    63.4     124.50     130.52   fsehuf          
        63648861    63.6      30.47      21.87   FastHF          
        64138215    64.1      42.63      45.75   arith_static    
       100000004   100.0    1680.95    1699.70   memcpy          
    - TurboHF with huffman coding declasses all other "asymmetric numeral systems" in compression ratio and speed.

  37. Thanks:

    JamesB (15th August 2016)

  38. #59
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    488
    Thanks
    173
    Thanked 171 Times in 116 Posts
    Good to see some ARM results. I've never tried my code on ARM as I don't have a development platform for it. How did it fair on x86_64?

    For what it's worth, I wouldn't consider the rans_static* code to be particularly bullet proof. The only one used in production is various incarnations of the rans_static4x8 (rans_static4k from before), so the 16-bit renorm one may have issues due to lack of stress testing. (It's really just an academic test to see what can be achieved if we changed formats, but really we ought to use tANS instead anyway.)

    James

  39. #60
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    748
    Thanks
    234
    Thanked 238 Times in 146 Posts
    The lack of multiplication in tANS is much more important for hardware implementations.
    Combining with Huffman compatibility, it/when producers of processors will finally decide to add hardware support also for general purpose compressors, the first step will be Huffman+tANS decoder, decoding ~1 symbol/cycle, nearly equalizing the speed of e.g. LZ4 and ZSTD decoding ...

Page 2 of 6 FirstFirst 1234 ... LastLast

Similar Threads

  1. EBOLA AND THE FUTURE OF CIVILZATION
    By biject.bwts in forum The Off-Topic Lounge
    Replies: 20
    Last Post: 3rd November 2014, 10:34
  2. Future-LZ as last-step compression scheme
    By Piotr Tarsa in forum Data Compression
    Replies: 18
    Last Post: 3rd December 2011, 00:55
  3. BCM's future
    By encode in forum Data Compression
    Replies: 17
    Last Post: 9th August 2009, 01:00
  4. Future Bandwidth.
    By Tribune in forum The Off-Topic Lounge
    Replies: 9
    Last Post: 10th October 2008, 22:56
  5. LZPM's future
    By encode in forum Forum Archive
    Replies: 129
    Last Post: 3rd March 2008, 19:23

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •