Results 1 to 28 of 28

Thread: Turbo-Range-Coder : Range coder / Arithmetic Coding

  1. #1
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts

    Exclamation Turbo-Range-Coder : Range coder / Arithmetic Coding

    - Fastest Range Coder/Arithmetic Coder
    • 100% C (C++ headers). OS:Linux amd64, arm64, PowerPC, MacOs. Windows: Mingw, visual c++
    • No other Range Coder / Arithmetic Coder encode or decode faster with better compression
    • Can work as bitwise or/and as multisymbol range coder
    • 32 or 64 bits range coder
    • Renormalization output 8,16 or 32 bits
    • Easy connection to bit, nibble or byte predictors
    • stdin/stdout file compressor included


    Benchmark at: Entropy Coding Benchmark


    TurboBWT for windows & linux - bwt based on
    Turbo Range Coder:
    Attached Files Attached Files
    Last edited by dnd; 7th February 2020 at 14:24.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,687
    Thanks
    264
    Thanked 1,180 Times in 651 Posts
    Doesn't compile with VS/IC or as C++, included VS project is from trle.
    Test results:
    Code:
    c.size     c.time d.time
    61,250,820 1.875s 2.844s:  trc -0 enwik8 0 // gcc82 -march=skylake
    44,878,056 2.094s 3.312s:  trc -1 enwik8 1
    61,190,376 2.219s 3.094s:  trc -2 enwik8 2
    44,174,424 2.390s 3.515s:  trc -3 enwik8 3
    
    62,652,174 1.640s 2.500s:  coder c enwik8 1 // IC19/SSE2
    62,652,174 1.516s 2.406s:  coder c enwik8 1 // gcc82/skylake
    "coder" is bytewise decremental order0 based on sh_v1x.

  3. Thanks (2):

    dnd (25th December 2019),JamesB (25th December 2019)

  4. #3
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Thanks for reporting and testing.

    I've uploaded new project files for VS.

  5. #4
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    I've integrated Turbo Range Coder into TurboBench Compression Benchmark

    If "coder" above is available, I can integrate it into TurboBench and the Entropy Coding Benchmark

    Benchmark : Skylake 3.4GHz

    Code:
         C Size   ratio%     C MB/s     D MB/s   Name            File    (bold = pareto) MB=1.000.0000
       433131376    43.3      40.05      32.48   TurboRC 3       enwik9
       440344732    44.0      45.72      39.75   TurboRC 1       enwik9
       618395024    61.8      44.14      39.23   TurboRC 2       enwik9
       620093192    62.0      52.41      47.39   TurboRC 0       enwik9

    Code:
         C Size   ratio%     C MB/s     D MB/s   Name            File           
       173418060    17.3      51.75      46.87   TurboRC 3       enwik9bwt
       175618792    17.6      63.09      58.71   TurboRC 1       enwik9bwt
       176983264    17.7      60.86      56.91   TurboRC 2       enwik9bwt
       183542772    18.4      79.97      72.67   TurboRC 0       enwik9bwt
    Note, there is not MTF,RLE or oder tranform involved



  6. #5
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Turbo Range Coder update:

    - variable length integer (8,16,32 bits) coding in gamma + turborc
    Efficient only if most of the values to encode are very small.

    Some use cases:
    - lz77 match lengths
    - Integer delta values
    - Run Length in RLE encoding
    - BWT post encoding (after transform)

    Note that this coder is using only ~2k bytes memory to encode the full 32 bits integer range
    with an adaptive range coder
    Last edited by dnd; 1st January 2020 at 23:27.

  7. Thanks:

    Mike (1st January 2020)

  8. #6
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Turbo Range Coder update:

    - New: In-memory benchmark

    Code:
      ./turborc -e0 inputfile

  9. #7
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Turbo Range Coder update:

    - New: Simple Run Length Encoding with gamma coding + turborc
    (no context and no MTF or other transform)

    Code:
      ./turborc -e5 inputfile
    Code:
        E MB/s      size   ratio%   D MB/s      function    file
           139  182958556  18.30%      113      trcsrle     enwik9bwt
           248   44347340  10.58%      208      trcsrle     1034.db
    ​      1454    1087896   1.08%     2032      trcsrle     qc
    qc file

    enwik9bwt + 1034.db see also: Turbo Run Length Encoding

    You can see in "turborcs.c" how simple the use of Turbo Range Coder is.
    All loops are less than 10 lines



  10. #8
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Turbo Range Coder update:

    - Added RLE+rc strong
    - Added RLE with order1 context for run length + symbol
    - New benchmark (skylake 3.4GHz)

    Turbo Range Coder is now compressing enwik9bwt to 171,300,484 with only RLE+RC (No MTF, SRC or QLFC)
    As comparison, RAZOR compress enwik9 to 173,041,176 bytes

    The 100 MB QC file is compressed to
    839384 bytes with decompression speed in the range of 2GB/s
    BSC compress this file slightly better but decompression speed is under 20 MB/s

    Last edited by dnd; 5th January 2020 at 12:01.

  11. #9
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    483
    Thanks
    166
    Thanked 165 Times in 113 Posts
    Comparing against Razor is interesting, but it's not exactly like for like.

    That's really a comparison of full 1GB BWT vs LZ.

  12. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,687
    Thanks
    264
    Thanked 1,180 Times in 651 Posts
    > If "coder" above is available, I can integrate it into TurboBench and the Entropy Coding Benchmark

    Previous version was posted here: https://encode.su/threads/3175-freqtable-compression

    Here I prepared it for integration.
    You probably need "model1.cpp" version with c_mem.cpp as example of its usage.
    "model.cpp" version is designed for file i/o, so has more bound checks.

    The rangecoder is my usual low-precision one (like in ppmd_sh etc),
    The main coder is a simple bytewise combinatoric (see "model_dec.inc" etc) -
    it collects symbol counts for a block, encodes the freqtable,
    then encodes the symbols while decrementing the counts.
    The main feature is freqtable compression - it has a tuned CM-like model for freqtable entropy coding.
    There's a BLKSIZE constant at start of model.inc (currently at 16k) -
    changing it can somewhat adjust the compression/speed tradeoff.
    Attached Files Attached Files

  13. Thanks:

    dnd (9th January 2020)

  14. #11
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    483
    Thanks
    166
    Thanked 165 Times in 113 Posts
    Quote Originally Posted by Shelwien View Post
    There's a BLKSIZE constant at start of model.inc (currently at 16k) -
    changing it can somewhat adjust the compression/speed tradeoff.
    Block size for static coders with fixed sizes can make a huge difference. Adaptive less so.
    Ideally we'd use a dynamic block size that is based on entropy of the data, particularly for BWT data which has long periods of predictable data interspersed with unpredictable data.

    Some examples from my AVX2 unrolled rANS on enwik9bwt (on "Intel(R) Xeon(R) Gold 6142 CPU @ 2.60GHz")

    Code:
    BLK_SIZE 7111
    -o0  451.7 MB/s enc, 1030.9 MB/s dec     1000000008 bytes -> 251686728 bytes (best)
    -o1   82.8 MB/s enc,  197.6 MB/s dec     1000000008 bytes -> 241760238 bytes
    
    BLK_SIZE 71117
    -o0  526.6 MB/s enc, 1241.8 MB/s dec     1000000008 bytes -> 287757066 bytes
    -o1  165.4 MB/s enc,  476.3 MB/s dec     1000000008 bytes -> 203819238 bytes (best)
    
    BLK_SIZE 101117
    -o1* 181.9 MB/s enc,  509.2 MB/s dec     1000000008 bytes -> 200186386 bytes (11 bit freqs)
    
    BLK_SIZE 711177
    -o0  537.1 MB/s enc, 1232.0 MB/s dec     1000000008 bytes -> 367516505 bytes
    -o1  203.0 MB/s enc,  772.5 MB/s dec     1000000008 bytes -> 219731753 bytes
    And on enwik9 as-is (no BWT)
    Code:
    BLK_SIZE 7111
    -o0  427.9 MB/s enc,  843.5 MB/s dec     1000000000 bytes -> 658628464 bytes
    -o1   34.7 MB/s enc,   68.0 MB/s dec     1000000000 bytes -> 588300258 bytes
    
    BLK_SIZE 71117
    -o0  521.2 MB/s enc, 1195.6 MB/s dec     1000000000 bytes -> 641591112 bytes (best)
    -o1  110.1 MB/s enc,  277.3 MB/s dec     1000000000 bytes -> 490907991 bytes
    
    BLK_SIZE 711177
    -o0  519.4 MB/s enc, 1210.2 MB/s dec     1000000000 bytes -> 643865408 bytes
    -o1  158.3 MB/s enc,  638.2 MB/s dec     1000000000 bytes -> 484477472 bytes (best)
    -o1* 171.8 MB/s enc,  473.1 MB/s dec     1000000000 bytes -> 479543688 bytes (11-bit freqs)
    The best block size depends on data input type (entropy) and also on overheads, with O1 frequency tables clearly being larger.

    (Note some of these are on the pareto frontier for dnd's benchmarks, despite running on a much slower CPU.)
    This one isn't public though as it's a total hack, but I could upload the horrendous mess of code if anyone really cares. It's the same as my older AVX2 code, but I now encode the O1 frequencies in a different, more compact, manner.

    Edit: added some more benchmarks with increased frequency table precision. Trades speed for size.
    Last edited by JamesB; 8th January 2020 at 14:18.

  15. Thanks:

    dnd (9th January 2020)

  16. #12
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    I've added the freqtable_v2 entropy coder to TurboBench Compression Benchmark.
    Uncomment the FREQTAB=1 entry in the makefile to enable the inclusion.

    I've tried the optimization options included, but there is no speed improvement compared the "-O3" option.
    Also the option "-march=skylake" doesn't change anything.

    gcc 9.2 - Skylake : 3.4 GHz as in the Entropy Coder / Entropy Coding Benchmark

    Code:
          C Size  ratio%     C MB/s     D MB/s   Name                 File    
       620093192    62.0      53.44      47.40   TurboRC 11  order 0  enwik9
       634365787    63.4      51.64      33.72   freqtab              enwik9

    Code:
          C Size  ratio%     C MB/s     D MB/s   Name                 File          
       183542772    18.4      80.41      72.73   TurboRC 11 order 0   enwik9bwt
       247889950    24.8      81.58      62.25   freqtab              enwik9bwt
    Your benchmarks are showing the timings with I/O and with full file length as the block length for TurboRC.

  17. Thanks:

    Shelwien (9th January 2020)

  18. #13
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Turbo Range Coder update with division free decoding:

    - Simple and fast multisymbol CDF range coding
    - Superfast division free decoding with linear or binary symbol search
    - Example usage range coding with cdf
    - Can be used with static CDFs or adaptive preditors


  19. #14
    Member
    Join Date
    Nov 2011
    Location
    france
    Posts
    45
    Thanks
    3
    Thanked 27 Times in 19 Posts
    Can't see any CDF option in the help message... What is the option to benchmark these new features?

    ./turborc -h


    TurboRC 20.01 Copyright (c) 2018-2020 Powturbo Jan 14 2020


    Usage: ./turborc <options> <infile1> <outfile>
    <options>
    -# #: compression codec
    Range Coder: 11/13=order0, 12/14=Order1 (simple/strong)
    RLE+rc : 21/23=order0, 22/24=Order1 (simple/strong)
    Gamma+rc : 30/33=gamma8, 31/34=gamma16, 32/34=gamma32 (simple/strong)
    -b# #: block size in log2 (10..31 ,default 31)
    -d decompress
    -v verbose
    -o write on standard output


    Ex.: turborc -11 -f file.jpg file.jpg.rc
    turborc -d file.jpg.rc file1.jpg
    ---------- Benchmark ---------------------
    -k benchmark
    -e# # = function ids separated by ',' or ranges '#-#'
    # = 0 Benchmark all functions
    -i#/-j# # = Minimum de/compression iterations per run (default=auto)
    -I#/-J# # = Number of de/compression runs (default=3)
    Ex.: turborc -e0 file
    Ex.: turborc -e11,12,20-22 file

  20. #15
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    ​The use cases for these CDFs are small alphabets (ex. 16) when used with adaptive coding.
    And even for this size the predictor must be coded in SIMD to be competitive in speed with the bitwise range coder Turbo-Range-Coder.

    Adaptive encoding can be very fast with large alphabets (>3 times),
    unfortunately decoding can't be made a lot faster than a bitwise range coder.
    see: Entropy Coder / Entropy Coding Benchmark

    The benchmark "turborc" is processing bytes.
    With adaptive CDF coding, we must encode first the high nibble (order o0) and
    then the low nibble (order o1) using the high nibble as context.
    This will not show the benefits of CDFs.

    I've now included the (not shown in usage ) option "-40" for static full byte CDF coding (without predictor), only for testing the functionality.
    Last edited by dnd; 14th January 2020 at 21:58.

  21. #16
    Member
    Join Date
    Nov 2011
    Location
    france
    Posts
    45
    Thanks
    3
    Thanked 27 Times in 19 Posts
    oh, i see. Thx!

  22. #17
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Turbo-Range-Coder update :

    Benchmark with small alphabet (16 symbols) - option "-t"


    ./turborc -e40,41,42,43,44 enwik9 -t

    Code:
        E MB/s    size     ratio%   D MB/s   function
        76.10  475135200  47.51%    69.45 40-rc4sa    adaptive bitwise o0 simple
        92.05  500000004  50.00%    71.47 41-rc4s     static bitwise o0 simple 
       365.82  482541082  48.25%    80.32 42-rccdfsb  static search 
       366.24  482541082  48.25%    57.45 43-rccdfsv  static division 
       366.33  482541082  48.25%    68.98 44-rccdfst  static division free (reciprocal multiplication)
    Decoding:
    - reciprocal multiplication is using 640k lookup table for 32 bit division
    - The division free w/ lut is faster than hardware division on skylake
    - static symbol search (42) faster than division (43) or reciprocal multiplication (44)
    - adaptive Turbo-Range-Coder (40) nearly as fast as static coder (41)
    - adaptive bitwise o0 is faster than adaptive coder with CDFs (entry not in result table)
    Last edited by dnd; 16th January 2020 at 20:06.

  23. Thanks (2):

    JamesB (16th January 2020),Shelwien (16th January 2020)

  24. #18
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    483
    Thanks
    166
    Thanked 165 Times in 113 Posts
    Those sizes all look like basic order-1 entropy, not order-0. Is that simply from being nibble based and thus implicitly having partial order-1 by using the first 4 bits as context for the next 4 bits? I'm still surprised if so that it can match o1 so closely.

    Code:
    $ entropy16 < ~/scratch/data/enwik9
    len = 1000000000 bytes, entropy8   = 644561309.084184, 5.156490 bits per byte
    len = 1000000000 bytes, entropy8o1 = 486874433.871637, 3.894995 bits per byte
    What block size is it using too? Also is it genuinely block based with the ability to decode from any block, or are the frequency tables encoded relative to the previous frequency table? (That can help improve compression especially with small blocks, but obviously removes random access by block.)

  25. #19
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    ​Sorry for not elaborating the sizes in this test.
    The goal of this benchmark is to compare the speed of different CDF decoding with
    small alphabets and a clean static coding only, without the overhead of adaptive predictors and contexts.
    Small alphabets are the preferred use cases (image,video,...) for CDFs.
    Without writing a dedicated benchmark, we simply ignore the high nibble and encode only the low nibble of each byte.
    Thus the sizes shown are meaningless here.

    But the relative speed is correct and the benchmark shows for ex. that bitwise decoding is
    faster than CDF decoding even for static distributions.
    Predictors for adaptive CDF coding are more complex and slow.

    In other words decoding 4 bits is faster than decoding a single nibble with CDFs
    This is a little surprising given that nowadays CDF coding has become a fashion in entropy coding.

    What block size is it using too? Also is it genuinely block based with the ability to decode from any block, or are the frequency tables encoded relative to the previous frequency table? (That can help improve compression especially with small blocks, but obviously removes random access by block.)

    This test is only for benchmarking speed with small alphabets. The CDFs are calculated for the whole file before coding/decoding.

    Last edited by dnd; 16th January 2020 at 23:14.

  26. Thanks:

    JamesB (17th January 2020)

  27. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,687
    Thanks
    264
    Thanked 1,180 Times in 651 Posts
    1. 32-bit i/o in x64 rc is a good speed optimization.
    In fact, I have an old 32-bit coder like that which used FP registers for rc state,
    so this method should be relatively portable.
    But I wonder if more than 32-bit would make sense too?
    I tried testing it with my rc, and it somehow made decoding speed worse:
    Code:
    45411520  6.203s  6.890s // 32-bit renorm/LE
    45729030  6.156s  6.969s // 48-bit renorm/BE
    45729030  6.187s  7.078s // 48-bit renorm/LE
    In theory, even 56-bit renorm is possible, eg. with counters like in lzma.

    2. Its possible to make a binary rc with 128-bit range.
    It would require more multiplications, but would also allow 64-bit renorm
    and/or only checking for renorm once per 8 bits.
    I wonder if it would work as speed optimization.

  28. #21
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Larger renorm (ex. 48 bit) will make substantial changes necessary in the coding.
    A priori it will not make the RC faster.
    The speed issue of a RC is the sequential processing in the decoding.


    You can use interleaving to speed up rANS.
    Actually I've not tried interleaving (2 streams) with the Turbo-Range-Coder.
    Wonder if it's possible and how to rearrange the output of the RC,
    so to make it possible to use interleaving with one stream at decoding.
    The 64 bit RC is reading and writing in 32 bits units only.

    I've added some logic to use the renorm only when necessary in the o0 simple coder (option 11).

    enwik9 - skylake i6700 3.4GHz
    Code:
     E MB/s    size     ratio%   D MB/s   function             RC_IO   RC_BITS 
      56.34  620093192  62.01%    47.56   11-rcs o0  simple     32       15       4 Renorm/Byte
      54.58  620093192  62.01%    47.10   11-rcs o0  simple     32       15       8 Renorm/Byte (previous version)
      50.70  620717608  62.07%    42.58   11-rcs o0  simple     16       12       2 Renorm/Byte
    2. Its possible to make a binary rc with 128-bit range.
    I've added the type __uint128_t to rcrange_t to support 128 bit RC, without changing the coding.
    It's too slow as expected. Output (RC_IO) can be 8,16,32 or 64 bit.

    The fastest RC is the default 64 bit RC (RC_SIZE = 64) and with 32 bit i/o (RC_IO = 32).

  29. Thanks:

    Shelwien (17th January 2020)

  30. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,687
    Thanks
    264
    Thanked 1,180 Times in 651 Posts
    Here's my coder with lzma counters: http://nishi.dreamhosters.com/u/c0lit_v0.7z
    Code:
    [i7-8650U @ 4.10Ghz]
    61251024  2.063s  2.156s // c0lit_cl90.exe     
    61251024  2.265s  2.297s // c0lit_gc82.exe     
    61251024  2.109s  2.156s // c0lit_ic19.exe     
    61251024  2.172s  2.344s // c0lit_cl90_fil.exe 
    61251024  2.094s  2.265s // c0lit_gc82_fil.exe 
    61251024  2.218s  2.312s // c0lit_ic19_fil.exe 
    
    [i7-7820X @ 4.36Ghz]
    61251024  1.750s  1.844s // c0lit_cl90.exe
    61251024  1.844s  2.000s // c0lit_cl90_fil.exe
    61251024  1.891s  1.969s // c0lit_gc82.exe    
    61251024  1.782s  1.937s // c0lit_gc82_fil.exe
    61251024  1.781s  1.829s // c0lit_ic19.exe    
    61251024  1.875s  1.969s // c0lit_ic19_fil.exe
    
       E MB/s    size     ratio%   D MB/s   function
        62.39   61250808  61.25%    56.88 11-rcs      o0  simple    enwik8
    
    61250808  1.602s  1.758s // trc.exe recalculated from speed
    61250820  1.953s  1.891s // trc.exe -11 "enwik8" 1  (gcc82, modified test.bat)
    Its timed with file i/o, so I wonder how it actually compares...

  31. #23
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Turbo-Range-Coder update:

    ​- complete bwt post coder/decoder with qlfc + TurboRC in 70 lines of code
    - using simple gamma coding + TurboRC
    - decoding 2 times faster than bsc's post coder qlfc,1 at nearly the same compression ratio
    - beats bwtmix (context mixing) on enwik9


    TurboBench Compression Benchmark - Skylake 3.4 GHz
    File: enwik9bwt
    Code:
          C Size    ratio%     C MB/s     D MB/s   Name          
       163,924,735    16.4      28          27     bscqlfc 2
       164,998,959    16.5      55          61     bscqlfc 1
       167,395,956    16.7      82         120     TurboRC 25
       167,861,671    16.8                         bwtmix

  32. #24
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Turbo-Range-Coder update:
    - added CDF interleaved range coder + improved symbol search

    Note all the CDF static coders here can used with adaptive coding by connecting a CDF predictor.

    For large alphabets (ex. 256), there is a possibility to add a fast table based decoding with "division" or "division lut" (instead of the symbol search), but this will not work with adaptive coding.
    For large alphabets with static or semi-static (periodic rescale) the compression ratio
    will be so worse that it is better to use huffman coding or more better block based ANS.


    TurboBench Compression Benchmark - Skylake 3.4 GHz

    ./turborc -e40,41,42,43,44,45 -t enwik9
    Code:
       E MB/s    size     ratio%   D MB/s   function
        82.18  475135200  47.51%    69.17   40-rc4s bitwise adaptive o0 simple
        92.18  500000004  50.00%    72.74   41-rc4s bitwise static   o0 simple 
       425.47  478499608  47.85%    85.62   42-rccdfsb CDF static search 
       425.47  478499608  47.85%    50.93   43-rccdfsv CDF static division 
       358.50  482541082  48.25%    70.01   44-rccdfst CDF static division lut 
       491.94  478499612  47.85%   100.68   45-rccdfsb CDF static search interleaved

    ./turborc -e40,41,42,43,44,45 -t enwik9bwt
    Code:
       E MB/s    size     ratio%   D MB/s   function
       138.29  156109312  15.61%   122.36   40-rc4s bitwise adaptive o0 simple     
       139.34  500000008  50.00%   127.62   41-rc4s bitwise static   o0 simple     
       457.02  478499608  47.85%   132.32   42-rccdfsb CDF static search     
       456.71  478499608  47.85%    69.24   43-rccdfsv CDF static division     
       420.97  483079172  48.31%   119.03   44-rccdfst CDF static division lut     
       548.13  478499616  47.85%   171.40   45-rccdfsb CDF static search interleaved

    In this speed benchmark only the low nibbles of each byte are encoded/decoded
    Last edited by dnd; 22nd January 2020 at 21:50.

  33. #25
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    Turbo-Range-Coder update:
    - added simple single threaded bwt compression/decompression w/ QLFC + TurboRC

    enwik8 :
    21,274,340
    enwik9: 167,395,944

    bwt compress: ./turborc -26 inputfile outputfile
    bwt decompress:
    ./turborc -d inputfile.rc outputfile

    Attachment: TurboRC windows executable
    Attached Files Attached Files

  34. Thanks (2):

    JamesB (26th January 2020),Sportman (25th January 2020)

  35. #26
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    ​Turbo-Range-Coder update:

    - Added logic for predictors with parameters
    - Added shelwien predictor (only o0 parameters used, add SH=1 as arg. to make)
    - Added nanozip predictor (add NZ=1 as arg. to make)


    book1
    Code:
       E MB/s    size     ratio%   D MB/s   function
        37.41     334292  43.48%    33.27   16-rcxnz    o8b nanozip sliding context      
        46.81     342484  44.55%    35.45   12-rcxs     o8b simple sliding context       
        45.88     343360  44.66%    37.83   18-rcxsh    o8b shelwien sliding context     
        36.52     360100  46.84%    30.83   14-rcxss    o8b strong sliding context       
        44.89     434560  56.53%    40.57   15-rcnz     o0  nanozip                      
        58.22     438848  57.08%    46.79   17-rcsh     o0  shelwien                     
        57.18     441132  57.38%    48.86   11-rcs      o0  simple                       
        46.59     446756  58.11%    40.94   13-rcss     o0  strong

    enwik9
    Code:
       E MB/s    size     ratio%   D MB/s   function
        46.56  440344732  44.03%    35.22   12-rcxs     o8b simple sliding context       
        45.60  441464048  44.15%    37.62   18-rcxsh    o8b shelwien sliding context     
        36.55  455524488  45.55%    30.78   14-rcxss    o8b strong sliding context       
        37.12  456710868  45.67%    32.93   16-rcxnz    o8b nanozip sliding context      
        45.58  618395008  61.84%    40.08   13-rcss     o0  strong                       
        56.33  619541604  61.95%    45.83   17-rcsh     o0  shelwien                     
        55.71  620093192  62.01%    47.92   11-rcs      o0  simple                       
        43.80  629091976  62.91%    39.59   15-rcnz     o0  nanozip

    enwik9bwt
    Code:
       E MB/s    size     ratio%   D MB/s   function
        75.46  167302256  16.73%   105.43   27-rcqlfcss QLFC strong                      
        82.46  167395956  16.74%   122.23   25-rcqlfcs  QLFC simple                      
        75.40  167611432  16.76%   108.44   28-rcqlfcnz QLFC nanozip                     
    
    
        63.05  175618792  17.56%    50.64   12-rcxs     o8b simple sliding context       
        60.65  176725456  17.67%    54.24   18-rcxsh    o8b shelwien sliding context     
        63.66  176983132  17.70%    58.82   13-rcss     o0  strong                       
        81.42  183542676  18.35%    74.17   11-rcs      o0  simple                       
        46.58  185680832  18.57%    44.24   14-rcxss    o8b strong sliding context       
        46.50  187728188  18.77%    45.46   16-rcxnz    o8b nanozip sliding context  
        81.01  186745268  18.67%    71.35   17-rcsh     o0  shelwien                     
        59.64  218700024  21.87%    56.34   15-rcnz     o0  nanozip
    Last edited by dnd; 29th January 2020 at 00:29.

  36. #27
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    ​Turbo-Range-Coder update:

    - New: fast lzp byte preprocessor.
    Can be used to reduce (for speedup reason) the input length before further processing like bwt or RC.

    - New: improved bwt speed with lzp preprocessing.
    lzp is considered only when the reduction of the the input length is large enough (> ~6%)
    The minimum lzp match length (32-256) can be set with the option "-l#" (default = 96), -l0=No lzp
    (Use other text files than enwik8/9 for testing)

    - New: Faster Inverse bwt for large inputs

    It is possible to test and benchmark lzp alone with : "/.turborc -e60 file" or "./turborc -e60 -l128 file" (max. 1GB file)

    Windows exe compiled w/ mingw-w64 included
    Attached Files Attached Files
    Last edited by dnd; 1st February 2020 at 21:43.

  37. #28
    Member
    Join Date
    Mar 2013
    Location
    Worldwide
    Posts
    557
    Thanks
    64
    Thanked 196 Times in 145 Posts
    ​Turbo-Range-Coder update:

    - Added order 0 encode/decode in mb_o0.h.
    Possible RC for sizes 2, 3, 4(Nibble), 5, 6, 7, 8(byte) bits
    - Added chunked integer coding (mbl3) in mb_vint.h (similar to lzma match length encoding).
    - Slightly improved speed for lzp and bwt
    - new adaptive cdf predictor with SSE2/AVX2/Neon/Altivec on x86/Arm/powerpc


    CDF Nibble (only low nibble of each byte encoded)
    Code:
    ./turborc -e40,48,49 enwik9 -t
       E MB/s    size     ratio%   D MB/s    function
        80.78  475135200  47.51%    68.26    40-rc4s  bitwise adaptive o0 simple          
       308.09  474280380  47.43%    67.40    48-rccdf CDF adaptive                        
       306.41  474280388  47.43%    83.10    49-rccdf CDF nibble adaptive interleaved     
    
    
    ./turborc -e40,48,49 enwik9bwt -t 
       E MB/s    size     ratio%   D MB/s    function
       137.38  156109312  15.61%   123.31    40-rc4s  bitwise adaptive o0 simple          
       377.11  163493876  16.35%   112.27    48-rccdf CDF nibble adaptive                 
       397.73  163493884  16.35%   143.06    49-rccdf CDF nibble adaptive interleaved
    CDF byte using high + low nibble
    Code:
    ./turborc -e11,48,49 enwik9
       E MB/s    size     ratio%   D MB/s    function
        55.32  620093192  62.01%    47.69    11-rcs bitwise  o0  simple                       
       165.69  622450296  62.25%    51.27    48-rccdf CDF adaptive                       
       157.96  622450300  62.25%    50.96    49-rccdf CDF byte   adaptive interleaved    
    
    
    ./turborc -e11,48,49 enwik9bwt
       E MB/s    size     ratio%   D MB/s    function
        80.92  183542676  18.35%    74.10    11-rcs bitwise o0  simple                       
       203.92  195826332  19.58%    88.09    48-rccdf CDF byte   adaptive                
       200.22  195826340  19.58%    87.38    49-rccdf CDF byte   adaptive interleaved
    Attached Files Attached Files
    Last edited by dnd; 15th February 2020 at 01:14.

  38. Thanks:

    Shelwien (13th February 2020)

Similar Threads

  1. Replies: 33
    Last Post: 2nd July 2018, 21:18
  2. Division-free arithmetic coder
    By Bulat Ziganshin in forum Data Compression
    Replies: 17
    Last Post: 19th May 2016, 03:12
  3. range coder fragility
    By cr88192 in forum Data Compression
    Replies: 0
    Last Post: 8th January 2016, 20:01
  4. Hashing a large range of integers to a smaller range
    By mtwaha in forum Data Compression
    Replies: 3
    Last Post: 12th April 2011, 00:27
  5. How fast should be a range coder ?
    By Cyan in forum Data Compression
    Replies: 33
    Last Post: 16th November 2009, 17:02

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •