Page 1 of 2 12 LastLast
Results 1 to 30 of 34

Thread: How fast should be a range coder ?

  1. #1
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts

    How fast should be a range coder ?

    Hi

    Well, I think i cant' be more specific than the title already is.

    I'm talking about *just* the range coder / arithmetic coder part. I know that usually it is not used alone, that's why i'm trying isolate its performance level.

    The reason i'm asking is i'm in the process of writing my first entropy coder, and looking for targets.

    Although principles of arithmetic/range coders are pretty clear and well explained on the web, it took me monthes to think the problem in a way that can be coded.

    But now that i've got a first compilable code, i'm a bit shocked by the speed of it. I've not completed the decoder yet, so i could be simply coding something which is entirely wrong...

  2. #2
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    Take a look at fpaq0 variants at http://mattmahoney.net/dc/text.html#5586
    These are all bitwise order 0 compressors, meaning 8 coding operations per byte. The fastest is fpaq0pv4b1 at 56 ns/byte compression and 60 ns/byte decompression (7-8 ns per bit).

  3. #3
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,503
    Thanks
    741
    Thanked 665 Times in 359 Posts
    writing efficient entropy coder in an art of its own

    tornado's results:
    Code:
    D:\testing>C:\!\FreeArchiver\Compression\Tornado\tor-full.exe -o compressed.o -1
    Compressing 70.505 mb with greedy parser, 16kb:1 hash4, buffer 1mb, bytecoder w/o tables
    -1: compressed 70.505 -> 88.131 mb (125.0%), time 0.597 secs, speed 118.122 mb/sec
    
    D:\testing>C:\!\FreeArchiver\Compression\Tornado\tor-full.exe -o compressed.o -1 -c3
    Compressing 70.505 mb with greedy parser, 16kb:1 hash4, buffer 1mb, hufcoder w/o tables
    -1: compressed 70.505 -> 70.867 mb (100.5%), time 1.095 secs, speed 64.393 mb/sec
    
    
    D:\testing>C:\!\FreeArchiver\Compression\Tornado\tor-full.exe -o compressed.o -1 -c4
    Compressing 70.505 mb with greedy parser, 16kb:1 hash4, buffer 1mb, aricoder w/o tables
    -1: compressed 70.505 -> 71.040 mb (100.8%), time 1.315 secs, speed 53.618 mb/sec
    so 100 mb/sec for range coder (it's optimized to frequencies represented as x/2^n) and 140 mb/s for huffman (core2@3.24GHz)

    EDIT: my calculations is a bit too optimizstic since bytewise coding also needs some time. so 100-120 mb/sec for huffman and 80-90 mb/sec for rc
    Last edited by Bulat Ziganshin; 22nd October 2009 at 13:33.

  4. #4
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    Thanks Matt, yes this is the information i'm looking for.

    As a minor comment, i wonder if bit-wise and byte-wise rangecoders are directly comparables, as i would expect bit-wise to need more operations, but then, each one is much simpler so ....

    Well, anyway, i noted other coders in LTCB with "oO" tag, apparently context 0, so it would translate into pure ari/range coder ? They are less powerfull than fpaq0 obviously, but sometimes faster (especially shindlet). Are they byte-wise or bit-wise ?

    Edit : Thanks Bulat very much for these figures. They help getting an idea on which kind of performance to look for.

    Regards

    Yann

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > fpaq0pv4b1

    To be specific, its http://ctxmodel.net/files/fpaq0pv4b3.rar
    Also, there're more versions of the rangecoder class with
    the same interface, but better precision, eg. in there:
    http://ctxmodel.net/files/o6mix3d1.rar
    For example. rc_sh2f is a bitwise arithmetic coder actually.

    1. There's no single perfect rangecoder.
    At least, its speed optimization always depends on the
    actual coding algorithm (model), and popular
    implementations are only usable in some very specific cases.
    For example, the best feature of paq coder is its source
    size, which allows for manual inlining. But it doesn't
    mean that its easiest to understand and use, and there're
    many implied restrictions:
    - probability precision (with >13 probability bits there's
    a considerable redundancy due to carry workaround).
    (This affects speed - different algorithms can be
    implemented with 16bit or even 32bit i/o)
    - there's a renormalization loop with complex condition
    and up to 5 iterations per bit, which has to be performed
    after (or before) each bit.
    (Otherwise its possible to run renorm eg. once per two
    bits, and fully unroll it).
    - not compatible with non-binary alphabets
    (mainly because of the probability limit though)
    (bytewise coding might be faster than bitwise even when
    there's a division).

    2. Arithmetic coding is basically string enumeration.
    We just have to assign a (long) number to each data
    string, and there're really many ways to do that.
    So for now there might be no efficient implementation
    of adaptive huffman coding or some other alternatives,
    but it doesn't mean that everyone has to use a
    specific rangecoder implementation

    3. Professional codecs mostly have some
    multiplication-free table-driven implementations, which
    are supposed to be faster
    (btw its these that are patented, not the plain arithmetic
    coding from a book).
    But redundancy of these is really high, and replacing a
    single multiplication with 20 branches is not a good idea,
    if its not a hardware implementation.
    Here's an example with some benchmarks:
    http://compression.ru/sh/elscoder.rar

    4. So for first experiments with arithmetic coding it
    seems better to suggest something like:
    http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar
    Rangecoders there have a relatively abstract interface,
    and there're examples of both bitwise and bytewise coding
    using the same rangecoder, and there're even two
    rangecoders used in the same program for precision/speed
    tradeoff.

    5. There're quite a few techniques applicable in various
    circumstances, so here're some more sources:
    - old bytewise - note the alternative carryless (CRLF) and
    a coder with 32-bit i/o (CL-D),
    http://compression.ru/sh/aridemo6.rar
    http://compression.ru/sh/coders6a.rar
    http://compression.ru/sh/coders6c2.rar
    (Btw, the rc in 7zip seems to be a derivative of sh1.cdr there)
    - some coding tricks:
    http://compression.ru/sh/marcdemo.rar (variable output alphabet)
    http://compression.ru/sh/parcoder.rar (multiple rcs at once)
    - newer bitwise:
    http://ctxmodel.net/files/mix_test/BWTmix_v1.rar
    http://ctxmodel.net/files/mix_test/B..._gcc_patch.rar
    http://ctxmodel.net/files/mix_test/mix_test_vC.rar

  6. #6
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    Thanks for the very detailed inputs. It is going to be very usefull.

    To explain a bit : i tried to read some "entropy coder" source code some time ago, but couldn't understand how they work. And copy/pasting wouldn't meet the objective to learn.
    In the end, i decided to write my own code, even a bad one, in order to fully understand the process. I guess that i will be more able to understand other's source code after completing this exercise.

    For the time being, i've got 2 codes, that still need to be proven by realizing a compatible decoder. I'm not too far from result, so i believe there are already some learnings to share .

    The first one is "block based", with statistics calculated and transmitted into header. I guess it's called "semi static".
    The second one is "guessing", with statistics evolving along transmission by using history. I guess it's called "dynamic"'.

    I'm willing to share two counter-intuitives results from these experiments.


    1) Ratio : Semi-static wins
    Now, i was not at all expecting this result.
    The difference is not huge (about ~0.5%), but still in favor of semi-static.
    Semi static has to pay for the cost of sending the table, it also cannot adapt too fast to avoid a large cost for the tables.
    I settled with 256KB blocks, tables representing 0.3% of compressed data. In this case, semi-static always win compared to dynamic.

    Tentative explanation :

    I guess that's because "guessing" is not as efficient as providing exactly the right ratio. For example, if a character is never used, it gets a probability of "0" in semi static, while it gets at least "1/Range" for dynamic.
    In the end, the difference seems to pay better than the table cost.

    2) Speed : Dynamic wins
    Here also, i was expecting the "guessing" to cost more to compute.
    But not at all. 10% better speed with dynamic (not huge difference though).

    Tentative explanation :
    In fact, "semi static" requires a first analysis of data block, before starting the encoding process.
    Therefore, it is a 2-steps process. While dynamic only go through data stream once.


    I was not expecting such results.
    Is this in line with other range coders experiments ?

    Regards

    Yann

  7. #7
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    I haven't experimented with static coding, only dynamic. However, compression will depend on the data. If the statistics are uniform then static will win. If you have a mix of different data types then dynamic might win.

  8. #8
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You're comparing apples to oranges. To clarify keep in mind that "compression = prediction + modeling". Afaiu you used two different models (prediction). So you actually compared the models performance. Your result sounds odd to me and it could simply mean, that there's something wrong with your models (or the terms). Of course results depend on the testing data, too. I guess your "adaptive" model (per context) is a counter like:

    - estimate the probability for symbol s: p_s = n_s/N
    - update for actual symbol s: n_s = n_s+1
    - total count: N = sum (for all s) n_s

    which is actually static. You can add periodic rescaling, e.g. if N>=L, where L is some limit. Or a even better way would be to update it like that:

    n_s = c*n_s + 1
    n_t = c*n_t (for all symbols t!=s)

    with c in [0, 1). After selecting a particular structure for your model(s) there's still the requirement for proper parameters. If you don't want to "waste" much time on that i'd suggest to use the periodic rescaling approach and bruteforce the best value for L.

    Hope that helps, good luck!

  9. #9
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    The dynamic model is constantly updating.
    The history length can be programmed.
    For this comparison exercise, i selected a 32K sliding window for probability history.

    Compared to the 256K block for static, it reacts quicker to pattern changes.

    But apparently, it does not pay off.
    Maybe probability being based on past data, instead of forward ones, the "guessed" probability for dynamic is less accurate than the calculated one for static...

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > To explain a bit : i tried to read some "entropy coder"
    > source code some time ago, but couldn't understand how
    > they work.

    Here's my favourite explanation:
    Introduction to Rangecoding
    Please tell me, if its still not clear enough.

    And as to static/dynamic - that's your implementation quirk.
    Normally static has faster decoding (its not certain about encoding though)
    because it doesn't need to update the model, but dynamic has better
    compression (as its can adapt to local data correlations) and slower
    decoding.
    Last edited by Shelwien; 28th October 2009 at 03:14.

  11. #11
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    Thanks for information. This helps

    I had difficulties to find free time these last few weeks, but managed to find some to work on the decoder yesterday.

    This was a nightmare to debug, but in the end, most problems were little flaws into the encoder.
    My last bug was an overflow propagation limit (encoder cannot select value because range get stuck between, for example 0x3FFFFFFE and 0x40000001),
    with a specific torture test (a 6.5GB binary file) producing some long overflow chains, the record so far being 726 digits I was certainly not expecting such condition...

    Anyway, since then, it seems safe to use, with exact binary comparison on all files i could test. So here it is :
    http://img48.xooimage.com/files/5/9/...ck-1518a71.zip

    The compression ratio is on par with other 0-context arithmetic coders.
    The point here is about encoding speed.
    (Range coder are not meant to be used "alone", but as an encoding step for a "real" compression algorithm.
    So having an efficient range encoder can help making some faster and better compressors.)


    On my system, this range coder encodes at ~110MB/s.
    This "seems" reasonable given Bulat previous inputs.

    Quote Originally Posted by Bulat Ziganshin View Post
    so 100 mb/sec for range coder (it's optimized to frequencies represented as x/2^n) and 140 mb/s for huffman (core2@3.24GHz)

    EDIT: my calculations is a bit too optimizstic since bytewise coding also needs some time. so 100-120 mb/sec for huffman and 80-90 mb/sec for rc
    However, it decodes at 60 MB/s, only.

    The main difference here is that this version use precise frequencies values.
    It is not rounded to x/2^n (which by the way is an excellent idea to deal with shifting rather than multiplication/division).
    I guess this *could* explain why decoding performances are so poor...

    Anyway comments are very welcomed.

    Now on the technical aspects :
    This is a "semi-static" version : data is broken down into blocks of 512KB. Each block gets a header with a precise count of bytes to follow, from which a frequency table is created.
    The encoding is not "byte-specific", in fact any number of elements could be used. But here, we are crunching characters.

    Regards
    Last edited by Cyan; 8th November 2009 at 18:06.

  12. #12
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > range0_block-1518a71.zip

    3% redundancy comparing to fpaq0pv4B (641M vs 620M on enwik9),
    but 4x faster (well, that one is bitwise).
    Although, for me the actual encoding speed with a ramdrive
    seems to be ~120Mb/s for enwik9 (coder reports ~135Mb/s),
    and, somehow, ~140Mb/s for random data (coder reports 178Mb/s)

    > The point here is about encoding speed.

    I don't think that there's any sense in speed optimizing
    a rangecoder without an actual model.
    For example, we might be able to afford a relatively complex
    rangecoder while using words as "symbols".

    > However, it decodes at 60 MB/s, only.

    Bytewise decoding requires an extra division, and also a symbol
    search, so its relatively slow.
    Its possible to build a lookup table for faster search though,
    when its a static coder.

    > Anyway comments are very welcomed.

    seems like 25 cpu clocks per byte, which is pretty cool, taking
    into account multiplications and stuff.
    So its hard to say how far from perfection it actually is.
    But I'd suggest to try a better compiler - IntelC or gcc/mingw 4.3+

    > The encoding is not "byte-specific", in fact any number of
    > elements could be used. But here, we are crunching
    > characters.

    Then you can take some simple LZ implementation, and use your
    rc for match encoding, with offsets etc as symbols.
    I kinda doubt that there's enough precision for that, though.

  13. #13
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    Thanks for input, Shelwien

    Bytewise decoding requires an extra division, and also a symbol
    search, so its relatively slow.
    Its possible to build a lookup table for faster search though,
    when its a static coder.
    As i'm already using table lookup,
    i ended up guessing that division had something to do with it.

    I'm however a bit surprised by the cost of it.
    I hope that division is using the standard "integer" cpu branch, and not the FP coprocessor. I'm however totally clueless about how to check that...

    Then you can take some simple LZ implementation, and use your
    rc for match encoding, with offsets etc as symbols.
    Yes, that's for the next step

  14. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,503
    Thanks
    741
    Thanked 665 Times in 359 Posts
    afair, * was 4 cycles and / was 40 cycles on my duron. it's easy to check - just write some cycle that, for example, adds results of / or * and unroll it using gcc -O3

    FP / is much faster, believe it or not

  15. #15
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    Thanks for info Bulat.

    I probably was wronged by old memories about processors with no multiplication/division capabilities.
    Back in the time, it was necessary to "hand build" the operation in assembler. And division, when properly done, was not much more costly than multiplication.
    That being said, it means that both were relatively slow

    But it's not a good enough comparison. I can understand that modern CPU circuitry prefer optimizing multiplication more than division...

  16. #16
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    so 100 mb/sec for range coder (it's optimized to frequencies represented as x/2^n) and 140 mb/s for huffman (core2@3.24GHz)
    Quick question :

    I intend to test the idea of a frequency distribution in the form x/2^n.
    Because it should help improving decoding speed.

    However, i fail to see the difference with what could be a considered as "just another way" to do Huffman coding.

    Bulat, your statement seem to imply that there is, indeed, a difference, as huffman is described as a faster alternative than Range Coder with x/2^n frequencies.
    However, i fail to see it....

  17. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    Huffman coding can be compared to arithmetic coding with 2^k/2^n freqs

  18. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,503
    Thanks
    741
    Thanked 665 Times in 359 Posts
    with huffman, each code occupies 1/2^n codespace, i.e. 1/2, 1/4...

    with my RC, code can occupy 3/2^n space, 157/2^n space and so on. i achieve this by using block-adaptive weghts:

    on the start of every block, each code has k/2^n weight and their total weights is 1, of course. these weights are used for encoding of this block

    for counting new weights, i reduce each weight, say, by 10%, i.e. replace each k with round (0.9*k). then i increase each weight by 1 when its symbol occurs. i stop the block when sum of all weights is again 1 and use new weights for encoding of the next block
    Last edited by Bulat Ziganshin; 9th November 2009 at 21:19.

  19. #19
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    Thanks for the clear anwer Bulat.
    Your updating process is very different from the one i selected during my tests for dynamic RC,
    i should take good note of it, because my results were pretty poor so far

    Anyway, it seems that i transformed my Range Coder into an Huffman coder by going the 2^n route...
    And, that's an interesting one too, so i will spend some time investigating.

    It seems the time required to build the huffman tree is really significant, it drastically changes the time proportion between collecting/updating data and encoding, quite an interesting challenge.

    minor question :
    Is there a recommended/usual format to send an huffman tree within a block header ?
    I've read that one can describe a tree by using one bit per node/leave.
    256 Elements ==> 511 nodes+leaves ==> ~64 Bytes.
    It looks interesting, but then, once the tree is built, how to know which value is in which leave ?
    puzzled ...

  20. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,503
    Thanks
    741
    Thanked 665 Times in 359 Posts
    you may find efficient huftree_build procedures in tornado and 7-zip. read appnotes and other docs in http://www.haskell.org/bz/lzh.7z for info about efficient encoding of huftree and other lzh-related ideas

    UPDATE: and i don't encode huftree in tornado, it uses pretty the same semi-static up[dating procedure. of cpurse, it doesn't need to wait until total weight will be exactly 2^n/2^n. with a fast huftree_build routine, you can build huftrees even at decoding time
    Last edited by Bulat Ziganshin; 10th November 2009 at 14:06.

  21. #21
    Expert
    Matt Mahoney's Avatar
    Join Date
    May 2008
    Location
    Melbourne, Florida, USA
    Posts
    3,255
    Thanks
    306
    Thanked 779 Times in 486 Posts
    To send a Huffman tree, you only need to send probabilities, which is the same as sending the number of bits for each symbol. For example, JPEG sends 16 lists of symbols (bytes), which are all the symbols of length 1, then all of length 2, up to 16. The decoder assigns codes in numerical order.

  22. #22
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    I finally found some time to look at the coder.
    And, to my disappointment, its really "far from perfect".
    At least, whole carry check can be removed, and
    a lot of other stuff simplified (like low is completely unnecessary
    in decoder). Also there's no sense to keep both cumulative and
    plain frequencies.
    On the other hand, I guess core2 cpus are really cool - to
    process all of that in just 50 clocks.

    Code:
      while ( 1 ) {
        *out = c = LUT[value];
        dlow  = rshift * freqtbl[c].ofs
        range = freqtbl[c].val * rshift;
        low += dlow;
        out1 = out++ + 1;
        if( !((low ^ (range + low)) & 0xFF000000) ) {
          low <<= 8;
          range <<= 8;
          code = *data++ + (code << 8);
          if( !((low ^ (range + low)) & 0xFF000000) ) {
            range <<= 8;
            low <<= 8;
            code = *data++ + (code << 8);
          }
        }
        if( range<0x100000 ) {
          range <<= 8;
          low <<= 8;
          code = *data++ + (code << 8);
          if( range < 0x100000 ) {
            range <<= 8;
            low <<= 8;
            code = *data++ + (code << 8);
          }
        }
        if( out1 >= bufend ) break;
        rshift = range >> 16;
        value = (code - low) / rshift;
      }

  23. #23
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    It is very interesting.
    I've never programmed the code this way, so i guess you are using a decompiler tool which translates the result, which means that it takes in consideration compiler's optimizations.
    And actually, i find this very interesting and instructive.

    There are some strange sequences that i would have never imagined, and i guess they were selected by the compiler because they are supposed to be better.

    I believe this can be very usefull to learn how to produce better codes. What kind of tool are you using to get for such a result ?

    And, well, as we are also talking about range coder , what do you mean by "carry check can be removed" ?

    Best regards

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > I believe this can be very usefull to learn how to produce better codes.

    Sure. You also can get asm output from a C++ compiler though.

    > What kind of tool are you using to get for such a result ?

    http://hex-rays.com/
    http://isohunt.com/download/90347605/hexrays.torrent

    > And, well, as we are also talking about range coder,
    > what do you mean by "carry check can be removed" ?

    Your coder is a carryless rangecoder (likely buggy).
    By implementing full carries in the encoder, that
    side might become a little more complex and slow,
    but you won't need the low^low+range checks instead.

    Compare that to this:
    Code:
     uint GetFreq (uint totFreq) {
       return code/(range/=totFreq);
     }
     void Decode( uint cumFreq, uint freq, uint totFreq ) {
       code  -= cumFreq * range;
       range *= freq;
       while ( range<TOP ) (code<<=8)+=InpSrcByte(), range<<=8;
     }

  25. #25
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts

    Thumbs up

    Thanks for the link Shelwien,
    i will get a look into it.

    Regarding a "carry" for rangecoder,
    i was not aware of this concept.

    Sure the proposed "carry" implementation looks much simpler to code
    but could the complexity be hidden into InpSrcByte() ?

    I guess "carry" operation may also have a small cost on compression ratio ?

    Note that i've tested the coder on a number of difficult files, with no problem on the released version, so if there is still a bug remaining, i'm willing to correct it.

    Edit: it took me a while to understand how the disassembler works, but now that i get it running a bit, it certainly seems a valuable tool. Thanks again for the hint Shelwien !

    Regards
    Last edited by Cyan; 12th November 2009 at 03:59.

  26. #26
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > Regarding a "carry" for rangecoder,
    > i was not aware of this concept.

    Which means you didn't check any of my previous links.
    Here's a tldr-syndrome-aware attach with a single version from
    coders6c2.

    > Sure the proposed "carry" implementation looks much simpler to code
    > but could the complexity be hidden into InpSrcByte() ?

    No, that can be replaced with *p++. Or getc(file).
    In short, "low" register might contain a value of any size
    (only in the form 0FF...FFxxxxxxxx, though), but "code"
    is always <range, so no need for overflow handling in decoder.

    > I guess "carry" operation may also have a small cost on compression ratio ?

    In a way. Proper overflow handling _improves_ compression.

    > Note that i've tested the coder on a number of difficult
    > files, with no problem on the released version, so if there
    > is still a bug remaining, i'm willing to correct it.

    You and the coder probably have different concepts of "difficult"

    Here's a file which it "compresses" to 12 bytes and can't decompress
    (see attach).
    Attached Files Attached Files

  27. #27
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    Thanks for the bug report.

    Indeed, it is an obvious corner case (large data blocks with a single character repetition) which was simply not present in the test suite. Thanks for this file, which fills an important gap in the test suite.
    Note : range0 is not "compressing" the file to 12 bytes, it is silently exiting due to an error condition.
    That means that i'm not properly trapping runtime errors. Proper behavior would be to provide an explicit error message. I still have to learn how to do that properly...

    "low" register might contain a value of any size
    (only in the form 0FF...FFxxxxxxxx, though), but "code"
    is always <range, so no need for overflow handling in decoder.
    I have to study that one more thoroughly.
    This looks pretty close to my first attempt at Range Coder implementation, which failed.
    I though that i had been too optimistic, and that overflow checking was the only way to go.

    But in fact, there was a solution.
    Now i need some time to learn from the proposed source. I'm slow to learn but copy/pasting is not my goal here...

  28. #28
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    > Indeed, it is an obvious corner case (large data blocks
    > with a single character repetition) which was simply not
    > present in the test suite. Thanks for this file, which
    > fills an important gap in the test suite.

    Actually the aim was to get long FF sequences in output code,
    it was just simpler to generate like that, but
    "single character repetition" is not necessary at all.

    > Note : range0 is not "compressing" the file to 12 bytes,
    > it is silently exiting due to an error condition.

    I know, as I specifically generated that file to exploit
    the fault in the coding algorithm.

    > That means that i'm not properly trapping runtime errors.
    > Proper behavior would be to provide an explicit error
    > message. I still have to learn how to do that properly...

    C++ exceptions?
    Though its not a good idea when you aim for speed optimization.

    >> "low" register might contain a value of any size
    >> (only in the form 0FF...FFxxxxxxxx, though), but "code"
    >> is always <range, so no need for overflow handling in decoder.

    > I have to study that one more thoroughly.

    You just have to count the FF values shifted out of
    the "low" register. Because only these can overflow.

    > I though that i had been too optimistic, and that overflow
    > checking was the only way to go.

    There're quite a few of different ways to do that,
    and some might be better for encoding speed
    (like having a wider "low" and a small unused interval
    on the top end), but the precise implementation
    is not that much slower anyway, and decoding speed
    is more important usually.

    > Now i need some time to learn from the proposed source.
    > I'm slow to learn but copy/pasting is not my goal here...

    That's unfortunate, as I'd prefer to discuss some more interesting
    speed optimization techniques, like multiplication-free, runlength,
    or parallel coding, but all of these kinda require more than 3 lines
    of C++ source to implement.

  29. #29
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    863
    Thanks
    460
    Thanked 257 Times in 105 Posts
    Hello,

    As i could not let a buggy software remain, i finally found the time to adapt Shelwien's proposed modifications to Range0, and as expected they do improve performance.

    Here is the new version :
    http://img28.xooimage.com/files/c/7/...ck-154ea29.zip
    It successfully passes the "long trail" test.

    Encoding speed is improved by approximately +10%, which makes it a bit above 120MB/s.
    Decoding speed is also improved, by approximately +5%, which translates into 63MB/s.

    The difference it not as large as code simplification would make us believe.
    I guess this is mostly because the bottlenecks are elsewhere, which is especially true for the decoder : the division is the main issue, so leaving this part unmodified, any improvement is necessarily dwarfed by the bottleneck.


    Anyway, this was nice learning and good suggestions,
    so i guess that if now i want to find some good decoding speed,
    i will have to go along the Huffman way...

    Regards

  30. #30
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,340
    Thanks
    211
    Thanked 1,009 Times in 534 Posts
    Actually I already mentioned one way to improve the rangecoding speed,
    which is especially applicable when its based on a static model.
    You can just use multiple rangecoders at once.
    Encode symbol c[0] with rc[0], then c[1] with rc[1], ... then c[N] with rc[0] again.
    It can be even vectorized to some extent.
    And it would certainly allow to overlap these divisions.

Page 1 of 2 12 LastLast

Similar Threads

  1. M1 - Optimized demo coder
    By toffer in forum Data Compression
    Replies: 189
    Last Post: 22nd July 2010, 00:49
  2. A weird order8(?) CM coder
    By Shelwien in forum Data Compression
    Replies: 1
    Last Post: 23rd July 2009, 22:08
  3. flzp_ac2 (flzp + an order-2 arithmetic coder)
    By inikep in forum Data Compression
    Replies: 4
    Last Post: 25th June 2008, 22:37
  4. LZC - new fastr LZ coder compressor -
    By Nania Francesco in forum Forum Archive
    Replies: 70
    Last Post: 16th November 2007, 14:04

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •