Thread: How fast should be a range coder ?

1. How fast should be a range coder ?

Hi

Well, I think i cant' be more specific than the title already is.

I'm talking about *just* the range coder / arithmetic coder part. I know that usually it is not used alone, that's why i'm trying isolate its performance level.

The reason i'm asking is i'm in the process of writing my first entropy coder, and looking for targets.

Although principles of arithmetic/range coders are pretty clear and well explained on the web, it took me monthes to think the problem in a way that can be coded.

But now that i've got a first compilable code, i'm a bit shocked by the speed of it. I've not completed the decoder yet, so i could be simply coding something which is entirely wrong...

2. Take a look at fpaq0 variants at http://mattmahoney.net/dc/text.html#5586
These are all bitwise order 0 compressors, meaning 8 coding operations per byte. The fastest is fpaq0pv4b1 at 56 ns/byte compression and 60 ns/byte decompression (7-8 ns per bit).

3. writing efficient entropy coder in an art of its own

Code:
```D:\testing>C:\!\FreeArchiver\Compression\Tornado\tor-full.exe -o compressed.o -1
Compressing 70.505 mb with greedy parser, 16kb:1 hash4, buffer 1mb, bytecoder w/o tables
-1: compressed 70.505 -> 88.131 mb (125.0%), time 0.597 secs, speed 118.122 mb/sec

Compressing 70.505 mb with greedy parser, 16kb:1 hash4, buffer 1mb, hufcoder w/o tables
-1: compressed 70.505 -> 70.867 mb (100.5%), time 1.095 secs, speed 64.393 mb/sec

Compressing 70.505 mb with greedy parser, 16kb:1 hash4, buffer 1mb, aricoder w/o tables
-1: compressed 70.505 -> 71.040 mb (100.8%), time 1.315 secs, speed 53.618 mb/sec```
so 100 mb/sec for range coder (it's optimized to frequencies represented as x/2^n) and 140 mb/s for huffman (core2@3.24GHz)

EDIT: my calculations is a bit too optimizstic since bytewise coding also needs some time. so 100-120 mb/sec for huffman and 80-90 mb/sec for rc

4. Thanks Matt, yes this is the information i'm looking for.

As a minor comment, i wonder if bit-wise and byte-wise rangecoders are directly comparables, as i would expect bit-wise to need more operations, but then, each one is much simpler so ....

Well, anyway, i noted other coders in LTCB with "oO" tag, apparently context 0, so it would translate into pure ari/range coder ? They are less powerfull than fpaq0 obviously, but sometimes faster (especially shindlet). Are they byte-wise or bit-wise ?

Edit : Thanks Bulat very much for these figures. They help getting an idea on which kind of performance to look for.

Regards

Yann

5. > fpaq0pv4b1

To be specific, its http://ctxmodel.net/files/fpaq0pv4b3.rar
Also, there're more versions of the rangecoder class with
the same interface, but better precision, eg. in there:
http://ctxmodel.net/files/o6mix3d1.rar
For example. rc_sh2f is a bitwise arithmetic coder actually.

1. There's no single perfect rangecoder.
At least, its speed optimization always depends on the
actual coding algorithm (model), and popular
implementations are only usable in some very specific cases.
For example, the best feature of paq coder is its source
size, which allows for manual inlining. But it doesn't
mean that its easiest to understand and use, and there're
many implied restrictions:
- probability precision (with >13 probability bits there's
a considerable redundancy due to carry workaround).
(This affects speed - different algorithms can be
implemented with 16bit or even 32bit i/o)
- there's a renormalization loop with complex condition
and up to 5 iterations per bit, which has to be performed
after (or before) each bit.
(Otherwise its possible to run renorm eg. once per two
bits, and fully unroll it).
- not compatible with non-binary alphabets
(mainly because of the probability limit though)
(bytewise coding might be faster than bitwise even when
there's a division).

2. Arithmetic coding is basically string enumeration.
We just have to assign a (long) number to each data
string, and there're really many ways to do that.
So for now there might be no efficient implementation
of adaptive huffman coding or some other alternatives,
but it doesn't mean that everyone has to use a
specific rangecoder implementation

3. Professional codecs mostly have some
multiplication-free table-driven implementations, which
are supposed to be faster
(btw its these that are patented, not the plain arithmetic
coding from a book).
But redundancy of these is really high, and replacing a
single multiplication with 20 branches is not a good idea,
if its not a hardware implementation.
Here's an example with some benchmarks:
http://compression.ru/sh/elscoder.rar

4. So for first experiments with arithmetic coding it
seems better to suggest something like:
http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar
Rangecoders there have a relatively abstract interface,
and there're examples of both bitwise and bytewise coding
using the same rangecoder, and there're even two
rangecoders used in the same program for precision/speed

5. There're quite a few techniques applicable in various
circumstances, so here're some more sources:
- old bytewise - note the alternative carryless (CRLF) and
a coder with 32-bit i/o (CL-D),
http://compression.ru/sh/aridemo6.rar
http://compression.ru/sh/coders6a.rar
http://compression.ru/sh/coders6c2.rar
(Btw, the rc in 7zip seems to be a derivative of sh1.cdr there)
- some coding tricks:
http://compression.ru/sh/marcdemo.rar (variable output alphabet)
http://compression.ru/sh/parcoder.rar (multiple rcs at once)
http://ctxmodel.net/files/mix_test/BWTmix_v1.rar
http://ctxmodel.net/files/mix_test/B..._gcc_patch.rar
http://ctxmodel.net/files/mix_test/mix_test_vC.rar

6. Thanks for the very detailed inputs. It is going to be very usefull.

To explain a bit : i tried to read some "entropy coder" source code some time ago, but couldn't understand how they work. And copy/pasting wouldn't meet the objective to learn.
In the end, i decided to write my own code, even a bad one, in order to fully understand the process. I guess that i will be more able to understand other's source code after completing this exercise.

For the time being, i've got 2 codes, that still need to be proven by realizing a compatible decoder. I'm not too far from result, so i believe there are already some learnings to share .

The first one is "block based", with statistics calculated and transmitted into header. I guess it's called "semi static".
The second one is "guessing", with statistics evolving along transmission by using history. I guess it's called "dynamic"'.

I'm willing to share two counter-intuitives results from these experiments.

1) Ratio : Semi-static wins
Now, i was not at all expecting this result.
The difference is not huge (about ~0.5%), but still in favor of semi-static.
Semi static has to pay for the cost of sending the table, it also cannot adapt too fast to avoid a large cost for the tables.
I settled with 256KB blocks, tables representing 0.3% of compressed data. In this case, semi-static always win compared to dynamic.

Tentative explanation :

I guess that's because "guessing" is not as efficient as providing exactly the right ratio. For example, if a character is never used, it gets a probability of "0" in semi static, while it gets at least "1/Range" for dynamic.
In the end, the difference seems to pay better than the table cost.

2) Speed : Dynamic wins
Here also, i was expecting the "guessing" to cost more to compute.
But not at all. 10% better speed with dynamic (not huge difference though).

Tentative explanation :
In fact, "semi static" requires a first analysis of data block, before starting the encoding process.
Therefore, it is a 2-steps process. While dynamic only go through data stream once.

I was not expecting such results.
Is this in line with other range coders experiments ?

Regards

Yann

7. I haven't experimented with static coding, only dynamic. However, compression will depend on the data. If the statistics are uniform then static will win. If you have a mix of different data types then dynamic might win.

8. You're comparing apples to oranges. To clarify keep in mind that "compression = prediction + modeling". Afaiu you used two different models (prediction). So you actually compared the models performance. Your result sounds odd to me and it could simply mean, that there's something wrong with your models (or the terms). Of course results depend on the testing data, too. I guess your "adaptive" model (per context) is a counter like:

- estimate the probability for symbol s: p_s = n_s/N
- update for actual symbol s: n_s = n_s+1
- total count: N = sum (for all s) n_s

which is actually static. You can add periodic rescaling, e.g. if N>=L, where L is some limit. Or a even better way would be to update it like that:

n_s = c*n_s + 1
n_t = c*n_t (for all symbols t!=s)

with c in [0, 1). After selecting a particular structure for your model(s) there's still the requirement for proper parameters. If you don't want to "waste" much time on that i'd suggest to use the periodic rescaling approach and bruteforce the best value for L.

Hope that helps, good luck!

9. The dynamic model is constantly updating.
The history length can be programmed.
For this comparison exercise, i selected a 32K sliding window for probability history.

Compared to the 256K block for static, it reacts quicker to pattern changes.

But apparently, it does not pay off.
Maybe probability being based on past data, instead of forward ones, the "guessed" probability for dynamic is less accurate than the calculated one for static...

10. > To explain a bit : i tried to read some "entropy coder"
> source code some time ago, but couldn't understand how
> they work.

Here's my favourite explanation:
Introduction to Rangecoding
Please tell me, if its still not clear enough.

And as to static/dynamic - that's your implementation quirk.
Normally static has faster decoding (its not certain about encoding though)
because it doesn't need to update the model, but dynamic has better
compression (as its can adapt to local data correlations) and slower
decoding.

11. Thanks for information. This helps

I had difficulties to find free time these last few weeks, but managed to find some to work on the decoder yesterday.

This was a nightmare to debug, but in the end, most problems were little flaws into the encoder.
My last bug was an overflow propagation limit (encoder cannot select value because range get stuck between, for example 0x3FFFFFFE and 0x40000001),
with a specific torture test (a 6.5GB binary file) producing some long overflow chains, the record so far being 726 digits I was certainly not expecting such condition...

Anyway, since then, it seems safe to use, with exact binary comparison on all files i could test. So here it is :
http://img48.xooimage.com/files/5/9/...ck-1518a71.zip

The compression ratio is on par with other 0-context arithmetic coders.
The point here is about encoding speed.
(Range coder are not meant to be used "alone", but as an encoding step for a "real" compression algorithm.
So having an efficient range encoder can help making some faster and better compressors.)

On my system, this range coder encodes at ~110MB/s.
This "seems" reasonable given Bulat previous inputs.

Originally Posted by Bulat Ziganshin
so 100 mb/sec for range coder (it's optimized to frequencies represented as x/2^n) and 140 mb/s for huffman (core2@3.24GHz)

EDIT: my calculations is a bit too optimizstic since bytewise coding also needs some time. so 100-120 mb/sec for huffman and 80-90 mb/sec for rc
However, it decodes at 60 MB/s, only.

The main difference here is that this version use precise frequencies values.
It is not rounded to x/2^n (which by the way is an excellent idea to deal with shifting rather than multiplication/division).
I guess this *could* explain why decoding performances are so poor...

Now on the technical aspects :
This is a "semi-static" version : data is broken down into blocks of 512KB. Each block gets a header with a precise count of bytes to follow, from which a frequency table is created.
The encoding is not "byte-specific", in fact any number of elements could be used. But here, we are crunching characters.

Regards

12. > range0_block-1518a71.zip

3% redundancy comparing to fpaq0pv4B (641M vs 620M on enwik9),
but 4x faster (well, that one is bitwise).
Although, for me the actual encoding speed with a ramdrive
seems to be ~120Mb/s for enwik9 (coder reports ~135Mb/s),
and, somehow, ~140Mb/s for random data (coder reports 178Mb/s)

> The point here is about encoding speed.

I don't think that there's any sense in speed optimizing
a rangecoder without an actual model.
For example, we might be able to afford a relatively complex
rangecoder while using words as "symbols".

> However, it decodes at 60 MB/s, only.

Bytewise decoding requires an extra division, and also a symbol
search, so its relatively slow.
Its possible to build a lookup table for faster search though,
when its a static coder.

> Anyway comments are very welcomed.

seems like 25 cpu clocks per byte, which is pretty cool, taking
into account multiplications and stuff.
So its hard to say how far from perfection it actually is.
But I'd suggest to try a better compiler - IntelC or gcc/mingw 4.3+

> The encoding is not "byte-specific", in fact any number of
> elements could be used. But here, we are crunching
> characters.

Then you can take some simple LZ implementation, and use your
rc for match encoding, with offsets etc as symbols.
I kinda doubt that there's enough precision for that, though.

13. Thanks for input, Shelwien

Bytewise decoding requires an extra division, and also a symbol
search, so its relatively slow.
Its possible to build a lookup table for faster search though,
when its a static coder.
As i'm already using table lookup,
i ended up guessing that division had something to do with it.

I'm however a bit surprised by the cost of it.
I hope that division is using the standard "integer" cpu branch, and not the FP coprocessor. I'm however totally clueless about how to check that...

Then you can take some simple LZ implementation, and use your
rc for match encoding, with offsets etc as symbols.
Yes, that's for the next step

14. afair, * was 4 cycles and / was 40 cycles on my duron. it's easy to check - just write some cycle that, for example, adds results of / or * and unroll it using gcc -O3

FP / is much faster, believe it or not

15. Thanks for info Bulat.

I probably was wronged by old memories about processors with no multiplication/division capabilities.
Back in the time, it was necessary to "hand build" the operation in assembler. And division, when properly done, was not much more costly than multiplication.
That being said, it means that both were relatively slow

But it's not a good enough comparison. I can understand that modern CPU circuitry prefer optimizing multiplication more than division...

16. so 100 mb/sec for range coder (it's optimized to frequencies represented as x/2^n) and 140 mb/s for huffman (core2@3.24GHz)
Quick question :

I intend to test the idea of a frequency distribution in the form x/2^n.
Because it should help improving decoding speed.

However, i fail to see the difference with what could be a considered as "just another way" to do Huffman coding.

Bulat, your statement seem to imply that there is, indeed, a difference, as huffman is described as a faster alternative than Range Coder with x/2^n frequencies.
However, i fail to see it....

17. Huffman coding can be compared to arithmetic coding with 2^k/2^n freqs

18. with huffman, each code occupies 1/2^n codespace, i.e. 1/2, 1/4...

with my RC, code can occupy 3/2^n space, 157/2^n space and so on. i achieve this by using block-adaptive weghts:

on the start of every block, each code has k/2^n weight and their total weights is 1, of course. these weights are used for encoding of this block

for counting new weights, i reduce each weight, say, by 10%, i.e. replace each k with round (0.9*k). then i increase each weight by 1 when its symbol occurs. i stop the block when sum of all weights is again 1 and use new weights for encoding of the next block

19. Thanks for the clear anwer Bulat.
Your updating process is very different from the one i selected during my tests for dynamic RC,
i should take good note of it, because my results were pretty poor so far

Anyway, it seems that i transformed my Range Coder into an Huffman coder by going the 2^n route...
And, that's an interesting one too, so i will spend some time investigating.

It seems the time required to build the huffman tree is really significant, it drastically changes the time proportion between collecting/updating data and encoding, quite an interesting challenge.

minor question :
Is there a recommended/usual format to send an huffman tree within a block header ?
I've read that one can describe a tree by using one bit per node/leave.
256 Elements ==> 511 nodes+leaves ==> ~64 Bytes.
It looks interesting, but then, once the tree is built, how to know which value is in which leave ?
puzzled ...

20. you may find efficient huftree_build procedures in tornado and 7-zip. read appnotes and other docs in http://www.haskell.org/bz/lzh.7z for info about efficient encoding of huftree and other lzh-related ideas

UPDATE: and i don't encode huftree in tornado, it uses pretty the same semi-static up[dating procedure. of cpurse, it doesn't need to wait until total weight will be exactly 2^n/2^n. with a fast huftree_build routine, you can build huftrees even at decoding time

21. To send a Huffman tree, you only need to send probabilities, which is the same as sending the number of bits for each symbol. For example, JPEG sends 16 lists of symbols (bytes), which are all the symbols of length 1, then all of length 2, up to 16. The decoder assigns codes in numerical order.

22. I finally found some time to look at the coder.
And, to my disappointment, its really "far from perfect".
At least, whole carry check can be removed, and
a lot of other stuff simplified (like low is completely unnecessary
in decoder). Also there's no sense to keep both cumulative and
plain frequencies.
On the other hand, I guess core2 cpus are really cool - to
process all of that in just 50 clocks.

Code:
```  while ( 1 ) {
*out = c = LUT[value];
dlow  = rshift * freqtbl[c].ofs
range = freqtbl[c].val * rshift;
low += dlow;
out1 = out++ + 1;
if( !((low ^ (range + low)) & 0xFF000000) ) {
low <<= 8;
range <<= 8;
code = *data++ + (code << 8);
if( !((low ^ (range + low)) & 0xFF000000) ) {
range <<= 8;
low <<= 8;
code = *data++ + (code << 8);
}
}
if( range<0x100000 ) {
range <<= 8;
low <<= 8;
code = *data++ + (code << 8);
if( range < 0x100000 ) {
range <<= 8;
low <<= 8;
code = *data++ + (code << 8);
}
}
if( out1 >= bufend ) break;
rshift = range >> 16;
value = (code - low) / rshift;
}```

23. It is very interesting.
I've never programmed the code this way, so i guess you are using a decompiler tool which translates the result, which means that it takes in consideration compiler's optimizations.
And actually, i find this very interesting and instructive.

There are some strange sequences that i would have never imagined, and i guess they were selected by the compiler because they are supposed to be better.

I believe this can be very usefull to learn how to produce better codes. What kind of tool are you using to get for such a result ?

And, well, as we are also talking about range coder , what do you mean by "carry check can be removed" ?

Best regards

24. > I believe this can be very usefull to learn how to produce better codes.

Sure. You also can get asm output from a C++ compiler though.

> What kind of tool are you using to get for such a result ?

http://hex-rays.com/

> And, well, as we are also talking about range coder,
> what do you mean by "carry check can be removed" ?

Your coder is a carryless rangecoder (likely buggy).
By implementing full carries in the encoder, that
side might become a little more complex and slow,
but you won't need the low^low+range checks instead.

Compare that to this:
Code:
``` uint GetFreq (uint totFreq) {
return code/(range/=totFreq);
}
void Decode( uint cumFreq, uint freq, uint totFreq ) {
code  -= cumFreq * range;
range *= freq;
while ( range<TOP ) (code<<=8)+=InpSrcByte(), range<<=8;
}```

25. Thanks for the link Shelwien,
i will get a look into it.

Regarding a "carry" for rangecoder,
i was not aware of this concept.

Sure the proposed "carry" implementation looks much simpler to code
but could the complexity be hidden into InpSrcByte() ?

I guess "carry" operation may also have a small cost on compression ratio ?

Note that i've tested the coder on a number of difficult files, with no problem on the released version, so if there is still a bug remaining, i'm willing to correct it.

Edit: it took me a while to understand how the disassembler works, but now that i get it running a bit, it certainly seems a valuable tool. Thanks again for the hint Shelwien !

Regards

26. > Regarding a "carry" for rangecoder,
> i was not aware of this concept.

Which means you didn't check any of my previous links.
Here's a tldr-syndrome-aware attach with a single version from
coders6c2.

> Sure the proposed "carry" implementation looks much simpler to code
> but could the complexity be hidden into InpSrcByte() ?

No, that can be replaced with *p++. Or getc(file).
In short, "low" register might contain a value of any size
(only in the form 0FF...FFxxxxxxxx, though), but "code"
is always <range, so no need for overflow handling in decoder.

> I guess "carry" operation may also have a small cost on compression ratio ?

In a way. Proper overflow handling _improves_ compression.

> Note that i've tested the coder on a number of difficult
> files, with no problem on the released version, so if there
> is still a bug remaining, i'm willing to correct it.

You and the coder probably have different concepts of "difficult"

Here's a file which it "compresses" to 12 bytes and can't decompress
(see attach).

27. Thanks for the bug report.

Indeed, it is an obvious corner case (large data blocks with a single character repetition) which was simply not present in the test suite. Thanks for this file, which fills an important gap in the test suite.
Note : range0 is not "compressing" the file to 12 bytes, it is silently exiting due to an error condition.
That means that i'm not properly trapping runtime errors. Proper behavior would be to provide an explicit error message. I still have to learn how to do that properly...

"low" register might contain a value of any size
(only in the form 0FF...FFxxxxxxxx, though), but "code"
is always <range, so no need for overflow handling in decoder.
I have to study that one more thoroughly.
This looks pretty close to my first attempt at Range Coder implementation, which failed.
I though that i had been too optimistic, and that overflow checking was the only way to go.

But in fact, there was a solution.
Now i need some time to learn from the proposed source. I'm slow to learn but copy/pasting is not my goal here...

28. > Indeed, it is an obvious corner case (large data blocks
> with a single character repetition) which was simply not
> present in the test suite. Thanks for this file, which
> fills an important gap in the test suite.

Actually the aim was to get long FF sequences in output code,
it was just simpler to generate like that, but
"single character repetition" is not necessary at all.

> Note : range0 is not "compressing" the file to 12 bytes,
> it is silently exiting due to an error condition.

I know, as I specifically generated that file to exploit
the fault in the coding algorithm.

> That means that i'm not properly trapping runtime errors.
> Proper behavior would be to provide an explicit error
> message. I still have to learn how to do that properly...

C++ exceptions?
Though its not a good idea when you aim for speed optimization.

>> "low" register might contain a value of any size
>> (only in the form 0FF...FFxxxxxxxx, though), but "code"
>> is always <range, so no need for overflow handling in decoder.

> I have to study that one more thoroughly.

You just have to count the FF values shifted out of
the "low" register. Because only these can overflow.

> I though that i had been too optimistic, and that overflow
> checking was the only way to go.

There're quite a few of different ways to do that,
and some might be better for encoding speed
(like having a wider "low" and a small unused interval
on the top end), but the precise implementation
is not that much slower anyway, and decoding speed
is more important usually.

> Now i need some time to learn from the proposed source.
> I'm slow to learn but copy/pasting is not my goal here...

That's unfortunate, as I'd prefer to discuss some more interesting
speed optimization techniques, like multiplication-free, runlength,
or parallel coding, but all of these kinda require more than 3 lines
of C++ source to implement.

29. Hello,

As i could not let a buggy software remain, i finally found the time to adapt Shelwien's proposed modifications to Range0, and as expected they do improve performance.

Here is the new version :
http://img28.xooimage.com/files/c/7/...ck-154ea29.zip
It successfully passes the "long trail" test.

Encoding speed is improved by approximately +10%, which makes it a bit above 120MB/s.
Decoding speed is also improved, by approximately +5%, which translates into 63MB/s.

The difference it not as large as code simplification would make us believe.
I guess this is mostly because the bottlenecks are elsewhere, which is especially true for the decoder : the division is the main issue, so leaving this part unmodified, any improvement is necessarily dwarfed by the bottleneck.

Anyway, this was nice learning and good suggestions,
so i guess that if now i want to find some good decoding speed,
i will have to go along the Huffman way...

Regards

30. Actually I already mentioned one way to improve the rangecoding speed,
which is especially applicable when its based on a static model.
You can just use multiple rangecoders at once.
Encode symbol c[0] with rc[0], then c[1] with rc[1], ... then c[N] with rc[0] again.
It can be even vectorized to some extent.
And it would certainly allow to overlap these divisions.

Page 1 of 2 12 Last

Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•