Thanks a ton! Can I ask in your opinion if this coder is suitable to use with large alphabets (say thousands or tens of thousands of symbols, most with low frequency)? What if I switch it to work with 64-bit?
Also, how do you normally prefer these things to be attributed?
> Also, how do you normally prefer these things to be attributed?
Both are public domain.
"subbotin.cpp" - rangecoder by Dmitry Subbotin, frontend/model by Eugene Shelwien.
"sh_v1m.cpp" - rc/frontend/model by Eugene Shelwien.
> Can I ask in your opinion if this coder is suitable to use with large alphabets
> (say thousands or tens of thousands of symbols, most with low frequency)?
Subbotin's rc supports "total" up to 2^16 (see BOT macro).
sh_v1m up to 2^24.
That's why counter rescale loop is commented out in sh_v1m.
If its really necessary, I have a 128-bit rc in stegdict - https://github.com/Shelwien/stegdict.../sh_v1m128.inc
it supports up to 2^56.
Stegdict specifically uses it to encode a word choice, with support for billion-sized dictionaries.
Its possible to support up to 2^31 total on 32-bit system using old-style arithmetic coder with bitwise output,
but I don't think I have an optimized version of that.
As an example, I have "beeari" here: http://compression.ru/sh/coders6a.rar
Thanks - I was just trying to figure out if large alphabets with very skewed distributions would suffer from too much re-normalization - I have no experience/intuition about range coders, only a vague theoretical worry...; It's funny that compilers have added __umul128 intrinsic but nobody's exposing the 128-by-64 divide that's kind of related... The stegdict may be more suitable for me, I am basically experimenting with a specific offline dictionary builder so my symbols will be basically "words".
> I was just trying to figure out if large alphabets with very
> skewed distributions would suffer from too much re-normalization -
Yes. As I said, a normal rangecoder would not be able to encode
a probability less than 1/(1<<24) (or even 1/(1<<16) for faster ones).
But if we'd renormalize probabilities so that 1/(1<<24) would turn into 1/(1<<16),
we'd also have a 16-bit codelength for this word instead of 24-bit,
which is obviously redundant.
> It's funny that compilers have added __umul128 intrinsic but
> nobody's exposing the 128-by-64 divide that's kind of related...
They added __int128 type, but it only exists in x64 mode.
And I didn't see a 128/64 divide for x86 at all, so I had to write one.
On x64 its not a problem, because DIV cpu instruction just normally does it.
> The stegdict may be more suitable for me, I am basically experimenting with a
> specific offline dictionary builder so my symbols will be basically "words".
There might be still an error in flush handling in stegdict rc.
It can be easily avoided by simply calling ShiftLow 8 times in rc_Quit,
but atm there's this complex code that attempts to cut off tail bytes.
So if you'd ever get a repeatable test case where decoding fails at the end,
just give it to me and I'd fix that.
They added __int128 type, but it only exists in x64 mode.
And I didn't see a 128/64 divide for x86 at all, so I had to write one.
On x64 its not a problem, because DIV cpu instruction just normally does it.
Yes, I meant that there is no intrinsic to access the RDX:RAX div through C/C++, you have to use inline assembly...; just an annoyance that in one case you don't need inline assembly, and in another one you do. I have stopped worrying about 32-bit platforms in my experiments and professionally long time ago, though I get it that not everybody has that luxury...
I've been exploring some of your coders (coders6c.rar maybe? I forget the source) for the CRAM file format. One thing I explored is how well they work in other languages, specifically something simplistic like JavaScript.
It turns out the 32-bit Schindler variant works brilliantly in Javascript, while the others are around 100x slower due to not natively supporting 64-bit arithmetic. In C, the Schindler code is maybe 5% slower or so, so I concluded it's an acceptable tradeoff.
Another noteworthy thing there is the "CLD-a" coder.
It uses FPU registers to implement 64-bit integer arithmetics.
Thing is, long double/extended type has 64 bits of mantissa, so its possible.
The trick is to always add 1<<63 to any values to keep fractional bits away.
(Otherwise there could be differences on different platforms/compilers - different rounding etc).
> It turns out the 32-bit Schindler variant works brilliantly in Javascript, while the others are around 100x slower due to not natively supporting 64-bit arithmetic.
Actually it would be possible to modify any coders with 64-bit "low" the same way:
As to muldiv functions, unless there's a 64/32 division, its easier to do range/=total first.
Although for binary coding there're options without 64-bit arithmetics and precision loss:
> Actually, MSVC added __umul128 intrinsic, while gcc added __int128 type
In any case, the only real problem is 128/64 division using 32-bit arithmetics.
Of course, a version with shift loop and subtractions would be simple, but also slow.
But using 64/32 divisions to implement 128/64 is tricky.
defines the number of compressed data bytes in "low" register. So 6*8=48.
Min possible range value is "sTOP", so 1<<40 - it also defines the max total.
double type has 52 mantissa bits, so we can't use more than that here,
and non-byte-aligned low size requires additional data caching and realignment,
so its better to use the next byte-aligned value.
When SSE math is used, this coder seems to be compatible with different compilers.
I also tried converting "low" to double, but ended up reverting it back,
since it always has to be converted to int anyway (to extract compressed bytes),
but there's no benefit in using float-point type for it.
Well, CLD coder kept everything in FP, because it allowed to maintain rc state
without ever flushing to memory, when the main algorithm is all scalar integer.
But SSE2+ supports both integer and FP types, so there's kinda no point
if it adds extra work.
Added rc_v2f - arithmetic coder with bitwise i/o and freqs up to 1<<31.
Accidentally found it in one of CM coders on ctxmodel.net.
Output has +4 bytes comparing to v1m/fp_rc, because there's no tail cutting - precision is actually similar.
Also added a float-based rc (fp32_rc; double-based one renamed to fp64_rc).
It needs freq renorm like "subbotin", since it also supports total only up to 1<<16,
but 32-bit registers might be good for vectorization. I think I'd try it later.
fp64 also can be vectorized, but obviously there'd be 2x less instances.
full resc64k name freqbits tailcut
------- 435,901 subbotin 16 +
------- 435,595 fp32_rc 16 +
435,398 435,581 fp64_rc 40 +
435,398 435,581 sh_v1m 24 +
435,398 435,580 sh_v2f 31 + // 435580 is random luck - last byte was accidentally FF
435,398 435,581 sh_128 56 +
input size: 0 1 2 3 4
fp32_rc 0 2 3 4 5 // len+1 is because of EOF, symbol probability is <1/256, thus extra byte is needed for byte alignment
fp64_rc 0 2 3 4 5 // only with 8+ of same symbol it starts actually compressing files
sh_128 0 2 3 4 5
sh_v1m 0 2 3 4 5
sh_v2f 0 2 3 4 5
subbotin 2 2 3 4 5 // 0->2 is because of 2^32-1 range at init; other coders use 2^n
- Added sh_128 from stegdict
- Added tail cutting in subbotin and sh_v2f
- Added test with freq rescaling for all coders
- Added output size test for small files
- Split main.inc from coders, since its always the same
- Found a bug in sh_128 - decoding error on wcc386; not fixed yet
- Added sh_v1n = sh_v1m without explicit EOF coding.
Instead, it generates code value at flush in a way to cause decoding uncertainty,
which is used to detect EOF:
Unfortunately access to freqtable at EOF point is necessary for rc flush.
For coders like ppmd this can be tricky (though not impossible).
http://nishi.dreamhosters.com/u/freqtable_v0.rar
A different frontend for sh_v2f, with combinatoric model and extra freqtable file.
main() can be extracted and combined with most coders here (except sh_v1n).
For normal files this approach is actually better, since there's no handling of new symbols, or EOF,
and its easy enough to compress the freqtable with a custom model.
But for short strings its not really a good idea, since its hard to compress a 1024-byte freqtable to under 1 bit.