# Thread: Comparison of the recent CM demo coders

1. > How may orders do you mix at once for a unary guess?
> Something around three should be ok?
> I'll try anyway, but i need a point to start.

Well, I mixed all. I think its also a good starting point.
Also there's a question of "update exclusion" - a model for
symbols escaped from orders >N is clearly different from
a normal order-N.

> By accident i found a way to estimate R0 symbols with a hit
> rate of 97-99% (!!!) on all files expect acrord32.exe (here
> it was 92%).

Wonder what you tried to say here... because if you're
really able to guess the next symbol with even 92% probability,
then you'd have 0.4bpc on SFC. Don't forget about Hutter prize

Did you mean most probable symbol in context?
Is that including only context's statistics, or global statistics?

> *per order 2 context* (after ST) using order 10 mixing? This
> clearly is something > order 2.

Well, I didn't say anything about it being order2.
Of course its order3, the same way as Matt's fpaq0f is order2
(and fpaq0f2x is order3).

But now you can see the gain from doing so. There're results
for plain order0, MtF+order0, order1, MtF+order1 in my table,
and you can see how it affects the prediction quality...
for different kinds of order2 counters

2. Btw, with all the mixing talk, I forgot.
You don't have to mix anything at all actually.
Predictions of any submodels are just another context.

3. If the R0 symbols under order N, N-1,..., N-M match, there's a very high chance for the R0 symbol to match the next one. This special condition is fulfilled for orders 6-2 in around 20% of the input data. The chance for a successful match is very high (97-99%).
So there's an easy way to identify highly redundant data. I'm using this as a SSE context in CMM4 in combination with other similar things.

Btw, with all the mixing talk, I forgot.
You don't have to mix anything at all actually.
Predictions of any submodels are just another context.
Basically you can call prediction merging what you want... Do you mean contexts for SSE, here?

Somehow I'm unable to download anything from compression.ru?! Could you include these as attatchments?

I see that you've listed M012, i thought it was M01

4. > in around 20% of the input data. The chance for a successful match is very high (97-99%).

Eg. here're counts by ranks ( freq(mtf(bwt(book1))) ):
(total size 768771) 0: 386450, 1:116648, 2: 62043, 3: 41128, 4:28736, 5:20872 [...]
As you can see, more than 50% are rank0 there.
Also, here're counts for acrord32:
(total size 3870784) 0:1956567, 1:283597, 2:160721, 3:111042, 4:84835, 5:67897 [...]

> > You don't have to mix anything at all actually.
> > Predictions of any submodels are just another context.

> Basically you can call prediction merging what you want... Do you mean contexts for SSE, here?

I mean exactly that you can treat the predictions the same way as any other context.
Of course, you'd (partly) know the origin of this context so it'd be easier to quantize,
but then there's not much difference from any PCM (audio data) model or the like.

> I see that you've listed M012, i thought it was M01

Btw, my current results are 1850xxxx (st2rc v3 was 18703791) - optimization in progress.
I mean, the previous results were optimized too, but I'd mixed in another order0 model,
so now its mix(o0b,mix[ctx](o0a,o1)).
But paq8 -6 result for the same data (st2+e8 output) is 17256448, so there's surely
a space for improvement.
Still wonder how it does that, though... Can order2+ _in ST2 output_ be of any help?..

5. It's not the same situation with BWT, since in a "lookup-table-manner" (lookup the contexts) the ranks change "over time" (file/buffer position). You would have to modify this:

c_i-1 | r0_i-1
c_i | r0_i <- this is the currently viewed order 6 context c_i
c_i+1 | r0_i+1

When knowing where c_i is located in the input data probe backwards and see, if all truncated order 5-2 contexts which occur first have the same R0 character.

I modified CMM4 to output the counts again (note: i use hashing, so the true number must be higher due to unresolved collisions):

Code:
```G:\projects\cmm4-080527\bin\Release>cmm4 43 f:\testset\A10.jpg f:\testset\A10.jp
g.tst17
CMM4 v0.1f by C. Mattern  Jun  8 2008
Experimental file compressor.
Init: Order6,4-0 context mixing coder.
Allocated 116494 kB.
Encoding: done.
Ratio: 828527/842468 bytes (7.87 bpc)
Speed: 278 kB/s (3505.2 ns/byte)
Time: 2.95 s

85/92 (92.39%)

G:\projects\cmm4-080527\bin\Release>cmm4 43 f:\testset\AcroRd32.exe f:\testset\A
croRd32.exe.tst17
CMM4 v0.1f by C. Mattern  Jun  8 2008
Experimental file compressor.
Init: Order6,4-0 context mixing coder.
Allocated 116494 kB.
Applying x86 transform.
Encoding: done.
Ratio: 1184968/3870784 bytes (2.45 bpc)
Speed: 369 kB/s (2640.0 ns/byte)
Time: 10.22 s

814190/840443 (96.88%)

G:\projects\cmm4-080527\bin\Release>cmm4 43 f:\testset\english.dic f:\testset\en
glish.dic.tst17
CMM4 v0.1f by C. Mattern  Jun  8 2008
Experimental file compressor.
Init: Order6,4-0 context mixing coder.
Allocated 116494 kB.
Encoding: done.
Ratio: 451105/4067439 bytes (0.89 bpc)
Speed: 479 kB/s (2035.9 ns/byte)
Time: 8.28 s

1661250/1753468 (94.74%)

G:\projects\cmm4-080527\bin\Release>cmm4 43 f:\testset\FlashMX.pdf f:\testset\Fl
ashMX.pdf.tst17
CMM4 v0.1f by C. Mattern  Jun  8 2008
Experimental file compressor.
Init: Order6,4-0 context mixing coder.
Allocated 116494 kB.
Encoding: done.
Ratio: 3650013/4526946 bytes (6.45 bpc)
Speed: 303 kB/s (3220.3 ns/byte)
Time: 14.58 s

281184/287437 (97.82%)

G:\projects\cmm4-080527\bin\Release>cmm4 43 f:\testset\FP.LOG f:\testset\FP.LOG.
tst17
CMM4 v0.1f by C. Mattern  Jun  8 2008
Experimental file compressor.
Init: Order6,4-0 context mixing coder.
Allocated 116494 kB.
Encoding: done.
Ratio: 426899/20617071 bytes (0.17 bpc)
Speed: 488 kB/s (1999.3 ns/byte)
Time: 41.22 s

13953959/14063791 (99.22%)

G:\projects\cmm4-080527\bin\Release>cmm4 43 f:\testset\MSO97.DLL f:\testset\MSO9
7.DLL.tst17
CMM4 v0.1f by C. Mattern  Jun  8 2008
Experimental file compressor.
Init: Order6,4-0 context mixing coder.
Allocated 116494 kB.
Applying x86 transform.
Encoding: done.
Ratio: 1596934/3782416 bytes (3.38 bpc)
Speed: 288 kB/s (3383.3 ns/byte)
Time: 12.80 s

335236/350267 (95.71%)

G:\projects\cmm4-080527\bin\Release>cmm4 43 f:\testset\ohs.doc f:\testset\ohs.do
c.tst17
CMM4 v0.1f by C. Mattern  Jun  8 2008
Experimental file compressor.
Init: Order6,4-0 context mixing coder.
Allocated 116494 kB.
Encoding: done.
Ratio: 748289/4168192 bytes (1.44 bpc)
Speed: 396 kB/s (2462.7 ns/byte)
Time: 10.27 s

2221984/2248349 (98.83%)

G:\projects\cmm4-080527\bin\Release>cmm4 43 f:\testset\rafale.bmp f:\testset\raf
ale.bmp.tst17
CMM4 v0.1f by C. Mattern  Jun  8 2008
Experimental file compressor.
Init: Order6,4-0 context mixing coder.
Allocated 116494 kB.
Encoding: done.
Ratio: 742157/4149414 bytes (1.43 bpc)
Speed: 349 kB/s (2794.1 ns/byte)
Time: 11.59 s

925509/1051685 (88.00%)

G:\projects\cmm4-080527\bin\Release>cmm4 43 f:\testset\vcfiu.hlp f:\testset\vcfi
u.hlp.tst17
CMM4 v0.1f by C. Mattern  Jun  8 2008
Experimental file compressor.
Init: Order6,4-0 context mixing coder.
Allocated 116494 kB.
Encoding: done.
Ratio: 511397/4121418 bytes (0.99 bpc)
Speed: 365 kB/s (2669.0 ns/byte)
Time: 11.00 s

1589466/1631281 (97.44%)

G:\projects\cmm4-080527\bin\Release>cmm4 43 f:\testset\world95.txt f:\testset\wo
rld95.txt.tst17
CMM4 v0.1f by C. Mattern  Jun  8 2008
Experimental file compressor.
Init: Order6,4-0 context mixing coder.
Allocated 116494 kB.
Encoding: done.
Ratio: 455580/2988578 bytes (1.22 bpc)
Speed: 347 kB/s (2807.7 ns/byte)
Time: 8.39 s

621905/642724 (96.76%)```
Looks like the numbers i've written down from memory were wrong - these are correct . And what i did modified is:

Code:
```...
rzCtx = 4*rzCtx + 2*
(rzQueues[0]->Match(rzQueues[1]->Char())&
rzQueues[0]->Match(rzQueues[2]->Char())&
rzQueues[0]->Match(rzQueues[3]->Char()));

r0Guess = (rzCtx&2)!=0 && rzQueues[3]->Hits()>1;
...```
There must be more than a single hit, this i forgot, too. And after knowing the coded character c:

Code:
```    ...
r0Hits += r0Guess && (rzQueues[3]->Match(c));
r0Count += r0Guess;
...```
In my experience order 0 mainly helps on binary data (mostly x86 data). You are still talking about the results with an e8e9 filter?
Your implementation does the (inverse) transform _always_ forward within a buffer?

As i said i was surprised that higher orders help at all. After some thoughts not that much - at least for only an order 2 context sort, since the separation isn't as strong as BWT (order is limited only by the memory).

6. > It's not the same situation with BWT, since in a
> "lookup-table-manner" (lookup the contexts) the ranks change
> "over time" (file/buffer position).

Maybe I misunderstand something again, but once I made a straight
symbol ranking transform (without entropy coding part), and learned
that it commonly has very similar ranks distrubution to that of MTF(BWT).

> (rzQueues[0]->Match(rzQueues[1]->Char())&
> rzQueues[0]->Match(rzQueues[2]->Char())&

This reminded me of ppmonstr... and my own ash too -
there're things like that in SSE contexts.

> In my experience order 0 mainly helps on binary data (mostly x86 data).

I already said that order0 isn't very important, except for compression of small files.
(Not using it might cause some constant redundancy, like 300 bytes per any file).

And executables are not really binary... they're more like containers of different
filetypes - like tables and images and text... so you need to have some models for
all of these and precisely mix or switch them. And x86 code has known syntax and patterns
which may help to highly improve its compression comparing to a "universal" model.

> You are still talking about the results with an e8e9 filter?

Well, yes, since I already implemented it there.

> Your implementation does the (inverse) transform _always_ forward within a buffer?

Yes, check out the ST2.inc

> As i said i was surprised that higher orders help at all.

Well, I was more surprised that MTF(MTF) turned out bad
(I tried it after seeing long runs of 01 values in fp.log mtf output).
And also both my CM and paq8 seem to compress the direct data
significantly better... my results on ST2 output are actually
better than paq8's results for MTF(ST2)

> After some thoughts not that much - at least for only an order 2 context sort,

Well, I knew that there're correlations like that, but kinda
hoped that paq8 is inadaptive enough so I could stop there

> since the separation isn't as strong as BWT (order is limited only by the memory).

Try compressing the BWT output with different orders (ppmonstr or my compressors
allow to specify that) - guess you'd be even more surprised.

7. BTW, what means Unary Coding/Model/Guessing, etc. Do we talking about simple char guess i.e. single flag - predicted/mispredicted character?
What hides behind such idea?

8. I think you mean unary decomposition?

You encode the current symbol's rank in a queue of probable symbols (in most cases mtf rank under higher orders) as a unary number. Every bit's probability can be derived from your model.
In most cases you only need to encode a few bits. For example, have a look at the sr2 output (or bwt output). Round about 70% R0 hits, 20% R1 hits and 5% R2 hits. So you encode 95% (just a rought estimate) of the symbols in _at most_ 3 coding steps.

Page 2 of 2 First 12

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•