# Thread: Subset model selection problem

1. ## Subset model selection problem

Hello to all members!

I?m working on a small packer for exe intros for the demoscene using neural networks. The code is quite small (asm code <320 bytes) and its efficiency is comparable to the crinkler packer (but well, often slightly worse around 1% under 4k-8k). This code will be released open-source once I managed to have something fully working. Although, i?m facing some problems that i want to share with you:

1) The first problem is more relative to the current technique used in crinkler, an exe-linker-packer for small intros under 4k (and can still work well under 16k). Suppose that we have N models and i want to use a subset K (more efficient) of these models and assign them static weights for a small amount of data to compress (compare to fully neural network, static weights are quite efficient under 16k of data, well, when the type of data is not changing too much?)

n0i n1i are respective counter for bit 0 and 1 for the current model (counters are stored in a hash for each model)

The final prediction will simply be : p(1) = sum _0_K ( n1i * 2 ^ wi ) / sum _0_K ( ( n0i + n1i ) * 2 ^ wi )

My questions are :

- How can we determine the best subset model K from N models?

- How can we determine the best weight for the subset model K ? (range for wi is 0-7)

I tried several empirical methods, using errors, prefiltering with neural network?etc. with no success. What do you think? Is there any good pragmatic paper on this subject? (I?m not a statistician but a programmer, so everything that is not somehow applied in a simplified algorithm is quite hard for me to transcribe?)

(Not to mention that parameters for models ? well in this case, this is only a byte per model - are taken into account into the final compressed size).

2) Same question regarding neural networks. I have N models and I want to select the best subset. The only difference is that I don?t have to manage the weights? I?m using currently a dichotomy algorithm with sorting on weights (only summing weights > 0 ) and it?s working quite well? but not sure it?s the best approach.

2. > n0i n1i are respective counter for bit 0 and 1 for the
> current model (counters are stored in a hash for each
> model)

Well, normally I'd recommend to use the linear
probability counters instead of such frequencies,
but for such small amounts of data its really uncertain.
Also not sure that hashing is a good idea... well
its almost certain that you won't get any collisions
with 4k of data, but initializing higher-order contexts
would be a pain.

> - How can we determine the best subset model K from N models?
> - How can we determine the best weight for the subset model K ? (range for wi is 0-7)

There's not much choice other than bruteforce actually.
Of course you can use some heuristics for the search, but
I'm sure that there're no good analytic solutions.
Also, imho there's no such thing as a fixed model set to
choose from. Instead, I'd represent it as parametric models.
And then would proceed this optimization eg. like this:
1. Optimize a single counter (context mask/hash parameters/rate parameters)
2. Optimize another counter with another sample file
3. Add a mixer and optimize it, then reoptimize the parameters of all 3 elements
4. Clone this submodel, add another mixer to mix the versions
[...]
So the point is that instead of having a fixed set of submodels,
I'd just add new tuned submodels until compression stops improving.

> What do you think? Is there any good pragmatic paper on
> this subject? (I'm not a statistician but a programmer, so
> everything that is not somehow applied in a simplified
> algorithm is quite hard for me to transcribe

Its actually easy enough to formulate this as a mathematical
problem - just write down the formula for the coding cost
of some data (with all parameters) and then find the roots
of its derivatives.
An example:
Code:
```We know that there're n0 zeroes and n1 ones in the data,
what would be the best (static) probability to encode this
data?
Coding cost is: f(p)=-n0*log2(p)-n1*log2(1-p)
D[f(p),p] = n1/(1-p)/ln(2) - n0/p/ln(2) == 0
((n0+n1)*p-n0)/(1-p)/p/ln(2) == 0
p = n0/(n0+n1)```
But this approach quickly becomes too complex in practical
cases, so these papers mainly discuss the choice of possible
numeric optimization methods, it seems.

Well, I didn't really try to build a model for such amounts
of data, so it might be better to get some experience first.
Can you provide a sample file (the data you're actually
trying to compress, not a raw obj file or something like that),
with compressed size estimations for it (yours/crinkler's)?

3. Originally Posted by Alexandre Mutel
Hello to all members!

I?m working on a small packer for exe intros for the demoscene using neural networks. The code is quite small (asm code <320 bytes) and its efficiency is comparable to the crinkler packer
Hello! Sounds very interesting. Will it be the natural packer (I mean working with already existing .exe file) or will it behave like Crinkler (I mean linker combined with packer so you need the sources) ?

4. Originally Posted by Shelwien
Well, normally I'd recommend to use the linear
probability counters instead of such frequencies,
but for such small amounts of data its really uncertain.
Also not sure that hashing is a good idea... well
its almost certain that you won't get any collisions
with 4k of data, but initializing higher-order contexts
would be a pain.
I tried several update methods (paq8, balz...etc.), but the most efficient method with a 1 byte n-counter is still the non stationary update i found from Matt Mahoney (although i did a small change to slightly improve it).
I should have been also a bit more precised: in fact, i use, like crinkler, a single model, a generic match-sparse 8-order model. It means that i have 256 contexts to choose from (and that's why a subset is very important, most of a contexts are generating "noise"). So concerning the hash, it's size is often huge (>200M) because with the structure exe data, you can get lost of hash collision (and like crinkler, for efficiency and simplicity of code, collision is not handled).

Originally Posted by Shelwien
There's not much choice other than bruteforce actually.
oh, ok. I tried some bruteforce, but i haven't found yet the best method... and it's quite cpu intensive (256 contexts, weight ranging from 0-7). I need to figure out how they dit it in crinkler, it's really impressive...

Originally Posted by Shelwien
Well, I didn't really try to build a model for such amounts
of data, so it might be better to get some experience first.
Can you provide a sample file (the data you're actually
trying to compress, not a raw obj file or something like that),
with compressed size estimations for it (yours/crinkler's)
You'll find attached a sample file extracted from an intro we did (it's called raytro). This sample file is the exact copy of what crinkler is compressing and puting in the final exe.

I have to mention that binary x86 code is stored in the first 13532 bytes. The rest of data is binary structure, text, shader glsl text...etc. Crinkler is using different context selection for code and data. With the neural network i'm using, this is automatic (it shares the same selection for code and data), but i still reset the hash at 13532 to improve the compression (1 to 2%).

Concerning the compression results on this sample file:
• The size of this file compressed with crinkler is around 9700 bytes (+230 bytes for the decompressor).
• With my compressor, it's around 9560 bytes (+310 bytes for the decompressor).

But because my decompressor is taking 80 bytes more than crinkler, it's only 40 bytes more efficient than crinkler! I found crinkler to be really impressive using statics models for code and data (the asm code is really unbelievable ).

Originally Posted by Skymmer
Hello! Sounds very interesting. Will it be the natural packer (I mean working with already existing .exe file) or will it behave like Crinkler (I mean linker combined with packer so you need the sources) ?
Thanks! I expect to make it work directly on the exe, but it will need the exe to be compiled with relocation information (and this is not always the case). This is required in order to be able to reorder/strip code & data in the exe, thus, improving compression. crinkler is reordering code & data section at linking time.

5. trying to optimize the coders for this intro binary
(mix_v3, mix_test, m1_03b)
and somehow it doesn't go that well
sure, its to be expected that x86 code and an image would require specific models
but I kinda expected to reach the same level at least
...not that its that far though
i'm just being lazy and try to avoid adding hashed contexts to mix_test

Current stats: ((*)=intermediate results)
paq8p: 8609
ppmonstr vJ -o16: 9635
ccm130c: 10790
E8 + ppmonstr vJ -o16: 9498
E8 + ppmonstr vJ -o2: 10032
E8 + mix_test 09h2 tuned: 10275 (*)
E8 + mix_test 09h2 + BMF2: 8979(*)+1164=10143
E8 + mix 3e22: 10483 (*)
E8 + m1_03b-1 + BMF2: 10048+1164=11212

1. I kinda expected to get an already filtered input,
but seems that E8 filtering still helped
2. mix_test coder is actually a BWT postcoder and
is limited at o2 max.

6. > I should have been also a bit more precised: in fact, i
> use, like crinkler, a single model, a generic match-sparse
> 8-order model. It means that i have 256 contexts to choose
> from (and that's why a subset is very important, most of a
> contexts are generating "noise").

Using a bit mask would certainly help, so I'd not stop
at just 256 contexts

>> There's not much choice other than bruteforce actually.
> oh, ok. I tried some bruteforce, but i haven't found yet
> the best method... and it's quite cpu intensive (256
> contexts, weight ranging from 0-7). I need to figure out
> how they dit it in crinkler, it's really impressive...

Possibly there're just a few precalculated profiles,
and full search is not performed.

Also 256 is not much (or 256*256), and that hash can
be replaced with a direct scan (I mean, instead of
storing the counters, they can be calculated from
scratch each time - without using any memory).
Like that it would be easily possible to calculate
the code length for each parameter set at once and find
the best one... or actually it should be possible
to even optimize a few sets for data segments with
different properties.

> I have to mention that binary x86 code is stored in the
> first 13532 bytes. The rest of data is binary structure,

Also there's a 12k 64x64x3 image there (or texture?),
which allows to gain a few 100s of bytes with an image model.

> * The size of this file compressed with crinkler is around 9700 bytes
> (+230 bytes for the decompressor).
> * With my compressor, it's around 9560 bytes
> (+310 bytes for the decompressor).

Well, my recent coder (mix_test) is pretty simple as well,
although dunno how much of that unpacker code is required for

> But because my decompressor is taking 80 bytes more than
> crinkler, it's only 40 bytes more efficient than crinkler!

Anyway to me it seems more interesting to use different
models for different types of data in this case, even
if the decoder size would be a few times larger.
It won't work for small binaries of course, but 32k
input is already ok I think.

> I found crinkler to be really impressive using statics
> models for code and data (the asm code is really
> unbelievable ).

Well, x86 code compression certainly can be improved
a lot with proper parsing.
http://www.compression.ru/ds/disasm32.rar
Also this might be useful too:
http://ctxmodel.net/files/PPMd/segfile_sh2.rar

And then, maybe I'll find time to add higher order
contexts to mix_test (it uses bitwise context masks
and adaptive mixing of 8 contexts btw), and we'd see then.

Unfortunately, BWT doesn't help in this case

Btw, a random idea: if something like paq rangecoder
is used (the one with ((x1^x2)&0xFF000000)==0 renormalization)
then there's a way to cut off some bytes from the decoder,
as non-carryless decoders are simpler than that.

7. Originally Posted by Shelwien
1. I kinda expected to get an already filtered input,
but seems that E8 filtering still helped
Good to know, i'll see check if it's worth to add it (compressor code size vs compressed size gain)

Originally Posted by Shelwien
Using a bit mask would certainly help, so I'd not stop
at just 256 contexts
Yes, it's a byte mask which describre the last byte to take into account for the hash-counting (1 bit on => 1 byte to process in the history).
Do you think that with a 16 order context, i could have some improvement?
Not sure about that, but well, it needs a test...

Originally Posted by Shelwien
Possibly there're just a few precalculated profiles,
and full search is not performed.
The first method i tried was assigning a weight based on the order of the context (like in the first version of paq's), and then changing them slighltly. It works quite well but i was not able to find the same model/weight than crinkler (and they were less efficient).
Originally Posted by Shelwien
Also 256 is not much (or 256*256), and that hash can
be replaced with a direct scan (I mean, instead of
storing the counters, they can be calculated from
scratch each time - without using any memory).
Like that it would be easily possible to calculate
the code length for each parameter set at once and find
the best one... or actually it should be possible
to even optimize a few sets for data segments with
different properties.
Huh, i don't fully understand your suggestion.
If hash had to be recalculated from scratch for each bit, it would take ages to compress/decompress... and the decompression needs to be reasonable (but not the compression )

Originally Posted by Shelwien
Also there's a 12k 64x64x3 image there (or texture?),which allows to gain a few 100s of bytes with an image model.
...
Anyway to me it seems more interesting to use different
models for different types of data in this case, even
if the decoder size would be a few times larger.
It won't work for small binaries of course, but 32k
input is already ok I think.
Yep, you're right... but for such a size, i'm not sure that the cost of an image model (the code size) would give enough results compare to the basic match-sparse model... moreover, image are not really common in intros and it's preferred to generate procedurally the texture (although in the raytro, we didn't have time to do it )
It's probably worth for x86 code (this is actualy what is done in kkrunchy). Demos are also using GLSL/HLSL shader code (like c code), i'm wondering if a text/code model would relevant here...

Originally Posted by Shelwien
Btw, a random idea: if something like paq rangecoder
is used (the one with ((x1^x2)&0xFF000000)==0 renormalization)
then there's a way to cut off some bytes from the decoder,
as non-carryless decoders are simpler than that.
I guess i'm using what you are suggesting. The C version i'm using is based on an improved version of a generated code with flavor (http://flavor.sourceforge.net, i found it quite helpful by the way, nice program) and a 33 bit arithmetic precision to facilitate the test of the renormalisation 0x80000000 (in asm, it's just a sign test, really simple)
Code:
```	int Decoder::Decode(Bitstream &_F_bs, double pr1) {
int x = 0; // the decoded symbol

// renormalization
while (_F_R < 0x80000000) {
_F_R <<= 1;
_F_V <<= 1;
_F_V += _F_bs.getbits(1);
}

unsigned int r_1 = (_F_R * pr1);
_F_R = _F_R - r_1;
if (_F_V >= _F_R) { // we have the symbol 1
_F_V = _F_V - _F_R;
_F_R = r_1;
x++;
}

return x;
}```

8. >> but seems that E8 filtering still helped
> Good to know, i'll see check if it's worth to add it
> (compressor code size vs compressed size gain)

In fact, I meant this: http://ctxmodel.net/files/E8flt-v2.rar
And a similar filter can be implemented with a 20-byte asm loop,
while allowing to gain ~130 bytes, so I think it would be

> > Using a bit mask would certainly help, so I'd not stop
> > at just 256 contexts
>
> Yes, it's a byte mask which describre the last byte to
> take into account for the hash-counting (1 bit on => 1
> byte to process in the history).

I meant that you have some prefix bytes, for example:
= F9 D8 24 95 E0 26 43 00
& FF 00 FF 00 FF 00 FF 00
= F9 00 24 00 E0 00 43 00
while I suggested something like this:
& 00 F0 80 80 80 F0 FF FF
= 00 D0 00 80 80 20 43 00

Its commonly very useful for text compression
(eg. 0DF masks) and even more for binary data,
as x86 instruction consist of bitfields.

Btw, it also helps to optimize the hash function parameters

> Do you think that with a 16 order context, i could have some improvement?
> Not sure about that, but well, it needs a test...

It seems that ppmonstr only gains ~13 bytes from o16 vs o8,
so probably not.

>> Also 256 is not much (or 256*256), and that hash can
>> be replaced with a direct scan (I mean, instead of
>> storing the counters, they can be calculated from
>> scratch each time - without using any memory).
>> Like that it would be easily possible to calculate
>> the code length for each parameter set at once and find
>> the best one... or actually it should be possible
>> to even optimize a few sets for data segments with
>> different properties.

> Huh, i don't fully understand your suggestion.
> If hash had to be recalculated from scratch for each bit,
> it would take ages to compress/decompress... and the
> decompression needs to be reasonable (but not the
> compression)

1. Of course the actual compression and decompression
can be implemented with a hash, I talked about the
parameter optimization.
2. To recalculate the counter from scratch you need
to find all the context occurences from the start of
data and update the counter with each one.
I agree that this might be slow with megabytes of data,
but how many occurences of order8 context can be there
in 30k of data?
3. The point of doing this is reuse of intermediate values.
For example, you would need 64k full sequential passes
to try all mask/"weight" combinations, but only 256 passes
with 256 weight loops in each of them with such parallel approach.

>> Also there's a 12k 64x64x3 image there (or texture?), which
>> allows to gain a few 100s of bytes with an image model.

> Yep, you're right... but for such a size, i'm not sure
> that the cost of an image model (the code size) would give
> enough results compare to the basic match-sparse model...

In my tests, the image model allowed to cut off ~300 bytes.
And its decoder would be basically the same as for other data,
it just has to include some byte from the previous row into
context.

> moreover, image are not really common in intros and it's
> preferred to generate procedurally the texture (although
> in the raytro, we didn't have time to do it)

Well, somehow I think that there're plenty of similar cases.

> It's probably worth for x86 code (this is actualy what is
> done in kkrunchy).

Sure... its also done in durilca:
http://www.compression.ru/ds/durilca.rar

> Demos are also using GLSL/HLSL shader code (like c code),
> i'm wondering if a text/code model would relevant here...

Of course, any data with syntax restrictions would benefit
from a custom model for compression.
However, in the case of your intro, the script compresses
to ~800 bytes, so it won't be quite easy to write a full
expression parser which would be smaller than a possible gain.
But such a script can be shrinked with a lossy preprocessor
(all identifiers to 1 symbol, numeric constants optimization:
0.004 -> 4E-3, 20000.0 -> 2E4, 0.0 -> .0 etc) and then
charset restrictions and bracket matching can be added to the
compressor's model.

> I guess i'm using what you are suggesting.

No, I was talking about something like sh_v1m in
http://ctxmodel.net/files/mix_test/mix_test_v9.rar

> The C version i'm using is based on an improved version

Well, this doesn't have the carry workaround, which is good,
but its based on bitwise input which is not quite good.
(though I guess it can be ok if it uses BT instruction
Also, it reminded me of some old implementation of mine
which kept the rangecoder state on FPU stack
(see cld-a2 in compression.ru/sh/coders6a.rar)
but actually I don't think that using any float-point
is good for speed optimization - FP instructions are
longer (both FPU and SSE) and some operations require
several instructions while single is enough for integers.

9. Originally Posted by Shelwien
In fact, I meant this: http://ctxmodel.net/files/E8flt-v2.rar
And a similar filter can be implemented with a 20-byte asm loop,
while allowing to gain ~130 bytes, so I think it would be
Thanks! I'll have a look at it... i agree that it's quite small to code in asm and worth it.
I meant that you have some prefix bytes, for example:
= F9 D8 24 95 E0 26 43 00
& FF 00 FF 00 FF 00 FF 00
= F9 00 24 00 E0 00 43 00
while I suggested something like this:
& 00 F0 80 80 80 F0 FF FF
= 00 D0 00 80 80 20 43 00
Well, it would be great... but i'm afraid it will take too much space...
I have only one byte to match the last 8 bytes... With you solution, each parameter for the context would take 8 bytes. Could be worth for a 64k... but not sure in the range 4k-16k...
I guess that the best opportunity for my compressor is to add a prefilter for x86 code... that's where i can have a good gain vs code size

It seems that ppmonstr only gains ~13 bytes from o16 vs o8,
so probably not.
Ok, good to know.

Well, this doesn't have the carry workaround, which is good,
but its based on bitwise input which is not quite good.
(though I guess it can be ok if it uses BT instruction
Also, it reminded me of some old implementation of mine
which kept the rangecoder state on FPU stack
(see cld-a2 in compression.ru/sh/coders6a.rar)
but actually I don't think that using any float-point
is good for speed optimization - FP instructions are
longer (both FPU and SSE) and some operations require
several instructions while single is enough for integers.
Yes it use BT instruction, and the code is really small (<45 bytes). It would'nt be small with the sh_v1m. I agree that working on bits is not efficient in speed... but efficient in size, and that's what i'm looking for!
I'm using the FPU mainly because it's much easier to calculate neural networks train/stretch functions and the code is smaller too. I don't have any precalc, shifting... etc. and it saves a lots of space!
For FPU vs fixed point, well, i don't fully agree with you. In demos, using lots of calculation for 3D or soft synthezizer, integer fixed point has been always slower than fpu since... pentium if i remember... With large calculation code, the FPU is much better armed with its stack (i'm talking about plain +*/-. Other instructions log/sin/cos are slow... but it's not comparable to x86 integer).
So with plain integer fixed point, you have to play with push/pop registers, memory stack, shifting... etc. In the end, this is usually slower (well, i'm sure you could find some example that are faster... but for what i have tested on asm code for the past few years, it was usually much slower with integer - and the code was bigger also...).

10. Well, I tried to sketch it, and here's a 41 byte version.
Also there're some places to tweak, so its probably not the real minimum.

Update: a 37-byte version with one more trick.

Update2: 36 bytes

Update3: 35 bytes %)

11. > Yes it use BT instruction, and the code is really small
> (<45 bytes). It would'nt be small with the sh_v1m.

Well, it is

> I agree that working on bits is not efficient in speed... but
> efficient in size, and that's what i'm looking for!

Not really

> I'm using the FPU mainly because it's much easier to
> calculate neural networks train/stretch functions and the
> code is smaller too. I don't have any precalc, shifting...
> etc. and it saves a lots of space!

Actually there're linear mixers that don't need any precalc.
And the formal stretch/squash are not the best compression-wise.
And as to precalc, I think it will be well compensated by other things.

> For FPU vs fixed point, well, i don't fully agree with
> you. In demos, using lots of calculation for 3D or soft
> synthezizer, integer fixed point has been always slower
> than fpu since... pentium if i remember... With large
> calculation code, the FPU is much better armed with its
> stack (i'm talking about plain +*/-. Other instructions
> log/sin/cos are slow... but it's not comparable to x86
> integer).

Dunno, it all depends on programming style I guess.
I've tried to use float-point for simplicity in many
projects, and then I had to convert them to integer because
of portability issues and it always becomes much faster after that.

> some example that are faster... but for what i have tested
> on asm code for the past few years, it was usually much
> slower with integer - and the code was bigger also...).

Well, my condolences

12. Originally Posted by Shelwien
Well, I tried to sketch it, and here's a 41 byte version.
Also there're some places to tweak, so its probably not the real minimum.
Update: a 37-byte version with one more trick.
Update2: 36 bytes
Update3: 35 bytes %)
Oops, my apologize, i didn't see the if (DECODE) in sh_v1m! Your code is really nice!
eheh, i was also miscounting my version that is around 39 bytes... it's the switch from the float multiplication to the integer and visversa that adds those bytes! , but the version you used is much more elegant because the compressor code is also very simple (mine is horrible, generated from flavor, and using 33 bits, it's huge compare to yours!). Your version looks like the version from crinkler... but the renormalization seems to be done at sTOP=0x80000000. For optimization, it would be more relevant to compare the whole program. With limited registers, things are getting a little bit more complicated on the whole code (push/pop, initialization...). But still, the version you point me is the one i was looking for with no success... that's why i went to flavor...

Originally Posted by Shelwien
Actually there're linear mixers that don't need any precalc.
And the formal stretch/squash are not the best compression-wise.
And as to precalc, I think it will be well compensated by other things.
Oh, it would be great if i could have adaptive weights without the ln2/2^ stuff!. Can you point me to an example?

Originally Posted by Shelwien
Dunno, it all depends on programming style I guess.
I've tried to use float-point for simplicity in many
projects, and then I had to convert them to integer because
of portability issues and it always becomes much faster after that.
Context is probably relevant too... but well, i should do some test again... i missed perhaps something when trying to convert from floating point to fixed point...

13. > Oops, my apologize, i didn't see the if (DECODE) in sh_v1m!
> Your code is really nice!

Btw, it became 34 bytes using shifts by CL instead of 8.
And probably can be improved even further... eg. there're
ways to remove one... or all... jumps.

> compressor code is also very simple (mine is horrible,
> generated from flavor, and using 33 bits, it's huge
> compare to yours!).

As to encoder simplicity, this is pretty bad comparing
to carryless rangecoders... for example, see
rc_mm.inc in http://ctxmodel.net/files/MIX/mix_v3.rar
...or crlf_m from http://compression.ru/sh/coders6c2.rar
But such an implementation has lower precision and
its decoder is a little more complex than non-carryless one.

> Your version looks like the version from crinkler... but
> the renormalization seems to be done at sTOP=0x80000000.

rc_sh2f.inc in mix_v3.rar is like that.

> For optimization, it would be more relevant to compare the
> whole program. With limited registers, things are getting
> a little bit more complicated on the whole code (push/pop,
> initialization...).

Dunno, I guess after 6502 with its 3 8-bit registers this
still seems a lot to me
Also push/pop are really cool for code size optimization -
they are single-byte and can be used to manipulate user data
as there's no risk of interrupt messing up stack in modern OSes.
For example, if I'd need to save the rangecoder state in
register sets.

> Oh, it would be great if i could have adaptive weights
> without the ln2/2^ stuff!. Can you point me to an example?

Well, that mix coder I linked does exactly that (sh_mixer.inc).
Also I think its explained in http://ctxmodel.net/rem.pl?-12
It seems that the logistic mixer has somewhat better adaptation,
but it probably doesn't matter in your case.

> Context is probably relevant too...

I meant fairly relevant stuff I think...
Like http://ctxmodel.net/files/ASH/ash07.rar (float)
vs http://www.ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar
Or there were some codecs with FFT etc.
Anyway, I talked about computation-heavy stuff of course.

> but well, i should do some test again... i missed perhaps
> something when trying to convert from floating point to
> fixed point...

Another possibility is bytewise modelling.
For example, ppmonstr compresses your intro binary
to 9635 bytes and durilca (=disasm32+ppmonstr) to 8753 bytes,
which is pretty similar to paq-based results, but
doesn't use any bitwise processing.

Well, such bytewise coders have much more complex algorithms
than binary ones, but which approach is actually more compact
is unknown.

14. http://ctxmodel.net/files/mix_test/mix_test_vA.rar

Experiment with o8 CM tuning for intro compression

38424 intro.exe.unpack
8609 paq8p -7
9635 ppmonstr vJ -o16
9498 E8 + ppmonstr vJ -o16:
~9700 crinkler
~9560 ???

09h2x2 09h2x3 09h2x3a
9805 9976 9970 // whole original file
9671 9847 9840 // whole file filter with E8
9360 9452 9443 // E8 + bmf2 image encoding

09h2x2 - slower version with dynamic allocation
09h2x3 - 09h2x2 retuned to reduce memory usage
09h2x3a - 09h2x3 with static counter arrays

--------------

Actually it was an experiment on extending the mix_test contexts
up to order8 without involving hashes.
to use the same code both to access the dynamically allocated trees and
static tables

Also, I'd probably post another version later, with somewhat better
results - this one isn't fully optimized yet.

15. Code:
```intro.exe.unpack   38 424

PAQ8px -4          8 558
PAQ8px -5          8 550
PAQ8px -6          8 547
PAQ8px -7          8 549

PAQ8k -4          8 492
PAQ8k -5          8 495
PAQ8k -6          8 496
PAQ8k -7          8 496

PAQ8k2 -6          9 209```
Interesting is that for PAQ8px, -6 level is better than -7 and for PAQ8k, -4 level gives best results. By the way, all results can be 15 bytes smaller if the intro.exe.unpack name is changed to some 1 symbol name.
So 8 477 is achievable.

16. Thanks, I was too lazy to do that myself.
However, its kinda unlikely that full paq8 would fit into extra
1k which it gained

Btw, I found that test.bat in mix_test_vA.rar would cause
an exception when started on a cpu without SSE3.
Appears that Shkarin for some reason compiled bmf for SSE3.

17. Originally Posted by Skymmer
So 8 477 is achievable.
Well, the size of a PAQ8k decompressor would be >> 1.5k even in assembler! So it's not really achievable... At this size, we need to compare the size of the compressed data + the size of the decompressor.

Thanks Shelwien for the test also. I did a simple test with your E8 filter and a adding the upper pixel in the 64x64x3 and the results with crunky (my compressor, we are going to call it like this as of now )
- 9560 normal
- 9424 with E8 filter
- 9287 with E8 filter + pixel above 64x64x3

Adding the E8 filter to the decompressor is around 20 bytes, adding the pixel above is less than 10 bytes. So you convince me that is worth to do it!

Although, seems that BMF is doing more than just the pixel above... what it does exactly?

18. > 9287 with E8 filter + pixel above 64x64x3

Well, guess I'd like to know how many contexts do you mix there,
as my result with 8 contexts seems significantly worse,
even though its supposedly better optimized than yours
(because of bitwise masks, if not anything else).
Also whether you have APM/SSE stages, and/or something like
paq's "exemodel".

> Although, seems that BMF is doing more than just the pixel above...
> what it does exactly?

Its not open source. But I suppose its something similar to this:
http://www.compression.ru/ds/bcdr.rar

Btw, paq8p etc compress that image to nearly the same size as BMF
(1146 vs 116.

19. Originally Posted by Shelwien
> 9287 with E8 filter + pixel above 64x64x3
Well, guess I'd like to know how many contexts do you mix there,
as my result with 8 contexts seems significantly worse,
even though its supposedly better optimized than yours
(because of bitwise masks, if not anything else).
Also whether you have APM/SSE stages, and/or something like
paq's "exemodel".
Actually, my program is selecting 28 context other the 256 (it means 28 bytes more that are already stored in the 9287 bytes. Without this, the real size is 9250). Context selection is dependent on the data, so i'm not so surprised that it is quite efficient.
Also, when i'm looking at your code, you have lots of precalculated tables : just for the M_xxx_masky tables, it's a whopping 256*22 = 5632 bytes. You have also the strech/squash precalc table that adds 32*2*2 = 128 bytes. Just those datas are more than half the size of the compressed data. To really make a fair comparison, those extras datas should be put in your final compressed size.

The initialization of the valmap is quite important, the mixer also is a long suite of specialized code, sure that it would be smaller in full asm, but even with your asm skill, a plain translation in asm would'nt go under 500 bytes (if not a lot more). I'm also quite sure that the method i'm using with the FPU is well efficient in size and in results apparently for this kind of data. I'm not using any APM/SSE stages, as i did some test and it didn't give me better results (including the code size for this stage).

Originally Posted by Shelwien
Its not open source. But I suppose its something similar to this: http://www.compression.ru/ds/bcdr.rar
Woo, i didn't expect a code so large. Well, reading code like this is quite obscure to me, like yours Shelwien btw, i think your code would deserve a big readme.txt and more comments to enlight obscure parts and to be able to share with others... Also, it's something i have appreciated a lot in Matt Mahoney's code, the large comments at the beginning of its compressors were really helpful to understand the code!
For example, the mixing algo you are using is quite interesting, but even the link you provide on your "blog" and the code in the mixer are not enough to understand it. It would be good also to give a comparison between your mixer and for example Mahoney's neural network mixer, everything else being constant of course. It would give more pertinence - or not - to your method.

Also, your help is very helpful : even on small size of data, i've learned that it's worth sometimes to add a new model that gives better results!

20. > Actually, my program is selecting 28 context other the 256

So 28 contexts with static logistic mixing and that's all?

Well, its still slightly more than 8, so I guess its reasonable

Although why don't you just use all 256 and adaptive mixing?
With that there'd be no more "subset selection problem"

> Also, when i'm looking at your code, you have lots of
> precalculated tables : just for the M_xxx_masky tables,
> it's a whopping 256*22 = 5632 bytes.

That's a memory/speed optimization, its not really necessary.
There're still 9*15=135 bytes of contexts masks (7 mixers, 8 counters),
4*15=60 bytes of rate parameters and 2*7=14 bytes for initial weights
(though these can be safely removed in this case).
So 135+60+14=209 total bytes of parameters for model elements.
And then there're 2*2*32=128 bytes of mixer mappings.
So 209+128=337.

> Just those datas are more than half the size of the
> compressed data. To really make a fair comparison, those
> extras datas should be put in your final compressed size.

I guess, but your current result is better than mine even
not considering that
Still, its not completely valid to just take the parameter
profile size as is - it can be restructured (like W0 and
mappings removed) and even compressed (eg. most of masks are 0).

Anyway, this coder wasn't made for intro compression, nor do
I have enough experience in this area, where constraints are
considerably different from usual.

I just took a random coder which I was recently tweaking,
and used it for redundancy estimation of your data - which
was fairly successful imho.
Well, normally I just use paq8 for such estimation, but in
this case it isn't really applicable.

> The initialization of the valmap is quite important, the
> mixer also is a long suite of specialized code, sure that
> it would be smaller in full asm, but even with your asm
> skill, a plain translation in asm would'nt go under 500
> bytes (if not a lot more).

You mean, just that mixer? Not really, it would take
just ~40 bytes for Mixup() and Update() and a similar
amount for valmap init - these are actually much easier
to write in asm than C .

> I'm also quite sure that the method i'm using with the FPU
> is well efficient in size and in results apparently for
> this kind of data.

I don't really have anything against use of float-point in
asm... I even posted a link to my own implementation of
an FPU-based rangecoder.
But in this specific task I don't really see any place
where FPU instructions would be more applicable than
integer.

> I'm not using any APM/SSE stages, as i did some test and
> it didn't give me better results (including the code size
> for this stage).

Yeah... newer instructions tend to have longer codes

>> http://www.compression.ru/ds/bcdr.rar
> Woo, i didn't expect a code so large.

Its not really, its just a weird writing style
Most of C statements there actually correspond to
single asm instructions, so its not that large, just

> i think your code would deserve a big readme.txt and more
> comments to enlight obscure parts and to be able to share
> with others...

Well, in theory I agree, though most of it was explained
on this forum or ctxmodel.net.
Its just that chances to discuss such implementation details
are quite rare and I don't have any motivation to explain
something when nobody specific needs that.
Also I prefer to spend the time at improving the coder
instead of documenting an imperfect coder

> Also, it's something i have appreciated a lot in Matt
> Mahoney's code, the large comments at the beginning of its
> compressors were really helpful to understand the code!

Yeah, I liked that a lot too, but my projects are of a much
smaller scale for now.
even if it doesn't work quite well yet

> For example, the mixing algo you are using is quite
> and the code in the mixer are not enough to understand it.

Err... if you mean paq_mixer.inc from mix_test, its completely
different from the linear mixer for which I gave you the link.
The file with linear mixer is called "sh_mixer.inc" and is available

> It would be good also to give a comparison between your
> mixer and for example Mahoney's neural network mixer,
> everything else being constant of course. It would give
> more pertinence - or not - to your method.

Fortunately, we do have exactly that here:

So basically the mixer in mix_test is my own implementation
of a logistic mixer similar to one used in paq.

Which I started to use only a month ago, after an objective
comparison with fully optimized coders based on linear and
logistic mixers showed that logistic approach actually has
a valid foundation and faster adapts to data than linear.

Still its pretty interesting that a few years before that
I had no problems with beating paq's results (for specific
data types) using only linear mixing (which was the reason
why I didn't bother to properly investigate the paq mixer before).
It kinda tells a lot about actual efficiency of paq coders.
Well, in fact Matt even posted the results of zpaq version
of a model similar to mix_test_v3 (http://encode.su/forum/showthread.php?p=7769#post7769)
where we can see how helpful the proper parameter tuning is
(219581 vs 214695 on book1 - don't look at "indirect model" results
as contexts there are different so its unfit for mixer comparison.)

> i've learned that it's worth sometimes to add a new
> model that gives better results!

Well, 38k is not really that small - decomp8.exe from a recent
hutter prize entry is only 16252 bytes in size, and it includes
a complete (well, universal part) paq8 model and dictionary management
and a WRT-like text preprocessor.

21. Originally Posted by Shelwien
Although why don't you just use all 256 and adaptive mixing? With that there'd be no more "subset selection problem"
Using the 256 models is quite efficient under a certain size, because i don't need to store the models... but is slightly inefficient when you have more data (and it compress very slowly : 800s for book1) due to the fact that the mixer waste time to exclude bad contexts (affecting weights << 0)

=Shelwien;8336]There're still 9*15=135 bytes of contexts masks (7 mixers, 8 counters)
I'm really interested in the mask you choose, meaning of xxx in the M_xxx_masky? On what kind of expectation/kind of data (exe,...etc) is it based? Are they heuristics or did you choose them on a bruteforce method?

Originally Posted by Shelwien
You mean, just that mixer? Not really, it would take
just ~40 bytes for Mixup() and Update() and a similar
amount for valmap init - these are actually much easier
to write in asm than C .
No, i was just meaning for the whole program of course!
But you proved me that asm translating for you is almost your natural language

Originally Posted by Shelwien
I don't really have anything against use of float-point inasm... I even posted a link to my own implementation of
an FPU-based rangecoder.
But in this specific task I don't really see any place
where FPU instructions would be more applicable than
integer.
oh, for the rangecoder, FPU seems to be a nightmare (your implementation is... coming right from planet mars )
FPU seems to be relevant in size (except under 4k compressed data, where crinkler here beats anything) if you want to get rid off precalculated tables for stretch/squash function. 128 bytes of data is more than the cost of using directly the FPU (in my code, the strech function takes only 20 bytes FPU code - the overhead is coming from switching from integer counters to probability-, the squash is only 26 bytes), so in my case, this is far more efficient in size. In fact, at this size, it's quite common to try to avoid any precalculation, as the precalculation code is taking a significant space over a direct use inside the code.

Originally Posted by Shelwien
> I'm not using any APM/SSE stages, as i did some test and
> it didn't give me better results (including the code size
> for this stage).
Yeah... newer instructions tend to have longer codes
right... I would have loved to use sse2 with all the registers available... i guess it's coming from the time i was using 68000.... bad habit

Originally Posted by Shelwien
Well, in fact Matt even posted the results of zpaq version
of a model similar to mix_test_v3 (http://encode.su/forum/showthread.php?p=7769#post7769)
where we can see how helpful the proper parameter tuning is
(219581 vs 214695 on book1 - don't look at "indirect model" results
as contexts there are different so its unfit for mixer comparison.)
i tried my compressor on book1 and book1bwt and the results are quite good actually
Code:
```                crunky
book1	        216962	(13 contexts used)
book1bwt	211907	(20 contexts used)```
though, it uses 13-20 contexts, and i'm not sure how many contexts were used in the zpaq comparison... The mixer i'm using is a very slightly different version from paq's series anyway...

Originally Posted by Shelwien
Well, 38k is not really that small - decomp8.exe from a recent
hutter prize entry is only 16252 bytes in size, and it includes
a complete (well, universal part) paq8 model and dictionary management
and a WRT-like text preprocessor.
Oops.. i don't understand the comparison between the 38k of uncompressed data and the 16252 bytes of compressor code? What do you mean?

22. Originally Posted by Alexandre Mutel
Well, the size of a PAQ8k decompressor would be >> 1.5k even in assembler! So it's not really achievable... At this size, we need to compare the size of the compressed data + the size of the decompressor.
For this particular file its really probably not achievable but we are talking about PE packer which is supposed to work with quite different files which are varying by its size and content so I think PAQ engine can proove its usefullness for some files. Furthermore you're planning to beat Crinkler and I suppose kkrunchy so it will be quite hard task so having a couple of selectable compression engines will get you closer to victory I think. Something like having nrv2b\nrv2d\nrv2e\lzma in UPX or aplib\briefLZ\ffce\jcalg1\lzma in PECompact2.

23. Originally Posted by Skymmer
For this particular file its really probably not achievable but we are talking about PE packer which is supposed to work with quite different files which are varying by its size and content so I think PAQ engine can proove its usefullness for some files. Furthermore you're planning to beat Crinkler and I suppose kkrunchy so it will be quite hard task so having a couple of selectable compression engines will get you closer to victory I think. Something like having nrv2b\nrv2d\nrv2e\lzma in UPX or aplib\briefLZ\ffce\jcalg1\lzma in PECompact2.
Well, i'm not tageting to compete with packer for large executable anyway... i'm more in the 4k-64k range of compressed executable. From the test i did, it's almost impossible under 4k to do better than crinkler. With kkrunchy, it's possible, currently, with only a E8 filter, my compressor seems to be 0.5% to 1% worse (well i did only this test on putty.exe, so this is not really applicable to the 64k target), but still, kkrunchy is performing much more pre/filter/processing on the code than i was planning to.
But between crinkler and kkrunchy (between 4k and 64k, more precisely the 16k type), it's possible to achieve better results. But well, i agree that it is a very very small niche market

24. > due to the fact that the mixer waste time to exclude bad
> contexts (affecting weights << 0)

Well, using a single mixer with multiple inputs is not
the only way possible, as you could see from my coders.
In fact, a binary mixer tree instead of single multi-input mixer
seems to improve compression, which can be partly explained by absense
of Sum w_i = 1 constraint in most paq mixer implementations.

And with such a design its also possible to improve speed by
discarding branches that have ~0 weights (both estimation and update).

>> There're still 9*15=135 bytes of contexts masks (7 mixers, 8 counters)
>
> I'm really interested in the mask you choose, meaning of xxx

First pair of counters just got randomly named f and g,
then I added a second pair (f1 and g1) and then 4 more
And these xxx are just names of context indexes for corresponding
counters.

Also you could notice some sh_model.idx files - they contain
the original source from which sh_model_h.inc and sh_model_p.inc
were generated.

> On what kind of expectation/kind of data (exe,...etc) is it based?

Well, the contexts in mix_test_vA was tuned to the executable part
(after cutting off that image) of your intro sample, without E8 applied though.

> Are they heuristics or did you choose them on a bruteforce method?

Both I guess. Anyway, it took multiple hours of optimizer iterations,
although both my optimizer and that specific coder implementation are
far from really efficient, so the speed of optimization process actually
can be significantly improved - to a few minutes I guess
(though well, my few minutes, with an overclocked Q9450 .

> No, i was just meaning for the whole program of course!

Btw, do you include the size of PE headers into your decoder size estimations?
I think it should be included, as its possible to keep some data/code in some

Also, intros don't necessarily have to be distributed as PE files, right?
Did you consider writing a .COM dropper which would generate a real exe file
and run it, or something like NE exe (its header is kinda more compact)?

> But you proved me that asm translating for you is almost your natural language

In fact it is... I'd used it for like 12 years before switching to C++
(which I did only because of MMX/SSE and IntelC with automatic vectorization -
its very hard to compete with that in plain asm).

Btw, here's a better example of asm code size optimizations
I tried to help somebody to fit it into 128 bytes, but didn't succeed either.
Still, it (w.asm) shows how stack and string operations can be used, and
there's also self-modification.

> oh, for the rangecoder, FPU seems to be a nightmare (your
> implementation is... coming right from planet mars )

Actually it was supposed to be a speed optimization.
First, it keeps the rangecoder state on the FPU stack,
while compressor is free to use all integer registers.
And second, it reads/writes the compressed data by 32 bits at once -
and FPU had 64 bits of integer precision even at
the time when there was no MMX

> FPU seems to be relevant in size (except under 4k compressed
> data, where crinkler here beats anything) if you want to get
> rid off precalculated tables for stretch/squash function.

Err... only that? Why not make it all float-point - counters,
mixer weights, rc?.. It might not only reduce the number of
float-to-integer transitions, but also shave off some bytes
due to improved precision...

In fact, my first CMs (ppmy and ash) were written using
completely float-point models, and these are
almost impossible to translate into integers - it'd
probably require some kind of software float-point anyway.

But if its only stretch/squash, then floats there really
don't seem that necessary.

> 128 bytes of data is more than the cost of using directly the FPU

Yeah, but it doesn't have to be 128 bytes
For example, 10*64/8 is already 80 bytes (deltas barely fit into 10 bits),
and then its surely possible to cut off a few bits of precision.
Also, these tables actually can be merged... or 3 more tables
can be added for better tuning .

And if necessary, its even possible to just use FPU-based
squash and stretch there as well, as my tables are only some
approximations of these functions, although some data-specific
static probability transformations are tuned into them.

> In fact, at this size, it's quite common to try to avoid any
> precalculation, as the precalculation code is taking a
> significant space over a direct use inside the code.

Its only true for very simple algorithms, but something
on the scale of 300-500 bytes is already different imho.
inline is fairly little (<10 bytes), and if the function
is used multiple times, it might be well worth it.

Although of course I don't know it for certain in the
case of a logistic CM decoder, as I didn't write one yet

>>> I'm not using any APM/SSE stages, as i did some test and
>>> it didn't give me better results (including the code size
>>> for this stage).
>>
>> Yeah... newer instructions tend to have longer codes
>
> right... I would have loved to use sse2 with all the
> registers available...

It was a pun actually, with different kinds of SSE

But actually even though SSE might be more complex than
mixing, it gives enough compression improvement to
consider it imho - not the one from paq maybe, though.
For example, a model with 2 counters and 2d SSE is similar
(or better) in compression as a model with 4 mixed counters.

> i tried my compressor on book1 and book1bwt
> and the results are quite good actually
> crunky
>book1 216962 (13 contexts used)
>book1bwt 211907 (20 contexts used)

My current results are 209836 (book1) and 207961 (book1bwt)
with mix_test and 8 counters (my mixers have contexts too,
so there're 15 contexts).
Also book1 result would be better in a few days, I guess -
its pretty slow to optimize from scratch.

> though, it uses 13-20 contexts, and i'm not sure how many
> contexts were used in the zpaq comparison...

There were plain order0 and order1 counter tables and
a single instance of mixer in my case. Probably the same
with zpaq, though I didn't really try to decipher the config.

> Oops.. i don't understand the comparison between the 38k of
> uncompressed data and the 16252 bytes of compressor code?
> What do you mean?

Not sure... probably wanted to sleep and missed a whole train
of thoughts
Likely its two merged statements actually.
1. That 38k compressible to ~9k is a lot, as it allows to
add a whole disassembler to decoder to gain another 100 bytes
(eg. 500 gain - 400 bytes disassembler) - or a few models
for alternate formats. Basically, at such scale algorithms
already can be more compact representations of data.
2. That full paq8 won't be that large if properly written
(around 1-2k probably), as C++ compiler's code is far from
compact, but still fits into 16k.

Btw, considering [1] I've got an idea in the style of
kolmorogov complexity - it should be possible to generate
some code, that would produce data, that would be useful
to train the model before processing the actual file data.

Also did you try to train your model with your decoder's code?
Though there're even better samples - some common files
(eg. MSVCRT.dll) would be probably the same in all windows versions.

I checked and it seems to work with paq8 at least:

8614 intro_exe_unpack.paq8p
111935 msvcrt_dll.paq8p
120196 msvcrt_dll1.paq8p
120196-111935=8261 which is kinda better than 8614

25. Originally Posted by Shelwien
Well, using a single mixer with multiple inputs is not
the only way possible, as you could see from my coders.
In fact, a binary mixer tree instead of single multi-input mixer
seems to improve compression, which can be partly explained by absense
of Sum w_i = 1 constraint in most paq mixer implementations.
And with such a design its also possible to improve speed by
discarding branches that have ~0 weights (both estimation and update).
Like in the mix_test_* in paq_mixer.inc? What's the meaning of M_mXwr and M_mXwq?

Btw, do you include the size of PE headers into your decoder size estimations?
I think it should be included, as its possible to keep some data/code in some
Also, intros don't necessarily have to be distributed as PE files, right?
Did you consider writing a .COM dropper which would generate a real exe file
and run it, or something like NE exe (its header is kinda more compact)?
You are right, but because the PE header doesn't really depend on the compression method, it's a static size. And yes, i will use unused parts of the header to store code or hash for imported dll functions (as it is done in crinkler). The com dropper is not targeted as it is not always accepted and i want to use plain exe anyway...

Originally Posted by Shelwien
Err... only that? Why not make it all float-point - counters,
mixer weights, rc?.. It might not only reduce the number of
float-to-integer transitions, but also shave off some bytes
due to improved precision...
I have to check the opportunity for the counters update in float... i would like to use the probability update, i found your discussion with toffer about this (parameter optimization's thread), but well, i need some explanation!
In your update formula : p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw
what are wr, mw, pm?
Also, not sure that the rc is worth to do it in the FPU... it seems size consuming compare to simple integer rc...

Originally Posted by Shelwien
Also did you try to train your model with your decoder's code?
Though there're even better samples - some common files
(eg. MSVCRT.dll) would be probably the same in all windows versions.
I tried but it doesn't give any improvement... don't know why... probably because even if the hash is huge... i don't handle collisions... but well, i haven't investigate further...

Originally Posted by Shelwien
Using a bit mask would certainly help, so I'd not stop
at just 256 contexts
You were right! I was able to add a bit mask model in a simple manner, without adding too much code (i hope ~10-15bytes) and it improves compression quite well: 9403 bytes on the normal file (compare to 9560 in the initial version).

26. Originally Posted by Shelwien
Well, using a single mixer with multiple inputs is not
the only way possible, as you could see from my coders.
In fact, a binary mixer tree instead of single multi-input mixer
seems to improve compression, which can be partly explained by absense
of Sum w_i = 1 constraint in most paq mixer implementations.
And with such a design its also possible to improve speed by
discarding branches that have ~0 weights (both estimation and update).
Like in the mix_test_* in paq_mixer.inc? What's the meaning of M_mXwr and M_mXwq?

Btw, do you include the size of PE headers into your decoder size estimations?
I think it should be included, as its possible to keep some data/code in some
Also, intros don't necessarily have to be distributed as PE files, right?
Did you consider writing a .COM dropper which would generate a real exe file
and run it, or something like NE exe (its header is kinda more compact)?
You are right, but because the PE header doesn't really depend on the compression method, it's a static size. And yes, i will use unused parts of the header to store code or hash for imported dll functions (as it is done in crinkler). The com dropper is not targeted as it is not always accepted and i want to use plain exe anyway...

Originally Posted by Shelwien
Err... only that? Why not make it all float-point - counters,
mixer weights, rc?.. It might not only reduce the number of
float-to-integer transitions, but also shave off some bytes
due to improved precision...
I have to check the opportunity for the counters update in float... i would like to use the probability update, i found your discussion with toffer about this (parameter optimization's thread), but well, i need some explanation!
In your update formula : p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw
what are wr, mw, pm?
Also, not sure that the rc is worth to do it in the FPU... it seems size consuming compare to simple integer rc...

Originally Posted by Shelwien
Also did you try to train your model with your decoder's code?
Though there're even better samples - some common files
(eg. MSVCRT.dll) would be probably the same in all windows versions.
I tried but it doesn't give any improvement... don't know why... probably because even if the hash is huge... i don't handle collisions... but well, i haven't investigate further...

Originally Posted by Shelwien
Using a bit mask would certainly help, so I'd not stop
at just 256 contexts
You were right! I was able to add a bit mask model in a simple manner, without adding too much code (i hope ~10-15bytes) and it improves compression quite well: 9403 bytes on the normal file (compare to 9560 in the initial version).

27. >> In fact, a binary mixer tree instead of single multi-input
>> mixer seems to improve compression, which can be partly
>> explained by absense of Sum w_i = 1 constraint in most paq
>> mixer implementations. And with such a design its also
>> possible to improve speed by discarding branches that have
>> ~0 weights (both estimation and update).
>
> Like in the mix_test_* in paq_mixer.inc?

Yes, but any trees are in fact applicable, not necessarily
binary or balanced ones.

> What's the meaning of M_mXwr and M_mXwq?

All the values in that coder have 15-bit precision.
(Although mixer weight is stored as 0x4000+w instead of
value clipping, as negative weight values might appear
due to imprecision).

And m-wr is kinda a weight for static mixing with 0.5,
while m-wq is mixer update rate... normally wr was supposed
to be that rate, but because of a mistake it ended like this.

> You are right, but because the PE header doesn't really
> depend on the compression method, it's a static size.

Only a few fields there are really static.
And even these rare fields that have to be specified,
often have a rather large set of equally usable values.
So my idea was that tweaking these and using all of
the file as a context for the model might allow to cut
off some more bytes without actually increasing the decoder size.

> And yes, i will use unused parts of the header to store
> code or hash for imported dll functions (as it is done in
> crinkler).

I'm not so certain that custom importing is any better than
native one. Imho they usually do that in decoders just to
avoid dealing with variable PE structure.
There're even imports by ordinal, though I guess nobody compared
the ordinals of common functions in different windows versions.

> In your update formula : p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw
> what are wr, mw, pm?

I'd suggest looking at this:
http://ctxmodel.net/rem.pl?-2
http://ctxmodel.net/rem.pl?-6
http://ctxmodel.net/rem.pl?-12

Anyway, its a static binary linear mixing application
for counter update, basically
p'=mix2(mix2(p,y),pm) or
p'=mix2(p,mix2(y,pm)) which is what's used in my counters
(y=current bit, p=probability of 1 (in this case),
pm=some static order-(-1) probability)

> Also, not sure that the rc is worth to do it in the FPU...
> it seems size consuming compare to simple integer rc...

Probably not, as some extra operations are required to
perform precise integer arithmetics with FPU.
Like keeping all values in ((1<<63)+x) form to stop
normalization.

>> Also did you try to train your model with your decoder's code?
>> Though there're even better samples - some common files
>> (eg. MSVCRT.dll) would be probably the same in all windows versions.
>
> I tried but it doesn't give any improvement... don't know
> why... probably because even if the hash is huge...
> i don't handle collisions... but well, i haven't investigate further...

Well, in fact it didn't work work with mix_test either,
that's why I posted that paq8 example to simplify things
But I found that its possible to gain something by using
a selected fragment of msvcrt.dll as a context.
Probably our models are just not adaptive enough
(well, no wonder without SSE etc).

> You were right! I was able to add a bit mask model in a
> simple manner, without adding too much code (i hope
> ~10-15bytes) and it improves compression quite well: 9403
> bytes on the normal file (compare to 9560 in the initial
> version).

That's cool, though I really wonder how are you going
to optimize these parameters

Then, I think this might be usable too (from paq:

Code:
```//////////////////////////// exeModel /////////////////////////
// Model x86 code.  The contexts are sparse containing only those
// bits relevant to parsing (2 prefixes, opcode, and mod and r/m fields
// of modR/M byte).
// Get context at buf(i) relevant to parsing 32-bit x86 code
U32 execxt(int i, int x=0) {
int prefix=(buf(i+2)==0x0f)+2*(buf(i+2)==0x66)+3*(buf(i+2)==0x67)
+4*(buf(i+3)==0x0f)+8*(buf(i+3)==0x66)+12*(buf(i+3)==0x67);
int opcode=buf(i+1);
int modrm=i ? buf(i)&0xc7 : 0;
return prefix|opcode<<4|modrm<<12|x<<20;
}```

28. Originally Posted by Shelwien
> In your update formula : p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw
> what are wr, mw, pm?
I'm testing your update formula, but i'm getting a 6% worse on compression size than the simple mahoney's non-stationary update ( if ( c[y] < 255) c[y]++; if (c[1-y]>2) c[1-y] = c[1-y] / 2 + 1; ). This is a huge difference... i tried to tweak wr, mw, pm, but i can't do better than 6%.

What's wrong? Have you performed already a comparison with the simple non stationary update rules? Could you try it on your model, i'm curious...

29. > I'm testing your update formula, but i'm getting a 6%
> worse on compression size than the simple mahoney's
> non-stationary update

That's kinda expected in your case (with small amount of data),
as that linear counter with constant wr is only an approximation
of combinatoric counter and is especially imprecise for first
bits in context.
Although maybe you only replaced the counter and didn't try to
reoptimize whole model with it - that initial imprecision usually
can be well compensated by lower-order contexts in the mix.
Anyway, my linear counter is chosen for convenience reasons,
not because of its prediction precision, so there's not much
reason to "defend" it for me.

> c[y] = c[y] + (c[y]<255);
> c[1-y] = c[1-y] / 2 + 1 + (c[1-y]<2);

I guess, you should try tweaking this Matt's counter instead,
for example. rewrite it like this:

> c[y] = c[y] + u*(c[y]<255);
> c[1-y] = __max( t, c[1-y]*w ) + v;

And try finding better (u,v,w,t) values than (1,1,0.5,2).

Also, another thing can be added at probability calculation:

you can use something like
(c[0]+f0) / ( c[0]+c[1] + f0+f1 )
so f0 and f1 would be additional parameters, similar to pm
in the linear counter.

> This is a huge difference... i tried to tweak wr, mw, pm,
> but i can't do better than 6%.

There're versions of linear counter which would better fit
your case (ones with dynamic update rate), but they won't

> What's wrong? Have you performed already a comparison with
> the simple non stationary update rules? Could you try it
> on your model, i'm curious...

Why, I did try, and it was insignificantly better, but noticeably
slower... also not the /2 version but a generalized one, with
all the tuned parameters.
I tried that for book1 though, not with your intro sample.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•