Results 1 to 29 of 29

Thread: Subset model selection problem

  1. #1
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Subset model selection problem

    Hello to all members!

    I?m working on a small packer for exe intros for the demoscene using neural networks. The code is quite small (asm code <320 bytes) and its efficiency is comparable to the crinkler packer (but well, often slightly worse around 1% under 4k-8k). This code will be released open-source once I managed to have something fully working. Although, i?m facing some problems that i want to share with you:

    1) The first problem is more relative to the current technique used in crinkler, an exe-linker-packer for small intros under 4k (and can still work well under 16k). Suppose that we have N models and i want to use a subset K (more efficient) of these models and assign them static weights for a small amount of data to compress (compare to fully neural network, static weights are quite efficient under 16k of data, well, when the type of data is not changing too much?)

    n0i n1i are respective counter for bit 0 and 1 for the current model (counters are stored in a hash for each model)

    The final prediction will simply be : p(1) = sum _0_K ( n1i * 2 ^ wi ) / sum _0_K ( ( n0i + n1i ) * 2 ^ wi )

    My questions are :

    - How can we determine the best subset model K from N models?

    - How can we determine the best weight for the subset model K ? (range for wi is 0-7)

    I tried several empirical methods, using errors, prefiltering with neural network?etc. with no success. What do you think? Is there any good pragmatic paper on this subject? (I?m not a statistician but a programmer, so everything that is not somehow applied in a simplified algorithm is quite hard for me to transcribe?)

    (Not to mention that parameters for models ? well in this case, this is only a byte per model - are taken into account into the final compressed size).

    2) Same question regarding neural networks. I have N models and I want to select the best subset. The only difference is that I don?t have to manage the weights? I?m using currently a dichotomy algorithm with sorting on weights (only summing weights > 0 ) and it?s working quite well? but not sure it?s the best approach.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    > n0i n1i are respective counter for bit 0 and 1 for the
    > current model (counters are stored in a hash for each
    > model)

    Well, normally I'd recommend to use the linear
    probability counters instead of such frequencies,
    but for such small amounts of data its really uncertain.
    Also not sure that hashing is a good idea... well
    its almost certain that you won't get any collisions
    with 4k of data, but initializing higher-order contexts
    would be a pain.

    > - How can we determine the best subset model K from N models?
    > - How can we determine the best weight for the subset model K ? (range for wi is 0-7)

    There's not much choice other than bruteforce actually.
    Of course you can use some heuristics for the search, but
    I'm sure that there're no good analytic solutions.
    Also, imho there's no such thing as a fixed model set to
    choose from. Instead, I'd represent it as parametric models.
    And then would proceed this optimization eg. like this:
    1. Optimize a single counter (context mask/hash parameters/rate parameters)
    2. Optimize another counter with another sample file
    3. Add a mixer and optimize it, then reoptimize the parameters of all 3 elements
    4. Clone this submodel, add another mixer to mix the versions
    [...]
    So the point is that instead of having a fixed set of submodels,
    I'd just add new tuned submodels until compression stops improving.

    > What do you think? Is there any good pragmatic paper on
    > this subject? (I'm not a statistician but a programmer, so
    > everything that is not somehow applied in a simplified
    > algorithm is quite hard for me to transcribe

    Its actually easy enough to formulate this as a mathematical
    problem - just write down the formula for the coding cost
    of some data (with all parameters) and then find the roots
    of its derivatives.
    An example:
    Code:
    We know that there're n0 zeroes and n1 ones in the data,
    what would be the best (static) probability to encode this
    data?
    Coding cost is: f(p)=-n0*log2(p)-n1*log2(1-p)
    D[f(p),p] = n1/(1-p)/ln(2) - n0/p/ln(2) == 0
    ((n0+n1)*p-n0)/(1-p)/p/ln(2) == 0
    p = n0/(n0+n1)
    But this approach quickly becomes too complex in practical
    cases, so these papers mainly discuss the choice of possible
    numeric optimization methods, it seems.

    Well, I didn't really try to build a model for such amounts
    of data, so it might be better to get some experience first.
    Can you provide a sample file (the data you're actually
    trying to compress, not a raw obj file or something like that),
    with compressed size estimations for it (yours/crinkler's)?
    Last edited by Shelwien; 6th July 2009 at 18:51.

  3. #3
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Quote Originally Posted by Alexandre Mutel View Post
    Hello to all members!

    I?m working on a small packer for exe intros for the demoscene using neural networks. The code is quite small (asm code <320 bytes) and its efficiency is comparable to the crinkler packer
    Hello! Sounds very interesting. Will it be the natural packer (I mean working with already existing .exe file) or will it behave like Crinkler (I mean linker combined with packer so you need the sources) ?

  4. #4
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Well, normally I'd recommend to use the linear
    probability counters instead of such frequencies,
    but for such small amounts of data its really uncertain.
    Also not sure that hashing is a good idea... well
    its almost certain that you won't get any collisions
    with 4k of data, but initializing higher-order contexts
    would be a pain.
    I tried several update methods (paq8, balz...etc.), but the most efficient method with a 1 byte n-counter is still the non stationary update i found from Matt Mahoney (although i did a small change to slightly improve it).
    I should have been also a bit more precised: in fact, i use, like crinkler, a single model, a generic match-sparse 8-order model. It means that i have 256 contexts to choose from (and that's why a subset is very important, most of a contexts are generating "noise"). So concerning the hash, it's size is often huge (>200M) because with the structure exe data, you can get lost of hash collision (and like crinkler, for efficiency and simplicity of code, collision is not handled).

    Quote Originally Posted by Shelwien View Post
    There's not much choice other than bruteforce actually.
    oh, ok. I tried some bruteforce, but i haven't found yet the best method... and it's quite cpu intensive (256 contexts, weight ranging from 0-7). I need to figure out how they dit it in crinkler, it's really impressive...

    Quote Originally Posted by Shelwien View Post
    Well, I didn't really try to build a model for such amounts
    of data, so it might be better to get some experience first.
    Can you provide a sample file (the data you're actually
    trying to compress, not a raw obj file or something like that),
    with compressed size estimations for it (yours/crinkler's)
    You'll find attached a sample file extracted from an intro we did (it's called raytro). This sample file is the exact copy of what crinkler is compressing and puting in the final exe.

    I have to mention that binary x86 code is stored in the first 13532 bytes. The rest of data is binary structure, text, shader glsl text...etc. Crinkler is using different context selection for code and data. With the neural network i'm using, this is automatic (it shares the same selection for code and data), but i still reset the hash at 13532 to improve the compression (1 to 2%).

    Concerning the compression results on this sample file:
    • The size of this file compressed with crinkler is around 9700 bytes (+230 bytes for the decompressor).
    • With my compressor, it's around 9560 bytes (+310 bytes for the decompressor).


    But because my decompressor is taking 80 bytes more than crinkler, it's only 40 bytes more efficient than crinkler! I found crinkler to be really impressive using statics models for code and data (the asm code is really unbelievable ).

    Quote Originally Posted by Skymmer View Post
    Hello! Sounds very interesting. Will it be the natural packer (I mean working with already existing .exe file) or will it behave like Crinkler (I mean linker combined with packer so you need the sources) ?
    Thanks! I expect to make it work directly on the exe, but it will need the exe to be compiled with relocation information (and this is not always the case). This is required in order to be able to reorder/strip code & data in the exe, thus, improving compression. crinkler is reordering code & data section at linking time.
    Attached Files Attached Files
    Last edited by Alexandre Mutel; 7th July 2009 at 10:02.
    ______________
    @lx - FRequency

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    trying to optimize the coders for this intro binary
    (mix_v3, mix_test, m1_03b)
    and somehow it doesn't go that well
    sure, its to be expected that x86 code and an image would require specific models
    but I kinda expected to reach the same level at least
    ...not that its that far though
    i'm just being lazy and try to avoid adding hashed contexts to mix_test

    Current stats: ((*)=intermediate results)
    paq8p: 8609
    ppmonstr vJ -o16: 9635
    ccm130c: 10790
    E8 + ppmonstr vJ -o16: 9498
    E8 + ppmonstr vJ -o2: 10032
    E8 + mix_test 09h2 tuned: 10275 (*)
    E8 + mix_test 09h2 + BMF2: 8979(*)+1164=10143
    E8 + mix 3e22: 10483 (*)
    E8 + m1_03b-1 + BMF2: 10048+1164=11212

    1. I kinda expected to get an already filtered input,
    but seems that E8 filtering still helped
    2. mix_test coder is actually a BWT postcoder and
    is limited at o2 max.

  6. #6
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    > I should have been also a bit more precised: in fact, i
    > use, like crinkler, a single model, a generic match-sparse
    > 8-order model. It means that i have 256 contexts to choose
    > from (and that's why a subset is very important, most of a
    > contexts are generating "noise").

    Just a masked context with a byte mask?
    Using a bit mask would certainly help, so I'd not stop
    at just 256 contexts

    >> There's not much choice other than bruteforce actually.
    > oh, ok. I tried some bruteforce, but i haven't found yet
    > the best method... and it's quite cpu intensive (256
    > contexts, weight ranging from 0-7). I need to figure out
    > how they dit it in crinkler, it's really impressive...

    Possibly there're just a few precalculated profiles,
    and full search is not performed.

    Also 256 is not much (or 256*256), and that hash can
    be replaced with a direct scan (I mean, instead of
    storing the counters, they can be calculated from
    scratch each time - without using any memory).
    Like that it would be easily possible to calculate
    the code length for each parameter set at once and find
    the best one... or actually it should be possible
    to even optimize a few sets for data segments with
    different properties.

    > I have to mention that binary x86 code is stored in the
    > first 13532 bytes. The rest of data is binary structure,
    > text, shader glsl text...etc.

    Also there's a 12k 64x64x3 image there (or texture?),
    which allows to gain a few 100s of bytes with an image model.

    > * The size of this file compressed with crinkler is around 9700 bytes
    > (+230 bytes for the decompressor).
    > * With my compressor, it's around 9560 bytes
    > (+310 bytes for the decompressor).

    Well, my recent coder (mix_test) is pretty simple as well,
    although dunno how much of that unpacker code is required for
    maintenance tasks (saving/restoring registers etc).

    > But because my decompressor is taking 80 bytes more than
    > crinkler, it's only 40 bytes more efficient than crinkler!

    Anyway to me it seems more interesting to use different
    models for different types of data in this case, even
    if the decoder size would be a few times larger.
    It won't work for small binaries of course, but 32k
    input is already ok I think.

    > I found crinkler to be really impressive using statics
    > models for code and data (the asm code is really
    > unbelievable ).

    Well, x86 code compression certainly can be improved
    a lot with proper parsing.
    http://www.compression.ru/ds/disasm32.rar
    Also this might be useful too:
    http://ctxmodel.net/files/PPMd/segfile_sh2.rar

    And then, maybe I'll find time to add higher order
    contexts to mix_test (it uses bitwise context masks
    and adaptive mixing of 8 contexts btw), and we'd see then.

    Unfortunately, BWT doesn't help in this case

    Btw, a random idea: if something like paq rangecoder
    is used (the one with ((x1^x2)&0xFF000000)==0 renormalization)
    then there's a way to cut off some bytes from the decoder,
    as non-carryless decoders are simpler than that.

  7. #7
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien
    1. I kinda expected to get an already filtered input,
    but seems that E8 filtering still helped
    Good to know, i'll see check if it's worth to add it (compressor code size vs compressed size gain)

    Quote Originally Posted by Shelwien View Post
    Just a masked context with a byte mask?
    Using a bit mask would certainly help, so I'd not stop
    at just 256 contexts
    Yes, it's a byte mask which describre the last byte to take into account for the hash-counting (1 bit on => 1 byte to process in the history).
    Do you think that with a 16 order context, i could have some improvement?
    Not sure about that, but well, it needs a test...

    Quote Originally Posted by Shelwien View Post
    Possibly there're just a few precalculated profiles,
    and full search is not performed.
    The first method i tried was assigning a weight based on the order of the context (like in the first version of paq's), and then changing them slighltly. It works quite well but i was not able to find the same model/weight than crinkler (and they were less efficient).
    Quote Originally Posted by Shelwien View Post
    Also 256 is not much (or 256*256), and that hash can
    be replaced with a direct scan (I mean, instead of
    storing the counters, they can be calculated from
    scratch each time - without using any memory).
    Like that it would be easily possible to calculate
    the code length for each parameter set at once and find
    the best one... or actually it should be possible
    to even optimize a few sets for data segments with
    different properties.
    Huh, i don't fully understand your suggestion.
    If hash had to be recalculated from scratch for each bit, it would take ages to compress/decompress... and the decompression needs to be reasonable (but not the compression )

    Quote Originally Posted by Shelwien View Post
    Also there's a 12k 64x64x3 image there (or texture?),which allows to gain a few 100s of bytes with an image model.
    ...
    Anyway to me it seems more interesting to use different
    models for different types of data in this case, even
    if the decoder size would be a few times larger.
    It won't work for small binaries of course, but 32k
    input is already ok I think.
    Yep, you're right... but for such a size, i'm not sure that the cost of an image model (the code size) would give enough results compare to the basic match-sparse model... moreover, image are not really common in intros and it's preferred to generate procedurally the texture (although in the raytro, we didn't have time to do it )
    It's probably worth for x86 code (this is actualy what is done in kkrunchy). Demos are also using GLSL/HLSL shader code (like c code), i'm wondering if a text/code model would relevant here...

    Quote Originally Posted by Shelwien View Post
    Btw, a random idea: if something like paq rangecoder
    is used (the one with ((x1^x2)&0xFF000000)==0 renormalization)
    then there's a way to cut off some bytes from the decoder,
    as non-carryless decoders are simpler than that.
    I guess i'm using what you are suggesting. The C version i'm using is based on an improved version of a generated code with flavor (http://flavor.sourceforge.net, i found it quite helpful by the way, nice program) and a 33 bit arithmetic precision to facilitate the test of the renormalisation 0x80000000 (in asm, it's just a sign test, really simple)
    Code:
    	int Decoder::Decode(Bitstream &_F_bs, double pr1) {
    		int x = 0; // the decoded symbol
    
    		// renormalization
    		while (_F_R < 0x80000000) {
    			_F_R <<= 1; 
    			_F_V <<= 1;
    			_F_V += _F_bs.getbits(1);
    		}
    
    		unsigned int r_1 = (_F_R * pr1);
    		_F_R = _F_R - r_1;
    		if (_F_V >= _F_R) { // we have the symbol 1
    			_F_V = _F_V - _F_R;
    			_F_R = r_1;
    			x++;
    		}
    
    		return x;
    	}
    Last edited by Alexandre Mutel; 7th July 2009 at 23:45.
    ______________
    @lx - FRequency

  8. #8
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    >> but seems that E8 filtering still helped
    > Good to know, i'll see check if it's worth to add it
    > (compressor code size vs compressed size gain)

    In fact, I meant this: http://ctxmodel.net/files/E8flt-v2.rar
    And a similar filter can be implemented with a 20-byte asm loop,
    while allowing to gain ~130 bytes, so I think it would be
    a worthy addition.

    > > Using a bit mask would certainly help, so I'd not stop
    > > at just 256 contexts
    >
    > Yes, it's a byte mask which describre the last byte to
    > take into account for the hash-counting (1 bit on => 1
    > byte to process in the history).

    I meant that you have some prefix bytes, for example:
    = F9 D8 24 95 E0 26 43 00
    and mask some of these with a byte mask, for example:
    & FF 00 FF 00 FF 00 FF 00
    = F9 00 24 00 E0 00 43 00
    while I suggested something like this:
    & 00 F0 80 80 80 F0 FF FF
    = 00 D0 00 80 80 20 43 00

    Its commonly very useful for text compression
    (eg. 0DF masks) and even more for binary data,
    as x86 instruction consist of bitfields.

    Btw, it also helps to optimize the hash function parameters
    along with context mask.

    > Do you think that with a 16 order context, i could have some improvement?
    > Not sure about that, but well, it needs a test...

    It seems that ppmonstr only gains ~13 bytes from o16 vs o8,
    so probably not.

    >> Also 256 is not much (or 256*256), and that hash can
    >> be replaced with a direct scan (I mean, instead of
    >> storing the counters, they can be calculated from
    >> scratch each time - without using any memory).
    >> Like that it would be easily possible to calculate
    >> the code length for each parameter set at once and find
    >> the best one... or actually it should be possible
    >> to even optimize a few sets for data segments with
    >> different properties.

    > Huh, i don't fully understand your suggestion.
    > If hash had to be recalculated from scratch for each bit,
    > it would take ages to compress/decompress... and the
    > decompression needs to be reasonable (but not the
    > compression)

    1. Of course the actual compression and decompression
    can be implemented with a hash, I talked about the
    parameter optimization.
    2. To recalculate the counter from scratch you need
    to find all the context occurences from the start of
    data and update the counter with each one.
    I agree that this might be slow with megabytes of data,
    but how many occurences of order8 context can be there
    in 30k of data?
    3. The point of doing this is reuse of intermediate values.
    For example, you would need 64k full sequential passes
    to try all mask/"weight" combinations, but only 256 passes
    with 256 weight loops in each of them with such parallel approach.

    >> Also there's a 12k 64x64x3 image there (or texture?), which
    >> allows to gain a few 100s of bytes with an image model.

    > Yep, you're right... but for such a size, i'm not sure
    > that the cost of an image model (the code size) would give
    > enough results compare to the basic match-sparse model...

    In my tests, the image model allowed to cut off ~300 bytes.
    And its decoder would be basically the same as for other data,
    it just has to include some byte from the previous row into
    context.

    > moreover, image are not really common in intros and it's
    > preferred to generate procedurally the texture (although
    > in the raytro, we didn't have time to do it)

    Well, somehow I think that there're plenty of similar cases.

    > It's probably worth for x86 code (this is actualy what is
    > done in kkrunchy).

    Sure... its also done in durilca:
    http://www.compression.ru/ds/durilca.rar

    > Demos are also using GLSL/HLSL shader code (like c code),
    > i'm wondering if a text/code model would relevant here...

    Of course, any data with syntax restrictions would benefit
    from a custom model for compression.
    However, in the case of your intro, the script compresses
    to ~800 bytes, so it won't be quite easy to write a full
    expression parser which would be smaller than a possible gain.
    But such a script can be shrinked with a lossy preprocessor
    (all identifiers to 1 symbol, numeric constants optimization:
    0.004 -> 4E-3, 20000.0 -> 2E4, 0.0 -> .0 etc) and then
    charset restrictions and bracket matching can be added to the
    compressor's model.

    > I guess i'm using what you are suggesting.

    No, I was talking about something like sh_v1m in
    http://ctxmodel.net/files/mix_test/mix_test_v9.rar

    > The C version i'm using is based on an improved version

    Well, this doesn't have the carry workaround, which is good,
    but its based on bitwise input which is not quite good.
    (though I guess it can be ok if it uses BT instruction
    to read bits).
    Also, it reminded me of some old implementation of mine
    which kept the rangecoder state on FPU stack
    (see cld-a2 in compression.ru/sh/coders6a.rar)
    but actually I don't think that using any float-point
    is good for speed optimization - FP instructions are
    longer (both FPU and SSE) and some operations require
    several instructions while single is enough for integers.

  9. #9
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    In fact, I meant this: http://ctxmodel.net/files/E8flt-v2.rar
    And a similar filter can be implemented with a 20-byte asm loop,
    while allowing to gain ~130 bytes, so I think it would be
    a worthy addition.
    Thanks! I'll have a look at it... i agree that it's quite small to code in asm and worth it.
    I meant that you have some prefix bytes, for example:
    = F9 D8 24 95 E0 26 43 00
    and mask some of these with a byte mask, for example:
    & FF 00 FF 00 FF 00 FF 00
    = F9 00 24 00 E0 00 43 00
    while I suggested something like this:
    & 00 F0 80 80 80 F0 FF FF
    = 00 D0 00 80 80 20 43 00
    Well, it would be great... but i'm afraid it will take too much space...
    I have only one byte to match the last 8 bytes... With you solution, each parameter for the context would take 8 bytes. Could be worth for a 64k... but not sure in the range 4k-16k...
    I guess that the best opportunity for my compressor is to add a prefilter for x86 code... that's where i can have a good gain vs code size

    It seems that ppmonstr only gains ~13 bytes from o16 vs o8,
    so probably not.
    Ok, good to know.

    Well, this doesn't have the carry workaround, which is good,
    but its based on bitwise input which is not quite good.
    (though I guess it can be ok if it uses BT instruction
    to read bits).
    Also, it reminded me of some old implementation of mine
    which kept the rangecoder state on FPU stack
    (see cld-a2 in compression.ru/sh/coders6a.rar)
    but actually I don't think that using any float-point
    is good for speed optimization - FP instructions are
    longer (both FPU and SSE) and some operations require
    several instructions while single is enough for integers.
    Yes it use BT instruction, and the code is really small (<45 bytes). It would'nt be small with the sh_v1m. I agree that working on bits is not efficient in speed... but efficient in size, and that's what i'm looking for!
    I'm using the FPU mainly because it's much easier to calculate neural networks train/stretch functions and the code is smaller too. I don't have any precalc, shifting... etc. and it saves a lots of space!
    For FPU vs fixed point, well, i don't fully agree with you. In demos, using lots of calculation for 3D or soft synthezizer, integer fixed point has been always slower than fpu since... pentium if i remember... With large calculation code, the FPU is much better armed with its stack (i'm talking about plain +*/-. Other instructions log/sin/cos are slow... but it's not comparable to x86 integer).
    So with plain integer fixed point, you have to play with push/pop registers, memory stack, shifting... etc. In the end, this is usually slower (well, i'm sure you could find some example that are faster... but for what i have tested on asm code for the past few years, it was usually much slower with integer - and the code was bigger also...).
    ______________
    @lx - FRequency

  10. #10
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    Well, I tried to sketch it, and here's a 41 byte version.
    Also there're some places to tweak, so its probably not the real minimum.

    Update: a 37-byte version with one more trick.

    Update2: 36 bytes

    Update3: 35 bytes %)
    Attached Files Attached Files
    • File Type: cpp 1.cpp (1,015 Bytes, 263 views)
    • File Type: cpp 1a.cpp (1,015 Bytes, 259 views)
    • File Type: cpp 1b.cpp (1.1 KB, 254 views)
    • File Type: cpp 1d.cpp (1,007 Bytes, 272 views)
    Last edited by Shelwien; 8th July 2009 at 04:27.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    > Yes it use BT instruction, and the code is really small
    > (<45 bytes). It would'nt be small with the sh_v1m.

    Well, it is

    > I agree that working on bits is not efficient in speed... but
    > efficient in size, and that's what i'm looking for!

    Not really

    > I'm using the FPU mainly because it's much easier to
    > calculate neural networks train/stretch functions and the
    > code is smaller too. I don't have any precalc, shifting...
    > etc. and it saves a lots of space!

    Actually there're linear mixers that don't need any precalc.
    And the formal stretch/squash are not the best compression-wise.
    And as to precalc, I think it will be well compensated by other things.

    > For FPU vs fixed point, well, i don't fully agree with
    > you. In demos, using lots of calculation for 3D or soft
    > synthezizer, integer fixed point has been always slower
    > than fpu since... pentium if i remember... With large
    > calculation code, the FPU is much better armed with its
    > stack (i'm talking about plain +*/-. Other instructions
    > log/sin/cos are slow... but it's not comparable to x86
    > integer).

    Dunno, it all depends on programming style I guess.
    I've tried to use float-point for simplicity in many
    projects, and then I had to convert them to integer because
    of portability issues and it always becomes much faster after that.

    > some example that are faster... but for what i have tested
    > on asm code for the past few years, it was usually much
    > slower with integer - and the code was bigger also...).

    Well, my condolences

  12. #12
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Well, I tried to sketch it, and here's a 41 byte version.
    Also there're some places to tweak, so its probably not the real minimum.
    Update: a 37-byte version with one more trick.
    Update2: 36 bytes
    Update3: 35 bytes %)
    Oops, my apologize, i didn't see the if (DECODE) in sh_v1m! Your code is really nice!
    eheh, i was also miscounting my version that is around 39 bytes... it's the switch from the float multiplication to the integer and visversa that adds those bytes! , but the version you used is much more elegant because the compressor code is also very simple (mine is horrible, generated from flavor, and using 33 bits, it's huge compare to yours!). Your version looks like the version from crinkler... but the renormalization seems to be done at sTOP=0x80000000. For optimization, it would be more relevant to compare the whole program. With limited registers, things are getting a little bit more complicated on the whole code (push/pop, initialization...). But still, the version you point me is the one i was looking for with no success... that's why i went to flavor...

    Quote Originally Posted by Shelwien
    Actually there're linear mixers that don't need any precalc.
    And the formal stretch/squash are not the best compression-wise.
    And as to precalc, I think it will be well compensated by other things.
    Oh, it would be great if i could have adaptive weights without the ln2/2^ stuff!. Can you point me to an example?

    Quote Originally Posted by Shelwien
    Dunno, it all depends on programming style I guess.
    I've tried to use float-point for simplicity in many
    projects, and then I had to convert them to integer because
    of portability issues and it always becomes much faster after that.
    Context is probably relevant too... but well, i should do some test again... i missed perhaps something when trying to convert from floating point to fixed point...
    Last edited by Alexandre Mutel; 8th July 2009 at 10:40.
    ______________
    @lx - FRequency

  13. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    > Oops, my apologize, i didn't see the if (DECODE) in sh_v1m!
    > Your code is really nice!

    Btw, it became 34 bytes using shifts by CL instead of 8.
    And probably can be improved even further... eg. there're
    ways to remove one... or all... jumps.

    > compressor code is also very simple (mine is horrible,
    > generated from flavor, and using 33 bits, it's huge
    > compare to yours!).

    As to encoder simplicity, this is pretty bad comparing
    to carryless rangecoders... for example, see
    rc_mm.inc in http://ctxmodel.net/files/MIX/mix_v3.rar
    ...or crlf_m from http://compression.ru/sh/coders6c2.rar
    But such an implementation has lower precision and
    its decoder is a little more complex than non-carryless one.

    > Your version looks like the version from crinkler... but
    > the renormalization seems to be done at sTOP=0x80000000.

    rc_sh2f.inc in mix_v3.rar is like that.

    > For optimization, it would be more relevant to compare the
    > whole program. With limited registers, things are getting
    > a little bit more complicated on the whole code (push/pop,
    > initialization...).

    Dunno, I guess after 6502 with its 3 8-bit registers this
    still seems a lot to me
    Also push/pop are really cool for code size optimization -
    they are single-byte and can be used to manipulate user data
    as there's no risk of interrupt messing up stack in modern OSes.
    For example, if I'd need to save the rangecoder state in
    that v1m example, I'd just use PUSHAD/MOV ESP/POPAD to switch
    register sets.

    > Oh, it would be great if i could have adaptive weights
    > without the ln2/2^ stuff!. Can you point me to an example?

    Well, that mix coder I linked does exactly that (sh_mixer.inc).
    Also I think its explained in http://ctxmodel.net/rem.pl?-12
    It seems that the logistic mixer has somewhat better adaptation,
    but it probably doesn't matter in your case.

    > Context is probably relevant too...

    I meant fairly relevant stuff I think...
    Like http://ctxmodel.net/files/ASH/ash07.rar (float)
    vs http://www.ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar
    Or there were some codecs with FFT etc.
    Anyway, I talked about computation-heavy stuff of course.

    > but well, i should do some test again... i missed perhaps
    > something when trying to convert from floating point to
    > fixed point...

    Another possibility is bytewise modelling.
    For example, ppmonstr compresses your intro binary
    to 9635 bytes and durilca (=disasm32+ppmonstr) to 8753 bytes,
    which is pretty similar to paq-based results, but
    doesn't use any bitwise processing.

    Well, such bytewise coders have much more complex algorithms
    than binary ones, but which approach is actually more compact
    is unknown.
    Last edited by Shelwien; 8th July 2009 at 17:46.

  14. #14
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    http://ctxmodel.net/files/mix_test/mix_test_vA.rar

    Experiment with o8 CM tuning for intro compression

    38424 intro.exe.unpack
    8609 paq8p -7
    9635 ppmonstr vJ -o16
    9498 E8 + ppmonstr vJ -o16:
    ~9700 crinkler
    ~9560 ???

    09h2x2 09h2x3 09h2x3a
    9805 9976 9970 // whole original file
    9671 9847 9840 // whole file filter with E8
    9360 9452 9443 // E8 + bmf2 image encoding

    09h2x2 - slower version with dynamic allocation
    09h2x3 - 09h2x2 retuned to reduce memory usage
    09h2x3a - 09h2x3 with static counter arrays

    --------------

    Actually it was an experiment on extending the mix_test contexts
    up to order8 without involving hashes.
    Unexpectedly, C++ operator overloading really helped, and I was able
    to use the same code both to access the dynamically allocated trees and
    static tables

    Also, I'd probably post another version later, with somewhat better
    results - this one isn't fully optimized yet.

  15. #15
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Code:
    intro.exe.unpack   38 424
    
    PAQ8px -4          8 558
    PAQ8px -5          8 550
    PAQ8px -6          8 547
    PAQ8px -7          8 549
    
     PAQ8k -4          8 492
     PAQ8k -5          8 495
     PAQ8k -6          8 496
     PAQ8k -7          8 496
    
    PAQ8k2 -6          9 209
    Interesting is that for PAQ8px, -6 level is better than -7 and for PAQ8k, -4 level gives best results. By the way, all results can be 15 bytes smaller if the intro.exe.unpack name is changed to some 1 symbol name.
    So 8 477 is achievable.

  16. #16
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    Thanks, I was too lazy to do that myself.
    However, its kinda unlikely that full paq8 would fit into extra
    1k which it gained

    Btw, I found that test.bat in mix_test_vA.rar would cause
    an exception when started on a cpu without SSE3.
    Appears that Shkarin for some reason compiled bmf for SSE3.

  17. #17
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Skymmer View Post
    So 8 477 is achievable.
    Well, the size of a PAQ8k decompressor would be >> 1.5k even in assembler! So it's not really achievable... At this size, we need to compare the size of the compressed data + the size of the decompressor.

    Thanks Shelwien for the test also. I did a simple test with your E8 filter and a adding the upper pixel in the 64x64x3 and the results with crunky (my compressor, we are going to call it like this as of now )
    - 9560 normal
    - 9424 with E8 filter
    - 9287 with E8 filter + pixel above 64x64x3

    Adding the E8 filter to the decompressor is around 20 bytes, adding the pixel above is less than 10 bytes. So you convince me that is worth to do it!

    Although, seems that BMF is doing more than just the pixel above... what it does exactly?
    ______________
    @lx - FRequency

  18. #18
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    > 9287 with E8 filter + pixel above 64x64x3

    Well, guess I'd like to know how many contexts do you mix there,
    as my result with 8 contexts seems significantly worse,
    even though its supposedly better optimized than yours
    (because of bitwise masks, if not anything else).
    Also whether you have APM/SSE stages, and/or something like
    paq's "exemodel".

    > Although, seems that BMF is doing more than just the pixel above...
    > what it does exactly?

    Its not open source. But I suppose its something similar to this:
    http://www.compression.ru/ds/bcdr.rar

    Btw, paq8p etc compress that image to nearly the same size as BMF
    (1146 vs 116.

  19. #19
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    > 9287 with E8 filter + pixel above 64x64x3
    Well, guess I'd like to know how many contexts do you mix there,
    as my result with 8 contexts seems significantly worse,
    even though its supposedly better optimized than yours
    (because of bitwise masks, if not anything else).
    Also whether you have APM/SSE stages, and/or something like
    paq's "exemodel".
    Actually, my program is selecting 28 context other the 256 (it means 28 bytes more that are already stored in the 9287 bytes. Without this, the real size is 9250). Context selection is dependent on the data, so i'm not so surprised that it is quite efficient.
    Also, when i'm looking at your code, you have lots of precalculated tables : just for the M_xxx_masky tables, it's a whopping 256*22 = 5632 bytes. You have also the strech/squash precalc table that adds 32*2*2 = 128 bytes. Just those datas are more than half the size of the compressed data. To really make a fair comparison, those extras datas should be put in your final compressed size.

    The initialization of the valmap is quite important, the mixer also is a long suite of specialized code, sure that it would be smaller in full asm, but even with your asm skill, a plain translation in asm would'nt go under 500 bytes (if not a lot more). I'm also quite sure that the method i'm using with the FPU is well efficient in size and in results apparently for this kind of data. I'm not using any APM/SSE stages, as i did some test and it didn't give me better results (including the code size for this stage).

    Quote Originally Posted by Shelwien View Post
    Its not open source. But I suppose its something similar to this: http://www.compression.ru/ds/bcdr.rar
    Woo, i didn't expect a code so large. Well, reading code like this is quite obscure to me, like yours Shelwien btw, i think your code would deserve a big readme.txt and more comments to enlight obscure parts and to be able to share with others... Also, it's something i have appreciated a lot in Matt Mahoney's code, the large comments at the beginning of its compressors were really helpful to understand the code!
    For example, the mixing algo you are using is quite interesting, but even the link you provide on your "blog" and the code in the mixer are not enough to understand it. It would be good also to give a comparison between your mixer and for example Mahoney's neural network mixer, everything else being constant of course. It would give more pertinence - or not - to your method.

    Also, your help is very helpful : even on small size of data, i've learned that it's worth sometimes to add a new model that gives better results!
    Last edited by Alexandre Mutel; 11th July 2009 at 17:31.
    ______________
    @lx - FRequency

  20. #20
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    > Actually, my program is selecting 28 context other the 256

    So 28 contexts with static logistic mixing and that's all?

    Well, its still slightly more than 8, so I guess its reasonable

    Although why don't you just use all 256 and adaptive mixing?
    With that there'd be no more "subset selection problem"

    > Also, when i'm looking at your code, you have lots of
    > precalculated tables : just for the M_xxx_masky tables,
    > it's a whopping 256*22 = 5632 bytes.

    That's a memory/speed optimization, its not really necessary.
    There're still 9*15=135 bytes of contexts masks (7 mixers, 8 counters),
    4*15=60 bytes of rate parameters and 2*7=14 bytes for initial weights
    (though these can be safely removed in this case).
    So 135+60+14=209 total bytes of parameters for model elements.
    And then there're 2*2*32=128 bytes of mixer mappings.
    So 209+128=337.

    > Just those datas are more than half the size of the
    > compressed data. To really make a fair comparison, those
    > extras datas should be put in your final compressed size.

    I guess, but your current result is better than mine even
    not considering that
    Still, its not completely valid to just take the parameter
    profile size as is - it can be restructured (like W0 and
    mappings removed) and even compressed (eg. most of masks are 0).

    Anyway, this coder wasn't made for intro compression, nor do
    I have enough experience in this area, where constraints are
    considerably different from usual.

    I just took a random coder which I was recently tweaking,
    and used it for redundancy estimation of your data - which
    was fairly successful imho.
    Well, normally I just use paq8 for such estimation, but in
    this case it isn't really applicable.

    > The initialization of the valmap is quite important, the
    > mixer also is a long suite of specialized code, sure that
    > it would be smaller in full asm, but even with your asm
    > skill, a plain translation in asm would'nt go under 500
    > bytes (if not a lot more).

    You mean, just that mixer? Not really, it would take
    just ~40 bytes for Mixup() and Update() and a similar
    amount for valmap init - these are actually much easier
    to write in asm than C .

    > I'm also quite sure that the method i'm using with the FPU
    > is well efficient in size and in results apparently for
    > this kind of data.

    I don't really have anything against use of float-point in
    asm... I even posted a link to my own implementation of
    an FPU-based rangecoder.
    But in this specific task I don't really see any place
    where FPU instructions would be more applicable than
    integer.

    > I'm not using any APM/SSE stages, as i did some test and
    > it didn't give me better results (including the code size
    > for this stage).

    Yeah... newer instructions tend to have longer codes

    >> http://www.compression.ru/ds/bcdr.rar
    > Woo, i didn't expect a code so large.

    Its not really, its just a weird writing style
    Most of C statements there actually correspond to
    single asm instructions, so its not that large, just
    unreadable

    > i think your code would deserve a big readme.txt and more
    > comments to enlight obscure parts and to be able to share
    > with others...

    Well, in theory I agree, though most of it was explained
    on this forum or ctxmodel.net.
    Its just that chances to discuss such implementation details
    are quite rare and I don't have any motivation to explain
    something when nobody specific needs that.
    Also I prefer to spend the time at improving the coder
    instead of documenting an imperfect coder

    > Also, it's something i have appreciated a lot in Matt
    > Mahoney's code, the large comments at the beginning of its
    > compressors were really helpful to understand the code!

    Yeah, I liked that a lot too, but my projects are of a much
    smaller scale for now.
    Also, I basically made the ctxmodel site to collect such comments,
    even if it doesn't work quite well yet

    > For example, the mixing algo you are using is quite
    > interesting, but even the link you provide on your "blog"
    > and the code in the mixer are not enough to understand it.

    Err... if you mean paq_mixer.inc from mix_test, its completely
    different from the linear mixer for which I gave you the link.
    The file with linear mixer is called "sh_mixer.inc" and is available
    in some other coders (linked in this thread too).

    > It would be good also to give a comparison between your
    > mixer and for example Mahoney's neural network mixer,
    > everything else being constant of course. It would give
    > more pertinence - or not - to your method.

    Fortunately, we do have exactly that here:
    http://encode.su/forum/showthread.php?p=7758#post7758
    http://encode.su/forum/showthread.php?t=387
    http://encode.su/forum/showthread.php?t=398

    So basically the mixer in mix_test is my own implementation
    of a logistic mixer similar to one used in paq.

    Which I started to use only a month ago, after an objective
    comparison with fully optimized coders based on linear and
    logistic mixers showed that logistic approach actually has
    a valid foundation and faster adapts to data than linear.

    Still its pretty interesting that a few years before that
    I had no problems with beating paq's results (for specific
    data types) using only linear mixing (which was the reason
    why I didn't bother to properly investigate the paq mixer before).
    It kinda tells a lot about actual efficiency of paq coders.
    Well, in fact Matt even posted the results of zpaq version
    of a model similar to mix_test_v3 (http://encode.su/forum/showthread.php?p=7769#post7769)
    where we can see how helpful the proper parameter tuning is
    (219581 vs 214695 on book1 - don't look at "indirect model" results
    as contexts there are different so its unfit for mixer comparison.)

    > i've learned that it's worth sometimes to add a new
    > model that gives better results!

    Well, 38k is not really that small - decomp8.exe from a recent
    hutter prize entry is only 16252 bytes in size, and it includes
    a complete (well, universal part) paq8 model and dictionary management
    and a WRT-like text preprocessor.
    Last edited by Shelwien; 12th July 2009 at 08:44.

  21. #21
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Although why don't you just use all 256 and adaptive mixing? With that there'd be no more "subset selection problem"
    Using the 256 models is quite efficient under a certain size, because i don't need to store the models... but is slightly inefficient when you have more data (and it compress very slowly : 800s for book1) due to the fact that the mixer waste time to exclude bad contexts (affecting weights << 0)

    =Shelwien;8336]There're still 9*15=135 bytes of contexts masks (7 mixers, 8 counters)
    I'm really interested in the mask you choose, meaning of xxx in the M_xxx_masky? On what kind of expectation/kind of data (exe,...etc) is it based? Are they heuristics or did you choose them on a bruteforce method?

    Quote Originally Posted by Shelwien View Post
    You mean, just that mixer? Not really, it would take
    just ~40 bytes for Mixup() and Update() and a similar
    amount for valmap init - these are actually much easier
    to write in asm than C .
    No, i was just meaning for the whole program of course!
    But you proved me that asm translating for you is almost your natural language

    Quote Originally Posted by Shelwien View Post
    I don't really have anything against use of float-point inasm... I even posted a link to my own implementation of
    an FPU-based rangecoder.
    But in this specific task I don't really see any place
    where FPU instructions would be more applicable than
    integer.
    oh, for the rangecoder, FPU seems to be a nightmare (your implementation is... coming right from planet mars )
    FPU seems to be relevant in size (except under 4k compressed data, where crinkler here beats anything) if you want to get rid off precalculated tables for stretch/squash function. 128 bytes of data is more than the cost of using directly the FPU (in my code, the strech function takes only 20 bytes FPU code - the overhead is coming from switching from integer counters to probability-, the squash is only 26 bytes), so in my case, this is far more efficient in size. In fact, at this size, it's quite common to try to avoid any precalculation, as the precalculation code is taking a significant space over a direct use inside the code.

    Quote Originally Posted by Shelwien View Post
    > I'm not using any APM/SSE stages, as i did some test and
    > it didn't give me better results (including the code size
    > for this stage).
    Yeah... newer instructions tend to have longer codes
    right... I would have loved to use sse2 with all the registers available... i guess it's coming from the time i was using 68000.... bad habit

    Quote Originally Posted by Shelwien View Post
    Well, in fact Matt even posted the results of zpaq version
    of a model similar to mix_test_v3 (http://encode.su/forum/showthread.php?p=7769#post7769)
    where we can see how helpful the proper parameter tuning is
    (219581 vs 214695 on book1 - don't look at "indirect model" results
    as contexts there are different so its unfit for mixer comparison.)
    i tried my compressor on book1 and book1bwt and the results are quite good actually
    Code:
                    crunky	
    book1	        216962	(13 contexts used)
    book1bwt	211907	(20 contexts used)
    though, it uses 13-20 contexts, and i'm not sure how many contexts were used in the zpaq comparison... The mixer i'm using is a very slightly different version from paq's series anyway...

    Quote Originally Posted by Shelwien View Post
    Well, 38k is not really that small - decomp8.exe from a recent
    hutter prize entry is only 16252 bytes in size, and it includes
    a complete (well, universal part) paq8 model and dictionary management
    and a WRT-like text preprocessor.
    Oops.. i don't understand the comparison between the 38k of uncompressed data and the 16252 bytes of compressor code? What do you mean?
    ______________
    @lx - FRequency

  22. #22
    Member Skymmer's Avatar
    Join Date
    Mar 2009
    Location
    Russia
    Posts
    681
    Thanks
    37
    Thanked 168 Times in 84 Posts
    Quote Originally Posted by Alexandre Mutel View Post
    Well, the size of a PAQ8k decompressor would be >> 1.5k even in assembler! So it's not really achievable... At this size, we need to compare the size of the compressed data + the size of the decompressor.
    For this particular file its really probably not achievable but we are talking about PE packer which is supposed to work with quite different files which are varying by its size and content so I think PAQ engine can proove its usefullness for some files. Furthermore you're planning to beat Crinkler and I suppose kkrunchy so it will be quite hard task so having a couple of selectable compression engines will get you closer to victory I think. Something like having nrv2b\nrv2d\nrv2e\lzma in UPX or aplib\briefLZ\ffce\jcalg1\lzma in PECompact2.

  23. #23
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Skymmer View Post
    For this particular file its really probably not achievable but we are talking about PE packer which is supposed to work with quite different files which are varying by its size and content so I think PAQ engine can proove its usefullness for some files. Furthermore you're planning to beat Crinkler and I suppose kkrunchy so it will be quite hard task so having a couple of selectable compression engines will get you closer to victory I think. Something like having nrv2b\nrv2d\nrv2e\lzma in UPX or aplib\briefLZ\ffce\jcalg1\lzma in PECompact2.
    Well, i'm not tageting to compete with packer for large executable anyway... i'm more in the 4k-64k range of compressed executable. From the test i did, it's almost impossible under 4k to do better than crinkler. With kkrunchy, it's possible, currently, with only a E8 filter, my compressor seems to be 0.5% to 1% worse (well i did only this test on putty.exe, so this is not really applicable to the 64k target), but still, kkrunchy is performing much more pre/filter/processing on the code than i was planning to.
    But between crinkler and kkrunchy (between 4k and 64k, more precisely the 16k type), it's possible to achieve better results. But well, i agree that it is a very very small niche market
    ______________
    @lx - FRequency

  24. #24
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    > due to the fact that the mixer waste time to exclude bad
    > contexts (affecting weights << 0)

    Well, using a single mixer with multiple inputs is not
    the only way possible, as you could see from my coders.
    In fact, a binary mixer tree instead of single multi-input mixer
    seems to improve compression, which can be partly explained by absense
    of Sum w_i = 1 constraint in most paq mixer implementations.

    And with such a design its also possible to improve speed by
    discarding branches that have ~0 weights (both estimation and update).

    >> There're still 9*15=135 bytes of contexts masks (7 mixers, 8 counters)
    >
    > I'm really interested in the mask you choose, meaning of xxx
    > in the M_xxx_masky?

    First pair of counters just got randomly named f and g,
    then I added a second pair (f1 and g1) and then 4 more
    And these xxx are just names of context indexes for corresponding
    counters.

    Also you could notice some sh_model.idx files - they contain
    the original source from which sh_model_h.inc and sh_model_p.inc
    were generated.

    > On what kind of expectation/kind of data (exe,...etc) is it based?

    Well, the contexts in mix_test_vA was tuned to the executable part
    (after cutting off that image) of your intro sample, without E8 applied though.

    > Are they heuristics or did you choose them on a bruteforce method?

    Both I guess. Anyway, it took multiple hours of optimizer iterations,
    although both my optimizer and that specific coder implementation are
    far from really efficient, so the speed of optimization process actually
    can be significantly improved - to a few minutes I guess
    (though well, my few minutes, with an overclocked Q9450 .

    > No, i was just meaning for the whole program of course!

    Btw, do you include the size of PE headers into your decoder size estimations?
    I think it should be included, as its possible to keep some data/code in some
    parts of the header.

    Also, intros don't necessarily have to be distributed as PE files, right?
    Did you consider writing a .COM dropper which would generate a real exe file
    and run it, or something like NE exe (its header is kinda more compact)?

    > But you proved me that asm translating for you is almost your natural language

    In fact it is... I'd used it for like 12 years before switching to C++
    (which I did only because of MMX/SSE and IntelC with automatic vectorization -
    its very hard to compete with that in plain asm).

    Btw, here's a better example of asm code size optimizations
    http://shelwien.googlepages.com/worm.rar
    I tried to help somebody to fit it into 128 bytes, but didn't succeed either.
    Still, it (w.asm) shows how stack and string operations can be used, and
    there's also self-modification.

    > oh, for the rangecoder, FPU seems to be a nightmare (your
    > implementation is... coming right from planet mars )

    Actually it was supposed to be a speed optimization.
    First, it keeps the rangecoder state on the FPU stack,
    while compressor is free to use all integer registers.
    And second, it reads/writes the compressed data by 32 bits at once -
    and FPU had 64 bits of integer precision even at
    the time when there was no MMX

    > FPU seems to be relevant in size (except under 4k compressed
    > data, where crinkler here beats anything) if you want to get
    > rid off precalculated tables for stretch/squash function.

    Err... only that? Why not make it all float-point - counters,
    mixer weights, rc?.. It might not only reduce the number of
    float-to-integer transitions, but also shave off some bytes
    due to improved precision...

    In fact, my first CMs (ppmy and ash) were written using
    completely float-point models, and these are
    almost impossible to translate into integers - it'd
    probably require some kind of software float-point anyway.

    But if its only stretch/squash, then floats there really
    don't seem that necessary.

    > 128 bytes of data is more than the cost of using directly the FPU

    Yeah, but it doesn't have to be 128 bytes
    For example, 10*64/8 is already 80 bytes (deltas barely fit into 10 bits),
    and then its surely possible to cut off a few bits of precision.
    Also, these tables actually can be merged... or 3 more tables
    can be added for better tuning .

    And if necessary, its even possible to just use FPU-based
    squash and stretch there as well, as my tables are only some
    approximations of these functions, although some data-specific
    static probability transformations are tuned into them.

    > In fact, at this size, it's quite common to try to avoid any
    > precalculation, as the precalculation code is taking a
    > significant space over a direct use inside the code.

    Its only true for very simple algorithms, but something
    on the scale of 300-500 bytes is already different imho.
    The overhead of precalculating something instead of computing it
    inline is fairly little (<10 bytes), and if the function
    is used multiple times, it might be well worth it.

    Although of course I don't know it for certain in the
    case of a logistic CM decoder, as I didn't write one yet

    >>> I'm not using any APM/SSE stages, as i did some test and
    >>> it didn't give me better results (including the code size
    >>> for this stage).
    >>
    >> Yeah... newer instructions tend to have longer codes
    >
    > right... I would have loved to use sse2 with all the
    > registers available...

    It was a pun actually, with different kinds of SSE

    But actually even though SSE might be more complex than
    mixing, it gives enough compression improvement to
    consider it imho - not the one from paq maybe, though.
    For example, a model with 2 counters and 2d SSE is similar
    (or better) in compression as a model with 4 mixed counters.

    > i tried my compressor on book1 and book1bwt
    > and the results are quite good actually
    > crunky
    >book1 216962 (13 contexts used)
    >book1bwt 211907 (20 contexts used)

    My current results are 209836 (book1) and 207961 (book1bwt)
    with mix_test and 8 counters (my mixers have contexts too,
    so there're 15 contexts).
    Also book1 result would be better in a few days, I guess -
    its pretty slow to optimize from scratch.

    > though, it uses 13-20 contexts, and i'm not sure how many
    > contexts were used in the zpaq comparison...

    There were plain order0 and order1 counter tables and
    a single instance of mixer in my case. Probably the same
    with zpaq, though I didn't really try to decipher the config.

    > Oops.. i don't understand the comparison between the 38k of
    > uncompressed data and the 16252 bytes of compressor code?
    > What do you mean?

    Not sure... probably wanted to sleep and missed a whole train
    of thoughts
    Likely its two merged statements actually.
    1. That 38k compressible to ~9k is a lot, as it allows to
    add a whole disassembler to decoder to gain another 100 bytes
    (eg. 500 gain - 400 bytes disassembler) - or a few models
    for alternate formats. Basically, at such scale algorithms
    already can be more compact representations of data.
    2. That full paq8 won't be that large if properly written
    (around 1-2k probably), as C++ compiler's code is far from
    compact, but still fits into 16k.

    Btw, considering [1] I've got an idea in the style of
    kolmorogov complexity - it should be possible to generate
    some code, that would produce data, that would be useful
    to train the model before processing the actual file data.

    Also did you try to train your model with your decoder's code?
    Though there're even better samples - some common files
    (eg. MSVCRT.dll) would be probably the same in all windows versions.

    I checked and it seems to work with paq8 at least:

    8614 intro_exe_unpack.paq8p
    111935 msvcrt_dll.paq8p
    120196 msvcrt_dll1.paq8p
    120196-111935=8261 which is kinda better than 8614

  25. #25
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Well, using a single mixer with multiple inputs is not
    the only way possible, as you could see from my coders.
    In fact, a binary mixer tree instead of single multi-input mixer
    seems to improve compression, which can be partly explained by absense
    of Sum w_i = 1 constraint in most paq mixer implementations.
    And with such a design its also possible to improve speed by
    discarding branches that have ~0 weights (both estimation and update).
    Like in the mix_test_* in paq_mixer.inc? What's the meaning of M_mXwr and M_mXwq?

    Btw, do you include the size of PE headers into your decoder size estimations?
    I think it should be included, as its possible to keep some data/code in some
    parts of the header.
    Also, intros don't necessarily have to be distributed as PE files, right?
    Did you consider writing a .COM dropper which would generate a real exe file
    and run it, or something like NE exe (its header is kinda more compact)?
    You are right, but because the PE header doesn't really depend on the compression method, it's a static size. And yes, i will use unused parts of the header to store code or hash for imported dll functions (as it is done in crinkler). The com dropper is not targeted as it is not always accepted and i want to use plain exe anyway...

    Quote Originally Posted by Shelwien View Post
    Err... only that? Why not make it all float-point - counters,
    mixer weights, rc?.. It might not only reduce the number of
    float-to-integer transitions, but also shave off some bytes
    due to improved precision...
    I have to check the opportunity for the counters update in float... i would like to use the probability update, i found your discussion with toffer about this (parameter optimization's thread), but well, i need some explanation!
    In your update formula : p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw
    what are wr, mw, pm?
    Also, not sure that the rc is worth to do it in the FPU... it seems size consuming compare to simple integer rc...

    Quote Originally Posted by Shelwien View Post
    Also did you try to train your model with your decoder's code?
    Though there're even better samples - some common files
    (eg. MSVCRT.dll) would be probably the same in all windows versions.
    I tried but it doesn't give any improvement... don't know why... probably because even if the hash is huge... i don't handle collisions... but well, i haven't investigate further...

    Quote Originally Posted by Shelwien
    Just a masked context with a byte mask?
    Using a bit mask would certainly help, so I'd not stop
    at just 256 contexts
    You were right! I was able to add a bit mask model in a simple manner, without adding too much code (i hope ~10-15bytes) and it improves compression quite well: 9403 bytes on the normal file (compare to 9560 in the initial version).
    ______________
    @lx - FRequency

  26. #26
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    Well, using a single mixer with multiple inputs is not
    the only way possible, as you could see from my coders.
    In fact, a binary mixer tree instead of single multi-input mixer
    seems to improve compression, which can be partly explained by absense
    of Sum w_i = 1 constraint in most paq mixer implementations.
    And with such a design its also possible to improve speed by
    discarding branches that have ~0 weights (both estimation and update).
    Like in the mix_test_* in paq_mixer.inc? What's the meaning of M_mXwr and M_mXwq?

    Btw, do you include the size of PE headers into your decoder size estimations?
    I think it should be included, as its possible to keep some data/code in some
    parts of the header.
    Also, intros don't necessarily have to be distributed as PE files, right?
    Did you consider writing a .COM dropper which would generate a real exe file
    and run it, or something like NE exe (its header is kinda more compact)?
    You are right, but because the PE header doesn't really depend on the compression method, it's a static size. And yes, i will use unused parts of the header to store code or hash for imported dll functions (as it is done in crinkler). The com dropper is not targeted as it is not always accepted and i want to use plain exe anyway...

    Quote Originally Posted by Shelwien View Post
    Err... only that? Why not make it all float-point - counters,
    mixer weights, rc?.. It might not only reduce the number of
    float-to-integer transitions, but also shave off some bytes
    due to improved precision...
    I have to check the opportunity for the counters update in float... i would like to use the probability update, i found your discussion with toffer about this (parameter optimization's thread), but well, i need some explanation!
    In your update formula : p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw
    what are wr, mw, pm?
    Also, not sure that the rc is worth to do it in the FPU... it seems size consuming compare to simple integer rc...

    Quote Originally Posted by Shelwien View Post
    Also did you try to train your model with your decoder's code?
    Though there're even better samples - some common files
    (eg. MSVCRT.dll) would be probably the same in all windows versions.
    I tried but it doesn't give any improvement... don't know why... probably because even if the hash is huge... i don't handle collisions... but well, i haven't investigate further...

    Quote Originally Posted by Shelwien
    Just a masked context with a byte mask?
    Using a bit mask would certainly help, so I'd not stop
    at just 256 contexts
    You were right! I was able to add a bit mask model in a simple manner, without adding too much code (i hope ~10-15bytes) and it improves compression quite well: 9403 bytes on the normal file (compare to 9560 in the initial version).
    ______________
    @lx - FRequency

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    >> In fact, a binary mixer tree instead of single multi-input
    >> mixer seems to improve compression, which can be partly
    >> explained by absense of Sum w_i = 1 constraint in most paq
    >> mixer implementations. And with such a design its also
    >> possible to improve speed by discarding branches that have
    >> ~0 weights (both estimation and update).
    >
    > Like in the mix_test_* in paq_mixer.inc?

    Yes, but any trees are in fact applicable, not necessarily
    binary or balanced ones.

    > What's the meaning of M_mXwr and M_mXwq?

    All the values in that coder have 15-bit precision.
    (Although mixer weight is stored as 0x4000+w instead of
    value clipping, as negative weight values might appear
    due to imprecision).

    And m-wr is kinda a weight for static mixing with 0.5,
    while m-wq is mixer update rate... normally wr was supposed
    to be that rate, but because of a mistake it ended like this.

    > You are right, but because the PE header doesn't really
    > depend on the compression method, it's a static size.

    Only a few fields there are really static.
    And even these rare fields that have to be specified,
    often have a rather large set of equally usable values.
    So my idea was that tweaking these and using all of
    the file as a context for the model might allow to cut
    off some more bytes without actually increasing the decoder size.

    > And yes, i will use unused parts of the header to store
    > code or hash for imported dll functions (as it is done in
    > crinkler).

    I'm not so certain that custom importing is any better than
    native one. Imho they usually do that in decoders just to
    avoid dealing with variable PE structure.
    There're even imports by ordinal, though I guess nobody compared
    the ordinals of common functions in different windows versions.

    > In your update formula : p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw
    > what are wr, mw, pm?

    I'd suggest looking at this:
    http://ctxmodel.net/rem.pl?-2
    http://ctxmodel.net/rem.pl?-6
    http://ctxmodel.net/rem.pl?-12

    Anyway, its a static binary linear mixing application
    for counter update, basically
    p'=mix2(mix2(p,y),pm) or
    p'=mix2(p,mix2(y,pm)) which is what's used in my counters
    (y=current bit, p=probability of 1 (in this case),
    pm=some static order-(-1) probability)

    > Also, not sure that the rc is worth to do it in the FPU...
    > it seems size consuming compare to simple integer rc...

    Probably not, as some extra operations are required to
    perform precise integer arithmetics with FPU.
    Like keeping all values in ((1<<63)+x) form to stop
    normalization.

    >> Also did you try to train your model with your decoder's code?
    >> Though there're even better samples - some common files
    >> (eg. MSVCRT.dll) would be probably the same in all windows versions.
    >
    > I tried but it doesn't give any improvement... don't know
    > why... probably because even if the hash is huge...
    > i don't handle collisions... but well, i haven't investigate further...

    Well, in fact it didn't work work with mix_test either,
    that's why I posted that paq8 example to simplify things
    But I found that its possible to gain something by using
    a selected fragment of msvcrt.dll as a context.
    Probably our models are just not adaptive enough
    (well, no wonder without SSE etc).

    > You were right! I was able to add a bit mask model in a
    > simple manner, without adding too much code (i hope
    > ~10-15bytes) and it improves compression quite well: 9403
    > bytes on the normal file (compare to 9560 in the initial
    > version).

    That's cool, though I really wonder how are you going
    to optimize these parameters

    Then, I think this might be usable too (from paq:

    Code:
    //////////////////////////// exeModel /////////////////////////
    // Model x86 code.  The contexts are sparse containing only those
    // bits relevant to parsing (2 prefixes, opcode, and mod and r/m fields
    // of modR/M byte).
    // Get context at buf(i) relevant to parsing 32-bit x86 code
    U32 execxt(int i, int x=0) {
      int prefix=(buf(i+2)==0x0f)+2*(buf(i+2)==0x66)+3*(buf(i+2)==0x67)
        +4*(buf(i+3)==0x0f)+8*(buf(i+3)==0x66)+12*(buf(i+3)==0x67);
      int opcode=buf(i+1);
      int modrm=i ? buf(i)&0xc7 : 0;
      return prefix|opcode<<4|modrm<<12|x<<20;
    }
    Last edited by Shelwien; 16th July 2009 at 08:27.

  28. #28
    Member
    Join Date
    Jul 2009
    Location
    Paris
    Posts
    12
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    > In your update formula : p'=(p*(1-wr)+y*wr)*(1-mw)+pm*mw
    > what are wr, mw, pm?
    I'm testing your update formula, but i'm getting a 6% worse on compression size than the simple mahoney's non-stationary update ( if ( c[y] < 255) c[y]++; if (c[1-y]>2) c[1-y] = c[1-y] / 2 + 1; ). This is a huge difference... i tried to tweak wr, mw, pm, but i can't do better than 6%.

    What's wrong? Have you performed already a comparison with the simple non stationary update rules? Could you try it on your model, i'm curious...
    ______________
    @lx - FRequency

  29. #29
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    193
    Thanked 968 Times in 501 Posts
    > I'm testing your update formula, but i'm getting a 6%
    > worse on compression size than the simple mahoney's
    > non-stationary update

    That's kinda expected in your case (with small amount of data),
    as that linear counter with constant wr is only an approximation
    of combinatoric counter and is especially imprecise for first
    bits in context.
    Although maybe you only replaced the counter and didn't try to
    reoptimize whole model with it - that initial imprecision usually
    can be well compensated by lower-order contexts in the mix.
    Anyway, my linear counter is chosen for convenience reasons,
    not because of its prediction precision, so there's not much
    reason to "defend" it for me.

    > c[y] = c[y] + (c[y]<255);
    > c[1-y] = c[1-y] / 2 + 1 + (c[1-y]<2);

    I guess, you should try tweaking this Matt's counter instead,
    for example. rewrite it like this:

    > c[y] = c[y] + u*(c[y]<255);
    > c[1-y] = __max( t, c[1-y]*w ) + v;

    And try finding better (u,v,w,t) values than (1,1,0.5,2).

    Also, another thing can be added at probability calculation:

    instead of p = c[0]/(c[0]+c[1])
    you can use something like
    (c[0]+f0) / ( c[0]+c[1] + f0+f1 )
    so f0 and f1 would be additional parameters, similar to pm
    in the linear counter.

    > This is a huge difference... i tried to tweak wr, mw, pm,
    > but i can't do better than 6%.

    There're versions of linear counter which would better fit
    your case (ones with dynamic update rate), but they won't
    be simple enough for your task.

    > What's wrong? Have you performed already a comparison with
    > the simple non stationary update rules? Could you try it
    > on your model, i'm curious...

    Why, I did try, and it was insignificantly better, but noticeably
    slower... also not the /2 version but a generalized one, with
    all the tuned parameters.
    I tried that for book1 though, not with your intro sample.

Similar Threads

  1. Thread title changeable? - No Problem!
    By Simon Berger in forum The Off-Topic Lounge
    Replies: 13
    Last Post: 7th January 2016, 10:53
  2. Problem with forum
    By EneergE in forum Forum Archive
    Replies: 1
    Last Post: 28th March 2008, 12:06
  3. CM Match model
    By toffer in forum Forum Archive
    Replies: 32
    Last Post: 1st February 2008, 19:26
  4. CM model
    By toffer in forum Forum Archive
    Replies: 40
    Last Post: 26th January 2008, 06:59
  5. [OFFTOPIC]: Problem for hardcore programmers!
    By nimdamsk in forum Forum Archive
    Replies: 4
    Last Post: 25th March 2007, 18:12

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •