Hi,

It's a little nerdy to post technical thing at the first hour of the year but anyway...

I'm working on yet another version of TarsaLZP. Right now I'm working on low order PPM part - the part that was very weak in all released versions so far.

I've created bytewise coder, which scores very good on enwik8. For example 61.536.584 bytes with order0 modelling (S = 512), with only a single set of probabilities (not frequencies). What's different in my model compared to usual bytewise coders is that I'm updating the model the same way fpaq0p does, ie. I'm adding to a selected symbol frequency a fraction of sum of other symbols frequencies. Usual coders just increase counter of selected symbol by a fixed amount.

Adjusting in usual bytewise coders is done this way:
counter[selected] += C;

In my coder I'm doing (probabilites aren't normalized, I'm only maintaining correct relative weights):
increase = (probabilitySum - probability[selected]) / S;
probability[selected] += increase;
probabilitySum += increase;

This way recent symbol always has greater weight than any older and neither strength of rescalings nor time from last rescaling plays any role (except having slight impact on rounding errors). Just like bitwise fpaq0p like modelling.

What I want to adaptively tune is the learning rate S. I have ruled out model mixing as:
- model mixing with non-normalized probabilites would be expensive,
- there will be LZP layer and that means symbol exclusion that is incompatible with probability normalization,

I want to tune order-1 model learning rate. My current idea is to have three order-1 models:
- one with fixed fast adaptation rate,
- one with fixed slow adaptation rate,

Actual coding would be done with probabilites from that third model, but I would want to use information from the first two models to update the learning rate of third model. The question is: how to do that efficiently?  Reply With Quote

2. What you want is probably an indirect model: context -> bit history -> prediction. So if you have a sequence like 000001, you don't have to guess 1 (fast adapting) or 0 (slow adapting). You look at previous occurrences of the same sequence to see what happened (using a slow adapting model initialized in this case to 1/6).

ZPAQ uses an 8 bit bit-history which is not ideal because a slow adapting direct model still works better on uniform sources. Here are some tests using direct (CM 16..4 = slow..fast) and indirect (ICM) order 0 models. maxcomp.cat is the concatenated maximum compression corpus.

Code:
```comp 0 0 0 0 1  (enwik8    calgary.tar  maxcomp.cat)
(0 cm 9 16    (61241069    1681577    35339958)) (uncomment one line)
(0 cm 9 12    (61217687    1681256    35259845))
(0 cm 9 8     (61290591    1684348    35195798))
(0 cm 9 4     (61902641    1705057    35321488))
(0 icm 5      (62736218    1717028    34535069))
hcomp
halt
post
0
end```
Using a longer bit history might get better results, but then you have to think harder about the second modeling stage. Besides, ICM is designed more for higher order contexts.  Reply With Quote

3. Do you suggest a bitwise model? I already ruled that out because I will be doing symbol exclusion (usually some very probable symbol) and bitwise coding is not compatible with that. Or do you suggest histories (where each history bit tells which model was better) and switching between models? Well, switching should be worse than adaptive learning rate I think.

Besides, ICM is designed more for higher order contexts.
I will use LZP for higher orders, like I did in previous versions of TarsaLZP - there were order-4 and order-8 LZP models, adaptively switched.

Goal is to beat Francesco's RINGS on enwik using single core Having similar speed of course.  Reply With Quote

4. That looks interesting.
What kind of speed do you get with this method ?
It apparently involves updating 256 fields per byte,
so it may not end up being faster than bit coding ...  Reply With Quote

5. No, I'm updating only one probability, as I'm not normalizing probabilities. If I would want to do a CM coder I would probably opt for nibblewise coding, no symbol exclusion and normalising to 2^N to avoid divisions entirely. Such nibblewise CM should be both far superior to bytewise coder in terms of speed and somewhat superior to bitwise CM in terms of compression ratio on texts.

With normalisation to 2^N I would be forced to recompute all values after each byte. So the routine will be:
sumOthers = 0
for (symbol : symbols)
if (symbol != selected)
sumOthers += (probabity[symbol] -= probability[symbol] >> S)
probability[selected] = 2 ^ N - sumOthers

What I'm doing now is grouping symbols into 16 groups of 16 symbols each. That speeds things a lot and can be easily vectorized using even MMX, as I'm using 16-bit precision in model.

Sample session (unnecessary info removed):
Code:
```piotrek@msi-wind:~/Pulpit/enwik\$ time ~/NetBeansProjects/TarsaLZP/dist/Release/GNU-Linux-x86/tarsalzp c enwik8 enwik8.ari

real	0m11.496s
user	0m11.145s
sys	0m0.316s
piotrek@msi-wind:~/Pulpit/enwik\$ time ~/NetBeansProjects/TarsaLZP/dist/Release/GNU-Linux-x86/tarsalzp d enwik8.ari enwik8.dec

real	0m16.862s
user	0m16.077s
sys	0m0.460s
piotrek@msi-wind:~/Pulpit/enwik\$ time ~/NetBeansProjects/TarsaLZP/dist/Release/GNU-Linux-x86/tarsalzp c enwik9 enwik9.ari

real	2m1.621s
user	1m51.719s
sys	0m3.404s
piotrek@msi-wind:~/Pulpit/enwik\$ time ~/NetBeansProjects/TarsaLZP/dist/Release/GNU-Linux-x86/tarsalzp d enwik9.ari enwik9.dec

real	2m48.654s
user	2m40.782s
sys	0m3.724s
piotrek@msi-wind:~/Pulpit/enwik\$ time fpaq/fpaq0pv4A/a.out c enwik8 enwik8.fpaq
enwik8 (100000000 bytes) -> enwik8.fpaq (61457810 bytes) in 9.30 s.

real	0m9.735s
user	0m9.005s
sys	0m0.344s
piotrek@msi-wind:~/Pulpit/enwik\$ time fpaq/fpaq0pv4A/a.out d enwik8.fpaq enwik8.dec
enwik8.fpaq (61457810 bytes) -> enwik8.dec (100000000 bytes) in 11.58 s.

real	0m11.975s
user	0m11.229s
sys	0m0.424s
piotrek@msi-wind:~/Pulpit/enwik\$ time fpaq/fpaq0pv4A/a.out c enwik9 enwik9.fpaq
enwik9 (1000000000 bytes) -> enwik9.fpaq (622237009 bytes) in 95.88 s.

real	1m46.947s
user	1m32.854s
sys	0m3.044s
piotrek@msi-wind:~/Pulpit/enwik\$ time fpaq/fpaq0pv4A/a.out d enwik9.fpaq enwik9.dec
enwik9.fpaq (622237009 bytes) -> enwik9.dec (1000000000 bytes) in 117.86 s.

real	2m11.885s
user	1m54.251s
sys	0m3.760s
piotrek@msi-wind:~/Pulpit/enwik\$ ls -al
-r--r--r-- 1 piotrek piotrek  100000000 2011-06-01 16:24 enwik8
-rw-rw-r-- 1 piotrek piotrek   61536584 2012-01-01 13:17 enwik8.ari
-rw-rw-r-- 1 piotrek piotrek  100000000 2012-01-01 13:28 enwik8.dec
-rw-rw-r-- 1 piotrek piotrek   61457810 2012-01-01 13:27 enwik8.fpaq
-r--r--r-- 1 piotrek piotrek 1000000000 2011-06-01 17:29 enwik9
-rw-rw-r-- 1 piotrek piotrek  623688984 2012-01-01 13:20 enwik9.ari
-rw-rw-r-- 1 piotrek piotrek 1000000000 2012-01-01 13:32 enwik9.dec
-rw-rw-r-- 1 piotrek piotrek  622237009 2012-01-01 13:30 enwik9.fpaq
piotrek@msi-wind:~/Pulpit/enwik\$```
Eugene's fpaq0something compiled with:
Code:
`piotrek@msi-wind:~/Pulpit/enwik/fpaq/fpaq0pv4A\$ g++ -m32 -O3 -fomit-frame-pointer fpaq0pv4.cpp`
So fpaq0something is still faster than my bytewise coder, but my coder offers possibility of easy and precise symbol exclusion, which is very important in TarsaLZP.

I've attached snapshot of my coder (Linux 64-bit builds + NetBeans project with automatically generated makefiles that probably work standalone) and I'm still waiting for ideas I've asked in topic   Reply With Quote

6. Another version for comparison - http://nishi.dreamhosters.com/u/rc_v3a.rar
Its about 2x slower than fpaq0pv4b, but that's mostly a matter of obvious i/o speed optimizations.
But enwik8 result here is 61172542.  Reply With Quote

7. > increase = (probabilitySum - probability[selected]) / S;

Its different in this case, though, because in fpaq0p the
equivent of your probabilitySum remains constant... so
the actual probability should be different.

Still, I think it should be possible to recalculate the increment
in such a way that the _probability_ (not just freq like in your case)
would be updated in the same way as in fpaq0p.
It would be likely much slower, but imho its interesting how it
would affect the compression.

> model mixing with non-normalized probabilites would be expensive,

Not really.
"Linear" mixing like (freq1*(1-w)+freq2*w)/(total1*(1-w)+total2*w)
is quite affordable imho (especially when *w can be replaced with shifts etc).

And if you mean having to mix all 256 freqs at once,
its also not necessarily the case, as you can mix some kind
of cumulative freqs instead (like using unary or binary lookup).

But thats exactly how you end up with binary probabilities instead of freqs,
after some optimizations Btw, did you see the counter in
http://ctxmodel.net/files/PPMd/ppmd_Jr1_sh8.rar
defined in mod_see.inc, used in ppmd_proc2.inc ?
Its a tricky variable-rate shift-based counter.

> The question is: how to do that efficiently?

I'd say, by comparing the codelengths generated by other submodels.
In short, compute an exponentially-weighted entropy estimation for
a model, like clen = clen*(1-wr) + LOG2[p]
(or even (clen*=(1-wr))+=LOG2[total]-LOG2[freq] )
But again, after trying to make it more efficient for a while,
you're very likely to get a standard mixer from that.

An interesting alternative though is explicit switching + optimal parsing...
ie its possible to make an asymmetric coder which would encode extra flags
to switch between models, and the example of lzma shows that with precise
estimation of overall entropy its possible to minimize the redundancy
from such a non-bijective code and reach similar compression as with simple
mixing, while always using just a single submodel in decoder (encoder would be
slower though).  Reply With Quote

8. > increase = (probabilitySum - probability[selected]) / S;

Its different in this case, though, because in fpaq0p the
equivent of your probabilitySum remains constant... so
the actual probability should be different.

Still, I think it should be possible to recalculate the increment
in such a way that the _probability_ (not just freq like in your case)
would be updated in the same way as in fpaq0p.
It would be likely much slower, but imho its interesting how it
would affect the compression.
I think that it will end up being basically the same. The only difference is that in my scheme without renormalization only the selected symbol probability grows by a fixed fraction, while with renormalization also probabilites of non-selected symbols shrink by a fixed fraction. So the only difference is that the adaptation speed is usually slightly faster in normalizing scheme for the same constant factor.

> model mixing with non-normalized probabilites would be expensive,

Not really.
"Linear" mixing like (freq1*(1-w)+freq2*w)/(total1*(1-w)+total2*w)
is quite affordable imho (especially when *w can be replaced with shifts etc).

And if you mean having to mix all 256 freqs at once,
its also not necessarily the case, as you can mix some kind
of cumulative freqs instead (like using unary or binary lookup).

But thats exactly how you end up with binary probabilities instead of freqs,
after some optimizations
Yes, I mean mix all symbols at once and that's why I mentioned nibblewise model mixing. Anyway, mixing is rather incompatible with symbol exclusion, at least symbol exclusion makes things messy, slow and imprecise when mixing.

Besides, in my nibblewise model mixing idea there would be no symbol exclusion, so renormalizing scheme could be used, then total1 could be set equal to total2 and both would be power of 2. The division would be then replaces by simple bit shift and that would cause mixing all symbols affordable - with average 8 mixings per nibble there would be on average 16 mixings per byte and that's not too far off 8 mixings in bitwise coders. With periodical symbol sorting the average number of mixings could be even reduced to less than 8 mixings in the nibblewise case.

I'd say, by comparing the codelengths generated by other submodels.
In short, compute an exponentially-weighted entropy estimation for
a model, like clen = clen*(1-wr) + LOG2[p]
(or even (clen*=(1-wr))+=LOG2[total]-LOG2[freq] )
But again, after trying to make it more efficient for a while,
you're very likely to get a standard mixing from that.
Can you elaborate on that? What the variables mean? Assume, like in my original idea, that we have three o1 models: one fast adapting, one slow adapting, and one which we actually use and want to compute good learning rate for.

An interesting alternative though is explicit switching + optimal parsing...
ie its possible to make an asymmetric coder which would encode extra flags
to switch between models, and the example of lzma shows that with precise
estimation of overall entropy its possible to minimize the redundancy
from such a non-bijective code and reach similar compression as with simple
mixing, always using just a single submodel in decoder (encoder would be
slower though).
Almost sure that won't be worth the hassle. I want something rather symmetric, without excessively slowing down compression.  Reply With Quote

9. Originally Posted by Shelwien clen = clen*(1-wr) + LOG2[p]
I'd use alen=alen*(1-wr)+w*LOG2[p]  Reply With Quote

10. Originally Posted by Piotr Tarsa Can you elaborate on that? What the variables mean?
One of the many possible ways. This is used in CPVLMS. Use one fast and a slow one. Count the entropy off the two over an intervall. If the fast one was better make the real rate a little faster. If the slow one was better, make the real rate a little slower.  Reply With Quote

11. > symbol exclusion

Btw, there _is_ a way to implement a bitwise PPM at least.
The escape can be encoded per bit (ie ternary code basically),
so that the model would go to the specific branch of a lower-order context,
while all other symbols would be excluded automatically.

> Assume, like in my original idea, that we have three o1 models

As I said, the best way is to estimate the entropy generated
by submodels (without actual coding), and tweak the 3rd model
depending on results of 1st and 2nd.
(So in a way, the 3rd model becomes a mix of 1st and 2nd,
and trying to tweak such a nonlinear parameter as update rate
would likely make it worse than usual mixers).
But in such cases we can't use the total entropy, because that would
make it too stationary. Thus new values have to be added with higher weight,
same as with counters.
And then, exponential weighting is not necessarily the best here
from compression p.o.v, but surely is the most practical.

Btw, if the "1-wr" is what you don't understand there, the idea
is that we have a exponentially weighted sum of symbol coding costs:
clen = Sum[ w^i*l[pos-i], i=0..pos ]
which is incrementally computed via clen=clen*w+l.
And w is written as (1-wr) because "wr" is my usual name for update rates,
with common values near 0, thus (1-wr) is a little below 1.

> I'd use alen=alen*(1-wr)+w*LOG2[p]

That's actually the same in this case - there's no point to normalize
the values, when the only thing that matters is their relation.  Reply With Quote

12. Okay. Thanks for explanations.

I understand that we have to compute after each symbol:
ef = ef*w + log(p_ef[s])
es = es*w + log(p_es[s])

Where e[f|s] is weighted sum of costs in [fast|slow] model, and log(p_e[f|s]) is the cost of current symbol in [fast|slow] model, w is the decay rate.

Now, how to compute rate_m then (m stands for 'main' there)? The problem for me is that with rate_m somewhere in between rate_f and rate_s we can have better compression than both fast and slow models. How to update rate_m then?

Also, do you know some neat method for fast and somewhat accurate (>10 bits accuracy for example) logarithm estimation using integer arithmetics?  Reply With Quote

13. as stated, for example

Code:
```ef = ef + log(p_ef[s])
es = es + log(p_es[s])
nsample++
if (nsample>INTERVAL_LEN)
{
if (ef<es) {
rate_m = min(rate_m+rate_max)/2,rate_max) or min(rate_m*alpha,rate_max) with alpha>1
} else {
rate_m = max(beta*rate_m,rate_min) with beta<1
}
ef=es=0
nsample=0
}```
so its getting faster than slower with the first variant adaption could be better.

with a test-intervall version like this you should leave the "w"-parameter out.  Reply With Quote

14. > I understand that we have to compute after each symbol:
> ef = ef*w + log(p_ef[s])
> es = es*w + log(p_es[s])

Yea, but it may be enough to compute
dfs = dfs*w + log(p_ef[s]) - log(p_es[s])
(btw, i imply that these are -log there actually, or log(1/p) )

Also the common mistake in that kind of estimation is forgetting
to switch p depending on bit value.
Ie it should be
p = model.probability(context); // prob of bit==1
(e*=w) += log2( bit ? p : 1-p );

> How to update rate_m then?

I'd suggest something like
rate_m = ef<es ? max(rate_f,rate_m-1) : min(rate_s,rate_m+1);
or maybe
squash(x) = 1/(1+exp(-x))
rate_m = rate_f + (rate_s-rate_f)/2 * squash( (ef-es)*coef )

> logarithm estimation using integer arithmetics?

Code:
```void LzmaEnc_InitPriceTables( uint* ProbPrices ) {
uint i,j;
for( i=(1<<kNumMoveReducingBits)/2; i<kBitModelTotal; i+=(1<<kNumMoveReducingBits) ) {
const int kCyclesBits = kNumBitPriceShiftBits;
uint w = i;
uint bitCount = 0;
for( j=0; j<kCyclesBits; j++ ) {
w = w * w;
bitCount <<= 1;
while( w>=(1<<16) ) { w>>=1; bitCount++; }
}
ProbPrices[i>>kNumMoveReducingBits] = ((kNumBitModelTotalBits<<kCyclesBits) - 15 - bitCount);
}
}```
"kCyclesBits" sets the precision there.
With some adjustments I've got a full match with math.h results.  Reply With Quote

15. I've opted for model switching (based on weighted costs) instead of adjusting the learning rate. Computing costs takes surprisingly long time, even though I'm using small 2-KiB LUT for computing logarithms. Currently my algorithm consists of two (slow and fast) o1 models and two (slow and fast) o2 models and selects only one when coding a symbol. Maybe I'll remove one o1 model.

My algo is awfully slow now (about 4x slower than fpaq0pv4B) but compresses rather good for a o2 coder. Currenlty it scores:
- 37 067 568 bytes for enwik8,
- 348 348 128 bytes for enwik9,

I have some constants in my algorithm that I would want to tune:

For example first one is used in updating selected symbol in o1 fast adapting model, ie prob[s] += total - prob[s] >> O1_FAST_ADAPTION_RATE. Cost adaptation rate is a log2 logarithm of the fraction of weighted sum that I'm subtracting after every symbol.

1. What do you use for parameter optimization?
2. Can you point me to some good order-2 (preferably mixed with order-1 also) coders?   Reply With Quote

16. Originally Posted by Piotr Tarsa 1. What do you use for parameter optimization?
I've used MATLAB GA Toolbox for my optimization purposes. It tunes a single bit stream which consists all tune-able parameters. As a post processing stage, I also used a random bit-flipping like optimizer to find local optimum where GA left. It worked very well. You can confirm yourself by looking BIT results   Reply With Quote

17. > Computing costs takes surprisingly long time,

That might be due to a long dependency chain.
Explicit prefetching might help (especially with o2).
instead of the start of iteration.

> My algo is awfully slow now (about 4x slower than fpaq0pv4B) but
> compresses rather good for a o2 coder.

o2 is completely different from o0-o1, because it doesn't fit into caches.
So 4x slower may be actually quite good.

> - 37 067 568 bytes for enwik8,

The result with ppmd -o2 is 36,800,798

> 1. What do you use for parameter optimization?

A simple perl script which does binary search on each parameter in a loop,
and runs the actual coder to test a parameter set.
Somehow I never actually needed anything stronger, though.

However, I use it in conjunction with a parameter description DSL,
and apparently my optimization framework is much easier to use than
all others (eg. toffer's from m1).

> 2. Can you point me to some good order-2 (preferably mixed with order-1 also) coders?

How about ppmd -o2, also http://compression.ru/ds/ppmsj.rar  Reply With Quote

18. As you only use probabilities, you can mimic PPMDs inherited frequencies by just letting higher orders prob be lower orders prob when context was not used yet, this helped my o2-coder a lot.  Reply With Quote

19. I've used MATLAB GA Toolbox for my optimization purposes.
MATLAB is unfortunately a paid software and I'm not a student already so I can't play with MATLAB for free at university. Do you know something freeware (preferably Linux friendly) with decent optimization features?

And what to optimize for? Weighted compression ratio (on some predefined corpus of subjectively interesting data) doesn't have to be the optimal function, I think, as some files like FlashMX.pdf are much harder to compress - looking at TarsaLZP results on maximumcompression.com: if I improve compressed FlashMX.pdf size by 1% then I will jump from 84'th place to 42'th place (exacly halfway) but if I improve compressed english.dic size by 1% then my position will stay the same.

I could use a function that compares the size of output of my algorithm and eg size of paq8 output but then I see little sense in comparing to coders that have special filters, sparse contexts, etc on data that benefits from that special models for reasons similar to stated in paragraph about FlashMX.pdf - some files aren't very compressible without filters or sparse contexts, so striving to compress them better without that features would bring too little gain on that file to offset losses on other files.

Also, in the end, TarsaLZP will have 3 layers - high order LZP, medium order SR and low order PPM. As higher order layers are basically independent from lower layers I could compute the cost of output from higher layers and optimize it separately. What function (of compressed size) should I optimize then?

> - 37 067 568 bytes for enwik8,

The result with ppmd -o2 is 36,800,798
Hmmm, ppmd AFAIR uses something like SSE based on frequency rank and lots of small contexts/ flags so there's probably a high level of indirection in probabilites mapping, but anyway ppmd is a good point to start. I'll probably add order-3 symbol ranking layer in between LZP and PPM layers in upcoming TarsaLZP, so before adding LZP layer I think will have better comparison to ppmd.

On enwik9 the differences are smaller though as ppmd -o2 scores 347 435 600 bytes, so that my compressed size is only 0.26 % larger.

As you only use probabilities, you can mimic PPMDs inherited frequencies by just letting higher orders prob be lower orders prob when context was not used yet, this helped my o2-coder a lot.
Good idea. I think I'll add some additional counters to o2 contexts and temporarily disable them if they were seen eg less than couple of times.  Reply With Quote

20. Here are some ZPAQ results on enwik8 using some strictly order 2 models (no fallback or mixing with lower orders). Compression times are on a 2.0 GHz T3200, Win32. Compression times for all of the CM models are about the same (37-38 sec) so I didn't show all of them. Times are for 1 thread. For comparison:

ppmd -o2: 36800798, 8.4 sec, 1.3 MB memory
ppmonstr -o2: 33178379, 150 sec, 1.3 MB

Code:
```(test.cfg)
comp 0 1 0 0 3 (change n=3 to 1 for other models)
(0 icm 17   (36065829, 41s))
(0 cm 23 16 (37489742))
(0 cm 23 8  (37133663))
(0 cm 23 6  (37062353, 37s))
(0 cm 23 4  (37099224))
(0 cm 23 3  (37271166))
(0 cm 23 2  (37803631))
0 cm 23 4
1 cm 23 8
2 mix 0 0 2 32 255 (36634347, 70s)
hcomp
*b=a hash b++ hash *d=a halt
post 0 end```
The commented out models are single component models (n=1). "icm 17" is an indirect context model (context -> bit history -> prediction) using 2^(17+6) = 8M histories of 1 byte each. "cm 23 n" is a direct context model (context -> prediction) with 2^23 = 8M entries (32 MB) and update rate of max(1/count, 1/(4*n)), so higher numbers adapt slower. The optimal rate is 6 (1/24). The uncommented model mixes 2 cm's on either side of the optimal rate with an order -1 (no context) mixer and a weight learning rate of 32/4096 (higher numbers are faster), which I tuned on enwik7 for best compession. The other parameters are 0 (number of context bits), 0 (start of input range), 2 (number of inputs) and 255 (order 0 context mask, irrelevant in this case).

The COMP parameters 0 1 select the array sizes H[2^0=1] and M[2^1=2]. I use M as a rotating history buffer of the last 2 bytes from which the context hash is computed and put in H. All of the components share this context.

The HCOMP code is called after coding each byte with that byte in the A register. *B=A stores A in M[B mod 2]. HASH computes (A=A+*B+512)*773. *D=A stores A in H[D mod 1].

To compress I used: "zpaq -mtest c x \res\enwik8" to create x.zpaq. (You may get a slightly different result depending on the path length to enwik .  Reply With Quote

21. > Do you know something freeware (preferably Linux friendly) with
> decent optimization features?

In fact its frequently not a good idea to use an universal optimization engine for this.
It would be much more likely to test all kinds of weird corner cases, and would probably
stop early enough. Well, that's what most people want it to do though, but its clear that
a compressor optimization can't converge without even testing all the parameter bits.

Anyway, I can give you my optimization scripts - they're written in perl, so would work
in linux too. But you're probably better off writing some of your own.
You've tried tweaking it manually, right? How about automating that process?

> And what to optimize for? Weighted compression ratio

In most cases I used book1, book1bwt, or book1+wcc386 as optimization targets.
Well, also the whole SFC set in lzmarec.
I guess, most of my models are slow enough so that carefully choosing
the optimization target wasn't really an option.

But actually I never thought of integrating any weighted metrics
into the optimizer - it just looks at the compressed file size.
We can always compress multiple files into a single archive,
and also adjust individual sample sizes to fix up the metric,
so imho there's no need to add extra dependencies to the optimizer.

Btw, in some cases I also included the memory size to the metric,
but that was done by adding something like this to the actual coder:
Code:
```fseek( out, (memoryused>>12)-1, SEEK_CUR );
putc( 0, out);```
> if I improve compressed FlashMX.pdf size by 1% then I will jump from
> 84'th place to 42'th place (exacly halfway) but if I improve
> compressed english.dic size by 1% then my position will stay the same.

In a way, you are right here, as a different metric (eg. min sum of places in maxcomp)
would produce different results.

But imho you can think about that after getting it to work at all,
and plain compressed size of a single file (like enwik or something)
should be good enough for that.

> Also, in the end, TarsaLZP will have 3 layers
> What function (of compressed size) should I optimize then?

Simple reduction in compressed size is good enough imho,
the absolute value doesn't matter.

> Hmmm, ppmd AFAIR uses something like SSE based on frequency rank

Only ppmonstr has SSE.
ppmd only has SEE (ie separate binary models for escape flags),
so there's nothing wrong with it > On enwik9 the differences are smaller though as ppmd -o2 scores 347
> 435 600 bytes, so that my compressed size is only 0.26 % larger.

ppmd has additional options like -m and -r, so that may be not
the best result on enwik9.

> Good idea. I think I'll add some additional counters to o2 contexts
> and temporarily disable them if they were seen eg less than couple
> of times.

ppmd does that at symbol level though.
Afair it tries to assign the new freq in such a way that
the symbol would initially have the same probability as it previously
And with freqs, that's a bit tricky.  Reply With Quote

22. In my case, initially MATLAB generates a random bit string and runs a small utility to patch my compressor executable with that bit string. In compressor source, I've intentionally left a code portion that can be found with a signature and then can be patched. After patching, my compressor runs on a target (for some small improvements, I usually used book1+wcc386, if it yields better result I do a final optimization with whole SFC). Whole SFC optimization takes a very long time which varies between 7-14 days with non-stop working on my laptop (I've permanently damaged a MacBook Pro with that ). Optimization function was not any kind of approximation. It was basically compressed size. But, it's better to integrate Shelwien's metric function or similar (compressing time + transmission time + decompressing time). One of the most important thing, GA or similar algorithms deals in a quite large solution space. So, they could never converge. If improvements becomes very very small, you should run a separate algorithm to squeeze last few KiB. In my case, I've used Shelwien's script like approach (bit-flipping).

1. I don't need deal GA stuff. There is an already working framework.
2. I can even use vectorized GA with a single option, which runs parallel my compressor, thus shorter optimization time.
3. GA is good for such optimization.

1. Toffer's optimizer can converge faster, due to custom operators e.g. non-linear quantization table optimization which is a hard problem in my case.
2. MATLAB is very bloated application and it's very resource hungry. I can't use it on all of my computers. A simple console application would be better.
3. Patching and creating output compressed archive each time cause huge amount of I/O operation. Thus a RAM drive could be better.  Reply With Quote

I'll deal with GA soon (hopefully), after adding SR and LZP. But now there're still ways to improve PPM Matt:
Thanks for comparisons and script. It will be a helpful tool. But I still think ICM is adding more information than simple frequencies or counters. AFAIU ICM to CM is like fpaq0f to fpaq0p. And there are controversies about fpaq0f being truly order-0. Anyway, it's not a dealbreaker. State maps are simple and elegant way to deal with a lot of problems at once, but certainly not feasible with bytewise coding and probably don't bring much benefit when having also higher order models at the same time.

My coder, without much optimizations, is less than twice as slower than PPMd -o2 (on my AMD E-350), which AFAIR does frequency sorting that should improve speed considerably on compressible data. So that's much better than ZPAQ does, although maybe ZPAQ's JIT isn't optimized for low order coding.

Code:
```piotrek@msi-wind:~\$ cd Pulpit/enwik/
piotrek@msi-wind:~/Pulpit/enwik\$ time ~/NetBeansProjects/cTarsaLZP/dist/Release/GNU-Linux-x86/ctarsalzp c enwik8 enwik8.ari

real	0m42.466s
user	0m40.955s
sys	0m0.932s
piotrek@msi-wind:~/Pulpit/enwik\$ time ppmd/PPMd e -m256 -o2 enwik8
Fast PPMII compressor for textual data, variant J, Dec 18 2011
enwik8.pmd already exists, overwrite?: <Y>es, <N>o, <A>ll, <Q>uit?y
enwik8:100000000 >36800776, 2.09 bpb, used:  1.3MB, speed: 4663 KB/sec

real	0m25.485s
user	0m20.577s
sys	0m0.376s
piotrek@msi-wind:~/Pulpit/enwik\$ time ~/NetBeansProjects/cTarsaLZP/dist/Release/GNU-Linux-x86/ctarsalzp d enwik8.ari enwik8.dec

real	0m49.282s
user	0m47.367s
sys	0m1.228s
piotrek@msi-wind:~/Pulpit/enwik\$ time ppmd/PPMd d enwik8.pmd
Fast PPMII compressor for textual data, variant J, Dec 18 2011
enwik8 already exists, overwrite?: <Y>es, <N>o, <A>ll, <Q>uit?y
enwik8:36800776 >100000000, 2.09 bpb, used:  1.3MB, speed: 4105 KB/sec

real	0m31.470s
user	0m23.221s
sys	0m0.600s
piotrek@msi-wind:~/Pulpit/enwik\$```
ppmd has additional options like -m and -r, so that may be not
the best result on enwik9.
I've used -o2 and -m256 so that there was no model restart/ cut-off to make comparison as fair as possible.

I have another idea for adaptive disabling. Instead of disabling either o2 model based on total frequency (temporarily do not use models of contexts that were rarely seen in the whole process before), I could compute weighted sum of intervals between o2 context occurences. I would use a simple timestamps to compute interval length. I would eg maintain two weighted interval sums - one will give higher weight to recently incoming interval than another. Then I would have to convert the relation between those two sums to a decision of disabling particular o2 model temporarily. Maybe use some APM to convert history of recent weighted intervals sums relations and current relation of weighted interval sums to a decision about disabling particular 02 model.

Besides, development/ research probably will be slowed down for a while as I'm moving and additionally my netbooks's cooling system is likely broken.  Reply With Quote

24. ZPAQ bit histories are tuned for higher order models. For low order, simply using the last 8 bits as the bit history works better. However, ZPAQ does not have such a model.

Also there is some extra overhead. ZPAQ computes and stores SHA-1 checksums. It compresses to a temporary file and then appends it to the archive. Also, the CM includes both a prediction and a count. The count indexes a table to get the learning rate and then multiplies by the prediction error, rather than using a simple shift like fpaq0f2.

I'm considering updating the ZPAQ standard to level 2 to support fast compressors. The idea would be to omit the context model and keep just the post-processor (written in ZPAQL). The preprocessed data would be stored otherwise uncompressed. This would be equivalent to compressing a string as an arbitrary program plus its input data which writes the string. The changes to libzpaq should be fairly minimal and it would support arbitrary compression algorithms.  Reply With Quote

25. > I've used -o2 and -m256

It should be faster - http://compression.ru/ds/ppmsj.rar

> decision about disabling particular 02 model

As I said, its not really a good idea. Why not just recompute the "bad" freq from lower orders, like ppmd does?
Disabling a whole context won't let it work where it could too.  Reply With Quote

26. Ok, but what about ppms?
It should be faster - http://compression.ru/ds/ppmsj.rar
No Makefile for GCC and problems compiling under Ubuntu As I said, its not really a good idea. Why not just recompute the "bad" freq from lower orders, like ppmd does?
Disabling a whole context won't let it work where it could too.
I've talked about disabling for coding a specific symbol not disabling permanently or for a long run. Remember that I'm using switching between models, so disabling is generally equal to preventing switching to a particular model.

How and how often would you do recomputing? Recomputing one symbol would take at least one division as probabilites aren't normalized. So it could mean low gain for high cost.

Matt:
I believe that bit histories in you ICM and my idea about pair of weighted intervals sum estimators share some ideas.

Mainly, bit histories + APM allows for automatic detection of local stationarity - if we're in nonstationary region in the file then APM will favor the most recent part of bit history and if we're in a stationary region then entire bit history would be beneficial for computation.

My pair of weighted intervals sum estimator should bring at least some advantages of ICM. Having two estimators - one quickly adapting and one slowly adapting, we can detect if the data type has changed recently. My intuition is that if some context suddenly appears less often (ie quickly adapting estimator shows bigger interval length that slowly adapting one) then we should favor slowly adapting model and, by analogy, if some context suddenly appears more often then we should favor quickly adapting model.

The problem is that I don't know if my intution is true and I don't have a complete idea of how to use that homogenity estimation and also how (and if) should it affect model updating. There are also four models in total, so multiplying weighted costs (because that's used for switching between models) from only o2 models will render o1 models unusable. I need to sort that out.  Reply With Quote

27. http://nishi.dreamhosters.com/u/ppmsj_sh0.rar
http://nishi.dreamhosters.com/ppmsj_sh0/
Compile PPMs.cpp there, like "g++ -O3 PPMs.cpp"

> How and how often would you do recomputing?

Usually once - when the symbol occurs for the first time in the context.  Reply With Quote

28. I have checked what would be the result with ideal switching. Results are estimated as I've used my low precision logarithm estimator (on random input it has less than 0.01 % error so that should be enough).

Results for enwik8:

All four models ideally switched: 31 420 078 bytes
O2 models ideally switched with o1 models disabled: 33 813 520 bytes
Only slow o2 model: 38 223 691 bytes
Only fast o2 model: 38 391 744 bytes

Recall that current working result is: 37 067 568 bytes

Results are quite surprising for me. It seems that idea of switching based on weighted costs either is broken or insufficient. I have also a bug in code that causes the o2 models to be heavily favored to o1 models and after fixing that bug compression gets worse. So actually it seems that explicit switching could give a gain! But I'm not into it. I will search for better switching methods, probably not using cost estimation as it's too slow (though maybe someone have a fast routine for that that he's going to share).

I have thought about using that weighted interval sums. Currently my idea is to quantize the ratio of fast and slow adapting weighted interval sums and map that using APM to a cell containing four probabilites (probability of each submodel to be the best). For example quantize o2slowsum / o2fastsum to 8 values. I could quantize also o1slowsum / o1fastsum to eg 4 values and make a 32 bin 2D SSE. I could also quantize current o2 interval length as next APM dimension. Finally I could store last permutation (or two) per o2 context of the order of probabilities (ie. the order of probabilites from all four submodels - there would be 4! = 24 permutations). I need to test that and see if that will work.

In the meanwhile I'm attaching current source code (Linux binaries included).

And PPMs is indeed faster than PPMd, but not much:
Code:
```piotrek@msi-wind:~/Pulpit/enwik\$ time ppmsj_sh0/PPMs e -o2 enwik8
Small PPMII compressor with 1024KB memory heap, variant J, Jan  6 2012
enwik8:100000000 >36866726, 2.09 bpb, CutOffs:  5, speed: 5083 KB/sec

real	0m19.855s
user	0m18.801s
sys	0m0.412s
piotrek@msi-wind:~/Pulpit/enwik\$ time ppmd/PPMd e -m256 -o2 enwik8
Fast PPMII compressor for textual data, variant J, Dec 18 2011
enwik8:100000000 >36800776, 2.09 bpb, used:  1.3MB, speed: 4654 KB/sec

real	0m22.506s
user	0m20.509s
sys	0m0.504s
piotrek@msi-wind:~/Pulpit/enwik\$```

After quick investigation: Those "ideally switching" results are probably meaningless as they are getting too good with some crazy update ratios.  Reply With Quote

29. I presume you're not coding any model ids in your "ideal switching".
Like that its certainly worthless, because such a switching between submodels which give p=0 and p=1 would compress everything to zero.
The real "ideal switching" is called mixing.
It makes use of the fact that information about which submodel is the best for next symbol is unnecessary for us,
and simply adjusts the probability distribution of actual symbols taking into account the p.d. of submodel selection.
Thus switching is always worse than mixing based on the same model.
Basically mixing is just a more efficient method of coding for switching, which lets us to encode the actual symbols
with exactly the same entropy, while skipping the coding of submodel choice.

However, there're also some benefits in explicit switching too.
1. Mixing is always symmetric, but switching lets us build formats with faster decoding.
Good parsing optimization would even minimize the redundancy from explicit model choice coding to a negligible value.
With update exclusion, it can actually show better results than mixing.
2. Redundancy can be avoided if the information given by submodel choice is removed from the p.d. of symbols,
ie masking is applied. Knowing which submodel is the best for the next symbol, we can locate symbols for which its the best,
fix up their probabilities, and only encode the information yet unknown to the decoder.
Its kinda similar to what PPM does.
But this approach is also symmetric and relatively complicated, so CM is usually preferred.  Reply With Quote

30. Let w_1, ..., w_n be the probabilities that the (next) symbol x is drawn according to the distribution p_i(x), 0<i<=n.

1. Mixing

We defined the linear mixture probability for x to be

p(x) := w_1 p_1(x) + ... + w_n p_n(x).

The (expected) code length for x is exactly

-log2( p(x) ).

2. Switching

We select model "i" with probability w_i. The expected code lenght is

- w_1 log2(p_1(x)) - ... - w_n log2(p_n(x))

= -log2( p_1(x)^w_1 p_2(x)^w_2 ... p_n(x)^w_n )

>= -log2( w_1 p_1(x) + ... w_n p_n(x) ) [Jensens inequality]

= -log2( p(x) ).

Thus, in expectation mixing is never worse than switching.

PS: i'd really appreciate TeX support.  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•