1. ## Question about LPAQ1 mixer

I'll paste the code I'm referring to:
```//////////////////////////// Mixer /////////////////////////////

// Mixer m(N, M) combines models using M neural networks with
//     N inputs each using 4*M*N bytes of memory.  It is used as follows:
// m.update(y) trains the network where the expected output is the
//     last bit, y.
// m.add(stretch(p)) inputs prediction from one of N models.  The
//     prediction should be positive to predict a 1 bit, negative for 0,
//     nominally -2K to 2K.
// m.set(cxt) selects cxt (0..M-1) as one of M neural networks to use.
// m.p() returns the output prediction that the next bit is 1 as a
//     12 bit number (0 to 4095).  The normal sequence per prediction is:
//
// - m.add(x) called N times with input x=(-2047..2047)
// - m.set(cxt) called once with cxt=(0..M-1)
// - m.p() called once to predict the next bit, returns 0..4095
// - m.update(y) called once for actual bit y=(0..1).

inline void train(int *t, int *w, int n, int err) {
for (int i=0; i<n; ++i)
w[i]+=t[i]*err+0x8000>>16;
}

inline int dot_product(int *t, int *w, int n) {
int sum=0;
for (int i=0; i<n; ++i)
sum+=t[i]*w[i];
return sum>>8;
}

class Mixer {
const int N, M;  // max inputs, max contexts
int* tx;         // N inputs
int* wx;         // N*M weights
int cxt;         // context
int nx;          // Number of inputs in tx, 0 to N
int pr;          // last result (scaled 12 bits)
public:
Mixer(int n, int m);

// Adjust weights to minimize coding cost of last prediction
void update(int y) {
int err=((y<<12)-pr)*7;
assert(err>=-32768 && err<32768);
train(&tx, &wx[cxt*N], N, err);
nx=0;
}

// Input x (call up to N times)
assert(nx<N);
tx[nx++]=x;
}

// Set a context
void set(int cx) {
assert(cx>=0 && cx<M);
cxt=cx;
}

// predict next bit
int p() {
return pr=squash(dot_product(&tx, &wx[cxt*N], N)>>8);
}
};

Mixer::Mixer(int n, int m):
N(n), M(m), tx(0), wx(0), cxt(0), nx(0), pr(2048) {
assert(n>0 && N>0 && M>0);
alloc(tx, N);
alloc(wx, N*M);
}```

Function p() does weighted summation of stretched predictions and then squashes the result. My concern is about the weighted summation. AFAIU, sum of weights has to be more or less constant for the whole algorithm to be working correctly. But it's not obvious by looking at the train() function. Bo you have some simple explanation of the convergence? What are possible errors, ie distance from the ideal weight sum (that is a fixed constant) caused by rounding errors or something like that?

A side question:
What is most time consuming in LPAQ1? Is that retrieving data from hashtables or the mapping and mixing (ie heavy computations) themselves?  Reply With Quote

2. Weights don't have to add to 1, although they usually are close. In ZPAQ, I have two types of mixers. a MIX uses 20 bit weights (range -8 to 8 with 16 bit fraction) and 12 bit stretched predictions (-8 to 8 with 8 bit fraction representing ln(p(1)/p(0)). It takes any number of inputs. A MIX2 takes 2 inputs and uses a single 16 bit weight, constrained to the range 0 to 1.

Depending on the model, one or the other might work better. An ISSE mixes a prediction with a constant 1 using a pair of weights selected by a context mapped to a bit history. I used a 2 input MIX rather than a MIX2 because it compresses better.

When I use a SSE (APM in PAQ) to adjust the final bit prediction, I use a MIX2 to mix its input and output. This works better than a 2 input MIX and better than PAQ's method of linear mixing with a constant weight (3/4 output, 1/4 input).  Reply With Quote

3. Weights don't have to add to 1, although they usually are close.
That's the tricky part for me.

If they don't add to 1, but eg to 2, then if all models give a prediction of 0.5, then the final prediction will be 1.0, which would be far far away from the original predictions. Is there any proof that functions like train() are numerically stable and the distance of the weight sum from 1 (ie the imprecision) is bounded?  Reply With Quote

4. > That's the tricky part for me.

In theory these weights are probabilities (of each submodel being the best for coding of the next bit),
so they have to add up to 1.0.
That's why I prefer to use chains of mixers with 2 inputs - they're a little more precise, have more parameters to tune,
don't really increase the amount of computations, and there's no problem with weight normalization.

Though in any case, the squash function would map anything to a correct probability interval, and weight space overflow
may even improve the compression indirectly (based on the same effect as my "probability extrapolation").  Reply With Quote

5. Weights only need to add to 1 for linear mixing.

> That's why I prefer to use chains of mixers with 2 inputs

A quick test.

zpaq a x calgary -v 0 -m 4nc0,0,255c0,0,255,255c0,0,255,255,255m -> 837929 (order 1, 2, 3 ICM and 3 input order 0 MIX with unconstrained weights).

zpaq a x calgary -v 0 -m 4nc0,0,255c0,0,255,255tc0,0,255,255,255t -> 851189 (1, 2, MIX2, 3, MIX2 with weights constrained to add to 1).

zpaq a x calgary -v 0 -m 4nc0,0,255c0,0,255,255mc0,0,255,255,255t -> 842967 (replace first MIX2 with MIX, removing constraint).

zpaq a x calgary -v 0 -m 4nc0,0,255c0,0,255,255mc0,0,255,255,255m -> 838360 (replace second MIX2 with MIX (but takes all components as input)).

zpaq a x calgary -v 0 -m 4nc0,0,255i2,1 -> 842012 (ICM1, ISSE2, ISSE3 chain with 2 implied 2 input MIX with unconstrained weights).

zpaq a x calgary -v 0 -m 4nc0,0,255i2,1m -> 832878 (as above with a final 3 input MIX).  Reply With Quote

6. > Weights only need to add to 1 for linear mixing.

Not really, if we'd mix (p-0.5) values, and cut off <=0 and >=1 overflows.

> A quick test.

Thanks, but I don't think that it tells us anything.
My opinion is based on a specific theory about the nature of weights in logistic mixer,
and that theory provides an explanation for results like yours, which I mentioned.
To test it, we'd need
(1) mix() that would give the same results as mix2() if we'd normalize its weights (ie w'[i]=w[i]/wsum)
(2) mix2() with a tuned constant coef
(3) maybe investigate the behavior of wsum and explicitly implement it for mix2, with parameters -
the difference is likely similar to freq pair vs linear counter.

In any case, I think that even if you have an algorithm which works better for unknown reasons,
it doesn't mean that it would be better in any other framework too.  Reply With Quote

7. I did research about something related. Here comes the math...

1. PAQ derivation

First of all, the paq mixture is the unique minimizer of

f(q) = w_1 D(q||p_1) + ... + w_m D(q||p_m)

where D(q||p) is the KL-divergence between PDFs q (source distribution) and p (model distribution); p_1, ..., p_m are the model distributions (i.e. predictions); and w_1, ..., w_m are non-negative weights where at least one is non-zero. All of these PDFs are over a binary alphabet.

Finally, the minimizer q* of f(q) is given by

q*(1)
= [ prod i=1..m p_i(1)^w_i' ] / [ sum x in {0, 1} prod i=1..m p_i(x)^w_i' ]
= squash( sum i=1..m w_i * stretch(p_i(1)) )
(probability of bit 1)

where w_i' = w_i/(w_1 + ... + w_m). Thus weights are *forced to sum to one* (and by the assumption are non-negative) by the nature of the above problem.

Minimizing f(q) has the following sense: Given model distributions p_1, ..., p_m and weights w_1, ..., w_m as a measure of the performance/importance of the model PDFs find the "closest" source distribution. Here we measure proximity via the KL-divergence (expected code length excess).

2. Performance analysis

Now suppose for an input sequence x_1, ..., x_n with inputs p_1,k, ..., p_m,k where k=0,1,...,n you want to find a "good" (in some sense) sequene of weight vectors w_1, w_2, ..., w_n using online gradient descent (OGD) s.t. the weight in step k may onl depend on steps 1, ..., k-1. (This is what we usually do and want.) Let w* be the the best static weight vector (i.e. w_1=w_2=...=w_n=w* minimizes the total code length).

If you use a constant step size for OGD you have a uniform bound (for any input sequence and any squence of model distributions):

(1) code length of OGD scheme <= A * (code length of best static weight vector w*) + O( |w_1-w*| * A^2/(1-A) )

The constant A depends on the step size. The smaller the step size, the closer A is to one. The problem with the above approach is that |w_1-w*| may be unbounded, if you don't bound the weights. Now if you do bound the weights (details omitted, something like min(1, max(0, w_k)) suffices) the additive redundancy (the big-Oh-term in (1)) is bounded. If you bound the weights to lie in some superset (like -1 <= w<=1) of the probability simplex (weights are non-negative, summing to one) this has two side effects:
i. the worst-case redundancy increases (more freedom, more chance to do something wrong)
ii. the best-case redundancy decreases (under some conditions it might even becom negative!)
iii. emperically no dramatic changes

Now someone might say: (1) is a dumb bound. In practice we'll be better anyway since we will adapt to local changes. This is can be confirmed by experiments - and theoretically, since the following holds:

(2) code length of OGD scheme <= A * sum_i=1..s [code length if i-th input subsequence starting at t_i, ending at t_(i+1)-1] + O( |w_i-w_i*| * A^2/(1-A) ).

This means that the OGD scheme does not perform "much worse" than a sequence of s comparision vectors w_i*, where w_i* minimizes the code length of the i-th subsequence (the string x_(t_i), ... x_(t_(i+1)-1) with model distributions p_1,k, ..., p_m,k where k = t_i, (t_i)+1, ..., (t_(i+1))-1). Both, the number s of sequences and their locations 1=t_1<t_2<...<t_(s+1)=n+1 are arbitrary. Hence you can take the minimum over these entities.

The above is written down from memory. There might be small quirks like missing/wrong indices.

I hope this helps.

EDIT: the above assumes that in each time step k you have 0< B<= p_1,k(x), ..., p_m,k(x) <= 1-B < 1, i.e. the PDFS have probabilities which are bounded from below by a constant B>0 and from above by 1-B.

Cheers  Reply With Quote

8. I think it makes sense for the model weights to be non-negative and add to 1 if the input models are well calibrated. But they might not be. Suppose you have a model that outputs p(1)=0.1 when the next bit is 1 and p(1)=0.9 when the next bit is 0. Then the ideal weight is negative. More practically, you might have a "weak" model that outputs p(1)=0.51 for 1 or p(1)=0.49 for 0. Then the ideal weight is much greater than 1. If you mix either of these models with a second model that outputs randomly or always outputs 0.5, then the ideal weight for the second model is 0, so this also holds for the summation.

As another experiment, consider mixing a simple order 0 model and an order 1 model. Here we use direct, stationary context models on the Calgary corpus. The final "t" indicates a MIX2, which has weights constrained to the range 0 to 1 and must sum to 1. "m" is a MIX which has neither constraint.

zpaq a x calgary -v 0 -m 4nc256c256,0,255t -> 1382194
zpaq a x calgary -v 0 -m 4nc256c256,0,255m -> 1367527

(4 means 2^4 MB blocks, n means no e8e9 filter, c256 means a CM with maximum count limit (256-1)*4 and no bytewise context, c256,0,256 means a CM likewise with no periodic or column context and a context byte mask of 255 (1 byte)).

Here is the same experiment using indirect context models (c0 = ICM = context -> bit history -> prediction).

zpaq a x calgary -v 0 -m 4nc0c0,0,255t -> 1277393
zpaq a x calgary -v 0 -m 4nc0c0,0,255m -> 1271825

Here is the same after a BWT (3 = BWT with 16 MB blocks).

zpaq a x calgary -v 0 -m 3nc0c0,0,255t -> 804253
zpaq a x calgary -v 0 -m 3nc0c0,0,255m -> 799110

However, it is not always true that a MIX outperforms a MIX2. In a sequence ending like "mst" (MIX, SSE, MIX2), or "mm16t" (MIX with order 0 and 1 contexts, MIX2) a final MIX2 compresses better than a 2 input MIX. In these cases, the inputs to the MIX2 are presumably accurate. The higher level models therefore use by default "mm16tst" (m also takes all earlier components as input, but t takes only the last 2).  Reply With Quote

9. OK, that sheds some light. Thans toffer & Matt.

I think it makes sense for the model weights to be non-negative and add to 1 if the input models are well calibrated. But they might not be. Suppose you have a model that outputs p(1)=0.1 when the next bit is 1 and p(1)=0.9 when the next bit is 0. Then the ideal weight is negative. More practically, you might have a "weak" model that outputs p(1)=0.51 for 1 or p(1)=0.49 for 0. Then the ideal weight is much greater than 1. If you mix either of these models with a second model that outputs randomly or always outputs 0.5, then the ideal weight for the second model is 0, so this also holds for the summation.
I would call those two models broken or at least not adapting at the right rate.

Assuming that we have enough models so that at least one is predicting well at every situation, we can avoid negative or higher than 1 weights. With enough models it should be sufficient to give the wrong model a weight of 0 and the well behaving model a weight close to but lower than 1. "Weak" model predictions should rather be "strengthened" by some APM I think. Overall having weights outside of range (0,1) and not summing to 1 doesn't really correspond to the term "mixing" for me.

Also you have 2-input restricted mixer and N-input unrestricted mixer but you don't have N-input restricted mixer. I wonder what would happen if we restrict the N-input mixer. Maybe I'll try it.  Reply With Quote

10. In my experiment, all mixers have 2 inputs. Also, when I developed ZPAQ I initially experimented with weights restricted to the range 0 to 1 like in PAQ (but not required to add to 1). I found that the compression was not as good as when I allowed the weights to range from -8 to 8. I would have preferred 0 to 1 because you need 16 bits of precision for good compression (in my experiments), which would have allowed a faster SSE2 implementation of the MIX using 16 bit saturated arithmetic. Unfortunately I had to use 20 bit weights to get good compression. I implemented both the predict and update functions in SSE2, but found that the update function was faster in ordinary x86 or x86-64.

It's not that you need 16 bit fractional parts to compute the prediction, but you do to make the update steps small enough. Using a 12 bit fraction could probably be done using randomized updates I guess. I suspect it would be slower. But 20 bits still allows for SSE2 prediction like this:

weight (20 bits) x input (12 bits) -> 32 bits
drop low 8 bits
sum up to 256 components -> 32 bits.
drop 20 bits.  Reply With Quote

11. So if I understand toffer's paper, it is necessary to put bounds on the weights to prove bounds on coding cost.  Reply With Quote

12. Originally Posted by Matt Mahoney So if I understand toffer's paper, it is necessary to put bounds on the weights to prove bounds on coding cost.
In the paper i only consider the case where the weight space is a compact (= closed and bounded) and convex subset of the n-dimensional euclidian space. This implies that (euclidian) distances |u-v| between weights u and v, are bounded.
(There is some constant C s.t. for arbitrary pairs u, v of weights we have |u-v|<=C.) I use this constraint for several reasons:
i. It is realistic and practical. In every implementation the weight space has to be bounded (finite memory/register size).
ii. Redundancy terms such as O(|w_1-w*|) from my previous post (or Prop. 3.3 from the paper) are O(1). Hence the redundancy is bounded and cannot be arbitrarily large.

On the other hand, the same proofs apply to the unbounded case. (If there is demand i can give some details.) However, ii. isn't satisfied - which i consider a problem.  Reply With Quote

13. So the paper says that GEO (PAQ7- always does at least as well as LIN (PAQ6)?  Reply With Quote

14. Originally Posted by Matt Mahoney So the paper says that GEO (PAQ7- always does at least as well as LIN (PAQ6)?
Almost. The worst-case bounds (Th. 4.7) for GEO are always better than for LIN. To say that the actual worst-case code length is always lower one needs to find an example input I=(x_1, ..., x_n; p_1,k, p_m,k for k=1,...,n) where:
code length bound GEO <= actual code length LIN for I <= code length bound LIN.
I didn't figure out such an input yet.

AFAIR LIN isn't PAQ6:
I defined linear mixture PDF p of m PDFs p_1, ..., p_m as: p(x) = w_1 * p_1(x) + ... + w_m * p_m(x) (mixed probability for letter x) where w_1, ..., w_m >= 0 and w_1 + ... + w_m = 1.
I call this linear mixing because the output PDF is a linear combination of input PDFs.
AFAIR in PAQ6 you use something different: a = a_1 * w_1 + ... a_m * w_m (mixed count for bit 1), b = b_1 * w_1 + ... b_m * w_m (mixed count for bit 0) and p(1) = a/(a+b).

Comparing LIN and GEO where weights are found with OGD the key messages are: For both, LIN and GEO,
i. we can provide code length guarantees (worst-case) for OGD compared to the best static weight vector (Prop. 3.3);
ii. we can probide code length guarantees (worst-case) for OGD compared to a sequence of optimal weight vectors (Th. 3.7);
iii. we can probide code length guarantees (worst-case) for OGD compared to a sequence of model distributions (Th. 4.7).

iii. actually is a weaker refinement of ii.; both ii. and iii. guarantee that we will prefer models (predicted PDFs) locally which provide better performance than other models if the change in compression is "dramatic" enough.
And interestingly the redundancy terms of GEO in Table 1 are O(B^2) vs. O(4^B/B) for LIN where the smallest probability for any letter in the input PDFs for mixing is 2^-B. This indicates (this is no proof though) that the worst-case code length of GEO w.r.t. a large range of input probabilities if exponentially lower than for LIN.

EDIT: Here i always assume that weights for both, LIN and GEO, are drawn from the probability simplex (non-negative, summing to one).  Reply With Quote

15. Yes, you're right that PAQ6 is not LIN. It is a weighted sum of bit counts, which has the effect of giving more weight to predictions near 0 or 1 (like GEO).  Reply With Quote

16. I wonder how GEO performs on (near-) uniform distributions. A probability (close to) 0.5 after stretching is (close to) 0.

Suppose that at some point we have a stream segment with uniform distributions. We have also two models:
- model A which was predicting poorly, but now is predicting well on the flat distribution,
- model B which was predicting well, but now is predicting worse than A,

In train() function weight delta depends on stretched probability, so weight for model A won't improve (bad?), but weight for model B will decrease (good). Shouldn't the weight for model A increase?  Reply With Quote

17. If model A is predicting well, then it will output a positive number (a) when the next bit is 1 and negative when 0. If model B is predicting poorly then it will output a number b independent of what bit is next. The prediction is p = squash(wa*a + wb*b) where squash(x) = 1/(1+exp(-x)). Then the prediction error is -p if the next bit is 0 or 1-p if the next bit is 1. Then the weights are updated: wa += a*error*rate, wb += b*error*rate, where the rate is typically around 0.001 to 0.01.

The update of wa and wb depends on the prediction error, which depends on both a and b. Therefore the effect of a high error rate for b will increase the update rate of wa.

What happens to wb depends on how b behaves. If b is always 0 (prediction = 1/2), then wb is not updated. But this does not matter because b has no effect on p. If b is random or fixed at some nonzero value, then because squash() is nonlinear, the effect on p will be larger when b is wrong than right and wb will decrease.

Note that if we only have a single model with perfect prediction (like a = +1 or -1 depending on the next bit and b = 0), then the weight wa will grow without bound (as wa ~ ln(n*rate) after n predictions).  Reply With Quote

18. Here is what happens to the weight of a single perfect predictor with update rates of 0.1, 0.01, and 0.001 after n=1..2^30 predictions. The numbers in parenthesis are the difference from ln(n*rate + 1).

Code:
```#include <stdio.h>
#include <math.h>
int main() {
double w1=0, w2=0, w3=0;
for (int i=1; i<1100000000; ++i) {
w1+=0.1/(1+exp(w1));
w2+=0.01/(1+exp(w2));
w3+=0.001/(1+exp(w3));
if (!(i&i-1))
printf("%10d %10.6f %10.6f %10.6f (%10.6f %10.6f %10.6f)\n",
i, w1, w2, w3, w1-log(i*0.1+1), w2-log(i*0.01+1), w3-log(i*0.001+1));
}
return 0;
}

1   0.050000   0.005000   0.000500 ( -0.045310  -0.004950  -0.000500)
2   0.098750   0.009988   0.001000 ( -0.083571  -0.009815  -0.000998)
4   0.192633   0.019925   0.001999 ( -0.143839  -0.019296  -0.001993)
8   0.366860   0.039652   0.003997 ( -0.220926  -0.037309  -0.003972)
16   0.668414   0.078518   0.007985 ( -0.287098  -0.069902  -0.007888)
32   1.132554   0.153958   0.015938 ( -0.302530  -0.123674  -0.015561)
64   1.743509   0.296139   0.031749 ( -0.257971  -0.198558  -0.030286)
128   2.438299   0.549381   0.062995 ( -0.186369  -0.274795  -0.057451)
256   3.159808   0.957576   0.124008 ( -0.121104  -0.312185  -0.103925)
512   3.881147   1.525894   0.240359 ( -0.073936  -0.285668  -0.173075)
1024   4.595130   2.202298   0.452262 ( -0.043475  -0.217181  -0.252814)
2048   5.301957   2.921540   0.807021 ( -0.024948  -0.145582  -0.307465)
4096   6.003548   3.646199   1.326941 ( -0.014072  -0.090518  -0.301515)
8192   6.701714   4.364046   1.976350 ( -0.007834  -0.053831  -0.241983)
16384   7.397769   5.073849   2.687668 ( -0.004317  -0.031127  -0.167882)
32768   8.092569   5.777429   3.413006 ( -0.002359  -0.017655  -0.106507)
65536   8.786643   6.476837   4.133631 ( -0.001280  -0.009872  -0.064112)
131072   9.480303   7.173635   4.845982 ( -0.000690  -0.005459  -0.037366)
262144  10.173732   7.868869   5.551389 ( -0.000370  -0.002991  -0.021313)
524288  10.867033   8.563190   6.251979 ( -0.000198  -0.001627  -0.011968)
1048576  11.560263   9.256990   6.949501 ( -0.000105  -0.000879  -0.006640)
2097152  12.253455   9.950496   7.645163 ( -0.000056  -0.000472  -0.003649)
4194304  12.946626  10.643839   8.339732 ( -0.000029  -0.000253  -0.001989)
8388608  13.639786  11.337092   9.033672 ( -0.000015  -0.000135  -0.001077)
16777216  14.332940  12.030297   9.727257 ( -0.000008  -0.000071  -0.000580)
33554432  15.026090  12.723475  10.420644 ( -0.000004  -0.000038  -0.000310)
67108864  15.719240  13.416638  11.113921 ( -0.000002  -0.000020  -0.000166)
134217728  16.412388  14.109794  11.807138 ( -0.000001  -0.000010  -0.000088)
268435456  17.105535  14.802946  12.500323 ( -0.000001  -0.000005  -0.000047)
536870912  17.798683  15.496095  13.193490 ( -0.000000  -0.000003  -0.000025)
1073741824  18.491830  16.189244  13.886648 ( -0.000000  -0.000002  -0.000013)```  Reply With Quote

19. Here is a simulation of a perfect predictor which outputs +1 or -1 and a 1 input mixer with a learning rate of R=0.01. The first column is i, the number of compressed bits. The second column is the mixer weight, w. The third column is the prediction error, p. The fourth column is the total compressed size so far in bits. The last column is the compressed size of the last i/2 bits, which converges to 1/R.

Code:
```#include <stdio.h>
#include <math.h>
int main() {
double w=0, R=0.01, h=0, h1=0;
for (int i=1; i<1100000000; ++i) {
double p=1/(1+exp(w));
h-=log(1-p);
w+=p*R;
if (!(i&i-1)) {
printf("%10d %10.6f %10.6f %12.6f %10.6f\n",
i, w, p, h/log(2), (h-h1)/log(2));
h1=h;
}
}
return 0;
}

bits     weight      error         size  last half
1   0.005000   0.500000     1.000000   1.000000
2   0.009988   0.498750     1.996398   0.996398
4   0.019925   0.496259     3.978458   1.982061
8   0.039652   0.491316     7.900139   3.921680
16   0.078518   0.481583    15.577643   7.677504
32   0.153958   0.462737    30.299327  14.721684
64   0.296139   0.427548    57.432137  27.132810
128   0.549381   0.366860   103.975880  46.543742
256   0.957576   0.277921   175.027346  71.051467
512   1.525894   0.178858   267.986515  92.959169
1024   2.202298   0.099634   372.896357 104.909841
2048   2.921540   0.051124   480.644961 107.748604
4096   3.646199   0.025433   587.167845 106.522884
8192   4.364046   0.012568   691.689107 104.521262
16384   5.073849   0.006220   794.557170 102.868062
32768   5.777429   0.003087   896.290153 101.732984
65536   6.476837   0.001536   997.305654 101.015500
131072   7.173635   0.000766  1097.888087 100.582433
262144   7.868869   0.000382  1198.216792 100.328705
524288   8.563190   0.000191  1298.399958 100.183166
1048576   9.256990   0.000095  1398.500975 100.101017
2097152   9.950496   0.000048  1498.556211 100.055236
4194304  10.643839   0.000024  1598.586196 100.029985
8388608  11.337092   0.000012  1698.602372 100.016177
16777216  12.030297   0.000006  1798.611053 100.008681
33554432  12.723475   0.000003  1898.615690 100.004637
67108864  13.416638   0.000001  1998.618157 100.002467
134217728  14.109794   0.000001  2098.619464 100.001307
268435456  14.802946   0.000000  2198.620155 100.000691
536870912  15.496095   0.000000  2298.620519 100.000364
1073741824  16.189244   0.000000  2398.620710 100.000191```  Reply With Quote

20. Thanks for the analysis. But I've been worried about a case where input is random and by "predicting well" I've meant outputting a prediction p near or equal to 1/2.

But after some thinking I realized that t[i] which is the stretched prediction, will be very low for the well behaving model (ie those that outputs predictions near 1/2, which after stretching are near 0) and thus it's weight will diminish much more slowly than the weight for a model that is predicting worse (ie randomly and not in correlation with the random input).

Logistic mixing starts to make much more sense for me now, also the intuitive restrictions like restricting weights to sum to one now looks unnecessary.  Reply With Quote

21. Now I have somewhat more difficult question.

As you know I'm working on Demixer and I'm using suffix tree (of limited depth) to retrieve statistics. Thus I can have a lot of statistics (currently only trimmed bit histories) at once, but their quantity is variable. Because their quantity is variable I cannot use a single N-input mixer with restriction that weights sum to 1. So I'm left with either a chain of mixers or a unrestricted mixer.

If I choose to have a chain of 2-input mixers (restricted or not) then for a single place in chain I can have a set of mixers to choose from (selecting by a context). I don't know if it would make sense for N-input mixer, ie a mixer which has a separate input for each order, but instead of a single vector of weights I would choose each weight based on a small context which is independent of contexts of weights for other orders. AFAIU the inputs in unrestricted mixer are fairly independent so it shouldn't matter if I have a set of weights per each input, provided that I don't use a single weight variable for different inputs. Also it shouldn't really matter if I disable (or don't have information about) some inputs sometimes. Am I right?

Also I have an impression that with a chain of 2-input mixers the error propagation would be rather slow and/ or less smooth if the chain is long compared to a single N-input mixer.  Reply With Quote

22. I haven't tried selecting weights for an n-input mixer by different contexts, but it would probably work. For a chain of 2 input mixers, that is the normal thing to do. If you have an n-input mixer and a variable number of models, you could select a mixer according to the number of inputs or else set unused models to 0. I guess you need to do the experiment.  Reply With Quote

23. From a theoretical POV having a chain of two input-mixers, one for each binary context, is very appealing. I played around a bit with such a model (for each context: two input mixers with OGD and probability estimation via linear counters) theoretically and it turns out that one can obtain superior theoretical guarantees - similar to CTW. (BTW the idea behind PPM is not much different from that.) For the actual value of a weight and a probability estimate for the first k, k is constant, occourences of any context one can use arbitrary heuristics (SSE, information inhertiance, etc.) and the theoretical guarantees still hold. Note that i didn't work out all the math, but everything i found up until now points into the same direction. This could be a happy union of theory and practise. I will implement such a scheme on my own, but i'd be happy if someone else tries. BTW Shelwien used such a model setup, too. (e.g. MIX* compressors on his site)

If you want some hints on the setup for PDF estimation and mixer updates for each context let me know.

Cheers  Reply With Quote

24. Thanks for the information. But according to my quick experiment described in thread about Demixer, long chains of mixers are unstable, at least if there's no additional context to select mixers at each step. I'm thinking about computing something like Levenshtein distance between bit histories for consecutive orders and based on that select or disable mixers at each place in the chain. I'll think about fine tuning the learning rates after I stabilize the predictions in long chains.  Reply With Quote

25. Originally Posted by Piotr Tarsa Thanks for the information. But according to my quick experiment described in thread about Demixer, long chains of mixers are unstable, at least if there's no additional context to select mixers at each step. I'm thinking about computing something like Levenshtein distance between bit histories for consecutive orders and based on that select or disable mixers at each place in the chain. I'll think about fine tuning the learning rates after I stabilize the predictions in long chains.
Just to clarify: You got one PDF estimator per context. For the scheme i mentioned you need one PDF and one mixer per context. And you need to mix "from top to bottom" (first mix the two highest order predictions). E.g. for mixing just orders 0 and 1 you have (at most) 255 + 256*255 PDF estimator and mixers.

And i don' think a proper implementation performs poorly. Compressors like PPMonstr use similar techniques - although they don't decompose the source alphabet - and perform pretty well.

EDIT: If you are interested i can provide a simple framework for parameter tuning. I suggest to have a look at genetic algorithms if you want to do that on your own.  Reply With Quote

26. Just to clarify: You got one PDF estimator per context. For the scheme i mentioned you need one PDF and one mixer per context. And you need to mix "from top to bottom" (first mix the two highest order predictions). E.g. for mixing just orders 0 and 1 you have (at most) 255 + 256*255 PDF estimator and mixers.
I don't get it - what's PDF estimator here? How it's different from mixer itself? And you mean that I need to have separate mixer for each context and not each context order?

Is there any reason that mixing "from top to bottom" should give better results than mixing in reverse order, given proper tuning of course?

EDIT: If you are interested i can provide a simple framework for parameter tuning.
Well, if you have a ready one then I'll happily look at it.

I suggest to have a look at genetic algorithms if you want to do that on your own.
Sure, but I want to play with various transforms, structures and schemes first.  Reply With Quote

27. I don't get it - what's PDF estimator here?
OK, we consider the binary case. For every binary context you model the probability of a one-bit and zero-bit, in other words a probability density function (PDF) over a binary alphabet. This is a PDF-estimator.

How it's different from mixer itself?
See above.

And you mean that I need to have separate mixer for each context and not each context order?
Yes. I mean a separate mixer for every context - and not just each order.

Is there any reason that mixing "from top to bottom" should give better results than mixing in reverse order, given proper tuning of course?
Theoretical reasons. ATM my math here is rather sketchy and consists of pencial and paper work only. I need to recheck this. But it looks plausible to me.

Well, if you have a ready one then I'll happily look at it.
I've recently rewritten and simplified my optimization framework. To optimize custom targets you need to:
1. provide a parameter definition (see example/definition)
2. provide a C function which returns the cost funtion value (see example/rastrigin.cpp)

I use Code::Blocks. To compile by hand with gcc:
g++ -o opt.exe -Iexample optimize\optimize.cpp example\rastrigin.cpp
(This builds the optimizer to minimize the Rastrigin function with parameter 2, i.e. in two dimensional space)
Then run opt.exe to see the options. Run e.g. "opt alg=dcga,pop=50,iter=100,miss=100,pm=0.06,log=log. txt,dst=pop.txt" to optimize for 100 iterations with specified GA options and logging to log.txt and parameter recording to pop.txt (one can resume optimization from such record files).

To optimize your own stuff do:
1. create directory "mystuff"
2. create parameter profile "mystuff/definition"
3. create cost function with driver program: "mystuff/mycost.cpp"
4. g++ -o opt.exe -Iexample optimize\optimize.cpp mystuff/mycost.cpp

Note that the provided files are experimental and not well-tested, i.e. there might be bugs. At least it passed my tests. Some functionality like multi-threading is missing here.  Reply With Quote

28. The optimizer program looks somewhat convoluted. AFAIU it's meant for direct integration with C++ sources, yes? I'll need to spawn new process per each optimization trial then and somehow pass the parameters.

Yes. I mean a separate mixer for every context - and not just each order.
Ouch, that seems like an overkill. And that would be incompatible with the suffix tree with compressed edges as with CTW-like weighting there would be multiple different mixer weights even on non-branching edges. Thus the tree could not have a linear size then. With bit histories it's a different story because bit histories are identical for every node on compressed edge and the number of branches in a suffix tree is linear.

I've took a look at CTW results and they aren't overly impressive. Shelwien did a bytewise mixing compressor http://encode.su/threads/541-Simple-...xt-mixing-demo but that also didn't score well (worse than CTW) on book1. One would expect that bytewise coders should do very good on byte-oriented data like book1.  Reply With Quote

29. The optimizer is perfectly separable - however i prefer it this way (integrated with C++ sources) since i don't need to write parsers for loading parameter configurations - the code is automatically generated.
You don't need to spawn a process per trail. And you don't need to take care of where the parameters come from. You just need to supply a function:

double F( const char* x ) {
PROBLEM_NAME p=x;
// compute cost function value y
return y;
}

and declare the contents of PROBLEM_NAME via the definition file.

As to CTW...

Well, CTW implementations are dated and don't use any modern heuristics.

As i said PPMonstr uses some modern heuristics and works similar. And its performance is pretty good.  Reply With Quote

30. I've tried mixer-chains but they where not better than the traditional approach.

One should not forget that the current math for PDF-estimation assumes memoryless subsources which contradicts with our usage of state-based-counters.
I have a PDF written for some subject in University about limited-order markov compression. Apparently it is in german.

I've also written a large optimizer framework which can work with any program which is controllable through a txt-based configuration file.
You can select between different algorithms like DDS, DE and ASA. DDS gives fairly good estimations for around 1000 function evaluations for dimensions >30.  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•