# Thread: Logistic mixing & modeling

1. ## Logistic mixing & modeling

After thinking a bit about PAQ's mixer - which uses gradient descent to compute weights iteratively - i thought about applying that to bit models. Instead of having a counter with a state p (probability) in linear space one could have a logistic counter s=St(p) and update its state in logistic space. When using logistic mixing one doesn't need to convert the input probabilities p into logistic space - thus no stretch function needs to be implemented. Alltogether a logistic counter with gadient based update could look like:

h = -ln(a+b*p)
p = Sq(s)
dh/ds = -(y-p)

s' = s + L*(y-p) = s + L*(y-Sq(s))
Sq(x) - Squash 1/(1+e^-x)

That update can be derived like i did in the Mixing strategies thread - and actually equals a fixed input ("one") with a "weight" s. Hence it is just a weigthed bias neuron. BTW that update appears in bias neuron updates using logistic mixing - since it's the same. Nothing new actually.

Now i noticed that we could actually "reinvent" the wheele: When using logistic mixing along with such a model - why not take the mixing functions influence into account, adjusting the bit models proportional to their contribution to the mixed output? Well integrating that into logistic mixing gives us a combined purely logistic mixing and modeling approach:

(1) each model i has a state s_i in logistic space

(2) the mixer assigns weights w_i to the models s_i

(3) a mixed prediction p in linear space is

p = Sq( sum w_i s_i )

(4) the coding error e is

e = y-p

(4) weights are updated

w_i' = w_i + L*e*s_i

(5) predictions are updated

s_i' = s_i + M*e*w_i

0< L, M < 1 are learning rates. Note that -e*w_i is simply the partial derivative of h (coding cost) wrt to an input s_i. Alltogether this can be interpreted as a neural network: the input layer is made of bias neurons, with weights selected by a context. The output layer is a single neuron with linear activation and squash as the output function. The hidden layer just forwards the inputs (linear neurons). It learns using error backpropagation which minimizes coding cost rather than RMSE.

I have no idea how well this will preform in practice - but i hope that it might work as well as logistic mixing. And most of this is already covered by logistic mixing. This is just an extension of modeling integration.

I'd implement this when i've got some spare time.

It might even be possible to get rather good results with *static* weights (no weight update) since the logistic model updates take the actual weight values into account.

Comments?   Reply With Quote

2. It would be great if you'll release a fpaq0-like compressor as a demo. Just a simple model with no templates and all that crazy constants as some authors like to embed into their programs to make the source code completely unreadable and a bit encrypted, so nobody can understand what's going on...   Reply With Quote

3. You were faster than me I was thinking about it to remove completely stretch/squash functions for decreasing cache misses (I think, the tables are big enough to pollute the cache). I thought to implement a new counter model based on PAQ's mixer weight update/probability mix functions. But, got stuck in how to remove "even" final squash function. How can we integrate final logistic probability in RC? I mean, multiplaction free RC?  Reply With Quote

4. @Osman
Actually the lookup tables are rather cache friendly since probabilities are clustered near 0/1 in linear space - just remember the model output histograms. I made some tests there and there is no difference in speed.

@Encode
Well if you want to have different parameters for different models there is no other flexible way in passimg them to the models.

I'd implement that if i've got spare time. Don't forget that i'm rather busy with my final thesis. Actually implementing is pretty straight forward, since if you know how to update a PAQ mixer:

e = y-p
w_i' = w_i + L*e*s_i

You know how to update models, too

s_i' = s_i + M*e*w_i  Reply With Quote

5. @Toffer:
Yes. I remember. I meant a bit different thing: they are big enough (2x8 KiB in total) to place in cache (especially in AMD architectures). There are the other data to which should be cached (i.e. SSE entries). So, by removing stretch/squash from whole implementation should leave more space for better caching. Also, I'm sure multiplaction free RC would be super fast.

Edit: BTW, IIRC I intentionally removed stretch/squash table to see they were bottleneck in prior to BIT 0.5 (maybe more older). And I observed that it's faster. Yes I know the computation becomes more and more weird (the output is completely garbage). But, I proved that stretch/squash functions are slow part of PAQ mixer.  Reply With Quote

6. I don't think that a multiplication free RC would be possible (with enough precision). But i never made any investigations there.

The cache is adaptive - and i doubt that there'll be more than a few (<10) 64 byte cache lines used actually. Since - as i said - the functions are evaluated near 0/1 in almost all cases. These locations are frequently used and will be cached. Others won't. Hence the other stuff is pushed out. As i said i tested this effect. There is no speed difference. On the first look i was pretty surprised. But looking at a model output histogram reveals why this hapens.  Reply With Quote

7. BTW, correct me if I'm wrong:
At initially, we have a model which hold w and s. And they are all zero at the beginning. Because, we need assign p=0.5 and if we derive s from p, s=Sq(p=0.5), so it's s=0. So, here is the question: How can we update the function if coefficient are zero!? In PAQ mixer, s_i part in linear space, so there were no problem (p=0.5 at the beginning at there). But, in here it's logistic and it's zero at the beginning.

e = y-p
w_i' = w_i + L*e*s_i
s_i' = s_i + M*e*w_i

Edit: I miswrote. I meant s_i part is not in linear space in PAQ mixer but it comes from a linear space source and it can be a non-zero value.  Reply With Quote

8. That's implementation detail Only the model holds s, since it replaces counters.

Just set w_i(0) = w0/N, where N is the number of models, w0 can be optimized or set to 1.  Reply With Quote

9. @encode:
> It would be great if you'll release a fpaq0-like
> compressor as a demo.

http://ctxmodel.net/files/mix_test_v3.rar
Contains a simple (simpler than fpaq0 imho)
bitwise coder implementations, both one based
on my linear mixer, and one based on toffer's logistic mixer.

> as some authors like to embed into their
> programs to make the source code completely unreadable

Now, look who's saying that -----------------

> I was thinking about it to remove completely
> stretch/squash functions for decreasing cache misses (I
> think, the tables are big enough to pollute the cache).

As to logistic mixer, then Squash() for it can be
calculated by replacing the full table (SCALE) with a smaller
table (1<<N) and additional N multiplications (N=5 atm).
This can be also applied to Stretch(), but somehow I think
that it'd lose a lot of precision after that, as atm
the stretch table is initialized so that Squash(Stretch(i))=i.

Also, both squash and stretch tables can be cut in half
as they're symmetric.

> But, got stuck in how to remove "even" final squash
> function. How can we integrate final logistic probability
> in RC? I mean, multiplaction free RC?

Well, either with multiplications, or with lookup tables -
there's no other way (obviously).
In fact, there're many multiplication-free ari coders
which all use logistic counters (state-based) and maintain
log(range) instead of range. Afaik they all also have
lookup tables.

> Instead of having a counter with a state p (probability)
> in linear space one could have a logistic counter s=St(p)
> and update its state in logistic space.

Now, why? Its a shamanic approach in full bloom In fact, I'm still waiting for an explanation even considering
why the logistic mixer works - aside from its fixed nonlinear
mapping, hidden in the form of "update rate".
I mean, we have an explanation for the linear mixer - the weight
there is a probability of certain submodel being optimal.
But now, what's the interpretation of logistic weight?
And logistic mixing itself?
So, logistic mixing does work, that's nice and all.
But actually we need to understand why it works, based on
which side effects - that would certainly allow to acquire
even better results.
In fact, yesterday I finally tried an obvious thing -
replacing w0 and w1 in the binary logistic mixer with
w and 1-w and rederiving the update formula.
And what do you think, compression got suddendly improved
from 214395 to 214027. That's beside the fact, that
all 5 probability mappings involved in the binary mixing
and its update, can be separately tuned, providing some
additional improvement (quoted results only use single
tuned mapping - original result was 214695 btw).

Well, anyway, I already tried replacing the linear counters
with (0,1) mixers (improved ones), and optimizing the coder -
current result is 218244 and it doesn't seem to go anywhere further.

Also, there's some common sense behind the linear counters in fact.
If we have an order0 (memoryless) static generator, like
putc( '0'+(rand()>=P) )
and it outputs N0 zeroes and N1 ones, then there're (N0+N1)!/N0!/N1!
possible permutations of that (number of current permutation has
to be encoded), and the optimal coding probability is N0/(N0+N1),
with N0~=N*P/SCALE (product of such intervals equals to the
number of permutations so its perfect).

Then, probability-based linear counters are imprecise approximations
of that, also taking into account data nonstationarity.
But actually they work completely the same as combinatoric counters
after a certain initial period (see http://ctxmodel.net/rem.pl?-2).

Well, I can't deny the possibility that for some types of data
the logistic counters would be better (due to faster adaptation or something).
But BWT output is certainly not that - its as close to memoryless order0
source as natural data can be. And the same applies to context histories
as they're fairly similar to BWT output.  Reply With Quote

10. Well this is not shamanic at all...

Reasons for the logistic mixer to work well:

* Stretch has high slopes near 0/1, hence the resolution increases there (quantisation function). Luckily this coincides with the model output histogram of a binary model - most predictions are made near 0/1:
http://encode.su/forum/attachment.ph...2&d=1220619017

* The weight update formulation is a simplified iterative numeric minimization technique - gradient descent with a fixed step length. Thus weights are adjusted to minimize the actual coding cost.

The second thing - including a logistic estimation - just replaces St(p_i). The updates rule i posted is again an interative minimization of coding cost *taking into account* the actual weight assigned to a submodel thus using more information than a normal counter.

As to a linear counter... Indeed an optimal (wrt. coding cost) estimation for P(y=1) = N1/N assuming a memoryless stationary source having seen N1 bits out of N beeing 1.

But who tells you that the nonstationary update N1'=c*N1+y, N'=c*N+1, or p' = (1-c)*p + c*y (this is the first model in case of N->infinity, stationary state) does some kind of optimization regarding the coding cost? Isn't it "shamanic" to fix it for a nonstationary source by inserting an aging factor c?

Well in fact this model minimizes a nonstationary RMSE 1/N sum k=0..N-1 c^N-k (y(k)-p)^2 and is not realted to entropy.  Reply With Quote

11. > Well this is not shamanic at all...

Magic with scientific elements is still magic To be specific:
1. Log[p/(1-p)] logistic mapping comes off the ceiling.
I doubt you can say anything about its choice except
that it fits your image - which is obviously not enough.
And with similar but different mapping being better for
specific application than Log[p/(1-p)] it becomes even
harder to explain.
2. You didn't have any suggestions as to avoiding weight overflows,
or changing the mixing formula to improve the results.
But using a single weight there improves both compression
and memory usage (a single 16-bit weight is currently used),
and it doesn't seem final either.
3. Using numeric optimization techniques doesn't guarantee
that the optimization target and iteration functions are
chosen consciously. For example, having a rate coefficient
in the logistic mixer update is a really bad idea - that is,
same effect can be acquired in a better way.

My point is that well-understood techniques (non-shamanic that is)
provide much better flexibility. Combinatoric/linear counters are
a good example of that. They can be modified for any specific
application, while about using paq mixer with alphabet distributions
I hear just "it won't work". Why, it would work - when we'll properly
understand how it works.

> Reasons for the logistic mixer to work well:

I talked about the interpretation, not reasons.
Like "probability p is the size of interval in code space,
assigned to data strings starting with current symbol" or
"weight w in linear mixing is a probability of submodel M1
being optimal for coding".
We don't know anything similar about logistic mixer for now.

> * Stretch has high slopes near 0/1, hence the resolution
> increases there (quantisation function). Luckily this
> coincides with the model output histogram of a binary
> model - most predictions are made near 0/1:

I can just add the p'=Squash(wr*Stretch(p)) mappings (wr=constant)
into my linear coder and it would help too. So what?
That's not all what a paq mixer does.

> * The weight update formulation is a simplified iterative
> numeric minimization technique - gradient descent with a
> fixed step length. Thus weights are adjusted to minimize
> the actual coding cost.

And then we have a choice of multiple possible iteration methods,
and which one is better can be only determined by manual testing,
and which is the best is unknown.

> The second thing - including a logistic estimation - just
> replaces St(p_i). The updates rule i posted is again an
> interative minimization of coding cost *taking into
> account* the actual weight assigned to a submodel thus

Anyway, for now logistic counters have worse performance.

> But who tells you that the nonstationary update
> N1'=c*N1+y, N'=c*N+1, or p' = (1-c)*p + c*y (this is the
> first model in case of N->infinity, stationary state) does
> some kind of optimization regarding the coding cost? Isn't
> it "shamanic" to fix it for a nonstationary source by
> inserting an aging factor c?

1. Afair its optimal for weighted code length
Sum -c^(N-i)*Log[ p_i ]

2. Its true that this exponential weighting is arbitrary.
Even more can be said - different weighting functions
were tested and there were better ones.
But here its a matter of computation cost, these better
functions are much more expensive, but not much better
compression-wise, so this one is used.
So this is not exactly shamanic, as it was consciously
chosen after some evaluation.

****************

Btw, I actually have a draft of logistic mixing interpretation.

1. The idea is that coding cost is a linear combination
of Log[p_i], so linear mixing of submodel inputs in
log space might be more fitting... and adaptive.

2. The bit coding cost after such mixing would look
as follows:
((1-y)*Log[p1]+y*Log[1-p1])*(1-w) + ((1-y)*Log[p2]+y*Log[1-p2])*(1-w)
and
(1-y)*Log[p]+y*Log[1-p] = Log[p]-y*(Log[p]-Log[1-p]) = Log[p] - y*Log[p/(1-p)]
so mixing function becomes
Log[p1]*(1-w)+Log[p2]*w - y*(Log[p1/(1-p1)]*(1-w)+Log[p2/(1-p2)*w)

And that right part is what my logistic mixer currently deals with,
not the full formula.

Incidentally, this might also solve the paq mixer incompatibility
with non-binary alphabets.  Reply With Quote

12. > I can just add the p'=Squash(wr*Stretch(p)) mappings (wr=constant)
> into my linear coder and it would help too. So what?

Btw, my iterpretation to effect of such static mappings is
then shift closer to 0 or 1 after several updates.
And as it takes quite a few updates to arrive to a stable
probability, its kinda resonable that some compensation
mapping might help.

But this is kinda far from what one can see while looking
at w_i += wr*e*s_i logistic weight update.  Reply With Quote

13. > 1. Log[p/(1-p)] logistic mapping comes off the ceiling.
> I doubt you can say anything about its choice except
> that it fits your image - which is obviously not enough.
> And with similar but different mapping being better for
> specific application than Log[p/(1-p)] it becomes even
> harder to explain.

My iterpretation for Matts design here is:

Matts idea was to use a NN for mixing - a design constrain. A sigmoidal function (differentiable threshold element) as an output neuron provides weight adaption via minimizing coding cost without a *divison*. Thus it is just a modification of his previous (< PAQ6) mixers which minimized coding cost using gradient descent, too. Now if the output p=Sq(s) of that *nonlinear* output neuron has the meaning of a "probability", s certainly is something different. Thus one can assume Sq(s) to be the inverse of some function - St(p) in particular - which transformed probabilities into a different domain (#).

I know that this lacks a mathematical interpretation but to my understanding of that intuition that makes sense. On the other hand i think that the actual design process is reversed. Usually it'd be better to analyse the inputs and build the model based on that. Not the reverse way.

And i still don't know what'd be wrong in knowing the reasons for it to work well - even if these reasons luckily coincided with a more or less random design process.

> 2. You didn't have any suggestions as to avoiding weight overflows,
> or changing the mixing formula to improve the results.
> But using a single weight there improves both compression
> and memory usage (a single 16-bit weight is currently used),
> and it doesn't seem final either.

That was already done in ZPAQ. We already talked about this - i didn't reject any of these ideas. And why shall i tell you how to implement it if you can do it on your own. I'm sure both of us know how this can be implemented.

> 3. Using numeric optimization techniques doesn't guarantee
> that the optimization target and iteration functions are
> chosen consciously.

Coding cost (entropy) should be a good target function. And i never said that gradient descent is a perfect minimization technique - i just sayed that its application in here *makes sense* since it is a technique to minimize a certain cost function. In addition here it's easy to implement and seems to be efficient.

> For example, having a rate coefficient
> in the logistic mixer update is a really bad idea - that is,
> same effect can be acquired in a better way.

Well as you know it doesn't play a great role if you locate it in mixing or in updates mathematically. And i already explained why it appeared in update. The normal procedure in iterative numeric minimization of a function H(x) is to estimate a search direction d(k) and a scalar step length L(k) in each step k having an initial estimation x(0). So x(k+1) = x(k) + L(k)*d(k). Having obtained a direction d(k) the dim x dimensional optimization problem is reduced to a 1 dimensional min H( x(k) + L(k)*d(k) ) w.r.t. L(k) in each step k. Thus the step lengths L(k) generally differ in each step k. A simple solution is to assume L(k) = L for all k. That's where that L comes from.

> while about using paq mixer with alphabet distributions
> I hear just "it won't work".

I mentioned the reason several times: http://encode.su/forum/attachment.ph...2&d=1220619017
That's different for bytewise models as you know.

> I can just add the p'=Squash(wr*Stretch(p)) mappings (wr=constant)
> into my linear coder and it would help too. So what?
> That's not all what a paq mixer does.

I never said that and i don't know what you mean there actually.

> And then we have a choice of multiple possible iteration methods,
> and which one is better can be only determined by manual testing,
> and which is the best is unknown.

Mathematically using more information makes sense, since you differentiate the whole system dh/ds_i = dh/dp * dp/dx * dx/ds_i.
h = -ln a+b*p
p = S(sum s_i w_i)
And not just bypass the effect of each individual component. That's the foundation of error backpropagation, too.

And finally citing my first post: "...I have no idea how well this will preform in practice..."

> 1. Afair its optimal for weighted code length
> Sum -c^(N-i)*Log[ p_i ]

Right - i didn't know. But in this case the minimizer of -sum c_k*ln(a_k+b_k*p) equals the minimizer of sum c_k (y_k-p)^2 which i found pretty interesting.

> 2. Its true that this exponential weighting is arbitrary.
> Even more can be said - different weighting functions
> were tested and there were better ones.
> But here its a matter of computation cost, these better
> functions are much more expensive, but not much better
> compression-wise, so this one is used.

Well the same applies to simple gradient descent applied here as one possibility of numeric minization...

... and the Hessian is singular >
> ****************
>
> Btw, I actually have a draft of logistic mixing interpretation.
>
> 1. The idea is that coding cost is a linear combination
> of Log[p_i], so linear mixing of submodel inputs in
> log space might be more fitting... and adaptive.
>

But where does Squash come from? If you answer that it's applied to get probs back from logistic space you'd confirm (#).  Reply With Quote

14. Originally Posted by Shelwien In fact, yesterday I finally tried an obvious thing -
replacing w0 and w1 in the binary logistic mixer with
w and 1-w and rederiving the update formula.
And what do you think, compression got suddendly improved
from 214395 to 214027.  Reply With Quote

15. Get the inputs: s_i = St(p_i)

Apply mixing: p = Sq(1-w)*s_0 + w*s_1) = Sq(s_0 + (s_1-s_0)*w)

Update: w' = max(0, min(1, w + L*(y-p)*(s_1-s_0) ))  Reply With Quote

pm = w1*Log[p1/(1-p1)+w2*Log[p2/(1-p2)]
Log[1/(1+Exp[-pm])*(1-2*y)+y]
(pm,p1,p2 are probabilities of 0, y=bit, result is code length (negative))

it can be transformed into

y=0: w1*Log[p1] +w2*Log[p2] - Log[p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2]
y=1: w1*Log[1-p1]+w2*Log[1-p2] - Log[p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2]

And then, if we convert that to probabilities:

y=0: p1^w1*p2^w2/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)
y=1: (1-p1)^w1*(1-p2)^w2/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)
((1-y)*p1^w1*p2^w2+y*(1-p1)^w1*(1-p2)^w2)/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)

So this is what the actual mixing looks like.
Now, I can interpret this as taking a probability of a string of N0*w1 zeroes
encoded with first model (using p1) and then N0*w2 zeroes encoded with second
model, and inferring an "average" zero probability out of it.

pm0^N0 = p1^(N0*w1)*p2^(N0*w2)

Then we do the same with a string of ones, and normalize the estimation into
the same distribution, and that's the result.

pm1^N1 = (1-p1)^(N1*w1)*(1-p2)^(N1*w2)
pm = pm0/(pm0+pm1)

Unfortunately, this doesn't provide any hints for improvement right away,
as d/dw would be the same as in "old" logistic mixer, and stretch tables
would be still required for update, even though in mixing part they
can be replaced with plain exp and log tables.

But still, now I understand what happens much better, which would be helpful.
Also, this representation provides an easy way of algorithm generalization
for larger alphabets (which might be useful for speed optimizations).

And then, this might open a way for more complex estimators based on
string probabilities - for example, with exponential weighting for
the string symbols.

Or, we can try finding an optimal parsing of already processed data
into symbols belonging to one of submodels, and thus improve the
weight estimation precision.  Reply With Quote

17. Here are some results of constraining a 2 input mixer so that the weights add to 1. This adds 3 results to my previous zpaq results on book1bwt. For simple mixing strategies, the compression is slightly worse (mix2) than when the two weights can change independently. In the third case where the models are chained and mixed again, constraining the weights helps a bit. In each of the 3 new cases, I tuned the learning rate for max compression.

Code:
```(results for zpaq c(this_file) x book1bwt)
(note: parameters are tuned)
comp 2 0 0 0 (hh hm ph pm)
(uncomment any of below to run)
(235069 simple order 0)    (1 0 cm 9 3)
(237916 simple order 1)    (2 0 const 0  1 cm 17 5)
(220811 static mixer)      (3 0 cm 9 3   1 cm 17 5     2 avg 0 1 120)
(219581 adaptive mixer)    (3 0 cm 9 3   1 cm 17 5     2 mix 0 0 2 14 0)
(219858 adapt 1w mixer)    (3 0 cm 9 3   1 cm 17 5     2 mix2 0 0 1 140 0) (worse)

(221070 indirect order 0)  (1 0 icm 4)
(231883 indirect order 1)  (2 0 const 0  1 icm 12)
(215055 indirect chained)  (2 0 icm 4    1 isse 12 0)
(214936 ind static mixer)  (3 0 icm 4    1 icm 12      2 avg 0 1 160)
(214514 ind adapt mixer)   (3 0 icm 4    1 icm 12      2 mix 0 0 2 14 0)
(214809 ind adapt 1w mixer)(3 0 icm 4    1 icm 12      2 mix2 0 0 1 48 0) (worse)
(214772 chained & mixed)   (3 0 icm 4    1 isse 12 0   2 mix 0 0 2 14 0)
(214604 chained & 1w mixed)(3 0 icm 4    1 isse 12 0   2 mix2 0 0 1 220 0) (better)
hcomp
d= 1 a<<= 9 *d=a halt   (set order 1 context for model 1)
post
0
end```  Reply With Quote

18. Btw, my current results with book1bwt (using mix_test_bin):
(which only uses straight o0 and o1 stats, not icm)
214023 - rounding added to my binary logistic mixer implementation
213953 - Stretch mapping in update optimized separately from mapping in mix

Also, I noticed a small problem with C arithmetics which affects
some logistic mixer implementations.
That is, (x>>c) is not the same as (x/(1<<c)) for negative x.  Reply With Quote

19. Hopefully not zpaq. When I write in the document floor(x / 2^c) I mean of course (x >> c).  Reply With Quote

20. You'd better not mean what you said, as (int(-1)>>1)=-1.
So yeah, zpaq, and other ones.
Check this:
Code:
```#include <stdio.h>
void main( void ) { printf( "%i %i %i %i\n", -1>>1, 1>>1, (-254+2)>>2, (254+2)>>2 ); }

output:
-1 0 -63 64```  Reply With Quote

21. Got some interesting results in support of my substring theory.

214023 - baseline mix2
214395 - logistic mixer with two weights
214999 - linear update v1
214184 - linear update v2

Code:
```Node2j f0;
double cl0,cl1;

void Update_v1( int y, int p0,int p1 ) {
int z, pz=f0.P;
if( y==0 ) {
z = (SCALE-pz)*p1 >= pz*p0;
} else {
z = (SCALE-pz)*(SCALE-p1) >= pz*(SCALE-p0);
}
f0[ctx].Update( z, M_l1wr, M_l1mw );
w = SCALE-f0.P + hSCALE;
}

void Update_v2( int y, int p0,int p1 ) {
cl0 = cl0*(SCALE-M_l0wr)/SCALE - log( float(y?SCALE-p0:p0)/SCALE );
cl1 = cl1*(SCALE-M_l0wr)/SCALE - log( float(y?SCALE-p1:p1)/SCALE );
int z = cl1>cl0;
f0.Update( z, M_l1wr, M_l1mw );
w = SCALE-f0.P + hSCALE;
}```
So, v1 constructs the submodel (o0 and o1) switching sequence (z) based
on comparisons of coding cost of model selection + model output.
But that's a greedy algorithm so I actually expected much worse performance
from it - its a case similar to LZ greedy parsing.

And v2 constructs the switching sequence based on exponentially weighted
coding cost comparison - btw, the result becomes even better when a context
is used for a counter there (ie for probability estimation of first submodel selection).

Anyway, greedy switching doesn't even allow to properly initialize the switching
sequence model, and averaged coding cost comparison might be somewhat
But a better parsing implementation should further improve the performance
of this scheme. And I hope that it might be possible to reach better results
than with gradient update using delayed counters or something like that.

So it looks like logistic mixing is actually quite related to optimal parsing
of processed data, and nobody noticed   Reply With Quote

22. Changed the switching flag definition to
Code:
```    int yp0 = y?SCALE-p0:p0;
int yp1 = y?SCALE-p1:p1;
int z = (cl1>cl0) || (yp0>yp1+1160);```
and now its 214010 %)
So seems that improvements (over gradient update) are certainly possible.
They might be much more complicated and slower though...

And meanwhile, the current result with separately optimized mapping
for pm=Squash(x) (to avoid clipping or something) is 213766 %).
So it was possible to improve it almost by 1k just by tweaking the mixer
(comparing to http://encode.su/forum/showpost.php?p=7777&postcount=20).  Reply With Quote

pm = w1*Log[p1/(1-p1)+w2*Log[p2/(1-p2)]
Log[1/(1+Exp[-pm])*(1-2*y)+y]
(pm,p1,p2 are probabilities of 0, y=bit, result is code length (negative))
Ok, but remember to mention that you work with P(y=0). pm might be misleading since it is no linear probability, but a logistic estimation.

it can be transformed into

y=0: w1*Log[p1] +w2*Log[p2] - Log[p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2]
y=1: w1*Log[1-p1]+w2*Log[1-p2] - Log[p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2]

And then, if we convert that to probabilities:

y=0: p1^w1*p2^w2/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)
y=1: (1-p1)^w1*(1-p2)^w2/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)
((1-y)*p1^w1*p2^w2+y*(1-p1)^w1*(1-p2)^w2)/(p1^w1*p2^w2+(1-p1)^w1*(1-p2)^w2)

So this is what the actual mixing looks like.
I think conversion terms the reverse transform from logistic space (your pm) to linear space, right? But you "converted" using exp(x) not Squash(x), having ln(1+exp(... instead of ln(exp(... makes the transformations you did impossible. That's not the same and the result doesn't have the meaning of a linear probability. The p_i in your equations are probabilities, but the whole expression is in logistic space, thus conversion should be done with Squash.

...
If it was a probability the rest would be correct, i guess.
And meanwhile, the current result with separately optimized mapping
for pm=Squash(x) (to avoid clipping or something) is 213766 %).
So it was possible to improve it almost by 1k just by tweaking the mixer
(comparing to http://encode.su/forum/showpost.php?p=7777&postcount=20).
The "clipping" simply increases the dynamic range of Squash. Note that this can be useful, since the derivative (slope) near the asymptot at +-1 is rather small. In NN terms that is often done, too. The gradient descent gets slowed down by the low slope. So my suggestion would be, instead of optimizing the mappings themself (their stems) or the learning rate (as we discussed) one could simply optimize the range of Squash. In the implementation i initially gave you it was from -8...8. Thus you could save stems for squash from -10...10 and optimize the range -r ... r manually. Afterwards rescale the function to map to +-1 at the edges -r, r.  Reply With Quote

24. Log[1/(1 + Exp[-w1*Log[p1/(1 - p1)] - w2*Log[p2/(1 - p2)]])*(1 - 2*y) + y]

then, for y=1:

sm = w1*Log[p1/(1-p1)] + w2*Log[p2/(1-p2)]

Exp[sm] = (p1/(1-p1))^w1 * (p2/(1-p2))^w2

Log[1-1/(1+Exp[-sm])] = Log[1-1/(1+1/Exp[sm])] =

Log[1-Exp[sm]/(Exp[sm]+1])] = Log[1/(Exp[sm]+1])]

Log[ 1 / ( 1 + (p1/(1-p1))^w1 * (p2/(1-p2))^w2 ) ] =

Log[ 1 / ( 1 + p1^w1*p2^w2 / (1-p1)^w1 / (1-p2)^w2 ] =

Log[ ( (1-p1)^w1 * (1-p2)^w2 ) / ( (1-p1)^w1*(1-p2)^w2 + p1^w1*p2^w2 ) ] =

w1*Log[1-p1] + w2*Log[1-p2] - Log[ (1-p1)^w1*(1-p2)^w2 + p1^w1*p2^w2 ]  Reply With Quote

25. I admit you are right - made a mistake in calculations. Looks like i'm more error prone than Mathematica .

Good work.

BTW i found an alternate representation. Note p = P(y=1), b = 1-2*y. In the oppositve case replace b with -b.

p = [(1-p1)^(b*w1) * (1-p2)^(b*w2)] / [(1-p1)^(b*w1) * (1-p2)^(b*w2) + p1^(b*w1) * p2^(b*w2)]

What about the clipping i mentioned?  Reply With Quote

26. Originally Posted by Shelwien You'd better not mean what you said, as (int(-1)>>1)=-1.
floor(x) = the largest integer not greater than x, so floor(-1/2) = -1, right?   Reply With Quote

27. I mean, its a wrong behaviour for symmetric values.  Reply With Quote

28. http://ctxmodel.net/files/mix_test/mix_test_v4.rar

(v3)
215703 - dual fixed-rate linear mixer
214695 - paq-like logistic mixer
(v4)
213663 - logistic mix2 with gradient descent update
213599 - logistic mix2 with likelihood-based update

- Logistic mixer rewritten
- A new update based on new logistic mixing interpretation tested
- 3 mappings per coder optimized

Update: Tested with enwik8.bwt:
21169967 - v3/linear
21064282 - v3/logistic
21038280 - v4/logistic
21065319 - v4/likelihood

So the new mixer seems to be too tuned to book1.
On other hand, its still on par with "original" logistic mixer.  Reply With Quote

29. Originally Posted by Shelwien I mean, its a wrong behaviour for symmetric values.
That's true, like in adjusting the zpaq mixer weights I use round() instead of floor(). However, in computing the mixer output prediction, floor() works just as well experimentally, so I used it to save a few instructions. In fact for this result:

Code:
`(219581 adaptive mixer)    3 0 cm 9 3   1 cm 17 5     2 mix 0 0 2 14 0`
I tried replacing the two instances of floor() with round() in predict() and the output was 8 bytes larger. In other words, I tried inserting "+128" below:

Code:
```        for (int j=0; j<m; ++j)
p[i]+=(wt[j]+128>>8)*p[cp+j];
p[i]=clamp2k(p[i]+128>>8);```
Inside the loop, round() made no difference at all.  Reply With Quote

30. Seems like there're still things to work on, considering lookup table generation.  Reply With Quote