I'll paste the code I'm referring to:

//////////////////////////// Mixer /////////////////////////////

// Mixer m(N, M) combines models using M neural networks with

// N inputs each using 4*M*N bytes of memory. It is used as follows:

// m.update(y) trains the network where the expected output is the

// last bit, y.

// m.add(stretch(p)) inputs prediction from one of N models. The

// prediction should be positive to predict a 1 bit, negative for 0,

// nominally -2K to 2K.

// m.set(cxt) selects cxt (0..M-1) as one of M neural networks to use.

// m.p() returns the output prediction that the next bit is 1 as a

// 12 bit number (0 to 4095). The normal sequence per prediction is:

//

// - m.add(x) called N times with input x=(-2047..2047)

// - m.set(cxt) called once with cxt=(0..M-1)

// - m.p() called once to predict the next bit, returns 0..4095

// - m.update(y) called once for actual bit y=(0..1).

inline void train(int *t, int *w, int n, int err) {

for (int i=0; i<n; ++i)

w[i]+=t[i]*err+0x8000>>16;

}

inline int dot_product(int *t, int *w, int n) {

int sum=0;

for (int i=0; i<n; ++i)

sum+=t[i]*w[i];

return sum>>8;

}

class Mixer {

const int N, M; // max inputs, max contexts

int* tx; // N inputs

int* wx; // N*M weights

int cxt; // context

int nx; // Number of inputs in tx, 0 to N

int pr; // last result (scaled 12 bits)

public:

Mixer(int n, int m);

// Adjust weights to minimize coding cost of last prediction

void update(int y) {

int err=((y<<12)-pr)*7;

assert(err>=-32768 && err<32768);

train(&tx[0], &wx[cxt*N], N, err);

nx=0;

}

// Input x (call up to N times)

void add(int x) {

assert(nx<N);

tx[nx++]=x;

}

// Set a context

void set(int cx) {

assert(cx>=0 && cx<M);

cxt=cx;

}

// predict next bit

int p() {

return pr=squash(dot_product(&tx[0], &wx[cxt*N], N)>>8);

}

};

Mixer::Mixer(int n, int m):

N(n), M(m), tx(0), wx(0), cxt(0), nx(0), pr(2048) {

assert(n>0 && N>0 && M>0);

alloc(tx, N);

alloc(wx, N*M);

}

Function p() does weighted summation of stretched predictions and then squashes the result. My concern is about the weighted summation. AFAIU, sum of weights has to be more or less constant for the whole algorithm to be working correctly. But it's not obvious by looking at the train() function. Bo you have some simple explanation of the convergence? What are possible errors, ie distance from the ideal weight sum (that is a fixed constant) caused by rounding errors or something like that?

A side question:

What is most time consuming in LPAQ1? Is that retrieving data from hashtables or the mapping and mixing (ie heavy computations) themselves?