Same origin, I'm just posting.
Alpha version, mostly supports -cO and -cc single-threaded decoding, problems with some audio blocks.
Please help with -cc decrypting, it seems related to lpaq.
Same origin, I'm just posting.
Alpha version, mostly supports -cO and -cc single-threaded decoding, problems with some audio blocks.
Please help with -cc decrypting, it seems related to lpaq.
- replaced coroutine class with a getc/putc stub to simplify things;
- turned all functions to class methods, removed t-> prefixes;
- split functions and structures to multiple include files;
- replaced most of dynamic allocation with static;
- merged most init functions;
Code:book1 768771->202310 c.time=1.031s d.time=1.031s
Decrypted an interesting counter.
It looks like this in the code:// kDivideLookup[0] = 255;
// for( i=1; i<255; i++ ) kDivideLookup[i] = 1024 / (3 + i * 2);
// kDivideLookup[255] = 1;
uint C_Update1( uint x, uint bit, uint lim ) {
byte tt = byte(x);
uint z = (bit<<23) - (x>>9);
uint y = kDivideLookup[tt] * z;
return (y & 0xFFFFFF00) + x + (tt<lim);
}
but essentially works kinda like this:union C1 {
uint X;
struct {
uint N :8;
uint P :32-8;
};
};
uint C_Update1( uint _x, uint bit, uint lim ) {
C1 x; x.X=_x;
if( bit==0 ) {
x.P -= 2*x.P/(x.N*2+3);
} else {
x.P += 2*((1<<24)-x.P)/(x.N*2+3);
}
x.N += (x.N<lim);
return x.X;
}
This version also improves book1 compression from 203310 to 202412.
In other words, its something like P = (1<<24)*(N1+0.75)/(N0+0.75+N1+0.75)
The interesting part is the "vector" update: x + (tt<lim) + (y & 0xFFFFFF00)
updates both occurrence counter and the probability parts without shifts.
Update: Sorry, it was actually 202310 in original version. I guess limited precision division via multiplication by LUT
has the effect of slightly slowing down the update rate for higher N.
I tried and got 202300 like this:if( bit==0 ) {
x.P -= 4*x.P/(x.N*5+5);
} else {
x.P += 4*((1<<24)-x.P)/(x.N*5+5);
}
Reminds "Krichevski-Trofimov (KT) estimator" from CTW: http://mattmahoney.net/dc/dce.html#Section_423
But this kind of approach poorly adapts to local situation - finding average from beginning.
For adapting to local situation it is better to use exponential moving average based, like
CDF += (newCDF - CDF) >> rate
1) N is actually limited (255 at most), when limit is reached, it starts updating at fixed rate.
2) Its actually used for APM counters. lpaq-like FSM counter state is mapped to actual probability using this.
- refactored FSM counter (former kLzModelLNext);
- refactored APM counter (the variable-rate one above);
- refactored main submodels (o1,o2,o3,o4,o6,o10,o24);
An example. Other submodels look the same, just with different indexes.
// order1 FSM+APM+stretch
C_Update1(modele[0].ptr,bit,128);
C_Update0(ptrs8[0],bit);
ptrs8[0] = &modelv[ ((out_byte<<8)&0xFF00) + L_LUT[out_byte_wip] ];
modele[0].ptr = &modele[0].model[*ptrs8[0]];
factors[0] = stretch1( modele[0].ptr );
// order2 FSM+APM+stretch
C_Update1(modele[1].ptr,bit,254);
C_Update0(ptrs8[1],bit);
ptrs8[1] = &modelw[ ((out_byte<<8)&0xFFFF00) + L_LUT[out_byte_wip] ];
ptrs8_cur[1] = *ptrs8[1];
modele[1].ptr = &modele[1].model[*ptrs8[1]];
factors[1] = stretch1( modele[1].ptr );
byte hist_and = ptrs8_cur[1]!=0;
byte hist_sum = hist_and;
// hashed order3 FSM+APM+stretch
C_Update1(modele[2].ptr,bit,254);
C_Update0(ptrs8[2],bit);
if( bit_index_in_byte==0 ) {
hashes_arr28[2] = (out_byte<<8) ^ byte(out_byte) ^ 0xFFFFFF00;
uint hash2 = 33595399 * hashes_arr28[2] ^ (hashes_arr28[2] >> 11);
ptrs8[2] = CM_GetHashSlot(hash2) + 4;
} else if( bit_index_in_byte==4 ) {
ptrs8[2] = CM_PtrGetX(ptrs8[2], (byte)(out_byte_wip ^ hashes_arr28[2]));
} else {
uint incr = (bit + 1) << ((bit_index_in_byte & 3) - 1);
ptrs8[2] += incr*4;
}
ptrs8_cur[2] = *ptrs8[2];
modele[2].ptr = &modele[2].model[ptrs8_cur[2]];
factors[2] = stretch1( modele[2].ptr );
hist_and &= (ptrs8_cur[2]!=0);
hist_sum += hist_and;
birdie (30th January 2020),Sergey3695 (28th January 2020)
What's the objective here? I might be missing some history... NanoZip source code lost?
Well, its another reverse-engineering project, but target is not Christian's, so nobody complains.
Also author died.
It was never open-source (although "nanozip -cc" aka nzcc seems to be a speed-optimized lpaq).
Sad to hear!
No one in his neighborhood is able to share his code?
Maybe someone is still reading private messages on: https://fi.linkedin.com/in/runsas ?
Its been a long time, so nope. https://encode.su/threads/111-NanoZi...c?p=44233&pp=1
- submodel detector based on entropy estimation
- bitmodel - unaligned bit model with 17-bit context
- wordmodel (parses last english word until length 128 or non-letter)
- sparse order1 FSM+APM: xxxxxxxx 00000000; disabled if normal order1 gives better results
- masked order3 FSM+APM: xxx00000 xxx00000 xxx00000; only used for high 5 bits
- logistic mixing with context based on wordmodel and matchmodel
- interpolated SSE<13>
- adaptive logistic extrapolation ("TSE")
Main model mostly done, matchmodel remains.Code:202310 0.953s 0.985s // v0 202310 1.000s 1.015s // v1 202310 1.078s 1.093s // v2 202310 1.063s 1.078s // v3 202728 1.000s 1.016s // TSE disabled 213440 1.000s 1.000s // SSE disabled 221784 0.937s 0.937s // SSE+TSE disabled
About hashtable...
Its designed like this.
There're 64-byte cells allocated per nibble.
Each cell contains 4 slots for binary trees (interleaved, maybe to avoid *15) and a hash byte per slot.
When a slot for an unknown hash byte is requested, a slot with min high-bit-counter value is chosen and cleared out, hash byte is replaced.
When hash byte is known, it just normally returns a pointer to corresponding counter tree.
Original code uses some tricky arithmetics there:byte v4 = (rp[5] < rp[4]);
byte v5 = -(rp[6] >= rp[v4 + 4]);
byte v6 = (v5 & v4) + (~v5 & 2);
byte v7 = -(rp[7] >= rp[v6 + 4]);
rp += (v7 & v6) + (~v7 & 3);
But that's actually just finding the index of min of 4 bytes.
New version looks like this:byte* __restrict FindCounter( byte* __restrict p, byte b ) {
// find a slot with matching high byte of hash
uint i = (p[0]==b) + 2*(p[1]==b) + 4*(p[2]==b) + 8*(p[3]==b) + 16;
i = _bit_scan_forward(i);
if( i<4 ) return p+4+i;
// find a slot with min counter (for high bit)
i = (p[4+1]<p[4+0]) ? 1 : 0;
i = (p[4+2]<p[4+i]) ? 2 : i;
i = (p[4+3]<p[4+i]) ? 3 : i;
// claim the slot for current context
p[i] = b;
// init the counters
p += 4+i;
for( i=0; i<15; i++ ) p[4*i]=0;
return p;
}
// input: p = high nibble counter location, b = hash byte
byte* __restrict CM_PtrGetX( byte* __restrict p, byte b ) {
p -= (p-((byte*)0)) & 63; // align 64
p += 64 * (b|1); // |1 just to always get a different slot than high nibble?
return FindCounter( p, b );
}
byte* __restrict CM_GetHashSlot( uint hash ) {
return FindCounter( &some_ptr[(hash<<6) & some_mask], hash>>24 );
}
And apparently tricky arithmetics don't really help - speed didn't change, and new code is still branchless.
There seemed to be some potential for vectorization, so I tried rewriting it with _mm_minpos_epu16 (for finding p[i]==b and for finding min),
and vector AND 0xFFFFFF00 for clearing out the slot, but speed didn't improve.
Some choices are pretty random though - *15 would be probably better for caching (and imho it would be better to use 16 slots and vector math),
clearing the slot saves 4 bytes on book1, and (b|1) loses 4 bytes.