1 Attachment(s)
SMAC 1.16: Bitwise, Logistic Mixing, ICM, Orders-6 to 1 & Match Model
● A context seen for the first time points to a Bit history value of 0. With version 1.16, the couples of Frequencies linked to this Bit history entry will not be updated, so its probability will remain p(0)=1/2. Therefore, because Stretch(p(0))=Log2(0.5/0.5)=0, contexts seen for the first time will have no impact on the global prediction.
● The Match model is triggered on a byte boundary. It searches, on encoding, the Source buffer history (data read so far) to find the same pattern as the current Context (8 bytes). On decoding, it will search the Destination buffer (data written so far).
The bits following that pattern will be considered as next prediction bits until a mismatch occurs, with probabilities of p(0)=65535/65536 if bit 0 is predicted, and p(0)=1/65536 if bit 1 is predicted.
Eventually, that prediction will be included in the Logistic mixing process, along with Orders-6 to 1 predictions.
The Match model uses a hash-table of 512 Kilobytes (65535 entries of 8 bytes) to find in the buffer a pattern matching the current Context.
Each entry of the hash-table is designed like this :
http://fastpack.free.fr/MatchTable.PNG
The 64 bits of the Context are hashed to a 16 bits value. Because collisions are very likely, the low-half of the current Context will be compared with the 2nd double-word of the Match hash-table, which contains the low-half of the last Context having produced the same hash.
If both values differ, a collision is detected and the Match model will terminate with a Match Prediction value of 2, meaning that no prediction could be made, and a Match Offset value of 0, meaning the model is turned "off".
If both values are equal, Match Offset value will be filled with the first double-word of the Match hash-table entry, which contains the offset in the buffer of the next byte that follows the same context. Match Prediction will be filled with the most significant bit of that byte.
In all cases (either a prediction could be carried out or not), the entry of the Match hash-table will be updated with respectively the offset of the byte following the current Context, and its low double-word value.
During the next bit encoding/decoding process, the procedure will know that a Match model is "on" by testing the Match Offset variable which will be different from 0, and will fetch the next bit of Match Prediction directly from that address. If that prediction turns out to be wrong, Match Offset will be set to 0, and a new prediction will be calculated on the next byte boundary.
Results on enwik8 & enwik9 are 21831822 & 183551384.
1 Attachment(s)
SMAC 1.17 : Optimized for speed
After looking over Intel & Agner Fog’s documentation regarding Core Duo processors & above, I endeavoured to modify the code in order to give a boost to compression & decompression times.
I have discarded many instructions greater than 1 µop or with high latencies, reorganized them to favor a 4-1-1-1 decoding scheme and, more generally, I tried to take profit from latest Intel procs novelties like Store-forwarding, loopback buffer, zero-latency movs, macro-fusions, etc…
The FNV-1 hash for Contexts has also been overhauled, so that the Order-4 hash uses Order-3 hash as a partial result, and likewise for Order-6 hash.
There is no improvement in the compression engine itself, still the results are a tad better due to a change in the rounding method of the Bit History counters. Also the lookup table for 232/(n0+n1) now uses floating-point values instead of integers previously.
Results on enwik8 & enwik9 are 21816272 & 183459153.
1 Attachment(s)
SMAC 1.17A : more speed optimization
I managed to gain 20 seconds more on enwik8 with my Core 2 Duo Merom, essentially by getting rid of the slow FSCALE instruction in the Squash function, while calculating 2-p.
I also replaced some FPU instructions with faster ones, plus a few more optimization tricks.
Compression results remain identical as previous version.
1 Attachment(s)
SMAC 1.18 : optimized for speed (part III, final?)
* The Squash function has been entirely rebuilt using a chain of polynomial approximations (3rd degree, 32 intervals, step=1).
I've traded a bit of compression for the benefit of speed : -40 seconds on enwik8 on my Core 2 Duo.
* Until recently, I was not aware that XMM registers were considered as volatile when calling an API.
On Windows 64 bits, it was possible (but not confirmed) that calls to ReadFile & WriteFile destroy xmm0 & xmm2, which kept the Context and a mask for a hashtable.
Ergo, I have replaced xmm0 & xmm2 by xmm6 & xmm7, which are preserved on 64 bits.
The Squash function now uses a SSE3 instruction (fisttp). Since SSE3 was introduced in 2004(Intel) & 2005(AMD), I've assumed that nowadays it is active on every PCs (In other words, I've been too lazy to check for SSE3 compatibility).
The (minimax) polynomial tables for the Squash function are the work & idea of qWord : thanks to him, and all MASM32 forum members for their support!
Results on enwik8 & enwik9 are 21816285 & 183459860.
1 Attachment(s)
SMAC 1.19 : yet another speed improvement
- I have figured out that my Stretch function could be radically improved.
Indeed, if we substitute p0=(n0/(n0+n1))*232 in Stretch(p0)=log2(p0/(232-p0)), we end up with :
Stretch(p0)=log2(n0/n1)=log2 n0 – log2 n1
with n0=bit 0 count, n1=bit 1 count, p0=prediction that next bit will be 0, and ni ranges between [1;65535]
Thus, there was no need to calculate p0 & p1. Compression of enwik8 is now less than 5 min. under my Core 2 Duo. - The Contexts hashes are now calculated earlier in order to increase the Prefetch Scheduling Distance : prefetches instructions benefit from this additional leeway to complete successfully.
Results on enwik8 & enwik9 are 21816323 & 183459942.
1 Attachment(s)
SMAC 1.20 : using Laplace estimator
I have switched my prediction formula p0=n0/(n0+n1) with Laplace’s rule of succession, which gives better results in estimating the probability of next symbol.
From now on, p0=(n0+1)/(n0+n1+2),and Stretch(p0)=log2(n0+1)-log2(n1+1).
I made some experiments with Krichevski-Trofimov estimator, where
p0=(n0+0.5)/(n0+n1+1), and Stretch(p0)=log2(2n0+1)-log2(2n1+1), but results were not as good.
One positive side-effect of Laplace estimator is that it solves the Zero-frequency problem : with previous formula, I had to initialize n0 and n1 counts to one to avoid a division by zero. Now, the frequencies can range from [0;65535].
The Match model has been improved. The bit following a matching pattern of 8 bytes is now predicted with a probability of p0=1-(1/x) if bit 0 is predicted, and p0=1/x if bit 1 is predicted, with x=64 and x=x+1 after each correct prediction.
Results on enwik8 & enwik9 are 21781544 & 183190888.
1 Attachment(s)
SMAC 1.22 : Secondary Symbol Estimation
● SSE improves the prediction by collecting further statistics in a 2D-table, featuring an Order-2 context and a stretched prediction ranging from -16 to +16 (33 boundaries).
My method roughly follows the one described by Matt in the latest PAQ archivers.
Each entry of the table stores a pair of 16 bit counters reflecting the observed bits for a given context & prediction.
From these counters, a new stretched prediction will be calculated.
http://fastpack.free.fr/sse.png
Like with PAQ, both counters will be halved when one of them is saturated, and initial counters are set so that each prediction maps to itself.
The SSE prediction is calculated as a linear interpolation between above & below boundaries, and the final prediction is equal to 3/4 SSE prediction+1/4 Initial prediction.
The closest entry to the linear interpolation will be updated when the result is known.
● With hindsight, it appeared to me that my Match model was overly complicated.
Henceforth, each entry of the Match table is a simple doubleword containing the offset of a matching pattern, given current context (8 bytes).
Both contexts (current context & Matching pattern) are still compared to detect hash collisions.
Also, I have increased the memory allocated for the table.
● FPU instructions have been replaced by SIMD ones (SSE & SSE2), except in the creation of the Logarithms lookup table.
● Some optimization in the code.
● Source code has been ported to UASM.
Results on enwik8 & enwik9 are 21047188 & 176607820.