Here is a program I wrote, based on my refactor of paq8px. It takes one parameter from the command line, which is just a file to process. Like paq8px, it reads the file bit-by-bit and passes those bits to a context model. Unlike paq8px, it doesn't create an output file. It simply records the input bits and the predictions, and for each bit that isn't perfectly predicted, it outputs the bit position, the bit, and the prediction. It also outputs the average prediction error across all bits in the file.
Code:
#include "Mixer.hpp"
#include "MixerFactory.hpp"
#include "ModelStats.hpp"
#include "file/FileDisk.hpp"
#include "file/FileName.hpp"
#include "file/fileUtils2.hpp"
#include "model/ContextModel.hpp"
#include <cstdint>
#include "Models.hpp"
auto main(int argc, char *argv[]) -> int {
auto shared = Shared::getInstance();
shared->chosenSimd = SIMD_AVX2;
shared->setLevel(9);
FileName input;
input += argv[1];
FileDisk f;
f.open(input.c_str(), true);
uint64_t fSize = getFileSize(input.c_str());
auto *modelStats = new ModelStats();
auto *models = new Models(modelStats);
ContextModel contextModel(modelStats, *models);
auto updateBroadcaster = UpdateBroadcaster::getInstance();
auto programChecker = ProgramChecker::getInstance();
shared->reset();
shared->buf.setSize(shared->mem * 8);
int c = 0;
uint8_t y = 0;
auto results = static_cast<uint16_t *>(malloc(8 * fSize * sizeof(uint16_t)));
auto ys = static_cast<uint8_t *>(malloc(8 * fSize * sizeof(uint8_t)));
uint64_t position = 0;
for( int j = 0; j < fSize; ++j ) {
c = f.getchar();
for( int i = 7; i >= 0; --i ) {
auto p = contextModel.p();
results[position] = p;
y = (c >> i) & 1U;
shared->y = y;
shared->update();
ys[position] = y;
++position;
updateBroadcaster->broadcastUpdate();
}
}
uint64_t sum = 0;
for( uint64_t i = 0; i < position; ++i ) {
y = ys[i];
uint16_t target = y == 0 ? 0 : 4095;
if( target != results[i] ) {
printf("%llu, %d, %d\n", i, target, results[i]);
}
sum += abs(target - results[i]);
}
printf("(%llu - %llu) %f\n", 0ULL, position, double(sum) / double(position));
programChecker->print();
return 1;
}
Notice that the prediction is made before the model is updated with the next bit, such that the prediction of the first bit is made after having seen 0 bits, the prediction of the second bit is made after having seen 1 bit, and so on. Is this correct?
Here are the first 10 lines of output from when compressing enwik8:
Code:
0, 0, 2047
1, 0, 2047
2, 4095, 2047
3, 4095, 2047
4, 4095, 2047
5, 4095, 2043
6, 0, 2043
7, 0, 2043
8, 0, 2043
9, 4095, 2047
10, 4095, 2047
and here are the last few lines:
Code:
799999975, 0, 3753
799999979, 0, 11
799999980, 0, 2208
799999981, 0, 438
799999987, 4095, 3199
799999988, 0, 1
799999989, 0, 1546
799999991, 4095, 3909
799999993, 0, 5
799999996, 0, 6
(0 - 800000000) 295.095109
Time 26567.74 sec, used 3814 MB (3999353079 bytes) of memory
Notice that some bit positions aren't printed (e.g. 799999976, 799999977, etc) -- this means that they were perfectly predicted.
I have more questions but this post is long enough as it is -- I'll save those for later posts in this thread.