It's interesting that bsc is about 10 seconds on 100MB, but about 1 second on 1.5 MB with the default settings.

I made some progress on idea #1 in the first post:

Block size: 2 MB (ROLZ dictionary is the processed part of the block)

Actual memory is (1 + 2) * 2 MB because the input/output block is uint8, processed block is uint16

16k ROLZ tables with 16 entries each, hashed order-3

Including IO buffer and stack, it's a total of 8.7 MB

enwik8: 35,368k in 2.9 seconds (0.28 in Huffman coding, 2.32 in running ROLZ on blocks), unpacks in 1.4 seconds (0.69 in Huffman decoding with a small index table and a full one, 0.5 in unpacking ROLZ). I/O and other delays are why total time doesn't add up to the individual parts. Note that this speed relative to zling is worse because I'm on a desktop. Searching is just checking the 16 entries of the ROLZ table for that context (hash checked with the 3 bytes for the real context plus the next two bytes for MIN_LENGTH).

Symbol packing in zlite is similar to what I was doing with bits, this is how I use it:

Code:

const int MIN_LENGTH = 2;
const int MAX_LENGTH = 47;
const int ROLZ_NUM_TABLES_BITS = 14;
const int ROLZ_TABLE_SIZE_BITS = 4;
const int ROLZ_NUM_TABLES = 1 << ROLZ_NUM_TABLES_BITS;
const int ROLZ_NUM_TABLES_MASK = ROLZ_NUM_TABLES - 1;
const int ROLZ_TABLE_SIZE = 1 << ROLZ_TABLE_SIZE_BITS;
const int ROLZ_TABLE_SIZE_MASK = ROLZ_TABLE_SIZE - 1;
const int NUM_SYMBOLS = 1024;
const int MAX_REACH = 256 + ((MAX_LENGTH - MIN_LENGTH) * ROLZ_TABLE_SIZE); // Must be less than NUM_SYMBOLS
int match_sym(int idx, int len)
{
return 256 + ((len - MIN_LENGTH) << ROLZ_TABLE_SIZE_BITS) + idx;
}
int match_unsym_idx(int sym)
{
sym -= 256;
return sym & ROLZ_TABLE_SIZE_MASK;
}
int match_unsym_len(int sym)
{
sym -= 256;
sym >>= ROLZ_TABLE_SIZE_BITS;
return sym + MIN_LENGTH;
}

I'm not sure what the overhead of Polar coding vs Huffman is, more experiments are needed. Every block has a new Huffman tree (every 2 MB), which I store as RLE-encoded frequencies. I should replace that with just the code lengths for each symbol, even though it's only a difference of 1024 uint16 vs 1024 uint8 to store for every 2 MB.