In the version uploaded:

Code:

const static int TABLE_BITS = 16;
uint32 prev_pos[0x100][1 << TABLE_BITS];
uint16 prev_lens[0x100][1 << TABLE_BITS];
uint32 head[0x100];

ROLZ offset and length are both written as a CDF to give log base and raw bits for exp: (base = log2(value), value - 2^base). Lengths are encoded the same as normal LZ lengths. In addition, lengths are context coded by log2(prev_len), so we have CDF log_rolz_offset and CDF4 log_rolz_len_hi[8] + CDF4 log_rolz_len_lo[8][16]

Insert into ROLZ at every word (e.g. literal is (pos, 0), match is (dest pos, match len). I run the same matchfinder as LZ but with a different acceptance criteria: instead of taking longest, it checks, for len = 2 to matched if the cost of (rolz depth, len diff) is cheaper than the current cost (normal LZ for the same match length is always cheaper for smaller offsets). Including the previous byte in the match test (i.e. check pos - 1, match length is -1 afterwards) means there's only one match finder for the whole system, not one per context.

Improvements after:

- Simple mistake, forgot to context-code ROLZ offsets too (previous byte), so we now have CDF log_rolz_offset[0x100] and CDF log_rolz_len[8]

- Switched length encoding from flat to log-exp (same as offset as described above).

This reaches 22,728,933 on enwik8, however the cost checks for every length of every match from the binary tree matchfinder (not just if match length is longer).

Thinking about:

- Greatly reducing memory. Most of the cells are empty, so a scheme like this:

Code:

// 64k
const int CELL_SIZE = 256;
const int LUT_SIZE = 256;
struct Cell {
uint32 prev_pos[CELL_SIZE]
uint16 prev_len[CELL_SIZE];
};
struct CellLUT {
uint32 cells[LUT_SIZE];
};
struct RolzTable {
CellLUT lut[0x100];
uint32 start[0x100], head[0x100];
Cell *cells;
};
// Some scheme to assign cells and increment starts to drop old cells when available cells run low.

Nearly all, for some definition of "nearly all", matches are within 4k entries, so a bounded memory scheme without limiting every previous byte context to 4k will keep ROLZ memory usage more reasonable (<10 MB).

- A more accurate matcher instead of the classic increasing-length-only. A hash chain (instead of binary tree) match finder reaches 22,274,413, but runs overnight.