You could split into 512 byte chunks, delta each uint16 vs the previous uint16 (512 bytes ago), zig-zag(ish) and var-int encode, and then chuck through your favourite compression tool or entropy encoder. The logic being that each 64k block of input is likely to have some similarity to the last. Alternatively build up an average frequency and delta vs that, if we believe the frequencies are static.

Doing the former and putting through xz gave me 211044, assuming my process was actually reversible and bug free. It harms more complex tools like paq though.

Alternatively a data rotation by 512. So byte 0, 512, 1024, 1536, ..., followed by byte 1, 513, 1025, 1537, ... etc. Maybe in 16-bit quantities instead of 8-bit too if you wish to keep high and low bytes per value together.

Rotate by 512 | paq8px_v101 yielded 183994 bytes (179070 on 16-bit quantities) and xz gets it to 192664. That's a trivially fast transform to use.

Eg hacky perl 1-liner again for data rotation with 8 and 16-bit quantities.

Can't really rotate anything - I'm trying to make a blockwise adaptive rc to compete with rANS.
So I'd like each table to be stored in the header of each block.
But yeah, I guess there's more similarity in freqs of individual symbols than I expected.

Blockwise you need to evaluate those compression tools by chopping up into blocks and testing each block individually then, but knowing how much you can get when not chopped at least hints at a baseline to aim for.

You may be able to predict the upcoming frequencies based on what's gone before in this block of frequencies. E.g. if we're encoding freq for symbol 200 and the sum of all freqs for symbols 0 to 199 add up to 65535 then we know there's only 1 symbol left out there. Depending on the cardinality of the symbols, it may also be better to sort by frequency and write the symbol letters out in addition to the delta frequencies, but generally I found that didn't work too well. Some data sets it may do though.

I'm not sure how FSE works, but it's likely Yann will have put some effort into finding a good solution that doesn't cost too much time.

Edit: or perhaps there is a sweet range, expressible as 2 bytes (eg 32 to 122) whose symbols we encode first and then we encode the remainder (0-31 and 123 to 255) armed with the knowledge the total remaining frequency is very low. If you can work out which range leaves no unstored frequency of >= 256 then you know the others all fit in 1 byte, meaning no var-int encoding is necessary and freqs between 128 to 255 don't take an additional byte up (assuming a byte-based varint method instead of bit-bases).

> knowing how much you can get when not chopped at least hints at a baseline to aim for.

It already works better than a similar coder with incremented freqs.
And overall compression improves when I use smaller blocks.

> You may be able to predict the upcoming frequencies based on
> what's gone before in this block of frequencies.

Well, I actually have a proper bitwise adaptive model for freqs there,
it already produces results similar to default lzma.
Just that it doesn't remember individual freqs from the previous block atm.

> E.g. if we're encoding freq for symbol 200 and the sum of all freqs for symbols 0 to 199
> add up to 65535 then we know there's only 1 symbol left out there.

That's probably a rare case.

> Depending on the cardinality of the symbols, it may also be better to sort by frequency
> and write the symbol letters out in addition to the delta frequencies,

That's the idea I had actually - I already sort symbols anyway.
But it also doesn't take into account similarity of symbol freqs in adjacent blocks.

> I'm not sure how FSE works, but it's likely Yann will have put some effort
> into finding a good solution that doesn't cost too much time.

I think zstd just compresses the stats with default literal coder using precomputed state tables.
Jyrki Alakuijala said that its not worth extra complexity when headers take like 0.2% of compressed data (0.5% here).
But I think that its better to compress the headers as much as possible (since it doesn't change the processing speed anyway).

While Huffman has canonical coding, optimal encoding of frequencies for accurate entropy coder is a difficult problem.
I had a thread about it: https://encode.su/threads/1883-Reduc...y-distribution
Ideally we would like to optimize minimal description length:

So e.g. we should use larger accuracy for longer blocks and lower probability symbols.
Here is some my old construction (but I haven't worked further on that):
- encode CDF as points in (0,1) range,
- start with splitting the range into #symbols equal subranges, encode number of points in each with unary coding - trick from James Dow Allen which turns out nearly optimal (slide 17, 18 here),
- then scan symbols adding single bits of information until reaching chosen precision:

> Ideally we would like to optimize minimal description length

With AC/rANS we can use all information from precise symbol counts, so precision trade-offs don't make any sense.

> start with splitting the range into #symbols equal subranges

My implementation recursively encodes range sums until it reaches individual symbols,
but based on paq results (and James' preprocessing) there's a potential for improvement.

Storing entire counts is a simple option, but not necessarily optimal - you might need a smaller total number of bits if storing approximated frequencies instead.

The splitting into subranges is just the first step - top of the diagram, then below is adding successive bits until reaching chosen accuracy: higher for low probabilities.

> Depending on the cardinality of the symbols, it may also be better to sort by frequency
> and write the symbol letters out in addition to the delta frequencies,

That's the idea I had actually - I already sort symbols anyway.
But it also doesn't take into account similarity of symbol freqs in adjacent blocks.

If you're encoding block by block but not having synchronisation points for random access, then yes using the previous freqs helps.
It's possible you may want to just exploit the previous symbol order though. One idea: for text we store ETAONRISHD... for the first block of letters sorted by frequency, along with their frequencies as deltas to the previous. The next block we don't store the symbol order, but do store the frequencies in that order, with delta (which may now be negative) and then sort to compute the actual frequency order - eg maybe I is now more common and has moved up a couple: ETAOINRSHD. Our delta in the second block would mean I frequency is negative when delta vs R, but zig-zag encoding can cope with that. I don't know if this helps much, but I'd guess the order of symbol frequency to vary less than the actual frequencies themselves. If so then perhaps we can use this to reduce complexity in the frequency values.

Storing entire counts is a simple option, but not necessarily optimal - you might need a smaller total number of bits if storing approximated frequencies instead.

The splitting into subranges is just the first step - top of the diagram, then below is adding successive bits until reaching chosen accuracy: higher for low probabilities.

One thing I wrestled with was how to store frequencies for both large and small blocks. Large blocks we'll be normalising the frequencies to, say, 14 bit anyway. Small blocks we don't want to normalise them upwards as that's inventing extra precision that we have to store. We could store the actual or normalised (whatever is smaller) and always renormalise on decode, but you have to be very careful when renormalising in the decoder to ensure the maths is exact with no potential rounding differences. Instead I normalise to the maximum bit count (say 14) or the next highest bit count if the block is small. Then the decode process can either use a smaller lookup table or, for my use in Order-1 tables where everything must be the same normalised number, scale up using a simple bit shift.

Storing order-1 frequencies, ie a 2D array, poses its own problems though. I wanted completely random access so couldn't use previous blocks, but I can use order-0 frequencies to aid the order-1 storing. I store a list of all the order-0 symbols followed by all the order-1 tables for those symbols only, including all the zero frequency entries. For those, I also have a run length because many order-1 frequencies have large runs of 0-frequency values (for symbols that may occur in another context). Example code here: https://github.com/jkbonfield/htscod...ns4x16.js#L919

Especially for small blocks, it is worth to consider very approximated frequencies, like in representation from diagram above: with accuracies optimizing header cost + length * Kullback-Leibler.
Canonical Huffman can be seen as example of such approximation: representing only 1/2^k probabilities.

It might be also worth to consider parametric distributions, like with Golomb codes for Laplace distribution and storing/predicting only their parameters ( https://encode.su/threads/3124-LOCO-...ny-alternatves ).