The higher order or context dependent method, instead of e.g. "tableDecode[state]", will use "tableDecode[history/context][state]" - many smaller coding tables.
Where, while encoding is in backward direction, the history/context means the one for later forward direction decoding.
The minimal header problem is indeed interesting: how to find the best description/approximation: minimizing "header size + imprecision cost".
For large data, the header cost becomes negligible (unless using some high order or complex context model) - static compressors can be better then adaptive, as they use the whole data to construct the optimal model - the income can be larger than the header cost.
ps. As I have written
in a different thread, for "reversed entropy coding" problem, we can use generalizations of Kuznetsov-Tsybakov problem to encode such that decoder doesn't have to know the probabilities used -
no header is needed.
I don't see how to do it for standard entropy coding, but maybe something like that is possible? The information about probability distribution seems somehow excess ...