Huffman codelength gen algorithm, in-place, based on Moffat's

For a while lately I've been intent on taking a very close look at the Huffman algorithm, with attention paid to attempting to find out just how simple and efficient one can make it. I managed to get around to writing a slim, in-place algorithm for finding the code lengths, e.g. for feeding into a canonical encoder/decoder. I've copied the most relevant parts here, the functions which compute code lengths from symbol frequencies in two steps.

The code is still a bit rough around the edges; it's closer to a first draft than a final product. You can see the complete (experimental) program and actually run it here (neat site): http://ideone.com/CeyLzv

The two functions here are phase 1 and phase 2 of generating the codes. I split this into two functions, which is perhaps not ideal from a software engineering perspective, but at this early stage, it seemed to help.

I think my implementation manages to be even simpler and (perhaps) slightly more efficient than Moffat's (without having benchmarked on actual hardware). His algorithm makes 3 passes over the array; I got this down to more like 2. It's hard to guess whether Moffat failed to see these simplifications, or if he designed his the way he did deliberately for some important reason that I am not seeing. Either way, this algorithm seems to be approaching its lower bound on simplicity, and there doesn't look to be a whole lot left to improve upon.

void inplace_huffman_p2(int *A, int n) {
// Phase 2
int level_top = n - 2; //root
int depth = 1;
int i = n, j, k;
int total_nodes_at_level = 2;
while(i > 0) {
for(k=level_top; k>0 && A[k-1]>=level_top; k--) {}

int internal_nodes_at_level = level_top - k;
int leaves_at_level = total_nodes_at_level - internal_nodes_at_level;
for(j=0; j<leaves_at_level; j++) {
A[--i] = depth;
}

The code is shorter than I imagined. Could you give a high level description of what phase 1 and phase 2 do?

If you find and read the Moffat paper, it will provide a lot of useful context. I dont have a link handy, but its called "In-place calculation of minimum-redundancy codes." Phase 1 is basically straight from the paper.

By the way, the input to phase 1 is the frequencies in sorted order.

I'm not going to describe phase 1 in detail, because the description in the paper is good and there is a good diagram. At the end of phase 1, the array is composed of parent pointers, with each element corresponding to an internal node. The levels of the tree are laid out contiguously. The only thing you don't know during phase 1 is where the boundaries are. From the root, you can work backward and find the level boundaries. The root is level 1, then you find the root's two children and that's level 2, and so on. I count the parent pointers that cross the level boundary; those are the children that are internal nodes. The children you don't find are the leaves. You know when the level ends (when you find one that doesn't cross the boundary), so you can infer which ones are missing. Each internal node has 2 children, so that tells you how many nodes to expect on the next level.

That's about it. You can overwrite the array with the leaf depths/code lengths like in the paper (which is what the code I posted does). It seems like there might be advantages to writing the number of leaves at each level. In that case, each level gets a number, even when that number is zero. That list fits in the array, since there can't be more levels than symbols. In a lot of cases (maybe always?), that should give you a shorter list than enumerating the leaves.

Last edited by nburns; 10th February 2015 at 07:30.

That makes sense. I will find and read Moffat's paper one of these days.

It's really not a very taxing read, as papers go.

To follow up the rhetorical question I posed last time: indeed, enumerating levels would always give a shorter list. This isn't hard to prove: The worst case level count is when the tree degenerates into a linked list, and in that case, #symbols - #levels = 1. Any other tree shape reduces the number of levels without changing the number of symbols.

Before I got to the result I posted in this thread, I spent a lot of time trying to think of a simple one-pass solution. The result of that effort was a pretty good appreciation for why it can't be done: the boundaries for a given level are not yet determined when the level's nodes are written. They are determined higher in the tree, and it can be an arbitrary amount higher.

BTW, the other paper Kennon linked to is much later (1997) and informed by the earlier (1985) paper nburns linked to. Anybody looking at this stuff probably ought to check out both the early (conference) paper and the later (journal) paper.

As I understand things so far (anybody who knows better should please correct me)...

A theme of both of them is that you generally don't want to build Huffman trees explicitly, and it's usually better to sort symbol frequencies first (still n log n but with a constant factor win over Huffman's actual algorithm) and then use specialized techniques that generate minimum redundancy codes from sorted frequencies. Another theme is that you generally want to map symbols in some sparse space (e.g., all possible character strings) to a compact range of integers (corresponding to observed strings), which gives you several implementation wins.

They're particularly focused on "canonical" minimum redundancy prefix-free ("Huffman") codes, where the codes of a given length are basically consecutive integers, and you get fast encoding and decoding because you can use array indexing tricks. (Once you've gotten to the part of the table that holds codes of a given length, you can just use array indexing).

Plus there are tricks to make everything fast, like storing codes in left justified form so that you can easily do symbolwise operations on the high bits that matter (rather than fiddly bitwise operations with lots of little shifts), and having a single lookup table for all bit patterns that start with frequent (short) codes. That makes decoding fast O(1) in the common case and gracefully degrading (proportional to the code length & entropy) in other cases.

And, just for historical perspective, the 1976 paper by van Leeuwen that shows how to construct a Huffman tree in linear time from a list of sorted weights is available here:

To help me understand the code, I optimized phase 1 for speed rather than size. This is what I got:

void inplace_huffman_p1(int *A, int n) {

// Phase 1
if (n > 1) {
A[0] += A[1];
int r = 0, s = 2, t = 1;
while (s != n) {
if (A[r] < A[s]) {
if ((r+1 < t) && (A[r+1] < A[s])) {
A[t] = A[r] + A[r+1];
A[r++] = t;
A[r++] = t++;
}
else {
A[t] = A[r] + A[s++];
A[r++] = t++;
}
}
else {
if ((A[r] < A[s+1]) || (s+1 == n)) {
A[t] = A[s++] + A[r];
A[r++] = t++;
}
else {
A[t++] = A[s] + A[s+1];
s += 2;
}
}
}
while (t != n-1) {
A[t] = A[r] + A[r+1];
A[r++] = t;
A[r++] = t++;
}
}
disp_array("Phase 1", A, n);
}

I did not test it and now realize it's not so simple for Tree. I want a version that reserves code space for recency ranking and the new symbol escape and also deals with codes of equal length being put in 2^12 code bins.

Last edited by Kennon Conrad; 19th February 2015 at 21:42.

I want a version that reserves code space for recency ranking and the new symbol escape and also deals with codes of equal length being put in 2^12 code bins.

I would think it would work well to treat the new symbol escape code as just another non-ergodic symbol---throw it in the LRU cache with the rest.

That way it'll adjust the escape probability depending on whether you're hitting a lot of old stuff, or a lot of new stuff.

I would think it would work well to treat the new symbol escape code as just another non-ergodic symbol---throw it in the LRU cache with the rest.

That way it'll adjust the escape probability depending on whether you're hitting a lot of old stuff, or a lot of new stuff.

That sounds like a good idea, but won't work well with the current code. The highest probability given to any non-ergodic symbol when encoding enwik9 is currently 1/128 but the new symbol escape has a global probability of about 1/12. The new symbol escape code probability is also much more closely correlated to file position than any other symbol. The probability of the first symbol being the escape symbol is 100% and the probability of the last symbol being the escape symbol is 0%, so I think the ideal probability calculation would be different than for non-ergodic symbols. For now, the new symbol escape symbols mostly uses code space reserved for symbols that are yet to appear since this approximately matches the new symbol probability. In the long run, I want to use ANS to adjust probabilites of categories of symbols (new/ergodic/non-ergodic), but haven't taken the plunge yet.

I wouldn't think the general trend from high to low escape frequencies would be an issue for MTF-coding the escape symbol. MTF tracks trends just fine, on average, and if there's a problem I'd think it would be something more fundamental, like MTF being too twitchy.

My impression is that this is a case where twitchy is what you want. You'll often have bursts of many new symbols interspersed with stretches where you have few new symbols. (As when compressing a Wikipedia article on a strange subject, with many new subject terms, followed by one on a subject you've seen many related articles about before.)

Responding to those relatively short-term transients is usually much more important (if you've got them) than how precisely you track long-term trends.

MTF tends to work pretty well for combinations of long-term trends and short-term transients. It tracks stable or slowly changing statistics pretty well, and while it does get "distracted" by transients, it then rapidly re-adapts to a subsequent relatively stable (or slowly changed) phase again.

It's dead stupid and has to re-learn the stable or trending stats after responding to a transient, but that's usually better than not responding appropriately to transients, and assuming stability or slow change.

If you have frequent major transients, assuming stability or slow change will kill you--there is no useful bound on how wrong your stable or slowly-trending stats can be right now, when it matters.

On the other hand, if you have stability or slow change, being twitchy like MTF will hurt some, but not a lot, because if things really are stable or slowly-changing, the very short-term stats tend to be strongly correlated to the long-term stats, and being forgetful doesn't hurt all that much---if the underlying stats or trends really are pretty stable, and you forget them, you can easily and quickly re-learn (most of) what you forgot, because the ongoing phenomena are still going on and re-teach you what you forgot.

Last edited by Paul W.; 20th February 2015 at 17:51.
Reason: typo, clarification

That sounds like a good idea, but won't work well with the current code. The highest probability given to any non-ergodic symbol when encoding enwik9 is currently 1/128 but the new symbol escape has a global probability of about 1/12.

OK, now that I think about it there are two problems with what I suggested. It doesn't make sense to throw the escape code in the normal MTF queue.

The obvious reason is that it doesn't fit with other symbols' probability range OR skew, so it should be special cased to get the probability closer to right.

The less obvious but more fundamental reason is that it's a very different KIND of symbol, that's going to have very different regularities, so it can't be right to treat it the same.

For example, suppose you encounter a Wikipedia article about a common subject, but it's the first time you've seen an article on that subject.

You'll get a whole slew of "new" symbols (subject terms) that are only "new" because this is just happens to be the first article you encounter on that subject. Once you've seen those symbols once, they may behave perfectly normally, recency-wise, e.g., recurring within the article with about the same probabilities. Their "newness" is just an artifact of the sorta-random lexicographic ordering of the articles, nothing funny about the symbols themselves.

The new symbol escape is metadata, not normal data, and shouldn't be thrown in with normal data.

Now this makes me wonder about sorta-analogous issues with PPMC vs. PPMD, in terms of estimating the probability of falling back from failed high-order predictions to lower-order predictions, but I need to refresh my memory about the escape probability heuristics in different versions of PPM, and learn more about how this is handled in context mixers like PAQ. It seems to me that in all these cases, the escape probability should be special-cased, because an escape is just a whole different kind of thing than normal data.

There's a general principle there that some regularities are observable in very specific ways looking at very specific things (e.g., that a high-order (finite context) prediction is usually more reliable than a low-order one) but others are only observable in terms of aggregate behavior. (E.g., noticing that higher-order predictions are failing a lot lately, due to a transient, and changing the weighting of high- vs. low-order models because the high-order models are stale and the low-order models can adapt rapidly to what's actually going on NOW.)

I suspect there's a name for this general idea about adaptation, because it's a basic idea implicit in various things (some versions of PPM, some context mixers, some caching policies, some other resource allocation strategies) but I don't know it. (If anybody does, let me know.)

The general trend of the new symbol probability doesn't prevent it from being handled via MTF, it would just be less than optimal to ignore the trend. The main problem is that for typical files, the probability distribution of where the next new symbol escape symbol occurs is a lot different than for any other symbol. As an experiment, I would like to figure out a way to calculate "Huffman" codes for just the ergodic symbols, but tell the algorithm how many of the 4096 code bins to use.

Update: Okay, I see from your latest post you figured out what I was trying to describe. The only thing I would add is that I have seen fairly strong negative correlations between the local new symbol escape symbol probability and the appearance of symbols deemed to be non-ergodic.

Last edited by Kennon Conrad; 20th February 2015 at 20:04.

The only thing I would add is that I have seen fairly strong negative correlations between the local new symbol escape symbol probability and the appearance of symbols deemed to be non-ergodic.

Just checking because there's a double negative in there: you're saying that when you have a lot of new symbols, you DON'T get a lot of non-ergodic symbols?

I'd sorta expect that when you have a lot of new symbols, those symbols would usually be non-ergodic. (If they were ergodic, they'd be relatively randomly or evenly distributed, so you'd tend to encounter them pretty soon.)

For natural language text, you tend to see the frequent ergodic symbols ("and a" "or the" etc.) everywhere, including the first things to encounter, so you learn them right away, along with whatever subject terms are used in the first few things you encounter.

After that, I'd think you're mostly learning subject terms that occur in clusters, or mode-of-speech stuff that occurs in some kinds of writing but not others, and also occurs in clusters that overlap with topic clusters.

That intuition is based on something more like computational linguists' notion of "dispersion" than your much smaller-scale measure of "ergodicity," though, and I haven't figured out the implications of that.

Just checking because there's a double negative in there: you're saying that when you have a lot of new symbols, you DON'T get a lot of non-ergodic symbols?

I'm sorry, yes I tripped myself up on that. I meant ergodic rather than non-ergodic. When there are bursts of new symbols, there tend to be fewer symbols coming from the dictionary and more symbols coming from the MTF queues.

My thought has been that it would be good to use ANS for dealing with the varying overall new/ergodic/non-ergodic symbol probabilities but still use Huffman codes for the ergodic symbols and queue positions underneath the ANS coding. I was thinking I would do the Huffman coding first and liked nburns example, but was not sure how to handle building a tree that didn't use all of the code space. Now I think I can do it by adding fake symbols in the sorted symbols list with counts set according to the number of bits the fake symbol needs to get. For example, if I want to create a code tree that uses 11/16ths of the code space, I would add a fake symbol that would be guaranteed to get 1/4 of the code space (2 bits) and another that would be guaranteed to get 1/16th of the code space (4 bits). I'm just not sure exactly what's needed for to ensure the guarantee.

I want a version that reserves code space for recency ranking and the new symbol escape and also deals with codes of equal length being put in 2^12 code bins.

I'm not sure I am following along with all the details of what you're trying to accomplish, but I think it's probably a good idea to leave such details out of Huffman itself -- that is, if Huffman is really what you want. You can always create artificial symbols with the right probabilities and map onto that alphabet as an intermediate step, then apply Huffman to that sequence.

BTW, the other paper Kennon linked to is much later (1997) and informed by the earlier (1985) paper nburns linked to. Anybody looking at this stuff probably ought to check out both the early (conference) paper and the later (journal) paper.

The newer paper refers you to the 1985 paper for details on code-length calculation, and doesn't seem to revisit any of the same issues. Of course, the newer one is worth reading, also, if you're doing a complete implementation of Huffman.

As I understand things so far (anybody who knows better should please correct me)...

A theme of both of them is that you generally don't want to build Huffman trees explicitly, and it's usually better to sort symbol frequencies first (still n log n but with a constant factor win over Huffman's actual algorithm) and then use specialized techniques that generate minimum redundancy codes from sorted frequencies. Another theme is that you generally want to map symbols in some sparse space (e.g., all possible character strings) to a compact range of integers (corresponding to observed strings), which gives you several implementation wins.

It's never better to build a pointer-based tree if you can get the same result from a flat array. In-place algorithms and implicit data structures are generally the best you can do in terms of space utilization, and, in this case, the in-place algorithm is also elegant and fast, so it's win-win-win.

The textbook Huffman algorithm uses a priority queue to sort on the fly, which is about the same as sorting with heapsort. Heapsort is not the fastest sort on actual hardware, so you might as well use the fastest sort to pre-sort the frequencies, followed by this algorithm to calculate code lengths.

They're particularly focused on "canonical" minimum redundancy prefix-free ("Huffman") codes, where the codes of a given length are basically consecutive integers, and you get fast encoding and decoding because you can use array indexing tricks. (Once you've gotten to the part of the table that holds codes of a given length, you can just use array indexing).

Plus there are tricks to make everything fast, like storing codes in left justified form so that you can easily do symbolwise operations on the high bits that matter (rather than fiddly bitwise operations with lots of little shifts), and having a single lookup table for all bit patterns that start with frequent (short) codes. That makes decoding fast O(1) in the common case and gracefully degrading (proportional to the code length & entropy) in other cases.

There is another paper out there that deals with code-length calculation for very large alphabets. I haven't read much of it, but I think it does something like RLE to keep the array small and gets some extra wins.

I'm sorry, yes I tripped myself up on that. I meant ergodic rather than non-ergodic. When there are bursts of new symbols, there tend to be fewer symbols coming from the dictionary and more symbols coming from the MTF queues.

My thought has been that it would be good to use ANS for dealing with the varying overall new/ergodic/non-ergodic symbol probabilities but still use Huffman codes for the ergodic symbols and queue positions underneath the ANS coding. I was thinking I would do the Huffman coding first and liked nburns example, but was not sure how to handle building a tree that didn't use all of the code space. Now I think I can do it by adding fake symbols in the sorted symbols list with counts set according to the number of bits the fake symbol needs to get. For example, if I want to create a code tree that uses 11/16ths of the code space, I would add a fake symbol that would be guaranteed to get 1/4 of the code space (2 bits) and another that would be guaranteed to get 1/16th of the code space (4 bits). I'm just not sure exactly what's needed for to ensure the guarantee.

I would think that if you're going with ANS for the high-level, mostly metadata regularities, there'd be a simpler and more natural solution that did not require fiddling with the low-level details of Huffman, e.g., shoving fake symbols into the frequency table to reserve code space.

I'd think that if you're gluing things together with ANS (or any kind of range coder) at the high level, you could just use the most efficient implementation of Huffman for whatever very specific thing it's good for---e.g., encoding ergodic symbols, where your "symbols" are mostly non-tiny strings with stable stats, so that weirdly optimized Huffman implementations assuming big buffers and stable stats are just fine. (The stats are stable because you only do it for ergodic symbols, and Huffman doesn't lose much relative to arithmetic or range or ANS coding, because your "symbols" are mostly large enough that Huffman is effectively doing sub-bit precision anyhow, in terms of input bytes.)

There is another paper out there that deals with code-length calculation for very large alphabets. I haven't read much of it, but I think it does something like RLE to keep the array small and gets some extra wins.

I recall that too, but I didn't really grok it and don't know whether it was in the 1997 paper or some other paper it referred to but I have lost track of, and am not even sure what phase of processing it was about. (E.g., length assignment vs. code assignment vs. actually using those codes.) I need to go back and look and understand things better, and eventually I will...

There's some issue about how, for canonical codes, you can VERY rapidly narrow your search down to the part of the table where you can use constant-time array-indexing tricks. So there's an auxiliary table that indexes the main table by code lengths or something like that, and you search that table to tell you where in the main table to look with fast constant-time operations.

Given such an index you might do a binary search in that index to find the right code length, or you could just search it linearly from the front. You might think that a binary search would clearly be faster than a linear search, but it usually isn't in practice, because what a Huffman table encodes is usually very, very biased---the most frequent symbols are at the front and the distributions are usually Zipfy--so you usually get something like a log factor speedup in a linear search through the code lengths without doing anything at all, simply because short codes are so much more common than long ones. (Or they wouldn't be so short, or compression likely wouldn't be worthwhile at all.)

A linear search like that (with something like logarithmically distributed search lengths) has fantastic spatial locality, usually finding the right index into the relevant part of the main table in the first block or two or three (or four or five or six), so you almost never miss your cache, even if your cache is very small.

All that said, I definitely didn't really understand what I was reading---just that some issues sorta like that seemed to be in play. And my mental model of it doesn't really make sense, on reflection, because some of it depends on frequency skew in actual data, but some of it appears to only make sense after stats have been accumulated and those frequencies have been factored out...

(And I don't know if the advantage of linear over binary search matters anymore because these days even first-level caches hold orders of magnitude more blocks than you need to take advantage of the tremendous spatial locality of linear searches through a Zipfish-biased index. Binary searches---especially biased binary searches, taking into account Zipfiness but still giving a logarithmic bound---may just dominate linear searches now, both algorithmically and in terms of cache locality.)

I use code 2^12 code bins and a 4096 element lookup table to determine the number of bits in the Polar-like codes used for ergodic symbols and then calculate the symbol index from the code and the first bin used for that code length. The idea (at least for me) came from Bulat and worked great.

I don't have ANS yet. If I had ANS, I could just do a normal Huffman code length calculation. I was hoping to just replace the Polar-like codes with "Huffman" codes as an intermediate step, but that requires the codes only fill up a known portion of a "tree" (but I never really create a tree). It's probably not a big win in terms of compression but my Polar-like code length calculation code has not been 100% robust, isn't elegant, and in some cases may be further from optimal than I would like.

I use code 2^12 code bins and a 4096 element lookup table to determine the number of bits in the Polar-like codes used for ergodic symbols and then calculate the symbol index from the code and the first bin used for that code length. The idea (at least for me) came from Bulat and worked great.

I don't have ANS yet. If I had ANS, I could just do a normal Huffman code length calculation. I was hoping to just replace the Polar-like codes with "Huffman" codes as an intermediate step, but that requires the codes only fill up a known portion of a "tree" (but I never really create a tree). It's probably not a big win in terms of compression but my Polar-like code length calculation code has not been 100% robust, isn't elegant, and in some cases may be further from optimal than I would like.

I think I understand what you're trying to do (maybe).

If you want to reserve a contiguous block of code-space, insert a new symbol with probability equal to the total probability of the block. The code that Huffman assigns to this symbol then acts like a prefix for the codes inside the block -- or, equivalently -- like an escape that tells the decoder to expect a code from the block.

If you don't need the reserved space to be contiguous, you could break the block into sub-blocks, creating a symbol for each one as above. Smaller blocks gives Huffman more to work with and might result in a better code.

Last edited by nburns; 25th February 2015 at 05:05.