# Thread: 2 Huffman methods

1. ## 2 Huffman methods

I just came up with the following 2 Huffman methods. I think the first method is like Adaptive Huffman but I'm not sure. Perhaps these methods already exist. I just wanted to share it, maybe they're useful.

Method 1

From a decompression point of view: Supply the characters and their frequency in the compressed file. Then build a tree from those stats. Check the first bit(s) of the data and output the right character. Then decrease that character's frequency and build a new tree again. Repeat. This way the codes will get shorter and shorter the farther you get into the data.

Method 2

This method doesn't require to store the characters and their frequency in the file already, you only need to store the size of the decompressed file or perhaps use some 'End' node or something (Although I think the file would end up bigger with another extra node). Maybe the codes I show below aren't optimised but that doesn't really matter right now, this is just to give you the idea. I'm sure this whole thing could be tweaked :) I'll use the string 'ABRACADABRA' as an example. Also, the node 'New' always has a higher importance than other 1 frequency characters (Unless like 128+ characters would have a frequency of 1, then it won't have a higher importance anymore. And when 255 characters exist in the table you can replace 'New' with the last possible character).

Since there are no characters in the table at the start, grab the first character 'A' and make the table:

Char Freq Code
---------------
New 1 0
A 1 1

Since 'A' is the very first character just output it as 8 bits (01000001).

The next character is 'B' which isn't in the table yet, so output the node 'New' with code 0 and then output 8 bits for the character 'B' (01000010) and update the table:

Char Freq Code
---------------
New 1 0
A 1 10
B 1 11

The next character is 'R' which isn't known yet either so output 'New' with code 0 and output 8 bits for 'R' (01010010) and update the table:

Char Freq Code
---------------
New 1 00
A 1 01
B 1 10
R 1 11

Next character is 'A' which is already present in the table and therefore output 01, then update the table:

Char Freq Code
---------------
A 2 0
New 1 10
B 1 110
R 1 111

Next is 'C' which isn't known so output 'New' as 10 and output 8 bits for 'C' (01000011) and update the table:

Char Freq Code
---------------
A 2 0
New 1 100
B 1 101
C 1 110
R 1 111

Next is 'A' which is known and so output 0, the table stays the same except that the frequency of 'A' is increased.

Next is 'D' which isn't known so output 100 and output 8 bits for 'D' (01001101) and update the table:

Char Freq Code
---------------
A 3 0
New 1 10
B 1 1100
C 1 1101
D 1 1110
R 1 1111

Next one is 'A' again so output 0, table stays the same except for A's frequency.

Next is 'B' so output 1100 and update the table:

Char Freq Code
---------------
A 4 0
B 2 10
New 1 1100
C 1 1101
D 1 1110
R 1 1111

Next is 'R' so output 1111 and update the table:

Char Freq Code
---------------
A 4 0
B 2 100
R 2 101
New 1 110
C 1 1110
D 1 1111

Next is the final character 'A' so output 0.

The final output will be (Minus the filesize/End Of File data):

01000001 0 01000010 0 01010010 01 10 01000011 0 100 01001101 0 1100 1111 0 = 60 bits (Original string 'ABRACADABRA' is 8 x 11 = 88 bits)

2. Do you know Vitter's algorithm: http://en.wikipedia.org/wiki/Adaptive_Huffman_coding

Also IMO most Huffman encoders either are quasi-adaptive (ie. recompute Huffman codes every N processed symbols) or divide data into small blocks and use Canonical Huffman codes to minimize encoded tree size.

3. These are both possible, but normally one uses arithmetic coding because it is faster than rebuilding the Huffman tree after each character, besides compressing better. The only reason anyone uses Huffman coding in new algorithms now is because it is faster when using a fixed code. (At one time, people also used Huffman coding because arithmetic coding was patented, but that's expired now). And actually, the speed is not much different because modern processors have fast multiply instructions.

In method 1, transmitting just the frequencies (or code lengths, actually) is the normal way to transmit a Huffman code. JPEG and deflate do this, for example. But they don't update the frequencies. It actually doesn't cost much not to.

Method 2 is order 0 PPM with a fixed escape probability. Modern PPM uses higher orders and a context model to estimate the escape probability to drop to the next lower order.

4. No I didn't know about Vitter's algorithm, Piotr. I'll check it out, thanks.

Yeah Matt, I had a feeling arithmetic coding would actually be better instead I guess Huffman is starting to get a little obsolete by now...

5. check out Vh1 its a bijective vitter huffman

6. Originally Posted by TheVoid
01000001 0 01000010 0 01010010 01 10 01000011 0 100 01001101 0 1100 1111 0 = 60 bits (Original string 'ABRACADABRA' is 8 x 11 = 88 bits)
Nice example only two things need to be done to make it bijective.

The first is to not always write out 8 bits for new symbol yes the first should be 8 bits but the next few will vary between 8 and 7 until there are 128 left at which time the next is 7 bits and so on. Most people never do this why I don't know. However for this short example its not unreasonable to let them be all 8 bits.

The second at least looking at your example the last byte is you have 60 bits so the last bye would remainder of 60/8 is 4 so four bits left that is [1110xxxx] well the fast why of doing bijective ending where I look at [0....] and [100...] and the rest since there is in this case the last field of [0] a one bit field and its a zero so you add 1000... to ending and you get as last byte [11101000] which is 8 bytes for the compressed set of data.

So that is 64 bits 8 bytes and that includes the EOF

There is also a third approach I don't see people using. Look in my bijective=bwts-dc-fibonocci I create a table based on start positions of each new symbol. The same could be done for adaptive huffman or adaptive arithmetic. The nice thing about this is that for long files where symbols occur near the front. You don't need the NEW TOKEN symbol at all messing up what you do. And when all symbols known there are no more new symbols so the ZERO FREQUENCY problem goes away in a nice bijective manner.

Also I must state there is an error in my current bwts-dc-fib it still encodes and decodes any file correctly. But when you do decoding first if some rare cases two files decode to same file. I will try to post new exec soon.

7. Originally Posted by biject.bwts
The first is to not always write out 8 bits for new symbol yes the first should be 8 bits but the next few will vary between 8 and 7 until there are 128 left at which time the next is 7 bits and so on. Most people never do this why I don't know. However for this short example its not unreasonable to let them be all 8 bits.
Good point, I didn't realize that yet. I guess arithmetic would be better for this

The second at least looking at your example the last byte is you have 60 bits so the last bye would remainder of 60/8 is 4 so four bits left that is [1110xxxx] well the fast why of doing bijective ending where I look at [0....] and [100...] and the rest since there is in this case the last field of [0] a one bit field and its a zero so you add 1000... to ending and you get as last byte [11101000] which is 8 bytes for the compressed set of data.

So that is 64 bits 8 bytes and that includes the EOF
I don't really understand why you add 1000? Because for the decompressor in my example that would mean to output 'B' and then 'A' and then stop since there's no more data. I think in my example it would be better to use 1100 which calls for the 'New' symbol (110) but then the last bit 0 can't hold a new character which means that you reached the end. Although this would be problematic if there wouldn't be a 'New' symbol in the table anymore when all characters are used...

EDIT: How about adding in the beginning of the compressed data the number of bits left that aren't used? Since the data always ends in a byte/8 bits you could use the options 0...8 to identify the EOF. Would only take 4 bits (Or a little more than 3 bits in arithmetic).

There is also a third approach I don't see people using. Look in my bijective=bwts-dc-fibonocci I create a table based on start positions of each new symbol. The same could be done for adaptive huffman or adaptive arithmetic. The nice thing about this is that for long files where symbols occur near the front. You don't need the NEW TOKEN symbol at all messing up what you do. And when all symbols known there are no more new symbols so the ZERO FREQUENCY problem goes away in a nice bijective manner.
Yeah seems like a better method than using an extra symbol.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•