Results 1 to 29 of 29

Thread: Advanced Huffman Encoding

  1. #1
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts

    Advanced Huffman Encoding

    I am developing another LZ77 + Huffman encoder. I am very happy that I have a working compressor now. The results aren't this excited (yet) because it lacks some major features but it's currently also more then a minor implementation.
    Enough of the introduction. At the moment I have many different blocks ~4096 byte which I compress with Huffman. I use one tree for every following block. Because of this I create a maximum tree (a tree with every symbol which could occur in one of the blocks).
    Surely this isn't very good for compression but it is much better then creating a tree for every block. As a compromise I create a new tree if the output data is really big. Another advantage is a faster encoding and more important a faster decoding.

    But it's not my point to show that this is ground breaking or even good. I don't think so. I want to change it with changing the tree: Adding symbols, deleting symbols and moving symbols. Surely every of those transformations have to worthwhile and short in storing.

    Thats almost dynamic huffman. My question is if there are any code snippets doing something like this or are there any papers?
    Any idea is welcome also that this is my first compressor Iam no expert in compression so be patience with me

  2. #2
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Simon Berger View Post
    I am developing another LZ77 + Huffman encoder. I am very happy that I have a working compressor now. The results aren't this excited (yet) because it lacks some major features but it's currently also more then a minor implementation.
    Enough of the introduction. At the moment I have many different blocks ~4096 byte which I compress with Huffman. I use one tree for every following block. Because of this I create a maximum tree (a tree with every symbol which could occur in one of the blocks).
    Surely this isn't very good for compression but it is much better then creating a tree for every block. As a compromise I create a new tree if the output data is really big. Another advantage is a faster encoding and more important a faster decoding.

    But it's not my point to show that this is ground breaking or even good. I don't think so. I want to change it with changing the tree: Adding symbols, deleting symbols and moving symbols. Surely every of those transformations have to worthwhile and short in storing.

    Thats almost dynamic huffman. My question is if there are any code snippets doing something like this or are there any papers?
    Any idea is welcome also that this is my first compressor Iam no expert in compression so be patience with me
    You can find Huffman with periodic adaptation in FastHF and Bulat's Tornado.

  3. #3
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    periodic adaption sounds nice, thanks .
    I wanted to have a look at bulats tornado anyhow

  4. #4
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    ok I looked at the sources. But what I understood both doing something else as I want to do. It generates a better tree of the already computed data so it don't have to store any data because it can do the same behaviour in encoding and decoding. Thats the absolute only way if you encode symbol for symbol.
    Thats the normal in LZ77 encoding because of iteral, and length encoding...
    But I encode a whole block so I generate a "perfect" tree for the first block.

    For example a tree for literals and the block only consists the symbols 0, 1, 2 there also have to be 3 - 255 inside the tree. My intention is to generate a really perfect tree with only 0,1,2. If the next block have to be encoded I generate a new perfect tree and store only the difference of this.
    That was a very easy point of view and I would never generate a second tree, and generate a diff. Also the next trees won't be perfect because the resulting data would be too big but it could be almost perfect.
    Last edited by Simon Berger; 7th November 2008 at 15:41.

  5. #5
    Programmer toffer's Avatar
    Join Date
    May 2008
    Location
    Erfurt, Germany
    Posts
    587
    Thanks
    0
    Thanked 0 Times in 0 Posts
    You might want to construct the tree based on code lengths (canonical huffman). So you can easily store the code length differences between the next block's tree and previous block's tree. You can apply some fixed coding like rle with variable lenghts for code length diffs (they should likely to be small) if you want to go for speed.

    If i'm not wrong (g)zip does something like that? Just google for the deflate RFC.

  6. #6
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    Writing symbol difference counts for block 2-n is indeed an idea I didn't have. I could generate a really perfect tree with this and the overheade isn't too big for a small block of 4096 symbols (maybe I have to resize it anyway...).
    But the problem are more the sysmbols which didn't occur in one block but in the other. Transform the tree to be really good would only be my second step. The first step would be to have a tree of only the needed symbols because this has a more important effect then have symbol x with 3 bits instead of 5.

    I will more deeply think in your idea and consider if I do a test implementation. Thanks!

    I made a fast variable length code how I could think of for the block:

    For a symbol which is already in the tree
    Code:
    3bit difference     // LOOP
    1bit more?         // while !=  0
    1bit + or -        // deleted if difference = 0
    For a symbol which isn't in the tree (no need of + or - and an additional exists bit in front)
    Code:
    1bit more/exists?
    3bit difference    // LOOP
    1bit more          // while !=  0
    Last edited by Simon Berger; 7th November 2008 at 16:48.

  7. #7
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    I have always liked huffman but I fail to see a good reason to store a tree. You can assume one at start and then adaptively modify it using a modifed vitter. Scaning the block and builing a new tress per block is usually wasteful if you store the tree in code. An adaptive type of storeage is best. But whatever your compressor does at block boundaries the decompressor can do the same. Its better to keep the decsions in the code not as wasted control bits that creste the staic tree at front of a block.

  8. #8
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    I see some points:
    - am I wrong that a adaptive method everytime has to use a full tree as I mentioned in my case too?
    - rebuilding a tree at a given count needs more time. If I do it as often as with a method toffer described so before every new block this won't be a good argument for an adaptive variant.
    - generating a tree in front of a block means I can be sure to have a more or less good one. Generating it it from the history is dangerous if I would use a long period. Very much on changed data.

    I am not an expert as I mentioned before. I only describe my point of view and would be very happy to change it with right arguments from someone else

  9. #9
    Member biject.bwts's Avatar
    Join Date
    Jun 2008
    Location
    texas
    Posts
    449
    Thanks
    23
    Thanked 14 Times in 10 Posts
    Let me put it this way. If I have an I.I.D data source but dont know the statistics. But get a single block of data. The normal static Huffman is the best type of huffman except for the cost of storing the tree with the data. If you know the tree in advance you don't need to store it.
    If you don't know the tree or stats its generally better to do it adaptively since its only one pass and will general take less space than storing the static tree and the data. I know this is not well known but its true.

    Of course if you want better compression then is generally better to use ant adaptive airhtmetic than the static arithmetic where a table is needed.

  10. #10
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Simon Berger View Post
    I see some points:
    - am I wrong that a adaptive method everytime has to use a full tree as I mentioned in my case too?
    - rebuilding a tree at a given count needs more time. If I do it as often as with a method toffer described so before every new block this won't be a good argument for an adaptive variant.
    - generating a tree in front of a block means I can be sure to have a more or less good one. Generating it it from the history is dangerous if I would use a long period. Very much on changed data.

    I am not an expert as I mentioned before. I only describe my point of view and would be very happy to change it with right arguments from someone else
    For large blocks such approach looks good for me too. Size of the stored tree is small. Rebuilding time - negligible and decompression speed just great.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts

    Code lengths

    Considering code lengths for blockwise static huffman coding,
    I think that sequential coding of lengths or length differences
    is not the best idea (despite it being used everywhere).
    Imho tree-based coding is better approach here - using a simple
    binary tree, not a huffman tree.
    This method is easier to explain with frequency tables: knowing the
    total number of symbols (\x00-\xFF) at the tree root = N
    (its the block size, so its known),
    we encode the number of symbols \x00-\x7F = N0 (using 0<=N0<=N),
    and the N1 (number of symbols \x80-\xFF) is equal to N-N0.
    Thus if symbol count is 0, or only one symbol left in interval,
    then symbol counts for the interval are known, otherwise we
    recursively process the symbols intervals with N0 and N1.
    Now, huffman codelengths can be encoded the same way, as
    they correspond to power-of-2 symbol counts.
    Of course, similarities of statistics in blocks can be taken
    into account, using differences, or other means.
    However it might be reasonable to encode such trees with
    arithmetic coding, even if the actual data would be
    encoded with huffman codes for speed or whatever reason.

    Also I'd like to throw in some links about so-called
    "nonprefix coding":
    Here's the original article (its too chatty though):
    http://compression.ru/sh/shc1.htm (in russian)
    And here's the demo:
    http://compression.ru/sh/shc1exam.rar
    (alas, its very old and was written for DOS and
    might not work under Vista etc).
    In short, its about dropping the prefix code property
    to create an asymmetric bitwise coder with better
    compression than huffman-based ones.

    Anyway, I'd like to see a discussion of some less trivial things
    than new implementations of the same 100-year old algorithms.
    Last edited by Shelwien; 8th November 2008 at 02:20.

  12. #12
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    I have a question to this context. I'm a bit unsure here and didn't really find something anywhere:
    How long could a codelength be in the "worst case". I read something with 44% bigger then the shortest code but this didn't make sense to me.
    So how long could it be for x elements? x = 256 in most cases.

    - - - - - - - - - - - - - -

    I tested the differences between symbol counts and compression was a bit better. But not very much that it would be worth it in my opinion. I try more in this area.
    To show more where I want to use it (if anyone is interested). I at the moment have 5 different blocks that have 5 different trees.

    1. The characters which couldn't be compressed (256 elements expectable)
    2. A Bit Buffer with match or character (256 elements expectable)
    3. A Offset Buffer with changed Bit Length (Huffman makes not really sense here... but it will be changed later also 256 elements expectable)
    4. A Special Offset Buffer with 128 expectable elements
    5. A length buffer with currently 128 expectable elements thats going to change at the moment too.

    Reading your posts, some sources and my own mind :-D I think it's really better to use adaptive huffman for 2-5 but at least for 2 and 3. For the character buffer Iam sure there is a better method.
    Last edited by Simon Berger; 15th November 2008 at 03:59.

  13. #13
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Simon Berger View Post
    I have a question to this context. I'm a bit unsure here and didn't really find something anywhere:
    How long could a codelength be in the "worst case". I read something with 44% bigger then the shortest code but this didn't make sense to me.
    So how long could it be for x elements? x = 256 in most cases.

    - - - - - - - - - - - - - -

    I tested the differences between symbol counts and compression was a bit better. But not very much that it would be worth it in my opinion. I try more in this area.
    To show more where I want to use it (if anyone is interested). I at the moment have 5 different blocks that have 5 different trees.

    1. The characters which couldn't be compressed (256 elements expectable)
    2. A Bit Buffer with match or character (256 elements expectable)
    3. A Offset Buffer with changed Bit Length (Huffman makes not really sense here... but it will be changed later also 256 elements expectable)
    4. A Special Offset Buffer with 128 expectable elements
    5. A length buffer with currently 128 expectable elements thats going to change at the moment too.

    Reading your posts, some sources and my own mind :-D I think it's really better to use adaptive huffman for 2-5 but at least for 2 and 3. For the character buffer Iam sure there is a better method.
    You can have a tree like this:
    Code:
        /\
       /\
      /\
     /\
    /\
    So maximum length for n nodes is n-1.

  14. #14
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    Mhm yes sure. You have to be right. But that would be really BIG. In such a case almost all huffman implementations wouldn't work because they operate on int which only would support up to 32 (mine too).
    Last edited by Simon Berger; 15th November 2008 at 15:09.

  15. #15
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Simon Berger View Post
    Mhm yes sure. You have to be right. But that would be really BIG. In such a case almost all huffman implementation wouldn't work because the operate on int which only would support up to 32 (mine too).
    Well, the minimal number of elements needed to reach height n is (2^n)-1.
    So if you never go past your int capacity, the same int is enough to store the code.

  16. #16
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    Yes thats it.
    How could I calculate the maximum code length for n symbols? I have no logic for something like this

  17. #17
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Simon Berger View Post
    Yes thats it.
    How could I calculate the maximum code length for n symbols? I have no logic for something like this
    After thinking some more the previous calculation was wrong...

    Just think what code lenghts do you need to build such tree.
    A version that doesn't optimize tree structure length might build something like this:
    Bottom:
    (real)1 + (real)1
    One level up:
    (artificial)2 + (real)1
    Another level up:
    (artificial)3 + (real)2
    5+3
    8+5
    ...
    That's Fibonacci series, about (1.7)^n, it looks that you can actually exceed your int, depends on it's lenght.

    If, while building a tree, when having 2+ nodes with the second lowest weight, you choose the one with the lowest subtree height, it's different:
    1+1
    2+1
    3+3
    6+4
    10+7
    ...
    n+m
    (n+m)+(n+1)
    (root=(n+m)+(n+1))
    Still significantly less than 2^n.
    Is it a problem with 32? You'd have to calculate :P

    Correct me if I'm wrong again somewhere.

  18. #18
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    Don't know why I get it now but here is a VERY good article about.
    http://www.arturocampos.com/cp_ch3-4.html

    You were on the right way m^2 .

    From the previous we can know how many bytes must occur in our example before the code length is too high. In our case we need to have at least 34 symbols with a fibbonaci distribution. The first one would have a probability of 1, the second of 1, 2, 3, 5 and so on. If we sum them together, the result is 14,930,351. This is the minimum number of symbols that must appear to make a fibbonaci sequence which can degenerate our binary tree. So we know that the case is rather rare. And that if you don't plan to code more than 14 megas of data, you don't need to care about this problem. Note that if you restricted your code length to 16 bits, the minimum is reduced to 6764 bytes. And if for any reason you used 8 bit codes, it would be 142 bytes!
    Could be interesting for someone here building fixed tables for huffman lookups. I still use a tree search because of a high memory usage but want to build an option here.

  19. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts
    Actually, the article I mentioned (shc1.htm) contains a section
    about max code length calculation (its the last one) and although
    the text is in russian, but formulas in it are not

  20. #20
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    are you studied zip and tornado sources? i think that both programs build huuffman lookup trees and zip even does block-static huffman encoding (i.e. with codes stored explicitly in compressed data)

  21. #21
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    yes I studied your tornado code that was the reason I mentioned it here . I only looked at simple zip implementations yet.
    I thought it could be good for you because you use a reallocation but maybe could set it to 14-16 bits fixed (have to calculate how many 2048 needs). Maybe you had all these information yourselv and you have arguments to do it this way

  22. #22
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    I believe that if when you have a leaf and inner node, you choose the leaf, it's enough to push you to the second case.
    I calculated the outcome and this improves the maximum to 20633237 bytes.
    For 64bit values that's 72723460248141 (66 TB) and 100501350283428 (91 TB) respectively.
    I hoped that the gain is gonna grow with size...wrong.

  23. #23
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    if the whole problem is that codes may become larger than 16 bit, you should look into zip's solution to the problem. and read whole deflate/inflate sources - it's a pleasure because it's so well commented

  24. #24
    Member
    Join Date
    Oct 2007
    Location
    Germany, Hamburg
    Posts
    408
    Thanks
    0
    Thanked 5 Times in 5 Posts
    It's wasn't really a problem but while building a huffman class the problem with 32bits came in my mind. Also its good to know for optimizations. Yes thanks will look if I find something in the sources.

  25. #25
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,497
    Thanks
    735
    Thanked 660 Times in 354 Posts
    trees.c:

    int overflow = 0; /* number of elements with bit length too large */

  26. #26
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    Also I'd like to throw in some links about so-called
    "nonprefix coding":
    Here's the original article (its too chatty though):
    http://compression.ru/sh/shc1.htm (in russian)
    And here's the demo:
    http://compression.ru/sh/shc1exam.rar
    (alas, its very old and was written for DOS and
    might not work under Vista etc).
    In short, its about dropping the prefix code property
    to create an asymmetric bitwise coder with better
    compression than huffman-based ones.
    The links don't work

  27. #27
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,239
    Thanks
    192
    Thanked 968 Times in 501 Posts

  28. #28
    Member m^2's Avatar
    Join Date
    Sep 2008
    Location
    Ślůnsk, PL
    Posts
    1,611
    Thanks
    30
    Thanked 65 Times in 47 Posts
    Quote Originally Posted by Shelwien View Post
    Indeed, it's working now. Temporary problems.
    Thanks.

  29. #29
    Member
    Join Date
    Jun 2008
    Location
    L?vis, Canada
    Posts
    30
    Thanks
    0
    Thanked 0 Times in 0 Posts
    In Libima I renormalize the frequencies by dividing them by 2 when the total is 32768 or more. Assuming that the Huffman tree is not optimized for minimum depth, the worst case is,

    1 1 1 2 3 5 8 13 21 34 55 89 etc...

    So the maximum code length is 22 bits.

Similar Threads

  1. GearEnc - test application gear encoding
    By Sportman in forum Data Compression
    Replies: 7
    Last Post: 19th May 2013, 04:33
  2. huffman's Coding
    By swapy in forum Data Compression
    Replies: 5
    Last Post: 12th August 2009, 22:51
  3. Advanced counter/predictor
    By encode in forum Data Compression
    Replies: 22
    Last Post: 27th May 2008, 23:43
  4. Advanced Lazy Matching
    By encode in forum Data Compression
    Replies: 15
    Last Post: 8th May 2008, 00:29

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •