Page 1 of 2 12 LastLast
Results 1 to 30 of 41

Thread: TOP4 algorithm

  1. #1
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts

    TOP4 algorithm

    Hi everyone,

    Ok. Well, I've come up with an algorithm that beats Huffman and doesn't use anything but binary output and the statistics from order-0 file counts. I've named it TOP4 because of the way it works.

    It isn't as good as arithmetic encoding yet but it requires no tree assembly or node creation to perform and, from my tests with it, it usually beats huffman by several hundred bytes even with the 256 byte header table used for the stat counts. With every case it has beat huffman by that much or more.

    There are other modifications that can be made to the basic algorithm to make it compress further, to where it will still compress slightly even with cases where huffman would expand the data output. That, and a dynamic header that reduces header size from 256 to 8 bytes.

    But is this still relevant or useful to use this in 2018?

    Is this still useful since few people use Huffman now that Arithmetic coding patents have expired and most of the new compression algorithms are using that or ANS as the alternative?

    It's pretty simple and straightforward. I'm surprised no one has been using anything like it to replace huffman already prior to arithmetic patent expiration?

    I have a working demo of it on Windows and Linux (might do one for Mac OS X for the heck of it), and can explain it further if anyone is interested.

    James

  2. Thanks:

    xinix (8th October 2018)

  3. #2
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    870
    Thanks
    471
    Thanked 264 Times in 109 Posts
    Well, Huffman is still pretty much in use, whenever speed matters.
    Having an alternative that is marginally better on one or several aspects while abandoning no property can have a great future.

    Note there are already many huffman alternatives that compress better.
    But they tend to give up something in exchange,
    be it compression complexity, or memory usage, speed, etc.
    At the end of the day, this trade-off is what matters.

  4. Thanks:

    JamesWasil (8th October 2018)

  5. #3
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts
    Huffman is optimal among order-0 prefix codes, so you have something intermediate and still FIFO?
    There are many approximations of AC, Shannon-Fano-Elias, variable-to-fixed like Tunstall, enumerative coding ...

  6. Thanks:

    JamesWasil (8th October 2018)

  7. #4
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    That's true. Pound for pound, yes it is a FIFO implementation and it is more efficient than Huffman wherever the probabilities are more even and there is more noise (where redundancy is less than what we see with known data types such as executables, mp3s, images, text files, etc).

    As mentioned, it is basically a tree-less approach to symbol probability that uses a hierarchy statistically ordered, assigning the smallest bits possible to a specific distribution of symbol frequencies where a fixed amount are always smaller at the top, while a proportionally balanced amount are larger at the bottom while still being congruent.

    While Huffman uses the weighted probabilities to determine the prefix codes and trees, this instead has pre-designed and fixed codes that are known to both the compressor and decompressor in-advance, with a header that holds the arrangement of the symbols to replace the tree.

    It's extremely rare for data to exist in files that use less than 256 characters, but that's the only way I see for Huffman to still hang on against this since it is able to still assign prefix codes that are under 7 bits while this is fixed at 7 bits.

    The advantages for compression with this method are:
    1) Fast as Huffman (when optimized)
    2) Simple to implement in any language (I have a version in Basic / Visual Basic and C currently; it's slow at the moment with no optimizations but exists as proof of concept)
    3) Requires no tree structure (less coding, instant-run)
    4) Has a small header (fixed 256 or 8 byte header, either/or)
    5) Compresses data better than Huffman and compresses a few that Huffman won't as long as Arithmetic or Range coding is not used before it.

    The distance between the variable-sized codes is separated by a vast distribution of even 8 bit codes where nothing changes, making it possible to compress most average size files by a percentage of bytes when Huffman can't touch it or expands the data conversely. As you know, the amount of variance between the frequency makes or breaks a statistical algorithm because if the values are good the compression will be great, but if the values are not good the compression will be worse or it will expand.

    With Top4, files compressed using arithmetic encoding or range coding seem to be the only ones that expand. After testing a few thousand files to date, MPEG4 files, 7zip, zip, rar, and other files that use range coding or arithmetic will expand. Anything else compresses by fractional to significant margins ahead of Huffman, and there is a slight modification that can be made to the algorithm to improve it further.

    Even after a decent LZ algorithm with a decent size buffer or a BWT algorithm doing a pass over the data, Top4 is still is able to compress slightly more than Huffman does as a final stage / end compression, and on certain files (small jpegs for example) it is able to compress them by a few dozen to a few hundred bytes where huffman was/is expanding them each time.

    I think what I'm going to do is provide a Windows EXE for people to sample and do tests with it (warning: it'll be slow as it's a basic proof of concept only) and provide a huffman compressor that can be used to test it side-by-side to see the results. It doesn't do CRC checking or anything fancy; it asks for input and output file and expects the files to either be in the same directory or be given the path, but WYSIWYG with it after.

    Next post here will have an attachment of the compressor, the decompressor, and a huffman compressor to compare it to. If anyone has an adaptive huff compressor or an arithmetic compressor to compare it to, please feel free to attach that or share here as well.

  8. #5
    Member
    Join Date
    Nov 2013
    Location
    Kraków, Poland
    Posts
    747
    Thanks
    233
    Thanked 235 Times in 145 Posts

  9. #6
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    Here is the proof of concept program. It is really slow and unoptimized; pulled from debug mode, but works.

    3 windows 32-bit executables are enclosed: TOP43C compressor, TOP43D decompressor, and HUFFMAN.exe. Each of these are console, and are about 300k.

    The 7z file is 5MB total since there is a sample MP3 to compare it to.

    A basic huffman compressor is included, and runs at the speed that Top4 does when optimized and not in debug mode. For now, the speed is super slow due to debug checking every byte.

    Note: On the unoptimized version, I've noticed that Top4 will have worse compression than huffman on highly redundant text files and files with less than 256 characters. The optimized version of Top4 algorithm yields equal ratio detecting the frequency differences (but still beats huffman on jpegs, mp3s, and some other file types even without that optimization/modification to the algorithm).

    Attached Files Attached Files

  10. Thanks:

    Sportman (13th October 2018)

  11. #7
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    P.S: On the decompressor, you'll have to press any key after you enter the filenames and see the file size. I accidentally left one of the functions that waits for a key press

    (it polled the keyboard before as 1 program and used the arrow keys to compress or decompress selected files, but for the console mode that was no longer needed)

  12. #8
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    I should probably explain it a little more. As it's simplest form, it is the top 4 symbols changed to 7 bits (4/256 = .015625) from an assumed 256 symbol alphabet. The last 8 symbols are given an extra bit, and are changed from 8 to 9 bits (8/256 = .03125) to ensure that you still have a complete binary table that represents 256 symbols, albeit it is a variable-bit table now ranging from 7 to 9 bits, with 95.3% of it (244/256 = .953125) still remaining 8 bits.

    The compression comes naturally from associating the most frequent counts of the top 4 symbols to 7 bits, whereas the expansion is very slight or seldom since the last 8 symbols that occur the least get 9 bits but have a frequency that is too small to expand quickly since the majority of the data is equal, and the majority of the most frequent data compresses by 1 bit per symbol when seen.

    So if you assume an 8 bit table from 0 to 255 like this:

    00000000
    00000001
    00000010
    00000011
    00000100
    00000101
    00000110
    00000111...
    etc
    ending with
    11111111

    you would modify it to where the top 4 are now 7 bits

    0000000 0
    0000001 1
    0000010 2
    0000011 3
    00001000 4
    00001001 5
    00001010 6
    00001011 7
    00001100 8
    00001101 9
    00001110 10
    00001111 11 etc
    ...............
    (the binary patterns for 4 to 247 remain equal at 8 bits)
    ...............
    11111000 244
    11111001 245
    11111010 246
    11111011 247
    ..............
    (the binary patterns for 248 to 255 are expanded to 9 bits until the end to hold the table)

    111111000 248
    111111001 249
    111111010 250
    111111011 251
    111111100 252
    111111101 253
    111111110 254
    111111111 255

    and then use an array that holds the frequencies for each of the symbols, arranged from least to greatest, and paired with the table.

    When decompressing, the bits are read back as they were written (fifo) and the process terminated when it reaches the end of the file.

    No tree required, no other probability weights needed other than the raw symbol statistics.

    A CRC check is needed still at the end because at times (not often but on some smaller files) the last byte will be misread if it is a 9 or 7 bit code instead of 8, and a shift is needed to fix that there.

    There are other modifications to this that give it huge gains over huffman, but you should find that doing this with data that is not highly redundant but compressible by huffman yields better results with top4 instead. Whenever the data is more even and redundant, or there are less than 128 symbols in a file...for either case huffman will still excel. For all other situations, top4 will do better.

    One of 4 different modifications I have for this is to use a byte modifier that tries to predict the next byte based on the last, another is a 1.5 byte sliding binary window. Both of these make it possible to compress a few file types that normally expand a little.

    I apologize for any unclarity or slowness with the code. It's a basic build to demo it only, but now that you see how it works, you see it's simple enough that it can be done with literally any language from basic to pascal to C with only a few lines of code. With the other modifications, it always beats huffman with those and (usually) beats huffman without them, but it doesn't overlap or affect any previous compression you've performed with an LZ based method regardless which one you use.

    Hopefully that explains it a little more in detail. Let me know what you think!

    James

  13. Thanks:

    Sportman (13th October 2018)

  14. #9
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    84
    Thanks
    34
    Thanked 26 Times in 17 Posts
    I tested on a cpp source file and compression ratio is poor compared to other huffman coder.
    See for example here.https://github.com/algorithm314/FPC executables are in folder builds.

    file paq8hp12.cpp 115636
    your method 108532
    fpc -b 0 73221

  15. Thanks:

    JamesWasil (9th October 2018)

  16. #10
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    A C++ source file is a form of text file. As mentioned, it's one of the only areas that huffman will still win. Try it on other files without as much redundancy (mp3, wma, etc) and it will beat huffman.

    Since most of the modern-day uses for huffman are as an end-stage compressor to squeeze out everything that LZ and other methods didn't catch, you're going to have files that are more scattered on probability to where it will do well there.

    Most of what you compress (such as text or source files) using a stand-alone huffman algorithm will already be compressed by more modern and better algorithms already.

    The goal of the improvement algorithm is not to overlap with or compete with that result, but to beat huffman in the areas where there is little redundancy remaining, or where expansion of data with huffman would normally occur.

    Changes can be made to it as mentioned to where it's able to squeeze out the same or better as a stand-alone program, but then its usefulness to compress already compressed data one last time may be hindered by those changes (or overlap with them, resulting in worse compression).

    Hence, it isn't meant to beat huffman on raw text or exe files, but meant to beat it on the aftermath and output that you get from running LZ/LZSS/LZW/BWT first, etc.

    A hybrid of top4 can be made to where it incorporates LZ and other methods first to beat it everytime, but that would only be ideal to use if you didn't plan to use another LZ method with it. If you do, it's going to overlap with that, so for demo purposes it was left as-is. There's a few file types I still haven't tested yet (OGG being one of them). If you run it on that or a zip file and it expands, you should notice that it expands less than with huffman or compresses where it won't. I will check out the link above and that compressor to compare - thanks! -J

    Quote Originally Posted by algorithm View Post
    I tested on a cpp source file and compression ratio is poor compared to other huffman coder.
    See for example here.https://github.com/algorithm314/FPC executables are in folder builds.

    file paq8hp12.cpp 115636
    your method 108532
    fpc -b 0 73221

  17. #11
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    Test of default SAMPLE.MP3 included with the 7z package:

    File: Size: Bytes saved:
    SAMPLE.MP3 (original file) 4,829,825

    SAMPLE.TOP (top4 compressed) 4,796,346
    saved 33,479 bytes

    SAMPLE.FPC (suggested huff compressor) 4,811,485
    saved 18,340 bytes

    SAMPLE.HUF (included huff compressor) 4,812,574
    saved 17,251

    The FPC huffman result is more efficient than the included default huffman compressor, but FPC still did poorly and only compressed out 18,340. Top4 result was 15,319 bytes smaller.
    Last edited by JamesWasil; 9th October 2018 at 01:11. Reason: forum code

  18. #12
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    I guess I should have called this "after LZ algorithm" or "post-LZ replacement for huffman", since it seems people might expect it to be a stand-alone compressor that beats huffman on common redundant data...when its purpose is to beat it on noisy channel data after dictionary processing.

    To understand why this works better on compressed files like wma, mp3, avi, mpeg4 than exe, txt, or .cpp source files etc. is to understand how the data is laid out.

    Visually:

    Exe, Text, and other redundant data looks like this:

    ********************************************
    ******************************
    **************************
    ****************
    ************
    *********
    *******
    ****
    **

    after processing from an LZ based algorithm, the data is more uniform:

    *************
    ************
    **********
    **********
    **********
    *********
    *********
    *********
    ********
    ******
    *****
    ****
    ***

    This means that if you apply Shannon-Fano codes or Huffman codes to the data WITHOUT any other preprocessing, THAT will beat top4 because the best top4's basic algorithm is able to do is a data arrangement where the 1.5% at the top frequencies are expected to be exponentially greater than the 3% of the fewest frequencies at the end, and that looks a little like this:

    ********
    ********
    ********
    ********
    ***********
    ***********
    ***********
    ***********
    ***********
    ***********
    *************
    *************
    *************
    *************

    Logically, if you try to represent data where the statistical ranges are restricted that much, then ANY entropy coder is going to do better because it is designed to approximate and arrange the data as closely to theoretical limits as possible.

    However, when those algorithms are used, the output from that is mostly non-redundant, and it is there after other algorithms have done their work that you're able to use this to save more than huffman, because huffman will always generate longer codes on noisy channel data with less redundancy, where this will always generate shorter ones with the most even distribution possible.

    Arithmetic coding has better distribution over this because it's able to represent using fractional bits whereas this is still limited to a whole bit, but with output codes that are never larger than 9 bits while never being less than 7. Huffman is able to use as few as 1 bits per symbol while this is fixed at 7 bits, so you'll gain compression on those 6 bits and the differential spaces between them if you're coding for entropy on a single pass, and with that, huffman will win. If you're post-processing after an LZ algorithm, you'll be able to clean up on partial fields that huffman loses at.

    You can verify this yourself by compressing already compressed files with huffman and top4 like a GIF file or an MP4 and you'll see that the expansion will always be less with top4 than huffman (even if the file was arithmetically encoded and yields no compression) due to the way the data is laid out above. It will compress mp3, avi, wma, and other intermediate files more than huffman because it is still able to grab hold of the top frequencies while expanding as little as possible with the bottom ones.

    I'm sure that this is well-understood already by most on here, but I wanted to make sure I explained it as much as possible for anyone trying to understand it who might want to use this or implement it as a standalone solution. For that, you'll want to add an LZ component or BWT first and then this, because you'll get more out of huffman as a straight entropy encoder and more out of top4 as a post-LZ entropy encoder. I hope that makes sense now.

  19. #13
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    42
    Thanks
    12
    Thanked 17 Times in 12 Posts
    I think a better comparison would be with a simple Huffman that just has one table (is not adaptive at all); FPC is trying to adapt to the nearly random source and likely suffers from having to periodically update the codes. On another note, have you tried TOP1, TOP2, TOP8, etc variants? I can't find anything special about the choice 4 here...

  20. Thanks:

    JamesWasil (9th October 2018)

  21. #14
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    Neither could I, but it seemed to stand out from the results after trying the others. Yes, at first I tried with top1 to where only the first symbol was reduced to 7 bits, and the last was increased to 9 bits with no other changes. That still worked, but only by a very small amount, and using 2, 3, and 4 worked better and yielded higher compression.

    Going up to 4 symbols at top and 8 on the bottom improved it significantly, so I tried to do 10 symbols at the top and 20 at the bottom, Using anything more than 4 started to make compression worse rather than better after 4. There are a few instances where the gains were still there, but it was dependent on the data.

    Overall, the highest results came from using 4 smaller and 8 larger, but I'm thinking that an adaptive algorithm can be used to pass over the data and determine the most efficient post-processing codes to use based on the what is left rather than using a 4 and 8 static model. If an adaptive one is done, there'd have to be an addition to the header to know which table to generate for the compressor and decompressor.

  22. #15
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    42
    Thanks
    12
    Thanked 17 Times in 12 Posts
    I guess I don't see how what you have is any better in principle than a static order 0 Huffman with a single header; any savings you get by restricting the set of possible codes is a win only for a block header - it is obvious how to have, say a 12-byte description of the TOP4 codes; you pay the cost of a potential mismatch of symbol statistics on everything else, though. I'm also not sure the nearly-incompressible regime is that interesting...

  23. #16
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    LZ-based algorithms can usually give you up to 2.4 bits per symbol if the matches are right; on average, between 4 to 6 bits per symbol is reasonable, and then you'd use huffman or top4, but If you were to use huffman before an LZ or other dictionary algorithm, you run the risk of losing the string matches that are there. Most of your significant gains are done by string matching or equivalent algorithms and not from entropy coding. That is why Huffman is used last, but once you reach that point, you don't really have to worry about potential mismatch of symbol statistics because most of the probabilities become closer to even. You've already compressed out the larger matches, and the frequency counts aren't going to vary that much.

    If you analyze a 7-zip or Rar file, you'll see how true this is. Most of the occurrences of each symbol are nearly even. You'll see hundreds to thousands of nearly the same count of symbols throughout the file, evenly ordered together, and only a few dozen apart if that on the counts between each successive symbol. But if you look at an exe or text file, those frequencies are wildly different: thousands more 0's and 255's than any other character for executables for example, and of course symbols that are in vast numbers from 0 to 127 for text files.

    Once your frequencies become nearly even, at that point, if you run Huffman instead of Top4, Huffman is going to generate the shortest codes it can and it may even give you 6 bit codes rather than 7 to start with...but the prefix codes will always grow larger for huffman than a top4 code table. You'll save a lot more than on just the header; you're saving by having only static 8 bit codes throughout the entirety of most of the data set still, where Huffman would be expanding it to variable sized 9, 10, 11, or even 12 bit codes and larger for what should still be 8 or only 9 bits at most.

    Those extra bits that you save at the top add up quickly when the distribution of symbols is nearly even, so you save there and on every code that remains 9 bits or less.

    Granted, not everyone is interested in harder to compress data streams and gratification from compressing them isn't what it used to be...but in volume, and when used in the right areas, I'm hoping it will still be significant. You can always use an arithmetic coder to get around all this, but if you needed something better than huffman after processing but not arithmetic encoding for some reason, this works for that.

  24. #17
    Member cfeck's Avatar
    Join Date
    Jan 2012
    Location
    Germany
    Posts
    50
    Thanks
    0
    Thanked 17 Times in 9 Posts
    Quote Originally Posted by JamesWasil View Post
    Once your frequencies become nearly even, at that point, if you run Huffman instead of Top4, Huffman is going to generate the shortest codes it can and it may even give you 6 bit codes rather than 7 to start with...but the prefix codes will always grow larger for huffman than a top4 code table. You'll save a lot more than on just the header; you're saving by having only static 8 bit codes throughout the entirety of most of the data set still, where Huffman would be expanding it to variable sized 9, 10, 11, or even 12 bit codes and larger for what should still be 8 or only 9 bits at most.
    How did you come to this conclusion? Huffman, given a nearly flat symbol distribution, could (and does) generate the exact same 7/8/9 bit entries like TOP4. In fact, it would adapt the number of 7-bit symbols to optimize the coded size. The only difference I see is in table size, because most symbols still have 8-bit and you do not need to emit those. Your algorithm is not a replacement for an entropy coder. What you are doing with TOP4 is removing a little bit of redundancy _after_ (e.g.) Huffman.

  25. #18
    Member cfeck's Avatar
    Join Date
    Jan 2012
    Location
    Germany
    Posts
    50
    Thanks
    0
    Thanked 17 Times in 9 Posts
    BTW, I fail to see how you could possibly encode the 4 top and the 8 bottom symbols with 8 bytes in the table. I would have expected 12. How do you generate the table from the distribution?

  26. Thanks:

    JamesWasil (9th October 2018)

  27. #19
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    42
    Thanks
    12
    Thanked 17 Times in 12 Posts
    @James I am not sure at all what you are saying. Imagine a data block where bytes 0-3 show up 4 times each, bytes 4-247 show up 2 times each, and bytes 2448-255 show up once each. The Huffman code for those should assign exactly the same codes as TOP4; and if a Huffman code gives something 11 or 12 bits, it better be the case that symbol is really much rarer - I don't know how you decide that something that got a 11 bit code from Huffman should "get at most 9 bits".

    In summary:
    1. TOP4 is a integer code
    2. TOP4 is a prefix code
    3. Huffman is the optimal integer prefix code
    ----------
    Therefore
    ----------
    The TOP4 code is at best as good as Huffman, ignoring the overhead of transmitting the actual code tree (which is <= 12 bytes for TOP4 because I can easily imagine how encode the table for TOP4 using 12 bytes, and I'm pretending we're not encoding EOF).

    I'll let people that have written a Huffman coder in the last 20 years correct me (my last one was in 1998, I can't say that it was a great piece of engineering).

  28. #20
    Member
    Join Date
    Aug 2016
    Location
    USA
    Posts
    42
    Thanks
    12
    Thanked 17 Times in 12 Posts
    Wow, ninja'd by cfeck on encode.su; that is a first. You must have posted while I was trying to compose my answer.

  29. #21
    Member cfeck's Avatar
    Join Date
    Jan 2012
    Location
    Germany
    Posts
    50
    Thanks
    0
    Thanked 17 Times in 9 Posts
    Sorry, https://xkcd.com/386/ indeed made my bed wait

  30. #22
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    Quote Originally Posted by cfeck View Post
    BTW, I fail to see how you could possibly encode the 4 top and the 8 bottom symbols with 8 bytes in the table. I would have expected 12. How do you generate the table from the distribution?
    You're correct. That was a typo. I meant to say 12 bytes not 8 for the header in each instance. The minimum table size is 12 not 8. 4 * 8 bits for the top 4 + 8 * 8 bits for the last 8 =64; 64+32= 96 bits total for the header (excluding file size or any other data added later)

  31. #23
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    Quote Originally Posted by cfeck View Post
    How did you come to this conclusion? Huffman, given a nearly flat symbol distribution, could (and does) generate the exact same 7/8/9 bit entries like TOP4.
    That is false. If that were the case, then post-processed LZ results would reflect that. They do not. Huffman is only optimal when the logarithmic probabilities for the data are favorable. When they are not, Huffman no longer generates optimal codes, and subspatial codes can be used in place of those Huffman makes. Top4 generates those smaller codes when the symbol probabilities are no longer extreme enough to where using a few bits for a prefix code is able to compress large amounts of common symbols.

    [QUOTE =cfeck;58384]In fact, it would adapt the number of 7-bit symbols to optimize the coded size. The only difference I see is in table size, because most symbols still have 8-bit and you do not need to emit those.[/QUOTE]

    If that were the case, then why didn't it adapt and do that with any of the huffman examples thus far where an LZ dictionary is used before and without huffman present?

    [QUOTE =cfeck;58384] Your algorithm is not a replacement for an entropy coder. What you are doing with TOP4 is removing a little bit of redundancy _after_ (e.g.) Huffman.[/QUOTE]

    I'm so glad you're an expert on what you didn't write lol. No, it is not a replacement for an entropy coder...it is a replacement for post-LZ and post-dictionary compression where huffman is used as a final stage for encoding, giving smaller results than Huffman there WITH OR WITHOUT Huffman being used previously. As I said above, test it out with already compressed data and those where huffman is or is not used but can be still on LZ or LZW packaged data like mp3 packets or image data in a GIF file, and you'll find that huffman is not generating optimal codes where top4 will. It's not removing redundancy after Huffman if huffman is never used lol.

  32. #24
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    Quote Originally Posted by Stefan Atev View Post
    @James I am not sure at all what you are saying. Imagine a data block where bytes 0-3 show up 4 times each, bytes 4-247 show up 2 times each, and bytes 2448-255 show up once each. The Huffman code for those should assign exactly the same codes as TOP4; and if a Huffman code gives something 11 or 12 bits, it better be the case that symbol is really much rarer - I don't know how you decide that something that got a 11 bit code from Huffman should "get at most 9 bits".

    In summary:
    1. TOP4 is a integer code
    2. TOP4 is a prefix code
    3. Huffman is the optimal integer prefix code
    ----------
    Therefore
    ----------
    The TOP4 code is at best as good as Huffman, ignoring the overhead of transmitting the actual code tree (which is <= 12 bytes for TOP4 because I can easily imagine how encode the table for TOP4 using 12 bytes, and I'm pretending we're not encoding EOF).

    I'll let people that have written a Huffman coder in the last 20 years correct me (my last one was in 1998, I can't say that it was a great piece of engineering).
    Well, Huffman is not an optimal integer code when the probabilities no longer favor it. That's why arithmetic encoding and a few other solutions exist that prove huffman to be non-optimal in certain cases. In this case, the non-optimality occurs when the data is too uniform for it to generate the codes it needs to compress, where it's still possible to compress that data if you were to expand and contract only by 1 bit in the locations that you can.

    Huffman is no longer optimal if the distribution does not consist of powers of 2, as reflected by the delta of that space which top4 is using when data is uniform. It isn't until or unless those conditions are true that you'll see optimal results from top4 and not from huffman.

    For the demo code, no EOF was used, only a file size marker as a header extension. (i.e: for simplicity, a file of 47233 bytes is stored 1 byte each and terminated with a symbol. It could be compacted to use a 32 bit number instead for up to 4gb files, making the total table size 16 instead of 12).

    I don't know how 2,448 bytes will show up out of 255, but I'm going to assume that's a typo like my 8 byte header was. Huffman doesn't assign codes that are not powers of 2, and when the data is too uniform or random, huffman will (and does) still assign larger code blocks based on frequency of occurrence ignoring the volume of the data in that case if the data it is representing is almost equal. If you're going for raw entropy in a single pass, then yes, that'll work well.

    If you've already passed over the data with a dictionary compression or other compressor, huffman is not going to do as well as it would if you hadn't (this is a given), or well at all with the results of that where top4 will do better because of the restriction of variance on the code size. No codes will be made too large, even if it's still possible to make some of the codes smaller. If it's a power of 2 and there is enough redundancy left, you'll still get a little out of huffman and maybe more than top 4 wherever it is. Huffman's codes are based on powers of 2 out of the table, but with top4 you control the code and fix it to 1 and 3% respectively.

    When those redundancies in the data that huffman exploits disappear, you're going to need the smallest codes possible that expand the least, and you won't get that from huffman anymore beyond that point. You'll only get it when reductions are made slightly, or you're using fractional bits with ANS, range, or arithmetic which will do even better. Because of it's size, arithmetic coding can be used interchangeably whether you're post-processing or entropy coding in a single shot and you'll get smaller results there over either of these.

  33. #25
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    Quote Originally Posted by cfeck View Post
    How did you come to this conclusion?
    I was analyzing the data and noticing a constant pattern: data that was already compressed, like 7zip files, rar files, mp4, m4a, and others with top4 always expanded less than data output that came out of Huffman. I wondered: "If Huffman was generating optimal codes, then shouldn't the size of the expansion be the same?". I thought that it should be...but it wasn't.

    Looking at the data, it became obvious as to why. When the patterns are no longer predictable and the data's complexity increases, optimal signal generation becomes dependent upon doing what expands the data the least while still compressing it by a fractional margin. No matter how small that margin is, it has to be there...just as a small margin of expansion has to correspond to it. Huffman isn't designed to work on data that small, and the codes are too large after that point to compress even the fractional amounts without expanding it a little or a lot more than it should be. UP UNTIL that point, huffman will be optimal. After that point, it will not.

    That's why top4 seemed like an ideal substitution because it is small enough to still get some of those without expanding the data, and if you weren't able to use arithmetic encoding, you could use that in it's place still to squeeze out a few more drops of delicious data.

  34. #26
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    768
    Thanks
    217
    Thanked 286 Times in 168 Posts
    Quote Originally Posted by JamesWasil View Post
    With every case it has beat huffman by that much or more.
    Just a FYI

    Zopfli and Brotli do not use Huffman coding.

    Huffman code is a prefix code. Zopfli and Brotli use a prefix code that is different from Huffman code. Zopfli and Brotli use my own prefix coding that is more efficient than Huffman coding. It achieves the improved efficiency by not only optimizing size of coding, but also approximating and minimizing the size of the code-length-code length.

    The more efficient than Huffman code gives about 0.15 % compression density gain (post LZ77). Unfortunately, it is too hackish to ever be described in CS textbooks.

    A somewhat exhaustive search over the optimal prefix codes can give a further 0.15 % compression density over current Zopfli/Brotli, but slows down Zopfli or Brotli:11 by another two orders of magnitude (i.e., will be disgustingly slow).

  35. Thanks (2):

    Bulat Ziganshin (10th October 2018),JamesWasil (9th October 2018)

  36. #27
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    84
    Thanks
    34
    Thanked 26 Times in 17 Posts
    Quote Originally Posted by Stefan Atev View Post
    I think a better comparison would be with a simple Huffman that just has one table (is not adaptive at all); FPC is trying to adapt to the nearly random source and likely suffers from having to periodically update the codes. On another note, have you tried TOP1, TOP2, TOP8, etc variants? I can't find anything special about the choice 4 here...
    FPC is block based. The adaptive variant just tries to pick good places to make the block split.

  37. #28
    Member cfeck's Avatar
    Join Date
    Jan 2012
    Location
    Germany
    Posts
    50
    Thanks
    0
    Thanked 17 Times in 9 Posts
    Quote Originally Posted by JamesWasil View Post
    That is false.
    I will stop here to give you a few years of research and coding to let you understand the Huffman algorithm.

  38. #29
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    122
    Thanks
    105
    Thanked 71 Times in 51 Posts
    Quote Originally Posted by JamesWasil View Post
    SAMPLE.TOP (top4 compressed) 4,796,346
    saved 33,479 bytes
    I probably missed something in your description of the algorithm (or, equally likely I got the code below wrong), but I can't seem to get the numbers to add up. I wrote a small program that counts symbol frequencies, sorts them, and computes the encoded length of assigning 7 bits to the highest 4 and 9 bits to the lowest 8 for the file sample.mp3:

    #include <algorithm>
    #include <functional>
    #include <numeric>
    #include <iostream>
    #include <vector>

    #include <stdio.h>
    #include <stdlib.h>

    int main()
    {
    std::vector<size_t> count(256, 0);
    FILE *fp = fopen("sample.mp3", "rb");
    int ch;

    while ((ch = fgetc(fp)) != EOF) {
    count[ch]++;
    }

    std::sort(count.begin(), count.end(), std::greater<>());

    std::cout << std::accumulate(count.begin(), count.end(), 0) << " input bytes\n";

    size_t out_bits = 0;

    for (int i = 0; i < 256; ++i) {
    if (i < 4) {
    out_bits += 7 * count[i];
    }
    else if (i < 248) {
    out_bits += 8 * count[i];
    }
    else {
    out_bits += 9 * count[i];
    }
    }

    std::cout << (out_bits + 7) / 8 << " output bytes\n";
    }


    Output is:

    Code:
    4829825 input bytes
    4816362 output bytes

  39. #30
    Member JamesWasil's Avatar
    Join Date
    Dec 2017
    Location
    Arizona
    Posts
    65
    Thanks
    70
    Thanked 13 Times in 12 Posts
    Quote Originally Posted by cfeck View Post
    I will stop here to give you a few years of research and coding to let you understand the Huffman algorithm.
    Yes, you need to stop there. You're the one who was wrong on the internet. I've had decades of research and programming since 1986, but thank you. Hope you got some sleep. zzz

Page 1 of 2 12 LastLast

Similar Threads

  1. My new compression algorithm
    By tefara in forum Random Compression
    Replies: 55
    Last Post: 12th June 2019, 20:45
  2. which compress algorithm is best for me?
    By skogkatt in forum Data Compression
    Replies: 19
    Last Post: 29th December 2016, 14:29
  3. Anyone know which compression algorithm does this?
    By hjazz in forum Data Compression
    Replies: 8
    Last Post: 24th March 2014, 05:49
  4. Help me to odentify algorithm
    By attio in forum Data Compression
    Replies: 0
    Last Post: 23rd September 2011, 20:04
  5. PREDICTOR algorithm
    By encode in forum Forum Archive
    Replies: 30
    Last Post: 16th February 2008, 18:28

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •