Results 1 to 13 of 13

Thread: LittleBit 0.1 alpha

  1. #1
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    68
    Thanks
    3
    Thanked 14 Times in 8 Posts

    Exclamation LittleBit 0.1 alpha

    Last year I wrote a database engine for a company I work for. Because I am obsessed by compression I made an algorithm they did not ask for (in my own time). They are not interested in the algorithm so I feel free to publish it in the wild.
    It's something that's not ready for real world usage but shows promise for the future. Mainly because the encoding times are ridiculous on large files. Decoding is very fast though.

    The main goals designing this encoder is being able to use it in a database. It should be able to start reading on random positions (database fields). To be able to do that I used a static Huffman tree. A full explanation and source code can be found on https://github.com/kapenga/LittleBit/ and an alpha release on https://github.com/kapenga/LittleBit/releases/
    Make your own compiled jar if you don't trust the release.

    Please don't complain about the encoding times because I am aware that these are not practical.
    I am however proud on the compression ratios LittleBit is able to produce, considering it is allowed to work with only a static Huffman tree. The compression file sizes mentioned are including the Huffman tree (to be very clear about that). After reading the Huffman tree, decoding can be done on every bit that is the start of a Huffman tree leaf.

    Book1: 768.771 original, 22.975 huffman tree, 234.765 data, 257.740 total
    Enwik8: 100.000.000 original, 1.290.337 huffman tree, 25.357.414 data, 26.647.751 total
    Last edited by Kaw; 24th January 2020 at 18:01. Reason: Adding test results

  2. Thanks (4):

    encode (27th January 2020),JamesWasil (25th January 2020),Lucas (30th January 2020),xinix (26th January 2020)

  3. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,760
    Thanks
    274
    Thanked 1,200 Times in 668 Posts
    1. I wonder if its possible to turn it into a preprocessor for stronger compressors (like dictionary preprocessor, or for stuff like "space stuffing").

    2. Did you think about doing the same with arithmetic coding? It might be actually simpler and faster in this case.
    Also, like with huffman, its also possible to read AC from the middle - just have to provide interval bounds ("low" and "range" in case of RC) along with byte offset.

  4. Thanks (2):

    Kaw (27th January 2020),xinix (26th January 2020)

  5. #3
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    68
    Thanks
    3
    Thanked 14 Times in 8 Posts
    Quote Originally Posted by Shelwien View Post
    1. I wonder if its possible to turn it into a preprocessor for stronger compressors (like dictionary preprocessor, or for stuff like "space stuffing").

    2. Did you think about doing the same with arithmetic coding? It might be actually simpler and faster in this case.
    Also, like with huffman, its also possible to read AC from the middle - just have to provide interval bounds ("low" and "range" in case of RC) along with byte offset.
    1. Maybe if I cut off the algorithm in an early state. When the algorithm is done, the remaining symbols are not very predictable anymore because they have become fairly unique. The majority of symbols only occurs less than 20 times. That will be hard to predict for a more intelligent compressor.
    The array of symbols might however be an interesting input for an encoder from the LZ77 family. It might be able to reduce the offset/length pairs this way while having also less unique offset/length pairs. This should strengthen the algorithm in both ways.

    2. Ill have to think what I do with the canonical Huffman tree, which is a surprisingly efficient way of saving a tree state. Currently as far as I can see I will have problems saving the symbols with their probabilities in an efficient way if I want to use arithmetic encoding. This way I will win on the encoding, but lose on writing the static dictionary. I don't think it will make a big difference in the end. Maybe 1-2% at max.
    Also it is true that you can reference a position in a range encoder when you include the low and range values, but for now I like the idea of simplicity by just having to reference a bit position.

  6. Thanks:

    xinix (26th January 2020)

  7. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,760
    Thanks
    274
    Thanked 1,200 Times in 668 Posts
    > That will be hard to predict for a more intelligent compressor.

    There's a preprocessing trick called "space stuffing".
    For example, "capital conversion" works like this:
    "word"->"word"
    "Word"->"!word" (here "!" is a flag which tells the decoder to convert next symbol)
    Turns out, compression (PPM/BWT/CM at least) is actually better if we'd insert a space after the flag.
    "Word"->"! word"

    So, why the result is better with extra space?
    That's because the compressor already has word-prefix statistics in " " context anyway,
    and learning to predict " " in context of "!" is much easier than learning all word prefixes.

    Another similar example is deleting spaces from text - the file without spaces is smaller,
    but compression would become a lot worse for many algorithms.

    My idea was that since you're already doing a related kind of parsing optimization anyway
    (at least so it seems to me) - maybe its possible to turn into a universal preprocessor?

    > It might be able to reduce the offset/length pairs this way while having also less unique offset/length pairs.
    > This should strengthen the algorithm in both ways.

    You can also test it with NNCP: https://bellard.org/nncp/
    NNCP mainly works with 16-bit alphabet (usually from its own preprocessor).

    > Currently as far as I can see I will have problems saving the symbols with their probabilities
    > in an efficient way if I want to use arithmetic encoding.

    https://encode.su/threads/3175-freqtable-compression
    http://nishi.dreamhosters.com/u/freqtable_v2.7z

  8. Thanks:

    xinix (26th January 2020)

  9. #5
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    68
    Thanks
    3
    Thanked 14 Times in 8 Posts
    @Shelwien,

    I am working on a new version and using your feedback. Ill try to finish it within 2 weeks.

  10. Thanks (3):

    JamesB (26th January 2020),Shelwien (26th January 2020),xinix (27th January 2020)

  11. #6
    Member
    Join Date
    Dec 2012
    Location
    japan
    Posts
    159
    Thanks
    30
    Thanked 62 Times in 38 Posts
    ​how about using arithmetic coder to save huffman tree?

  12. Thanks:

    Kaw (27th January 2020)

  13. #7
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    68
    Thanks
    3
    Thanked 14 Times in 8 Posts
    Quote Originally Posted by xezz View Post
    ​how about using arithmetic coder to save huffman tree?
    The next version will do that for the part where counts are involved such as how many leafs there are on a certain depth.

  14. #8
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    68
    Thanks
    3
    Thanked 14 Times in 8 Posts
    I tried a thing few things and have no good progress. If I use a stationary Huffman tree as input for a PPM like algorithm I get PPM like results. Absolutely not bad but it's just a difficult and slow way of doing PPM so that's not very satisfying to publish. The only advantage is a decrease in tree size for the PPM part so the algorithm can look deeper, but the greater diversity of symbols is worse for the compression, but that again is compensated by the fact that there are less decisions to make. (1 symbol replaces 1 or more bytes) If you add the fact that you have to store the Huffman tree and the added (time) complexity of it all... It's not progress...

  15. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,760
    Thanks
    274
    Thanked 1,200 Times in 668 Posts
    A frequency table is more useful than a huffman length table.
    Compressing a frequency table, then decrementing the freqs doesn't add redundancy (like huffman lengths),
    but rather improves compression comparing to adaptive coding.

  16. #10
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    84
    Thanks
    34
    Thanked 26 Times in 17 Posts
    You can use FPC for block based Huffman coding. It is fast but symbols must be at most 256.https://github.com/algorithm314/FPC

    But from what I understand you must use an arithmetic coder like fpaq since the bottleneck is not the entropy coder but the other part of the algorithm.

  17. #11
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    68
    Thanks
    3
    Thanked 14 Times in 8 Posts
    Quote Originally Posted by algorithm View Post
    You can use FPC for block based Huffman coding. It is fast but symbols must be at most 256.https://github.com/algorithm314/FPC
    From the github page: "has higher compression ratio than many AC or ANS"

    How is that possible? Its 8bit huffman...

  18. #12
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    84
    Thanks
    34
    Thanked 26 Times in 17 Posts
    Compression is prediction + entropy coding. FPC has inteligent algorithm to divide files into blocks and achieve better statistics. Also passing bit lengths instead of frequencies can make headers of blocks smaller. So blocks can be smaller for even better statistics. But this is up to a point. If bit lengths get small, error from rounding to bit lengths becomes important and AC wins. So with careful engineering if you have large bit lengths, like combining symbols, compression ratio can be high.

  19. #13
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    68
    Thanks
    3
    Thanked 14 Times in 8 Posts
    Okay, did some tests the last few days. On Book1 I gain around 12k bytes if I use a range encoder and save the symbols with frequencies. If I add a PPM like encoder to the mix I only gain another 1-2k. This is explainable because stage 1 from LittleBit removes most of the redundancy in a file. I think a few kilobytes is not enough to justify the added complexity of a PPM like encoder.
    On the other hand: the range encoder might be useful for cases when random access reading isn't necessary and only changes the encoding part with little investments. I'm now creating a version I can publish which contains 2 options to encode.

  20. Thanks:

    Shelwien (3rd February 2020)

Similar Threads

  1. Tree alpha v0.1 download
    By Kennon Conrad in forum Data Compression
    Replies: 877
    Last Post: 2nd February 2020, 20:00
  2. BrutePNG Alpha 0.0
    By SvenBent in forum Data Compression
    Replies: 45
    Last Post: 30th July 2016, 21:45
  3. M03 alpha
    By michael maniscalco in forum Data Compression
    Replies: 6
    Last Post: 10th October 2009, 00:31
  4. PIM 2.00 (alpha) is here!!!
    By encode in forum Forum Archive
    Replies: 46
    Last Post: 14th June 2007, 19:27
  5. PIM 2.00 (alpha) overview
    By encode in forum Forum Archive
    Replies: 21
    Last Post: 8th June 2007, 13:41

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •