Results 1 to 22 of 22

Thread: Small data compression – theory and practice

  1. #1
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts

    Small data compression – theory and practice

    Hi all – Are there any good write-ups on small data compression, on the theoretical challenges and optimal approaches? I wonder what the best approaches are to small message compression. I mean sizes of about 300 bytes to 3 KB. Does small message compression present any interesting problems or opportunities? I've been looking at data serialization formats like MessagePack, Protocol Buffers, FlatBuffers, JSON, CBOR, Amazon Ion, and so forth. Most are binary, and some are compact, but none are designed specifically for compression. There are two senses in which I mean this:


    • They weren't co-developed with a compression codec tailored for them.
    • They weren't developed with preprocessing in mind – preprocessing with respect to subsequent compression by an existing codec like gzip, brotli, Zstd, LZMA, etc.


    By contrast, I think WebAssembly (Wasm) was developed with compression in mind, with maybe two tiers – tailored compression followed by general compression – but I don't know much about it.

    Do you think there should be a lot of headroom still for compressing small binary messages with tailored codecs? I was very impressed with some of the XML-specific codecs like XMill and XGrind (which is homomorphic). Which approaches do you think would work well on already-compact binary messages? What I'm getting at is whether there's a theoretical framework that pops out here, an approach to small payload compression. Something like JSON compresses well because there are lots of chunky and wasteful repeated strings. Compact binary formats are more challenging, but I don't know if things like CM, RANS, FSE, etc. have any particularly interesting properties for small payloads.

    Thanks.

    Links:


    Amazon Ion: http://amzn.github.io/ion-docs/

    Apache Arrow: https://arrow.apache.org/

    CBOR: https://cbor.io/

    FlatBuffers: https://google.github.io/flatbuffers/

    JSON: https://www.json.org

    MessagePack: https://msgpack.org

    Protocol Buffers: https://developers.google.com/protocol-buffers

  2. #2
    Member ivan2k2's Avatar
    Join Date
    Nov 2012
    Location
    Russia
    Posts
    47
    Thanks
    15
    Thanked 8 Times in 5 Posts
    Hi. Look at https://github.com/antirez/smaz and https://ed-von-schleck.github.io/shoco/ these are compressors designed for short strings\data.

  3. #3

  4. #4
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by ivan2k2 View Post
    Hi. Look at https://github.com/antirez/smaz and https://ed-von-schleck.github.io/shoco/ these are compressors designed for short strings\data.
    Thanks, shoco looks very interesting. I wonder how it does with binary data. I'm surprised that neither of these projects seem to be maintained.

  5. #5
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by ivan2k2 View Post
    Hi. Look at https://github.com/antirez/smaz and https://ed-von-schleck.github.io/shoco/ these are compressors designed for short strings\data.
    It looks like shoco is only good for strings up to 100 bytes or so. That's much smaller than I was thinking of, which was 300 bytes to 3 KB. I wonder if I could just compress 100 byte chunks at a time, and how the performance and ratios would compared to SLZ.

  6. #6
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    902
    Thanks
    246
    Thanked 326 Times in 199 Posts
    An earlier (pre-zopflication) version of brotli was pretty good at small files (<300 bytes). While zopflication improves things in general it can add 20+ bytes in the shortest data.

  7. #7
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,572
    Thanks
    783
    Thanked 687 Times in 372 Posts
    What is zopflication? How one can find pre-zoplification versions of Brotli?

  8. #8
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    An earlier (pre-zopflication) version of brotli was pretty good at small files (<300 bytes). While zopflication improves things in general it can add 20+ bytes in the shortest data.
    Interesting. Brotli has performed very well in tests so far, beating everything else by a wide margin. This surprised me because I don't think the dictionary is going to have any significant matches with the JSON payments payloads I've tested.

  9. #9
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > Brotli has performed very well in tests so far, beating everything else by a wide margin.

    Brotli does have some advanced parts in statistical model which other LZs lack - like o2-ish models for utf8 and 16-bit audio.
    But in the end it still uses huffman coding, so actual compression improvement is not that significant.

    > This surprised me because I don't think the dictionary is going to have any significant matches with the JSON payments payloads I've tested.

    You can download this file: https://github.com/google/brotli/blo...dictionary.bin
    And test it with zstd (and other codecs if you want - by subtracting compressed size of dictionary from compressed size of your file with prepended dictionary).

  10. #10
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Quote Originally Posted by Shelwien View Post

    Brotli does have some advanced parts in statistical model which other LZs lack - like o2-ish models for utf8 and 16-bit audio.
    But in the end it still uses huffman coding, so actual compression improvement is not that significant.

    > This surprised me because I don't think the dictionary is going to have any significant matches with the JSON payments payloads I've tested.

    You can download this file: https://github.com/google/brotli/blo...dictionary.bin
    And test it with zstd (and other codecs if you want - by subtracting compressed size of dictionary from compressed size of your file with prepended dictionary).
    That brotli dictionary file will work with Zstd? How?

    What do you mean that compression improvement is not significant? But it is in my testing. Brotli even beats Pavlov's gzip by a wide margin, and Pavlov's gzip has been incredible on other files.

  11. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,982
    Thanks
    298
    Thanked 1,309 Times in 745 Posts
    > That brotli dictionary file will work with Zstd? How?
    Code:
    zstd.exe --ultra -22 -fo book1.zst book1
    zstd.exe -D dictionary.bin --ultra -22 -fo book1d.zst book1
    zstd.exe --patch-from=dictionary.bin --ultra -22 -fo book1p.zst book1
    > What do you mean that compression improvement is not significant?

    Code:
    768,771 book1
    768,771 book1r          // reorder.exe forward.xlt book1 book1r
    256,361 book1.bro       // brotli_gc82.exe -q 11 -fo book1.bro book1 
    263,455 book1r.bro      // brotli_gc82.exe -q 11 -fo book1r.bro book1r 
    261,092 book1.lzma      // lzma.exe e book1 book1.lzma
    262,109 book1r.lzma     // lzma.exe e book1r book1r.lzma
    264,731 book1.zst       // zstd.exe --ultra -22 -fo book1.zst book1
    264,800 book1r.zst      // zstd.exe --ultra -22 -fo book1r.zst book1r
    I mean that most of brotli gains come from integrated dictionary rather
    than entropy model improvements.
    Well, usually - there could be significant differences if we'd try compressing 16-bit wavs
    or chinese text in utf8.

    > But it is in my testing. Brotli even beats Pavlov's gzip by a wide margin,
    > and Pavlov's gzip has been incredible on other files.

    Does it really?
    Code:
    32,768 book1r
    13,753 book1r.bro
    13,823 book1r.gz
    13,844 book1r.lzma
    13,879 book1r.zst
    You have to take into account that deflate only supports 32k window size,
    so for files longer than 32k its usually inferior to any new codecs, Pavlov or not.

    Also deflate doesn't have integrated dictionary either.

  12. #12
    Member SolidComp's Avatar
    Join Date
    Jun 2015
    Location
    USA
    Posts
    353
    Thanks
    131
    Thanked 54 Times in 38 Posts
    Are there any innovative theories for small data compression? It seems like there should be theories and strategies tailored to it. I found the Shoco or Unishox compressor you posted a while back. I think they were hand Huffman coding letters or something.

    Small data might end up being best served by dictionaries, I guess. Maybe the technique they use in brotli where you can permute dictionary entries.

  13. #13

  14. Thanks:

    SolidComp (30th June 2020)

  15. #14
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Here's my bunch of findings (as I'm researching compression of small binary files for a while now -> also check the tiny decompressors thread by introspec):

    Lossless Message Compression:
    https://www.diva-portal.org/smash/ge...FULLTEXT01.pdf

    The Hitchhiker's guide to choosing the compression algorithm for your smart meter data:
    https://ieeexplore.ieee.org/document/6348285 (PDF freely available)
    Slides:
    https://www.ti5.tuhh.de/publications..._energycon.pdf

    A Review of IP Packet Compression Techniques:
    http://citeseerx.ist.psu.edu/viewdoc...0.1.1.111.6448

    Automatic message compression with overload protection:
    https://www.sciencedirect.com/scienc...64121216300267

    Also feel free to send me compressor executables (Win32/64) and test sets. So I can include them in my test bench (see example plots here and here in the already mentioned thread).

  16. Thanks:

    SolidComp (30th June 2020)

  17. #15
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    902
    Thanks
    246
    Thanked 326 Times in 199 Posts
    Quote Originally Posted by Bulat Ziganshin View Post
    What is zopflication? How one can find pre-zoplification versions of Brotli?
    Zopflification (an attempt on optimal parsing) was added somewhere around early 2015. So a version from 2014 would likely work. IIRC, we opensource the first version in October 2013. The first version is not file-format compatible with the standardized brotli, but can give ideas on how well it can work for very short files. The first version only includes quality 11.

    Brotli's optimal parsing could be greatly improved -- it is still ignoring the context modeling, thus overestimating final literal cost. I suspect another 2 % in density could be possible by doing this more thoroughly.

    Before zopflification brotli would often outperform gzip with the smallest files (in 500 byte category) by 40-50 % in compression. After zopflification it got 10-20 % worse in the smallest category. This is likely because zopfli lead to more entropy codes to be used and that is not accounted for in optimal parsing.

  18. Thanks:

    Bulat Ziganshin (1st July 2020)

  19. #16
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    902
    Thanks
    246
    Thanked 326 Times in 199 Posts
    Quote Originally Posted by Shelwien View Post
    Brotli does have some advanced parts in statistical model which other LZs lack - like o2-ish models for utf8 and 16-bit audio.
    But in the end it still uses huffman coding, so actual compression improvement is not that significant.
    Brotli uses my own prefix coding that is different from (length limited) Huffman coding. In my coding I optimize not only for the code length (like Huffman), but also for the complexity of the resulting entropy code representation. That gives an 0.15 % improvement roughly over Huffman coding.

    In early 2014 Zoltan Szabadka compared our coding against ANS for our reference test corpora (web and fonts), and a simple ANS implementation was slightly less dense due to more overhead in entropy code description.

    In typical use the prefix code that we use is within 1 % of pure arithmetic coding, and simpler and faster (less shadowing of data because no reversals are needed).

    (Arithmetic coding gives about 10 % improvement over Huffman in practical implementations. 1 % of that is because of arithmetic coding being more efficient. The 9 % is because of context modeling that arithmetic coding allows. In Brotli we take much of that 9 % by using context modeling that is compatible with prefix coding, but none of the 1 % improvement.)

  20. Thanks (3):

    Bulat Ziganshin (1st July 2020),compgt (2nd July 2020),SolidComp (1st July 2020)

  21. #17
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Here's another paper describing an LZSS variant for small sensor data packets (so IoT, sensor mesh, SMS, network message compression related works look promising):
    An Improving Data Compression Capability in Sensor Node to Support SensorML-Compatible for Internet-of-Things [sic]
    http://bit.kuas.edu.tw/~jni/2018/vol3/JNI_2018_vol3_n2_001.pdf

  22. #18
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    121
    Thanks
    31
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Dresdenboy View Post
    Here's another paper describing an LZSS variant for small sensor data packets (so IoT, sensor mesh, SMS, network message compression related works look promising):
    An Improving Data Compression Capability in Sensor Node to Support SensorML-Compatible for Internet-of-Things [sic]
    http://bit.kuas.edu.tw/~jni/2018/vol3/JNI_2018_vol3_n2_001.pdf

    Dresdenboy, if you're interested in LZSS coding, search for my "RLLZ" in this forum for my remarks on it.

    RLLZ doesn't need literal/match prefix bit, no match_len for succeeding similar strings past the initially encoded string, and no literal_len. It was my idea in high school (1988-1992), but i was forgetting computer programming then and we didn't have access to a computer. I remembered it in 2018, so it's here again, straightened out, better explained.

    https://encode.su/threads/3013-Reduc...highlight=RLLZ
    Last edited by compgt; 3rd July 2020 at 13:10.

  23. Thanks:

    Dresdenboy (3rd July 2020)

  24. #19
    Member
    Join Date
    Sep 2018
    Location
    Philippines
    Posts
    121
    Thanks
    31
    Thanked 2 Times in 2 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Brotli uses my own prefix coding that is different from (length limited) Huffman coding. In my coding I optimize not only for the code length (like Huffman), but also for the complexity of the resulting entropy code representation. That gives an 0.15 % improvement roughly over Huffman coding.

    In early 2014 Zoltan Szabadka compared our coding against ANS for our reference test corpora (web and fonts), and a simple ANS implementation was slightly less dense due to more overhead in entropy code description.

    In typical use the prefix code that we use is within 1 % of pure arithmetic coding, and simpler and faster (less shadowing of data because no reversals are needed).

    (Arithmetic coding gives about 10 % improvement over Huffman in practical implementations. 1 % of that is because of arithmetic coding being more efficient. The 9 % is because of context modeling that arithmetic coding allows. In Brotli we take much of that 9 % by using context modeling that is compatible with prefix coding, but none of the 1 % improvement.)

    Maybe Jyrki and Google (with all their expertise) can implement RLLZ into an actual compressor and see actual gains for LZ77 and LZSS...

    Since the write buffer is imposed in the decoding algorithm, it should be very fast like byte-aligned LZ.
    Last edited by compgt; 3rd July 2020 at 14:58.

  25. #20
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Quote Originally Posted by compgt View Post
    Dresdenboy, if you're interested in LZSS coding, search for my "RLLZ" in this forum for my remarks on it.

    RLLZ doesn't need literal/match prefix bit, no match_len for succeeding similar strings past the initially encoded string, and no literal_len. It was my idea in high school (1988-1992), but i was forgetting computer programming then and we didn't have access to a computer. I remembered it in 2018, so it's here again, straightened out, better explained.

    https://encode.su/threads/3013-Reduc...highlight=RLLZ
    I already found this thread a few weeks ago. Your algo is very interesting. I think this might also work well with small files which don't have so many matches (covering ~15-50% of total bytes for binary data).

    The decompressor code might even be very small.

    My literal encoding needs roughly half a bit per literal or more, depending on other parameters, and 0 bits for no literals between matches. Matches (starting from length 2) need more than one bit ofc. This roughly fits the probabilities.

  26. #21
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    Christian Steinruecken looked at several compression methods in an interesting way. He scaled the actually compressed window. So we can see compression ratios at each partial file size:
    Click image for larger version. 

Name:	small_files.png 
Views:	70 
Size:	74.9 KB 
ID:	7793
    Source: https://q4.github.io/thesis.html

  27. #22
    Member
    Join Date
    May 2020
    Location
    Berlin
    Posts
    64
    Thanks
    10
    Thanked 20 Times in 15 Posts
    I just found a dissertation "Compression of Short Text on Embedded Systems" by Stephan Rein (full text: https://depositonce.tu-berlin.de/bit...okument_44.pdf). It covers both text and image compression, targeting smaller compute systems.

    Short paper: https://www.researchgate.net/publication/42803310_Compression_of_Short_Text_on_Embedded_Sys tems

    Interestingly the supervisor (and one of the co-authors in the paper) Prof. Clemens Gühmann is someone I occasionally meet when supervising students doing their thesis in our company. Now I have another interesting topic to talk about.

Similar Threads

  1. Replies: 25
    Last Post: 4th January 2017, 16:02
  2. a small data compression contest on hackerrank.com:
    By Alexander Rhatushnyak in forum Data Compression
    Replies: 0
    Last Post: 16th December 2013, 03:24
  3. 100% Lossless Compression Theory - Please Join Me
    By SmallestFilePoss in forum Random Compression
    Replies: 16
    Last Post: 24th September 2013, 22:33
  4. Data compression for stream with small packet
    By alpha_one_x86 in forum Data Compression
    Replies: 1
    Last Post: 6th May 2012, 18:51
  5. Compressing small bits of data
    By fredrp in forum Data Compression
    Replies: 9
    Last Post: 28th April 2009, 22:27

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •