Results 1 to 20 of 20

Thread: which compress algorithm is best for me?

  1. #1
    Member
    Join Date
    Dec 2016
    Location
    Bei Jing
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Talking which compress algorithm is best for me?

    I'm design a string (UTF8) pool for my software, the basic interface of the pool is:

    HANDLE SetString(String); and
    String GetString(HANDLE);

    HANDLE may ba a int or wharever.
    It may contain 10000000 strings (length is between 0~~255) (that's why I think I need compress), and compress/decompress speed is VERY important.

    So, which algorithm is best?

    Thank you all !

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,760
    Thanks
    274
    Thanked 1,199 Times in 667 Posts
    0. Its not a fact that you need compression at all - 10M strings of length 0..255 averages to 1.2G of memory,
    which is normally available now.

    0a. There're already existing database engines which support this:
    https://en.wikipedia.org/wiki/Column...MS#Compression
    Compression there is not that good though, afaik.

    1. Write a benchmark script, which would turn set/get speed + compression ratio into a single number (the optimization metric).
    For example, if you have your 10M strings ready in advance, there's normally no need to add strings by SetString -
    you can build/optimize the structure first, then run the benchmark.

    2. Choose a codec from http://encode.su/threads/?p=50982&pp=1

    3. Make a string array with sorted strings, compress blocks of convenient size (4k?)

    4. Set: binary search by first string in blocks, decompress a block (keep uncompressed in a cache), insert a string

    5. Get: same as Set. Alternatively, its possible to implement an addional hashtable for cached strings (in unpacked blocks)

    6. utf8 could be better to convert to utf16 in your case, or some other more efficient form than utf8.
    Maybe just make a "dictionary", where utf8-like (1- and 2-byte) codes are assigned to real utf8 symbols, based on observed frequency.

    7. Fast codecs are pretty bad at sorted-list compression, (see http://www.maximumcompression.com/data/dict.php )
    so a preprocessor can be used - see 31.cpp in http://nishi.dreamhosters.com/u/dictpack3b.rar

    8. Also an option is to compress strings with some static CM/PPM first (pre-trained with a dictionary),
    then use the same system as above on compressed strings, but without block compression.
    But this way is less adaptive and could be too slow with longer strings. Initial compression would be better though.

  3. #3
    Member
    Join Date
    Dec 2016
    Location
    Bei Jing
    Posts
    2
    Thanks
    0
    Thanked 0 Times in 0 Posts
    Shelwien, thanks a lot for your suggestion.
    The real scenario is I'm currently writing a everything like tool (
    http://www.voidtools.com/)
    Almost every function I already finished but a bit more memory used. So I'm finding a way to save memory, so I think I can compress memory of file names.
    Currently, 6m file names use about 100MB memory, and I've test LZW, it can compressed to about less than 30MB.
    I'll try what your said.
    Thanks again!

  4. #4
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    i wonder how you are using these compressed data? if you need to decompress everything during the search operation, yiou can just compress it in blocks of any size. and i suggest to use lzma/zstd depending on speed of the search itself. names of those few files that successfully went tghrough your filters, can be kept uncompressed

  5. #5
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,760
    Thanks
    274
    Thanked 1,199 Times in 667 Posts
    Would be interesting if you could post a sample file with your strings (a few MBs).
    We'd be able to test various things then.

  6. #6
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    766
    Thanks
    217
    Thanked 285 Times in 167 Posts
    Quote Originally Posted by skogkatt View Post
    So, which algorithm is best?
    Use a custom generated dictionary. Compress blocks of strings together (like 8-64 kb), after first sorting them. Use a normal text compressor like brotli on the blocks.

  7. #7
    Member
    Join Date
    Dec 2011
    Location
    Cambridge, UK
    Posts
    487
    Thanks
    172
    Thanked 168 Times in 115 Posts
    If you have a pattern matching system for your strings then something like the FM-index (BWT based) can be both compressed in memory and also fast to query. Also see compressed suffix arrays or patricia tries.

    Your best speed is going to be removing redundancy in a way that doesn't require decompression to search against.

  8. Thanks:

    Bulat Ziganshin (15th December 2016)

  9. #8
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    James, cute idea, but FM index indexes just one big string, like BWT, yes? if you have to search wildcards like "*lzma*", you have to insert into index every position in filename, making index for 10 mln files having 250 MB total filename length as large as 1 GB. Yes, it should be instant search for most patterns, but it will mean inflate data 4 times, rather than compress them

  10. #9
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    It may contain 10000000 strings (length is between 0~~255)
    Sounds like it could be a great fit for dictionary compression.
    Do you have a sample that could be downloaded for tests ?

  11. #10
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    Cyan, is compressed filename economy worth its huftable?

    while he say about filenames up to 255 chars, the average size on my own disks is about 20 bytes. i think dictionary compression may be useful, but only with xwrt-type dictionary

    and if compressed in larger groups, he may just use larger blocks

  12. #11
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,760
    Thanks
    274
    Thanked 1,199 Times in 667 Posts
    You have to take into account that these are probably chinese filenames, with 3-4 bytes per symbol (utf8).

    Also as to dictionary, is it necessary to re-process the dictionary data each time when dictionary is used to compress a new string,
    or its possible to parse the dictionary once, and then just reset coder to saved state?

  13. #12
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    Eugene, i wonder - in which language are most filenames on your HDD - Russian or Ukrainian? Most my files has Tatarian names, especially all those ICC installations, you know.

    i wonder, who you asking (probably me) and which dictionary you mean (probably xwrt-style). xwrt-style dictionary can be built on some training set of data and used on other strings. it's just a list of "shortcuts". i don't remember much about xwrt details, but at least my own dict algo works this way

  14. #13
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,760
    Thanks
    274
    Thanked 1,199 Times in 667 Posts
    1. Its very hard to find millions of filenames on a normal partition though, and something like filehosting might have a different percentage of non-english names.
    In any case, the TS explicitly said that strings are utf8.

    2. It was about zstd dictionary mode.

  15. #14
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    filenames on my disk are utf8 too, it just ensures that you can encode any name and doesn't mean that majority of them are non-english. and yeah, i have 3mln files here

    zstd dictionary, afaik, is just a data block "preceding" the data itself. you can build it on one data and use with another one. moreover, the idea is that you train it on a large dataset and it optimizes compression for this exact dataset, split in small blocks. so the closer the data, going to comrpess, to the training set, the more compression will be improved

    afair, zstd dict build just performs bwt-sort and selects the most frequent byte sequences into the dictionary

  16. #15
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    is it necessary to re-process the dictionary data each time when dictionary is used to compress a new string,
    or its possible to parse the dictionary once, and then just reset coder to saved state?
    Just once

    I suspect dictionary will work fine even on blocks of 100-200 bytes.
    There's no obligation to go down to "one filename per decompression unit".
    It could be a few of them.
    The point is, dictionary is likely to be efficient before reaching 1 KB.

    Compared to traditional methods, which require to group names into "reasonably sized" blocks of 4-16 KB in order to achieve "reasonable" compression ratios, it lowers the bar.
    The impact is on decompression speed : in order to extract one filename, one has to decompress a smaller block, making retrieval process proportionally faster.

    The best case would be to get some samples, in order to test which limits are best.

  17. #16
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    I suspect dictionary will work fine even on blocks of 100-200 bytes.
    do you have an idea what is approximate the size of entropy tables for such small block, filled with "usual" filenames?

    overall, for such cases (chess database too), i think we may try separate entropy tables, shared by multiple compressed blocks. they should be kept incompressed, so should be shared among 100-1000 KB of data.

  18. #17
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    This is already done in the dictionary

  19. #18
    Programmer Bulat Ziganshin's Avatar
    Join Date
    Mar 2007
    Location
    Uzbekistan
    Posts
    4,537
    Thanks
    758
    Thanked 676 Times in 366 Posts
    then why one need to compress 100-200 bytes together to get reasonable ratio?

    in my idea, overhead of decompressing data in 20-byte blocks is just a unused remainder of last byte in compressed data, and the structure holding all those compressed strings. compression ratio even on single filename may be 2-3x due to entropy coding plus lz matches for popular text strings (although xwrt-style dict should compress even better than lz matches - since it encode whiole words rather than raw LZ positions)

    sorry for my ignorance, i don't know how your dict really work, and not sure whether it was already described somewhere outside of sources themself

  20. #19
    Member
    Join Date
    Sep 2008
    Location
    France
    Posts
    869
    Thanks
    470
    Thanked 261 Times in 108 Posts
    There are other sources of inefficiencies. Entropy headers are just one of them.

    Frame and block headers are still there for example.
    They are small, but it matters at such a small scale.
    By the way, it's even possible to get rid of them by using direct block API .
    This is not recommended in general, since interoperability is lost in the process, but when data to compress is really very small, it can make a difference.

    i don't know how your dict really work, and not sure whether it was already described somewhere outside of sources themself
    It's documented in the Zstandard compression format specification, though it only explains what's there, not why.
    Last edited by Cyan; 16th December 2016 at 08:38.

  21. Thanks:

    Shelwien (16th December 2016)

  22. #20
    Member RichSelian's Avatar
    Join Date
    Aug 2011
    Location
    Shenzhen, China
    Posts
    167
    Thanks
    20
    Thanked 59 Times in 28 Posts
    small strings are difficult to compress because:
    1. you cannot get enough information for model predicting
    2. you have to store some metadata, which can not be ignored for small data

    instead of choosing a good compression algorithm, you should think about the way you store the data. for example, HBase save a huge amount of strings but sorting them in order and merge their common prefix.

Similar Threads

  1. I need to compress Zap file
    By Domingo in forum Data Compression
    Replies: 1
    Last Post: 15th September 2016, 19:59
  2. The best way to compress 2 mostly idenctical VMs?
    By nimdamsk in forum Data Compression
    Replies: 17
    Last Post: 12th October 2014, 13:31
  3. How to Compress Games (repack)
    By thenokiottos in forum Data Compression
    Replies: 4
    Last Post: 23rd August 2012, 22:04
  4. Compress any file to 4 bytes
    By Matt Mahoney in forum The Off-Topic Lounge
    Replies: 5
    Last Post: 28th June 2011, 07:11
  5. Compress-LZF
    By spark in forum Data Compression
    Replies: 2
    Last Post: 16th October 2009, 00:08

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •