Results 1 to 4 of 4

Thread: Compression for random access

  1. #1
    Member
    Join Date
    May 2020
    Location
    AZ
    Posts
    1
    Thanks
    0
    Thanked 0 Times in 0 Posts

    Compression for random access

    Not sure if this is on topic here, but I am interested on opinions.

    Problem:
    I have tables of around 100k text strings from 1-1500 characters and am looking for a strategy for storing compressed versions for random access. Insertion performance is not a concern.

    What I've done:
    I take all the strings and based on some silly weighting logic based on length and occurrences, I turn it into a list of shared substrings.
    That list of substrings is then turned into one block of text by merging largest overlaps.
    This block of text (usually around 30k) becomes the training data for an LZMA encoder.
    I modified the 7zip source to be able to (after encoding or decoding the training data) quickly restore state and be able to process each entry.

    I know this whole effort is rather silly (it would probably be better to just compress the whole database and be done with it), but does this approach make sense? Was LZMA a good choice? I've only just now discovered the Large Text Compression Benchmark (which is how I ended up here somehow) so I have a lot to play with now.

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    Hi, welcome.

    Somehow we don't have much string compression implementations, but I feel that 1500-byte strings may be short enough to try something more advanced than LZ (maybe PPM?).

    The most recent related topic: https://encode.su/threads/3322-LittleBit-0-1-alpha https://github.com/kapenga/LittleBit/

    Also zstd and brotli have integrated dictionary support... zstd even has integrated dictionary generation from a set of files.

    This is my hack for ppmd dictionary mode: http://nishi.dreamhosters.com/u/pmd_dict_v0s.7z

    This is plain arithmetic coder for random access to strings with bit alignment: https://encode.su/threads/3174-Finit...ll=1#post61256
    Should be possible to switch to order1-2 without much changes.
    Attached Files Attached Files

  3. Thanks:

    Kaw (4th May 2020)

  4. #3
    Member
    Join Date
    Apr 2009
    Location
    The Netherlands
    Posts
    83
    Thanks
    6
    Thanked 18 Times in 11 Posts
    Thanks Shelwien for mentioning LittleBit.

    LittleBit is made for this purpose. You should be able to compress the tables to sizes that are on par with common mid range compression algorithms. Compression is slow but 100k tables should be a matter of seconds. Because of the static nature of the Huffman table decompression is as fast as your IO can handle.
    Also your approach seems to be a little complicated for the subject. If the tables are static of nature then LittleBit might be the tool for you.

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,977
    Thanks
    296
    Thanked 1,304 Times in 740 Posts
    > Also your approach seems to be a little complicated for the subject.

    I think he just added external dictionary support to LZMA and used some heuristic method to generate shared dictionary for strings.

    Its actually a good idea imho - LZMA has better compression and lower overhead (no headers) than zstd/brotli.
    Could be good to also use this patch: https://encode.su/threads/1546-init-...terals-in-lzma

Similar Threads

  1. random compression FAQ
    By Shelwien in forum Random Compression
    Replies: 7
    Last Post: 23rd February 2020, 04:39
  2. Lossy compression of random numbers
    By byronknoll in forum Data Compression
    Replies: 5
    Last Post: 18th January 2020, 23:39
  3. Random compression
    By Shelwien in forum Random Compression
    Replies: 6
    Last Post: 29th September 2019, 07:51
  4. Replies: 41
    Last Post: 6th May 2016, 19:13
  5. Random reads vs random writes
    By Piotr Tarsa in forum The Off-Topic Lounge
    Replies: 22
    Last Post: 16th May 2011, 09:58

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •