Results 1 to 10 of 10

Thread: LZ style compression with static Dictionary

  1. #1
    Member
    Join Date
    Aug 2018
    Location
    Gummersbach / Germany
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts

    LZ style compression with static Dictionary

    Hi, I'm looking for a lz style compressor or a tool which creates/uses an optimal static dictionary for matches instead of a sliding window. Currently I'm using a quite simple LZ style compressor with a 256 byte rolling history buffer, but this small window hurts compression a lot. The target system is a c64 with only 64kb ram. The files I need to shorten are prepacked using a special RLE compressor. I only decompress one byte at a time which won't be saved to memory and I can't use previously decompressed data for a sliding window. I'm using this encoding right now:

    %xyyyyyyy x=0 copy up to y 127 literals from input / x=1 match length up to y 127 bytes and the following byte is the distance to the rolling history buffer

    My idea is to get rid of the history buffer and use a static match dictionary which holds all compressable strings and decode using this:

    %xyyyyyyy x=0 copy up to y 127 literals from input / x=1 y match len up to 127 and the next 2 bytes are the index to the dictionary which won't get larger than 64kb

    The compressed data should only exist of uncompressable literals and matches to the static dictionary. Zstd has an option to create a dictionary, but it refuses work with a too small dataset. My files are between 10 and 60kb small. SmalLZ4 can use a dictionary, but I didn't find a tool to create one.

    Any ideas regarding this topic?

  2. #2
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,842
    Thanks
    288
    Thanked 1,244 Times in 697 Posts
    Well, ideally you just need to implement optimal parsing for your method.
    Left-to-right, enumerate all ways to reach pos+len from pos, record the one with min total codelength,
    right-to-left, reconstruct the optimal sequence to reach endpos.
    But its not too easy to write, so nobody would write it for you.
    I suppose you can try to modify something like lz4x: https://github.com/encode84/lz4x/tree/master/src
    Or any other codecs with optimal parsing: https://github.com/inikep/lzbench https://github.com/powturbo/TurboBench
    But I doubt it would be of much help.

    I can also suggest a simpler way: generate a dictionary based on lzma's optimal parsing.
    Its rather close to really optimal, so works pretty good - I actually did it before.
    1. Take http://nishi.dreamhosters.com/u/lzma_markup_v0.rar
    2. run "lzma e sample sample.lzma", "dump d sample.lzma sample.rec"
    3. Then you can either directly parse sample.rec
    (its the sequence of lzma tokens without entropy coding, there's source for its parsing in http://nishi.dreamhosters.com/u/lzma_delrep_v1.rar),
    or use the file "output" which is additionally generated by dump.exe and lists token type for each input byte.
    4. merge sequences of bytes with same id in "output" to strings (ie "30 0\n31 0\n61 2\n 62 2\n" -> string1="01", string2="ab")
    5. sort strings and keep frequently occuring ones for the dictionary.

    Alternatively, same can be done using deflate optimal parsing - use 7-zip/kzip to compress the sample,
    remove entropy coding, collect frequent match strings.
    Turns out, I actually have better tools for this in deflate case: http://nishi.dreamhosters.com/u/defl_matches_v0.rar

  3. Thanks (2):

    Gonzalo (8th August 2018),Thcm (8th August 2018)

  4. #3
    Member
    Join Date
    Aug 2018
    Location
    Gummersbach / Germany
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Your examples look very interesting. Thx a lot! Using existing compressors and extracting the matches is a nice idea. I think I have to "cripple" the encoder beforce compression to only allow matches > 3 and < 128 bytes length and collect all matches for the dictionary. In the next step I think I have to sort the matches from largest to smallest and check if a smaller match is contained in a larger match and discard it. After that I can use my lz compressor and check for matches inside the dictionary instead of the source data.

    I'm no expert in this territory and the more I think about creating the perfect dictionary for maximum compression, special treatment is needed like you explained first. LZMA and Deflate only find matches in already decompressed data and won't scan the whole file from the beginning. Maybe I'm wrong, but putting matches in different order into the dictionary might lead to new and better matches.

  5. #4
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,842
    Thanks
    288
    Thanked 1,244 Times in 697 Posts
    > putting matches in different order into the dictionary might lead to new and better matches

    Well, you can also try to implement iterative optimization for the dictionary.
    Like, initially fill the dictionary with some set of strings (it could be match strings like above, or simply all strings of len=4..127 from your data),
    calculate compressed size with that dictionary, then try to modify it (delete a string, or swap a random pair of strings), keep modified
    dictionary if result is better, and repeat the process until it gets boring.

  6. #5
    Member
    Join Date
    Jan 2014
    Location
    Bothell, Washington, USA
    Posts
    695
    Thanks
    153
    Thanked 182 Times in 108 Posts
    Quote Originally Posted by Thcm View Post
    Hi, I'm looking for a lz style compressor or a tool which creates/uses an optimal static dictionary for matches instead of a sliding window.
    You could look at GLZA. The compressor attempts to create an optimal static dictionary, but it's a difficult problem so it is not perfect. The dictionary is hierarchical and the encoder sends the dictionary dynamically, so perhaps not exactly what you want, although it wouldn't be that hard to change either. There are options to limit the dictionary size and even a relatively small dictionary can often be fairly effective. There are also options to print the dictionary.

  7. #6
    Member
    Join Date
    Aug 2018
    Location
    Gummersbach / Germany
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by Shelwien View Post
    > putting matches in different order into the dictionary might lead to new and better matches

    Well, you can also try to implement iterative optimization for the dictionary.
    Like, initially fill the dictionary with some set of strings (it could be match strings like above, or simply all strings of len=4..127 from your data),
    calculate compressed size with that dictionary, then try to modify it (delete a string, or swap a random pair of strings), keep modified
    dictionary if result is better, and repeat the process until it gets boring.
    The size of the data is quite small and simple brute forcing might be good enough. I'll have a look at GLZA too. I found this on the web and it looks like it nearly does what I need: https://blog.cloudflare.com/improvin...te-dictionary/

  8. #7
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,842
    Thanks
    288
    Thanked 1,244 Times in 697 Posts
    > I found this on the web and it looks like it nearly does what I need

    Based on your first post, its not though?
    The idea there is to use literal strings at start of files, which are left uncompressed by deflate,
    to build a dictionary and improve compression (for multiple separately compressed files).
    While you need actual matches to be in the dictionary.

    Also, that blog post doesn't mention optimal parsing, which is important.

    As to GLZA, I also immediately thought about it, but its likely too complicated for your purpose.
    Although I guess it may be possible to use it to build the dictionary for you.

  9. #8
    Member
    Join Date
    Aug 2018
    Location
    Gummersbach / Germany
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    It's not perfect and it looks like it doesn't do optimal parsing, but it's a start. It produces a useable dictionary regarding my specs (match length 4-127) and when I choose a large enough dictionary and window size all needed matches to decode are there.

  10. #9
    Member
    Join Date
    Mar 2013
    Location
    Berlin
    Posts
    46
    Thanks
    14
    Thanked 72 Times in 31 Posts
    Quote Originally Posted by Thcm View Post
    SmalLZ4 can use a dictionary, but I didn't find a tool to create one.
    It's exactly the same kind of dictionary that LZ4 uses. Unfortunately, there is very little documentation on LZ4's dictionaries.

    Basically it's a simple text file that is virtually prepended to the data stream before compression/decompression.
    That means: usually the sliding window is initially empty - but if you include a dictionary then your dictionary will be the initial sliding window.

    If you process more and more data, then the dictionary will be slowly moving out of the sliding window.
    So dictionaries are only useful for files smaller than the LZ4 sliding window (< 64k).
    I ran most of my tests on source code files (C++/Javascript/JSON) and stored the most common keywords in the dictionary.

    Maybe I write a few words about LZ4 dictionaries on my website, too.

  11. #10
    Member
    Join Date
    Aug 2018
    Location
    Gummersbach / Germany
    Posts
    5
    Thanks
    1
    Thanked 0 Times in 0 Posts
    Quote Originally Posted by stbrumme View Post
    It's exactly the same kind of dictionary that LZ4 uses. Unfortunately, there is very little documentation on LZ4's dictionaries.

    Basically it's a simple text file that is virtually prepended to the data stream before compression/decompression.
    That means: usually the sliding window is initially empty - but if you include a dictionary then your dictionary will be the initial sliding window.

    If you process more and more data, then the dictionary will be slowly moving out of the sliding window.
    So dictionaries are only useful for files smaller than the LZ4 sliding window (< 64k).
    I ran most of my tests on source code files (C++/Javascript/JSON) and stored the most common keywords in the dictionary.

    Maybe I write a few words about LZ4 dictionaries on my website, too.
    Would be nice if you would include an option to create a dictionary from one or more files like zstd does. My files or data streams are always less than 64kb large so the dictionary is always smaller than 64kb too. When compressing lots of small files using a dictionary there might be no need to use solid compression and sorting the files by type to achieve good compression ratios. There are not so much infos on the net regarding creation of dictionarys and I'm looking forward to your explanations.

Similar Threads

  1. Help identifying LZSS-style firmware compression
    By dougg3 in forum Data Compression
    Replies: 12
    Last Post: 10th September 2019, 09:41
  2. Replies: 1
    Last Post: 24th June 2018, 18:48
  3. Self-Describe Dictionary Compression
    By JIMLYN in forum Data Compression
    Replies: 2
    Last Post: 13th April 2018, 13:38
  4. Replies: 33
    Last Post: 27th August 2011, 05:13
  5. Dummy Static [Windowless] Dictionary Text Decompressor
    By Sanmayce in forum Data Compression
    Replies: 4
    Last Post: 12th October 2010, 18:55

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •