Results 1 to 2 of 2

Thread: Tunstall dictionary generation

  1. #1
    Join Date
    Jul 2020
    Thanked 0 Times in 0 Posts

    Tunstall dictionary generation

    How would you generate a near-optimal Tunstall dictionary? Aiming for a small one, up to 256 entries, up to 512 bytes total entry data. Small source files, 16kb each, with 5-200 used byte values per file. This is for fun, for a use where FSE (tans) is too slow and uses too much RAM.

    I tried a few experiments, and while I got good results, it's clear it's possible to do better. Like in one case my most advanced setup did worse. Marlin does better too, though it's not directly comparable due to its huge dictionaries.

    My brute-force approach:
    - for match lengths 2 to 128, count how many unique matches are found in the entire file
    - pick the match that gives the largest "unique matches * match length" multiplied value
    - erase those bytes from the source data (greedily), add it to dictionary
    - repeat until dict is big enough

    This takes about two seconds to compress a 16kb file. Standard Tunstall a few ms, most-probable tunstall variant about half a sec.

    Tunstall experiments over various 16kb files
    The first three variants do not ship the dictionary, they generate it
    from the probabilities. The header is included in the compressed size.
    The last variant brute-forces a small dictionary, and ships it.
    sparseblock, entropy 5.54096 bytes
    Standard Tunstall:      3287 bytes, dict size 1106
    Most probable:          80 bytes, dict size 32387
    +rounded prob:          80 bytes, dict size 32387
    Brute:                  142 bytes + dict 10+217
    FSE:                    14 bytes
    sparseblock2, entropy 1.93033 bytes
    Standard Tunstall:      2058 bytes, dict size 1554
    Most probable:          75 bytes, dict size 32641
    +rounded prob:          75 bytes, dict size 32641
    Brute:                  133 bytes + dict 9+231
    FSE:                    8 bytes
    sparseblock3, entropy 775.3 bytes
    Standard Tunstall:      8224 bytes, dict size 493
    Most probable:          1267 bytes, dict size 1982
    +rounded prob:          986 bytes, dict size 28459
    Brute:                  756 bytes + dict 88+339
    FSE:                    792 bytes
    shinjiblock, entropy 884.723 bytes
    Standard Tunstall:      5600 bytes, dict size 573
    Most probable:          1342 bytes, dict size 2696
    +rounded prob:          1512 bytes, dict size 29902
    Brute:                  1839 bytes + dict 62+268
    FSE:                    900 bytes
    ntrblock, entropy 4261.65 bytes
    Standard Tunstall:      12570 bytes, dict size 8086
    Most probable:          6636 bytes, dict size 10312
    +rounded prob:          6393 bytes, dict size 11902
    Brute:                  6259 bytes + dict 256+431
    FSE:                    4400 bytes
    block.abs, entropy 10628.6
    Standard Tunstall:      13918 bytes, dict size 8044
    Most probable:          13918 bytes, dict size 8044
    +rounded prob:          13918 bytes, dict size 8044
    Brute:                  12242 bytes + dict 256+364
    FSE:                    10796 bytes
    Marlin:                 11136 bytes (Marlin uses large built-in dicts)

  2. #2
    Join Date
    May 2017
    Thanked 4 Times in 4 Posts
    edit: sorry, I learned what Tunstall does, it does not help you problem I think.

    @zeit have a look at my answer in my thread about dictionary compressor for small strings. It seems to me this is related.

    > pick the match that gives the largest "unique matches * match length" multiplied value
    Great idea, I am simply getting now the first largest match.

    My dictionary is generated this way:
    1. get all ngrams (your matches) from size 3-8 counting them
    2. sort them by frequency
    3. from bigger counter to smaller
    4. add ngram to dictionary method
    4.1 if ngram to add is a substring of any already ngram present in the dictionary, skip
    4.2 if some prefix or suffix of any ngram in dictionary, is equal to some prefix/suffix from the ngram to add,
    merge them producing a new ngram in dictionary (get longest prefix/suffix)
    4.3 if not, simply add ngram
    5. loop to 4 while dictionary size is less than the size you need

    Step 4 is great, it ables me to have a dictionary with double the ngrams I could have than simply
    adding the first ngrams. So, you double the "information" you pack in a dictionary of the same size. I copied this idea from many programs that do dictionary coding.

    Later I use this first dictionary to mutate and improve it with a GA. It worked pretty well.

Similar Threads

  1. M19 - beginning work on next generation BWT compressor
    By michael maniscalco in forum Data Compression
    Replies: 7
    Last Post: 12th October 2018, 04:53
  2. Incremental codebook generation?
    By Trixter in forum The Off-Topic Lounge
    Replies: 1
    Last Post: 28th August 2013, 07:14
  3. Code generation in LZ decoder / Branchless LZ77 Decoder
    By Shelwien in forum Data Compression
    Replies: 1
    Last Post: 30th September 2010, 21:48
  4. LZW with unbounded dictionary
    By encode in forum Data Compression
    Replies: 34
    Last Post: 28th September 2010, 04:30
  5. Executable patch generation methods
    By Shelwien in forum Data Compression
    Replies: 2
    Last Post: 2nd April 2010, 10:13

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts