Thread: How to use text file as dictionary?

    Question How to use text file as dictionary?

    Here is what I want to do...

    My plan is to have a big JSON snapshot for e.g. every 4 hours which compresses without dependency/dictionary, then every 20 minutes have a 'changes' compression file that can rebuild the current state but without needing to include the data that was present in the 4-hour-ago snapshot.

    Are there any easy tools I can play around with? I like LZMA since it is around the right speed and efficiency of what I want. Ideally I would want to be able to achieve decompressing this using javascript.

    Is there a specific term I should be using to describe what I want to do? Is there any way to do this with 7-zip?

    I could keep track of just the changes, and just compress that by itself, but I'll miss out on compression efficiency :\.

    Assuming you are generating that change file somehow, liblzma has functionality for using a dictionary for this purpose (which you'd create from the snapshot), but you'd have to build it yourself:
    Alternatively, you could use the feature in brotli or zstd (built-in, see command-line help)

    zstd snapshot20mn -D reference4h

    and for decompression

    zstd -d snapshot20mn.zst -D reference4h

    For high compression (closer to lzma) :
    zstd -19 snapshot20mn -D reference4h

    and if reference4h is really big (>8MB), you'll want to add --long command, or try an --ultra level (20, 21 or 22).

    > I could keep track of just the changes, and just compress that by itself, but I'll miss out on compression efficiency :\.

    Note that, save some specific corner case, the expected amount of savings due to dictionaries is not that large.
    About a few KB per file.
    That may sound very little, but when files are a few KB, it's actually a lot.
    However, when files become much larger, say >1 MB, gain is comparatively tiny, and therefore not worthwhile.

    Dictionary compression effect can be easily tested with any sequential compression algorithm, via diff(compress(dict),compress(dict+file)).
    100,000,000 enwik8
        465,211 dict1 // DRT dictionary
        133,741 dict.7z      // 7z a -m0=lzma dict dict1
     25,895,917 file.7z      // 7z a -m0=lzma file enwik8
     26,015,709 dict+file.7z // 7z a -m0 dict+file dict1 enwik8
     25,919,495 file.patch   // hdiffz.exe dict.7z dict+file.7z file.patch
    // 25,919,495>25,895,917 - bad dictionary
    100,000,000 enwik8
        768,771 dict1 // BOOK1
        261,068 dict.7z
     25,895,917 file.7z
     26,140,756 dict+file.7z
     25,879,830 file.patch
    // 25,879,830<25,895,917 - better?

