Not sure if this is on topic here, but I am interested on opinions.
Problem:
I have tables of around 100k text strings from 1-1500 characters and am looking for a strategy for storing compressed versions for random access. Insertion performance is not a concern.
What I've done:
I take all the strings and based on some silly weighting logic based on length and occurrences, I turn it into a list of shared substrings.
That list of substrings is then turned into one block of text by merging largest overlaps.
This block of text (usually around 30k) becomes the training data for an LZMA encoder.
I modified the 7zip source to be able to (after encoding or decoding the training data) quickly restore state and be able to process each entry.
I know this whole effort is rather silly (it would probably be better to just compress the whole database and be done with it), but does this approach make sense? Was LZMA a good choice? I've only just now discovered the Large Text Compression Benchmark (which is how I ended up here somehow) so I have a lot to play with now.