https://paste.ee/p/qQroG
I implemented a 'best string finder', I think. What it does is stores the possible sequences by putting them in a tree with counts, then it can efficiently find string and each's value, but I did leave out the refining of the sorted list but it's good enough to present. The value formula is a string's counts * the letters in string to get savings, then to get cost the letters in the string + the counts*2 to store the string/say it using huffman (2 bytes (16 bits), if there is 64,000 strings, currently manual adjusting is required). Now with cost and savings (they are printed, see code) you divide the savings by the cost to get bitsPerByte ratio. Code then sorts them by greatest to least value. So the first one below is huge, from 1million bytes fed in, it is 21.5! You save/or 'store' 5445 bytes for just ~253 bytes.
updated example:
['[[Structure of Atlas Shrugged|section', 5624, 75, 74.98666666666666]
['Structure of Atlas Shrugged|section', 5600, 75, 74.66666666666667]
['[Structure of Atlas Shrugged|section', 5472, 74, 73.94594594594595]
['tructure of Atlas Shrugged|section', 5440, 74, 73.51351351351352]
['ructure of Atlas Shrugged|section', 5280, 73, 72.32876712328768]
['ucture of Atlas Shrugged|section', 5120, 72, 71.11111111111111]
['cture of Atlas Shrugged|section', 4960, 71, 69.85915492957747]
Kennon's attempt to find the 'best strings' that are long and frequent is the same thing as Shelwien's Green algorithm https://encode.su/threads/541-Simple...xt-mixing-demo, Shelwien's learns online most of its context strings. Notice as Shelwien's tree/lists grow after eating what it generates, it is getting longer branches with more stats on them, yup, long frequent strings, so when it steps on 16 bytes it sends them basically to tree and finds the match, and instead of predicting whole strings it predicts just the next byte. Both their algorithms get the wiki8 100MB to ~21MB. Shelwien's can store 16 byte long contexts (costs RAM 2GB), so I think it can do ~150 byte masks? Enough to cover/detect most duplicates. For longer dups you can easily find a way to add them.