I originally just wanted to test contrepl, but then tried a couple of other random things.
(contrepl is my tool for turning lossy replacement regexps into lossless)
File was http://corpus.canterbury.ac.nz/descr...ge/E.coli.html
1. Apply replacements t->a, g->c, c->a .
Results is a file of only "a" + 3 flag streams.
Compression is somewhat hard to evaluate, since I only have my CM for flag compression with external contexts,
but comparing to same CM on original file, compression is worse.
Code:
4,638,690 e_coli
1,123,529 e_coli.CMo5
576,425 flg_ari_CA
290,275 flg_ari_GC
279,553 flg_ari_TA
1,146,253 CA+GC+TA
Surprisingly, there was also very little effect from switching to right-side contexts, like +/-100 bytes.
Of course, this doesn't mean much since CMo5 was tuned to book1 and flag coder to "contractions config",
so both results can be obviously improved (its fact its possible to tune 3 different models for flags here).
But there's no immediate win, so its not that interesting.
2. Identify some important sequences and insert spaces to split contexts
(its a popular text preprocessing trick called "space stuffing")
step 1: compress e.coli with 7-zip to a zip archive with parsing optimization
step 2: extract optimized match strings from deflate stream using deflmatch kit
step 3: use a perl script to insert spaces after 256 most frequent match strings
Code:
4,638,690 e_coli
1,235,713 e_coli.zip
1,185,356 e_coli.lzma
1,129,315 e_coli.pmm // ppmonstr -o128
1,130,772 e_coli_pre.pmm // spaces after 256 most frequent matches (2.pl)
1,136,925 e_coli_pre1.pmm // spaces after 3717 most frequent matches
1,168,500 e_coli_pre2.pmm // spaces before 256 most frequent matches
1,129,493 e_coli_pre3.pmm // spaces after 8 most frequent matches (3.pl)
Well, it clearly does something, since compressed size changes very little despite significant
added redundancy (+500k in pre1 case), but also no improvement.
Improvements might be still possible with a special tool for context clustering,
but for now no luck.
3. An interesting detail in (2) was that all most frequent strings consisted of 7 symbols.
Code:
4,638,690 e_coli
1,187,935 e_coli.lzma
5,301,360 1 // stegdict.exe d e_coli_lst 1 2 (_lst has \n inserted after every 7 chars)
22,649 1.lzma // lzma -d23 -fb273 -mc9999 -lc0 -lp3 -pb3 -mt1 (sorted list of 7-grams)
1,374,446 2 // stegdict "diff" to restore the permutation
4,638,690 1 // stegdict.exe d16 e_coli 1 2 (just 16-char records, no padding)
605,214 2 // diff file
700,937 3 // lzma -d23 -fb273 -mc9999 -lc1 -lp2 -pb2 -mt1
1,306,151 2+3
5,275,326 e_coli_deflmatch // cut -d " " -f 2 00000000.txt | sed -e s/\x22//
5,275,326 4 // stegdict.exe d e_coli_deflmatch 4 5
1,345,276 5