http://nishi.dreamhosters.com/u/lzma_delrep_v1.rar
Tried forcibly restoring rep-codes after removing them.
For enwik8 it looks kinda like this:
Code:
opt=08 dictsize=08000000 filesize=100000000
Count the tokens in .rec: 11042877
Load the tokens into an array. Done
Compute backward rep sets. Done
Turn rep-coded tokens into plain matches/literals: 34547 litR0, 222219 rep0, 47353 rep1-3
Recompute rep-codes: 173979 litR0, 222219 rep0, 51422 rep1-3
Store the results. Done
But compression actually got worse (24557177->24605378), despite the fact that
some distances were discarded, and many literals encoded with 4 bits instead of 9.
Seems like lzma encoder really takes into account entropy codes of these things -
all literals are encoded in context of rep0 byte, so explicit rep0 literals are rarely helpful.
So, I tried to only restore previously existing rep0 literals:
Code:
Turn rep-coded tokens into plain matches/literals: 34547 litR0, 222219 rep0, 47353 rep1-3
Recompute rep-codes: 34547 litR0, 222219 rep0, 51422 rep1-3
Which was still no good (24557177->24558297).
Then I tried to change match distances to values used in _future_ matches
(of course, only when referenced strings match too):
Code:
Turn rep-coded tokens into plain matches/literals: 34547 litR0, 222219 rep0, 47353 rep1-3
Recompute rep-codes: 34496 litR0, 226143 rep0, 58052 rep1-3
And it was an improvement this time (24557177->24556587).
Also, it clearly did find quite a few extra hits this time.
But unfortunately this gain is far from universal.
I'm still wondering how to properly do parsing optimization
with repeat codes...