So let's hit off as much as we can here. Can you help me?
Shelwien's algorithm Green can get the enwiki8 down from 100MB to 21.8MB and it just builds Online a simple tree 17 letters long with counts and mixes the 17 sets correctly based on total count total symbols and ignores low/high orders, and arithmetically encodes it. It prunes unused branches and ignores mixing very statistical low orders if higher orders have enough stats. And I nearly recreated this. I'm pondering whether I should remove counts in lower orders after I saw them in higher orders ex. order 5 "a 6 b 7" and then order 4 "a 22 b 16 c 9 d 8" becomes "a 16 b 9 c 9 d 8"?
Next, I ran the 100MB through the best cmix to pre-process it and Green got it down to 20.6MB. What does cmix's pre-processing do exactly? I got it from Matt's benchmark page.
Next, SSE (SEE? Second Symbol Estimation) is, is this just the issue during high orders that are missing the symbols you need to code and are predicting what is the escape probability of a 0 count symbol ex. 0.003? Ex. we seen 88 't', 5 'b', 1 'z', and therefore the chance of seeing a new letter Should be ex. 0.00005? So the uncertainty of 0 counts is being predicted for the Primary predictor? And this gets our 20.6MB to about 19.6MB?
What next? Is there room left in this schema to update a neural network that uses activation/inhibition? How does it work (exactly)?
Does CTW (Context Tree Weighting) do what Green does? How does CTW work? I know it's patented but I'm working on an AGI architecture.
And what is this "Both programs use other techniques to improve compression. They use partial update exclusion. When a character is counted in some context, it is counted with a weight of 1/2 in the next lower order context. Also, when computing symbol probabilities, it performs a weighted averaging with the predictions of the lower order context, with the weight of the lower order context inversely proportional to the number of different higher order contexts of which it is a suffix."
Apparently, Matt used just a 0-5 letter or so NN and got 18MB. Did he use pre-processing or SSE?
I'm unsure about the rest of the methods to get it to the 14.8MB mark but I'm guessing 2MB are taken off using grouping words. All 15MB-ers must be doing this. So where does the 16MB-18MB (2MB) get removed? What else is there happening?
How much more does bit prediction get it? How much more does using a thousand models get it?