# Thread: Is this how to practically implement Arithmetic Coding?

1. I realize before you mix anything the result is 100% perfect, i mean the window/mask says it seen a 33 times come next, b 22, c 5, d 2....so during Arithmetic Encoding them you give the common ones smaller codes, almost optimal if the arithmetic coder was perfect, great! But we gotta normalize when mix and that has a solution i can probably cook up myself. The normalizing/rescaling/averaging is tedious since they differ! That causes issue I wager. Cus we make them into %s from counts and them mix them.

2. Why would switching off models (masks) during some areas of wiki8 be useful? And how do you know when to do that?

3. 1) It would be useful if they don't appear in some area (for example, in one area sparse models may be helpful due to frequent mistypes,
then start hurting compression while being useless with a relatively high weight).
The best (most precise) way to detect that is to estimate compressed size for both cases (with and without some model) and switch it off
when necessary.

2) There's also a "context switching" method as alternative to mixing.
We can explicitly encode the choice of best context for a symbol (or some chunk).
Optimizing submodel choices is called "parsing optimization" and is used frequently in LZ algorithms to enable faster decoding.

1) You said Green isn't adaptive, but in another post say it dynamically grows depth to order3+ 'as' it regenerates text (Online Learning).

2) Does Green predict the next byte or next bit?

3) So to predict the next byte/bit your algorithm takes the previous 16 masks/contexts, finds them in tree/ linked lists as EXACT MATCHES, then does a nice mixing formula and Arithmetic Coding?

4) By the time Green finishes the 100MB input, how many letters are stored in Tree? 16*100,000,000? I see my RAM goes up by ~2GB. I don't understand "Log 2N" so I won't understand if you use those terms.

5. > 1) You said Green isn't adaptive,
> but in another post say it dynamically grows depth
> to order3+ 'as' it regenerates text
> (Online Learning).

I don't remember saying these things and google doesn't either.

a) green.cpp doesn't use dynamic memory allocation,
but it uses different probability distributions on each context occurrence
(and each byte of input data), so its certainly adaptive.
I also explicitly said that its weighting function is adaptive,
so I can't easily visualize it.

b) "DEPTH" is a constant in green.cpp which determines the max number
of most recent order3 context occurrences it can visit to rebuild
the order3+ probability distributions.

c) from the first byte actually processed by the model
(green.cpp copies first 3 bytes as is), greep.cpp does
access symbol counts for all contexts from 0 to MAXORDER.
Although contexts with total=0 are skipped in mixing.

> 2) Does Green predict the next byte or next bit?

green.cpp always works with byte distributions and encodes/decodes whole bytes.
But it is possible to convert it to a bitwise coder
using some parts from mod_ppmd.

> 3) So to predict the next byte/bit your algorithm takes the previous 16 masks/contexts,
> finds them in tree/ linked lists as EXACT MATCHES, then does a nice mixing formula and Arithmetic Coding?

Kind of, except it has static tables for order0..2 symbol counts.

> 4) By the time Green finishes the 100MB input, how many letters are stored in Tree?
> 16*100,000,000? I see my RAM goes up by ~2GB. I don't understand "Log 2N"
> so I won't understand if you use those terms.

If you mean tree.cpp, it creates about 97M of context nodes for enwik8.
They are variable-size though (min 12 bytes, plus 4 bytes per each new symbol in context),
that's why it hits 2G.
Unfortunately its not x64-aware, but with order15 it fits in 2G and compresses
enwik8 to 21,094,300 (and decompresses it correctly).
(Like "tree.exe 15 enwik8 output").

6. That clears more up thanks.

"no dynamic mixing or adaptive counters are used there."
From:
I'm assuming you mean the 16 models have their own counters. And order0-2 are static, order3-16 updates as decompression runs?

As it trains itself on wiki8 with its order-16 mask, am I supposed to store in ex. a tree every 16 bytes the mask moves along? That's 100,000,000 steps the 16-byte-mask makes. 16*100,000,000=~1,000,000,000 nodes! A billion...Am I supposed to erase some if not enough statistics are saw in 10KB? Not often does the same 16 letters/bytes appear in wiki8...So the leafs in tree are a ~billion...

7. > "no dynamic mixing or adaptive counters are used there."
> From:

Well, the meaning here is different.
"Adaptive counter" is something like this: https://github.com/Shelwien/mod_CM/b...CM/counter.inc
While green just uses simple numbers of occurrences.

Same with mixing - good mixer adapts to actual performance of the submodels,
and green just uses a heuristic weight function which assigns lower weights
to contexts with more occurrences (presuming that there'd be more relevant higher-order contexts).

Green's model overall is adaptive (its certainly not static), just that there're better models.

> I'm assuming you mean the 16 models have their own counters.
> And order0-2 are static, order3-16 updates as decompression runs?

They are all updated. Statistics are not stored in compressed file, its all adaptive.

> am I supposed to store in ex. a tree every 16 bytes the mask moves along?

Not quite like that, but with a tree, yes, you'd need to create a new context node
for most of symbols of the file.
On other hand, green.cpp only allocates 5*filesize for file data and order3 chains.

If you're concerned about memory usage, there're all kinds of tricks
For example, bwmonstr manages to compress files with only 1.25*filesize memory usage:
http://mattmahoney.net/dc/text.html#1605

> That's 100,000,000 steps the 16-byte-mask makes. 16*100,000,000=~1,000,000,000 nodes!

Yes, and? There're 1,073,741,824 bytes in a gigabyte, and most computers now have 4GB+ memory?
Though its certainly a good reason to not use python for data compression.

> A billion...Am I supposed to erase some if not enough statistics are saw in 10KB?

There're all kinds of workarounds for that.
1) "Sliding window" - only keep recent N bytes indexed, remove contexts that go out of window.
2) Restart (ppmd -r0) - once memory pool fills up, reset the model and continue with empty tree.
3) Restart and retrain (ash /w) - reset and reindex N% of known data
4) Tree pruning (ppmd -r1) - delete less probable contexts to free memory.
5) Freezing (ppmd -r2) - just stop adding new contexts when there's no more memory

There're also simply different data structures like BWT Suffix Array, hashtables, bloom filters...

In any case, for enwik8 it should be possible to fit in 1GB with a more compact tree.
The tree in tree.cpp just uses 32-bit fields for every symbol,
but we could reduce the precision and go with 16-bit or even 8-bit (like paq) counters.

8. I added one question while you were typing:

Lastly let's try something new: Can you explain why the mixing formula 'add and divide' is not as good? For example are the 17 models not normalized as good and that is the major cause? Try to flesh this out to me using pure English explaining the issues/idea and what is mixing together. Use analogies. The more context the better I will grasp solidly no doubt I will.

9. > Can you explain why the mixing formula 'add and divide' is not as good?

There's no specific reason.
Its possible to generate a file where it would be better than green's weighting function.

Normalized weights are probabilities of each context predicting the highest probability for the current symbol.

21,094,300 // w=coef[j][n==1]/(1+coef[j][2]+total/(coef[j][3]+1))
22,664,336 // w=8192/total (probability distributions with equal weight)
61,654,200 // w=1 (symbol counts with equal weight)

Ideally we need to build a proper model for prediction successes.
Its possible, but the program for that would be much more complicated than "green",
since it would require storing extra statistics for weight model, working with probabilities in fixed-point/logistic domain, etc.
That kind of thing is also much easier to make for a bitwise model, rather than bytewise.

10. Let me explain then. A context model of order-3 has seen every offset of 3 letters in the file (or some of file if Online Learning is implemented). So it has higher counts/probabilities for it's candidate predictions 1/0 (or a, b, c.....z....`, !, ", \$). Most the time it's best prediction is right and is given a smaller code by ACing.

If we want to mix order-3 and order-4 and order-5, the three model's may differ in relativity. Model1 may think a=50 counts, b=25 counts, while model2 may think a=480, b=200. Clearly they are similar, but they are scaled differently, else they'd be perfectly mixed (by just adding each 3 & dividing by 3) and be like a standalone model with no mixing. If normalization/re-scaling is the issue, this is simple to fix I guess. But if not just this is the issue, then there is other magical mysteries. Possibly, the models themselves are only sorta perfect if doing Online Learning, are scaled differently, and, each model is like a flavor where a higher order/more recent model model is more accurate/overpowering while has fewer stats seen. Any more differences between 2 given models with varying size/position (ex. order-3 and order-4)?

11. > they'd be perfectly mixed (by just adding each 3 & dividing by 3)

There's nothing perfect in contexts having equal weight.
For example, an order5 context "perfe" would predict next symbol to be 'c' with almost 100% probability.
While "erfe" already can be a part of "interference" and some other words.
In such a case its best for a weight function to assign high weight to order5,
and much lower weights to lower orders.

12. Now the 17 models/masks, I imagine you store them in 1 tree for tree version? Instead of 17 dedicated trees? So at layer 5 for order-5 you have the predictions? Are these predictions "layer 6"? Or its own leaf stash inside layer 5?

? Which way do you get the leaf predictions for order-2?
https://ibb.co/GsnNYn2

13. > Now the 17 models/masks, I imagine you store them in 1 tree for tree version?

Either way is possible. For example, it may be easier to look up each context in a hashtable independently.

In tree.cpp case, it has context node with a table of pointers (per symbol), which point either
to text buffer (if symbol only occured once in this context), or to a node of higher order context.

> Which way do you get the leaf predictions for order-2?

tree.cpp uses same context nodes for everything including order0.
But normally its easier to keep up to order2 in static tables (its just 64Mb).

14. Last question. For information uncertainty, when have not seen all the datset if doing Online Learning, you go to a node and may not even have the prediction there seen yet even once. Do you fallback an order? Or do you give all automatically ex. 1% range slice?

15. If you mean new symbols, there's a so-called order-(-1), which contains all symbols with equal probably
In green its represented by this (distribution init before mixing):
Code:
`for( j=0; j<CNUM; j++ ) freq[j]=1; // order-(-1)`
So for coding, every symbol would have non-zero probability, although its much less than 1%.

16. https://paste.ee/p/qQroG
I implemented a 'best string finder', I think. What it does is stores the possible sequences by putting them in a tree with counts, then it can efficiently find string and each's value, but I did leave out the refining of the sorted list but it's good enough to present. The value formula is a string's counts * the letters in string to get savings, then to get cost the letters in the string + the counts*2 to store the string/say it using huffman (2 bytes (16 bits), if there is 64,000 strings, currently manual adjusting is required). Now with cost and savings (they are printed, see code) you divide the savings by the cost to get bitsPerByte ratio. Code then sorts them by greatest to least value. So the first one below is huge, from 1million bytes fed in, it is 21.5! You save/or 'store' 5445 bytes for just ~253 bytes.

updated example:
['[[Structure of Atlas Shrugged|section', 5624, 75, 74.98666666666666]
['Structure of Atlas Shrugged|section', 5600, 75, 74.66666666666667]
['[Structure of Atlas Shrugged|section', 5472, 74, 73.94594594594595]
['tructure of Atlas Shrugged|section', 5440, 74, 73.51351351351352]
['ructure of Atlas Shrugged|section', 5280, 73, 72.32876712328768]
['ucture of Atlas Shrugged|section', 5120, 72, 71.11111111111111]
['cture of Atlas Shrugged|section', 4960, 71, 69.85915492957747]

Kennon's attempt to find the 'best strings' that are long and frequent is the same thing as Shelwien's Green algorithm https://encode.su/threads/541-Simple...xt-mixing-demo, Shelwien's learns online most of its context strings. Notice as Shelwien's tree/lists grow after eating what it generates, it is getting longer branches with more stats on them, yup, long frequent strings, so when it steps on 16 bytes it sends them basically to tree and finds the match, and instead of predicting whole strings it predicts just the next byte. Both their algorithms get the wiki8 100MB to ~21MB. Shelwien's can store 16 byte long contexts (costs RAM 2GB), so I think it can do ~150 byte masks? Enough to cover/detect most duplicates. For longer dups you can easily find a way to add them.

17. Let me make sure the code works first wait, something seems off.

EDIT: Updated code.

18. Wait, if Green does order-16, why do you need order-15, 14, 13, etc? Mixing is for non-overlaps I think. You just go down as far as a match you can and then choose a range. One tree/list, grows as learns more.

Or are we mixing because of info uncertainty, need a little stats of low order-2, order-6, order-4..as Matt said.

19. I think I might face a speed issue.
Say I pre-compute/store orders 0-2. Now i have to run down tree to all children from a layer 3 node? That's ~20,000 leafs.

20. > Wait, if Green does order-16, why do you need order-15, 14, 13, etc?

Because order-16 statistics only exist if you had at least one previous
occurrence of the same 16-byte string.
And even that is only enough if current symbol already appeared in context.

> Mixing is for non-overlaps I think.
> You just go down as far as a match you can and then choose a range.
> One tree/list, grows as learns more.

There's a PPM algorithm which starts from max available context,
then goes to lower order until it is able to encode the current symbol.
https://en.wikipedia.org/wiki/Predic...rtial_matching

But PPM can be seen as implicit CM with chained binary mixing
and (1 - escape probability) as context's weight.

> Or are we mixing because of info uncertainty,
> need a little stats of low order-2, order-6, order-4..as Matt said.

There're cases where higher order context can provide worse prediction
(eg. suffixes in a sorted dictionary).

> Say I pre-compute/store orders 0-2.
> Now i have to run down tree to all children from a layer 3 node?
> That's ~20,000 leafs.

and found the descriptor of the previous symbol in it.
So we can store a pointer to current order-N context
(previous order-(N-1) + previous symbol) and find it just by

Otherwise we can start from the tree's root, find 1st symbol,
walk to branch node, find 2nd symbol... until the descriptor of symbol is reached,
which is the last symbol of context string.

Number of children etc only matters if you want to encode the tree
instead of mixing the probability distribution for sequential coding
of the symbols.
That's possible too, I just posted an old coder which does that:

But we don't have an implementation with state-of-art compression
for this approach (though it should be possible; some BWT coders are related).

21. What does "timer.inc" do in Green's folder? Time for what?

Let's focus on a tree version here; So you start at the root and usually see 256 paths, right. Then you pick one, and usually see ex. 100 paths. It's getting thinner. It's 16 deep say. Are any of these early branch layer2/3 nodes where the predictions/counts are found? Or are predictions kept in last layer leafs? See my diagram below.

https://ibb.co/nn4BJhw

22. As you can see the backwards tree looks most useful, it can BackOff from order-16 to lower orders without leaving the prior zone/context frontier. All predictions are on other side, end leafs with counts there only, so if you see a short order-4 you'll need to go to all children...32,000 sometimes.

Help....maybe I got it wrong.

23. > What does "timer.inc" do in Green's folder? Time for what?

Its used in this line:
Code:
`if( CheckTimer(500) ) printstats(i,inplen);`
To print stats during processing only twice per second (500ms).
You can remove "#include timer.inc", StartTimer() and the line above -
it would still work, just without progress info.

> Or are predictions kept in last layer leafs?

green.cpp/tree.cpp use predictions from all context orders _at once_.
They take probability distributions from all contexts and mix them.

There's a PPM algorithm which attempts to use only the distribution from
the "last layer leaf" first.
Its not the algorithm which is used in green/tree.

> so if you see a short order-4 you'll need to go to all children...32,000 sometimes.

It happens in "green" - it would visit all order3 instances and collect starts for order3+ contexts.
But currently it only visits 8192 most recent instances (DEPTH constant).

And in "tree" this problem doesn't exist at all - it just works with up to 256 symbols in each context.

24. Can you see my tree image above? How should I draw it....And where do I go, stop, and gather in it? Explain using terms like go to next layer, or all or only 1 branch, grab counts, or skip layers using pointer.... Is your tree backwards like mine, ?

25. I don't understand your text very well, and understand your pictures even less.

> Is your tree backwards like mine, ?

green.cpp doesn't use any trees.
tree.cpp has a tree with a symbol table in each node and up to 256 branches, which lead to current_context_string+branch_symbol context nodes.

26. Take a look again at my image, the letter in the input 's' is the prediction, these are at end of the tree, the leafs. The letters prior to 's' are the context, they are stored in the tree, you can find it backwards and then grab predictions & counts.

So you mean each node in [your] Tree stores predictions+counts, and branch children? .... Your tree is a tree made of contexts, which layer/s holds the predictions?? Which layer/s hold counts?

27. If your tree is made of layers, which layers hold the counts for range encoding? Last layer? All layers?

28. > Take a look again at my image, the letter in the input 's' is the prediction,
> these are at end of the tree, the leafs.

Yes, but normally we'd walk through the tree from root to leaf.
The tree is a data structure for hierarchical arrangement of elements

> The letters prior to 's' are the context, they are stored in the tree,
> you can find it backwards and then grab predictions & counts.

I don't understand why would you want to find the context from the leaf node.
I don't understand either why box together the leaves from all context branches.

Context is already known, you just need to find the corresponding symbol table.
The dumb way to do that is to start from the tree root and choose a branch
corresponding to n'th symbol of context on n'th tree level.
But the tree would store the pointer to context node of 'cats' in
the symbol descriptor for 's' of context 'cat'.
So while finding the counter for symbol 's' in context 'cat',
we'd also find the context node for next symbol's context 'cats'.

> So you mean each node in [your] Tree stores predictions+counts, and branch children? ....
> Your tree is a tree made of contexts, which layer/s holds the predictions?? Which layer/s hold counts?

Actually my tree (one in tree.cpp) only has context counts.
So using it, I need to access all children (only direct children) of a context to build its probability distribution.

While usually symbol counters would be stored in symbol table, along with child pointers.
Its faster, but also uses more memory.

> If your tree is made of layers, which layers hold the counts for range encoding? Last layer? All layers?

I'd say, all "layers", but its not "made of layers"
(there's no easy way to access all symbols on a "layer").

My tree is made of context descriptors, which I call context nodes.
Each context node contains a counter (number of context instances seen before)
and a symbol table with addresses of context nodes of one-symbol-longer contexts.
"tree.cpp" would visit of all these child contexts (up to 256 per context)
to collect the counts for the probability distribution.

"counts for range encoding" are only computed (mixed) dynamically,
they're not stored anywhere.

29. So in my tree, I should increment any node I pass by 1. Then when I search for my 17 orders, I instantly get the next layer predictions with counts, like this:

https://ibb.co/HpYf8TQ

?
I think I got it now, checking...
every node has counts

-------------------

Of course we store letters crammed in many nodes until need to split at some area.

30. Getting 2GB already for 100,000,000 wiki8 6-letter-strings stored in tree. How will I fit 100,000,000 17-letter-long strings into tree?......I already crammed letters into single nodes.....and end of tree is widest

Page 3 of 4 First 1234 Last