1) There's much less of unique 6-letter-strings than 100M in enwik8. Only 7,238,979 to be specific.
2) There's no need to create different trees for 6-letter-strings and 17-letter-strings... 17-letter-strings would have same 6-letter prefixes.
3) BWT suffix array only needs 5N bytes to locate and count any n-grams in a file up to 4Gb (then 6N until 1Tb).
Unfortunately it can't be used for sequential decoding.

I mean that I have file data in memory anyway, so I can store offset of string's instance in file buffer.
In fact my implementations always start with that once a new symbol is added to context node...
only after symbol's second occurrence a new context node is allocated and filled with 1-2 new symbols
(current and one from previous location).

My context model mixing high order modelling (order0-20) with Online Learning got the 100MB losslessly compressed to 24MB bytes in 34 mins. C++. Now I'm 9MB away from world record. I had to come up with my own mixing formula. It uses an exponential curve, I'm unsure if this is what others use the curve for though. I can probably fine tune the settings more to get it lower, I set them by hand for 2 hours.

Well, best posted result for bytewise PPM (which is pretty similar to bytewise CM) is 16,209,167. http://mattmahoney.net/dc/text.html#1277
It uses preprocessing, but you can do it too.

So what does Ash or Durlica do that Green doesn't do? Green>Ash>Durlica comp scores are approx. 22MB>18MB>16MB. What is shaving off those extra MB besides preprocessing? And why does it work?

16.2Mb result from durilca doesn't include 310kb compressed size of "EnWiki.fsd",
also it uses different preprocessing, so can't be directly compared.

> What is shaving off those extra MB besides preprocessing? And why does it work?

1) Preprocessing.
For example, see https://github.com/inikep/XWRT
Its possible to improve compression by transforming the data first.
Compression algorithms have some known weaknesses that can be compensated by preprocessing.
Еg. they usually work with only one version of statistics,
so switching chunks of different data types hurts compression;
so sorting these chunks by type first would improve the results.

2) Extra contexts.
Durilca is not open-source, so we don't have perfect understanding of its algorithm.
But its known to use sparse contexts in addition to normal prefix contexts.

3) "Secondary Symbol Estimation"=SSE aka "Adaptive Probability Mapping"=APM.
There're some systematic prediction errors which appear in context models:
- imprecise statistics in new contexts
- slow adaptation of old contexts on data-type switch
- statistics can only provide a prior estimation,
while coding requires posterior one (otherwise we can't encode new symbols);
there's a nonlinear dependency that can be extrapolated
- arithmetic coding bias: https://encode.su/threads/1537-Modelling-what-exactly
(basically using a probability of next_symbol==x for AC is wrong;
AC requires a prediction of the number of future occurrences of x;
these are similar for stationary data, but different enough for real data)

These errors can be partially corrected by secondary statistics,
like "what are actual occurrence counts of bits that were given probability estimation p".
But at this point it becomes inconvenient to use non-binary symbols.
So ash and durilca transform symbols to flags first:
all symbols are ranked by recency, then by context order (higher first),
and flags (x==symbol[rank[i]]) (for i=0..254) are encoded instead of original byte alphabet.
Primary predictions for these flags are computed from symbol statistics,
but then secondary statistics are used to improve these predictions.

In addition, SSE can be used to combine predictions from multiple submodels,
by using multiple predictions as context for secondary statistics.
Normally its hard for PPM to include non-prefix submodels, since PPM uses
implicit weighting via escape probabilities.
Of course, its possible to build whole distribution first, then apply
mixing normally... but SSE provides a more efficient solution.

4) Keep in mind that in the end bitwise models are more adaptive and easier to combine.
Bytewise (and more) models are different enough to be useful, but its easier
to map byte distributions to bitwise than the reverse (bitwise models would have to
be speculatively updated to each of byte symbols, then restored).
Although on other hand, state-of-art compressors (cmix etc) show more use of bytewise mixing etc
than paq, so there's likely some potential in bytewise CM even though programming it
is harder.

How do I group words cat/dog? Do I use my tree? Explain quickly but clearly. I.e what decides cat=dog in compressors? And how is it used to gain compression?

Last edited by Self_Recursive_Data; 31st January 2020 at 09:18.

0. "Stop telling me what to do and I won't tell you where to go"

1. Forget that for now. Durilca and Ash only work at the level of predicting repeated strings, there're no semantic models.
You have to reach that level first.

2. If words appear in the same context, they would be grouped automatically.

3. "DRT" preprocessing (aka WRT aka LIPT) works by replacing real words with dictionary indexes encoded to look like words
(to let word-based models still work after preprocessing).
The main benefit of preprocessing is reduction of the file size (because dictionary indexes are smaller than english words),
but DRT dictionary was specially optimized for Hutter-Prize contest, so there's some level of semantic grouping in the dictionary,
which results in similar indexes used for words (eg. synonyms).
DRT doesn't particularly consider "cat" and "dog" to be similar, the encoding is cat=\xE4\xB2 dog=\xF5\x8E.
But for example "anything"=\xD0\x9B "everything"=\xD0\x98 have the same prefix after transformation, and would thus
have shorter codes when they appear in similar context.

> 3) "Secondary Symbol Estimation"=SSE aka "Adaptive Probability Mapping"=APM.

I am sorry, a quite interested request for comments.
In my compressor series Glue I have a single contest of order 1 or 2 of a subset of the symbols,
I do this way because it assures local response behaviour and it avoids the awkwardness of
estimating a non constant probability of not yet seen symbols.
That subset is organized as a rank and it is the position in it which is coded and sent to the
output in case of a good prediction.
Matched symbols advance by one step while the escape symbol is subject to move-to-front,
these quantity are prefixed.
If I would implement a per-symbol statistic to control the length of the ahead move in the rank,
do you think one can call it a form of SSE?
Moreover, do you think that helps to mitigate the lack of visibility of the actual ending point
distribution of the source stream?
Thank you in advance.

> it avoids the awkwardness of estimating a non constant probability of not yet seen symbols.

A good universal solution for "zero probability problem" is to compute posterior string probabilities
(aka likelihoods; including each possible next symbol, and using updated stats after adding the next symbol).

For example, suppose we have an alphabet of N symbols 0..N-1 and occurrence counts c[i] =
numbers of previous occurrences of each symbol.
The posterior likelihood for next symbol being x would be
l[x] = Product((c[i]+(i==x))!,i=0..N-1)/(total+1)!
p[x] = l[x]/Sum(l[i],i=0..N-1) = (c[x]+1)/(total+N)

> If I would implement a per-symbol statistic to control the length of the ahead move in the rank,
> do you think one can call it a form of SSE?

I'd not. I'd just call it "smart ranking" or something.
Based on common usage, SSE is supposed to use quantized probabilities as context,
and map primary predictions to secondary statistics.

There may be some similarity as ranking also uses indirect statistics,
but people would normally imagine something completely different
if you'd tell them that you implemented ranking with SSE
(something like what ppmonstr/durilca does).

> Moreover, do you think that helps to mitigate the lack of visibility of the actual ending point
> distribution of the source stream?

1. Its actually "visible" at encoding stage.
Its possible to encode the distribution, or just compute some flags based on it
and use these to control the coding process (parsing optimization etc).

2. Potential of ranking for compression is very limited.
For example, direct CM coders show much better results for BWT data than any known kind of ranking.

3. I did actually try different ranking methods in Ash (sorting symbols by descending probability, or moving up one or a few ranks, instead of to rank 0),
but simple MTF showed the best results with SSE.

Instead of using the counting of the (last) next symbol, which I do not able to see how could be solved
by the decoder, does it is possible to use as probability for the not yet seen symbols

p = M/(total+N)

where M is the number of symbols with c[] equal to zero?

> Based on common usage, SSE is supposed to use quantized probabilities as context,

Referring to your example of flags, if I understand correctly, SSE finds "similarities" about promotions
at certain rank positions on different contests, and it uses these to refine the primary prediction, is it
this way that quantization applies?

> 1. Its actually "visible" at encoding stage.

I forgot to mention that in Glue series encoding is strictly on-line, not block based.

Last edited by Marco_B; 3rd February 2020 at 15:56.

> Instead of using the counting of the (last) next symbol,
> which I do not able to see how could be solved by the decoder,

As I described before, mathematically it can be solved by computing
likelihoods of strings including all possible values of next symbol.

In practice I use p'=sq(C*st(p)) extrapolation formula which was derived
using these likelihoods.
Not sure if there's a practical way to apply this for non-binary coding though.

> does it is possible to use as probability for the not yet seen symbols
> p = M/(total+N)
> where M is the number of symbols with c[] equal to zero?

Well, its basically the PPM escape model.
Sure, but there're various better models, up to SEE.

>> Based on common usage, SSE is supposed to use quantized probabilities as context,
> Referring to your example of flags, if I understand correctly, SSE finds "similarities" about promotions
> at certain rank positions on different contests, and it uses these to refine the primary prediction, is it
> this way that quantization applies?

Quantization is necessary because having individual statistics for 4k probability values (or 32k in my case) is too much.
So SSE is used like this:

p' = SSE[p>>(SCALElog-6)];
SSE[p>>(SCALElog-6)].Update(bit);

or maybe like this:

q = (p*64)/SCALE;
w = (p*64)&(SCALE-1);
p' = (w*SSE[q+0]+(SCALE-w)*SSE[q+1])>>SCALElog;
SSE[q].Update(bit);

> I forgot to mention that in Glue series encoding is strictly on-line, not block based.

Not having block headers (with ~64k..256k blocks) is really nothing good.
There're many issues (like encountering an incompressible block, or a few MBs of zeroes, or MT (de)compression)
which can't be properly solved without block headers.

See image below. Green starts Outputting Bytes immediately when I begin the .exe.... I thought Green first builds a small tree of orders 0-2 of all the 100MB prior at the start so it has no uncertainty in the "early layers", no? That'd take at least 5 seconds... https://ibb.co/0jPn57Y

remove memset.
clear order-(-1) with total count.
clear order-3+ with context mixing.
reduce steps of context mixing.
skip calculation loop of total count for limit check.
add rescale of order0 - order2(compile with -DRESCALE).

It is maybe fastar. And if defined RESCALE, gives often better compression(but book1 compression is slightly worse).I compiled with "-static". attached file includes RESCALE ver and not RESCALE ver.

Ah, nice, originally 214,846 bytes for input of 768,771 bytes. Your code got it down to 214,150 bytes but it isn't faster than his Tree version at all. It's slower, and larger code....

Time to release my code. I made this from scratch in Blockly, tree, Online Learning, searching, context mixing, arithmetic encoding, decompressor. It is my first actual coding project (I'm 24 btw :p). It can compress 100MB down to ~23MB. I skipped refactoring it for now (there's a lot to shave off...), it is still small code too.

Attached below is the big input version, try that one, the Blockly one is just to show the toy version. It makes 266,700 bytes into 74,959 bytes and Shelwien's compressor that makes 100MB into 21.8MB makes the 266,700 bytes into 70,069 bytes. Note I didn't exhaustively find the best parems, there IS a few more bytes that can squeeze out of it (in? out..in..) and I'm not done with it just yet :-). To switch to decompression put 'no' at the top and make the input only have the first 15 letters, not all the input of 266700 letters, and put the code it encoded into the decode input at top ex. 0.[here]. You can use the following link to make the code into bits https://www.rapidtables.com/convert/...o-decimal.html (although I just divide the length of the encoding by 3, then make 1 3rd * 4, and 2 3rds * 3, which results in approx. same bit length. Ex. you create 0.487454848 which is 9 digits long, so: 9 / 3 = 3, 3*4=12 and 6*3=18, 18+12=30, 0.487454848 = 30 bits long!

In practice, that's usually too slow, so we end up using some heuristics.
For example, ppmy/ash have a more complex weight function, and some extra contextual statistics
named "escapes" and "optimals".
Where "escapes" is the number of cases where next symbol never occurred in a given context,
and "optimals" is the number of cases where the context had the best prediction.

While in PPM the weighting (implicit) is based solely on escapes.

> also, I see this table is 17 by 4 dimensions, why 17?

Because order0..16.

> Why 4?

Because this formula has 4 coefs:
w = coef[j][n==1]/(1+coef[j][2]+total/(coef[j][3]+1));

> What does 15592 do.....what does 32832

These are main coefs for (n>1) and (n==1) (where n is alphabet size, aka number of unique symbols in context).
Contexts with n=1 are called "deterministic" and usually need a higher weight.

> do.....4.........1......????

And these are other two coefs.
[3] is kinda a multiplier for the main coef.
One, it allows to change context's global weight, rather than [0] and [1] individually.
Two, "total" can grow to billions, so that kind of setup reduces the chance of overflow.

And [2] is a total offset, it reduces the context's weight for small values of total.

As to specific numbers, these are purely empirical.
Basically I tried various coef values for specific files, and kept ones that give the best result.

So I'm just going to do this: To decide a model/order's weight during mixing, I'll look at the total counts in a prediction set/model, the total unique chars, and how high each individual count is. These 3 hints should give an optimal weight for a model.

Cool my compressor got BOOK1 down to 212,499 bytes and I just verified decompression. Shelwien's Green did 214,843 lol. I seen in Charles Bloom's writing he got 212,733 with the PPM. Note I can still squeeze more out, I need to fix some things.....and I'm struggling....

compressing 100,000 bytes of enwiki8:
31,516 < Green's tree
32,145 < mine

So shelwien, your formula, when Green mixes the 17 contexts, your weight given to a context is based on:
1) how many statistics this order has (total).
2) how many unique symbols are predicted by it (total).
3) a global/local weight (total).

Global weight.... Is this telling the algorithm to somewhat ignore short order contexts and very long order contexts? How does this help, I'd think the context itself would be all we need to help us decide weight.

Does Green pre-compute an order 0-2 tree (full global view, no uncertainty) prior to doing Online Learning? Or is it insuperior? (My algorithm only does Online Learning)