18th January 2020, 17:29
> What does P0 mean?
Just an intermediate result.
We need to normalize the distribution, since sum of probabilities should be 1.
P is the final distribution used for coding.
> What is the constant? From what? A letter?
Its these constants (config.inc in green_v3a.rar):
uint coef = {
27584, 4, 208, 0,
15592, 32832, 4, 1,
59299, 86318, 7, 0,
26624, 65536, 3, 5,
8178, 38914, 0, 32,
7520, 56802, 0, 58,
7104, 51840, 0, 73,
7425, 47736, 0, 248,
8192, 41088, 0, 76,
7236, 32264, 0, 17,
1280, 34881, 0, 162,
4096, 40960, 0, 152,
9380, 24634, 0, 45,
4096, 25120, 0, 13,
4096, 24576, 0, 14,
4096, 41088, 0, 15,
4096, 49469, 0, 16
};
They're for transforming n (number of different symbols that appeared in context),
total (number of occurrences of context), and order (j here) into a weight value.
This is the formula in the source:
uint w = coef/(1+coef+total/(coef+1)); // mixer weight
As you can see, a context with (n==1), ie only one symbol, gets a higher weight,
since its likely to continue like that.
> What is w1*0.35 mean and what is result for just that? What is w1?
I changed 35% to 0.35 because I normally work with probabilities in 0..1 range.
And we can start by normalizing weights (w1'=w1/(w1+w2+w3) etc) and probability distributions,
then w1*P1 would be subrange of arithmetic code assigned to symbol 'e' of the first submodel.
We don't need to know which model seemed better to encoder,
so we have to merge same-symbol intervals of different submodels.
> It'll just add to 100% and after we /sum will say 'e' is ex. 50% of it, which is what we knew.
This normalization step here also normalizes weights, so it has to be there anyway.
Also computer programs don't really calculate anything precisely.
So we can get slightly more or less than 100% due to rounding of intermediate values etc.
> My understanding is still each model has its %s for the probability for each letter,
> then add each model m+m+m, then divide by 3. help...
That's what it does, just that "add each model then divide by 3" means
w1=1/3; w2=1/3; w3=1/3;
and its not the best weight function.
> Your algorithm Green is really fast and 'good start' for me and I want to know how it works exactly.
> What are you storing?
There's this:
byte* inpbuf = new byte; if( inpbuf==0 ) return 1; // unpacked data buffer
uint* l2 = new uint; if( l2==0 ) return 1; // o3+ link buffer
static qword freq;
static uint low;
static uint o0; bzero( o0 );
static uint o1; bzero( o1 ); // 256k
static uint o2; bzero( o2 ); // 64M
static uint p2; bzero( p2 ); // +64M, last occurrence pointers
So it stores symbol counts up to order2 as is, then builds linked lists
(via l2/p2 arrays, which store location of each previous occurrence of each order3 context).
> I see you say bruteforce, what?
Because unlike normal CM/PPM, it builds the distributions for o3+ dynamically,
by visiting up to DEPTH (=8192) recent context occurrences.
> If we take a piece of text "hello world and hello again" how many
> letters far back does it consider after trained offline and is running now?
Well, it would go back once for "hel".."hello " contexts,
I don't see any other o3+ matches in your example.
> 16 orders? How many context models? 1 length of 16? No way. 40 models?
17 = order0 .. order16
> Do you use a dict or tree? What is it doing that I'm not.
"green.cpp" uses a data structure usually known as "hash chains" in LZ, although without a hash function.
"tree.cpp" implements the same model with an actual tree.
> Here is my formula/understanding.
> Can you tweak this image to Green's formula?
> I will grasp it if it's visual. https://ibb.co/xHyhnmk
I made a log of distributions during green's processing of
" the cat was running down the street"
(I added 3 spaces at start, because green just writes first 3 symbols as is without modelling)
See it in attached file.