# Thread: All the lessons we learnt

1. ## All the lessons we learnt

So let's hit off as much as we can here. Can you help me?

Shelwien's algorithm Green can get the enwiki8 down from 100MB to 21.8MB and it just builds Online a simple tree 17 letters long with counts and mixes the 17 sets correctly based on total count total symbols and ignores low/high orders, and arithmetically encodes it. It prunes unused branches and ignores mixing very statistical low orders if higher orders have enough stats. And I nearly recreated this. I'm pondering whether I should remove counts in lower orders after I saw them in higher orders ex. order 5 "a 6 b 7" and then order 4 "a 22 b 16 c 9 d 8" becomes "a 16 b 9 c 9 d 8"?

Next, I ran the 100MB through the best cmix to pre-process it and Green got it down to 20.6MB. What does cmix's pre-processing do exactly? I got it from Matt's benchmark page.

Next, SSE (SEE? Second Symbol Estimation) is, is this just the issue during high orders that are missing the symbols you need to code and are predicting what is the escape probability of a 0 count symbol ex. 0.003? Ex. we seen 88 't', 5 'b', 1 'z', and therefore the chance of seeing a new letter Should be ex. 0.00005? So the uncertainty of 0 counts is being predicted for the Primary predictor? And this gets our 20.6MB to about 19.6MB?

What next? Is there room left in this schema to update a neural network that uses activation/inhibition? How does it work (exactly)?

Does CTW (Context Tree Weighting) do what Green does? How does CTW work? I know it's patented but I'm working on an AGI architecture.

And what is this "Both programs use other techniques to improve compression. They use partial update exclusion. When a character is counted in some context, it is counted with a weight of 1/2 in the next lower order context. Also, when computing symbol probabilities, it performs a weighted averaging with the predictions of the lower order context, with the weight of the lower order context inversely proportional to the number of different higher order contexts of which it is a suffix."

Apparently, Matt used just a 0-5 letter or so NN and got 18MB. Did he use pre-processing or SSE?

I'm unsure about the rest of the methods to get it to the 14.8MB mark but I'm guessing 2MB are taken off using grouping words. All 15MB-ers must be doing this. So where does the 16MB-18MB (2MB) get removed? What else is there happening?

How much more does bit prediction get it? How much more does using a thousand models get it?

2. > What does cmix's pre-processing do exactly? I got it from Matt's benchmark page.

There're many tricks.
http://nishi.dreamhosters.com/u/bwtpaper.txt

The main part is currently mostly called WRT (word replacement transform),
originally was introduced as LIPT: http://vlsi.cs.ucf.edu/datacomp/papers/fauziaitcc.pdf

> Next, SSE (SEE? Second Symbol Estimation) is, is this just the issue during high orders that are missing the symbols you need

SSE aka APM is mainly about the idea that a prediction can be used as context for secondary statistics.

It is _always_ useful, can even replace the mixing algorithm

SSE can compensate for various persistant prediction errors,
eg. difference between prior and posterior probabilities, or this:

But main compression improvement with SSE mainly comes from combining primary predictions
with some other information (match flags, prediction failure stats...)

> What next? Is there room left in this schema to update a neural network that uses activation/inhibition? How does it work (exactly)?

Sure, for example here's a purely NN-based compressor: https://bellard.org/nncp/

Anything works, if its able to produce a probability estimation.

> Does CTW (Context Tree Weighting) do what Green does? How does CTW work? I know it's patented but I'm working on an AGI architecture.

For CTW there're papers and an open-source implementation, nobody would read them for you.

The main idea there is about working with context history likelihoods (string probabilities) instead of
individual symbol probabilities.
"Logistic mixing" introduced by paq is kinda based on a similar concept.

> And what is this "Both programs use other techniques to improve compression. They use partial update exclusion.

Its exactly what is written. PPM doesn't have explicit mixing, but instead has a special "escape" symbol
added to the alphabet, so when it tries to encode the symbol with a single context distribution
(usually starting from the highest available order), sometimes it has to encode escape until it finds
a context which does have statistics for the next symbol.

So, since there's an explicit coding context, there can be differences (compared to CM) related to this.
Like whether we update statistics with processed symbol only in coding context, or in all contexts,

> Apparently, Matt used just a 0-5 letter or so NN and got 18MB. Did he use pre-processing or SSE?

paq doesn't really have any relation to NN, and its much more than just order5.
paq2+ uses SSE.

Original Mahoney's paq doesn't have integrated preprocessing, but has wordmodel and some other submodels
which have similar effect to a preprocessor.

There're also some paq branches (paq8pxd, cmix) which do have integrated preprocessing.

> I'm unsure about the rest of the methods to get it to the 14.8MB mark but I'm guessing 2MB are taken off using grouping words.
> All 15MB-ers must be doing this. So where does the 16MB-18MB (2MB) get removed? What else is there happening?

http://mattmahoney.net/dc/text.html
1) There're some 17,xxx results too
2) 2 of 5 top results (durilca and NNCP) don't have anything special aside from preprocessing.

So I'd say the main difference is in memory usage.
Many (especially older) compressors limited themselves with ~1gb of RAM, so their compression is limited too.
PAQ is special in this sense, because it essentially stores its statistics in compressed format (FSM counters + compact collision-friendly hashtables).

> How much more does bit prediction get it? How much more does using a thousand models get it?

Its very implementation-specific. 1000 dumb submodels (eg. all 10-bit bitmasks for prefix context)
won't help much, while 1000 hand-crafted handlers for specific parts of file syntax could help a lot.
For example, there's a dedicated submodel for compression of the pi number, which improves compression
by several orders.

As to enwik - there's no clear low bound, research continues.
But we certainly know that current records are not the best, since there're obvious parts
that are not handled yet (eg. english semantics or several types of markup syntax in enwik).

3. ## Thanks:

Alexander Rhatushnyak (3rd March 2020)

4. Ok I'm reading it now, thanks. Will update (edit) here. Can anyone else weigh in though on all the methods to get it to 14.8MB? Shelwien is very knowledgeable but seems to not be 100% English speaking and I would like more short and simplified step-like instructions like #1 #2 #3 as I'm trying to code it all...

Honestly I think the "first master pre-processing, then master word grouping" is wrong, I should focus on the biggest compression/speed earnings first, then the least important things latter like 0.1% improvements using transforms etc etc. And compression comes first more so - how can you improve compression if it is super fast but only improves 1%? 60% and ultra slow is more interesting. Of course we can test on a small datset, speed isn't an issue and need only focus on compression (unless you are doing brute force lol).

Rearranging articles in the enwiki8 is a human-like thing AI would need do. The pre-processing has room in my belief. Finding patterns plus making patterns is how to better predict the very world you live in and probabilistically survive longer by maximizing food and offspring and repairing data. Arteries are fractals so blood probabilistically reaches most areas. We want to know where, when, and what everything is to save on energy etc and predict better with reliable friend connections. Earth will become a total fractal of nanomachine units.

For BWT, I'm avoiding it, I believe it disturbs the data completely and is not beneficial.

For flagging EOLs, all CAPS, duplicate phrases, etc, I intend to use only non-hand-crafted grouping scheme like Word2Vec for recognizing T as t or run as walk.

There's no need to first replace common phrases with a 1 symbol code, they can be recognized as their own code to be predicted (so don't use bit prediction (unless is equivalent)).
​The 'longer words are less frequent' study, yes, this is what we are doing already really, just they suggest to flag its length and therefore narrow down which word it is...flagging may be a trap like how BWT seems to be. We could instead know all words that entail the previous words and group activate them by the net's child count (total unique symbols, total frequency, total children, total parents...).
Byte Pair Encoding is similar, you have near optimal codes. And what if 50% of the dataset is a duplicate? You must search for these as a pre-process and can't store all strings.

Update counts/weights based on prediction error, how does this work :(. Say an order's symbol count is very low but we need to arithmetic code it and waste space, well it isn't often, but what data do you get from prediction error stats... I don' get it...

5. Is there any way to compress a compressed file eg( multimedia Mp4 and MKV ....) again? At least 40%

6. Originally Posted by uhost
Is there any way to compress a compressed file eg( multimedia Mp4 and MKV ....) again? At least 40%
By recompression - first decompress compressed input and then apply better compression algorithm for uncompressed data.

7. ## Thanks:

uhost (6th March 2020)

8. Question - The top winning entries are 'fairly' fast and get the enwiki8 down to ~15MB, have any of yous stumbled upon a trade off that resulted in even slower algorithms that reached 13 or 10MB for enwiki8? Or was the 14.8MB one the limits of the pits of no return lol?

Has anyone here ever came across an algorithm that, although is very slow and/or needs big RAM (feasible though), can compress the enwiki8 down from 100MB to 13 or 10MB?

9. Originally Posted by Self_Recursive_Data
Question - The top winning entries are 'fairly' fast and get the enwiki8 down to ~15MB, have any of yous stumbled upon a trade off that resulted in even slower algorithms that reached 13 or 10MB for enwiki8? Or was the 14.8MB one the limits of the pits of no return lol?

Has anyone here ever came across an algorithm that, although is very slow and/or needs big RAM (feasible though), can compress the enwiki8 down from 100MB to 13 or 10MB?
Not that fast... phda needs approx. 5 hours.
I think it should be reached within few years (or maybe months) - if you look at LTCB results, reduction is not that big.

10. So no one here has realized how to get the enwiki8 100MB to 10MB but was way too time-taking to be accepted onto any formal benchmark listing/prize being considered practically useless?

11. > I would like more short and simplified step-like instructions like #1 #2 #3 as I'm trying to code it all...

"Data Compression: The Complete Refrence" by David Salomon
https://www.amazon.com/dp/8184898002
https://libgen.is/book/index.php?md5...3F3DBDECF364BB

"Understanding Compression" by Colt McAnlis
https://www.amazon.com/dp/1491961538
https://libgen.is/book/index.php?md5...BB587932E9C792

"Data Compression Explained" by Matt Mahoney
http://mattmahoney.net/dc/dce.html

"Modeling for text compression" by T.Bell,I.Witten,J.Cleary
https://dl.acm.org/doi/abs/10.1145/76894.76896
https://sci-hub.se/10.1145/76894.76896
https://dacemirror.sci-hub.se/journa....pdf#view=FitH

Also preferably something about C programming, data structures, sort algorithms, fast string search etc.

Unfortunately C/C++ is the language with the best compiler support atm.

When programming in python, you need 1000x computing power for the same results as C/C++.
While computing power (and memory size and latency) is still the main limit for compression.

> I should focus on the biggest compression/speed earnings first,
> then the least important things latter like 0.1% improvements using transforms etc etc.

1) You're wrong about preprocessing, it actually improves compression by 5-10% for texts.

2) In any case, state-of-art context modelling and entropy coding algorithms are necessary.

> For BWT, I'm avoiding it, I believe it disturbs the data completely and is not beneficial.

Yes, although it's still able to compress texts much better than Green.

Also BWT Suffix Array is a useful compact data structure for string search.

> For flagging EOLs, all CAPS, duplicate phrases, etc, I intend to use only non-hand-crafted grouping scheme like Word2Vec for recognizing T as t or run as walk.

This kind of approach is probably a failure according to Hutter Prize rules, because it includes the compressor/decompressor size.
You can't afford 100mb word2vec databases there.

> There's no need to first replace common phrases with a 1 symbol code, they can be recognized as their own code to be predicted

Sure, the whole preprocessing idea overall is mostly a speed optimization trick, rather than a perfect solution.

But we don't atm have the resources to do probability estimation at phrase level, even at word level.

For example, would you want to spend several extra weeks on porting paq8 to work with 16-bit alphabet,
or just compress the data as utf8 rather than utf16?

> The 'longer words are less frequent' study, yes, this is what we are doing already really,
> just they suggest to flag its length and therefore narrow down which word it is...flagging may be a trap like how BWT seems to be.

Its not so simple. DRT/phda/cmix also have a dictionary with pre-optimized (specifically for enwik8) word order,
such that similar words have codes with similar prefixes.

> We could instead know all words that entail the previous words and group activate them by the net's child count

Ideally yes, but you won't be able to compress enwik9 like that on current PC hardware.

> Update counts/weights based on prediction error, how does this work :(.

Like NN training.
First we only have contextual statistics and compute a prediction.
Then we know the actual symbol and can adjust statistics to improve the next prediction.

> Say an order's symbol count is very low but we need to arithmetic code it and waste space,
> well it isn't often, but what data do you get from prediction error stats... I don' get it...

Its like knowing that the data follows normal distribution with specific parameters.
https://en.wikipedia.org/wiki/Maximu...arameter_space

> Question - The top winning entries are 'fairly' fast and get the enwiki8
> down to ~15MB, have any of yous stumbled upon a trade off that resulted in
> even slower algorithms that reached 13 or 10MB for enwiki8? Or was the
> 14.8MB one the limits of the pits of no return lol?

14.8MB is already the cmix result (with 25GB memory usage) which is unfit for HP.

cmix is also not a simple standalone model, but a complex hybrid of many previous implementations
(it includes WRT preprocessing, two PAQ versions, PPMD, LSTM model etc).

There're some methods to improve its results a little (extra preprocessing, extra models from
some other independent implementations), but improvements would be minor
and nobody bothers to test them because of testing speed.

You also have to understand that current cmix is already v18 - it integrated
lots of various improvements over the years.

So no, nobody reached 13 or 10 MB for enwik8.

In fact 10MB is likely impossible with HP rules.

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•