Fixed.
Fixed.
Lower memory footprint, smaller exe, but most importantly much better compression.Code:- Removed LSTM model, DMCForest, SmallStationaryContextMap, Adaptive Learning Rate - Removed unused functions from utils.hpp - Imported LargeStationaryMap, MatchModel, IndirectContext changes from (a pre-relase of) paq8px - Replaced the 12-bit Arithemtic Encoder with a new high precision one - New APM: "APMPost" to support the new Arithmetic Encoder with higher precision probability - Cleaned up NormalModel, refined some contexts - Cleaned up SSE stage and Shared - Tweaks, cosmetic changes
The reason behind the dramatic compression improvement lies almost entirely in the higher precision probability representation: the new APM map used in the SSE stage and the new Arithmetic Encoder.
Source: https://github.com/GotthardtZ/paq8gen
Windows binaries: https://github.com/GotthardtZ/paq8gen/releases/tag/v3
Mike (31st January 2021),moisesmcardona (1st February 2021),mpais (31st January 2021)
Summary for paq8gen_v3 for coronavirus challenge (not submitted):
Size optimized exe in zip archive: 35'571 bytes
Transform source code in zip archive: 1'423 bytes
Compressed archive: 723'319 bytes
Total: 760'313 bytes.
coronavirus.fasta.preprocessed_seq_only.8 100'003'453 bytes
Code:paq8gen_v1.exe -8 109364 bytes paq8gen_v2.exe -8 107662 bytes paq8gen_v3.exe -8 82331 bytes
coronavirus.fasta.preprocessed.full 1'317'937'667 bytes
Code:paq8gen_v1.exe -12 1'118'915 bytes 23275.12 sec paq8gen_v2.exe -12 1'085'007 bytes 25974.80 sec paq8gen_v3.exe -12 723'319 bytes 23747.36 sec
Only now noticed this thread. Nice work! I'll try to benchmark paq8gen sometime this year.
Thank you for the encouragement!
Paq8gen is still at an early stage - not mature, yet, still a lot to do. So expect some more changes in the near future. If you'll follow this thread you will see anyway![]()
Mike (2nd February 2021)
1) VS with size opt builds a smaller exe. It could be much smaller if not for (kinda unnecessary) C++1x syntax in the source.
2) You can use upx and other exepackers instead of just zip.
3) ~5k of exe size is text (mostly usage text), this can be removed.
Update: removing constexpr from Stretch.hpp makes the exe 2x smaller, saves 5k of compressed size.
1) a) The size optimized version is compiled by VS 2019 Community ("Favor small code (/Os)"). (see here)
1) b) The C++17 syntax comes from the father: paq8px. Hmm. So it would be smaller? Or the stdlib that comes with such a build would make it smaller? Thanks for the hint! I'll try.
2) Tried other exe compressors, except upx. Whichever I tried, they failed - maybe they didn't like the 64-bit exe. Tried upx now, success! But unfortunately not good enough: zip/LZMA: 35K, UPX: 42K.
3) Earlier I began removing text, cpu dispatching (so it would be avx2 only), transform, detection, etc. Than I realized that I'm heading towards a maintenance nightmare when I later want to develop the main branch and the stripped branch in sync. -> Probably I'll create a #define that disables the most serious "bloats" - for the sake of the challange.
I appreciate the hints!
Here's my current result. VC6 could be still 2x smaller, if not for C++17.
Obviously x86 is smaller.
Update: Removed relocs. 59904-> 57856
Update2: Added x64 version. 69632 bytes .exe
Nice! 35K (original + zip/LZMA)-> 30K (yours + zip/LZMA). (33K when UXP'ed.)
For "Update2" (your 64-bit version): 32K with zip/LZMA
Minor version (tweaks only).Code:- Small tweaks in MormalModel, LineModel and MatchModel - Small tweak in APMPost: decreased a-priory counts - Small tweak in SSE stage: using two APMPost maps - Tweaked model memory use - Tweaked mixer scaling
Source: https://github.com/GotthardtZ/paq8gen
Windows binaries: https://github.com/GotthardtZ/paq8gen/releases/tag/v4
moisesmcardona (7th February 2021)
coronavirus.fasta.preprocessed_seq_only.8 100'003'453 bytesCode:paq8gen_v1.exe -8 109364 bytes paq8gen_v2.exe -8 107662 bytes paq8gen_v3.exe -8 82331 bytes paq8gen_v4.exe -8 78236 bytes
coronavirus.fasta.preprocessed.full 1'317'937'667 bytes
Code:paq8gen_v1.exe -12 1'118'915 bytes 23275.12 sec paq8gen_v2.exe -12 1'085'007 bytes 25974.80 sec paq8gen_v3.exe -12 723'319 bytes 23747.36 sec paq8gen_v4.exe -12 696'424 bytes 23969.18 sec
@Gotty:
First of all, nice job with this! I love to see that you used paq8px as a basis.
I'm curious, for the optimizations/tweaks that you've made in terms of size, do you do each by hand / intuition? Or do you have some sort of automated facility for trying out different tweaks?
Also, do you think anything you've developed here (especially the higher precision arithmetic encoder) would be useful to contribute back to paq8px?
Anything I can help out with? Happy to lend cycles for testing, or eyes/brain for development. I'm happy to start work on #define'ing out the parts that wouldn't be used in a true contest entry.
Yes intuition is behind. I'm trying all the ideas I think would improve compression (i.e. model the data better).
For example I improved the matchmodel in paq8px first (see the list of changes in v201) because I knew it would come handy in paq8gen. Then I ported it to gen. And found it didn't help. That was a big surprise. Actually nothing helped to improve compression further in paq8gen anymore. I investigated and found that the highly skewed probabilities won't come through the arithmetic encoder. We are losing thousands of bytes (!) just because the first bit is always zero (and we have 1 billion of them). Why? Because the 12-bit precision can only represent p=1/4096, but we would need p=1/billion when the file is 1GB. How much is the loss? For a 1GB file when using a 12-bit arithmetic encoder you lose -LOG(4095/4096;2)*1024*1024*1024 = 378K (at least), with a 20-bit precision encoder (-LOG((2^20-1)/(2^20);2)*1024*1024*1024) it can go as low as 1.4K (oversimplified calculation to make the problem more apparent).
So I tried if increasing the precision of the arithmetic encoder would help. And it did.
This is how it goes...
Yes!
Actually the high precision arithmetic encoder published in paq8gen comes from an unreleased idea from paq8px. Unfortunately it does not help in the usual case (counterintuitive, right?) because the current 12-bit precision is actually a protection: it helps not to have too skewed probabilities.
When paq8px starts to have a *strong* belief what comes next, but the data changes suddenly, then you have a big loss. The stronger the belief is the bigger the loss is. So we limit the loss by the low-precision arithmetic encoder.
But with paq8gen the data is mostly "regular", semi-static, the models are tuned for this kind of data, so it fits there perfectly.
But there is a requirement: you have to predict "perfectly" otherwise you'll experience big losses with those skewed probabilities with every mispredicted bit.
Also for @byronknoll:
If the requirement above cannot be fulfilled, you'll experience losses. So what happens when you try compressing the sars-2 challenge with a compressor that is not fine tuned for the benchmark but has a high precision arithmetic encoder? Like cmix?
If you downgrade the 16-bit arithmetic encoder in cmix to be 12-bit, you'll get a surprise! The original cmix compresses ~ 10 million bytes of sequence-only data from the sars-2 challenge to 15745 bytes, but with the 12-bit precision encoder it makes it 14593. So in case of cmix downgrading helps. It tells us this: if the compressor becomes overconfident (and it's easy to be as the sars-2 challenge data is very compressible) but the models do not predict the data too well then the compressor may experiences big losses during compression. (Note: same file for paq8gen_v5_pre (not yet released) is 12990).
So having a high-precision arithmetic encoder is a two-edged sword: it helps tremendously when your models fit the data well, on the other hand any misprediction costs you a lot when the model is too confident. I strongly suspect that the latter is the reason behind the "poor" compression from cmix. I might be wrong (didn't investigate it deeper).
I would be glad! I'll publish v5 over the weekend (at least that's my plan) and the codebase will be open for any tweaks. What you can do in the meantime: you can start creating a (non-constexpr) singleton from squash and stretch. As Shelwien spotted above, they are stored and taking up much space in the exe. Also you can #define out the help screen text or any related functionality (like redirection). Your call. You'll know what to do.
Last edited by Gotty; 21st February 2021 at 17:27. Reason: typos
byronknoll (21st February 2021),schnaader (19th February 2021)
Source: https://github.com/GotthardtZ/paq8genCode:- Simplified replacement strategy in Bucket16 - Decreased final learning rate from 8 to 6 in last mixer layer - Removed IndirectContext and LargeIndirectContext - Refined SSE stage: now using 4 high precision APMPost maps instead of 2 - Removed not useful contexts from LineModel, added contexts to better model codons - Removed not useful contexts form MatchModel, added context to identify ambiguous matches; now using initial match lengths of multiple of 3 - Cosmetic changes in NormalModel and Shared
Windows binaries: https://github.com/GotthardtZ/paq8gen/releases/tag/v5
Mike (20th February 2021)
coronavirus.fasta.preprocessed_seq_only.8 100'003'453 bytesCode:paq8gen_v1.exe -8 109364 bytes paq8gen_v2.exe -8 107662 bytes paq8gen_v3.exe -8 82331 bytes paq8gen_v4.exe -8 78236 bytes paq8gen_v5.exe -8 73779 bytes
coronavirus.fasta.preprocessed.full 1'317'937'667 bytes
Code:paq8gen_v1.exe -12 1'118'915 bytes 23275.12 sec paq8gen_v2.exe -12 1'085'007 bytes 25974.80 sec paq8gen_v3.exe -12 723'319 bytes 23747.36 sec paq8gen_v4.exe -12 696'424 bytes 23969.18 sec paq8gen_v5.exe -12 660'417 bytes 26623.40 sec
Last edited by Gotty; 20th February 2021 at 10:27.
hexagone (20th February 2021)
@Gotty, have you tested all the learning rates in mixer on smaller set?
Is compressed data geting smaller and smaller say for 200kb chunk of input or any other larger size?
KZo
Gotty (20th February 2021)
For the first 1MB of sars-2 (sequence only):
n-m: size (n: final learning rate on 1st layer, m: final learning rate on 2nd (last) layer)
8-2: 8122
8-3: 8113
8-4: 8109
8-5: 8108
8-6: 8103 * best
8-7: 8105
8-8: 8106 <- max learning rate on both layers
7-8: 8109
6-8: 8109
5-8: 8112
4-8: 8118
3-8: 8123
2-8: 8128
So a small decay in the last layer works, but having no max learning rate on the first layer hurts. This is what I learned.
The difference is just a handful of bytes for a 1MB file, so I didn't test with smaller size or chunks. Let me run a test round with c7 (first 10M bytes of sars-2) and record compression rate after every MB).
In the meantime these are the learning rate results for the small DNA corpus (https://encode.su/threads/2105-DNA-Corpus)
It shows us that changing the learning rate on the last layer have a very-very small effect. It doesn't tell much, since the files are tiny. But it closely matches with the 1MB sars-2 test: the learning rate should be large in case of these files, too.
Strangely the best result is at learning rate 7, the worst is at 6.
And so here it is the sars-2 challenge sequences-only file first 10 MBs - effect of changing the final learning rate on the last layer.
The numbers indicate the compressed size of each chunk (the difference measured at the last byte minus the first byte of the chunk).
Since the numbers show a very small fluctuation I didn't actually measure the filesize, I computed the entropy at the arithmetic encoder instead, so we can see the fractions, too.
The winner is: learning rate 6.
kaitz (20th February 2021)
If you change adaptivemap update to use full 32bit for prediction, like in pxd. For main statemap in normal model cm after 1MB of input. Will it improve?
KZo
Gotty (20th February 2021)
Btw, model is still bitwise, right?
Is there an effect from alphabet reordering?
http://nishi.dreamhosters.com/u/bwt_reorder_v4b.rar
http://ctxmodel.net/files/BWT_reorder_v2.rar for source
Gotty (20th February 2021)
Yes, it is.
Earlier I measured the entropy for every bit (in a byte). It showed me, that it "knows" very consistently which is "easy" to predict and which is not. Rerun it again:
-----------Code:Bit Entropy (in bytes) 0 5.3 <- "easy" 1 38.5 <- "easy" 2 5.3 <- "easy" 3 3696.6 4 155.1 <- "easy" 5 2234.1 6 1775.5 7 11.1 <- "easy"
(measured on c6 (1M bytes of sars-2 sequence only data))
So five of the bits are "very easy to predict", 3 bits have the entropy. This tells me that reordering probably won't help.
Question: did you construct the xlt files manually?
> This tells me that reordering probably won't help.
If it can make compression visibly worse, there's a chance that it can also improve it.
For example, maybe unary code for "ATCG" or some such.
> did you construct the xlt files manually?
With a heuristic optimizer based on specific entropy coder, but its slow.
Let's see: alphabet reordering.
The above frequency information is extracted from the full sars-2 challenge file ignoring the sequence names (i.e. it's only the sequences).Code:freq char old code new code 418420417 T 84 0 388931944 A 65 1 255577497 G 71 2 239142742 C 67 3 15186607 N 78 4 18009 Y 89 5 11622 K 75 6 7865 R 82 7 5764 W 87 8 3505 M 77 9 2381 S 83 11 237 H 72 12 144 D 68 13 89 V 86 14 59 B 66 15 \n 10 10
We are lucky that there are exactly 16 distinct chars (including the newline).
WhereCode:with original with new alphabet alphabet paq8gen_v5x -8 c6 8104 8056 (-0.59%) paq8gen_v5x -8 c7 12951 12770 (-1.39%) paq8gen_v5x -8 c8 73726 72318 (-1.91%)
- c6, c7, c8 are the first 1M, 10M, 100M bytes of sars-2 challenge file (sequence-only);
- paq8gen_v5x is a modified version of paq8gen_v5: the arithmetic encoder precision is increased from 20 to 28 bits and line type detection is disabled.
I didn't put much effort in finding an optimal reordering, nevertheless the goal is fulfilled and the validity of your idea is confirmed. Thanks a lot!
These preliminary results also show that the gain is better and better as file size increases. So for the full sars-2 file it must be even greater.
coronavirus.fasta.preprocessed.full 1'317'937'667 bytes
Note:Code:paq8gen_v1.exe -12 1'118'915 bytes 23275.12 sec paq8gen_v2.exe -12 1'085'007 bytes 25974.80 sec paq8gen_v3.exe -12 723'319 bytes 23747.36 sec paq8gen_v4.exe -12 696'424 bytes 23969.18 sec paq8gen_v5.exe -12 660'417 bytes 26623.40 sec Update: paq8gen_v5.exe -12 659'727 bytes <- arithmetic encoder precision increased from 20 to 28 bits paq8gen_v5.exe -12 646'466 bytes <- ^ + alphabet-transformation applied (only in case of the sequences) - see above post
The codebase is up to date with the above changes but the alphabet-transformtion is not run automatically (as it is sars-cov2-challenge-specific - you have to compile with the transform included and run it manually (before compression and after decompression to get back the original file)).
Wow it is amazing result, now it can beat lily. Congrats gotty!!!
From SARS-CoV-2 Coronavirus Data Compression Benchmark thread:
Reordering even helped LZMA get close to the paq8l result on the original fasta file.
Also, it's funny seeing how you find and tackle the same things as I do, like the low precision problem (you really went scorched earth on that one, 28 bits of precision?). The way you solved it was my plan B, in case the way I solved it in LILY wouldn't work.
If you're serious about the sars-cov-2 benchmark, why not just create a new branch of paq8gen on the repo dedicated to it, instead of polluting the main version with a bunch of #defines?
Thank you, Surya, it's very kind of you!
You have to know that paq8gen is a community effort it's not "mine" - I'm standing on the shoulders of those who have laid the foundation with hard dedicated work or supported paq8* with ideas or testing.
Please don't root against LILY. Let's be fair with one another. mpais did an excellent job with LILY. Now under 600K.
It's clear that you are rooting against LILY for personal reasons. You have to know that paq8px and also paq8gen would not be as strong as they are today without mpais. The models and methods that help paq8gen to be "good" are all coming from paq8px, and guess who helped tremendously enhancing those models? It's mpais. Fun fact: the strength of paq8gen comes mostly from the MatchModel (which is based on the MatchModel in EMMA). So any success of paq8gen is also success for mpais.
I don't have personal reason with mpais, I just see the result of paq8gen is better than lily.