Hello,
This thread is dedicated to anyone who have questions/suggestions/advices regarding Matt Mahoney´s Large Text Compression Benchmark website.
CompressMaster
Hello,
This thread is dedicated to anyone who have questions/suggestions/advices regarding Matt Mahoney´s Large Text Compression Benchmark website.
CompressMaster
Large window brotli results are missing. This makes people confused about brotli.
I've submitted new scores for paq8pxd bwt verision and 2 Silesia scores some time ago directly to Matt but there no answer.
Matt is busy these days. I'm not sure he has a proper Internet access either.
Darek (4th January 2019)
Code:Input: 1,000,000,000 bytes, enwik9 Output (Zstandard v1.3.8): 358,612,233 bytes, 2.661 sec. - 1.239 sec., zstd -1 331,386,910 bytes, 3.770 sec. - 1.356 sec., zstd -2 314,012,348 bytes, 4.574 sec. - 1.399 sec., zstd -3 308,269,618 bytes, 4.807 sec. - 1.480 sec., zstd -4 302,075,641 bytes, 7.695 sec. - 1.475 sec., zstd -5 295,590,745 bytes, 10.068 sec. - 1.516 sec., zstd -6 285,397,982 bytes, 14.049 sec. - 1.442 sec., zstd -7 281,180,106 bytes, 17.862 sec. - 1.354 sec., zstd -8 278,757,548 bytes, 25.683 sec. - 1.362 sec., zstd -9 273,782,728 bytes, 33.213 sec. - 1.392 sec., zstd -10 271,392,063 bytes, 46.106 sec. - 1.393 sec., zstd -11 269,321,220 bytes, 68.917 sec. - 1.403 sec., zstd -12 266,022,487 bytes, 84.879 sec. - 1.377 sec., zstd -13 261,574,115 bytes, 115.375 sec. - 1.397 sec., zstd -14 258,869,397 bytes, 156.278 sec. - 1.412 sec., zstd -15 250,212,437 bytes, 175.597 sec. - 1.368 sec., zstd -16 242,902,736 bytes, 256.838 sec. - 1.398 sec., zstd -17 239,765,452 bytes, 302.779 sec. - 1.449 sec., zstd -18 235,698,881 bytes, 401.229 sec. - 1.489 sec., zstd -19 226,024,466 bytes, 502.506 sec. - 1.627 sec., zstd -20 220,222,797 bytes, 566.301 sec. - 1.652 sec., zstd -21 215,032,608 bytes, 609.948 sec. - 1.682 sec., zstd -22 Output (Brotli v1.0.4): 378,841,123 bytes, 5.397 sec. - 3.275 sec., brotli -q 1 --lgwin=24 326,534,841 bytes, 9.315 sec. - 2.964 sec., brotli -q 2 --lgwin=24 323,112,082 bytes, 10.937 sec. - 2.832 sec., brotli -q 3 --lgwin=24 292,858,370 bytes, 16.786 sec. - 2.635 sec., brotli -q 4 --lgwin=24 277,475,304 bytes, 32.532 sec. - 2.736 sec., brotli -q 5 --lgwin=24 270,329,550 bytes, 44.030 sec. - 2.670 sec., brotli -q 6 --lgwin=24 263,446,043 bytes, 76.386 sec. - 2.649 sec., brotli -q 7 --lgwin=24 257,232,524 bytes, 138.933 sec. - 2.652 sec., brotli -q 8 --lgwin=24 251,687,047 bytes, 274.139 sec. - 2.727 sec., brotli -q 9 --lgwin=24 227,894,654 bytes, 1033.151 sec. - 2.999 sec., brotli -q 10 --lgwin=24 223,530,697 bytes, 2267.346 sec. - 2.808 sec., brotli -q 11 --lgwin=24 Output (Brotli v1.0.7, decompression fail): 352,779,477 bytes, 5.357 sec., bro -q 1 --lgwin=24 327,834,884 bytes, 10.541 sec., bro -q 2 --lgwin=24 324,410,103 bytes, 12.088 sec., bro -q 3 --lgwin=24 294,055,716 bytes, 15.743 sec., bro -q 4 --lgwin=24 278,596,096 bytes, 29.038 sec., bro -q 5 --lgwin=24 271,385,030 bytes, 40.653 sec., bro -q 6 --lgwin=24 264,466,868 bytes, 70.567 sec., bro -q 7 --lgwin=24 258,243,437 bytes, 128.828 sec., bro -q 8 --lgwin=24 252,692,493 bytes, 248.608 sec., bro -q 9 --lgwin=24 228,673,479 bytes, 856.122 sec., bro -q 10 --lgwin=24 224,208,958 bytes, 1691.600 sec., bro -q 11 --lgwin=24
Last edited by Sportman; 4th January 2019 at 20:53.
I think 24 is the non large window. Maybe 30?
https://github.com/google/brotli/blo.../brotli.c#L459 suggests using option --large_window=30
Code:Input: 1,000,000,000 bytes, enwik9 Output (Brotli v1.0.7, decompression fail): 352,779,477 bytes, 5.349 sec., bro -q 1 --large_window=30 327,834,884 bytes, 10.560 sec., bro -q 2 --large_window=30 324,047,710 bytes, 13.689 sec., bro -q 3 --large_window=30 293,900,179 bytes, 17.612 sec., bro -q 4 --large_window=30 277,291,160 bytes, 32.874 sec., bro -q 5 --large_window=30 270,302,479 bytes, 44.267 sec., bro -q 6 --large_window=30 263,499,861 bytes, 75.238 sec., bro -q 7 --large_window=30 257,322,354 bytes, 137.726 sec., bro -q 8 --large_window=30 251,049,163 bytes, 297.272 sec., bro -q 9 --large_window=30 203,991,421 bytes, 1158.281 sec., bro -q 10 --large_window=30 199,839,216 bytes, 2084.745 sec., bro -q 11 --large_window=30
The problem is now you are comparing Brotli with max window size at every level to zstd with the default window size for each level (I believe it goes up to 27 at -22).
So technically you could add --large=30 to zstd, but looking at the ratio and times you posted, it seems that Brotli ignores the large window option for low quality levels?
To some degree I think it is more fair to compare the default choices made available by the compression levels of each tool, since the developers had the opportunity to tune those, rather than trying to match all the available parameters. As a note, the zstd tool now offers a lot of options to customize the compression parameters through the --zstd option.
Edit: Well, after some reflection, perhaps adjusting the window size to match if such an option is readily available does make sense for a direct comparison.
Last edited by jibz; 5th January 2019 at 14:44.
I disagree with that. Window size is a decoding resource control, quality setting an encoding resource control. A proper encoder separates them and does not attempt to bundle them in a way that makes benchmarks look pretty. Often it makes a lot of sense to run expensive encoding to a tiny window size.
Code:Input: 1,000,000,000 bytes, enwik9 Output (Brotli v1.0.4): 378,841,123 bytes, 5.282 sec. - 3.334 sec., brotli -q 1 326,534,841 bytes, 9.284 sec. - 2.910 sec., brotli -q 2 323,112,082 bytes, 11.007 sec. - 2.834 sec., brotli -q 3 292,858,370 bytes, 16.648 sec. - 2.657 sec., brotli -q 4 277,475,304 bytes, 32.467 sec. - 2.732 sec., brotli -q 5 270,329,550 bytes, 44.071 sec. - 2.679 sec., brotli -q 6 263,446,043 bytes, 77.841 sec. - 2.664 sec., brotli -q 7 257,232,524 bytes, 139.508 sec. - 2.671 sec., brotli -q 8 251,687,047 bytes, 274.220 sec. - 2.737 sec., brotli -q 9 227,894,654 bytes, 1033.381 sec. - 3.016 sec., brotli -q 10 223,530,697 bytes, 2265.848 sec. - 2.824 sec., brotli -q 11 Output (Brotli v1.0.7, decompression fail): 352,779,477 bytes, 5.401 sec., bro -q 1 327,834,884 bytes, 10.570 sec., bro -q 2 324,410,103 bytes, 12.105 sec., bro -q 3 294,055,716 bytes, 15.701 sec., bro -q 4 278,596,096 bytes, 29.240 sec., bro -q 5 271,385,030 bytes, 40.511 sec., bro -q 6 264,466,868 bytes, 71.022 sec., bro -q 7 258,243,437 bytes, 128.265 sec., bro -q 8 252,692,493 bytes, 249.385 sec., bro -q 9 228,673,479 bytes, 855.928 sec., bro -q 10 224,208,958 bytes, 1690.347 sec., bro -q 11
I think the decompression issue is in the Brotli Windows compile binary posted in the Brotli Encode forum thread.
It shall be help full when the Brotli Github post working Windows binaries as they did by older Brotli versions, Windows is still most used OS worldwide 37.43% and Linux only 0.85% http://gs.statcounter.com/os-market-share
Without = (--large_window 30) or --large_window=30 ?
I´d like to ask if it´s possible to apply for this benchmark in conjunction with other compressor. I mean - 100 million byte values (that will be much larger) will be very time consuming for my custom algorithm (i.e. even 1 million is very large value). Therefore, I don´t think so that it would be possible to beat CMIX with an reasonable amount of time. But, if I can prior use free archiver with fast and good compression ratio (i.e. few seconds, below 30 MB), I will be definitely able to shrink it MUCH more than CMIX - I expect output at least 5 MB lossless. I know, I want to losslessly compress already compressed data, but as I said before, patterns are presented even in incompressible data, so randomness for me means hardly compressible, not incompressible.
Thanks for your advice.
Preprocessing by itself is accepted (eg. see how DRT is used there), so I guess you can use paq8hp12 or something as first stage.
But it has to be decodable and don't expect your result getting posted without verification.
CompressMaster (4th August 2019)
1, I´ve looked at drt and other preprocessing options, but that´s still lot of bytes to process by my terribly slow custom algorithm.
2, And what about preprocessing by compressing already compressed data? I think that´s not prohibited (at least at LTCB rules).
3,So, as a first stage, can I prior use some archiver that´s able to losslessly compress enwik8 below 18MB or so with good compression time (3 minutes or so)?
Memory usage is not problem at all (1MB will be enough for my algorithm - yes, really (or at least theoretically, since I don´t have developed it so far due little spare time)), the problem is time (8 hours).
4,Also, I´m confused with comp times. So how many seconds are for example in:
phda9 1.8 15,010,414 86182 86305 6319 83
5,Can I use pre-build compression database instead of time-consuming predictions? I mean - I will test my algorithm in first pass in order to predict byte values/words that will be signficantly smaller than enwik8. Then, I use this precomputed database for decompression (i.e. compression will not be requiered at all) instead of doing time consuming compression. So, "compressor specifically targeted to this benchmark" in that form is allowed?
Thanks.
3. Yes you can.
4. 86xxx numbers are enwik9 comp/decomp times in seconds.
5. No, but it doesn't matter - list is sorted by compressed size. Otherwise you can just compile the compressed file into the compressor and say that its compressed with i/o speed.
CompressMaster (5th September 2019)
1,Something like that? Compressed size of enwik8 (enwik9´s compression will be very time consuming, so I´ll leave it untouched) alongside with size of my compressor+decompressor packed into one file?
2,As to enwik9, its compression is neccessary to win hutter prize? Keep in mind that money isn´t my primary goal, my goal is to help myself and others geeks to achieve maximal possible compression ratio of computer files at the expence of slowness.
Last edited by CompressMaster; 5th September 2019 at 20:01. Reason: typo
1. Compressed data + decompressor
2. Only enwik8 decompression. enwik9 is only used for LTCB stats.
CompressMaster (5th September 2019)
1.It´s allowed to use external software for "preprocessing"? By preprocessing, I mean converting enwik8 to smaller words for example. And another to split these data to smaller parts. As far as I know, it isn´t prohibited.
2.>The total payout will be roughly 50,000 € if enwik8 can be compressed to 7MB about the lower bound of Shannon´s estimate of 0.6 bits per character.
what if someone exceeds that and he will be able to compress enwik8, says hypothetically, to 250 KB incl. comp+decomp?
3.As another preprocessor, I decided to use mcm0.84. Am I still eglible for the prize?
4.Why Byron Knoll (cmix author) isn´t mentioned as a winner? His compressor is the best (as of writing this), but phda´s author is still the winner...
Last edited by CompressMaster; 19th November 2019 at 21:08. Reason: fixed typo
1. Code for any data transformations has to be included as part of decompressor.
Its not listed in the rules, but I presume that they might allow decompressor sources instead of x86 binary, as for python etc binary is not very portable.
Use of system libraries is explicitly allowed, but I'm not sure how an entry using zpaq (which might be preinstalled in Ubuntu) or some NN library would be counted.
2. 50k is the whole fund, Marcus Hutter is not listed by Forbes, and the contest is intended for AI improvement rather than finding exploits for its rules.
3. Yes, but it still has to be counted as part of decompressor.
4. The contest has 1GB RAM limit, while cmix requires 24GB or more.
Also for enwik8/cmix LTCB lists raw compressed size without decoder, but cmix decoder (+dictionary) is larger than phda.
Also "you are eligible for a prize of 50'000€×(1-S/L). Minimum claim is 1500€." -> 50000*(1-S/15284944)>=1500 -> S<=14826395, so cmix is not good enough yet.
CompressMaster (8th October 2019)
Shelwien,
what about you and enwik8 challenge?
What about it? mod_ppmd and mod_SSE were a good improvement I think?
Alex quickly adopts any relevant open-source code, so I'd have to work on it in secret to make my own entry.
My most recent attempt is this: https://encode.su/threads/3072-contr...p-tools-for-HP
and it kinda shows promise, but its boring to work on it without feedback.
CompressMaster (23rd February 2020)