Results 1 to 23 of 23

Thread: Large Text Compression Benchmark

  1. #1
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    183
    Thanks
    49
    Thanked 13 Times in 13 Posts

    Large Text Compression Benchmark

    Hello,
    This thread is dedicated to anyone who have questions/suggestions/advices regarding Matt Mahoney´s Large Text Compression Benchmark website.

    CompressMaster

  2. #2
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    767
    Thanks
    217
    Thanked 286 Times in 168 Posts
    Large window brotli results are missing. This makes people confused about brotli.

  3. #3
    Member
    Join Date
    Dec 2008
    Location
    Poland, Warsaw
    Posts
    1,022
    Thanks
    617
    Thanked 418 Times in 316 Posts
    I've submitted new scores for paq8pxd bwt verision and 2 Silesia scores some time ago directly to Matt but there no answer.

  4. #4
    The Founder encode's Avatar
    Join Date
    May 2006
    Location
    Moscow, Russia
    Posts
    4,004
    Thanks
    393
    Thanked 386 Times in 149 Posts
    Matt is busy these days. I'm not sure he has a proper Internet access either.

  5. Thanks:

    Darek (4th January 2019)

  6. #5
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    899
    Thanks
    84
    Thanked 325 Times in 227 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    Large window brotli results are missing.
    Code:
    Input:
    1,000,000,000 bytes, enwik9
    
    Output (Zstandard v1.3.8):
    358,612,233 bytes,   2.661 sec. - 1.239 sec., zstd -1
    331,386,910 bytes,   3.770 sec. - 1.356 sec., zstd -2
    314,012,348 bytes,   4.574 sec. - 1.399 sec., zstd -3
    308,269,618 bytes,   4.807 sec. - 1.480 sec., zstd -4
    302,075,641 bytes,   7.695 sec. - 1.475 sec., zstd -5
    295,590,745 bytes,  10.068 sec. - 1.516 sec., zstd -6
    285,397,982 bytes,  14.049 sec. - 1.442 sec., zstd -7
    281,180,106 bytes,  17.862 sec. - 1.354 sec., zstd -8
    278,757,548 bytes,  25.683 sec. - 1.362 sec., zstd -9
    273,782,728 bytes,  33.213 sec. - 1.392 sec., zstd -10
    271,392,063 bytes,  46.106 sec. - 1.393 sec., zstd -11
    269,321,220 bytes,  68.917 sec. - 1.403 sec., zstd -12
    266,022,487 bytes,  84.879 sec. - 1.377 sec., zstd -13
    261,574,115 bytes, 115.375 sec. - 1.397 sec., zstd -14
    258,869,397 bytes, 156.278 sec. - 1.412 sec., zstd -15
    250,212,437 bytes, 175.597 sec. - 1.368 sec., zstd -16
    242,902,736 bytes, 256.838 sec. - 1.398 sec., zstd -17
    239,765,452 bytes, 302.779 sec. - 1.449 sec., zstd -18
    235,698,881 bytes, 401.229 sec. - 1.489 sec., zstd -19
    226,024,466 bytes, 502.506 sec. - 1.627 sec., zstd -20
    220,222,797 bytes, 566.301 sec. - 1.652 sec., zstd -21
    215,032,608 bytes, 609.948 sec. - 1.682 sec., zstd -22
    
    Output (Brotli v1.0.4):
    378,841,123 bytes,    5.397 sec. - 3.275 sec., brotli -q 1 --lgwin=24
    326,534,841 bytes,    9.315 sec. - 2.964 sec., brotli -q 2 --lgwin=24
    323,112,082 bytes,   10.937 sec. - 2.832 sec., brotli -q 3 --lgwin=24
    292,858,370 bytes,   16.786 sec. - 2.635 sec., brotli -q 4 --lgwin=24
    277,475,304 bytes,   32.532 sec. - 2.736 sec., brotli -q 5 --lgwin=24
    270,329,550 bytes,   44.030 sec. - 2.670 sec., brotli -q 6 --lgwin=24
    263,446,043 bytes,   76.386 sec. - 2.649 sec., brotli -q 7 --lgwin=24
    257,232,524 bytes,  138.933 sec. - 2.652 sec., brotli -q 8 --lgwin=24
    251,687,047 bytes,  274.139 sec. - 2.727 sec., brotli -q 9 --lgwin=24
    227,894,654 bytes, 1033.151 sec. - 2.999 sec., brotli -q 10 --lgwin=24
    223,530,697 bytes, 2267.346 sec. - 2.808 sec., brotli -q 11 --lgwin=24
    
    Output (Brotli v1.0.7, decompression fail):
    352,779,477 bytes,    5.357 sec., bro -q 1 --lgwin=24
    327,834,884 bytes,   10.541 sec., bro -q 2 --lgwin=24
    324,410,103 bytes,   12.088 sec., bro -q 3 --lgwin=24
    294,055,716 bytes,   15.743 sec., bro -q 4 --lgwin=24
    278,596,096 bytes,   29.038 sec., bro -q 5 --lgwin=24
    271,385,030 bytes,   40.653 sec., bro -q 6 --lgwin=24
    264,466,868 bytes,   70.567 sec., bro -q 7 --lgwin=24
    258,243,437 bytes,  128.828 sec., bro -q 8 --lgwin=24
    252,692,493 bytes,  248.608 sec., bro -q 9 --lgwin=24
    228,673,479 bytes,  856.122 sec., bro -q 10 --lgwin=24
    224,208,958 bytes, 1691.600 sec., bro -q 11 --lgwin=24
    Last edited by Sportman; 4th January 2019 at 19:53.

  7. #6
    Member
    Join Date
    Apr 2015
    Location
    Greece
    Posts
    84
    Thanks
    34
    Thanked 26 Times in 17 Posts
    I think 24 is the non large window. Maybe 30?

  8. #7
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    122
    Thanks
    105
    Thanked 71 Times in 51 Posts
    https://github.com/google/brotli/blo.../brotli.c#L459 suggests using option --large_window=30

  9. #8
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    899
    Thanks
    84
    Thanked 325 Times in 227 Posts
    Quote Originally Posted by jibz View Post
    suggests using option --large_window=30
    Code:
    Input:
    1,000,000,000 bytes, enwik9
    
    Output (Brotli v1.0.7, decompression fail):
    352,779,477 bytes,    5.349 sec., bro -q 1 --large_window=30
    327,834,884 bytes,   10.560 sec., bro -q 2 --large_window=30
    324,047,710 bytes,   13.689 sec., bro -q 3 --large_window=30
    293,900,179 bytes,   17.612 sec., bro -q 4 --large_window=30
    277,291,160 bytes,   32.874 sec., bro -q 5 --large_window=30
    270,302,479 bytes,   44.267 sec., bro -q 6 --large_window=30
    263,499,861 bytes,   75.238 sec., bro -q 7 --large_window=30
    257,322,354 bytes,  137.726 sec., bro -q 8 --large_window=30
    251,049,163 bytes,  297.272 sec., bro -q 9 --large_window=30
    203,991,421 bytes, 1158.281 sec., bro -q 10 --large_window=30
    199,839,216 bytes, 2084.745 sec., bro -q 11 --large_window=30

  10. #9
    Member jibz's Avatar
    Join Date
    Jan 2015
    Location
    Denmark
    Posts
    122
    Thanks
    105
    Thanked 71 Times in 51 Posts
    The problem is now you are comparing Brotli with max window size at every level to zstd with the default window size for each level (I believe it goes up to 27 at -22).

    So technically you could add --large=30 to zstd, but looking at the ratio and times you posted, it seems that Brotli ignores the large window option for low quality levels?

    To some degree I think it is more fair to compare the default choices made available by the compression levels of each tool, since the developers had the opportunity to tune those, rather than trying to match all the available parameters. As a note, the zstd tool now offers a lot of options to customize the compression parameters through the --zstd option.

    Edit: Well, after some reflection, perhaps adjusting the window size to match if such an option is readily available does make sense for a direct comparison.
    Last edited by jibz; 5th January 2019 at 13:44.

  11. #10
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    767
    Thanks
    217
    Thanked 286 Times in 168 Posts
    Quote Originally Posted by jibz View Post
    ... more fair to compare the default choices made available by the compression levels of each tool, since the developers had the opportunity to tune those
    I disagree with that. Window size is a decoding resource control, quality setting an encoding resource control. A proper encoder separates them and does not attempt to bundle them in a way that makes benchmarks look pretty. Often it makes a lot of sense to run expensive encoding to a tiny window size.

  12. #11
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    899
    Thanks
    84
    Thanked 325 Times in 227 Posts
    Quote Originally Posted by jibz View Post
    To some degree I think it is more fair to compare the default choices made available by the compression levels of each tool
    Code:
    Input:
    1,000,000,000 bytes, enwik9
    
    Output (Brotli v1.0.4):
    378,841,123 bytes,    5.282 sec. - 3.334 sec., brotli -q 1
    326,534,841 bytes,    9.284 sec. - 2.910 sec., brotli -q 2
    323,112,082 bytes,   11.007 sec. - 2.834 sec., brotli -q 3
    292,858,370 bytes,   16.648 sec. - 2.657 sec., brotli -q 4
    277,475,304 bytes,   32.467 sec. - 2.732 sec., brotli -q 5
    270,329,550 bytes,   44.071 sec. - 2.679 sec., brotli -q 6
    263,446,043 bytes,   77.841 sec. - 2.664 sec., brotli -q 7
    257,232,524 bytes,  139.508 sec. - 2.671 sec., brotli -q 8
    251,687,047 bytes,  274.220 sec. - 2.737 sec., brotli -q 9
    227,894,654 bytes, 1033.381 sec. - 3.016 sec., brotli -q 10
    223,530,697 bytes, 2265.848 sec. - 2.824 sec., brotli -q 11
    
    Output (Brotli v1.0.7, decompression fail):
    352,779,477 bytes,    5.401 sec., bro -q 1
    327,834,884 bytes,   10.570 sec., bro -q 2
    324,410,103 bytes,   12.105 sec., bro -q 3
    294,055,716 bytes,   15.701 sec., bro -q 4
    278,596,096 bytes,   29.240 sec., bro -q 5
    271,385,030 bytes,   40.511 sec., bro -q 6
    264,466,868 bytes,   71.022 sec., bro -q 7
    258,243,437 bytes,  128.265 sec., bro -q 8
    252,692,493 bytes,  249.385 sec., bro -q 9
    228,673,479 bytes,  855.928 sec., bro -q 10
    224,208,958 bytes, 1690.347 sec., bro -q 11

  13. #12
    Member
    Join Date
    Jun 2015
    Location
    Switzerland
    Posts
    767
    Thanks
    217
    Thanked 286 Times in 168 Posts
    Quote Originally Posted by Sportman View Post
    Code:
    Input:
    1,000,000,000 bytes, enwik9
    
    Output (Brotli v1.0.4):
    378,841,123 bytes,    5.282 sec. - 3.334 sec., brotli -q 1
    326,534,841 bytes,    9.284 sec. - 2.910 sec., brotli -q 2
    323,112,082 bytes,   11.007 sec. - 2.834 sec., brotli -q 3
    292,858,370 bytes,   16.648 sec. - 2.657 sec., brotli -q 4
    277,475,304 bytes,   32.467 sec. - 2.732 sec., brotli -q 5
    270,329,550 bytes,   44.071 sec. - 2.679 sec., brotli -q 6
    263,446,043 bytes,   77.841 sec. - 2.664 sec., brotli -q 7
    257,232,524 bytes,  139.508 sec. - 2.671 sec., brotli -q 8
    251,687,047 bytes,  274.220 sec. - 2.737 sec., brotli -q 9
    227,894,654 bytes, 1033.381 sec. - 3.016 sec., brotli -q 10
    223,530,697 bytes, 2265.848 sec. - 2.824 sec., brotli -q 11
    
    Output (Brotli v1.0.7, decompression fail):
    352,779,477 bytes,    5.401 sec., bro -q 1
    327,834,884 bytes,   10.570 sec., bro -q 2
    324,410,103 bytes,   12.105 sec., bro -q 3
    294,055,716 bytes,   15.701 sec., bro -q 4
    278,596,096 bytes,   29.240 sec., bro -q 5
    271,385,030 bytes,   40.511 sec., bro -q 6
    264,466,868 bytes,   71.022 sec., bro -q 7
    258,243,437 bytes,  128.265 sec., bro -q 8
    252,692,493 bytes,  249.385 sec., bro -q 9
    228,673,479 bytes,  855.928 sec., bro -q 10
    224,208,958 bytes, 1690.347 sec., bro -q 11
    File an issue to the brotli repository if you get a decompression fail.

    Report large text benchmark results with --large_window 30

  14. #13
    Member
    Join Date
    Aug 2008
    Location
    Planet Earth
    Posts
    899
    Thanks
    84
    Thanked 325 Times in 227 Posts
    Quote Originally Posted by Jyrki Alakuijala View Post
    File an issue to the brotli repository if you get a decompression fail.

    Report large text benchmark results with --large_window 30
    I think the decompression issue is in the Brotli Windows compile binary posted in the Brotli Encode forum thread.
    It shall be help full when the Brotli Github post working Windows binaries as they did by older Brotli versions, Windows is still most used OS worldwide 37.43% and Linux only 0.85% http://gs.statcounter.com/os-market-share

    Without = (--large_window 30) or --large_window=30 ?

  15. #14
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    183
    Thanks
    49
    Thanked 13 Times in 13 Posts
    I´d like to ask if it´s possible to apply for this benchmark in conjunction with other compressor. I mean - 100 million byte values (that will be much larger) will be very time consuming for my custom algorithm (i.e. even 1 million is very large value). Therefore, I don´t think so that it would be possible to beat CMIX with an reasonable amount of time. But, if I can prior use free archiver with fast and good compression ratio (i.e. few seconds, below 30 MB), I will be definitely able to shrink it MUCH more than CMIX - I expect output at least 5 MB lossless. I know, I want to losslessly compress already compressed data, but as I said before, patterns are presented even in incompressible data, so randomness for me means hardly compressible, not incompressible.

    Thanks for your advice.

  16. #15
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    Preprocessing by itself is accepted (eg. see how DRT is used there), so I guess you can use paq8hp12 or something as first stage.
    But it has to be decodable and don't expect your result getting posted without verification.

  17. Thanks:

    CompressMaster (4th August 2019)

  18. #16
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    183
    Thanks
    49
    Thanked 13 Times in 13 Posts
    1, I´ve looked at drt and other preprocessing options, but that´s still lot of bytes to process by my terribly slow custom algorithm.

    2, And what about preprocessing by compressing already compressed data? I think that´s not prohibited (at least at LTCB rules).

    3,So, as a first stage, can I prior use some archiver that´s able to losslessly compress enwik8 below 18MB or so with good compression time (3 minutes or so)?
    Memory usage is not problem at all (1MB will be enough for my algorithm - yes, really (or at least theoretically, since I don´t have developed it so far due little spare time)), the problem is time (8 hours).

    4,Also, I´m confused with comp times. So how many seconds are for example in:
    phda9 1.8 15,010,414 86182 86305 6319 83

    5,Can I use pre-build compression database instead of time-consuming predictions? I mean - I will test my algorithm in first pass in order to predict byte values/words that will be signficantly smaller than enwik8. Then, I use this precomputed database for decompression (i.e. compression will not be requiered at all) instead of doing time consuming compression. So, "compressor specifically targeted to this benchmark" in that form is allowed?

    Thanks.

  19. #17
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    3. Yes you can.
    4. 86xxx numbers are enwik9 comp/decomp times in seconds.
    5. No, but it doesn't matter - list is sorted by compressed size. Otherwise you can just compile the compressed file into the compressor and say that its compressed with i/o speed.

  20. Thanks:

    CompressMaster (5th September 2019)

  21. #18
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    183
    Thanks
    49
    Thanked 13 Times in 13 Posts
    Quote Originally Posted by Shelwien View Post
    Otherwise you can just compile the compressed file into the compressor and say that its compressed with i/o speed.
    1,Something like that? Compressed size of enwik8 (enwik9´s compression will be very time consuming, so I´ll leave it untouched) alongside with size of my compressor+decompressor packed into one file?

    2,As to enwik9, its compression is neccessary to win hutter prize? Keep in mind that money isn´t my primary goal, my goal is to help myself and others geeks to achieve maximal possible compression ratio of computer files at the expence of slowness.
    Last edited by CompressMaster; 5th September 2019 at 19:01. Reason: typo

  22. #19
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    1. Compressed data + decompressor

    2. Only enwik8 decompression. enwik9 is only used for LTCB stats.

  23. Thanks:

    CompressMaster (5th September 2019)

  24. #20
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    183
    Thanks
    49
    Thanked 13 Times in 13 Posts
    1.It´s allowed to use external software for "preprocessing"? By preprocessing, I mean converting enwik8 to smaller words for example. And another to split these data to smaller parts. As far as I know, it isn´t prohibited.

    2.>The total payout will be roughly 50,000 € if enwik8 can be compressed to 7MB about the lower bound of Shannon´s estimate of 0.6 bits per character.
    what if someone exceeds that and he will be able to compress enwik8, says hypothetically, to 250 KB incl. comp+decomp?

    3.As another preprocessor, I decided to use mcm0.84. Am I still eglible for the prize?

    4.Why Byron Knoll (cmix author) isn´t mentioned as a winner? His compressor is the best (as of writing this), but phda´s author is still the winner...
    Last edited by CompressMaster; 19th November 2019 at 20:08. Reason: fixed typo

  25. #21
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    1. Code for any data transformations has to be included as part of decompressor.
    Its not listed in the rules, but I presume that they might allow decompressor sources instead of x86 binary, as for python etc binary is not very portable.
    Use of system libraries is explicitly allowed, but I'm not sure how an entry using zpaq (which might be preinstalled in Ubuntu) or some NN library would be counted.

    2. 50k is the whole fund, Marcus Hutter is not listed by Forbes, and the contest is intended for AI improvement rather than finding exploits for its rules.

    3. Yes, but it still has to be counted as part of decompressor.

    4. The contest has 1GB RAM limit, while cmix requires 24GB or more.
    Also for enwik8/cmix LTCB lists raw compressed size without decoder, but cmix decoder (+dictionary) is larger than phda.
    Also "you are eligible for a prize of 50'000€×(1-S/L). Minimum claim is 1500€." -> 50000*(1-S/15284944)>=1500 -> S<=14826395, so cmix is not good enough yet.

  26. Thanks:

    CompressMaster (8th October 2019)

  27. #22
    Member
    Join Date
    Jun 2018
    Location
    Slovakia
    Posts
    183
    Thanks
    49
    Thanked 13 Times in 13 Posts
    Shelwien,
    what about you and enwik8 challenge?

  28. #23
    Administrator Shelwien's Avatar
    Join Date
    May 2008
    Location
    Kharkov, Ukraine
    Posts
    3,774
    Thanks
    276
    Thanked 1,206 Times in 671 Posts
    What about it? mod_ppmd and mod_SSE were a good improvement I think?
    Alex quickly adopts any relevant open-source code, so I'd have to work on it in secret to make my own entry.
    My most recent attempt is this: https://encode.su/threads/3072-contr...p-tools-for-HP
    and it kinda shows promise, but its boring to work on it without feedback.

  29. Thanks:

    CompressMaster (22nd February 2020)

Similar Threads

  1. text compression?
    By codebox in forum The Off-Topic Lounge
    Replies: 2
    Last Post: 16th March 2015, 16:31
  2. Snappy Compression for large number of small files
    By Selvaraj in forum Data Compression
    Replies: 1
    Last Post: 30th March 2013, 23:43
  3. Replies: 33
    Last Post: 27th August 2011, 05:13
  4. Large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 39
    Last Post: 13th January 2008, 00:57
  5. New rule for large text benchmark
    By Matt Mahoney in forum Forum Archive
    Replies: 5
    Last Post: 28th October 2007, 21:00

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •